[jira] [Comment Edited] (SPARK-22936) providing HttpStreamSource and HttpStreamSink

2018-01-02 Thread bluejoe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309220#comment-16309220
 ] 

bluejoe edited comment on SPARK-22936 at 1/3/18 7:25 AM:
-

The latest spark-http-stream artifact has been released to the central maven 
repository and is free to use in any java/scala projects.

I would be very appreciated that spark-http-stream is listed as a third-party 
Source/Sink in Spark documentation.


was (Author: bluejoe):
The latest spark-http-stream artifact has been released to the central maven 
repository and is free to use in any java/scala projects.

I will be appreciated that spark-http-stream is listed as a third-party 
Source/Sink in Spark documentation.

> providing HttpStreamSource and HttpStreamSink
> -
>
> Key: SPARK-22936
> URL: https://issues.apache.org/jira/browse/SPARK-22936
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: bluejoe
>
> Hi, in my project I completed a spark-http-stream, which is now available on 
> https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is 
> useful to others and is ok to be integrated as a part of Spark.
> spark-http-stream transfers Spark structured stream over HTTP protocol. 
> Unlike tcp streams, Kafka streams and HDFS file streams, http streams often 
> flow across distributed big data centers on the Web. This feature is very 
> helpful to build global data processing pipelines across different data 
> centers (scientific research institutes, for example) who own separated data 
> sets.
> The following code shows how to load messages from a HttpStreamSource:
> ```
> val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName)
>   .option("httpServletUrl", "http://localhost:8080/;)
>   .option("topic", "topic-1");
>   .option("includesTimestamp", "true")
>   .load();
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22936) providing HttpStreamSource and HttpStreamSink

2018-01-02 Thread bluejoe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309220#comment-16309220
 ] 

bluejoe commented on SPARK-22936:
-

The latest spark-http-stream artifact has been released to the central maven 
repository and is free to use in any java/scala projects.

I will be appreciated that spark-http-stream is listed as a third-party 
Source/Sink in Spark documentation.

> providing HttpStreamSource and HttpStreamSink
> -
>
> Key: SPARK-22936
> URL: https://issues.apache.org/jira/browse/SPARK-22936
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: bluejoe
>
> Hi, in my project I completed a spark-http-stream, which is now available on 
> https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is 
> useful to others and is ok to be integrated as a part of Spark.
> spark-http-stream transfers Spark structured stream over HTTP protocol. 
> Unlike tcp streams, Kafka streams and HDFS file streams, http streams often 
> flow across distributed big data centers on the Web. This feature is very 
> helpful to build global data processing pipelines across different data 
> centers (scientific research institutes, for example) who own separated data 
> sets.
> The following code shows how to load messages from a HttpStreamSource:
> ```
> val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName)
>   .option("httpServletUrl", "http://localhost:8080/;)
>   .option("topic", "topic-1");
>   .option("includesTimestamp", "true")
>   .load();
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17762) invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes

2018-01-02 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309053#comment-16309053
 ] 

Hossein Falaki commented on SPARK-17762:


I think SPARK-17790 is one place where this limit causes problems. Anywhere 
else we call {{writeBin}} we face similar limitation.

> invokeJava fails when serialized argument list is larger than INT_MAX 
> (2,147,483,647) bytes
> ---
>
> Key: SPARK-17762
> URL: https://issues.apache.org/jira/browse/SPARK-17762
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on 
> the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded 
> limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). 
> To work around it, we can check for this case and serialize the batch in 
> multiple parts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309037#comment-16309037
 ] 

Takeshi Yamamuro commented on SPARK-22942:
--

Since spark passes null to udfs in optimizer rules, you need to make udfs 
null-safe.

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> *However*, when I try to filter `rightBreakdown`, it fails.
> {code:java}
> rightBreakdown.filter("has_black == true").show(false)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$hasBlack$1: (array>) => 
> boolean)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138)
>   at 
> 

[jira] [Resolved] (SPARK-22898) collect_set aggregation on bucketed table causes an exchange stage

2018-01-02 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-22898.
-
Resolution: Duplicate

> collect_set aggregation on bucketed table causes an exchange stage
> --
>
> Key: SPARK-22898
> URL: https://issues.apache.org/jira/browse/SPARK-22898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Modi Tamam
>  Labels: bucketing
>
> I'm using Spark-2.2. I'm POCing Spark's bucketing. I've created a bucketed 
> table, here's the desc formatted my_bucketed_tbl output:
> +++---+
> |col_nam| data_type|comment|
> +++---+
> |  bundle| string|   null|
> | ifa| string|   null|
> |   date_|date|   null|
> |hour| int|   null|
> |||   |
> |# Detailed Table ...||   |
> |Database| default|   |
> |   Table| my_bucketed_tbl|
> |   Owner|zeppelin|   |
> | Created|Thu Dec 21 13:43:...|   |
> | Last Access|Thu Jan 01 00:00:...|   |
> |Type|EXTERNAL|   |
> |Provider| orc|   |
> | Num Buckets|  16|   |
> |  Bucket Columns| [`ifa`]|   |
> |Sort Columns| [`ifa`]|   |
> |Table Properties|[transient_lastDd...|   |
> |Location|hdfs:/user/hive/w...|   |
> |   Serde Library|org.apache.hadoop...|   |
> | InputFormat|org.apache.hadoop...|   |
> |OutputFormat|org.apache.hadoop...|   |
> |  Storage Properties|[serialization.fo...|   |
> +++---+
> When I'm executing an explain of a group by query, I can see that we've 
> spared the exchange phase :
> {code:java}
> sql("select ifa,max(bundle) from my_bucketed_tbl group by ifa").explain
> == Physical Plan ==
> SortAggregate(key=[ifa#932], functions=[max(bundle#920)])
> +- SortAggregate(key=[ifa#932], functions=[partial_max(bundle#920)])
>+- *Sort [ifa#932 ASC NULLS FIRST], false, 0
>   +- *FileScan orc default.level_1[bundle#920,ifa#932] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> {code}
> But, when I replace Spark's max function with collect_set, I can see that the 
> execution plan is the same as a non-bucketed table, means, the exchange phase 
> is not spared :
> {code:java}
> sql("select ifa,collect_set(bundle) from my_bucketed_tbl group by 
> ifa").explain
> == Physical Plan ==
> ObjectHashAggregate(keys=[ifa#1010], functions=[collect_set(bundle#998, 0, 
> 0)])
> +- Exchange hashpartitioning(ifa#1010, 200)
>+- ObjectHashAggregate(keys=[ifa#1010], 
> functions=[partial_collect_set(bundle#998, 0, 0)])
>   +- *FileScan orc default.level_1[bundle#998,ifa#1010] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22898) collect_set aggregation on bucketed table causes an exchange stage

2018-01-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308970#comment-16308970
 ] 

Liang-Chi Hsieh commented on SPARK-22898:
-

If no problem I will resolve this as duplicate. You can re-open it if you have 
other questions.

> collect_set aggregation on bucketed table causes an exchange stage
> --
>
> Key: SPARK-22898
> URL: https://issues.apache.org/jira/browse/SPARK-22898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Modi Tamam
>  Labels: bucketing
>
> I'm using Spark-2.2. I'm POCing Spark's bucketing. I've created a bucketed 
> table, here's the desc formatted my_bucketed_tbl output:
> +++---+
> |col_nam| data_type|comment|
> +++---+
> |  bundle| string|   null|
> | ifa| string|   null|
> |   date_|date|   null|
> |hour| int|   null|
> |||   |
> |# Detailed Table ...||   |
> |Database| default|   |
> |   Table| my_bucketed_tbl|
> |   Owner|zeppelin|   |
> | Created|Thu Dec 21 13:43:...|   |
> | Last Access|Thu Jan 01 00:00:...|   |
> |Type|EXTERNAL|   |
> |Provider| orc|   |
> | Num Buckets|  16|   |
> |  Bucket Columns| [`ifa`]|   |
> |Sort Columns| [`ifa`]|   |
> |Table Properties|[transient_lastDd...|   |
> |Location|hdfs:/user/hive/w...|   |
> |   Serde Library|org.apache.hadoop...|   |
> | InputFormat|org.apache.hadoop...|   |
> |OutputFormat|org.apache.hadoop...|   |
> |  Storage Properties|[serialization.fo...|   |
> +++---+
> When I'm executing an explain of a group by query, I can see that we've 
> spared the exchange phase :
> {code:java}
> sql("select ifa,max(bundle) from my_bucketed_tbl group by ifa").explain
> == Physical Plan ==
> SortAggregate(key=[ifa#932], functions=[max(bundle#920)])
> +- SortAggregate(key=[ifa#932], functions=[partial_max(bundle#920)])
>+- *Sort [ifa#932 ASC NULLS FIRST], false, 0
>   +- *FileScan orc default.level_1[bundle#920,ifa#932] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> {code}
> But, when I replace Spark's max function with collect_set, I can see that the 
> execution plan is the same as a non-bucketed table, means, the exchange phase 
> is not spared :
> {code:java}
> sql("select ifa,collect_set(bundle) from my_bucketed_tbl group by 
> ifa").explain
> == Physical Plan ==
> ObjectHashAggregate(keys=[ifa#1010], functions=[collect_set(bundle#998, 0, 
> 0)])
> +- Exchange hashpartitioning(ifa#1010, 200)
>+- ObjectHashAggregate(keys=[ifa#1010], 
> functions=[partial_collect_set(bundle#998, 0, 0)])
>   +- *FileScan orc default.level_1[bundle#998,ifa#1010] Batched: false, 
> Format: ORC, Location: 
> InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-02 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22943:
--

 Summary: OneHotEncoder supports manual specification of 
categorySizes
 Key: SPARK-22943
 URL: https://issues.apache.org/jira/browse/SPARK-22943
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


OHE should support configurable categorySizes, as n-values in  
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
 which allows consistent and foreseeable conversion.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Matthew Fishkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308882#comment-16308882
 ] 

Matthew Fishkin edited comment on SPARK-22942 at 1/2/18 11:27 PM:
--

I would expect that to work too. I'm more curious why the null pointer is 
occurring when none of the data is null.

Interestingly, I found the following. When you change from 
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))
{code}
to
{code:java}
val rightOnly = joined.filter("l.infos is null and r.infos is not 
null").select($"name", $"r.infos".as("r_infos"))
{code}

the rest of the code above works. 

But I am pretty sure 
*"l.infos is null and r.infos is not null"* is equal to *"l.infos is null"*. If 
one column of an outer join is null, the other must be defined.

Joining columns _A, B_ and _A, B_ on _A_, then it is guaranteed that either _B_ 
or _C_ is defined


was (Author: mjfish93):
I would expect that to work too. I'm more curious why the null pointer is 
occurring when none of the data is null.

Interestingly, I found the following. When you change from 
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))
{code}
to
{code:java}
val rightOnly = joined.filter("l.infos is null and r.infos is not 
null").select($"name", $"r.infos".as("r_infos"))
{code}

the rest of the code above works. 

But I am pretty sure 
"l.infos is null and r.infos is not null" is equal to "l.infos is null". If one 
column of an outer join is null, the other must be defined.

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> 

[jira] [Comment Edited] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Matthew Fishkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308882#comment-16308882
 ] 

Matthew Fishkin edited comment on SPARK-22942 at 1/2/18 11:25 PM:
--

I would expect that to work too. I'm more curious why the null pointer is 
occurring when none of the data is null.

Interestingly, I found the following. When you change from 
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))
{code}
to
{code:java}
val rightOnly = joined.filter("l.infos is null and r.infos is not 
null").select($"name", $"r.infos".as("r_infos"))
{code}

the rest of the code above works. 

But I am pretty sure 
"l.infos is null and r.infos is not null" is equal to "l.infos is null". If one 
column of an outer join is null, the other must be defined.


was (Author: mjfish93):
I would expect that to work too. I'm more curious why the null pointer is 
occurring when none of the data is null.

Interestingly, I found the following. When you change from 
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))
{code}

{code:java}
val rightOnly = joined.filter("l.infos is null and r.infos is not 
null").select($"name", $"r.infos".as("r_infos"))
{code}

the rest of the code above works. 

But I am pretty sure 
"l.infos is null and r.infos is not null" is equal to "l.infos is null". If one 
column of an outer join is null, the other must be defined.

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> *However*, when I try to filter `rightBreakdown`, it fails.
> {code:java}
> 

[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Matthew Fishkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308882#comment-16308882
 ] 

Matthew Fishkin commented on SPARK-22942:
-

I would expect that to work too. I'm more curious why the null pointer is 
occurring when none of the data is null.

Interestingly, I found the following. When you change from 
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))
{code}

{code:java}
val rightOnly = joined.filter("l.infos is null and r.infos is not 
null").select($"name", $"r.infos".as("r_infos"))
{code}

the rest of the code above works. 

But I am pretty sure 
"l.infos is null and r.infos is not null" is equal to "l.infos is null". If one 
column of an outer join is null, the other must be defined.

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> *However*, when I try to filter `rightBreakdown`, it fails.
> {code:java}
> rightBreakdown.filter("has_black == true").show(false)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$hasBlack$1: (array>) => 
> boolean)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
>   at 
> 

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2018-01-02 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308868#comment-16308868
 ] 

Bryan Cutler commented on SPARK-22126:
--

Thanks for taking a look [~josephkb]!  I believe it's possible to still use the 
current fit() API that returns a {{Seq[Model[_]]}} and avoid materializing all 
models in memory at once.  If an estimator has model-specific optimizations and 
is creating multiple models, it could return a lazy sequence (such as a SeqVew 
or Stream).  Then if the CrossValidator just converts the sequence of Models to 
an iterator, it will only hold a reference to the model currently being 
evaluated and previous models can be GC'd.  Does that sound like it would be 
worth a shot here?

As for the issue of parallelism in model-specific optimizations, it's true 
there might be some benefit in the Estimator being able to allow the 
CrossValidator to handle the parallelism under certain cases.  But until there 
are more examples of this to look at, it's hard to know if making a new API for 
it is worth that benefit.  I saw a reference to a Keras model in another JIRA, 
is that an example that could have a model-specific optimization while still 
allowing the CrossValidator to parallelize over it?

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> Anyone who's following might want to scan the design doc (in the links 
> above), the latest api proposal is:
> {code}
> def fitMultiple(
> dataset: Dataset[_],
> paramMaps: Array[ParamMap]
>   ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]
> {code}
> Old discussion:
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
> 

[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308830#comment-16308830
 ] 

Takeshi Yamamuro commented on SPARK-22942:
--

I think you just need NULL checks;
{code}
val hasBlack: Seq[Row] => Boolean = (s: Seq[Row]) => {
  if (s != null) {
s.exists{ case Row(num: Int, color: String) =>
  color == "black"
}
  } else {
false
  }
}
{code}

> Spark Sql UDF throwing NullPointer when adding a filter on a columns that 
> uses that UDF
> ---
>
> Key: SPARK-22942
> URL: https://issues.apache.org/jira/browse/SPARK-22942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0
>Reporter: Matthew Fishkin
>
> I ran into an interesting issue when trying to do a `filter` on a dataframe 
> that has columns that were added using a UDF. I am able to replicate the 
> problem with a smaller set of data.
> Given the dummy case classes:
> {code:java}
> case class Info(number: Int, color: String)
> case class Record(name: String, infos: Seq[Info])
> {code}
> And the following data:
> {code:java}
> val blue = Info(1, "blue")
> val black = Info(2, "black")
> val yellow = Info(3, "yellow")
> val orange = Info(4, "orange")
> val white = Info(5, "white")
> val a  = Record("a", Seq(blue, black, white))
> val a2 = Record("a", Seq(yellow, white, orange))
> val b = Record("b", Seq(blue, black))
> val c = Record("c", Seq(white, orange))
>  val d = Record("d", Seq(orange, black))
> {code}
> Create two dataframes (we will call them left and right)
> {code:java}
> val left = Seq(a, b).toDF
> val right = Seq(a2, c, d).toDF
> {code}
> Join those two dataframes with an outer join (So two of our columns are 
> nullable now.
> {code:java}
> val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
> joined.show(false)
> res0:
> +++---+
> |name|infos   |infos  |
> +++---+
> |b   |[[1,blue], [2,black]]   |null   |
> |a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
> |c   |null|[[5,white], [4,orange]]|
> |d   |null|[[4,orange], [2,black]]|
> +++---+
> {code}
> Then, take only the `name`s that exist in the right Dataframe
> {code:java}
> val rightOnly = joined.filter("l.infos is null").select($"name", 
> $"r.infos".as("r_infos"))
> rightOnly.show(false)
> res1:
> ++---+
> |name|r_infos|
> ++---+
> |c   |[[5,white], [4,orange]]|
> |d   |[[4,orange], [2,black]]|
> ++---+
> {code}
> Now, add a new column called `has_black` which will be true if the `r_infos` 
> contains _black_ as a color
> {code:java}
> def hasBlack = (s: Seq[Row]) => {
>   s.exists{ case Row(num: Int, color: String) =>
> color == "black"
>   }
> }
> val rightBreakdown = rightOnly.withColumn("has_black", 
> udf(hasBlack).apply($"r_infos"))
> rightBreakdown.show(false)
> res2:
> ++---+-+
> |name|r_infos|has_black|
> ++---+-+
> |c   |[[5,white], [4,orange]]|false|
> |d   |[[4,orange], [2,black]]|true |
> ++---+-+
> {code}
> So far, _exactly_ what we expected. 
> *However*, when I try to filter `rightBreakdown`, it fails.
> {code:java}
> rightBreakdown.filter("has_black == true").show(false)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$hasBlack$1: (array>) => 
> boolean)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138)
>   at 
> 

[jira] [Comment Edited] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-01-02 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308153#comment-16308153
 ] 

Takeshi Yamamuro edited comment on SPARK-21687 at 1/2/18 10:26 PM:
---

I feel this make some sense (But, this is not a bug, so less opportunity to 
backport to 2.3) cc: [~dongjoon]


was (Author: maropu):
I feel this make some sense (But, this is a not bug, so less opportunity to 
backport to 2.3) cc: [~dongjoon]

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22404) Provide an option to use unmanaged AM in yarn-client mode

2018-01-02 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308802#comment-16308802
 ] 

Devaraj K commented on SPARK-22404:
---

Thanks [~irashid] for the comment.

bq. can you provide a little more explanation for the point of this?

An unmanagedAM is an AM that is not launched and managed by the RM. The client 
creates a new application on the RM and negotiates a new attempt id. Then it 
waits for the RM app state to reach be YarnApplicationState.ACCEPTED after 
which it spawns the AM in same/another process and passes it the container id 
via env variable Environment.CONTAINER_ID. The AM(as part of same or different 
process) can register with the RM using the attempt id obtained from the 
container id and proceed as normal.

In this PR/JIRA, providing a new configuration "spark.yarn.un-managed-am" 
(defaults to false) to enable the Unmanaged AM Application in Yarn Client mode 
which starts the Application Master service as part of the Client. It utilizes 
the existing code for communicating between the Application Master <-> Task 
Scheduler for the container requests/allocations/launch, and eliminates these,
*   Allocating and launching the Application Master container
*   Remote Node/Process communication between Application Master <-> Task 
Scheduler

bq. how much time does this save for you?
It removes the AM container scheduling and launching time, and eliminates the 
AM acting as proxy for requesting, launching and removing executors. I can post 
the comparison results here with and without unmanaged am.

bq. What's the downside of an unmanaged AM?
Unmanaged AM service would run as part of the Client, Client can handle if 
anything goes wrong with the unmanaged AM service unlike relaunching the AM 
container for failures.

bq. the idea makes sense, but the yarn interaction and client mode is already 
pretty complicated so I'd like good justication for this
In this PR, it reuses the most of the existing code for communication between 
AM <-> Task Scheduler but happens in the same process. The Client starts the AM 
service in the same process when the applications state is ACCEPTED and 
proceeds as usual without disrupting existing flow.


> Provide an option to use unmanaged AM in yarn-client mode
> -
>
> Key: SPARK-22404
> URL: https://issues.apache.org/jira/browse/SPARK-22404
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>
> There was an issue SPARK-1200 to provide an option but was closed without 
> fixing.
> Using an unmanaged AM in yarn-client mode would allow apps to start up 
> faster, but not requiring the container launcher AM to be launched on the 
> cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Matthew Fishkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Fishkin updated SPARK-22942:

Description: 
I ran into an interesting issue when trying to do a `filter` on a dataframe 
that has columns that were added using a UDF. I am able to replicate the 
problem with a smaller set of data.

Given the dummy case classes:

{code:java}
case class Info(number: Int, color: String)
case class Record(name: String, infos: Seq[Info])
{code}

And the following data:
{code:java}
val blue = Info(1, "blue")
val black = Info(2, "black")
val yellow = Info(3, "yellow")
val orange = Info(4, "orange")
val white = Info(5, "white")

val a  = Record("a", Seq(blue, black, white))
val a2 = Record("a", Seq(yellow, white, orange))
val b = Record("b", Seq(blue, black))
val c = Record("c", Seq(white, orange))
 val d = Record("d", Seq(orange, black))
{code}

Create two dataframes (we will call them left and right)
{code:java}
val left = Seq(a, b).toDF
val right = Seq(a2, c, d).toDF
{code}

Join those two dataframes with an outer join (So two of our columns are 
nullable now.
{code:java}
val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
joined.show(false)

res0:
+++---+
|name|infos   |infos  |
+++---+
|b   |[[1,blue], [2,black]]   |null   |
|a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
|c   |null|[[5,white], [4,orange]]|
|d   |null|[[4,orange], [2,black]]|
+++---+
{code}

Then, take only the `name`s that exist in the right Dataframe
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))

rightOnly.show(false)

res1:
++---+
|name|r_infos|
++---+
|c   |[[5,white], [4,orange]]|
|d   |[[4,orange], [2,black]]|
++---+
{code}

Now, add a new column called `has_black` which will be true if the `r_infos` 
contains _black_ as a color

{code:java}
def hasBlack = (s: Seq[Row]) => {
  s.exists{ case Row(num: Int, color: String) =>
color == "black"
  }
}

val rightBreakdown = rightOnly.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+
{code}

So far, _exactly_ what we expected. 

*However*, when I try to filter `rightBreakdown`, it fails.
{code:java}
rightBreakdown.filter("has_black == true").show(false)

org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$hasBlack$1: (array>) => 
boolean)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
  at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
  at scala.collection.immutable.List.exists(List.scala:84)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at 

[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308765#comment-16308765
 ] 

Shivaram Venkataraman commented on SPARK-16693:
---

Did we have the discussion on dev@ ? I think its a good idea to remove this but 
I just want to make sure we gave enough of a heads up on dev@ and user@ 

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Matthew Fishkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Fishkin updated SPARK-22942:

Description: 
I ran into an interesting issue when trying to do a `filter` on a dataframe 
that has columns that were added using a UDF. I am able to replicate the 
problem with a smaller set of data.

Given the dummy case classes:

{code:java}
case class Info(number: Int, color: String)
case class Record(name: String, infos: Seq[Info])
{code}

And the following data:
{code:java}
val blue = Info(1, "blue")
val black = Info(2, "black")
val yellow = Info(3, "yellow")
val orange = Info(4, "orange")
val white = Info(5, "white")

val a  = Record("a", Seq(blue, black, white))
val a2 = Record("a", Seq(yellow, white, orange))
val b = Record("b", Seq(blue, black))
val c = Record("c", Seq(white, orange))
 val d = Record("d", Seq(orange, black))
{code}

Create two dataframes (we will call them left and right)
{code:java}
val left = Seq(a, b).toDF
val right = Seq(a2, c, d).toDF
{code}

Join those two dataframes with an outer join (So two of our columns are 
nullable now.
{code:java}
val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
joined.show(false)

res0:
+++---+
|name|infos   |infos  |
+++---+
|b   |[[1,blue], [2,black]]   |null   |
|a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
|c   |null|[[5,white], [4,orange]]|
|d   |null|[[4,orange], [2,black]]|
+++---+
{code}

Then, take only the `name`s that exist in the right Dataframe
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))

rightOnly.show(false)

res1:
++---+
|name|r_infos|
++---+
|c   |[[5,white], [4,orange]]|
|d   |[[4,orange], [2,black]]|
++---+
{code}

Now, add a new column called `has_black` which will be true if the `r_infos` 
contains _black_ as a color

{code:java}
def hasBlack = (s: Seq[Row]) => {
  s.exists{ case Row(num: Int, color: String) =>
color == "black"
  }
}

val rightBreakdown = rightOnlyInfos.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+
{code}

So far, _exactly_ what we expected. 

*However*, when I try to filter `rightBreakdown`, it fails.
{code:java}
rightBreakdown.filter("has_black == true").show(false)

org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$hasBlack$1: (array>) => 
boolean)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
  at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
  at scala.collection.immutable.List.exists(List.scala:84)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at 

[jira] [Created] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Matthew Fishkin (JIRA)
Matthew Fishkin created SPARK-22942:
---

 Summary: Spark Sql UDF throwing NullPointer when adding a filter 
on a columns that uses that UDF
 Key: SPARK-22942
 URL: https://issues.apache.org/jira/browse/SPARK-22942
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.2.0
Reporter: Matthew Fishkin


I ran into an interesting issue when trying to do a `filter` on a dataframe 
that has columns that were added using a UDF. I am able to replicate the 
problem with a smaller set of data.

Given the dummy case classes:

{code:scala}
case class Info(number: Int, color: String)
case class Record(name: String, infos: Seq[Info])
{code}

And the following data:
{code:scala}
val blue = Info(1, "blue")
val black = Info(2, "black")
val yellow = Info(3, "yellow")
val orange = Info(4, "orange")
val white = Info(5, "white")

val a  = Record("a", Seq(blue, black, white))
val a2 = Record("a", Seq(yellow, white, orange))
val b = Record("b", Seq(blue, black))
val c = Record("c", Seq(white, orange))
 val d = Record("d", Seq(orange, black))
{code}

Create two dataframes (we will call them left and right)
{code:scala}
val left = Seq(a, b).toDF
val right = Seq(a2, c, d).toDF
{code}

Join those two dataframes with an outer join (So two of our columns are 
nullable now.
{code:scala}
val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
joined.show(false)

res0:
+++---+
|name|infos   |infos  |
+++---+
|b   |[[1,blue], [2,black]]   |null   |
|a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
|c   |null|[[5,white], [4,orange]]|
|d   |null|[[4,orange], [2,black]]|
+++---+
{code}

Then, take only the `name`s that exist in the right Dataframe
{code:scala}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))

rightOnly.show(false)

res1:
++---+
|name|r_infos|
++---+
|c   |[[5,white], [4,orange]]|
|d   |[[4,orange], [2,black]]|
++---+
{code}

Now, add a new column called `has_black` which will be true if the `r_infos` 
contains _black_ as a color

{code:scala}
def hasBlack = (s: Seq[Row]) => {
  s.exists{ case Row(num: Int, color: String) =>
color == "black"
  }
}

val rightBreakdown = rightOnlyInfos.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+
{code}

So far, *exactly* what we expected. However, when I try to filter 
`rightBreakdown`, it fails.
{code:scala}
rightBreakdown.filter("has_black == true").show(false)

org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$hasBlack$1: (array>) => 
boolean)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
  at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
  at scala.collection.immutable.List.exists(List.scala:84)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 

[jira] [Updated] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF

2018-01-02 Thread Matthew Fishkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Fishkin updated SPARK-22942:

Description: 
I ran into an interesting issue when trying to do a `filter` on a dataframe 
that has columns that were added using a UDF. I am able to replicate the 
problem with a smaller set of data.

Given the dummy case classes:

{code:java}
case class Info(number: Int, color: String)
case class Record(name: String, infos: Seq[Info])
{code}

And the following data:
{code:java}
val blue = Info(1, "blue")
val black = Info(2, "black")
val yellow = Info(3, "yellow")
val orange = Info(4, "orange")
val white = Info(5, "white")

val a  = Record("a", Seq(blue, black, white))
val a2 = Record("a", Seq(yellow, white, orange))
val b = Record("b", Seq(blue, black))
val c = Record("c", Seq(white, orange))
 val d = Record("d", Seq(orange, black))
{code}

Create two dataframes (we will call them left and right)
{code:java}
val left = Seq(a, b).toDF
val right = Seq(a2, c, d).toDF
{code}

Join those two dataframes with an outer join (So two of our columns are 
nullable now.
{code:java}
val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer")
joined.show(false)

res0:
+++---+
|name|infos   |infos  |
+++---+
|b   |[[1,blue], [2,black]]   |null   |
|a   |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]|
|c   |null|[[5,white], [4,orange]]|
|d   |null|[[4,orange], [2,black]]|
+++---+
{code}

Then, take only the `name`s that exist in the right Dataframe
{code:java}
val rightOnly = joined.filter("l.infos is null").select($"name", 
$"r.infos".as("r_infos"))

rightOnly.show(false)

res1:
++---+
|name|r_infos|
++---+
|c   |[[5,white], [4,orange]]|
|d   |[[4,orange], [2,black]]|
++---+
{code}

Now, add a new column called `has_black` which will be true if the `r_infos` 
contains _black_ as a color

{code:java}
def hasBlack = (s: Seq[Row]) => {
  s.exists{ case Row(num: Int, color: String) =>
color == "black"
  }
}

val rightBreakdown = rightOnlyInfos.withColumn("has_black", 
udf(hasBlack).apply($"r_infos"))

rightBreakdown.show(false)

res2:
++---+-+
|name|r_infos|has_black|
++---+-+
|c   |[[5,white], [4,orange]]|false|
|d   |[[4,orange], [2,black]]|true |
++---+-+
{code}

So far, *exactly* what we expected. However, when I try to filter 
`rightBreakdown`, it fails.
{code:java}
rightBreakdown.filter("has_black == true").show(false)

org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$hasBlack$1: (array>) => 
boolean)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
  at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138)
  at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
  at scala.collection.immutable.List.exists(List.scala:84)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152)
  at 
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at 

[jira] [Created] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.

2018-01-02 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-22941:
--

 Summary: Allow SparkSubmit to throw exceptions instead of exiting 
/ printing errors.
 Key: SPARK-22941
 URL: https://issues.apache.org/jira/browse/SPARK-22941
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


{{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see 
SPARK-11035). But if the caller provides incorrect or inconsistent parameters 
to the app, {{SparkSubmit}} will print errors to the output and call 
{{System.exit}}, which is not very user friendly in this code path.

We should modify {{SparkSubmit}} to be more friendly when called this way, 
while still maintaining the old behavior when called from the command line.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20664) Remove stale applications from SHS listing

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20664:


Assignee: (was: Apache Spark)

> Remove stale applications from SHS listing
> --
>
> Key: SPARK-20664
> URL: https://issues.apache.org/jira/browse/SPARK-20664
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task is actually not explicit in the spec, and it's also an issue with 
> the current SHS. But having the SHS persist listing data makes it worse.
> Basically, the SHS currently does not detect when files are deleted from the 
> event log directory manually; so those applications are still listed, and 
> trying to see their UI will either show the UI (if it's loaded) or an error 
> (if it's not).
> With the new SHS, that also means that data is leaked in the disk stores used 
> to persist listing and UI data, making the problem worse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20664) Remove stale applications from SHS listing

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20664:


Assignee: Apache Spark

> Remove stale applications from SHS listing
> --
>
> Key: SPARK-20664
> URL: https://issues.apache.org/jira/browse/SPARK-20664
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task is actually not explicit in the spec, and it's also an issue with 
> the current SHS. But having the SHS persist listing data makes it worse.
> Basically, the SHS currently does not detect when files are deleted from the 
> event log directory manually; so those applications are still listed, and 
> trying to see their UI will either show the UI (if it's loaded) or an error 
> (if it's not).
> With the new SHS, that also means that data is leaked in the disk stores used 
> to persist listing and UI data, making the problem worse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20664) Remove stale applications from SHS listing

2018-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308579#comment-16308579
 ] 

Apache Spark commented on SPARK-20664:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20138

> Remove stale applications from SHS listing
> --
>
> Key: SPARK-20664
> URL: https://issues.apache.org/jira/browse/SPARK-20664
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task is actually not explicit in the spec, and it's also an issue with 
> the current SHS. But having the SHS persist listing data makes it worse.
> Basically, the SHS currently does not detect when files are deleted from the 
> event log directory manually; so those applications are still listed, and 
> trying to see their UI will either show the UI (if it's loaded) or an error 
> (if it's not).
> With the new SHS, that also means that data is leaked in the disk stores used 
> to persist listing and UI data, making the problem worse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21319) UnsafeExternalRowSorter.RowComparator memory leak

2018-01-02 Thread William Kinney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308566#comment-16308566
 ] 

William Kinney commented on SPARK-21319:


Is there a workaround for this for version 2.2.0?

> UnsafeExternalRowSorter.RowComparator memory leak
> -
>
> Key: SPARK-21319
> URL: https://issues.apache.org/jira/browse/SPARK-21319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: James Baker
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
> Attachments: 
> 0001-SPARK-21319-Fix-memory-leak-in-UnsafeExternalRowSort.patch, hprof.png
>
>
> When we wish to sort within partitions, we produce an 
> UnsafeExternalRowSorter. This contains an UnsafeExternalSorter, which 
> contains the UnsafeExternalRowComparator.
> The UnsafeExternalSorter adds a task completion listener which performs any 
> additional required cleanup. The upshot of this is that we maintain a 
> reference to the UnsafeExternalRowSorter.RowComparator until the end of the 
> task.
> The RowComparator looks like
> {code:java}
>   private static final class RowComparator extends RecordComparator {
> private final Ordering ordering;
> private final int numFields;
> private final UnsafeRow row1;
> private final UnsafeRow row2;
> RowComparator(Ordering ordering, int numFields) {
>   this.numFields = numFields;
>   this.row1 = new UnsafeRow(numFields);
>   this.row2 = new UnsafeRow(numFields);
>   this.ordering = ordering;
> }
> @Override
> public int compare(Object baseObj1, long baseOff1, Object baseObj2, long 
> baseOff2) {
>   // TODO: Why are the sizes -1?
>   row1.pointTo(baseObj1, baseOff1, -1);
>   row2.pointTo(baseObj2, baseOff2, -1);
>   return ordering.compare(row1, row2);
> }
> }
> {code}
> which means that this will contain references to the last baseObjs that were 
> passed in, and without tracking them for purposes of memory allocation.
> We have a job which sorts within partitions and then coalesces partitions - 
> this has a tendency to OOM because of the references to old UnsafeRows that 
> were used during the sorting.
> Attached is a screenshot of a memory dump during a task - our JVM has two 
> executor threads.
> It can be seen that we have 2 references inside of row iterators, and 11 more 
> which are only known in the task completion listener or as part of memory 
> management.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed

2018-01-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308491#comment-16308491
 ] 

Sean Owen commented on SPARK-22940:
---

Agreed. Many other scripts require wget, though at least one will use either. 
The easiest option is just to document that you need wget, which is simple 
enough. I forget whether I have it via XCode tools or brew, but, brew install 
wget should always be sufficient.

Using a library is just fine too and probably more desirable within the test 
code.

Feel free to open a PR for any of the options, I'd say.

> Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't 
> have wget installed
> -
>
> Key: SPARK-22940
> URL: https://issues.apache.org/jira/browse/SPARK-22940
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
> Environment: MacOS Sierra 10.12.6
>Reporter: Bruce Robbins
>Priority: Minor
>
> On platforms that don't have wget installed (e.g., Mac OS X), test suite 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an 
> exception and aborts:
> java.io.IOException: Cannot run program "wget": error=2, No such file or 
> directory
> HiveExternalCatalogVersionsSuite uses wget to download older versions of 
> Spark for compatibility testing. First it uses wget to find a suitable 
> mirror, and then it uses wget to download a tar file from the mirror.
> There are several ways to fix this (in reverse order of difficulty of 
> implementation)
> 1. Require Mac OS X users to install wget if they wish to run unit tests (or 
> at the very least if they wish to run HiveExternalCatalogVersionsSuite). 
> Also, update documentation to make this requirement explicit.
> 2. Fall back on curl when wget is not available.
> 3. Use an HTTP library to query for a suitable mirror and download the tar 
> file.
> Number 2 is easy to implement, and I did so to get the unit test to run. But 
> it relies on another external program if wget is not installed.
> Number 3 is probably slightly more complex to implement and requires more 
> corner-case checking (e.g, redirects, etc.).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed

2018-01-02 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-22940:
--
Description: 
On platforms that don't have wget installed (e.g., Mac OS X), test suite 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception 
and aborts:

java.io.IOException: Cannot run program "wget": error=2, No such file or 
directory

HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark 
for compatibility testing. First it uses wget to find a suitable mirror, and 
then it uses wget to download a tar file from the mirror.

There are several ways to fix this (in reverse order of difficulty of 
implementation)

1. Require Mac OS X users to install wget if they wish to run unit tests (or at 
the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, 
update documentation to make this requirement explicit.
2. Fall back on curl when wget is not available.
3. Use an HTTP library to query for a suitable mirror and download the tar file.

Number 2 is easy to implement, and I did so to get the unit test to run. But it 
relies on another external program if wget is not installed.

Number 3 is probably slightly more complex to implement and requires more 
corner-case checking (e.g, redirects, etc.).



  was:
On platforms that don't have wget installed (e.g., Mac OS X), test suite 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception 
and aborts:

java.io.IOException: Cannot run program "wget": error=2, No such file or 
directory

HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark 
for compatibility testing. First it uses wget to find a suitable mirror, and 
then it uses wget to download a tar file from the mirror.

There are several ways to fix this (in reverse order of difficulty of 
implementation)

1. Require Mac OS X users to install wget if they wish to run unit tests (or at 
the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, 
update documentation to make this requirement explicit.
2. Fall back on curl when wget is not available.
3. Use an HTTP library to query for a suitable mirror and download the tar file.

Number 2 is easy to implement, and I did so to get the unit test to run. But 
relies on another external program.

Number 3 is probably slightly more complex to implement and requires more 
corner-case checking (e.g, redirects, etc.).




> Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't 
> have wget installed
> -
>
> Key: SPARK-22940
> URL: https://issues.apache.org/jira/browse/SPARK-22940
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
> Environment: MacOS Sierra 10.12.6
>Reporter: Bruce Robbins
>Priority: Minor
>
> On platforms that don't have wget installed (e.g., Mac OS X), test suite 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an 
> exception and aborts:
> java.io.IOException: Cannot run program "wget": error=2, No such file or 
> directory
> HiveExternalCatalogVersionsSuite uses wget to download older versions of 
> Spark for compatibility testing. First it uses wget to find a suitable 
> mirror, and then it uses wget to download a tar file from the mirror.
> There are several ways to fix this (in reverse order of difficulty of 
> implementation)
> 1. Require Mac OS X users to install wget if they wish to run unit tests (or 
> at the very least if they wish to run HiveExternalCatalogVersionsSuite). 
> Also, update documentation to make this requirement explicit.
> 2. Fall back on curl when wget is not available.
> 3. Use an HTTP library to query for a suitable mirror and download the tar 
> file.
> Number 2 is easy to implement, and I did so to get the unit test to run. But 
> it relies on another external program if wget is not installed.
> Number 3 is probably slightly more complex to implement and requires more 
> corner-case checking (e.g, redirects, etc.).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed

2018-01-02 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-22940:
-

 Summary: Test suite HiveExternalCatalogVersionsSuite fails on 
platforms that don't have wget installed
 Key: SPARK-22940
 URL: https://issues.apache.org/jira/browse/SPARK-22940
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.2.1
 Environment: MacOS Sierra 10.12.6
Reporter: Bruce Robbins
Priority: Minor


On platforms that don't have wget installed (e.g., Mac OS X), test suite 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception 
and aborts:

java.io.IOException: Cannot run program "wget": error=2, No such file or 
directory

HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark 
for compatibility testing. First it uses wget to find a a suitable mirror, and 
then it uses wget to download a tar file from the mirror.

There are several ways to fix this (in reverse order of difficulty of 
implementation)

1. Require Mac OS X users to install wget if they wish to run unit tests (or at 
the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, 
update documentation to make this requirement explicit.
2. Fall back on curl when wget is not available.
3. Use an HTTP library to query for a suitable mirror and download the tar file.

Number 2 is easy to implement, and I did so to get the unit test to run. But 
relies on another external program.

Number 3 is probably slightly more complex to implement and requires more 
corner-case checking (e.g, redirects, etc.).





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed

2018-01-02 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-22940:
--
Description: 
On platforms that don't have wget installed (e.g., Mac OS X), test suite 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception 
and aborts:

java.io.IOException: Cannot run program "wget": error=2, No such file or 
directory

HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark 
for compatibility testing. First it uses wget to find a suitable mirror, and 
then it uses wget to download a tar file from the mirror.

There are several ways to fix this (in reverse order of difficulty of 
implementation)

1. Require Mac OS X users to install wget if they wish to run unit tests (or at 
the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, 
update documentation to make this requirement explicit.
2. Fall back on curl when wget is not available.
3. Use an HTTP library to query for a suitable mirror and download the tar file.

Number 2 is easy to implement, and I did so to get the unit test to run. But 
relies on another external program.

Number 3 is probably slightly more complex to implement and requires more 
corner-case checking (e.g, redirects, etc.).



  was:
On platforms that don't have wget installed (e.g., Mac OS X), test suite 
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception 
and aborts:

java.io.IOException: Cannot run program "wget": error=2, No such file or 
directory

HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark 
for compatibility testing. First it uses wget to find a a suitable mirror, and 
then it uses wget to download a tar file from the mirror.

There are several ways to fix this (in reverse order of difficulty of 
implementation)

1. Require Mac OS X users to install wget if they wish to run unit tests (or at 
the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, 
update documentation to make this requirement explicit.
2. Fall back on curl when wget is not available.
3. Use an HTTP library to query for a suitable mirror and download the tar file.

Number 2 is easy to implement, and I did so to get the unit test to run. But 
relies on another external program.

Number 3 is probably slightly more complex to implement and requires more 
corner-case checking (e.g, redirects, etc.).




> Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't 
> have wget installed
> -
>
> Key: SPARK-22940
> URL: https://issues.apache.org/jira/browse/SPARK-22940
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
> Environment: MacOS Sierra 10.12.6
>Reporter: Bruce Robbins
>Priority: Minor
>
> On platforms that don't have wget installed (e.g., Mac OS X), test suite 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an 
> exception and aborts:
> java.io.IOException: Cannot run program "wget": error=2, No such file or 
> directory
> HiveExternalCatalogVersionsSuite uses wget to download older versions of 
> Spark for compatibility testing. First it uses wget to find a suitable 
> mirror, and then it uses wget to download a tar file from the mirror.
> There are several ways to fix this (in reverse order of difficulty of 
> implementation)
> 1. Require Mac OS X users to install wget if they wish to run unit tests (or 
> at the very least if they wish to run HiveExternalCatalogVersionsSuite). 
> Also, update documentation to make this requirement explicit.
> 2. Fall back on curl when wget is not available.
> 3. Use an HTTP library to query for a suitable mirror and download the tar 
> file.
> Number 2 is easy to implement, and I did so to get the unit test to run. But 
> relies on another external program.
> Number 3 is probably slightly more complex to implement and requires more 
> corner-case checking (e.g, redirects, etc.).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-01-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308378#comment-16308378
 ] 

Dongjoon Hyun commented on SPARK-21687:
---

Thank you for ccing me, [~maropu]. I agree with that.

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308367#comment-16308367
 ] 

Felix Cheung commented on SPARK-16693:
--

These are all non public methods, so officially not public APIs, but people 
have been known to call them.




> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException

2018-01-02 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308363#comment-16308363
 ] 

Jacek Laskowski commented on SPARK-22935:
-

It does not seem to be the case as described in 
https://stackoverflow.com/q/48026060/1305344 where the OP wanted to 
{{inferSchema}} with no values that would be of {{Date}} or {{Timestamp}} and 
hence Spark SQL infers strings.

But...I think all's fine though as I said at SO:

{quote}
TL;DR Define the schema explicitly since the input dataset does not have values 
to infer types from (for java.sql.Date fields).
{quote}

I think you can close the issue as {{Invalid}}.



> Dataset with Java Beans for java.sql.Date throws CompileException
> -
>
> Key: SPARK-22935
> URL: https://issues.apache.org/jira/browse/SPARK-22935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>
> The following code can throw an exception with or without whole-stage codegen.
> {code}
>   public void SPARK22935() {
> Dataset cdr = spark
> .read()
> .format("csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", ";")
> .csv("CDR_SAMPLE.csv")
> .as(Encoders.bean(CDR.class));
> Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != 
> null));
> long c = ds.count();
> cdr.show(2);
> ds.show(2);
> System.out.println("cnt=" + c);
>   }
> // CDR.java
> public class CDR implements java.io.Serializable {
>   public java.sql.Date timestamp;
>   public java.sql.Date getTimestamp() { return this.timestamp; }
>   public void setTimestamp(java.sql.Date timestamp) { this.timestamp = 
> timestamp; }
> }
> // CDR_SAMPLE.csv
> timestamp
> 2017-10-29T02:37:07.815Z
> 2017-10-29T02:38:07.815Z
> {code}
> result
> {code}
> 12:17:10.352 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 61, Column 70: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 61, Column 70: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22939) Support Spark UDF in registerFunction

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22939:


Assignee: Apache Spark

> Support Spark UDF in registerFunction
> -
>
> Key: SPARK-22939
> URL: https://issues.apache.org/jira/browse/SPARK-22939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {noformat}
> import random
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StringType
> random_udf = udf(lambda: int(random.random() * 100), 
> IntegerType()).asNondeterministic()
> spark.catalog.registerFunction("random_udf", random_udf, StringType())
> spark.sql("SELECT random_udf()").collect()
> {noformat}
> We will get the following error.
> {noformat}
> Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
> py4j.Py4JException: Method __getnewargs__([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22939) Support Spark UDF in registerFunction

2018-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308252#comment-16308252
 ] 

Apache Spark commented on SPARK-22939:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20137

> Support Spark UDF in registerFunction
> -
>
> Key: SPARK-22939
> URL: https://issues.apache.org/jira/browse/SPARK-22939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>
> {noformat}
> import random
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StringType
> random_udf = udf(lambda: int(random.random() * 100), 
> IntegerType()).asNondeterministic()
> spark.catalog.registerFunction("random_udf", random_udf, StringType())
> spark.sql("SELECT random_udf()").collect()
> {noformat}
> We will get the following error.
> {noformat}
> Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
> py4j.Py4JException: Method __getnewargs__([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22939) Support Spark UDF in registerFunction

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22939:


Assignee: (was: Apache Spark)

> Support Spark UDF in registerFunction
> -
>
> Key: SPARK-22939
> URL: https://issues.apache.org/jira/browse/SPARK-22939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>
> {noformat}
> import random
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StringType
> random_udf = udf(lambda: int(random.random() * 100), 
> IntegerType()).asNondeterministic()
> spark.catalog.registerFunction("random_udf", random_udf, StringType())
> spark.sql("SELECT random_udf()").collect()
> {noformat}
> We will get the following error.
> {noformat}
> Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
> py4j.Py4JException: Method __getnewargs__([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22939) Support Spark UDF in registerFunction

2018-01-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22939:

Summary: Support Spark UDF in registerFunction  (was: registerFunction 
accepts Spark UDF )

> Support Spark UDF in registerFunction
> -
>
> Key: SPARK-22939
> URL: https://issues.apache.org/jira/browse/SPARK-22939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>
> {noformat}
> import random
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StringType
> random_udf = udf(lambda: int(random.random() * 100), 
> IntegerType()).asNondeterministic()
> spark.catalog.registerFunction("random_udf", random_udf, StringType())
> spark.sql("SELECT random_udf()").collect()
> {noformat}
> We will get the following error.
> {noformat}
> Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
> py4j.Py4JException: Method __getnewargs__([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22939) registerFunction accepts Spark UDF

2018-01-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22939:

Summary: registerFunction accepts Spark UDF   (was: registerFunction also 
accepts Spark UDF )

> registerFunction accepts Spark UDF 
> ---
>
> Key: SPARK-22939
> URL: https://issues.apache.org/jira/browse/SPARK-22939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>
> {noformat}
> import random
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StringType
> random_udf = udf(lambda: int(random.random() * 100), 
> IntegerType()).asNondeterministic()
> spark.catalog.registerFunction("random_udf", random_udf, StringType())
> spark.sql("SELECT random_udf()").collect()
> {noformat}
> We will get the following error.
> {noformat}
> Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
> py4j.Py4JException: Method __getnewargs__([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22939) registerFunction also accepts Spark UDF

2018-01-02 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22939:
---

 Summary: registerFunction also accepts Spark UDF 
 Key: SPARK-22939
 URL: https://issues.apache.org/jira/browse/SPARK-22939
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


{noformat}
import random
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType
random_udf = udf(lambda: int(random.random() * 100), 
IntegerType()).asNondeterministic()
spark.catalog.registerFunction("random_udf", random_udf, StringType())
spark.sql("SELECT random_udf()").collect()
{noformat}

We will get the following error.
{noformat}
Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22897) Expose stageAttemptId in TaskContext

2018-01-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22897.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20082
[https://github.com/apache/spark/pull/20082]

> Expose  stageAttemptId in TaskContext
> -
>
> Key: SPARK-22897
> URL: https://issues.apache.org/jira/browse/SPARK-22897
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, there's no easy way for Executor to detect a new stage is launched 
> as stageAttemptId is missing. 
> I'd like to propose exposing stageAttemptId in TaskContext, and will send a 
> pr if community thinks it's a good thing.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22897) Expose stageAttemptId in TaskContext

2018-01-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22897:
---

Assignee: Xianjin YE

> Expose  stageAttemptId in TaskContext
> -
>
> Key: SPARK-22897
> URL: https://issues.apache.org/jira/browse/SPARK-22897
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
>
> Currently, there's no easy way for Executor to detect a new stage is launched 
> as stageAttemptId is missing. 
> I'd like to propose exposing stageAttemptId in TaskContext, and will send a 
> pr if community thinks it's a good thing.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException

2018-01-02 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308194#comment-16308194
 ] 

Kazuaki Ishizaki commented on SPARK-22935:
--

[~jlaskowski]

When you see the scheme of this Dataset, {{timestamp}} is {{timestamp}}, is not 
{{date}}. The inferSchema always sets type for time into {{timestamp}}. 
If you change declaration of {{timestamp}} in {{CDR}} class from 
{{java.sql.Date}} to {{java.sql.Timestamp}} as below, it works well.

{code}
Dataset df = spark
.read()
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", ";")
.csv("CDR_SAMPLE.csv");
df.printSchema();
Dataset cdr = df
.as(Encoders.bean(CDR.class));
cdr.printSchema();
Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != 
null));
...

// result
root
 |-- timestamp: timestamp (nullable = true)
{code}

{code}
// CDR.java
public class CDR implements java.io.Serializable {
  public java.sql.Timestamp timestamp;
  public java.sql.Timestamp getTimestamp() { return this.timestamp; }
  public void setTimestamp(java.sql.Timestamp  timestamp) { this.timestamp = 
timestamp; }
}
{code}


> Dataset with Java Beans for java.sql.Date throws CompileException
> -
>
> Key: SPARK-22935
> URL: https://issues.apache.org/jira/browse/SPARK-22935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>
> The following code can throw an exception with or without whole-stage codegen.
> {code}
>   public void SPARK22935() {
> Dataset cdr = spark
> .read()
> .format("csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", ";")
> .csv("CDR_SAMPLE.csv")
> .as(Encoders.bean(CDR.class));
> Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != 
> null));
> long c = ds.count();
> cdr.show(2);
> ds.show(2);
> System.out.println("cnt=" + c);
>   }
> // CDR.java
> public class CDR implements java.io.Serializable {
>   public java.sql.Date timestamp;
>   public java.sql.Date getTimestamp() { return this.timestamp; }
>   public void setTimestamp(java.sql.Date timestamp) { this.timestamp = 
> timestamp; }
> }
> // CDR_SAMPLE.csv
> timestamp
> 2017-10-29T02:37:07.815Z
> 2017-10-29T02:38:07.815Z
> {code}
> result
> {code}
> 12:17:10.352 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 61, Column 70: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 61, Column 70: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22936) providing HttpStreamSource and HttpStreamSink

2018-01-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308177#comment-16308177
 ] 

Sean Owen commented on SPARK-22936:
---

I think this can and should start as you have started it, as an external 
project. If there's clear demand, it could be moved into the project itself 
later. Right now people can just add your artifact and use it with no problem 
right?

> providing HttpStreamSource and HttpStreamSink
> -
>
> Key: SPARK-22936
> URL: https://issues.apache.org/jira/browse/SPARK-22936
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: bluejoe
>
> Hi, in my project I completed a spark-http-stream, which is now available on 
> https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is 
> useful to others and is ok to be integrated as a part of Spark.
> spark-http-stream transfers Spark structured stream over HTTP protocol. 
> Unlike tcp streams, Kafka streams and HDFS file streams, http streams often 
> flow across distributed big data centers on the Web. This feature is very 
> helpful to build global data processing pipelines across different data 
> centers (scientific research institutes, for example) who own separated data 
> sets.
> The following code shows how to load messages from a HttpStreamSource:
> ```
> val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName)
>   .option("httpServletUrl", "http://localhost:8080/;)
>   .option("topic", "topic-1");
>   .option("includesTimestamp", "true")
>   .load();
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308161#comment-16308161
 ] 

Sean Owen commented on SPARK-16693:
---

Would this be a breaking change though?

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-01-02 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308153#comment-16308153
 ] 

Takeshi Yamamuro commented on SPARK-21687:
--

I feel this make some sense (But, this is a not bug, so less opportunity to 
backport to 2.3) cc: [~dongjoon]

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.

2018-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308130#comment-16308130
 ] 

Apache Spark commented on SPARK-22938:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/20136

> Assert that SQLConf.get is accessed only on the driver.
> ---
>
> Key: SPARK-22938
> URL: https://issues.apache.org/jira/browse/SPARK-22938
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Juliusz Sompolski
>
> Assert if code tries to access SQLConf.get on executor.
> This can lead to hard to detect bugs, where the executor will read 
> fallbackConf, falling back to default config values, ignoring potentially 
> changed non-default configs.
> If a config is to be passed to executor code, it needs to be read on the 
> driver, and passed explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22938:


Assignee: (was: Apache Spark)

> Assert that SQLConf.get is accessed only on the driver.
> ---
>
> Key: SPARK-22938
> URL: https://issues.apache.org/jira/browse/SPARK-22938
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Juliusz Sompolski
>
> Assert if code tries to access SQLConf.get on executor.
> This can lead to hard to detect bugs, where the executor will read 
> fallbackConf, falling back to default config values, ignoring potentially 
> changed non-default configs.
> If a config is to be passed to executor code, it needs to be read on the 
> driver, and passed explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22938:


Assignee: Apache Spark

> Assert that SQLConf.get is accessed only on the driver.
> ---
>
> Key: SPARK-22938
> URL: https://issues.apache.org/jira/browse/SPARK-22938
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>
> Assert if code tries to access SQLConf.get on executor.
> This can lead to hard to detect bugs, where the executor will read 
> fallbackConf, falling back to default config values, ignoring potentially 
> changed non-default configs.
> If a config is to be passed to executor code, it needs to be read on the 
> driver, and passed explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.

2018-01-02 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-22938:
-

 Summary: Assert that SQLConf.get is accessed only on the driver.
 Key: SPARK-22938
 URL: https://issues.apache.org/jira/browse/SPARK-22938
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.1
Reporter: Juliusz Sompolski


Assert if code tries to access SQLConf.get on executor.
This can lead to hard to detect bugs, where the executor will read 
fallbackConf, falling back to default config values, ignoring potentially 
changed non-default configs.
If a config is to be passed to executor code, it needs to be read on the 
driver, and passed explicitly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22937) SQL elt for binary inputs

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22937:


Assignee: Apache Spark

> SQL elt for binary inputs
> -
>
> Key: SPARK-22937
> URL: https://issues.apache.org/jira/browse/SPARK-22937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> SQL elt should output binary for binary inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22937) SQL elt for binary inputs

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22937:


Assignee: (was: Apache Spark)

> SQL elt for binary inputs
> -
>
> Key: SPARK-22937
> URL: https://issues.apache.org/jira/browse/SPARK-22937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> SQL elt should output binary for binary inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22937) SQL elt for binary inputs

2018-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308062#comment-16308062
 ] 

Apache Spark commented on SPARK-22937:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20135

> SQL elt for binary inputs
> -
>
> Key: SPARK-22937
> URL: https://issues.apache.org/jira/browse/SPARK-22937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> SQL elt should output binary for binary inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-01-02 Thread Abhay Pradhan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308059#comment-16308059
 ] 

Abhay Pradhan commented on SPARK-22918:
---

confirmed that our team is also affected by this issue. 

We have unit tests that spin up a local spark session with hive support 
enabled. Upgrading to 2.2.1 causes our tests to fail.

[~Daimon] thank you for the work around.

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> 

[jira] [Created] (SPARK-22937) SQL elt for binary inputs

2018-01-02 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-22937:


 Summary: SQL elt for binary inputs
 Key: SPARK-22937
 URL: https://issues.apache.org/jira/browse/SPARK-22937
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Takeshi Yamamuro
Priority: Minor


SQL elt should output binary for binary inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22613:


Assignee: (was: Apache Spark)

> Make UNCACHE TABLE behaviour consistent with CACHE TABLE
> 
>
> Key: SPARK-22613
> URL: https://issues.apache.org/jira/browse/SPARK-22613
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> The Spark SQL function CACHE TABLE is eager by default. Therefore it offers 
> an optional keyword LAZY in case you do not want to cache the complete table 
> immediately (See 
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html).
>  But the corresponding Spark SQL function UNCACHE TABLE is lazy by default 
> and doesn't offer an option EAGER (See 
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html,
>  
> https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql).
>  So one cannot cache and uncache a table in an eager way using Spark SQL. 
> As a user I want an option EAGER for UNCACHE TABLE. An alternative could be 
> to change the behaviour of UNCACHE TABLE to be eager by default (consistent 
> with CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE

2018-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308019#comment-16308019
 ] 

Apache Spark commented on SPARK-22613:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/20134

> Make UNCACHE TABLE behaviour consistent with CACHE TABLE
> 
>
> Key: SPARK-22613
> URL: https://issues.apache.org/jira/browse/SPARK-22613
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> The Spark SQL function CACHE TABLE is eager by default. Therefore it offers 
> an optional keyword LAZY in case you do not want to cache the complete table 
> immediately (See 
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html).
>  But the corresponding Spark SQL function UNCACHE TABLE is lazy by default 
> and doesn't offer an option EAGER (See 
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html,
>  
> https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql).
>  So one cannot cache and uncache a table in an eager way using Spark SQL. 
> As a user I want an option EAGER for UNCACHE TABLE. An alternative could be 
> to change the behaviour of UNCACHE TABLE to be eager by default (consistent 
> with CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE

2018-01-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22613:


Assignee: Apache Spark

> Make UNCACHE TABLE behaviour consistent with CACHE TABLE
> 
>
> Key: SPARK-22613
> URL: https://issues.apache.org/jira/browse/SPARK-22613
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Assignee: Apache Spark
>Priority: Minor
>
> The Spark SQL function CACHE TABLE is eager by default. Therefore it offers 
> an optional keyword LAZY in case you do not want to cache the complete table 
> immediately (See 
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html).
>  But the corresponding Spark SQL function UNCACHE TABLE is lazy by default 
> and doesn't offer an option EAGER (See 
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html,
>  
> https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql).
>  So one cannot cache and uncache a table in an eager way using Spark SQL. 
> As a user I want an option EAGER for UNCACHE TABLE. An alternative could be 
> to change the behaviour of UNCACHE TABLE to be eager by default (consistent 
> with CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-01-02 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307955#comment-16307955
 ] 

Gabor Somogyi commented on SPARK-21687:
---

[~srowen] I've just seen the Branch 2.3 cut mail. Should we backport it there?

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-01-02 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307877#comment-16307877
 ] 

Gabor Somogyi commented on SPARK-21687:
---

I would like to work on this. Please notify me if somebody already started.

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org