[jira] [Comment Edited] (SPARK-22936) providing HttpStreamSource and HttpStreamSink
[ https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309220#comment-16309220 ] bluejoe edited comment on SPARK-22936 at 1/3/18 7:25 AM: - The latest spark-http-stream artifact has been released to the central maven repository and is free to use in any java/scala projects. I would be very appreciated that spark-http-stream is listed as a third-party Source/Sink in Spark documentation. was (Author: bluejoe): The latest spark-http-stream artifact has been released to the central maven repository and is free to use in any java/scala projects. I will be appreciated that spark-http-stream is listed as a third-party Source/Sink in Spark documentation. > providing HttpStreamSource and HttpStreamSink > - > > Key: SPARK-22936 > URL: https://issues.apache.org/jira/browse/SPARK-22936 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: bluejoe > > Hi, in my project I completed a spark-http-stream, which is now available on > https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is > useful to others and is ok to be integrated as a part of Spark. > spark-http-stream transfers Spark structured stream over HTTP protocol. > Unlike tcp streams, Kafka streams and HDFS file streams, http streams often > flow across distributed big data centers on the Web. This feature is very > helpful to build global data processing pipelines across different data > centers (scientific research institutes, for example) who own separated data > sets. > The following code shows how to load messages from a HttpStreamSource: > ``` > val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName) > .option("httpServletUrl", "http://localhost:8080/";) > .option("topic", "topic-1"); > .option("includesTimestamp", "true") > .load(); > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22936) providing HttpStreamSource and HttpStreamSink
[ https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309220#comment-16309220 ] bluejoe commented on SPARK-22936: - The latest spark-http-stream artifact has been released to the central maven repository and is free to use in any java/scala projects. I will be appreciated that spark-http-stream is listed as a third-party Source/Sink in Spark documentation. > providing HttpStreamSource and HttpStreamSink > - > > Key: SPARK-22936 > URL: https://issues.apache.org/jira/browse/SPARK-22936 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: bluejoe > > Hi, in my project I completed a spark-http-stream, which is now available on > https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is > useful to others and is ok to be integrated as a part of Spark. > spark-http-stream transfers Spark structured stream over HTTP protocol. > Unlike tcp streams, Kafka streams and HDFS file streams, http streams often > flow across distributed big data centers on the Web. This feature is very > helpful to build global data processing pipelines across different data > centers (scientific research institutes, for example) who own separated data > sets. > The following code shows how to load messages from a HttpStreamSource: > ``` > val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName) > .option("httpServletUrl", "http://localhost:8080/";) > .option("topic", "topic-1"); > .option("includesTimestamp", "true") > .load(); > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17762) invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes
[ https://issues.apache.org/jira/browse/SPARK-17762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309053#comment-16309053 ] Hossein Falaki commented on SPARK-17762: I think SPARK-17790 is one place where this limit causes problems. Anywhere else we call {{writeBin}} we face similar limitation. > invokeJava fails when serialized argument list is larger than INT_MAX > (2,147,483,647) bytes > --- > > Key: SPARK-17762 > URL: https://issues.apache.org/jira/browse/SPARK-17762 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on > the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded > limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). > To work around it, we can check for this case and serialize the batch in > multiple parts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309037#comment-16309037 ] Takeshi Yamamuro commented on SPARK-22942: -- Since spark passes null to udfs in optimizer rules, you need to make udfs null-safe. > Spark Sql UDF throwing NullPointer when adding a filter on a columns that > uses that UDF > --- > > Key: SPARK-22942 > URL: https://issues.apache.org/jira/browse/SPARK-22942 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0 >Reporter: Matthew Fishkin > > I ran into an interesting issue when trying to do a `filter` on a dataframe > that has columns that were added using a UDF. I am able to replicate the > problem with a smaller set of data. > Given the dummy case classes: > {code:java} > case class Info(number: Int, color: String) > case class Record(name: String, infos: Seq[Info]) > {code} > And the following data: > {code:java} > val blue = Info(1, "blue") > val black = Info(2, "black") > val yellow = Info(3, "yellow") > val orange = Info(4, "orange") > val white = Info(5, "white") > val a = Record("a", Seq(blue, black, white)) > val a2 = Record("a", Seq(yellow, white, orange)) > val b = Record("b", Seq(blue, black)) > val c = Record("c", Seq(white, orange)) > val d = Record("d", Seq(orange, black)) > {code} > Create two dataframes (we will call them left and right) > {code:java} > val left = Seq(a, b).toDF > val right = Seq(a2, c, d).toDF > {code} > Join those two dataframes with an outer join (So two of our columns are > nullable now. > {code:java} > val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") > joined.show(false) > res0: > +++---+ > |name|infos |infos | > +++---+ > |b |[[1,blue], [2,black]] |null | > |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| > |c |null|[[5,white], [4,orange]]| > |d |null|[[4,orange], [2,black]]| > +++---+ > {code} > Then, take only the `name`s that exist in the right Dataframe > {code:java} > val rightOnly = joined.filter("l.infos is null").select($"name", > $"r.infos".as("r_infos")) > rightOnly.show(false) > res1: > ++---+ > |name|r_infos| > ++---+ > |c |[[5,white], [4,orange]]| > |d |[[4,orange], [2,black]]| > ++---+ > {code} > Now, add a new column called `has_black` which will be true if the `r_infos` > contains _black_ as a color > {code:java} > def hasBlack = (s: Seq[Row]) => { > s.exists{ case Row(num: Int, color: String) => > color == "black" > } > } > val rightBreakdown = rightOnly.withColumn("has_black", > udf(hasBlack).apply($"r_infos")) > rightBreakdown.show(false) > res2: > ++---+-+ > |name|r_infos|has_black| > ++---+-+ > |c |[[5,white], [4,orange]]|false| > |d |[[4,orange], [2,black]]|true | > ++---+-+ > {code} > So far, _exactly_ what we expected. > *However*, when I try to filter `rightBreakdown`, it fails. > {code:java} > rightBreakdown.filter("has_black == true").show(false) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$hasBlack$1: (array>) => > boolean) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) > at > scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) > at scala.collection.immutable.List.exists(List.scala:84) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138) > at > org.apache.spark.sql.catalyst.optimizer.EliminateO
[jira] [Resolved] (SPARK-22898) collect_set aggregation on bucketed table causes an exchange stage
[ https://issues.apache.org/jira/browse/SPARK-22898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-22898. - Resolution: Duplicate > collect_set aggregation on bucketed table causes an exchange stage > -- > > Key: SPARK-22898 > URL: https://issues.apache.org/jira/browse/SPARK-22898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Modi Tamam > Labels: bucketing > > I'm using Spark-2.2. I'm POCing Spark's bucketing. I've created a bucketed > table, here's the desc formatted my_bucketed_tbl output: > +++---+ > |col_nam| data_type|comment| > +++---+ > | bundle| string| null| > | ifa| string| null| > | date_|date| null| > |hour| int| null| > ||| | > |# Detailed Table ...|| | > |Database| default| | > | Table| my_bucketed_tbl| > | Owner|zeppelin| | > | Created|Thu Dec 21 13:43:...| | > | Last Access|Thu Jan 01 00:00:...| | > |Type|EXTERNAL| | > |Provider| orc| | > | Num Buckets| 16| | > | Bucket Columns| [`ifa`]| | > |Sort Columns| [`ifa`]| | > |Table Properties|[transient_lastDd...| | > |Location|hdfs:/user/hive/w...| | > | Serde Library|org.apache.hadoop...| | > | InputFormat|org.apache.hadoop...| | > |OutputFormat|org.apache.hadoop...| | > | Storage Properties|[serialization.fo...| | > +++---+ > When I'm executing an explain of a group by query, I can see that we've > spared the exchange phase : > {code:java} > sql("select ifa,max(bundle) from my_bucketed_tbl group by ifa").explain > == Physical Plan == > SortAggregate(key=[ifa#932], functions=[max(bundle#920)]) > +- SortAggregate(key=[ifa#932], functions=[partial_max(bundle#920)]) >+- *Sort [ifa#932 ASC NULLS FIRST], false, 0 > +- *FileScan orc default.level_1[bundle#920,ifa#932] Batched: false, > Format: ORC, Location: > InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > {code} > But, when I replace Spark's max function with collect_set, I can see that the > execution plan is the same as a non-bucketed table, means, the exchange phase > is not spared : > {code:java} > sql("select ifa,collect_set(bundle) from my_bucketed_tbl group by > ifa").explain > == Physical Plan == > ObjectHashAggregate(keys=[ifa#1010], functions=[collect_set(bundle#998, 0, > 0)]) > +- Exchange hashpartitioning(ifa#1010, 200) >+- ObjectHashAggregate(keys=[ifa#1010], > functions=[partial_collect_set(bundle#998, 0, 0)]) > +- *FileScan orc default.level_1[bundle#998,ifa#1010] Batched: false, > Format: ORC, Location: > InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22898) collect_set aggregation on bucketed table causes an exchange stage
[ https://issues.apache.org/jira/browse/SPARK-22898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308970#comment-16308970 ] Liang-Chi Hsieh commented on SPARK-22898: - If no problem I will resolve this as duplicate. You can re-open it if you have other questions. > collect_set aggregation on bucketed table causes an exchange stage > -- > > Key: SPARK-22898 > URL: https://issues.apache.org/jira/browse/SPARK-22898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Modi Tamam > Labels: bucketing > > I'm using Spark-2.2. I'm POCing Spark's bucketing. I've created a bucketed > table, here's the desc formatted my_bucketed_tbl output: > +++---+ > |col_nam| data_type|comment| > +++---+ > | bundle| string| null| > | ifa| string| null| > | date_|date| null| > |hour| int| null| > ||| | > |# Detailed Table ...|| | > |Database| default| | > | Table| my_bucketed_tbl| > | Owner|zeppelin| | > | Created|Thu Dec 21 13:43:...| | > | Last Access|Thu Jan 01 00:00:...| | > |Type|EXTERNAL| | > |Provider| orc| | > | Num Buckets| 16| | > | Bucket Columns| [`ifa`]| | > |Sort Columns| [`ifa`]| | > |Table Properties|[transient_lastDd...| | > |Location|hdfs:/user/hive/w...| | > | Serde Library|org.apache.hadoop...| | > | InputFormat|org.apache.hadoop...| | > |OutputFormat|org.apache.hadoop...| | > | Storage Properties|[serialization.fo...| | > +++---+ > When I'm executing an explain of a group by query, I can see that we've > spared the exchange phase : > {code:java} > sql("select ifa,max(bundle) from my_bucketed_tbl group by ifa").explain > == Physical Plan == > SortAggregate(key=[ifa#932], functions=[max(bundle#920)]) > +- SortAggregate(key=[ifa#932], functions=[partial_max(bundle#920)]) >+- *Sort [ifa#932 ASC NULLS FIRST], false, 0 > +- *FileScan orc default.level_1[bundle#920,ifa#932] Batched: false, > Format: ORC, Location: > InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > {code} > But, when I replace Spark's max function with collect_set, I can see that the > execution plan is the same as a non-bucketed table, means, the exchange phase > is not spared : > {code:java} > sql("select ifa,collect_set(bundle) from my_bucketed_tbl group by > ifa").explain > == Physical Plan == > ObjectHashAggregate(keys=[ifa#1010], functions=[collect_set(bundle#998, 0, > 0)]) > +- Exchange hashpartitioning(ifa#1010, 200) >+- ObjectHashAggregate(keys=[ifa#1010], > functions=[partial_collect_set(bundle#998, 0, 0)]) > +- *FileScan orc default.level_1[bundle#998,ifa#1010] Batched: false, > Format: ORC, Location: > InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes
yuhao yang created SPARK-22943: -- Summary: OneHotEncoder supports manual specification of categorySizes Key: SPARK-22943 URL: https://issues.apache.org/jira/browse/SPARK-22943 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor OHE should support configurable categorySizes, as n-values in http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. which allows consistent and foreseeable conversion. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308882#comment-16308882 ] Matthew Fishkin edited comment on SPARK-22942 at 1/2/18 11:27 PM: -- I would expect that to work too. I'm more curious why the null pointer is occurring when none of the data is null. Interestingly, I found the following. When you change from {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) {code} to {code:java} val rightOnly = joined.filter("l.infos is null and r.infos is not null").select($"name", $"r.infos".as("r_infos")) {code} the rest of the code above works. But I am pretty sure *"l.infos is null and r.infos is not null"* is equal to *"l.infos is null"*. If one column of an outer join is null, the other must be defined. Joining columns _A, B_ and _A, B_ on _A_, then it is guaranteed that either _B_ or _C_ is defined was (Author: mjfish93): I would expect that to work too. I'm more curious why the null pointer is occurring when none of the data is null. Interestingly, I found the following. When you change from {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) {code} to {code:java} val rightOnly = joined.filter("l.infos is null and r.infos is not null").select($"name", $"r.infos".as("r_infos")) {code} the rest of the code above works. But I am pretty sure "l.infos is null and r.infos is not null" is equal to "l.infos is null". If one column of an outer join is null, the other must be defined. > Spark Sql UDF throwing NullPointer when adding a filter on a columns that > uses that UDF > --- > > Key: SPARK-22942 > URL: https://issues.apache.org/jira/browse/SPARK-22942 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0 >Reporter: Matthew Fishkin > > I ran into an interesting issue when trying to do a `filter` on a dataframe > that has columns that were added using a UDF. I am able to replicate the > problem with a smaller set of data. > Given the dummy case classes: > {code:java} > case class Info(number: Int, color: String) > case class Record(name: String, infos: Seq[Info]) > {code} > And the following data: > {code:java} > val blue = Info(1, "blue") > val black = Info(2, "black") > val yellow = Info(3, "yellow") > val orange = Info(4, "orange") > val white = Info(5, "white") > val a = Record("a", Seq(blue, black, white)) > val a2 = Record("a", Seq(yellow, white, orange)) > val b = Record("b", Seq(blue, black)) > val c = Record("c", Seq(white, orange)) > val d = Record("d", Seq(orange, black)) > {code} > Create two dataframes (we will call them left and right) > {code:java} > val left = Seq(a, b).toDF > val right = Seq(a2, c, d).toDF > {code} > Join those two dataframes with an outer join (So two of our columns are > nullable now. > {code:java} > val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") > joined.show(false) > res0: > +++---+ > |name|infos |infos | > +++---+ > |b |[[1,blue], [2,black]] |null | > |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| > |c |null|[[5,white], [4,orange]]| > |d |null|[[4,orange], [2,black]]| > +++---+ > {code} > Then, take only the `name`s that exist in the right Dataframe > {code:java} > val rightOnly = joined.filter("l.infos is null").select($"name", > $"r.infos".as("r_infos")) > rightOnly.show(false) > res1: > ++---+ > |name|r_infos| > ++---+ > |c |[[5,white], [4,orange]]| > |d |[[4,orange], [2,black]]| > ++---+ > {code} > Now, add a new column called `has_black` which will be true if the `r_infos` > contains _black_ as a color > {code:java} > def hasBlack = (s: Seq[Row]) => { > s.exists{ case Row(num: Int, color: String) => > color == "black" > } > } > val rightBreakdown = rightOnly.withColumn("has_black", > udf(hasBlack).apply($"r_infos")) > rightBreakdown.show(false) > res2: > ++---+-+ > |name|r_infos|has_black| > ++---+-+ > |c |[[5,white], [4,orange]]|false| > |d |[[4,orange], [2,black]]|true | > ++---+-+ > {code} > So far, _exactly_ what w
[jira] [Comment Edited] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308882#comment-16308882 ] Matthew Fishkin edited comment on SPARK-22942 at 1/2/18 11:25 PM: -- I would expect that to work too. I'm more curious why the null pointer is occurring when none of the data is null. Interestingly, I found the following. When you change from {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) {code} to {code:java} val rightOnly = joined.filter("l.infos is null and r.infos is not null").select($"name", $"r.infos".as("r_infos")) {code} the rest of the code above works. But I am pretty sure "l.infos is null and r.infos is not null" is equal to "l.infos is null". If one column of an outer join is null, the other must be defined. was (Author: mjfish93): I would expect that to work too. I'm more curious why the null pointer is occurring when none of the data is null. Interestingly, I found the following. When you change from {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) {code} {code:java} val rightOnly = joined.filter("l.infos is null and r.infos is not null").select($"name", $"r.infos".as("r_infos")) {code} the rest of the code above works. But I am pretty sure "l.infos is null and r.infos is not null" is equal to "l.infos is null". If one column of an outer join is null, the other must be defined. > Spark Sql UDF throwing NullPointer when adding a filter on a columns that > uses that UDF > --- > > Key: SPARK-22942 > URL: https://issues.apache.org/jira/browse/SPARK-22942 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0 >Reporter: Matthew Fishkin > > I ran into an interesting issue when trying to do a `filter` on a dataframe > that has columns that were added using a UDF. I am able to replicate the > problem with a smaller set of data. > Given the dummy case classes: > {code:java} > case class Info(number: Int, color: String) > case class Record(name: String, infos: Seq[Info]) > {code} > And the following data: > {code:java} > val blue = Info(1, "blue") > val black = Info(2, "black") > val yellow = Info(3, "yellow") > val orange = Info(4, "orange") > val white = Info(5, "white") > val a = Record("a", Seq(blue, black, white)) > val a2 = Record("a", Seq(yellow, white, orange)) > val b = Record("b", Seq(blue, black)) > val c = Record("c", Seq(white, orange)) > val d = Record("d", Seq(orange, black)) > {code} > Create two dataframes (we will call them left and right) > {code:java} > val left = Seq(a, b).toDF > val right = Seq(a2, c, d).toDF > {code} > Join those two dataframes with an outer join (So two of our columns are > nullable now. > {code:java} > val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") > joined.show(false) > res0: > +++---+ > |name|infos |infos | > +++---+ > |b |[[1,blue], [2,black]] |null | > |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| > |c |null|[[5,white], [4,orange]]| > |d |null|[[4,orange], [2,black]]| > +++---+ > {code} > Then, take only the `name`s that exist in the right Dataframe > {code:java} > val rightOnly = joined.filter("l.infos is null").select($"name", > $"r.infos".as("r_infos")) > rightOnly.show(false) > res1: > ++---+ > |name|r_infos| > ++---+ > |c |[[5,white], [4,orange]]| > |d |[[4,orange], [2,black]]| > ++---+ > {code} > Now, add a new column called `has_black` which will be true if the `r_infos` > contains _black_ as a color > {code:java} > def hasBlack = (s: Seq[Row]) => { > s.exists{ case Row(num: Int, color: String) => > color == "black" > } > } > val rightBreakdown = rightOnly.withColumn("has_black", > udf(hasBlack).apply($"r_infos")) > rightBreakdown.show(false) > res2: > ++---+-+ > |name|r_infos|has_black| > ++---+-+ > |c |[[5,white], [4,orange]]|false| > |d |[[4,orange], [2,black]]|true | > ++---+-+ > {code} > So far, _exactly_ what we expected. > *However*, when I try to filter `rightBreakdown`, it fails. > {code:java} > rightBreakdown.
[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308882#comment-16308882 ] Matthew Fishkin commented on SPARK-22942: - I would expect that to work too. I'm more curious why the null pointer is occurring when none of the data is null. Interestingly, I found the following. When you change from {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) {code} {code:java} val rightOnly = joined.filter("l.infos is null and r.infos is not null").select($"name", $"r.infos".as("r_infos")) {code} the rest of the code above works. But I am pretty sure "l.infos is null and r.infos is not null" is equal to "l.infos is null". If one column of an outer join is null, the other must be defined. > Spark Sql UDF throwing NullPointer when adding a filter on a columns that > uses that UDF > --- > > Key: SPARK-22942 > URL: https://issues.apache.org/jira/browse/SPARK-22942 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0 >Reporter: Matthew Fishkin > > I ran into an interesting issue when trying to do a `filter` on a dataframe > that has columns that were added using a UDF. I am able to replicate the > problem with a smaller set of data. > Given the dummy case classes: > {code:java} > case class Info(number: Int, color: String) > case class Record(name: String, infos: Seq[Info]) > {code} > And the following data: > {code:java} > val blue = Info(1, "blue") > val black = Info(2, "black") > val yellow = Info(3, "yellow") > val orange = Info(4, "orange") > val white = Info(5, "white") > val a = Record("a", Seq(blue, black, white)) > val a2 = Record("a", Seq(yellow, white, orange)) > val b = Record("b", Seq(blue, black)) > val c = Record("c", Seq(white, orange)) > val d = Record("d", Seq(orange, black)) > {code} > Create two dataframes (we will call them left and right) > {code:java} > val left = Seq(a, b).toDF > val right = Seq(a2, c, d).toDF > {code} > Join those two dataframes with an outer join (So two of our columns are > nullable now. > {code:java} > val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") > joined.show(false) > res0: > +++---+ > |name|infos |infos | > +++---+ > |b |[[1,blue], [2,black]] |null | > |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| > |c |null|[[5,white], [4,orange]]| > |d |null|[[4,orange], [2,black]]| > +++---+ > {code} > Then, take only the `name`s that exist in the right Dataframe > {code:java} > val rightOnly = joined.filter("l.infos is null").select($"name", > $"r.infos".as("r_infos")) > rightOnly.show(false) > res1: > ++---+ > |name|r_infos| > ++---+ > |c |[[5,white], [4,orange]]| > |d |[[4,orange], [2,black]]| > ++---+ > {code} > Now, add a new column called `has_black` which will be true if the `r_infos` > contains _black_ as a color > {code:java} > def hasBlack = (s: Seq[Row]) => { > s.exists{ case Row(num: Int, color: String) => > color == "black" > } > } > val rightBreakdown = rightOnly.withColumn("has_black", > udf(hasBlack).apply($"r_infos")) > rightBreakdown.show(false) > res2: > ++---+-+ > |name|r_infos|has_black| > ++---+-+ > |c |[[5,white], [4,orange]]|false| > |d |[[4,orange], [2,black]]|true | > ++---+-+ > {code} > So far, _exactly_ what we expected. > *However*, when I try to filter `rightBreakdown`, it fails. > {code:java} > rightBreakdown.filter("has_black == true").show(false) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$hasBlack$1: (array>) => > boolean) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) > at > org.apache.spark.sql.catalyst.optimizer.E
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308868#comment-16308868 ] Bryan Cutler commented on SPARK-22126: -- Thanks for taking a look [~josephkb]! I believe it's possible to still use the current fit() API that returns a {{Seq[Model[_]]}} and avoid materializing all models in memory at once. If an estimator has model-specific optimizations and is creating multiple models, it could return a lazy sequence (such as a SeqVew or Stream). Then if the CrossValidator just converts the sequence of Models to an iterator, it will only hold a reference to the model currently being evaluated and previous models can be GC'd. Does that sound like it would be worth a shot here? As for the issue of parallelism in model-specific optimizations, it's true there might be some benefit in the Estimator being able to allow the CrossValidator to handle the parallelism under certain cases. But until there are more examples of this to look at, it's hard to know if making a new API for it is worth that benefit. I saw a reference to a Keras model in another JIRA, is that an example that could have a model-specific optimization while still allowing the CrossValidator to parallelize over it? > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for traini
[jira] [Commented] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308830#comment-16308830 ] Takeshi Yamamuro commented on SPARK-22942: -- I think you just need NULL checks; {code} val hasBlack: Seq[Row] => Boolean = (s: Seq[Row]) => { if (s != null) { s.exists{ case Row(num: Int, color: String) => color == "black" } } else { false } } {code} > Spark Sql UDF throwing NullPointer when adding a filter on a columns that > uses that UDF > --- > > Key: SPARK-22942 > URL: https://issues.apache.org/jira/browse/SPARK-22942 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.0 >Reporter: Matthew Fishkin > > I ran into an interesting issue when trying to do a `filter` on a dataframe > that has columns that were added using a UDF. I am able to replicate the > problem with a smaller set of data. > Given the dummy case classes: > {code:java} > case class Info(number: Int, color: String) > case class Record(name: String, infos: Seq[Info]) > {code} > And the following data: > {code:java} > val blue = Info(1, "blue") > val black = Info(2, "black") > val yellow = Info(3, "yellow") > val orange = Info(4, "orange") > val white = Info(5, "white") > val a = Record("a", Seq(blue, black, white)) > val a2 = Record("a", Seq(yellow, white, orange)) > val b = Record("b", Seq(blue, black)) > val c = Record("c", Seq(white, orange)) > val d = Record("d", Seq(orange, black)) > {code} > Create two dataframes (we will call them left and right) > {code:java} > val left = Seq(a, b).toDF > val right = Seq(a2, c, d).toDF > {code} > Join those two dataframes with an outer join (So two of our columns are > nullable now. > {code:java} > val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") > joined.show(false) > res0: > +++---+ > |name|infos |infos | > +++---+ > |b |[[1,blue], [2,black]] |null | > |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| > |c |null|[[5,white], [4,orange]]| > |d |null|[[4,orange], [2,black]]| > +++---+ > {code} > Then, take only the `name`s that exist in the right Dataframe > {code:java} > val rightOnly = joined.filter("l.infos is null").select($"name", > $"r.infos".as("r_infos")) > rightOnly.show(false) > res1: > ++---+ > |name|r_infos| > ++---+ > |c |[[5,white], [4,orange]]| > |d |[[4,orange], [2,black]]| > ++---+ > {code} > Now, add a new column called `has_black` which will be true if the `r_infos` > contains _black_ as a color > {code:java} > def hasBlack = (s: Seq[Row]) => { > s.exists{ case Row(num: Int, color: String) => > color == "black" > } > } > val rightBreakdown = rightOnly.withColumn("has_black", > udf(hasBlack).apply($"r_infos")) > rightBreakdown.show(false) > res2: > ++---+-+ > |name|r_infos|has_black| > ++---+-+ > |c |[[5,white], [4,orange]]|false| > |d |[[4,orange], [2,black]]|true | > ++---+-+ > {code} > So far, _exactly_ what we expected. > *However*, when I try to filter `rightBreakdown`, it fails. > {code:java} > rightBreakdown.filter("has_black == true").show(false) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$hasBlack$1: (array>) => > boolean) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) > at > scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) > at scala.collection.immutable.List.exists(List.scala:84) > at > org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138) > at > org.apache.spark.sq
[jira] [Comment Edited] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308153#comment-16308153 ] Takeshi Yamamuro edited comment on SPARK-21687 at 1/2/18 10:26 PM: --- I feel this make some sense (But, this is not a bug, so less opportunity to backport to 2.3) cc: [~dongjoon] was (Author: maropu): I feel this make some sense (But, this is a not bug, so less opportunity to backport to 2.3) cc: [~dongjoon] > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22404) Provide an option to use unmanaged AM in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-22404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308802#comment-16308802 ] Devaraj K commented on SPARK-22404: --- Thanks [~irashid] for the comment. bq. can you provide a little more explanation for the point of this? An unmanagedAM is an AM that is not launched and managed by the RM. The client creates a new application on the RM and negotiates a new attempt id. Then it waits for the RM app state to reach be YarnApplicationState.ACCEPTED after which it spawns the AM in same/another process and passes it the container id via env variable Environment.CONTAINER_ID. The AM(as part of same or different process) can register with the RM using the attempt id obtained from the container id and proceed as normal. In this PR/JIRA, providing a new configuration "spark.yarn.un-managed-am" (defaults to false) to enable the Unmanaged AM Application in Yarn Client mode which starts the Application Master service as part of the Client. It utilizes the existing code for communicating between the Application Master <-> Task Scheduler for the container requests/allocations/launch, and eliminates these, * Allocating and launching the Application Master container * Remote Node/Process communication between Application Master <-> Task Scheduler bq. how much time does this save for you? It removes the AM container scheduling and launching time, and eliminates the AM acting as proxy for requesting, launching and removing executors. I can post the comparison results here with and without unmanaged am. bq. What's the downside of an unmanaged AM? Unmanaged AM service would run as part of the Client, Client can handle if anything goes wrong with the unmanaged AM service unlike relaunching the AM container for failures. bq. the idea makes sense, but the yarn interaction and client mode is already pretty complicated so I'd like good justication for this In this PR, it reuses the most of the existing code for communication between AM <-> Task Scheduler but happens in the same process. The Client starts the AM service in the same process when the applications state is ACCEPTED and proceeds as usual without disrupting existing flow. > Provide an option to use unmanaged AM in yarn-client mode > - > > Key: SPARK-22404 > URL: https://issues.apache.org/jira/browse/SPARK-22404 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Devaraj K > > There was an issue SPARK-1200 to provide an option but was closed without > fixing. > Using an unmanaged AM in yarn-client mode would allow apps to start up > faster, but not requiring the container launcher AM to be launched on the > cluster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Fishkin updated SPARK-22942: Description: I ran into an interesting issue when trying to do a `filter` on a dataframe that has columns that were added using a UDF. I am able to replicate the problem with a smaller set of data. Given the dummy case classes: {code:java} case class Info(number: Int, color: String) case class Record(name: String, infos: Seq[Info]) {code} And the following data: {code:java} val blue = Info(1, "blue") val black = Info(2, "black") val yellow = Info(3, "yellow") val orange = Info(4, "orange") val white = Info(5, "white") val a = Record("a", Seq(blue, black, white)) val a2 = Record("a", Seq(yellow, white, orange)) val b = Record("b", Seq(blue, black)) val c = Record("c", Seq(white, orange)) val d = Record("d", Seq(orange, black)) {code} Create two dataframes (we will call them left and right) {code:java} val left = Seq(a, b).toDF val right = Seq(a2, c, d).toDF {code} Join those two dataframes with an outer join (So two of our columns are nullable now. {code:java} val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") joined.show(false) res0: +++---+ |name|infos |infos | +++---+ |b |[[1,blue], [2,black]] |null | |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| |c |null|[[5,white], [4,orange]]| |d |null|[[4,orange], [2,black]]| +++---+ {code} Then, take only the `name`s that exist in the right Dataframe {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) rightOnly.show(false) res1: ++---+ |name|r_infos| ++---+ |c |[[5,white], [4,orange]]| |d |[[4,orange], [2,black]]| ++---+ {code} Now, add a new column called `has_black` which will be true if the `r_infos` contains _black_ as a color {code:java} def hasBlack = (s: Seq[Row]) => { s.exists{ case Row(num: Int, color: String) => color == "black" } } val rightBreakdown = rightOnly.withColumn("has_black", udf(hasBlack).apply($"r_infos")) rightBreakdown.show(false) res2: ++---+-+ |name|r_infos|has_black| ++---+-+ |c |[[5,white], [4,orange]]|false| |d |[[4,orange], [2,black]]|true | ++---+-+ {code} So far, _exactly_ what we expected. *However*, when I try to filter `rightBreakdown`, it fails. {code:java} rightBreakdown.filter("has_black == true").show(false) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$hasBlack$1: (array>) => boolean) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scal
[jira] [Commented] (SPARK-16693) Remove R deprecated methods
[ https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308765#comment-16308765 ] Shivaram Venkataraman commented on SPARK-16693: --- Did we have the discussion on dev@ ? I think its a good idea to remove this but I just want to make sure we gave enough of a heads up on dev@ and user@ > Remove R deprecated methods > --- > > Key: SPARK-16693 > URL: https://issues.apache.org/jira/browse/SPARK-16693 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung > > For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Fishkin updated SPARK-22942: Description: I ran into an interesting issue when trying to do a `filter` on a dataframe that has columns that were added using a UDF. I am able to replicate the problem with a smaller set of data. Given the dummy case classes: {code:java} case class Info(number: Int, color: String) case class Record(name: String, infos: Seq[Info]) {code} And the following data: {code:java} val blue = Info(1, "blue") val black = Info(2, "black") val yellow = Info(3, "yellow") val orange = Info(4, "orange") val white = Info(5, "white") val a = Record("a", Seq(blue, black, white)) val a2 = Record("a", Seq(yellow, white, orange)) val b = Record("b", Seq(blue, black)) val c = Record("c", Seq(white, orange)) val d = Record("d", Seq(orange, black)) {code} Create two dataframes (we will call them left and right) {code:java} val left = Seq(a, b).toDF val right = Seq(a2, c, d).toDF {code} Join those two dataframes with an outer join (So two of our columns are nullable now. {code:java} val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") joined.show(false) res0: +++---+ |name|infos |infos | +++---+ |b |[[1,blue], [2,black]] |null | |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| |c |null|[[5,white], [4,orange]]| |d |null|[[4,orange], [2,black]]| +++---+ {code} Then, take only the `name`s that exist in the right Dataframe {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) rightOnly.show(false) res1: ++---+ |name|r_infos| ++---+ |c |[[5,white], [4,orange]]| |d |[[4,orange], [2,black]]| ++---+ {code} Now, add a new column called `has_black` which will be true if the `r_infos` contains _black_ as a color {code:java} def hasBlack = (s: Seq[Row]) => { s.exists{ case Row(num: Int, color: String) => color == "black" } } val rightBreakdown = rightOnlyInfos.withColumn("has_black", udf(hasBlack).apply($"r_infos")) rightBreakdown.show(false) res2: ++---+-+ |name|r_infos|has_black| ++---+-+ |c |[[5,white], [4,orange]]|false| |d |[[4,orange], [2,black]]|true | ++---+-+ {code} So far, _exactly_ what we expected. *However*, when I try to filter `rightBreakdown`, it fails. {code:java} rightBreakdown.filter("has_black == true").show(false) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$hasBlack$1: (array>) => boolean) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode
[jira] [Created] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
Matthew Fishkin created SPARK-22942: --- Summary: Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF Key: SPARK-22942 URL: https://issues.apache.org/jira/browse/SPARK-22942 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 2.2.0 Reporter: Matthew Fishkin I ran into an interesting issue when trying to do a `filter` on a dataframe that has columns that were added using a UDF. I am able to replicate the problem with a smaller set of data. Given the dummy case classes: {code:scala} case class Info(number: Int, color: String) case class Record(name: String, infos: Seq[Info]) {code} And the following data: {code:scala} val blue = Info(1, "blue") val black = Info(2, "black") val yellow = Info(3, "yellow") val orange = Info(4, "orange") val white = Info(5, "white") val a = Record("a", Seq(blue, black, white)) val a2 = Record("a", Seq(yellow, white, orange)) val b = Record("b", Seq(blue, black)) val c = Record("c", Seq(white, orange)) val d = Record("d", Seq(orange, black)) {code} Create two dataframes (we will call them left and right) {code:scala} val left = Seq(a, b).toDF val right = Seq(a2, c, d).toDF {code} Join those two dataframes with an outer join (So two of our columns are nullable now. {code:scala} val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") joined.show(false) res0: +++---+ |name|infos |infos | +++---+ |b |[[1,blue], [2,black]] |null | |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| |c |null|[[5,white], [4,orange]]| |d |null|[[4,orange], [2,black]]| +++---+ {code} Then, take only the `name`s that exist in the right Dataframe {code:scala} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) rightOnly.show(false) res1: ++---+ |name|r_infos| ++---+ |c |[[5,white], [4,orange]]| |d |[[4,orange], [2,black]]| ++---+ {code} Now, add a new column called `has_black` which will be true if the `r_infos` contains _black_ as a color {code:scala} def hasBlack = (s: Seq[Row]) => { s.exists{ case Row(num: Int, color: String) => color == "black" } } val rightBreakdown = rightOnlyInfos.withColumn("has_black", udf(hasBlack).apply($"r_infos")) rightBreakdown.show(false) res2: ++---+-+ |name|r_infos|has_black| ++---+-+ |c |[[5,white], [4,orange]]|false| |d |[[4,orange], [2,black]]|true | ++---+-+ {code} So far, *exactly* what we expected. However, when I try to filter `rightBreakdown`, it fails. {code:scala} rightBreakdown.filter("has_black == true").show(false) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$hasBlack$1: (array>) => boolean) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.
[jira] [Updated] (SPARK-22942) Spark Sql UDF throwing NullPointer when adding a filter on a columns that uses that UDF
[ https://issues.apache.org/jira/browse/SPARK-22942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Fishkin updated SPARK-22942: Description: I ran into an interesting issue when trying to do a `filter` on a dataframe that has columns that were added using a UDF. I am able to replicate the problem with a smaller set of data. Given the dummy case classes: {code:java} case class Info(number: Int, color: String) case class Record(name: String, infos: Seq[Info]) {code} And the following data: {code:java} val blue = Info(1, "blue") val black = Info(2, "black") val yellow = Info(3, "yellow") val orange = Info(4, "orange") val white = Info(5, "white") val a = Record("a", Seq(blue, black, white)) val a2 = Record("a", Seq(yellow, white, orange)) val b = Record("b", Seq(blue, black)) val c = Record("c", Seq(white, orange)) val d = Record("d", Seq(orange, black)) {code} Create two dataframes (we will call them left and right) {code:java} val left = Seq(a, b).toDF val right = Seq(a2, c, d).toDF {code} Join those two dataframes with an outer join (So two of our columns are nullable now. {code:java} val joined = left.alias("l").join(right.alias("r"), Seq("name"), "full_outer") joined.show(false) res0: +++---+ |name|infos |infos | +++---+ |b |[[1,blue], [2,black]] |null | |a |[[1,blue], [2,black], [5,white]]|[[3,yellow], [5,white], [4,orange]]| |c |null|[[5,white], [4,orange]]| |d |null|[[4,orange], [2,black]]| +++---+ {code} Then, take only the `name`s that exist in the right Dataframe {code:java} val rightOnly = joined.filter("l.infos is null").select($"name", $"r.infos".as("r_infos")) rightOnly.show(false) res1: ++---+ |name|r_infos| ++---+ |c |[[5,white], [4,orange]]| |d |[[4,orange], [2,black]]| ++---+ {code} Now, add a new column called `has_black` which will be true if the `r_infos` contains _black_ as a color {code:java} def hasBlack = (s: Seq[Row]) => { s.exists{ case Row(num: Int, color: String) => color == "black" } } val rightBreakdown = rightOnlyInfos.withColumn("has_black", udf(hasBlack).apply($"r_infos")) rightBreakdown.show(false) res2: ++---+-+ |name|r_infos|has_black| ++---+-+ |c |[[5,white], [4,orange]]|false| |d |[[4,orange], [2,black]]|true | ++---+-+ {code} So far, *exactly* what we expected. However, when I try to filter `rightBreakdown`, it fails. {code:java} rightBreakdown.filter("has_black == true").show(false) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$hasBlack$1: (array>) => boolean) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:411) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(joins.scala:127) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$rightHasNonNullPredicate$lzycompute$1$1.apply(joins.scala:138) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$lzycompute$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.rightHasNonNullPredicate$1(joins.scala:138) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(joins.scala:145) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:152) at org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$2.applyOrElse(joins.scala:150) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.sc
[jira] [Created] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
Marcelo Vanzin created SPARK-22941: -- Summary: Allow SparkSubmit to throw exceptions instead of exiting / printing errors. Key: SPARK-22941 URL: https://issues.apache.org/jira/browse/SPARK-22941 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: Marcelo Vanzin {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see SPARK-11035). But if the caller provides incorrect or inconsistent parameters to the app, {{SparkSubmit}} will print errors to the output and call {{System.exit}}, which is not very user friendly in this code path. We should modify {{SparkSubmit}} to be more friendly when called this way, while still maintaining the old behavior when called from the command line. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20664) Remove stale applications from SHS listing
[ https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20664: Assignee: (was: Apache Spark) > Remove stale applications from SHS listing > -- > > Key: SPARK-20664 > URL: https://issues.apache.org/jira/browse/SPARK-20664 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task is actually not explicit in the spec, and it's also an issue with > the current SHS. But having the SHS persist listing data makes it worse. > Basically, the SHS currently does not detect when files are deleted from the > event log directory manually; so those applications are still listed, and > trying to see their UI will either show the UI (if it's loaded) or an error > (if it's not). > With the new SHS, that also means that data is leaked in the disk stores used > to persist listing and UI data, making the problem worse. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20664) Remove stale applications from SHS listing
[ https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20664: Assignee: Apache Spark > Remove stale applications from SHS listing > -- > > Key: SPARK-20664 > URL: https://issues.apache.org/jira/browse/SPARK-20664 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task is actually not explicit in the spec, and it's also an issue with > the current SHS. But having the SHS persist listing data makes it worse. > Basically, the SHS currently does not detect when files are deleted from the > event log directory manually; so those applications are still listed, and > trying to see their UI will either show the UI (if it's loaded) or an error > (if it's not). > With the new SHS, that also means that data is leaked in the disk stores used > to persist listing and UI data, making the problem worse. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20664) Remove stale applications from SHS listing
[ https://issues.apache.org/jira/browse/SPARK-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308579#comment-16308579 ] Apache Spark commented on SPARK-20664: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20138 > Remove stale applications from SHS listing > -- > > Key: SPARK-20664 > URL: https://issues.apache.org/jira/browse/SPARK-20664 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task is actually not explicit in the spec, and it's also an issue with > the current SHS. But having the SHS persist listing data makes it worse. > Basically, the SHS currently does not detect when files are deleted from the > event log directory manually; so those applications are still listed, and > trying to see their UI will either show the UI (if it's loaded) or an error > (if it's not). > With the new SHS, that also means that data is leaked in the disk stores used > to persist listing and UI data, making the problem worse. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21319) UnsafeExternalRowSorter.RowComparator memory leak
[ https://issues.apache.org/jira/browse/SPARK-21319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308566#comment-16308566 ] William Kinney commented on SPARK-21319: Is there a workaround for this for version 2.2.0? > UnsafeExternalRowSorter.RowComparator memory leak > - > > Key: SPARK-21319 > URL: https://issues.apache.org/jira/browse/SPARK-21319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: James Baker >Assignee: Wenchen Fan > Fix For: 2.3.0 > > Attachments: > 0001-SPARK-21319-Fix-memory-leak-in-UnsafeExternalRowSort.patch, hprof.png > > > When we wish to sort within partitions, we produce an > UnsafeExternalRowSorter. This contains an UnsafeExternalSorter, which > contains the UnsafeExternalRowComparator. > The UnsafeExternalSorter adds a task completion listener which performs any > additional required cleanup. The upshot of this is that we maintain a > reference to the UnsafeExternalRowSorter.RowComparator until the end of the > task. > The RowComparator looks like > {code:java} > private static final class RowComparator extends RecordComparator { > private final Ordering ordering; > private final int numFields; > private final UnsafeRow row1; > private final UnsafeRow row2; > RowComparator(Ordering ordering, int numFields) { > this.numFields = numFields; > this.row1 = new UnsafeRow(numFields); > this.row2 = new UnsafeRow(numFields); > this.ordering = ordering; > } > @Override > public int compare(Object baseObj1, long baseOff1, Object baseObj2, long > baseOff2) { > // TODO: Why are the sizes -1? > row1.pointTo(baseObj1, baseOff1, -1); > row2.pointTo(baseObj2, baseOff2, -1); > return ordering.compare(row1, row2); > } > } > {code} > which means that this will contain references to the last baseObjs that were > passed in, and without tracking them for purposes of memory allocation. > We have a job which sorts within partitions and then coalesces partitions - > this has a tendency to OOM because of the references to old UnsafeRows that > were used during the sorting. > Attached is a screenshot of a memory dump during a task - our JVM has two > executor threads. > It can be seen that we have 2 references inside of row iterators, and 11 more > which are only known in the task completion listener or as part of memory > management. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed
[ https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308491#comment-16308491 ] Sean Owen commented on SPARK-22940: --- Agreed. Many other scripts require wget, though at least one will use either. The easiest option is just to document that you need wget, which is simple enough. I forget whether I have it via XCode tools or brew, but, brew install wget should always be sufficient. Using a library is just fine too and probably more desirable within the test code. Feel free to open a PR for any of the options, I'd say. > Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't > have wget installed > - > > Key: SPARK-22940 > URL: https://issues.apache.org/jira/browse/SPARK-22940 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 > Environment: MacOS Sierra 10.12.6 >Reporter: Bruce Robbins >Priority: Minor > > On platforms that don't have wget installed (e.g., Mac OS X), test suite > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an > exception and aborts: > java.io.IOException: Cannot run program "wget": error=2, No such file or > directory > HiveExternalCatalogVersionsSuite uses wget to download older versions of > Spark for compatibility testing. First it uses wget to find a suitable > mirror, and then it uses wget to download a tar file from the mirror. > There are several ways to fix this (in reverse order of difficulty of > implementation) > 1. Require Mac OS X users to install wget if they wish to run unit tests (or > at the very least if they wish to run HiveExternalCatalogVersionsSuite). > Also, update documentation to make this requirement explicit. > 2. Fall back on curl when wget is not available. > 3. Use an HTTP library to query for a suitable mirror and download the tar > file. > Number 2 is easy to implement, and I did so to get the unit test to run. But > it relies on another external program if wget is not installed. > Number 3 is probably slightly more complex to implement and requires more > corner-case checking (e.g, redirects, etc.). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed
[ https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-22940: -- Description: On platforms that don't have wget installed (e.g., Mac OS X), test suite org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception and aborts: java.io.IOException: Cannot run program "wget": error=2, No such file or directory HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark for compatibility testing. First it uses wget to find a suitable mirror, and then it uses wget to download a tar file from the mirror. There are several ways to fix this (in reverse order of difficulty of implementation) 1. Require Mac OS X users to install wget if they wish to run unit tests (or at the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, update documentation to make this requirement explicit. 2. Fall back on curl when wget is not available. 3. Use an HTTP library to query for a suitable mirror and download the tar file. Number 2 is easy to implement, and I did so to get the unit test to run. But it relies on another external program if wget is not installed. Number 3 is probably slightly more complex to implement and requires more corner-case checking (e.g, redirects, etc.). was: On platforms that don't have wget installed (e.g., Mac OS X), test suite org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception and aborts: java.io.IOException: Cannot run program "wget": error=2, No such file or directory HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark for compatibility testing. First it uses wget to find a suitable mirror, and then it uses wget to download a tar file from the mirror. There are several ways to fix this (in reverse order of difficulty of implementation) 1. Require Mac OS X users to install wget if they wish to run unit tests (or at the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, update documentation to make this requirement explicit. 2. Fall back on curl when wget is not available. 3. Use an HTTP library to query for a suitable mirror and download the tar file. Number 2 is easy to implement, and I did so to get the unit test to run. But relies on another external program. Number 3 is probably slightly more complex to implement and requires more corner-case checking (e.g, redirects, etc.). > Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't > have wget installed > - > > Key: SPARK-22940 > URL: https://issues.apache.org/jira/browse/SPARK-22940 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 > Environment: MacOS Sierra 10.12.6 >Reporter: Bruce Robbins >Priority: Minor > > On platforms that don't have wget installed (e.g., Mac OS X), test suite > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an > exception and aborts: > java.io.IOException: Cannot run program "wget": error=2, No such file or > directory > HiveExternalCatalogVersionsSuite uses wget to download older versions of > Spark for compatibility testing. First it uses wget to find a suitable > mirror, and then it uses wget to download a tar file from the mirror. > There are several ways to fix this (in reverse order of difficulty of > implementation) > 1. Require Mac OS X users to install wget if they wish to run unit tests (or > at the very least if they wish to run HiveExternalCatalogVersionsSuite). > Also, update documentation to make this requirement explicit. > 2. Fall back on curl when wget is not available. > 3. Use an HTTP library to query for a suitable mirror and download the tar > file. > Number 2 is easy to implement, and I did so to get the unit test to run. But > it relies on another external program if wget is not installed. > Number 3 is probably slightly more complex to implement and requires more > corner-case checking (e.g, redirects, etc.). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed
Bruce Robbins created SPARK-22940: - Summary: Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed Key: SPARK-22940 URL: https://issues.apache.org/jira/browse/SPARK-22940 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.2.1 Environment: MacOS Sierra 10.12.6 Reporter: Bruce Robbins Priority: Minor On platforms that don't have wget installed (e.g., Mac OS X), test suite org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception and aborts: java.io.IOException: Cannot run program "wget": error=2, No such file or directory HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark for compatibility testing. First it uses wget to find a a suitable mirror, and then it uses wget to download a tar file from the mirror. There are several ways to fix this (in reverse order of difficulty of implementation) 1. Require Mac OS X users to install wget if they wish to run unit tests (or at the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, update documentation to make this requirement explicit. 2. Fall back on curl when wget is not available. 3. Use an HTTP library to query for a suitable mirror and download the tar file. Number 2 is easy to implement, and I did so to get the unit test to run. But relies on another external program. Number 3 is probably slightly more complex to implement and requires more corner-case checking (e.g, redirects, etc.). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22940) Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't have wget installed
[ https://issues.apache.org/jira/browse/SPARK-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-22940: -- Description: On platforms that don't have wget installed (e.g., Mac OS X), test suite org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception and aborts: java.io.IOException: Cannot run program "wget": error=2, No such file or directory HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark for compatibility testing. First it uses wget to find a suitable mirror, and then it uses wget to download a tar file from the mirror. There are several ways to fix this (in reverse order of difficulty of implementation) 1. Require Mac OS X users to install wget if they wish to run unit tests (or at the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, update documentation to make this requirement explicit. 2. Fall back on curl when wget is not available. 3. Use an HTTP library to query for a suitable mirror and download the tar file. Number 2 is easy to implement, and I did so to get the unit test to run. But relies on another external program. Number 3 is probably slightly more complex to implement and requires more corner-case checking (e.g, redirects, etc.). was: On platforms that don't have wget installed (e.g., Mac OS X), test suite org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an exception and aborts: java.io.IOException: Cannot run program "wget": error=2, No such file or directory HiveExternalCatalogVersionsSuite uses wget to download older versions of Spark for compatibility testing. First it uses wget to find a a suitable mirror, and then it uses wget to download a tar file from the mirror. There are several ways to fix this (in reverse order of difficulty of implementation) 1. Require Mac OS X users to install wget if they wish to run unit tests (or at the very least if they wish to run HiveExternalCatalogVersionsSuite). Also, update documentation to make this requirement explicit. 2. Fall back on curl when wget is not available. 3. Use an HTTP library to query for a suitable mirror and download the tar file. Number 2 is easy to implement, and I did so to get the unit test to run. But relies on another external program. Number 3 is probably slightly more complex to implement and requires more corner-case checking (e.g, redirects, etc.). > Test suite HiveExternalCatalogVersionsSuite fails on platforms that don't > have wget installed > - > > Key: SPARK-22940 > URL: https://issues.apache.org/jira/browse/SPARK-22940 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 > Environment: MacOS Sierra 10.12.6 >Reporter: Bruce Robbins >Priority: Minor > > On platforms that don't have wget installed (e.g., Mac OS X), test suite > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite throws an > exception and aborts: > java.io.IOException: Cannot run program "wget": error=2, No such file or > directory > HiveExternalCatalogVersionsSuite uses wget to download older versions of > Spark for compatibility testing. First it uses wget to find a suitable > mirror, and then it uses wget to download a tar file from the mirror. > There are several ways to fix this (in reverse order of difficulty of > implementation) > 1. Require Mac OS X users to install wget if they wish to run unit tests (or > at the very least if they wish to run HiveExternalCatalogVersionsSuite). > Also, update documentation to make this requirement explicit. > 2. Fall back on curl when wget is not available. > 3. Use an HTTP library to query for a suitable mirror and download the tar > file. > Number 2 is easy to implement, and I did so to get the unit test to run. But > relies on another external program. > Number 3 is probably slightly more complex to implement and requires more > corner-case checking (e.g, redirects, etc.). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308378#comment-16308378 ] Dongjoon Hyun commented on SPARK-21687: --- Thank you for ccing me, [~maropu]. I agree with that. > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16693) Remove R deprecated methods
[ https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308367#comment-16308367 ] Felix Cheung commented on SPARK-16693: -- These are all non public methods, so officially not public APIs, but people have been known to call them. > Remove R deprecated methods > --- > > Key: SPARK-16693 > URL: https://issues.apache.org/jira/browse/SPARK-16693 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung > > For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException
[ https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308363#comment-16308363 ] Jacek Laskowski commented on SPARK-22935: - It does not seem to be the case as described in https://stackoverflow.com/q/48026060/1305344 where the OP wanted to {{inferSchema}} with no values that would be of {{Date}} or {{Timestamp}} and hence Spark SQL infers strings. But...I think all's fine though as I said at SO: {quote} TL;DR Define the schema explicitly since the input dataset does not have values to infer types from (for java.sql.Date fields). {quote} I think you can close the issue as {{Invalid}}. > Dataset with Java Beans for java.sql.Date throws CompileException > - > > Key: SPARK-22935 > URL: https://issues.apache.org/jira/browse/SPARK-22935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Kazuaki Ishizaki > > The following code can throw an exception with or without whole-stage codegen. > {code} > public void SPARK22935() { > Dataset cdr = spark > .read() > .format("csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", ";") > .csv("CDR_SAMPLE.csv") > .as(Encoders.bean(CDR.class)); > Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != > null)); > long c = ds.count(); > cdr.show(2); > ds.show(2); > System.out.println("cnt=" + c); > } > // CDR.java > public class CDR implements java.io.Serializable { > public java.sql.Date timestamp; > public java.sql.Date getTimestamp() { return this.timestamp; } > public void setTimestamp(java.sql.Date timestamp) { this.timestamp = > timestamp; } > } > // CDR_SAMPLE.csv > timestamp > 2017-10-29T02:37:07.815Z > 2017-10-29T02:38:07.815Z > {code} > result > {code} > 12:17:10.352 ERROR > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 61, Column 70: No applicable constructor/method found > for actual parameters "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 61, Column 70: No applicable constructor/method found for actual parameters > "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) > ... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22939) Support Spark UDF in registerFunction
[ https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22939: Assignee: Apache Spark > Support Spark UDF in registerFunction > - > > Key: SPARK-22939 > URL: https://issues.apache.org/jira/browse/SPARK-22939 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {noformat} > import random > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType, StringType > random_udf = udf(lambda: int(random.random() * 100), > IntegerType()).asNondeterministic() > spark.catalog.registerFunction("random_udf", random_udf, StringType()) > spark.sql("SELECT random_udf()").collect() > {noformat} > We will get the following error. > {noformat} > Py4JError: An error occurred while calling o29.__getnewargs__. Trace: > py4j.Py4JException: Method __getnewargs__([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22939) Support Spark UDF in registerFunction
[ https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308252#comment-16308252 ] Apache Spark commented on SPARK-22939: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20137 > Support Spark UDF in registerFunction > - > > Key: SPARK-22939 > URL: https://issues.apache.org/jira/browse/SPARK-22939 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > > {noformat} > import random > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType, StringType > random_udf = udf(lambda: int(random.random() * 100), > IntegerType()).asNondeterministic() > spark.catalog.registerFunction("random_udf", random_udf, StringType()) > spark.sql("SELECT random_udf()").collect() > {noformat} > We will get the following error. > {noformat} > Py4JError: An error occurred while calling o29.__getnewargs__. Trace: > py4j.Py4JException: Method __getnewargs__([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22939) Support Spark UDF in registerFunction
[ https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22939: Assignee: (was: Apache Spark) > Support Spark UDF in registerFunction > - > > Key: SPARK-22939 > URL: https://issues.apache.org/jira/browse/SPARK-22939 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > > {noformat} > import random > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType, StringType > random_udf = udf(lambda: int(random.random() * 100), > IntegerType()).asNondeterministic() > spark.catalog.registerFunction("random_udf", random_udf, StringType()) > spark.sql("SELECT random_udf()").collect() > {noformat} > We will get the following error. > {noformat} > Py4JError: An error occurred while calling o29.__getnewargs__. Trace: > py4j.Py4JException: Method __getnewargs__([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22939) Support Spark UDF in registerFunction
[ https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22939: Summary: Support Spark UDF in registerFunction (was: registerFunction accepts Spark UDF ) > Support Spark UDF in registerFunction > - > > Key: SPARK-22939 > URL: https://issues.apache.org/jira/browse/SPARK-22939 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > > {noformat} > import random > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType, StringType > random_udf = udf(lambda: int(random.random() * 100), > IntegerType()).asNondeterministic() > spark.catalog.registerFunction("random_udf", random_udf, StringType()) > spark.sql("SELECT random_udf()").collect() > {noformat} > We will get the following error. > {noformat} > Py4JError: An error occurred while calling o29.__getnewargs__. Trace: > py4j.Py4JException: Method __getnewargs__([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22939) registerFunction accepts Spark UDF
[ https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22939: Summary: registerFunction accepts Spark UDF (was: registerFunction also accepts Spark UDF ) > registerFunction accepts Spark UDF > --- > > Key: SPARK-22939 > URL: https://issues.apache.org/jira/browse/SPARK-22939 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li > > {noformat} > import random > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType, StringType > random_udf = udf(lambda: int(random.random() * 100), > IntegerType()).asNondeterministic() > spark.catalog.registerFunction("random_udf", random_udf, StringType()) > spark.sql("SELECT random_udf()").collect() > {noformat} > We will get the following error. > {noformat} > Py4JError: An error occurred while calling o29.__getnewargs__. Trace: > py4j.Py4JException: Method __getnewargs__([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22939) registerFunction also accepts Spark UDF
Xiao Li created SPARK-22939: --- Summary: registerFunction also accepts Spark UDF Key: SPARK-22939 URL: https://issues.apache.org/jira/browse/SPARK-22939 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Xiao Li {noformat} import random from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic() spark.catalog.registerFunction("random_udf", random_udf, StringType()) spark.sql("SELECT random_udf()").collect() {noformat} We will get the following error. {noformat} Py4JError: An error occurred while calling o29.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22897) Expose stageAttemptId in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22897. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20082 [https://github.com/apache/spark/pull/20082] > Expose stageAttemptId in TaskContext > - > > Key: SPARK-22897 > URL: https://issues.apache.org/jira/browse/SPARK-22897 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 2.3.0 > > > Currently, there's no easy way for Executor to detect a new stage is launched > as stageAttemptId is missing. > I'd like to propose exposing stageAttemptId in TaskContext, and will send a > pr if community thinks it's a good thing. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22897) Expose stageAttemptId in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22897: --- Assignee: Xianjin YE > Expose stageAttemptId in TaskContext > - > > Key: SPARK-22897 > URL: https://issues.apache.org/jira/browse/SPARK-22897 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > > Currently, there's no easy way for Executor to detect a new stage is launched > as stageAttemptId is missing. > I'd like to propose exposing stageAttemptId in TaskContext, and will send a > pr if community thinks it's a good thing. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException
[ https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308194#comment-16308194 ] Kazuaki Ishizaki commented on SPARK-22935: -- [~jlaskowski] When you see the scheme of this Dataset, {{timestamp}} is {{timestamp}}, is not {{date}}. The inferSchema always sets type for time into {{timestamp}}. If you change declaration of {{timestamp}} in {{CDR}} class from {{java.sql.Date}} to {{java.sql.Timestamp}} as below, it works well. {code} Dataset df = spark .read() .format("csv") .option("header", "true") .option("inferSchema", "true") .option("delimiter", ";") .csv("CDR_SAMPLE.csv"); df.printSchema(); Dataset cdr = df .as(Encoders.bean(CDR.class)); cdr.printSchema(); Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != null)); ... // result root |-- timestamp: timestamp (nullable = true) {code} {code} // CDR.java public class CDR implements java.io.Serializable { public java.sql.Timestamp timestamp; public java.sql.Timestamp getTimestamp() { return this.timestamp; } public void setTimestamp(java.sql.Timestamp timestamp) { this.timestamp = timestamp; } } {code} > Dataset with Java Beans for java.sql.Date throws CompileException > - > > Key: SPARK-22935 > URL: https://issues.apache.org/jira/browse/SPARK-22935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Kazuaki Ishizaki > > The following code can throw an exception with or without whole-stage codegen. > {code} > public void SPARK22935() { > Dataset cdr = spark > .read() > .format("csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", ";") > .csv("CDR_SAMPLE.csv") > .as(Encoders.bean(CDR.class)); > Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != > null)); > long c = ds.count(); > cdr.show(2); > ds.show(2); > System.out.println("cnt=" + c); > } > // CDR.java > public class CDR implements java.io.Serializable { > public java.sql.Date timestamp; > public java.sql.Date getTimestamp() { return this.timestamp; } > public void setTimestamp(java.sql.Date timestamp) { this.timestamp = > timestamp; } > } > // CDR_SAMPLE.csv > timestamp > 2017-10-29T02:37:07.815Z > 2017-10-29T02:38:07.815Z > {code} > result > {code} > 12:17:10.352 ERROR > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 61, Column 70: No applicable constructor/method found > for actual parameters "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 61, Column 70: No applicable constructor/method found for actual parameters > "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) > ... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22936) providing HttpStreamSource and HttpStreamSink
[ https://issues.apache.org/jira/browse/SPARK-22936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308177#comment-16308177 ] Sean Owen commented on SPARK-22936: --- I think this can and should start as you have started it, as an external project. If there's clear demand, it could be moved into the project itself later. Right now people can just add your artifact and use it with no problem right? > providing HttpStreamSource and HttpStreamSink > - > > Key: SPARK-22936 > URL: https://issues.apache.org/jira/browse/SPARK-22936 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: bluejoe > > Hi, in my project I completed a spark-http-stream, which is now available on > https://github.com/bluejoe2008/spark-http-stream. I am thinking if it is > useful to others and is ok to be integrated as a part of Spark. > spark-http-stream transfers Spark structured stream over HTTP protocol. > Unlike tcp streams, Kafka streams and HDFS file streams, http streams often > flow across distributed big data centers on the Web. This feature is very > helpful to build global data processing pipelines across different data > centers (scientific research institutes, for example) who own separated data > sets. > The following code shows how to load messages from a HttpStreamSource: > ``` > val lines = spark.readStream.format(classOf[HttpStreamSourceProvider].getName) > .option("httpServletUrl", "http://localhost:8080/";) > .option("topic", "topic-1"); > .option("includesTimestamp", "true") > .load(); > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16693) Remove R deprecated methods
[ https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308161#comment-16308161 ] Sean Owen commented on SPARK-16693: --- Would this be a breaking change though? > Remove R deprecated methods > --- > > Key: SPARK-16693 > URL: https://issues.apache.org/jira/browse/SPARK-16693 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung > > For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308153#comment-16308153 ] Takeshi Yamamuro commented on SPARK-21687: -- I feel this make some sense (But, this is a not bug, so less opportunity to backport to 2.3) cc: [~dongjoon] > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.
[ https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308130#comment-16308130 ] Apache Spark commented on SPARK-22938: -- User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/20136 > Assert that SQLConf.get is accessed only on the driver. > --- > > Key: SPARK-22938 > URL: https://issues.apache.org/jira/browse/SPARK-22938 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.1 >Reporter: Juliusz Sompolski > > Assert if code tries to access SQLConf.get on executor. > This can lead to hard to detect bugs, where the executor will read > fallbackConf, falling back to default config values, ignoring potentially > changed non-default configs. > If a config is to be passed to executor code, it needs to be read on the > driver, and passed explicitly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.
[ https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22938: Assignee: (was: Apache Spark) > Assert that SQLConf.get is accessed only on the driver. > --- > > Key: SPARK-22938 > URL: https://issues.apache.org/jira/browse/SPARK-22938 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.1 >Reporter: Juliusz Sompolski > > Assert if code tries to access SQLConf.get on executor. > This can lead to hard to detect bugs, where the executor will read > fallbackConf, falling back to default config values, ignoring potentially > changed non-default configs. > If a config is to be passed to executor code, it needs to be read on the > driver, and passed explicitly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.
[ https://issues.apache.org/jira/browse/SPARK-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22938: Assignee: Apache Spark > Assert that SQLConf.get is accessed only on the driver. > --- > > Key: SPARK-22938 > URL: https://issues.apache.org/jira/browse/SPARK-22938 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.1 >Reporter: Juliusz Sompolski >Assignee: Apache Spark > > Assert if code tries to access SQLConf.get on executor. > This can lead to hard to detect bugs, where the executor will read > fallbackConf, falling back to default config values, ignoring potentially > changed non-default configs. > If a config is to be passed to executor code, it needs to be read on the > driver, and passed explicitly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22938) Assert that SQLConf.get is accessed only on the driver.
Juliusz Sompolski created SPARK-22938: - Summary: Assert that SQLConf.get is accessed only on the driver. Key: SPARK-22938 URL: https://issues.apache.org/jira/browse/SPARK-22938 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.2.1 Reporter: Juliusz Sompolski Assert if code tries to access SQLConf.get on executor. This can lead to hard to detect bugs, where the executor will read fallbackConf, falling back to default config values, ignoring potentially changed non-default configs. If a config is to be passed to executor code, it needs to be read on the driver, and passed explicitly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22937) SQL elt for binary inputs
[ https://issues.apache.org/jira/browse/SPARK-22937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22937: Assignee: Apache Spark > SQL elt for binary inputs > - > > Key: SPARK-22937 > URL: https://issues.apache.org/jira/browse/SPARK-22937 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > SQL elt should output binary for binary inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22937) SQL elt for binary inputs
[ https://issues.apache.org/jira/browse/SPARK-22937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22937: Assignee: (was: Apache Spark) > SQL elt for binary inputs > - > > Key: SPARK-22937 > URL: https://issues.apache.org/jira/browse/SPARK-22937 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > SQL elt should output binary for binary inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22937) SQL elt for binary inputs
[ https://issues.apache.org/jira/browse/SPARK-22937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308062#comment-16308062 ] Apache Spark commented on SPARK-22937: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/20135 > SQL elt for binary inputs > - > > Key: SPARK-22937 > URL: https://issues.apache.org/jira/browse/SPARK-22937 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > SQL elt should output binary for binary inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308059#comment-16308059 ] Abhay Pradhan commented on SPARK-22918: --- confirmed that our team is also affected by this issue. We have unit tests that spin up a local spark session with hive support enabled. Upgrading to 2.2.1 causes our tests to fail. [~Daimon] thank you for the work around. > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(
[jira] [Created] (SPARK-22937) SQL elt for binary inputs
Takeshi Yamamuro created SPARK-22937: Summary: SQL elt for binary inputs Key: SPARK-22937 URL: https://issues.apache.org/jira/browse/SPARK-22937 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: Takeshi Yamamuro Priority: Minor SQL elt should output binary for binary inputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-22613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22613: Assignee: (was: Apache Spark) > Make UNCACHE TABLE behaviour consistent with CACHE TABLE > > > Key: SPARK-22613 > URL: https://issues.apache.org/jira/browse/SPARK-22613 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > > The Spark SQL function CACHE TABLE is eager by default. Therefore it offers > an optional keyword LAZY in case you do not want to cache the complete table > immediately (See > https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html). > But the corresponding Spark SQL function UNCACHE TABLE is lazy by default > and doesn't offer an option EAGER (See > https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html, > > https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql). > So one cannot cache and uncache a table in an eager way using Spark SQL. > As a user I want an option EAGER for UNCACHE TABLE. An alternative could be > to change the behaviour of UNCACHE TABLE to be eager by default (consistent > with CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-22613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308019#comment-16308019 ] Apache Spark commented on SPARK-22613: -- User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/20134 > Make UNCACHE TABLE behaviour consistent with CACHE TABLE > > > Key: SPARK-22613 > URL: https://issues.apache.org/jira/browse/SPARK-22613 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > > The Spark SQL function CACHE TABLE is eager by default. Therefore it offers > an optional keyword LAZY in case you do not want to cache the complete table > immediately (See > https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html). > But the corresponding Spark SQL function UNCACHE TABLE is lazy by default > and doesn't offer an option EAGER (See > https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html, > > https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql). > So one cannot cache and uncache a table in an eager way using Spark SQL. > As a user I want an option EAGER for UNCACHE TABLE. An alternative could be > to change the behaviour of UNCACHE TABLE to be eager by default (consistent > with CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22613) Make UNCACHE TABLE behaviour consistent with CACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-22613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22613: Assignee: Apache Spark > Make UNCACHE TABLE behaviour consistent with CACHE TABLE > > > Key: SPARK-22613 > URL: https://issues.apache.org/jira/browse/SPARK-22613 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Assignee: Apache Spark >Priority: Minor > > The Spark SQL function CACHE TABLE is eager by default. Therefore it offers > an optional keyword LAZY in case you do not want to cache the complete table > immediately (See > https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html). > But the corresponding Spark SQL function UNCACHE TABLE is lazy by default > and doesn't offer an option EAGER (See > https://docs.databricks.com/spark/latest/spark-sql/language-manual/uncache-table.html, > > https://stackoverflow.com/questions/47226494/is-uncache-table-a-lazy-operation-in-spark-sql). > So one cannot cache and uncache a table in an eager way using Spark SQL. > As a user I want an option EAGER for UNCACHE TABLE. An alternative could be > to change the behaviour of UNCACHE TABLE to be eager by default (consistent > with CACHE TABLE) and then offer an option LAZY also for UNCACHE TABLE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307955#comment-16307955 ] Gabor Somogyi commented on SPARK-21687: --- [~srowen] I've just seen the Branch 2.3 cut mail. Should we backport it there? > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307877#comment-16307877 ] Gabor Somogyi commented on SPARK-21687: --- I would like to work on this. Please notify me if somebody already started. > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org