[jira] [Resolved] (SPARK-24337) Improve the error message for invalid SQL conf value
[ https://issues.apache.org/jira/browse/SPARK-24337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24337. - Resolution: Fixed Fix Version/s: 2.4.0 > Improve the error message for invalid SQL conf value > > > Key: SPARK-24337 > URL: https://issues.apache.org/jira/browse/SPARK-24337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > Right now Spark will throw the following error message when a config is set > to an invalid value. It would be great if the error message contains the > config key so that it's easy to tell which one is wrong. > {code} > Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes > (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m. > Fractional values are not supported. Input was: 1.6 > at > org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291) > at > org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66) > at > org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234) > at > org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234) > at > org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289) > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071) > ... 59 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496132#comment-16496132 ] Liang-Chi Hsieh edited comment on SPARK-24410 at 5/31/18 5:39 AM: -- We can verify the partition of union dataframe: {code} val df1 = spark.table("a1").select(spark_partition_id(), $"key") val df2 = spark.table("a2").select(spark_partition_id(), $"key") df1.union(df2).select(spark_partition_id(), $"key").collect {code} {code} res8: Array[org.apache.spark.sql.Row] = Array([0,6], [0,7], [0,3], [1,0], [2,4], [2,8], [2,9], [2,2], [2,1], [2,5], [3,3], [3,7], [3,6], [4,0], [5,2], [5,9], [5,5], [5,8], [5,1], [5,4]) {code} >From above result, we can find that for the same {{key}} from {{df1}} and >{{df2}} are at different partitions. E.g., key {{6}} are at partition {{0}} >and partition {{3}}. So we still need a shuffle to get the correct results. was (Author: viirya): We can verify the partition of union dataframe: {code} val df1 = spark.table("a1").select(spark_partition_id(), $"key") val df2 = spark.table("a2").select(spark_partition_id(), $"key") df1.union(df2).select(spark_partition_id(), $"key").collect {code} {code} res8: Array[org.apache.spark.sql.Row] = Array([0,6], [0,7], [0,3], [1,0], [2,4], [2,8], [2,9], [2,2], [2,1], [2,5], [3,3], [3,7], [3,6], [4,0], [5,2], [5,9], [5,5], [5,8], [5,1], [5,4]) {code} >From above result, we can find that for the same {{key}} from {{df1 and >{{df2}} are at different partitions. E.g., key {{6}} are at partition {{0}} >and partition {{3}}. > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1) Project [key#25L] > : +- *(1) FileScan parquet default.a1[key#25L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > +- *(2) Project [key#28L] > +- *(2) FileScan parquet default.a2[key#28L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA
[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters
[ https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496131#comment-16496131 ] Teddy Choi commented on SPARK-21187: Hello [~bryanc], I'm working on Hive-Spark connector with Arrow. So I'm also interested in this issue. Can I work on MapType implementation? Thanks. > Complete support for remaining Spark data types in Arrow Converters > --- > > Key: SPARK-21187 > URL: https://issues.apache.org/jira/browse/SPARK-21187 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > This is to track adding the remaining type support in Arrow Converters. > Currently, only primitive data types are supported. ' > Remaining types: > * -*Date*- > * -*Timestamp*- > * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map > * -*Decimal*- > * *Binary* - in pyspark > Some things to do before closing this out: > * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write > values as BigDecimal)- > * -Need to add some user docs- > * -Make sure Python tests are thorough- > * Check into complex type support mentioned in comments by [~leif], should > we support mulit-indexing? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496132#comment-16496132 ] Liang-Chi Hsieh commented on SPARK-24410: - We can verify the partition of union dataframe: {code} val df1 = spark.table("a1").select(spark_partition_id(), $"key") val df2 = spark.table("a2").select(spark_partition_id(), $"key") df1.union(df2).select(spark_partition_id(), $"key").collect {code} {code} res8: Array[org.apache.spark.sql.Row] = Array([0,6], [0,7], [0,3], [1,0], [2,4], [2,8], [2,9], [2,2], [2,1], [2,5], [3,3], [3,7], [3,6], [4,0], [5,2], [5,9], [5,5], [5,8], [5,1], [5,4]) {code} >From above result, we can find that for the same {{key}} from {{df1 and >{{df2}} are at different partitions. E.g., key {{6}} are at partition {{0}} >and partition {{3}}. > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1) Project [key#25L] > : +- *(1) FileScan parquet default.a1[key#25L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > +- *(2) Project [key#28L] > +- *(2) FileScan parquet default.a2[key#28L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24337) Improve the error message for invalid SQL conf value
[ https://issues.apache.org/jira/browse/SPARK-24337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24337: --- Assignee: (was: Xiao Li) > Improve the error message for invalid SQL conf value > > > Key: SPARK-24337 > URL: https://issues.apache.org/jira/browse/SPARK-24337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Right now Spark will throw the following error message when a config is set > to an invalid value. It would be great if the error message contains the > config key so that it's easy to tell which one is wrong. > {code} > Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes > (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m. > Fractional values are not supported. Input was: 1.6 > at > org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291) > at > org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66) > at > org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234) > at > org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234) > at > org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88) > at > org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289) > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071) > ... 59 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23649. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 2.2.2 > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark
[ https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495998#comment-16495998 ] Huaxin Gao commented on SPARK-24439: I will work on this. > Add distanceMeasure to BisectingKMeans in PySpark > - > > Key: SPARK-24439 > URL: https://issues.apache.org/jira/browse/SPARK-24439 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to > BisectingKMeans. We will do the same for PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark
Huaxin Gao created SPARK-24439: -- Summary: Add distanceMeasure to BisectingKMeans in PySpark Key: SPARK-24439 URL: https://issues.apache.org/jira/browse/SPARK-24439 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Huaxin Gao https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to BisectingKMeans. We will do the same for PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24436) Add large dataset to examples sub-directory.
[ https://issues.apache.org/jira/browse/SPARK-24436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24436. -- Resolution: Invalid > Add large dataset to examples sub-directory. > > > Key: SPARK-24436 > URL: https://issues.apache.org/jira/browse/SPARK-24436 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.3.1 >Reporter: Varun Vishwanathan >Priority: Trivial > > Fire Incidents from San francisco is a large dataset that may help developers > work with more data when fixing bugs or developing features. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24436) Add large dataset to examples sub-directory.
[ https://issues.apache.org/jira/browse/SPARK-24436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495984#comment-16495984 ] Hyukjin Kwon commented on SPARK-24436: -- I don't think we should add a large dataset in the code base. You can live the link into user mailing list, post it in a blog, or somewhere. Example is an example and it doesn't have to have a large data. > Add large dataset to examples sub-directory. > > > Key: SPARK-24436 > URL: https://issues.apache.org/jira/browse/SPARK-24436 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.3.1 >Reporter: Varun Vishwanathan >Priority: Trivial > > Fire Incidents from San francisco is a large dataset that may help developers > work with more data when fixing bugs or developing features. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet
[ https://issues.apache.org/jira/browse/SPARK-24427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495982#comment-16495982 ] Hyukjin Kwon commented on SPARK-24427: -- Doesn't it sound you specified multiple versions of Spark in your classpath? > Spark 2.2 - Exception occurred while saving table in spark. Multiple sources > found for parquet > > > Key: SPARK-24427 > URL: https://issues.apache.org/jira/browse/SPARK-24427 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 >Reporter: Ashok Rai >Priority: Major > > We are getting below error while loading into Hive table. In our code, we use > "saveAsTable" - which as per documentation automatically chooses the format > that the table was created on. We have now tested by creating the table as > Parquet as well as ORC. In both cases - the same error occurred. > > - > 2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving > table in spark. > org.apache.spark.sql.AnalysisException: Multiple sources found for parquet > (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, > org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please > specify the fully qualified class name.; > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > ~[scala-library-2.11.8.jar:?] > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > ~[scala-library-2.11.8.jar:?] > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at scala.collection.immutable.List.foreach(List.scala:381) > ~[scala-library-2.11.8.jar:?] > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) > ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] > at >
[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495979#comment-16495979 ] Cyril Scetbon commented on SPARK-16367: --- Is there a better way to do it today ? I see that ticket has been opened for almost 2 years > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Semet >Priority: Major > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > This ticket proposes to allow users the ability to deploy their job as > "Wheels" packages. The Python community is strongly advocating to promote > this way of packaging and distributing Python application as a "standard way > of deploying Python App". In other word, this is the "Pythonic Way of > Deployment". > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use >
[jira] [Created] (SPARK-24438) Empty strings and null strings are written to the same partition
Mukul Murthy created SPARK-24438: Summary: Empty strings and null strings are written to the same partition Key: SPARK-24438 URL: https://issues.apache.org/jira/browse/SPARK-24438 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Mukul Murthy When you partition on a string column that has empty strings and nulls, they are both written to the same default partition. When you read the data back, all those values get read back as null. {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders.RowEncoder val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null)) val schema = new StructType().add("a", IntegerType).add("b", StringType) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) display(df) => a b 1 2 3 4 hello 5 null df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4") val df2 = spark.read.load("/home/mukul/weird_test_data4") display(df2) => a b 4 hello 3 null 2 null 1 null 5 null {code} Seems to affect multiple types of tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API
[ https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24333: Assignee: Apache Spark > Add fit with validation set to spark.ml GBT: Python API > --- > > Key: SPARK-24333 > URL: https://issues.apache.org/jira/browse/SPARK-24333 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Major > > Python version of API added by [SPARK-7132] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API
[ https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24333: Assignee: (was: Apache Spark) > Add fit with validation set to spark.ml GBT: Python API > --- > > Key: SPARK-24333 > URL: https://issues.apache.org/jira/browse/SPARK-24333 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Python version of API added by [SPARK-7132] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API
[ https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495904#comment-16495904 ] Apache Spark commented on SPARK-24333: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21465 > Add fit with validation set to spark.ml GBT: Python API > --- > > Key: SPARK-24333 > URL: https://issues.apache.org/jira/browse/SPARK-24333 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Python version of API added by [SPARK-7132] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495843#comment-16495843 ] Shixiong Zhu commented on SPARK-23649: -- [~cloud_fan] looks like this is fixed? > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24276) semanticHash() returns different values for semantically the same IS IN
[ https://issues.apache.org/jira/browse/SPARK-24276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24276. - Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 2.4.0 > semanticHash() returns different values for semantically the same IS IN > --- > > Key: SPARK-24276 > URL: https://issues.apache.org/jira/browse/SPARK-24276 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > When a plan is canonicalized any set-based operation, such as IS IN, should > have its expressions ordered as the order of expressions does not matter in > the evaluation of the operator. > For instance: > {code:scala} > val df = spark.createDataFrame(Seq((1, 2))) > val p1 = df.where('_1.isin(1, 2)).queryExecution.logical.canonicalized > val p2 = df.where('_1.isin(2, 1)).queryExecution.logical.canonicalized > val h1 = p1.semanticHash > val h2 = p2.semanticHash > {code} > {code} > df: org.apache.spark.sql.DataFrame = [_1: int, _2: int] > p1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > 'Filter '_1 IN (1,2) > +- LocalRelation [_1#0, _2#1] > p2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > 'Filter '_1 IN (2,1) > +- LocalRelation [_1#0, _2#1] > h1: Int = -1384236508 > h2: Int = 939549189 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24276) semanticHash() returns different values for semantically the same IS IN
[ https://issues.apache.org/jira/browse/SPARK-24276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24276: --- Assignee: Marco Gaido (was: Maxim Gekk) > semanticHash() returns different values for semantically the same IS IN > --- > > Key: SPARK-24276 > URL: https://issues.apache.org/jira/browse/SPARK-24276 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > When a plan is canonicalized any set-based operation, such as IS IN, should > have its expressions ordered as the order of expressions does not matter in > the evaluation of the operator. > For instance: > {code:scala} > val df = spark.createDataFrame(Seq((1, 2))) > val p1 = df.where('_1.isin(1, 2)).queryExecution.logical.canonicalized > val p2 = df.where('_1.isin(2, 1)).queryExecution.logical.canonicalized > val h1 = p1.semanticHash > val h2 = p2.semanticHash > {code} > {code} > df: org.apache.spark.sql.DataFrame = [_1: int, _2: int] > p1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > 'Filter '_1 IN (1,2) > +- LocalRelation [_1#0, _2#1] > p2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > 'Filter '_1 IN (2,1) > +- LocalRelation [_1#0, _2#1] > h1: Int = -1384236508 > h2: Int = 939549189 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495779#comment-16495779 ] Stavros Kontopoulos edited comment on SPARK-24434 at 5/30/18 10:12 PM: --- [~eje] I agree will give it a shot and try exhaust options, will see if I can come up with pros and cons etc. This also blocks the affinity/anti-affinity work. was (Author: skonto): [~eje] I agree will give it a shot and try exhaust options, will see if I can come up with pros and cons etc. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495779#comment-16495779 ] Stavros Kontopoulos commented on SPARK-24434: - [~eje] I agree will give it a shot and try exhaust options, will see if I can come up with pros and cons etc. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table
[ https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bimalendu Choudhary updated SPARK-24316: Summary: Spark sql queries stall for column width more than 6k for parquet based table (was: Spark sql queries stall for column width more 6k for parquet based table) > Spark sql queries stall for column width more than 6k for parquet based table > -- > > Key: SPARK-24316 > URL: https://issues.apache.org/jira/browse/SPARK-24316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0 >Reporter: Bimalendu Choudhary >Priority: Major > > When we create a table from a data frame using spark sql with columns around > 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while > the same table if we create through Hive with same data , the same query just > takes 5 minutes. > > Instrumenting the code we see that the executors are looping in the while > loop of the function initializeInternal(). The majority of time is getting > spent in the for loop in below code looping through the columns and the > executor appears to be stalled for long time . > > {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid} > private void initializeInternal() .. > .. > for (int i = 0; i < requestedSchema.getFieldCount(); ++i) > { ... } > } > {code:java} > {code} > > When spark sql is creating table, it also stores the metadata in the > TBLPROPERTIES in json format. We see that if we remove this metadata from the > table the queries become fast , which is the case when we create the same > table through Hive. The exact same table takes 5 times more time with the > Json meta data as compared to without the json metadata. > > So looks like as the number of columns are growing bigger than 5 to 6k, the > processing of the metadata and comparing it becomes more and more expensive > and the performance degrades drastically. > To recreate the problem simply run the following query: > import org.apache.spark.sql.SparkSession > val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7") > resp_data.write.format("csv").save("/tmp/filename") > > The table should be created by spark sql from dataframe so that the Json meta > data is stored. For ex:- > val dff = spark.read.format("csv").load("hdfs:///tmp/test.csv") > dff.createOrReplaceTempView("my_temp_table") > val tmp = spark.sql("Create table tableName stored as parquet as select * > from my_temp_table") > > > from pyspark.sql import SQL > Context > sqlContext = SQLContext(sc) > resp_data = spark.sql( " select * from test").limit(2000) > print resp_data_fgv_1k.count() > (resp_data_fgv_1k.write.option('header', > False).mode('overwrite').csv('/tmp/2.csv') ) > > > The performance seems to be even slow in the loop if the schema does not > match or the fields are empty and the code goes into the if condition where > the missing column is marked true: > missingColumns[i] = true; > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495695#comment-16495695 ] Li Jin commented on SPARK-24373: [~smilegator] Thank you for the suggestion. > "df.cache() df.count()" no longer eagerly caches data when the analyzed plans > are different after re-analyzing the plans > > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Assignee: Marco Gaido >Priority: Blocker > Fix For: 2.3.1, 2.4.0 > > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans
[ https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495691#comment-16495691 ] Xiao Li commented on SPARK-24373: - [~icexelloss] This is still possible since the query plans are changed. I am also fine to do it without a flag. If you apply the fix to your internal fork, I would suggest to add a flag. At least, you can turn it off when anything unexpected happens. > "df.cache() df.count()" no longer eagerly caches data when the analyzed plans > are different after re-analyzing the plans > > > Key: SPARK-24373 > URL: https://issues.apache.org/jira/browse/SPARK-24373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenbo Zhao >Assignee: Marco Gaido >Priority: Blocker > Fix For: 2.3.1, 2.4.0 > > > Here is the code to reproduce in local mode > {code:java} > scala> val df = sc.range(1, 2).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> val myudf = udf({x: Long => println(""); x + 1}) > myudf: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,LongType,Some(List(LongType))) > scala> val df1 = df.withColumn("value1", myudf(col("value"))) > df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint] > scala> df1.cache > res0: df1.type = [value: bigint, value1: bigint] > scala> df1.count > res1: Long = 1 > scala> df1.count > res2: Long = 1 > scala> df1.count > res3: Long = 1 > {code} > > in Spark 2.2, you could see it prints "". > In the above example, when you do explain. You could see > {code:java} > scala> df1.explain(true) > == Parsed Logical Plan == > 'Project [value#2L, UDF('value) AS value1#5] > +- AnalysisBarrier > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > value: bigint, value1: bigint > Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > == Physical Plan == > *(1) InMemoryTableScan [value#2L, value1#5L] > +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L] > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > but the ImMemoryTableScan is mising in the following explain() > {code:java} > scala> df1.groupBy().count().explain(true) > == Parsed Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS > value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Analyzed Logical Plan == > count: bigint > Aggregate [count(1) AS count#170L] > +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) > null else UDF(value#2L) AS value1#5L] > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Optimized Logical Plan == > Aggregate [count(1) AS count#170L] > +- Project > +- SerializeFromObject [input[0, bigint, false] AS value#2L] > +- ExternalRDD [obj#1L] > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#175L]) > +- *(1) Project > +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L] > +- Scan ExternalRDDScan[obj#1L] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24436) Add large dataset to examples sub-directory.
[ https://issues.apache.org/jira/browse/SPARK-24436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495684#comment-16495684 ] Varun Vishwanathan commented on SPARK-24436: Going to add a parquet file with the data. > Add large dataset to examples sub-directory. > > > Key: SPARK-24436 > URL: https://issues.apache.org/jira/browse/SPARK-24436 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.3.1 >Reporter: Varun Vishwanathan >Priority: Trivial > > Fire Incidents from San francisco is a large dataset that may help developers > work with more data when fixing bugs or developing features. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24437) Memory leak in UnsafeHashedRelation
[ https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gagan taneja updated SPARK-24437: - Attachment: Screen Shot 2018-05-30 at 2.07.22 PM.png > Memory leak in UnsafeHashedRelation > --- > > Key: SPARK-24437 > URL: https://issues.apache.org/jira/browse/SPARK-24437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: gagan taneja >Priority: Critical > Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot > 2018-05-30 at 2.07.22 PM.png > > > There seems to memory leak with > org.apache.spark.sql.execution.joins.UnsafeHashedRelation > We have a long running instance of STS. > With each query execution requiring Broadcast Join, UnsafeHashedRelation is > getting added for cleanup in ContextCleaner. This reference of > UnsafeHashedRelation is being held at some other Collection and not becoming > eligible for GC and because of this ContextCleaner is not able to clean it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24437) Memory leak in UnsafeHashedRelation
[ https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gagan taneja updated SPARK-24437: - Attachment: Screen Shot 2018-05-30 at 2.05.40 PM.png > Memory leak in UnsafeHashedRelation > --- > > Key: SPARK-24437 > URL: https://issues.apache.org/jira/browse/SPARK-24437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: gagan taneja >Priority: Critical > Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png > > > There seems to memory leak with > org.apache.spark.sql.execution.joins.UnsafeHashedRelation > We have a long running instance of STS. > With each query execution requiring Broadcast Join, UnsafeHashedRelation is > getting added for cleanup in ContextCleaner. This reference of > UnsafeHashedRelation is being held at some other Collection and not becoming > eligible for GC and because of this ContextCleaner is not able to clean it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24437) Memory leak in UnsafeHashedRelation
gagan taneja created SPARK-24437: Summary: Memory leak in UnsafeHashedRelation Key: SPARK-24437 URL: https://issues.apache.org/jira/browse/SPARK-24437 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: gagan taneja There seems to memory leak with org.apache.spark.sql.execution.joins.UnsafeHashedRelation We have a long running instance of STS. With each query execution requiring Broadcast Join, UnsafeHashedRelation is getting added for cleanup in ContextCleaner. This reference of UnsafeHashedRelation is being held at some other Collection and not becoming eligible for GC and because of this ContextCleaner is not able to clean it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24436) Add large dataset to examples sub-directory.
Varun Vishwanathan created SPARK-24436: -- Summary: Add large dataset to examples sub-directory. Key: SPARK-24436 URL: https://issues.apache.org/jira/browse/SPARK-24436 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 2.3.1 Reporter: Varun Vishwanathan Fire Incidents from San francisco is a large dataset that may help developers work with more data when fixing bugs or developing features. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495642#comment-16495642 ] Erik Erlandson commented on SPARK-24434: [~skonto] given the number of ideas that have gotten tossed around for this over time, an 'alternatives considered' section for a design doc will definitely be valuable > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495637#comment-16495637 ] Yinan Li commented on SPARK-24434: -- [~eje] That's a good question. I think we need to compare both and have a thorough discussion in the community once the design is out. There are pros and cons with each of them. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
[ https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495619#comment-16495619 ] Erik Erlandson commented on SPARK-24091: If we support user-supplied yaml, that may become a source of config map specifications > Internally used ConfigMap prevents use of user-specified ConfigMaps carrying > Spark configs files > > > Key: SPARK-24091 > URL: https://issues.apache.org/jira/browse/SPARK-24091 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > The recent PR [https://github.com/apache/spark/pull/20669] for removing the > init-container introduced a internally used ConfigMap carrying Spark > configuration properties in a file for the driver. This ConfigMap gets > mounted under {{$SPARK_HOME/conf}} and the environment variable > {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much > prevents users from mounting their own ConfigMaps that carry custom Spark > configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and > leaves users with only the option of building custom images. IMO, it is very > useful to support mounting user-specified ConfigMaps for custom Spark > configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495615#comment-16495615 ] Erik Erlandson commented on SPARK-24434: Is the template-based solution being explicitly favored over other options, e.g. pod presets or webhooks, etc? > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions
[ https://issues.apache.org/jira/browse/SPARK-24435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson resolved SPARK-24435. Resolution: Duplicate > Support user-supplied YAML that can be merged with k8s pod descriptions > --- > > Key: SPARK-24435 > URL: https://issues.apache.org/jira/browse/SPARK-24435 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Erik Erlandson >Priority: Major > Labels: features, kubernetes > Fix For: 2.4.0 > > > Kubernetes supports a large variety of configurations to Pods. Currently only > some of these are configurable from Spark, and they all operate by being > plumbed from --conf arguments through to pod creation in the code. > To avoid the anti-pattern of trying to expose an unbounded Pod feature set > through Spark configuration keywords, the community is interested in working > out a sane way of allowing users to supply "arbitrary" Pod YAML which can be > merged with the pod configurations created by the kube backend. > Multiple solutions have been considerd, including Pod Pre-sets and loading > Pod template objects. A requirement is that the policy for how user-supplied > YAML interacts with the configurations created by the kube back-end must be > easy to reason about, and also that whatever kubernetes features the solution > uses are supported on the kubernetes roadmap. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions
Erik Erlandson created SPARK-24435: -- Summary: Support user-supplied YAML that can be merged with k8s pod descriptions Key: SPARK-24435 URL: https://issues.apache.org/jira/browse/SPARK-24435 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.3.0 Reporter: Erik Erlandson Fix For: 2.4.0 Kubernetes supports a large variety of configurations to Pods. Currently only some of these are configurable from Spark, and they all operate by being plumbed from --conf arguments through to pod creation in the code. To avoid the anti-pattern of trying to expose an unbounded Pod feature set through Spark configuration keywords, the community is interested in working out a sane way of allowing users to supply "arbitrary" Pod YAML which can be merged with the pod configurations created by the kube backend. Multiple solutions have been considerd, including Pod Pre-sets and loading Pod template objects. A requirement is that the policy for how user-supplied YAML interacts with the configurations created by the kube back-end must be easy to reason about, and also that whatever kubernetes features the solution uses are supported on the kubernetes roadmap. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495601#comment-16495601 ] Stavros Kontopoulos edited comment on SPARK-24434 at 5/30/18 7:55 PM: -- [~liyinan926] I will work on a design document. From a first glance K8s pod spec could be enough or a subset of it at least. was (Author: skonto): [~liyinan926] I will work on a design document. From a first glance K8s pod spec could be enough or a subset of it. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495601#comment-16495601 ] Stavros Kontopoulos commented on SPARK-24434: - [~liyinan926] I will work on a design document. From a first glance K8s pod spec could be enough or a subset of it. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24434) Support user-specified driver and executor pod templates
Yinan Li created SPARK-24434: Summary: Support user-specified driver and executor pod templates Key: SPARK-24434 URL: https://issues.apache.org/jira/browse/SPARK-24434 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li With more requests for customizing the driver and executor pods coming, the current approach of adding new Spark configuration options has some serious drawbacks: 1) it means more Kubernetes specific configuration options to maintain, and 2) it widens the gap between the declarative model used by Kubernetes and the configuration model used by Spark. We should start designing a solution that allows users to specify pod templates as central places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495580#comment-16495580 ] Ted Yu commented on SPARK-18057: There are compilation errors in KafkaTestUtils.scala against Kafka 1.0.1 release. What's the preferred way forward: * use reflection in KafkaTestUtils.scala * create external/kafka-1-0-sql module Thanks > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24433) Add Spark R support
Yinan Li created SPARK-24433: Summary: Add Spark R support Key: SPARK-24433 URL: https://issues.apache.org/jira/browse/SPARK-24433 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li This is the ticket to track work on adding support for R binding into the Kubernetes mode. The feature is available in our fork at github.com/apache-spark-on-k8s/spark and needs to be upstreamed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24432) Support for dynamic resource allocation
Yinan Li created SPARK-24432: Summary: Support for dynamic resource allocation Key: SPARK-24432 URL: https://issues.apache.org/jira/browse/SPARK-24432 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li This is an umbrella ticket for work on adding support for dynamic resource allocation into the Kubernetes mode. This requires a Kubernetes-specific external shuffle service. The feature is available in our fork at github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-24432: - Summary: Add support for dynamic resource allocation (was: Support for dynamic resource allocation) > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24417) Build and Run Spark on JDK9+
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-24417: Description: This is an umbrella JIRA for Apache Spark to support Java 9+ As Java 8 is going away soon, Java 11 will be LTS and GA this Sep, and many companies are testing Java 9 or Java 10 to prepare for Java 11, i's best to start the traumatic process of supporting newer version of Java in Apache Spark as a background activity. The subtasks are what have to be done to support Java 9+. was: This is an umbrella JIRA for Apache Spark to support JDK9+ As Java 8 is going way soon, JDK11 will be LTS and GA this Sep, and companies are testing JDK9 or JDK10 to prepare for JDK11. It's best to start the traumatic process of supporting newer version of JDK in Apache Spark as a background activity. The subtasks are what have to be done to support JDK9+. > Build and Run Spark on JDK9+ > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > This is an umbrella JIRA for Apache Spark to support Java 9+ > As Java 8 is going away soon, Java 11 will be LTS and GA this Sep, and many > companies are testing Java 9 or Java 10 to prepare for Java 11, i's best to > start the traumatic process of supporting newer version of Java in Apache > Spark as a background activity. > The subtasks are what have to be done to support Java 9+. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23901. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21246 [https://github.com/apache/spark/pull/21246] > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23901: - Assignee: Marco Gaido > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23161. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21413 [https://github.com/apache/spark/pull/21413] > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Huaxin Gao >Priority: Minor > Labels: starter > Fix For: 2.4.0 > > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved to > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-23161: Assignee: Huaxin Gao > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Huaxin Gao >Priority: Minor > Labels: starter > Fix For: 2.4.0 > > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved to > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization
[ https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24384. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21426 [https://github.com/apache/spark/pull/21426] > spark-submit --py-files with .py files doesn't work in client mode before > context initialization > > > Key: SPARK-24384 > URL: https://issues.apache.org/jira/browse/SPARK-24384 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > In case the given Python file is .py file (zip file seems fine), seems the > python path is dynamically added after the context is got initialized. > with this pyFile: > {code} > $ cat /home/spark/tmp.py > def testtest(): > return 1 > {code} > This works: > {code} > $ cat app.py > import pyspark > pyspark.sql.SparkSession.builder.getOrCreate() > import tmp > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > ... > 1 > {code} > but this doesn't: > {code} > $ cat app.py > import pyspark > import tmp > pyspark.sql.SparkSession.builder.getOrCreate() > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > Traceback (most recent call last): > File "/home/spark/spark/app.py", line 2, in > import tmp > ImportError: No module named tmp > {code} > See > https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization
[ https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24384: -- Assignee: Hyukjin Kwon > spark-submit --py-files with .py files doesn't work in client mode before > context initialization > > > Key: SPARK-24384 > URL: https://issues.apache.org/jira/browse/SPARK-24384 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.3.0, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > In case the given Python file is .py file (zip file seems fine), seems the > python path is dynamically added after the context is got initialized. > with this pyFile: > {code} > $ cat /home/spark/tmp.py > def testtest(): > return 1 > {code} > This works: > {code} > $ cat app.py > import pyspark > pyspark.sql.SparkSession.builder.getOrCreate() > import tmp > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > ... > 1 > {code} > but this doesn't: > {code} > $ cat app.py > import pyspark > import tmp > pyspark.sql.SparkSession.builder.getOrCreate() > print("%s" % tmp.testtest()) > $ ./bin/spark-submit --master yarn --deploy-mode client --py-files > /home/spark/tmp.py app.py > Traceback (most recent call last): > File "/home/spark/spark/app.py", line 2, in > import tmp > ImportError: No module named tmp > {code} > See > https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24419) Upgrade SBT to 0.13.17 with Scala 2.10.7
[ https://issues.apache.org/jira/browse/SPARK-24419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-24419. - Resolution: Fixed > Upgrade SBT to 0.13.17 with Scala 2.10.7 > > > Key: SPARK-24419 > URL: https://issues.apache.org/jira/browse/SPARK-24419 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23754: - Fix Version/s: 2.3.1 > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Assignee: Emilio Dorigatti >Priority: Blocker > Fix For: 2.3.1, 2.4.0 > > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24369) A bug when having multiple distinct aggregations
[ https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24369. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21443 [https://github.com/apache/spark/pull/21443] > A bug when having multiple distinct aggregations > > > Key: SPARK-24369 > URL: https://issues.apache.org/jira/browse/SPARK-24369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > {code} > SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM > (VALUES >(1, 1), >(2, 2), >(2, 2) > ) t(x, y) > {code} > It returns > {code} > java.lang.RuntimeException > You hit a query analyzer bug. Please report your query to Spark user mailing > list. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24369) A bug when having multiple distinct aggregations
[ https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24369: --- Assignee: Takeshi Yamamuro > A bug when having multiple distinct aggregations > > > Key: SPARK-24369 > URL: https://issues.apache.org/jira/browse/SPARK-24369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > {code} > SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM > (VALUES >(1, 1), >(2, 2), >(2, 2) > ) t(x, y) > {code} > It returns > {code} > java.lang.RuntimeException > You hit a query analyzer bug. Please report your query to Spark user mailing > list. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
Xinyong Tian created SPARK-24431: Summary: wrong areaUnderPR calculation in BinaryClassificationEvaluator Key: SPARK-24431 URL: https://issues.apache.org/jira/browse/SPARK-24431 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Xinyong Tian My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to select best model. when the regParam in logistict regression is very high, no variable will be selected (no model), ie every row 's prediction is same ,eg. equal event rate (baseline frequency). But at this point, BinaryClassificationEvaluator set the areaUnderPR highest. As a result best model seleted is a no model. the reason is following. at time of no model, precision recall curve will be only two points: at recall =0, precision should be set to zero , while the software set it to 1. at recall=1, precision is the event rate. As a result, the areaUnderPR will be close 0.5 (my even rate is very low), which is maximum . the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24430) CREATE VIEW with UNION statement: Failed to recognize predicate 'UNION'.
[ https://issues.apache.org/jira/browse/SPARK-24430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495374#comment-16495374 ] Volodymyr Glushak commented on SPARK-24430: --- I wonder, if that is legal code, and this request should be moved to HIVE JIRA. > CREATE VIEW with UNION statement: Failed to recognize predicate 'UNION'. > > > Key: SPARK-24430 > URL: https://issues.apache.org/jira/browse/SPARK-24430 > Project: Spark > Issue Type: Request > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Volodymyr Glushak >Priority: Major > > When I executes following SQL statement: > {code} > spark.sql('CREATE VIEW view_12 AS > SELECT * FROM ( > SELECT * FROM table1 > UNION ALL > SELECT * FROM table2 > ) UT') > {code} > > It successfully creates view in HIVE, which I can query via Apache Spark. > However if I'm trying to query the same view directly via HIVE, I've got an > error: > {code} > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED: ParseException line 6:6 Failed to recognize predicate > 'UNION'. > Failed rule: 'identifier' in subquery source > {code} > > *Investigation* > Under hood, spark generate following SQL statement for this view: > {code} > CREATE VIEW `view_12` AS > SELECT * > FROM (SELECT * FROM > ( > (SELECT * FROM (SELECT * FROM `db1`.`table1`) AS > gen_subquery_0) > UNION ALL > (SELECT * FROM (SELECT * FROM `db1`.`tabl2`) AS > gen_subquery_1) > ) AS UT > ) AS UT > {code} > If I try to executes this statement in HIVE it fails with the same reason. > The easiest way to fix it, is to unwrap outer queries on lines 5 and 7. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24430) CREATE VIEW with UNION statement: Failed to recognize predicate 'UNION'.
Volodymyr Glushak created SPARK-24430: - Summary: CREATE VIEW with UNION statement: Failed to recognize predicate 'UNION'. Key: SPARK-24430 URL: https://issues.apache.org/jira/browse/SPARK-24430 Project: Spark Issue Type: Request Components: Spark Core, SQL Affects Versions: 2.2.1 Reporter: Volodymyr Glushak When I executes following SQL statement: {code} spark.sql('CREATE VIEW view_12 AS SELECT * FROM ( SELECT * FROM table1 UNION ALL SELECT * FROM table2 ) UT') {code} It successfully creates view in HIVE, which I can query via Apache Spark. However if I'm trying to query the same view directly via HIVE, I've got an error: {code} org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 6:6 Failed to recognize predicate 'UNION'. Failed rule: 'identifier' in subquery source {code} *Investigation* Under hood, spark generate following SQL statement for this view: {code} CREATE VIEW `view_12` AS SELECT * FROM (SELECT * FROM ( (SELECT * FROM (SELECT * FROM `db1`.`table1`) AS gen_subquery_0) UNION ALL (SELECT * FROM (SELECT * FROM `db1`.`tabl2`) AS gen_subquery_1) ) AS UT ) AS UT {code} If I try to executes this statement in HIVE it fails with the same reason. The easiest way to fix it, is to unwrap outer queries on lines 5 and 7. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495268#comment-16495268 ] Apache Spark commented on SPARK-23754: -- User 'e-dorigatti' has created a pull request for this issue: https://github.com/apache/spark/pull/21463 > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Assignee: Emilio Dorigatti >Priority: Blocker > Fix For: 2.4.0 > > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc in K8s module
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24428: Affects Version/s: (was: 2.3.0) 2.4.0 > Remove unused code and fix any related doc in K8s module > > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > There are some relics of previous refactoring like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24415: -- Priority: Critical (was: Major) > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Critical > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495224#comment-16495224 ] Thomas Graves commented on SPARK-24415: --- ok so the issue here is in the AppStatusListener where its only updating the task metrics for liveStages. It gets the second taskEnd event after it cancelled stage 2 so its no longer in the live stages array. > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885 ] Liang-Chi Hsieh edited comment on SPARK-24410 at 5/30/18 2:20 PM: -- I've done some experiments locally. But the results show that in above case seems we don't guarantee that the data distribution is the same when reading two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get the correct results. That said, for FileScan nodes at (1) and (2): {code:java} *(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code} FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are partitioned by {{key#28}}. But it doesn't guarantee that the same values of {{key#25}} and {{key#28}} are located in the same partition. was (Author: viirya): I've done some experiments locally. But the results show that in above case seems we don't guarantee that the data distribution is the same when reading two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get the correct results. That said, for FileScan nodes at (1) and (2): {code:java} *(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code} FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are partitioned by {{key#28}}. But it doesn't guarantee that the values of {{key#25}} and {{key#28}} are located in the same partition. > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1)
[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495204#comment-16495204 ] Thomas Graves commented on SPARK-24415: --- It also looks like in the history server they show up properly in the aggregated metrics, although if you look at the Tasks (for all stages) column on the jobs page, it only lists a single task failure where it should list 2. > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24429) Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-24429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24429: Description: Right now in cluster mode only extraJavaOptions targeting the executor are set. According to the implementation and the docs: "In client mode, this config must not be set through the {{SparkConf}} directly in your application, because the driver JVM has already started at that point. Instead, please set this through the {{--driver-java-options}} command line option or in your default properties file." A typical driver launch in cluster mode will eventually use client mode to run the Spark-submit and looks like: "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1" Also at the entrypoint.sh file there is no management of the driver's java opts. We propose to set an env var to pass the extra java opts to the driver (like in the case of the executor), rename the env vars in the container as the one for the executor is a bit misleading, and use --driver-java-options to pass the required options. was: Right now in cluster mode only extraJavaOptions targeting the executor are set. According to the implementation and the docs: "In client mode, this config must not be set through the {{SparkConf}} directly in your application, because the driver JVM has already started at that point. Instead, please set this through the {{--driver-java-options}} command line option or in your default properties file." A typical driver launch in cluster mode will eventually use client mode to run the Spark-submit and looks like: "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1" Also at the entrypoint.sh file there is no management of the driver's java Opts. We propose to set an env var to pass the extra java opts to the driver (like in the case of the executor), rename the env vars in the container as the one for the executor is a bit misleading, and use --driver-java-options to pass the required options. > Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s > -- > > Key: SPARK-24429 > URL: https://issues.apache.org/jira/browse/SPARK-24429 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Major > > > Right now in cluster mode only extraJavaOptions targeting the executor are > set. > According to the implementation and the docs: > "In client mode, this config must not be set through the {{SparkConf}} > directly in your application, because the driver JVM has already started at > that point. Instead, please set this through the {{--driver-java-options}} > command line option or in your default properties file." > A typical driver launch in cluster mode will eventually use client mode to > run the Spark-submit and looks like: > "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit > --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal 1" > Also at the entrypoint.sh file there is no management of the driver's java > opts. > We propose to set an env var to pass the extra java opts to the driver (like > in the case of the executor), rename the env vars in the container as the one > for the executor is a bit misleading, and use --driver-java-options to pass > the required options. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24429) Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-24429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24429: Description: Right now in cluster mode only extraJavaOptions targeting the executor are set. According to the implementation and the docs: "In client mode, this config must not be set through the {{SparkConf}} directly in your application, because the driver JVM has already started at that point. Instead, please set this through the {{--driver-java-options}} command line option or in your default properties file." A typical driver launch in cluster mode will eventually use client mode to run the Spark-submit and looks like: "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1" Also at the entrypoint.sh file there is no management of the driver's java Opts. We propose to set an env var to pass the extra java opts to the driver (like in the case of the executor), rename the env vars in the container as the one for the executor is a bit misleading, and use --driver-java-options to pass the required options. was: Right now in cluster mode only extraJavaOptions targeting the executor are set. According to the implementation and the docs: "In client mode, this config must not be set through the {{SparkConf}} directly in your application, because the driver JVM has already started at that point. Instead, please set this through the {{--driver-java-options}} command line option or in your default properties file." A typical driver call now is of the form: "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1" Also at the entrypoint.sh file there is no management of the driver's java Opts. We propose to set an env var to pass the extra java opts to the driver (like in the case of the executor), rename the env vars in the container as the one for the executor is a bit misleading, and use --driver-java-options to pass the required options. > Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s > -- > > Key: SPARK-24429 > URL: https://issues.apache.org/jira/browse/SPARK-24429 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Major > > > Right now in cluster mode only extraJavaOptions targeting the executor are > set. > According to the implementation and the docs: > "In client mode, this config must not be set through the {{SparkConf}} > directly in your application, because the driver JVM has already started at > that point. Instead, please set this through the {{--driver-java-options}} > command line option or in your default properties file." > A typical driver launch in cluster mode will eventually use client mode to > run the Spark-submit and looks like: > "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp > /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit > --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal 1" > Also at the entrypoint.sh file there is no management of the driver's java > Opts. > We propose to set an env var to pass the extra java opts to the driver (like > in the case of the executor), rename the env vars in the container as the one > for the executor is a bit misleading, and use --driver-java-options to pass > the required options. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495200#comment-16495200 ] Thomas Graves commented on SPARK-24415: --- this might actually be an order of events type thing. You will note that the config I have is stage.maxFailedTasksPerExecutor=1 so it should really only have 1 failed task, but looking at the log it seems it starts the second task before totally handling the blacklist from the first failure: 18/05/30 13:57:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 0, PROCESS_LOCAL, 7746 bytes) [Stage 2:> (0 + 1) / 10]18/05/30 13:57:20 INFO BlockManagerMasterEndpoint: Registering block manager gsrd259n13.red.ygrid.yahoo.com:43203 with 912.3 MB RAM, BlockManagerId(2, gsrd259n13.red.ygrid.yahoo.com, 43203, None) 18/05/30 13:57:21 INFO Client: Application report for application_1526529576371_25524 (state: RUNNING) 18/05/30 13:57:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on gsrd259n13.red.ygrid.yahoo.com:43203 (size: 1941.0 B, free: 912.3 MB) 18/05/30 13:57:21 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 1, PROCESS_LOCAL, 7747 bytes) 18/05/30 13:57:21 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, gsrd259n13.red.ygrid.yahoo.com, executor 2): java.lang.RuntimeException: Bad executor 18/05/30 13:57:21 INFO TaskSetBlacklist: Blacklisting executor 2 for stage 2 18/05/30 13:57:21 INFO YarnScheduler: Cancelling stage 2 18/05/30 13:57:21 INFO YarnScheduler: Stage 2 was cancelled 18/05/30 13:57:21 INFO DAGScheduler: ShuffleMapStage 2 (map at :26) failed in 12.063 s due to Job aborted due to stage failure: 18/05/30 13:57:21 INFO DAGScheduler: Job 1 failed: collect at :26, took 12.069052 s The thing is though that the executors page shows that it had 2 task failures on that node, its just in the aggregated metrics for that stage that doesn't have it. > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24429) Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s
Stavros Kontopoulos created SPARK-24429: --- Summary: Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s Key: SPARK-24429 URL: https://issues.apache.org/jira/browse/SPARK-24429 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.0 Reporter: Stavros Kontopoulos Right now in cluster mode only extraJavaOptions targeting the executor are set. According to the implementation and the docs: "In client mode, this config must not be set through the {{SparkConf}} directly in your application, because the driver JVM has already started at that point. Instead, please set this through the {{--driver-java-options}} command line option or in your default properties file." A typical driver call now is of the form: "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal 1" Also at the entrypoint.sh file there is no management of the driver's java Opts. We propose to set an env var to pass the extra java opts to the driver (like in the case of the executor), rename the env vars in the container as the one for the executor is a bit misleading, and use --driver-java-options to pass the required options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc in K8s module
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24428: Summary: Remove unused code and fix any related doc in K8s module (was: Remove unused code and fix any related doc) > Remove unused code and fix any related doc in K8s module > > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > There are some relics of previous refactoring: like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc in K8s module
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24428: Description: There are some relics of previous refactoring like: [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] Target is to cleanup anything not used. was: There are some relics of previous refactoring: like: [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] Target is to cleanup anything not used. > Remove unused code and fix any related doc in K8s module > > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > There are some relics of previous refactoring like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24428) Remove unused code and fix any related doc
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24428: Assignee: (was: Apache Spark) > Remove unused code and fix any related doc > -- > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > There are some relics of previous refactoring: like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24428) Remove unused code and fix any related doc
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495184#comment-16495184 ] Apache Spark commented on SPARK-24428: -- User 'skonto' has created a pull request for this issue: https://github.com/apache/spark/pull/21462 > Remove unused code and fix any related doc > -- > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > There are some relics of previous refactoring: like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24428) Remove unused code and fix any related doc
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24428: Assignee: Apache Spark > Remove unused code and fix any related doc > -- > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Assignee: Apache Spark >Priority: Minor > > There are some relics of previous refactoring: like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24415: -- Description: Running with spark 2.3 on yarn and having task failures and blacklisting, the aggregated metrics by executor are not correct. In my example it should have 2 failed tasks but it only shows one. Note I tested with master branch to verify its not fixed. I will attach screen shot. To reproduce: $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf "spark.blacklist.killBlacklistedExecutors=true" import org.apache.spark.SparkEnv sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() was: Running with spark 2.3 on yarn and having task failures and blacklisting, the aggregated metrics by executor are not correct. In my example it should have 2 failed tasks but it only shows one. Note I tested with master branch to verify its not fixed. I will attach screen shot. To reproduce: $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf "spark.blacklist.killBlacklistedExecutors=true" sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24415: -- Description: Running with spark 2.3 on yarn and having task failures and blacklisting, the aggregated metrics by executor are not correct. In my example it should have 2 failed tasks but it only shows one. Note I tested with master branch to verify its not fixed. I will attach screen shot. To reproduce: $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf "spark.blacklist.killBlacklistedExecutors=true" sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() was: Running with spark 2.3 on yarn and having task failures and blacklisting, the aggregated metrics by executor are not correct. In my example it should have 2 failed tasks but it only shows one. Note I tested with master branch to verify its not fixed. I will attach screen shot. To reproduce: $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf "spark.blacklist.killBlacklistedExecutors=true" sc.parallelize(1 to 1, 10).map\{ x => | if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") | else (x % 3, x) | }.reduceByKey((a, b) => a + b).collect() > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > > > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24415: -- Description: Running with spark 2.3 on yarn and having task failures and blacklisting, the aggregated metrics by executor are not correct. In my example it should have 2 failed tasks but it only shows one. Note I tested with master branch to verify its not fixed. I will attach screen shot. To reproduce: $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf "spark.blacklist.killBlacklistedExecutors=true" sc.parallelize(1 to 1, 10).map\{ x => | if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") | else (x % 3, x) | }.reduceByKey((a, b) => a + b).collect() was: Running with spark 2.3 on yarn and having task failures and blacklisting, the aggregated metrics by executor are not correct. In my example it should have 2 failed tasks but it only shows one. Note I tested with master branch to verify its not fixed. I will attach screen shot. To reproduce: $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf "spark.blacklist.killBlacklistedExecutors=true" sc.parallelize(1 to 1, 10).map { x => | if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") | else (x % 3, x) | } .reduceByKey((a, b) => a + b).collect() > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > > sc.parallelize(1 to 1, 10).map\{ x => | if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") | else (x % 3, x) | }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24428: Description: There are some relics of previous refactoring: like: [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] Target is to cleanup anything not used. was: There are some relics of previous refactoring: like: [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] Target is to cleanup anything no used. > Remove unused code and fix any related doc > -- > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > There are some relics of previous refactoring: like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc
[ https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-24428: Summary: Remove unused code and fix any related doc (was: Remove unused code and Fix docs) > Remove unused code and fix any related doc > -- > > Key: SPARK-24428 > URL: https://issues.apache.org/jira/browse/SPARK-24428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > There are some relics of previous refactoring: like: > [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] > Target is to cleanup anything no used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24428) Remove unused code and Fix docs
Stavros Kontopoulos created SPARK-24428: --- Summary: Remove unused code and Fix docs Key: SPARK-24428 URL: https://issues.apache.org/jira/browse/SPARK-24428 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.0 Reporter: Stavros Kontopoulos There are some relics of previous refactoring: like: [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63] Target is to cleanup anything no used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet
Ashok Rai created SPARK-24427: - Summary: Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet Key: SPARK-24427 URL: https://issues.apache.org/jira/browse/SPARK-24427 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.2.0 Reporter: Ashok Rai We are getting below error while loading into Hive table. In our code, we use "saveAsTable" - which as per documentation automatically chooses the format that the table was created on. We have now tested by creating the table as Parquet as well as ORC. In both cases - the same error occurred. - 2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving table in spark. org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please specify the fully qualified class name.; at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) ~[scala-library-2.11.8.jar:?] at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) ~[scala-library-2.11.8.jar:?] at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) ~[scala-library-2.11.8.jar:?] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at scala.collection.immutable.List.foreach(List.scala:381) ~[scala-library-2.11.8.jar:?] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1] at
[jira] [Commented] (SPARK-24395) Fix Behavior of NOT IN with Literals Containing NULL
[ https://issues.apache.org/jira/browse/SPARK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495068#comment-16495068 ] Marco Gaido commented on SPARK-24395: - The main issue here is that {{(null, null) = (1, 2)}} in Spark evaluates to {{false}}, while on other DBs evaluates to null. So we should revisit all our comparisons for struct. > Fix Behavior of NOT IN with Literals Containing NULL > > > Key: SPARK-24395 > URL: https://issues.apache.org/jira/browse/SPARK-24395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Miles Yucht >Priority: Major > > Spark does not return the correct answer when evaluating NOT IN in some > cases. For example: > {code:java} > CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES > (null, null) > AS m(a, b); > SELECT * > FROM m > WHERE a IS NULL AND b IS NULL >AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, > 1;{code} > According to the semantics of null-aware anti-join, this should return no > rows. However, it actually returns the row {{NULL NULL}}. This was found by > inspecting the unit tests added for SPARK-24381 > ([https://github.com/apache/spark/pull/21425#pullrequestreview-123421822).] > *Acceptance Criteria*: > * We should be able to add the following test cases back to > {{subquery/in-subquery/not-in-unit-test-multi-column-literal.sql}}: > {code:java} > -- Case 2 > -- (subquery contains a row with null in all columns -> row not returned) > SELECT * > FROM m > WHERE (a, b) NOT IN ((CAST (null AS INT), CAST (null AS DECIMAL(2, 1; > -- Case 3 > -- (probe-side columns are all null -> row not returned) > SELECT * > FROM m > WHERE a IS NULL AND b IS NULL -- Matches only (null, null) >AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, > 1; > -- Case 4 > -- (one column null, other column matches a row in the subquery result -> > row not returned) > SELECT * > FROM m > WHERE b = 1.0 -- Matches (null, 1.0) >AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, > 1; > {code} > > cc [~smilegator] [~juliuszsompolski] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24426) Unexpected combination of cache and join on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-24426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krzysztof Skulski updated SPARK-24426: -- Description: I have unexpected results, when I cache DataFrame and try to do another grouping on it. New DataFrames based on cached groupBy DataFrame works ok, but when i try join it to anohter DataFrame it seems like second join is adding new column but the data is copy from first joined DataFrame. Example below (userAgentType - is ok, userChannelType - is ok, userOrigin - is not ok). When I remove cache from aggregated DataFrame it works ok. {code:scala} val aggregated = dataFrame.cache() val userAgentType = aggregated.groupBy("id", "agentType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("agentType").as("agentType")) val userChannelType = aggregated.groupBy("id", "channelType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("channelType").as("channelType")) val userOrigin = userInfo .join(userAgentType, Seq("id"), "left") .join(userChannelType, Seq("id"), "left") {code} was: I have unexpected results, when I cache DataFrame and try to do another grouping on it. New DataFrames based on cached groupBy DataFrame works ok, but when i try join it to anohter DataFrame it seems like second join is adding new column but the data is copy from first joined DataFrame. Example below (userAgentType - is ok, userChannelType - is ok, userOrigin - is not ok). When I remove cache from aggregated DataFrame it works ok. {code} val aggregated = dataFrame.cache() val userAgentType = aggregated.groupBy("id", "agentType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("agentType").as("agentType")) val userChannelType = aggregated.groupBy("id", "channelType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("channelType").as("channelType")) val userOrigin = userInfo .join(userAgentType, Seq("id"), "left") .join(userChannelType, Seq("id"), "left") {code} > Unexpected combination of cache and join on DataFrame > - > > Key: SPARK-24426 > URL: https://issues.apache.org/jira/browse/SPARK-24426 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Krzysztof Skulski >Priority: Major > > I have unexpected results, when I cache DataFrame and try to do another > grouping on it. New DataFrames based on cached groupBy DataFrame works ok, > but when i try join it to anohter DataFrame it seems like second join is > adding new column but the data is copy from first joined DataFrame. Example > below (userAgentType - is ok, > userChannelType - is ok, userOrigin - is not ok). > When I remove cache from aggregated DataFrame it works ok. > > {code:scala} > val aggregated = dataFrame.cache() > val userAgentType = aggregated.groupBy("id", "agentType").count() >.orderBy(asc("id"), > desc("count")).groupBy("id").agg(first("agentType").as("agentType")) > val userChannelType = aggregated.groupBy("id", "channelType").count() >.orderBy(asc("id"), > desc("count")).groupBy("id").agg(first("channelType").as("channelType")) > val userOrigin = userInfo >.join(userAgentType, Seq("id"), "left") >.join(userChannelType, Seq("id"), "left") > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24426) Unexpected combination of cache and join on DataFrame
Krzysztof Skulski created SPARK-24426: - Summary: Unexpected combination of cache and join on DataFrame Key: SPARK-24426 URL: https://issues.apache.org/jira/browse/SPARK-24426 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Krzysztof Skulski I have unexpected results, when I cache DataFrame and try to do another grouping on it. New DataFrames based on cached groupBy DataFrame works ok, but when i try join it to anohter DataFrame it seems like second join is adding new column but the data is copy from first joined DataFrame. Example below (userAgentType - is ok, userChannelType - is ok, userOrigin - is not ok). When I remove cache from aggregated DataFrame it works ok. {code} val aggregated = dataFrame.cache() val userAgentType = aggregated.groupBy("id", "agentType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("agentType").as("agentType")) val userChannelType = aggregated.groupBy("id", "channelType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("channelType").as("channelType")) val userOrigin = userInfo .join(userAgentType, Seq("id"), "left") .join(userChannelType, Seq("id"), "left") {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494975#comment-16494975 ] Apache Spark commented on SPARK-23754: -- User 'e-dorigatti' has created a pull request for this issue: https://github.com/apache/spark/pull/21461 > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Assignee: Emilio Dorigatti >Priority: Blocker > Fix For: 2.4.0 > > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23754: Assignee: Emilio Dorigatti > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Assignee: Emilio Dorigatti >Priority: Blocker > Fix For: 2.4.0 > > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23754. -- Resolution: Fixed Fix Version/s: 2.4.0 Fixed in https://github.com/apache/spark/pull/21383 Backporting is needed. > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Blocker > Fix For: 2.4.0 > > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885 ] Liang-Chi Hsieh edited comment on SPARK-24410 at 5/30/18 8:41 AM: -- I've done some experiments locally. But the results show that in above case seems we don't guarantee that the data distribution is the same when reading two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get the correct results. That said, for FileScan nodes at (1) and (2): {code:java} *(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code} FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are partitioned by {{key#28}}. But it doesn't guarantee that the values of {{key#25}} and {{key#28}} are located in the same partition. was (Author: viirya): I've done some experiments locally. But the results show that in above case seems we don't guarantee that the data distribution is the same when reading two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get the correct results. That said, for FileScan nodes at (1) and (2): {code:java} *(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code} FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are partitioned by {{key#28}}. But it doesn't guarantee that the values of {{key#25}} and {{key#28}} are located in the same partition. > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1) Project
[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885 ] Liang-Chi Hsieh edited comment on SPARK-24410 at 5/30/18 8:41 AM: -- I've done some experiments locally. But the results show that in above case seems we don't guarantee that the data distribution is the same when reading two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get the correct results. That said, for FileScan nodes at (1) and (2): {code:java} *(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code} FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are partitioned by {{key#28}}. But it doesn't guarantee that the values of {{key#25}} and {{key#28}} are located in the same partition. was (Author: viirya): I've done some experiments locally. But the results show that in above case seems we don't guarantee that the data distribution is the same when reading two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get the correct results. > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1) Project [key#25L] > : +- *(1) FileScan parquet default.a1[key#25L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > +- *(2) Project [key#28L] > +- *(2) FileScan parquet default.a2[key#28L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885 ] Liang-Chi Hsieh commented on SPARK-24410: - I've done some experiments locally. But the results show that in above case seems we don't guarantee that the data distribution is the same when reading two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get the correct results. > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1) Project [key#25L] > : +- *(1) FileScan parquet default.a1[key#25L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > +- *(2) Project [key#28L] > +- *(2) FileScan parquet default.a2[key#28L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24425) Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required
[ https://issues.apache.org/jira/browse/SPARK-24425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494864#comment-16494864 ] Unai Sarasola commented on SPARK-24425: --- Totally agree with you Sam. It's up to the developer to decide wether maintain or optimize the number of partitions. This made impossible to backup your files in HDFS using Spark (I know is not the main task of Spark, but it's a limit that we don't have with the previous versions), also I'm playing with the same properties said by [~sams] without luck > Regression from 1.6 to 2.x - Spark no longer respects input partitions, > unnecessary shuffle required > > > Key: SPARK-24425 > URL: https://issues.apache.org/jira/browse/SPARK-24425 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: sam >Priority: Major > > I think this is a regression. We used to be able to easily control the number > of output files / tasks based on num files and coalesce. Now I have to use > `repartition` to get the desired num files / partitions which is > unnecessarily expensive. > I've tried playing with spark.sql.files.maxPartitionBytes and > spark.sql.files.openCostInBytes to see if I can force the conventional > behaviour. > {code:java} > val ss = SparkSession.builder().appName("uber-cp").master(conf.master()) > .config("spark.sql.files.maxPartitionBytes", 1) > .config("spark.sql.files.openCostInBytes", Long.MaxValue) > {code} > This didn't work. Spark just squashes all my parquet files into less > partitions. > Suggest a simple `option` on DataFrameReader that can disable this (or enable > it, default behaviour should be same as 1.6). > > This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if > SPARK-5997 was implemented this ticket wouldn't really be necessary. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494852#comment-16494852 ] sam commented on SPARK-20144: - Regarding the original issue of sorting, I agree with [~srowen] in that it should be up to the user to explicitly ask for sorted data. This is because fundamentally Spark implements the Map Reduce programming paradigm which is defined in terms of multisets. [~icexelloss] Please read [http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf] Regarding my issue of Spark reducing the number of partitions without any ask from the user I've created a separate issue: https://issues.apache.org/jira/browse/SPARK-24425 > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24425) Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required
sam created SPARK-24425: --- Summary: Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required Key: SPARK-24425 URL: https://issues.apache.org/jira/browse/SPARK-24425 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Reporter: sam I think this is a regression. We used to be able to easily control the number of output files / tasks based on num files and coalesce. Now I have to use `repartition` to get the desired num files / partitions which is unnecessarily expensive. I've tried playing with spark.sql.files.maxPartitionBytes and spark.sql.files.openCostInBytes to see if I can force the conventional behaviour. {code:java} val ss = SparkSession.builder().appName("uber-cp").master(conf.master()) .config("spark.sql.files.maxPartitionBytes", 1) .config("spark.sql.files.openCostInBytes", Long.MaxValue) {code} This didn't work. Spark just squashes all my parquet files into less partitions. Suggest a simple `option` on DataFrameReader that can disable this (or enable it, default behaviour should be same as 1.6). This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if SPARK-5997 was implemented this ticket wouldn't really be necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494832#comment-16494832 ] Ruben Berenguel commented on SPARK-23904: - [~igreenfi] that's what I mean, removing the code (no-op = no operation). I don't get OOM due to this string being generated, all the OOM I manage to get are due to too large tree plans in catalyst (which seems expected, I have tried more than 10 times already with different settings). You mentioned in StackOverflow that even removing spark.ui = true, the string was sent through the listenerBus anyway? Where and how have you seen this happening? I guess we are doing something differently and I can't figure out what it is to reproduce your OOM. > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494810#comment-16494810 ] Unai Sarasola commented on SPARK-20144: --- But if you want to have exactly a copy from your data in HDFS for example, this would made impossible to Spark to maintain the copy of the files exactly in both cases. Is that right > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin >Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494806#comment-16494806 ] Apache Spark commented on SPARK-23442: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21460 > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23442: Assignee: Apache Spark > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Assignee: Apache Spark >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23442: Assignee: (was: Apache Spark) > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
[ https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494771#comment-16494771 ] Dilip Biswal commented on SPARK-24424: -- [~smilegator] Thank you. I would like to give it a try. > Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET > --- > > Key: SPARK-24424 > URL: https://issues.apache.org/jira/browse/SPARK-24424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Currently, our Group By clause follows Hive > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup] > : > However, this does not match ANSI SQL compliance. The proposal is to update > our parser and analyzer for ANSI compliance. > For example, > {code:java} > GROUP BY col1, col2 WITH ROLLUP > GROUP BY col1, col2 WITH CUBE > GROUP BY col1, col2 GROUPING SET ... > {code} > It is nice to support ANSI SQL syntax at the same time. > {code:java} > GROUP BY ROLLUP(col1, col2) > GROUP BY CUBE(col1, col2) > GROUP BY GROUPING SET(...) > {code} > Note, we only need to support one-level grouping set in this stage. That > means, nested grouping set is not supported. > Note, we should not break the existing syntax. The parser changes should be > like > {code:sql} > group-by-expressions > >>-GROUP BY+-hive-sql-group-by-expressions-+--->< >'-ansi-sql-grouping-set-expressions-' > hive-sql-group-by-expressions > '--GROUPING SETS--(--grouping-set-expressions--)--' >.-,--. +--WITH CUBE--+ >V| +--WITH ROLLUP+ > >>---+-expression-+-+---+-+->< > grouping-expressions-list >.-,--. >V| > >>---+-expression-+-+-->< > grouping-set-expressions > .-,. > | .-,--. | > | V| | > V '-(--expression---+-)-' | > >>+-expression--+--+->< > ansi-sql-grouping-set-expressions > >>-+-ROLLUP--(--grouping-expression-list--)-+-->< >+-CUBE--(--grouping-expression-list--)---+ >'-GROUPING SETS--(--grouping-set-expressions--)--' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24395) Fix Behavior of NOT IN with Literals Containing NULL
[ https://issues.apache.org/jira/browse/SPARK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494769#comment-16494769 ] Xiao Li commented on SPARK-24395: - I think Oracle returns a different answer. We should fix them. > Fix Behavior of NOT IN with Literals Containing NULL > > > Key: SPARK-24395 > URL: https://issues.apache.org/jira/browse/SPARK-24395 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Miles Yucht >Priority: Major > > Spark does not return the correct answer when evaluating NOT IN in some > cases. For example: > {code:java} > CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES > (null, null) > AS m(a, b); > SELECT * > FROM m > WHERE a IS NULL AND b IS NULL >AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, > 1;{code} > According to the semantics of null-aware anti-join, this should return no > rows. However, it actually returns the row {{NULL NULL}}. This was found by > inspecting the unit tests added for SPARK-24381 > ([https://github.com/apache/spark/pull/21425#pullrequestreview-123421822).] > *Acceptance Criteria*: > * We should be able to add the following test cases back to > {{subquery/in-subquery/not-in-unit-test-multi-column-literal.sql}}: > {code:java} > -- Case 2 > -- (subquery contains a row with null in all columns -> row not returned) > SELECT * > FROM m > WHERE (a, b) NOT IN ((CAST (null AS INT), CAST (null AS DECIMAL(2, 1; > -- Case 3 > -- (probe-side columns are all null -> row not returned) > SELECT * > FROM m > WHERE a IS NULL AND b IS NULL -- Matches only (null, null) >AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, > 1; > -- Case 4 > -- (one column null, other column matches a row in the subquery result -> > row not returned) > SELECT * > FROM m > WHERE b = 1.0 -- Matches (null, 1.0) >AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, > 1; > {code} > > cc [~smilegator] [~juliuszsompolski] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24423) Add a new option `query` for JDBC sources
[ https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24423: Description: Currently, our JDBC connector provides the option `dbtable` for users to specify the to-be-loaded JDBC source table. {code} val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", "dbName.tableName") .options(jdbcCredentials: Map) .load() {code} Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option. {code} val query = """ (select * from tableName limit 10) as tmp """ val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", query) .options(jdbcCredentials: Map) .load() {code} However, this is straightforward to end users. We should simply allow users to specify the query by a new option `query`. We will handle the complexity for them. {code} val query = """select * from tableName limit 10""" val jdbcDf = spark.read .format("jdbc") .option("*{color:#ff}query{color}*", query) .options(jdbcCredentials: Map) .load() {code} Users are not allowed to specify query and dbtable at the same time. was: Currently, our JDBC connector provides the option `dbtable` for users to specify the to-be-loaded JDBC source table. {code} val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", "dbName.tableName") .options(jdbcCredentials: Map) .load() {code} Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option. {code} val query = """ (select * from tableName limit 10) as tmp """ val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", query) .options(jdbcCredentials: Map) .load() {code} However, this is straightforward to end users. We should simply allow users to specify the query by a new option `query`. We will handle the complexity for them. {code} val query = """select * from tableName limit 10""" val jdbcDf = spark.read .format("jdbc") .option("*{color:#ff}query{color}*", query) .options(jdbcCredentials: Map) .load() {code} Users are not allowed to specify query and dbtable at the same time. > Add a new option `query` for JDBC sources > - > > Key: SPARK-24423 > URL: https://issues.apache.org/jira/browse/SPARK-24423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Currently, our JDBC connector provides the option `dbtable` for users to > specify the to-be-loaded JDBC source table. > {code} > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", "dbName.tableName") > .options(jdbcCredentials: Map) > .load() > {code} > Normally, users do not fetch the whole JDBC table due to the poor > performance/throughput of JDBC. Thus, they normally just fetch a small set of > tables. For advanced users, they can pass a subquery as the option. > {code} > val query = """ (select * from tableName limit 10) as tmp """ > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", query) > .options(jdbcCredentials: Map) > .load() > {code} > However, this is straightforward to end users. We should simply allow users > to specify the query by a new option `query`. We will handle the complexity > for them. > {code} > val query = """select * from tableName limit 10""" > val jdbcDf = spark.read > .format("jdbc") > .option("*{color:#ff}query{color}*", query) > .options(jdbcCredentials: Map) > .load() > {code} > Users are not allowed to specify query and dbtable at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24423) Add a new option `query` for JDBC sources
[ https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24423: Description: Currently, our JDBC connector provides the option `dbtable` for users to specify the to-be-loaded JDBC source table. {code} val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", "dbName.tableName") .options(jdbcCredentials: Map) .load() {code} Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option. {code} val query = """ (select * from tableName limit 10) as tmp """ val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", query) .options(jdbcCredentials: Map) .load() {code} However, this is straightforward to end users. We should simply allow users to specify the query by a new option `query`. We will handle the complexity for them. {code} val query = """select * from tableName limit 10""" val jdbcDf = spark.read .format("jdbc") .option("*{color:#ff}query{color}*", query) .options(jdbcCredentials: Map) .load() {code} Users are not allowed to specify query and dbtable at the same time. was: Currently, our JDBC connector provides the option `dbtable` for users to specify the to-be-loaded JDBC source table. val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", "dbName.tableName") .options(jdbcCredentials: Map) .load() Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option. val query = """ (select * from tableName limit 10) as tmp """ val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", query) .options(jdbcCredentials: Map) .load() However, this is straightforward to end users. We should simply allow users to specify the query by a new option `query`. We will handle the complexity for them. val query = """select * from tableName limit 10""" val jdbcDf = spark.read .format("jdbc") .option("*{color:#ff}query{color}*", query) .options(jdbcCredentials: Map) .load() Users are not allowed to specify query and dbtable at the same time. > Add a new option `query` for JDBC sources > - > > Key: SPARK-24423 > URL: https://issues.apache.org/jira/browse/SPARK-24423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Currently, our JDBC connector provides the option `dbtable` for users to > specify the to-be-loaded JDBC source table. > {code} > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", "dbName.tableName") > .options(jdbcCredentials: Map) > .load() > {code} > > Normally, users do not fetch the whole JDBC table due to the poor > performance/throughput of JDBC. Thus, they normally just fetch a small set of > tables. For advanced users, they can pass a subquery as the option. > > {code} > val query = """ (select * from tableName limit 10) as tmp """ > val jdbcDf = spark.read > .format("jdbc") > .option("*dbtable*", query) > .options(jdbcCredentials: Map) > .load() > {code} > > However, this is straightforward to end users. We should simply allow users > to specify the query by a new option `query`. We will handle the complexity > for them. > > {code} > val query = """select * from tableName limit 10""" > val jdbcDf = spark.read > .format("jdbc") > .option("*{color:#ff}query{color}*", query) > .options(jdbcCredentials: Map) > .load() > {code} > > Users are not allowed to specify query and dbtable at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
[ https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494754#comment-16494754 ] Xiao Li commented on SPARK-24424: - Also cc [~dkbiswal] > Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET > --- > > Key: SPARK-24424 > URL: https://issues.apache.org/jira/browse/SPARK-24424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Currently, our Group By clause follows Hive > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup] > : > However, this does not match ANSI SQL compliance. The proposal is to update > our parser and analyzer for ANSI compliance. > For example, > {code:java} > GROUP BY col1, col2 WITH ROLLUP > GROUP BY col1, col2 WITH CUBE > GROUP BY col1, col2 GROUPING SET ... > {code} > It is nice to support ANSI SQL syntax at the same time. > {code:java} > GROUP BY ROLLUP(col1, col2) > GROUP BY CUBE(col1, col2) > GROUP BY GROUPING SET(...) > {code} > Note, we only need to support one-level grouping set in this stage. That > means, nested grouping set is not supported. > Note, we should not break the existing syntax. The parser changes should be > like > {code:sql} > group-by-expressions > >>-GROUP BY+-hive-sql-group-by-expressions-+--->< >'-ansi-sql-grouping-set-expressions-' > hive-sql-group-by-expressions > '--GROUPING SETS--(--grouping-set-expressions--)--' >.-,--. +--WITH CUBE--+ >V| +--WITH ROLLUP+ > >>---+-expression-+-+---+-+->< > grouping-expressions-list >.-,--. >V| > >>---+-expression-+-+-->< > grouping-set-expressions > .-,. > | .-,--. | > | V| | > V '-(--expression---+-)-' | > >>+-expression--+--+->< > ansi-sql-grouping-set-expressions > >>-+-ROLLUP--(--grouping-expression-list--)-+-->< >+-CUBE--(--grouping-expression-list--)---+ >'-GROUPING SETS--(--grouping-set-expressions--)--' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org