[jira] [Resolved] (SPARK-24337) Improve the error message for invalid SQL conf value

2018-05-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24337.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Improve the error message for invalid SQL conf value
> 
>
> Key: SPARK-24337
> URL: https://issues.apache.org/jira/browse/SPARK-24337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Right now Spark will throw the following error message when a config is set 
> to an invalid value. It would be great if the error message contains the 
> config key so that it's easy to tell which one is wrong.
> {code}
> Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes 
> (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
> Fractional values are not supported. Input was: 1.6
>   at 
> org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291)
>   at 
> org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071)
>   ... 59 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496132#comment-16496132
 ] 

Liang-Chi Hsieh edited comment on SPARK-24410 at 5/31/18 5:39 AM:
--

We can verify the partition of union dataframe:
{code}
val df1 = spark.table("a1").select(spark_partition_id(), $"key")
val df2 = spark.table("a2").select(spark_partition_id(), $"key")
df1.union(df2).select(spark_partition_id(), $"key").collect
{code}
{code}
res8: Array[org.apache.spark.sql.Row] = Array([0,6], [0,7], [0,3], [1,0], 
[2,4], [2,8], [2,9], [2,2], [2,1], [2,5], [3,3], [3,7], [3,6], [4,0], [5,2], 
[5,9], [5,5], [5,8], [5,1], [5,4])
{code}

>From above result, we can find that for the same {{key}} from {{df1}} and 
>{{df2}} are at different partitions. E.g., key {{6}} are at partition {{0}} 
>and partition {{3}}. So we still need a shuffle to get the correct results.





was (Author: viirya):
We can verify the partition of union dataframe:
{code}
val df1 = spark.table("a1").select(spark_partition_id(), $"key")
val df2 = spark.table("a2").select(spark_partition_id(), $"key")
df1.union(df2).select(spark_partition_id(), $"key").collect
{code}
{code}
res8: Array[org.apache.spark.sql.Row] = Array([0,6], [0,7], [0,3], [1,0], 
[2,4], [2,8], [2,9], [2,2], [2,1], [2,5], [3,3], [3,7], [3,6], [4,0], [5,2], 
[5,9], [5,5], [5,8], [5,1], [5,4])
{code}

>From above result, we can find that for the same {{key}} from {{df1 and 
>{{df2}} are at different partitions. E.g., key {{6}} are at partition {{0}} 
>and partition {{3}}.




> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-05-30 Thread Teddy Choi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496131#comment-16496131
 ] 

Teddy Choi commented on SPARK-21187:


Hello [~bryanc], I'm working on Hive-Spark connector with Arrow. So I'm also 
interested in this issue. Can I work on MapType implementation? Thanks.

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * *Binary* - in pyspark
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496132#comment-16496132
 ] 

Liang-Chi Hsieh commented on SPARK-24410:
-

We can verify the partition of union dataframe:
{code}
val df1 = spark.table("a1").select(spark_partition_id(), $"key")
val df2 = spark.table("a2").select(spark_partition_id(), $"key")
df1.union(df2).select(spark_partition_id(), $"key").collect
{code}
{code}
res8: Array[org.apache.spark.sql.Row] = Array([0,6], [0,7], [0,3], [1,0], 
[2,4], [2,8], [2,9], [2,2], [2,1], [2,5], [3,3], [3,7], [3,6], [4,0], [5,2], 
[5,9], [5,5], [5,8], [5,1], [5,4])
{code}

>From above result, we can find that for the same {{key}} from {{df1 and 
>{{df2}} are at different partitions. E.g., key {{6}} are at partition {{0}} 
>and partition {{3}}.




> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24337) Improve the error message for invalid SQL conf value

2018-05-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24337:
---

Assignee: (was: Xiao Li)

> Improve the error message for invalid SQL conf value
> 
>
> Key: SPARK-24337
> URL: https://issues.apache.org/jira/browse/SPARK-24337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now Spark will throw the following error message when a config is set 
> to an invalid value. It would be great if the error message contains the 
> config key so that it's easy to tell which one is wrong.
> {code}
> Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes 
> (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
> Fractional values are not supported. Input was: 1.6
>   at 
> org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291)
>   at 
> org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071)
>   ... 59 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-05-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23649.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1
   2.2.2

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-05-30 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495998#comment-16495998
 ] 

Huaxin Gao commented on SPARK-24439:


I will work on this. 

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-05-30 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-24439:
--

 Summary: Add distanceMeasure to BisectingKMeans in PySpark
 Key: SPARK-24439
 URL: https://issues.apache.org/jira/browse/SPARK-24439
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Huaxin Gao


https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24436) Add large dataset to examples sub-directory.

2018-05-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24436.
--
Resolution: Invalid

> Add large dataset to examples sub-directory.
> 
>
> Key: SPARK-24436
> URL: https://issues.apache.org/jira/browse/SPARK-24436
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: Varun Vishwanathan
>Priority: Trivial
>
> Fire Incidents from San francisco is a large dataset that may help developers 
> work with more data when fixing bugs or developing features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24436) Add large dataset to examples sub-directory.

2018-05-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495984#comment-16495984
 ] 

Hyukjin Kwon commented on SPARK-24436:
--

I don't think we should add a large dataset in the code base. You can live the 
link into user mailing list, post it in a blog, or somewhere. Example is an 
example and it doesn't have to have a large data.

> Add large dataset to examples sub-directory.
> 
>
> Key: SPARK-24436
> URL: https://issues.apache.org/jira/browse/SPARK-24436
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: Varun Vishwanathan
>Priority: Trivial
>
> Fire Incidents from San francisco is a large dataset that may help developers 
> work with more data when fixing bugs or developing features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet

2018-05-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495982#comment-16495982
 ] 

Hyukjin Kwon commented on SPARK-24427:
--

Doesn't it sound you specified multiple versions of Spark in your classpath?

>  Spark 2.2 - Exception occurred while saving table in spark. Multiple sources 
> found for parquet 
> 
>
> Key: SPARK-24427
> URL: https://issues.apache.org/jira/browse/SPARK-24427
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Ashok Rai
>Priority: Major
>
> We are getting below error while loading into Hive table. In our code, we use 
> "saveAsTable" - which as per documentation automatically chooses the format 
> that the table was created on. We have now tested by creating the table as 
> Parquet as well as ORC. In both cases - the same error occurred.
>  
> -
> 2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving 
> table in spark.
>  org.apache.spark.sql.AnalysisException: Multiple sources found for parquet 
> (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, 
> org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please 
> specify the fully qualified class name.;
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) 
> ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  ~[scala-library-2.11.8.jar:?]
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  ~[scala-library-2.11.8.jar:?]
>  at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at scala.collection.immutable.List.foreach(List.scala:381) 
> ~[scala-library-2.11.8.jar:?]
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>  ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
>  ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
>  at 
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2018-05-30 Thread Cyril Scetbon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495979#comment-16495979
 ] 

Cyril Scetbon commented on SPARK-16367:
---

Is there a better way to do it today ? I see that ticket has been opened for 
almost 2 years

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>Priority: Major
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Created] (SPARK-24438) Empty strings and null strings are written to the same partition

2018-05-30 Thread Mukul Murthy (JIRA)
Mukul Murthy created SPARK-24438:


 Summary: Empty strings and null strings are written to the same 
partition
 Key: SPARK-24438
 URL: https://issues.apache.org/jira/browse/SPARK-24438
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Mukul Murthy


When you partition on a string column that has empty strings and nulls, they 
are both written to the same default partition. When you read the data back, 
all those values get read back as null.


{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, 
null))
val schema = new StructType().add("a", IntegerType).add("b", StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
display(df) 
=> 
a b
1 
2 
3 
4 hello
5 null

df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
val df2 = spark.read.load("/home/mukul/weird_test_data4")
display(df2)
=> 
a b
4 hello
3 null
2 null
1 null
5 null
{code}

Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24333:


Assignee: Apache Spark

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24333:


Assignee: (was: Apache Spark)

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-05-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495904#comment-16495904
 ] 

Apache Spark commented on SPARK-24333:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21465

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-05-30 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495843#comment-16495843
 ] 

Shixiong Zhu commented on SPARK-23649:
--

[~cloud_fan] looks like this is fixed?

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24276) semanticHash() returns different values for semantically the same IS IN

2018-05-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24276.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> semanticHash() returns different values for semantically the same IS IN
> ---
>
> Key: SPARK-24276
> URL: https://issues.apache.org/jira/browse/SPARK-24276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> When a plan is canonicalized any set-based operation, such as IS IN, should 
> have its expressions ordered as the order of expressions does not matter in 
> the evaluation of the operator.
> For instance:
> {code:scala}
> val df = spark.createDataFrame(Seq((1, 2)))
> val p1 = df.where('_1.isin(1, 2)).queryExecution.logical.canonicalized
> val p2 = df.where('_1.isin(2, 1)).queryExecution.logical.canonicalized
> val h1 = p1.semanticHash
> val h2 = p2.semanticHash
> {code}
> {code}
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
> p1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Filter '_1 IN (1,2)
> +- LocalRelation [_1#0, _2#1]
> p2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Filter '_1 IN (2,1)
> +- LocalRelation [_1#0, _2#1]
> h1: Int = -1384236508
> h2: Int = 939549189
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24276) semanticHash() returns different values for semantically the same IS IN

2018-05-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24276:
---

Assignee: Marco Gaido  (was: Maxim Gekk)

> semanticHash() returns different values for semantically the same IS IN
> ---
>
> Key: SPARK-24276
> URL: https://issues.apache.org/jira/browse/SPARK-24276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> When a plan is canonicalized any set-based operation, such as IS IN, should 
> have its expressions ordered as the order of expressions does not matter in 
> the evaluation of the operator.
> For instance:
> {code:scala}
> val df = spark.createDataFrame(Seq((1, 2)))
> val p1 = df.where('_1.isin(1, 2)).queryExecution.logical.canonicalized
> val p2 = df.where('_1.isin(2, 1)).queryExecution.logical.canonicalized
> val h1 = p1.semanticHash
> val h2 = p2.semanticHash
> {code}
> {code}
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
> p1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Filter '_1 IN (1,2)
> +- LocalRelation [_1#0, _2#1]
> p2: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> 'Filter '_1 IN (2,1)
> +- LocalRelation [_1#0, _2#1]
> h1: Int = -1384236508
> h2: Int = 939549189
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495779#comment-16495779
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 5/30/18 10:12 PM:
---

[~eje] I agree will give it a shot and try exhaust options, will see if I can 
come up with pros and cons etc.

This also blocks the affinity/anti-affinity work.


was (Author: skonto):
[~eje] I agree will give it a shot and try exhaust options, will see if I can 
come up with pros and cons etc.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495779#comment-16495779
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

[~eje] I agree will give it a shot and try exhaust options, will see if I can 
come up with pros and cons etc.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table

2018-05-30 Thread Bimalendu Choudhary (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bimalendu Choudhary updated SPARK-24316:

Summary: Spark sql queries stall for  column width more than 6k for parquet 
based table  (was: Spark sql queries stall for  column width more 6k for 
parquet based table)

> Spark sql queries stall for  column width more than 6k for parquet based table
> --
>
> Key: SPARK-24316
> URL: https://issues.apache.org/jira/browse/SPARK-24316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> When we create a table from a data frame using spark sql with columns around 
> 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while 
> the same table if we create through Hive with same data , the same query just 
> takes 5 minutes.
>  
> Instrumenting the code we see that the executors are looping in the while 
> loop of the function initializeInternal(). The majority of time is getting 
> spent in the for loop in below code looping through the columns and the 
> executor appears to be stalled for long time .
>   
> {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}
> private void initializeInternal() ..
>  ..
>  for (int i = 0; i < requestedSchema.getFieldCount(); ++i)
> { ... }
> }
> {code:java}
>  {code}
>  
> When spark sql is creating table, it also stores the metadata in the 
> TBLPROPERTIES in json format. We see that if we remove this metadata from the 
> table the queries become fast , which is the case when we create the same 
> table through Hive. The exact same table takes 5 times more time with the 
> Json meta data as compared to without the json metadata.
>  
> So looks like as the number of columns are growing bigger than 5 to 6k, the 
> processing of the metadata and comparing it becomes more and more expensive 
> and the performance degrades drastically.
> To recreate the problem simply run the following query:
> import org.apache.spark.sql.SparkSession
> val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")
>  resp_data.write.format("csv").save("/tmp/filename")
>  
> The table should be created by spark sql from dataframe so that the Json meta 
> data is stored. For ex:-
> val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")
> dff.createOrReplaceTempView("my_temp_table")
>  val tmp = spark.sql("Create table tableName stored as parquet as select * 
> from my_temp_table")
>  
>  
> from pyspark.sql import SQL
> Context 
>  sqlContext = SQLContext(sc) 
>  resp_data = spark.sql( " select * from test").limit(2000) 
>  print resp_data_fgv_1k.count() 
>  (resp_data_fgv_1k.write.option('header', 
> False).mode('overwrite').csv('/tmp/2.csv') ) 
>  
>  
> The performance seems to be even slow in the loop if the schema does not 
> match or the fields are empty and the code goes into the if condition where 
> the missing column is marked true:
> missingColumns[i] = true;
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans

2018-05-30 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495695#comment-16495695
 ] 

Li Jin commented on SPARK-24373:


[~smilegator] Thank you for the suggestion.

> "df.cache() df.count()" no longer eagerly caches data when the analyzed plans 
> are different after re-analyzing the plans
> 
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans

2018-05-30 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495691#comment-16495691
 ] 

Xiao Li commented on SPARK-24373:
-

[~icexelloss] This is still possible since the query plans are changed. I am 
also fine to do it without a flag. If you apply the fix to your internal fork, 
I would suggest to add a flag. At least, you can turn it off when anything 
unexpected happens. 

> "df.cache() df.count()" no longer eagerly caches data when the analyzed plans 
> are different after re-analyzing the plans
> 
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24436) Add large dataset to examples sub-directory.

2018-05-30 Thread Varun Vishwanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495684#comment-16495684
 ] 

Varun Vishwanathan commented on SPARK-24436:


Going to add a parquet file with the data.

> Add large dataset to examples sub-directory.
> 
>
> Key: SPARK-24436
> URL: https://issues.apache.org/jira/browse/SPARK-24436
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.3.1
>Reporter: Varun Vishwanathan
>Priority: Trivial
>
> Fire Incidents from San francisco is a large dataset that may help developers 
> work with more data when fixing bugs or developing features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-30 Thread gagan taneja (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gagan taneja updated SPARK-24437:
-
Attachment: Screen Shot 2018-05-30 at 2.07.22 PM.png

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-30 Thread gagan taneja (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gagan taneja updated SPARK-24437:
-
Attachment: Screen Shot 2018-05-30 at 2.05.40 PM.png

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-05-30 Thread gagan taneja (JIRA)
gagan taneja created SPARK-24437:


 Summary: Memory leak in UnsafeHashedRelation
 Key: SPARK-24437
 URL: https://issues.apache.org/jira/browse/SPARK-24437
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: gagan taneja


There seems to memory leak with 
org.apache.spark.sql.execution.joins.UnsafeHashedRelation

We have a long running instance of STS.

With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
getting added for cleanup in ContextCleaner. This reference of 
UnsafeHashedRelation is being held at some other Collection and not becoming 
eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24436) Add large dataset to examples sub-directory.

2018-05-30 Thread Varun Vishwanathan (JIRA)
Varun Vishwanathan created SPARK-24436:
--

 Summary: Add large dataset to examples sub-directory.
 Key: SPARK-24436
 URL: https://issues.apache.org/jira/browse/SPARK-24436
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 2.3.1
Reporter: Varun Vishwanathan


Fire Incidents from San francisco is a large dataset that may help developers 
work with more data when fixing bugs or developing features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495642#comment-16495642
 ] 

Erik Erlandson commented on SPARK-24434:


[~skonto] given the number of ideas that have gotten tossed around for this 
over time, an 'alternatives considered' section for a design doc will 
definitely be valuable

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495637#comment-16495637
 ] 

Yinan Li commented on SPARK-24434:
--

[~eje] That's a good question. I think we need to compare both and have a 
thorough discussion in the community once the design is out. There are pros and 
cons with each of them.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2018-05-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495619#comment-16495619
 ] 

Erik Erlandson commented on SPARK-24091:


If we support user-supplied yaml, that may become a source of config map 
specifications

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495615#comment-16495615
 ] 

Erik Erlandson commented on SPARK-24434:


Is the template-based solution being explicitly favored over other options, 
e.g. pod presets or webhooks, etc?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions

2018-05-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-24435.

Resolution: Duplicate

> Support user-supplied YAML that can be merged with k8s pod descriptions
> ---
>
> Key: SPARK-24435
> URL: https://issues.apache.org/jira/browse/SPARK-24435
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: features, kubernetes
> Fix For: 2.4.0
>
>
> Kubernetes supports a large variety of configurations to Pods. Currently only 
> some of these are configurable from Spark, and they all operate by being 
> plumbed from --conf arguments through to pod creation in the code.
> To avoid the anti-pattern of trying to expose an unbounded Pod feature set 
> through Spark configuration keywords, the community is interested in working 
> out a sane way of allowing users to supply "arbitrary" Pod YAML which can be 
> merged with the pod configurations created by the kube backend.
> Multiple solutions have been considerd, including Pod Pre-sets and loading 
> Pod template objects.  A requirement is that the policy for how user-supplied 
> YAML interacts with the configurations created by the kube back-end must be 
> easy to reason about, and also that whatever kubernetes features the solution 
> uses are supported on the kubernetes roadmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24435) Support user-supplied YAML that can be merged with k8s pod descriptions

2018-05-30 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-24435:
--

 Summary: Support user-supplied YAML that can be merged with k8s 
pod descriptions
 Key: SPARK-24435
 URL: https://issues.apache.org/jira/browse/SPARK-24435
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Erik Erlandson
 Fix For: 2.4.0


Kubernetes supports a large variety of configurations to Pods. Currently only 
some of these are configurable from Spark, and they all operate by being 
plumbed from --conf arguments through to pod creation in the code.

To avoid the anti-pattern of trying to expose an unbounded Pod feature set 
through Spark configuration keywords, the community is interested in working 
out a sane way of allowing users to supply "arbitrary" Pod YAML which can be 
merged with the pod configurations created by the kube backend.

Multiple solutions have been considerd, including Pod Pre-sets and loading Pod 
template objects.  A requirement is that the policy for how user-supplied YAML 
interacts with the configurations created by the kube back-end must be easy to 
reason about, and also that whatever kubernetes features the solution uses are 
supported on the kubernetes roadmap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495601#comment-16495601
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 5/30/18 7:55 PM:
--

[~liyinan926] I will work on a design document. From a first glance K8s pod 
spec could be enough or a subset of it at least.


was (Author: skonto):
[~liyinan926] I will work on a design document. From a first glance K8s pod 
spec could be enough or a subset of it.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495601#comment-16495601
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

[~liyinan926] I will work on a design document. From a first glance K8s pod 
spec could be enough or a subset of it.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24434) Support user-specified driver and executor pod templates

2018-05-30 Thread Yinan Li (JIRA)
Yinan Li created SPARK-24434:


 Summary: Support user-specified driver and executor pod templates
 Key: SPARK-24434
 URL: https://issues.apache.org/jira/browse/SPARK-24434
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Yinan Li


With more requests for customizing the driver and executor pods coming, the 
current approach of adding new Spark configuration options has some serious 
drawbacks: 1) it means more Kubernetes specific configuration options to 
maintain, and 2) it widens the gap between the declarative model used by 
Kubernetes and the configuration model used by Spark. We should start designing 
a solution that allows users to specify pod templates as central places for all 
customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-05-30 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495580#comment-16495580
 ] 

Ted Yu commented on SPARK-18057:


There are compilation errors in KafkaTestUtils.scala against Kafka 1.0.1 
release.

What's the preferred way forward:

* use reflection in KafkaTestUtils.scala
* create external/kafka-1-0-sql module

Thanks

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24433) Add Spark R support

2018-05-30 Thread Yinan Li (JIRA)
Yinan Li created SPARK-24433:


 Summary: Add Spark R support
 Key: SPARK-24433
 URL: https://issues.apache.org/jira/browse/SPARK-24433
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Yinan Li


This is the ticket to track work on adding support for R binding into the 
Kubernetes mode. The feature is available in our fork at 
github.com/apache-spark-on-k8s/spark and needs to be upstreamed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24432) Support for dynamic resource allocation

2018-05-30 Thread Yinan Li (JIRA)
Yinan Li created SPARK-24432:


 Summary: Support for dynamic resource allocation
 Key: SPARK-24432
 URL: https://issues.apache.org/jira/browse/SPARK-24432
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Yinan Li


This is an umbrella ticket for work on adding support for dynamic resource 
allocation into the Kubernetes mode. This requires a Kubernetes-specific 
external shuffle service. The feature is available in our fork at 
github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24432) Add support for dynamic resource allocation

2018-05-30 Thread Yinan Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li updated SPARK-24432:
-
Summary: Add support for dynamic resource allocation  (was: Support for 
dynamic resource allocation)

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24417) Build and Run Spark on JDK9+

2018-05-30 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24417:

Description: 
This is an umbrella JIRA for Apache Spark to support Java 9+

As Java 8 is going away soon, Java 11 will be LTS and GA this Sep, and many 
companies are testing Java 9 or Java 10 to prepare for Java 11, i's best to 
start the traumatic process of supporting newer version of Java in Apache Spark 
as a background activity. 

The subtasks are what have to be done to support Java 9+.

  was:
This is an umbrella JIRA for Apache Spark to support JDK9+

As Java 8 is going way soon, JDK11 will be LTS and GA this Sep, and companies 
are testing JDK9 or JDK10 to prepare for JDK11. It's best to start the 
traumatic process of supporting newer version of JDK in Apache Spark as a 
background activity. 

The subtasks are what have to be done to support JDK9+.


> Build and Run Spark on JDK9+
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> This is an umbrella JIRA for Apache Spark to support Java 9+
> As Java 8 is going away soon, Java 11 will be LTS and GA this Sep, and many 
> companies are testing Java 9 or Java 10 to prepare for Java 11, i's best to 
> start the traumatic process of supporting newer version of Java in Apache 
> Spark as a background activity. 
> The subtasks are what have to be done to support Java 9+.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23901) Data Masking Functions

2018-05-30 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23901.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21246
[https://github.com/apache/spark/pull/21246]

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23901) Data Masking Functions

2018-05-30 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23901:
-

Assignee: Marco Gaido

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-30 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23161.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21413
[https://github.com/apache/spark/pull/21413]

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-30 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23161:


Assignee: Huaxin Gao

> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Huaxin Gao
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization

2018-05-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24384.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21426
[https://github.com/apache/spark/pull/21426]

> spark-submit --py-files with .py files doesn't work in client mode before 
> context initialization
> 
>
> Key: SPARK-24384
> URL: https://issues.apache.org/jira/browse/SPARK-24384
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> In case the given Python file is .py file (zip file seems fine), seems the 
> python path is dynamically added after the context is got initialized.
> with this pyFile:
> {code}
> $ cat /home/spark/tmp.py
> def testtest():
> return 1
> {code}
> This works:
> {code}
> $ cat app.py
> import pyspark
> pyspark.sql.SparkSession.builder.getOrCreate()
> import tmp
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> ...
> 1
> {code}
> but this doesn't:
> {code}
> $ cat app.py
> import pyspark
> import tmp
> pyspark.sql.SparkSession.builder.getOrCreate()
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> Traceback (most recent call last):
>   File "/home/spark/spark/app.py", line 2, in 
> import tmp
> ImportError: No module named tmp
> {code}
> See 
> https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24384) spark-submit --py-files with .py files doesn't work in client mode before context initialization

2018-05-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24384:
--

Assignee: Hyukjin Kwon

> spark-submit --py-files with .py files doesn't work in client mode before 
> context initialization
> 
>
> Key: SPARK-24384
> URL: https://issues.apache.org/jira/browse/SPARK-24384
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> In case the given Python file is .py file (zip file seems fine), seems the 
> python path is dynamically added after the context is got initialized.
> with this pyFile:
> {code}
> $ cat /home/spark/tmp.py
> def testtest():
> return 1
> {code}
> This works:
> {code}
> $ cat app.py
> import pyspark
> pyspark.sql.SparkSession.builder.getOrCreate()
> import tmp
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> ...
> 1
> {code}
> but this doesn't:
> {code}
> $ cat app.py
> import pyspark
> import tmp
> pyspark.sql.SparkSession.builder.getOrCreate()
> print("%s" % tmp.testtest())
> $ ./bin/spark-submit --master yarn --deploy-mode client --py-files 
> /home/spark/tmp.py app.py
> Traceback (most recent call last):
>   File "/home/spark/spark/app.py", line 2, in 
> import tmp
> ImportError: No module named tmp
> {code}
> See 
> https://issues.apache.org/jira/browse/SPARK-21945?focusedCommentId=16488486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16488486



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24419) Upgrade SBT to 0.13.17 with Scala 2.10.7

2018-05-30 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24419.
-
Resolution: Fixed

> Upgrade SBT to 0.13.17 with Scala 2.10.7
> 
>
> Key: SPARK-24419
> URL: https://issues.apache.org/jira/browse/SPARK-24419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23754:
-
Fix Version/s: 2.3.1

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Emilio Dorigatti
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24369) A bug when having multiple distinct aggregations

2018-05-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24369.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21443
[https://github.com/apache/spark/pull/21443]

> A bug when having multiple distinct aggregations
> 
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24369) A bug when having multiple distinct aggregations

2018-05-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24369:
---

Assignee: Takeshi Yamamuro

> A bug when having multiple distinct aggregations
> 
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-05-30 Thread Xinyong Tian (JIRA)
Xinyong Tian created SPARK-24431:


 Summary: wrong areaUnderPR calculation in 
BinaryClassificationEvaluator 
 Key: SPARK-24431
 URL: https://issues.apache.org/jira/browse/SPARK-24431
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Xinyong Tian


My problem, I am using CrossValidator(estimator=LogisticRegression(...), ...,  
evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to select 
best model. when the regParam in logistict regression is very high, no variable 
will be selected (no model), ie every row 's prediction is same ,eg. equal 
event rate (baseline frequency). But at this point,  
BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
model seleted is a no model. 

the reason is following.  at time of no model, precision recall curve will be 
only two points: at recall =0, precision should be set to  zero , while the 
software set it to 1. at recall=1, precision is the event rate. As a result, 
the areaUnderPR will be close 0.5 (my even rate is very low), which is maximum .

the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24430) CREATE VIEW with UNION statement: Failed to recognize predicate 'UNION'.

2018-05-30 Thread Volodymyr Glushak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495374#comment-16495374
 ] 

Volodymyr Glushak commented on SPARK-24430:
---

I wonder, if that is legal code, and this request should be moved to HIVE JIRA.

> CREATE VIEW with UNION statement: Failed to recognize predicate 'UNION'.
> 
>
> Key: SPARK-24430
> URL: https://issues.apache.org/jira/browse/SPARK-24430
> Project: Spark
>  Issue Type: Request
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Volodymyr Glushak
>Priority: Major
>
> When I executes following SQL statement:
> {code}
> spark.sql('CREATE VIEW view_12 AS
> SELECT * FROM (
>   SELECT * FROM table1
>   UNION ALL
>   SELECT * FROM table2
> ) UT')
> {code}
>  
> It successfully creates view in HIVE, which I can query via Apache Spark.
> However if I'm trying to query the same view directly via HIVE, I've got an 
> error:
> {code}
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED: ParseException line 6:6 Failed to recognize predicate 
> 'UNION'.
> Failed rule: 'identifier' in subquery source
> {code}
>  
> *Investigation*
> Under hood, spark generate following SQL statement for this view:
> {code}
> CREATE VIEW `view_12` AS
> SELECT *
> FROM (SELECT * FROM
>     (
>    (SELECT * FROM (SELECT * FROM `db1`.`table1`) AS 
> gen_subquery_0)
>      UNION ALL
>     (SELECT * FROM (SELECT  * FROM `db1`.`tabl2`) AS 
> gen_subquery_1)
>     ) AS UT
>   ) AS UT
> {code}
> If I try to executes this statement in HIVE it fails with the same reason.
> The easiest way to fix it, is to unwrap outer queries on lines 5 and 7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24430) CREATE VIEW with UNION statement: Failed to recognize predicate 'UNION'.

2018-05-30 Thread Volodymyr Glushak (JIRA)
Volodymyr Glushak created SPARK-24430:
-

 Summary: CREATE VIEW with UNION statement: Failed to recognize 
predicate 'UNION'.
 Key: SPARK-24430
 URL: https://issues.apache.org/jira/browse/SPARK-24430
 Project: Spark
  Issue Type: Request
  Components: Spark Core, SQL
Affects Versions: 2.2.1
Reporter: Volodymyr Glushak


When I executes following SQL statement:

{code}

spark.sql('CREATE VIEW view_12 AS

SELECT * FROM (

  SELECT * FROM table1

  UNION ALL

  SELECT * FROM table2

) UT')

{code}

 

It successfully creates view in HIVE, which I can query via Apache Spark.

However if I'm trying to query the same view directly via HIVE, I've got an 
error:

{code}

org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: 
FAILED: ParseException line 6:6 Failed to recognize predicate 'UNION'.

Failed rule: 'identifier' in subquery source

{code}

 

*Investigation*

Under hood, spark generate following SQL statement for this view:

{code}

CREATE VIEW `view_12` AS

SELECT *

FROM (SELECT * FROM

    (

   (SELECT * FROM (SELECT * FROM `db1`.`table1`) AS 
gen_subquery_0)

     UNION ALL

    (SELECT * FROM (SELECT  * FROM `db1`.`tabl2`) AS 
gen_subquery_1)

    ) AS UT

  ) AS UT

{code}

If I try to executes this statement in HIVE it fails with the same reason.

The easiest way to fix it, is to unwrap outer queries on lines 5 and 7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495268#comment-16495268
 ] 

Apache Spark commented on SPARK-23754:
--

User 'e-dorigatti' has created a pull request for this issue:
https://github.com/apache/spark/pull/21463

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Emilio Dorigatti
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc in K8s module

2018-05-30 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24428:

Affects Version/s: (was: 2.3.0)
   2.4.0

> Remove unused code and fix any related doc in K8s module
> 
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> There are some relics of previous refactoring like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24415:
--
Priority: Critical  (was: Major)

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495224#comment-16495224
 ] 

Thomas Graves commented on SPARK-24415:
---

ok so the issue here is in the AppStatusListener where its only updating the 
task metrics for liveStages.  It gets the second taskEnd event after it 
cancelled stage 2 so its no longer in the live stages array.  

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885
 ] 

Liang-Chi Hsieh edited comment on SPARK-24410 at 5/30/18 2:20 PM:
--

I've done some experiments locally. But the results show that in above case 
seems we don't guarantee that the data distribution is the same when reading 
two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get 
the correct results.

That said, for FileScan nodes at (1) and (2):
{code:java}
*(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

*(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code}

FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are 
partitioned by {{key#28}}. But it doesn't guarantee that the same values of 
{{key#25}} and {{key#28}} are located in the same partition.


was (Author: viirya):
I've done some experiments locally. But the results show that in above case 
seems we don't guarantee that the data distribution is the same when reading 
two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get 
the correct results.

That said, for FileScan nodes at (1) and (2):
{code:java}
*(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

*(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code}

FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are 
partitioned by {{key#28}}. But it doesn't guarantee that the values of 
{{key#25}} and {{key#28}} are located in the same partition.

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) 

[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495204#comment-16495204
 ] 

Thomas Graves commented on SPARK-24415:
---

It also looks like in the history server they show up properly in the 
aggregated metrics, although if you look at the Tasks (for all stages) column 
on the jobs page, it only lists a single task failure where it should list 2.

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24429) Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s

2018-05-30 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24429:

Description: 
 

Right now in cluster mode only extraJavaOptions targeting the executor are set. 

According to the implementation and the docs:

"In client mode, this config must not be set through the {{SparkConf}} directly 
in your application, because the driver JVM has already started at that point. 
Instead, please set this through the {{--driver-java-options}} command line 
option or in your default properties file."

A typical driver launch in cluster mode will eventually use client mode to run 
the Spark-submit and looks like:

"/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* 
-Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf 
spark.driver.bindAddress=9.0.7.116 --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi 
spark-internal 1"

Also at the entrypoint.sh file there is no management of the driver's java 
opts. 

We propose to set an env var to pass the extra java opts to the driver (like in 
the case of the executor), rename the env vars in the container as the one for 
the executor is a bit misleading, and use --driver-java-options to pass the 
required options.

 

  was:
 

Right now in cluster mode only extraJavaOptions targeting the executor are set. 

According to the implementation and the docs:

"In client mode, this config must not be set through the {{SparkConf}} directly 
in your application, because the driver JVM has already started at that point. 
Instead, please set this through the {{--driver-java-options}} command line 
option or in your default properties file."

A typical driver launch in cluster mode will eventually use client mode to run 
the Spark-submit and looks like:

"/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* 
-Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf 
spark.driver.bindAddress=9.0.7.116 --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi 
spark-internal 1"

Also at the entrypoint.sh file there is no management of the driver's java 
Opts. 

We propose to set an env var to pass the extra java opts to the driver (like in 
the case of the executor), rename the env vars in the container as the one for 
the executor is a bit misleading, and use --driver-java-options to pass the 
required options.

 


> Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s
> --
>
> Key: SPARK-24429
> URL: https://issues.apache.org/jira/browse/SPARK-24429
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
>  
> Right now in cluster mode only extraJavaOptions targeting the executor are 
> set. 
> According to the implementation and the docs:
> "In client mode, this config must not be set through the {{SparkConf}} 
> directly in your application, because the driver JVM has already started at 
> that point. Instead, please set this through the {{--driver-java-options}} 
> command line option or in your default properties file."
> A typical driver launch in cluster mode will eventually use client mode to 
> run the Spark-submit and looks like:
> "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit 
> --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal 1"
> Also at the entrypoint.sh file there is no management of the driver's java 
> opts. 
> We propose to set an env var to pass the extra java opts to the driver (like 
> in the case of the executor), rename the env vars in the container as the one 
> for the executor is a bit misleading, and use --driver-java-options to pass 
> the required options.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24429) Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s

2018-05-30 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24429:

Description: 
 

Right now in cluster mode only extraJavaOptions targeting the executor are set. 

According to the implementation and the docs:

"In client mode, this config must not be set through the {{SparkConf}} directly 
in your application, because the driver JVM has already started at that point. 
Instead, please set this through the {{--driver-java-options}} command line 
option or in your default properties file."

A typical driver launch in cluster mode will eventually use client mode to run 
the Spark-submit and looks like:

"/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* 
-Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf 
spark.driver.bindAddress=9.0.7.116 --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi 
spark-internal 1"

Also at the entrypoint.sh file there is no management of the driver's java 
Opts. 

We propose to set an env var to pass the extra java opts to the driver (like in 
the case of the executor), rename the env vars in the container as the one for 
the executor is a bit misleading, and use --driver-java-options to pass the 
required options.

 

  was:
 

Right now in cluster mode only extraJavaOptions targeting the executor are set. 

According to the implementation and the docs:

"In client mode, this config must not be set through the {{SparkConf}} directly 
in your application, because the driver JVM has already started at that point. 
Instead, please set this through the {{--driver-java-options}} command line 
option or in your default properties file."

A typical driver call now is of the form:

"/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* 
-Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf 
spark.driver.bindAddress=9.0.7.116 --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi 
spark-internal 1"

Also at the entrypoint.sh file there is no management of the driver's java 
Opts. 

We propose to set an env var to pass the extra java opts to the driver (like in 
the case of the executor), rename the env vars in the container as the one for 
the executor is a bit misleading, and use --driver-java-options to pass the 
required options.

 


> Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s
> --
>
> Key: SPARK-24429
> URL: https://issues.apache.org/jira/browse/SPARK-24429
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
>  
> Right now in cluster mode only extraJavaOptions targeting the executor are 
> set. 
> According to the implementation and the docs:
> "In client mode, this config must not be set through the {{SparkConf}} 
> directly in your application, because the driver JVM has already started at 
> that point. Instead, please set this through the {{--driver-java-options}} 
> command line option or in your default properties file."
> A typical driver launch in cluster mode will eventually use client mode to 
> run the Spark-submit and looks like:
> "/usr/lib/jvm/java-1.8-openjdk/bin/java -cp 
> /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit 
> --deploy-mode client --conf spark.driver.bindAddress=9.0.7.116 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal 1"
> Also at the entrypoint.sh file there is no management of the driver's java 
> Opts. 
> We propose to set an env var to pass the extra java opts to the driver (like 
> in the case of the executor), rename the env vars in the container as the one 
> for the executor is a bit misleading, and use --driver-java-options to pass 
> the required options.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495200#comment-16495200
 ] 

Thomas Graves commented on SPARK-24415:
---

this might actually be an order of events type thing.  You will note that the 
config I have is stage.maxFailedTasksPerExecutor=1 so it should really only 
have 1 failed task, but looking at the log it seems it starts the second task 
before totally handling the blacklist from the first failure:

 

18/05/30 13:57:20 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, 
gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 0, PROCESS_LOCAL, 7746 
bytes)
[Stage 2:> (0 + 1) / 10]18/05/30 13:57:20 INFO BlockManagerMasterEndpoint: 
Registering block manager gsrd259n13.red.ygrid.yahoo.com:43203 with 912.3 MB 
RAM, BlockManagerId(2, gsrd259n13.red.ygrid.yahoo.com, 43203, None)
18/05/30 13:57:21 INFO Client: Application report for 
application_1526529576371_25524 (state: RUNNING)
18/05/30 13:57:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
gsrd259n13.red.ygrid.yahoo.com:43203 (size: 1941.0 B, free: 912.3 MB)
18/05/30 13:57:21 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, 
gsrd259n13.red.ygrid.yahoo.com, executor 2, partition 1, PROCESS_LOCAL, 7747 
bytes)
18/05/30 13:57:21 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
gsrd259n13.red.ygrid.yahoo.com, executor 2): java.lang.RuntimeException: Bad 
executor



18/05/30 13:57:21 INFO TaskSetBlacklist: Blacklisting executor 2 for stage 2

18/05/30 13:57:21 INFO YarnScheduler: Cancelling stage 2
18/05/30 13:57:21 INFO YarnScheduler: Stage 2 was cancelled
18/05/30 13:57:21 INFO DAGScheduler: ShuffleMapStage 2 (map at :26) 
failed in 12.063 s due to Job aborted due to stage failure:

18/05/30 13:57:21 INFO DAGScheduler: Job 1 failed: collect at :26, 
took 12.069052 s

 

The thing is though that the executors page shows that it had 2 task failures 
on that node, its just in the aggregated metrics for that stage that doesn't 
have it.

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24429) Add support for spark.driver.extraJavaOptions in cluster mode for Spark on K8s

2018-05-30 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24429:
---

 Summary: Add support for spark.driver.extraJavaOptions in cluster 
mode for Spark on K8s
 Key: SPARK-24429
 URL: https://issues.apache.org/jira/browse/SPARK-24429
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Stavros Kontopoulos


 

Right now in cluster mode only extraJavaOptions targeting the executor are set. 

According to the implementation and the docs:

"In client mode, this config must not be set through the {{SparkConf}} directly 
in your application, because the driver JVM has already started at that point. 
Instead, please set this through the {{--driver-java-options}} command line 
option or in your default properties file."

A typical driver call now is of the form:

"/usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* 
-Xmx1g org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf 
spark.driver.bindAddress=9.0.7.116 --properties-file 
/opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi 
spark-internal 1"

Also at the entrypoint.sh file there is no management of the driver's java 
Opts. 

We propose to set an env var to pass the extra java opts to the driver (like in 
the case of the executor), rename the env vars in the container as the one for 
the executor is a bit misleading, and use --driver-java-options to pass the 
required options.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc in K8s module

2018-05-30 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24428:

Summary: Remove unused code and fix any related doc in K8s module  (was: 
Remove unused code and fix any related doc)

> Remove unused code and fix any related doc in K8s module
> 
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> There are some relics of previous refactoring: like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc in K8s module

2018-05-30 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24428:

Description: 
There are some relics of previous refactoring like: 
[https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]

Target is to cleanup anything not used.

 

  was:
There are some relics of previous refactoring: like: 
[https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]

Target is to cleanup anything not used.

 


> Remove unused code and fix any related doc in K8s module
> 
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> There are some relics of previous refactoring like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24428) Remove unused code and fix any related doc

2018-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24428:


Assignee: (was: Apache Spark)

> Remove unused code and fix any related doc
> --
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> There are some relics of previous refactoring: like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24428) Remove unused code and fix any related doc

2018-05-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495184#comment-16495184
 ] 

Apache Spark commented on SPARK-24428:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/21462

> Remove unused code and fix any related doc
> --
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> There are some relics of previous refactoring: like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24428) Remove unused code and fix any related doc

2018-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24428:


Assignee: Apache Spark

> Remove unused code and fix any related doc
> --
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Minor
>
> There are some relics of previous refactoring: like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24415:
--
Description: 
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.    Note I tested with master branch to 
verify its not fixed.

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

import org.apache.spark.SparkEnv 

sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt >= 
1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()

  was:
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.    Note I tested with master branch to 
verify its not fixed.

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

 

sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt >= 
1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
executor") else (x % 3, x)  }.reduceByKey((a, b) => a + b).collect()


> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24415:
--
Description: 
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.    Note I tested with master branch to 
verify its not fixed.

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

 

sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt >= 
1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
executor") else (x % 3, x)  }.reduceByKey((a, b) => a + b).collect()

  was:
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.    Note I tested with master branch to 
verify its not fixed.

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

sc.parallelize(1 to 1, 10).map\{ x => | if (SparkEnv.get.executorId.toInt 
>= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
executor") | else (x % 3, x) | }.reduceByKey((a, b) => a + b).collect()


> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
>  
>  
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x)  }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-30 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24415:
--
Description: 
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.    Note I tested with master branch to 
verify its not fixed.

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

sc.parallelize(1 to 1, 10).map\{ x => | if (SparkEnv.get.executorId.toInt 
>= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
executor") | else (x % 3, x) | }.reduceByKey((a, b) => a + b).collect()

  was:
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.    Note I tested with master branch to 
verify its not fixed.

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

sc.parallelize(1 to 1, 10).map

{ x => | if (SparkEnv.get.executorId.toInt >= 1 && 
SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") 
| else (x % 3, x) | }

.reduceByKey((a, b) => a + b).collect()


> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
>  
> sc.parallelize(1 to 1, 10).map\{ x => | if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") | else (x % 3, x) | }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc

2018-05-30 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24428:

Description: 
There are some relics of previous refactoring: like: 
[https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]

Target is to cleanup anything not used.

 

  was:
There are some relics of previous refactoring: like: 
[https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]

Target is to cleanup anything no used.

 


> Remove unused code and fix any related doc
> --
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> There are some relics of previous refactoring: like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24428) Remove unused code and fix any related doc

2018-05-30 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24428:

Summary: Remove unused code and fix any related doc  (was: Remove unused 
code and Fix docs)

> Remove unused code and fix any related doc
> --
>
> Key: SPARK-24428
> URL: https://issues.apache.org/jira/browse/SPARK-24428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> There are some relics of previous refactoring: like: 
> [https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]
> Target is to cleanup anything no used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24428) Remove unused code and Fix docs

2018-05-30 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24428:
---

 Summary: Remove unused code and Fix docs
 Key: SPARK-24428
 URL: https://issues.apache.org/jira/browse/SPARK-24428
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Stavros Kontopoulos


There are some relics of previous refactoring: like: 
[https://github.com/apache/spark/blob/9e7bad0edd9f6c59c0af21c95e5df98cf82150d3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L63]

Target is to cleanup anything no used.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24427) Spark 2.2 - Exception occurred while saving table in spark. Multiple sources found for parquet

2018-05-30 Thread Ashok Rai (JIRA)
Ashok Rai created SPARK-24427:
-

 Summary:  Spark 2.2 - Exception occurred while saving table in 
spark. Multiple sources found for parquet 
 Key: SPARK-24427
 URL: https://issues.apache.org/jira/browse/SPARK-24427
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.2.0
Reporter: Ashok Rai


We are getting below error while loading into Hive table. In our code, we use 
"saveAsTable" - which as per documentation automatically chooses the format 
that the table was created on. We have now tested by creating the table as 
Parquet as well as ORC. In both cases - the same error occurred.

 

-

2018-05-29 12:25:07,433 ERROR [main] ERROR - Exception occurred while saving 
table in spark.

 org.apache.spark.sql.AnalysisException: Multiple sources found for parquet 
(org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat, 
org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please 
specify the fully qualified class name.;
 at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:584)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:111)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
 ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) 
~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) 
~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
 ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
 ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) 
~[scala-library-2.11.8.jar:?]
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 ~[scala-library-2.11.8.jar:?]
 at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) 
~[scala-library-2.11.8.jar:?]
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
 ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
 ~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at scala.collection.immutable.List.foreach(List.scala:381) 
~[scala-library-2.11.8.jar:?]
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) 
~[spark-catalyst_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) 
~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
 ~[spark-sql_2.11-2.2.0.2.6.4.25-1.jar:2.2.0.2.6.4.25-1]
 at 

[jira] [Commented] (SPARK-24395) Fix Behavior of NOT IN with Literals Containing NULL

2018-05-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495068#comment-16495068
 ] 

Marco Gaido commented on SPARK-24395:
-

The main issue here is that {{(null, null) = (1, 2)}} in Spark evaluates to 
{{false}}, while on other DBs evaluates to null. So we should revisit all our 
comparisons for struct.

> Fix Behavior of NOT IN with Literals Containing NULL
> 
>
> Key: SPARK-24395
> URL: https://issues.apache.org/jira/browse/SPARK-24395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
>
> Spark does not return the correct answer when evaluating NOT IN in some 
> cases. For example:
> {code:java}
> CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES
>   (null, null)
>   AS m(a, b);
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;{code}
> According to the semantics of null-aware anti-join, this should return no 
> rows. However, it actually returns the row {{NULL NULL}}. This was found by 
> inspecting the unit tests added for SPARK-24381 
> ([https://github.com/apache/spark/pull/21425#pullrequestreview-123421822).]
> *Acceptance Criteria*:
>  * We should be able to add the following test cases back to 
> {{subquery/in-subquery/not-in-unit-test-multi-column-literal.sql}}:
> {code:java}
>   -- Case 2
>   -- (subquery contains a row with null in all columns -> row not returned)
> SELECT *
> FROM   m
> WHERE  (a, b) NOT IN ((CAST (null AS INT), CAST (null AS DECIMAL(2, 1;
>   -- Case 3
>   -- (probe-side columns are all null -> row not returned)
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL -- Matches only (null, null)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;
>   -- Case 4
>   -- (one column null, other column matches a row in the subquery result -> 
> row not returned)
> SELECT *
> FROM   m
> WHERE  b = 1.0 -- Matches (null, 1.0)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1; 
> {code}
>  
> cc [~smilegator] [~juliuszsompolski]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24426) Unexpected combination of cache and join on DataFrame

2018-05-30 Thread Krzysztof Skulski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krzysztof Skulski updated SPARK-24426:
--
Description: 
I have unexpected results, when I cache DataFrame and try to do another 
grouping on it.  New DataFrames based on cached groupBy DataFrame works ok, but 
when i try join it to anohter DataFrame it seems like second join is adding new 
column but the data is copy from first joined DataFrame. Example below 
(userAgentType - is ok,
 userChannelType - is ok, userOrigin - is not ok). 
 When I remove cache from aggregated DataFrame it works ok.

 
{code:scala}
 val aggregated = dataFrame.cache()

 val userAgentType = aggregated.groupBy("id", "agentType").count()
   .orderBy(asc("id"), 
desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
 val userChannelType = aggregated.groupBy("id", "channelType").count()
   .orderBy(asc("id"), 
desc("count")).groupBy("id").agg(first("channelType").as("channelType"))

val userOrigin =  userInfo
   .join(userAgentType, Seq("id"), "left")
   .join(userChannelType, Seq("id"), "left")
{code}

  was:
I have unexpected results, when I cache DataFrame and try to do another 
grouping on it.  New DataFrames based on cached groupBy DataFrame works ok, but 
when i try join it to anohter DataFrame it seems like second join is adding new 
column but the data is copy from first joined DataFrame. Example below 
(userAgentType - is ok,
 userChannelType - is ok, userOrigin - is not ok). 
 When I remove cache from aggregated DataFrame it works ok.

 
{code}
 val aggregated = dataFrame.cache()

 val userAgentType = aggregated.groupBy("id", "agentType").count()
   .orderBy(asc("id"), 
desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
 val userChannelType = aggregated.groupBy("id", "channelType").count()
   .orderBy(asc("id"), 
desc("count")).groupBy("id").agg(first("channelType").as("channelType"))

val userOrigin =  userInfo
   .join(userAgentType, Seq("id"), "left")
   .join(userChannelType, Seq("id"), "left")
{code}


> Unexpected combination of cache and join on DataFrame
> -
>
> Key: SPARK-24426
> URL: https://issues.apache.org/jira/browse/SPARK-24426
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Krzysztof Skulski
>Priority: Major
>
> I have unexpected results, when I cache DataFrame and try to do another 
> grouping on it.  New DataFrames based on cached groupBy DataFrame works ok, 
> but when i try join it to anohter DataFrame it seems like second join is 
> adding new column but the data is copy from first joined DataFrame. Example 
> below (userAgentType - is ok,
>  userChannelType - is ok, userOrigin - is not ok). 
>  When I remove cache from aggregated DataFrame it works ok.
>  
> {code:scala}
>  val aggregated = dataFrame.cache()
>  val userAgentType = aggregated.groupBy("id", "agentType").count()
>.orderBy(asc("id"), 
> desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
>  val userChannelType = aggregated.groupBy("id", "channelType").count()
>.orderBy(asc("id"), 
> desc("count")).groupBy("id").agg(first("channelType").as("channelType"))
> val userOrigin =  userInfo
>.join(userAgentType, Seq("id"), "left")
>.join(userChannelType, Seq("id"), "left")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24426) Unexpected combination of cache and join on DataFrame

2018-05-30 Thread Krzysztof Skulski (JIRA)
Krzysztof Skulski created SPARK-24426:
-

 Summary: Unexpected combination of cache and join on DataFrame
 Key: SPARK-24426
 URL: https://issues.apache.org/jira/browse/SPARK-24426
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Krzysztof Skulski


I have unexpected results, when I cache DataFrame and try to do another 
grouping on it.  New DataFrames based on cached groupBy DataFrame works ok, but 
when i try join it to anohter DataFrame it seems like second join is adding new 
column but the data is copy from first joined DataFrame. Example below 
(userAgentType - is ok,
 userChannelType - is ok, userOrigin - is not ok). 
 When I remove cache from aggregated DataFrame it works ok.

 
{code}
 val aggregated = dataFrame.cache()

 val userAgentType = aggregated.groupBy("id", "agentType").count()
   .orderBy(asc("id"), 
desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
 val userChannelType = aggregated.groupBy("id", "channelType").count()
   .orderBy(asc("id"), 
desc("count")).groupBy("id").agg(first("channelType").as("channelType"))

val userOrigin =  userInfo
   .join(userAgentType, Seq("id"), "left")
   .join(userChannelType, Seq("id"), "left")
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494975#comment-16494975
 ] 

Apache Spark commented on SPARK-23754:
--

User 'e-dorigatti' has created a pull request for this issue:
https://github.com/apache/spark/pull/21461

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Emilio Dorigatti
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23754:


Assignee: Emilio Dorigatti

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Emilio Dorigatti
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23754.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/21383

Backporting is needed.

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885
 ] 

Liang-Chi Hsieh edited comment on SPARK-24410 at 5/30/18 8:41 AM:
--

I've done some experiments locally. But the results show that in above case 
seems we don't guarantee that the data distribution is the same when reading 
two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get 
the correct results.

That said, for FileScan nodes at (1) and (2):
{code:java}
*(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

*(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code}

FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are 
partitioned by {{key#28}}. But it doesn't guarantee that the values of 
{{key#25}} and {{key#28}} are located in the same partition.


was (Author: viirya):
I've done some experiments locally. But the results show that in above case 
seems we don't guarantee that the data distribution is the same when reading 
two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get 
the correct results.

That said, for FileScan nodes at (1) and (2):
{code:java}
*(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

*(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code}

FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are 
partitioned by {{key#28}}. But it doesn't guarantee that the values of 
{{key#25}} and {{key#28}} are located in the same partition.

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project 

[jira] [Comment Edited] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885
 ] 

Liang-Chi Hsieh edited comment on SPARK-24410 at 5/30/18 8:41 AM:
--

I've done some experiments locally. But the results show that in above case 
seems we don't guarantee that the data distribution is the same when reading 
two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get 
the correct results.

That said, for FileScan nodes at (1) and (2):
{code:java}
*(1) FileScan parquet default.a1[key#25L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

*(2) FileScan parquet default.a2[key#28L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code}

FileScan (1)'s rows are partitioned by {{key#25}} and FileScan (2)'s rows are 
partitioned by {{key#28}}. But it doesn't guarantee that the values of 
{{key#25}} and {{key#28}} are located in the same partition.


was (Author: viirya):
I've done some experiments locally. But the results show that in above case 
seems we don't guarantee that the data distribution is the same when reading 
two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get 
the correct results.

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494885#comment-16494885
 ] 

Liang-Chi Hsieh commented on SPARK-24410:
-

I've done some experiments locally. But the results show that in above case 
seems we don't guarantee that the data distribution is the same when reading 
two tables {{a1}} and {{a2}}. By removing the shuffling, you actually don't get 
the correct results.

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24425) Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required

2018-05-30 Thread Unai Sarasola (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494864#comment-16494864
 ] 

Unai Sarasola commented on SPARK-24425:
---

Totally agree with you Sam. It's up to the developer to decide wether maintain 
or optimize the number of partitions.

This made impossible to backup your files in HDFS using Spark (I know is not 
the main task of Spark, but it's a limit that we don't have with the previous 
versions), also I'm playing with the same properties said by [~sams] without 
luck

> Regression from 1.6 to 2.x - Spark no longer respects input partitions, 
> unnecessary shuffle required
> 
>
> Key: SPARK-24425
> URL: https://issues.apache.org/jira/browse/SPARK-24425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: sam
>Priority: Major
>
> I think this is a regression. We used to be able to easily control the number 
> of output files / tasks based on num files and coalesce. Now I have to use 
> `repartition` to get the desired num files / partitions which is 
> unnecessarily expensive.
> I've tried playing with spark.sql.files.maxPartitionBytes and 
> spark.sql.files.openCostInBytes to see if I can force the conventional 
> behaviour.
> {code:java}
> val ss = SparkSession.builder().appName("uber-cp").master(conf.master())
>  .config("spark.sql.files.maxPartitionBytes", 1)
>  .config("spark.sql.files.openCostInBytes", Long.MaxValue)
> {code}
> This didn't work. Spark just squashes all my parquet files into less 
> partitions.
> Suggest a simple `option` on DataFrameReader that can disable this (or enable 
> it, default behaviour should be same as 1.6).
>  
> This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if 
> SPARK-5997 was implemented this ticket wouldn't really be necessary.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-05-30 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494852#comment-16494852
 ] 

sam commented on SPARK-20144:
-

Regarding the original issue of sorting, I agree with [~srowen] in that it 
should be up to the user to explicitly ask for sorted data. This is because 
fundamentally Spark implements the Map Reduce programming paradigm which is 
defined in terms of multisets. [~icexelloss] Please read  
[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf]

Regarding my issue of Spark reducing the number of partitions without any ask 
from the user I've created a separate issue: 
https://issues.apache.org/jira/browse/SPARK-24425

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24425) Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required

2018-05-30 Thread sam (JIRA)
sam created SPARK-24425:
---

 Summary: Regression from 1.6 to 2.x - Spark no longer respects 
input partitions, unnecessary shuffle required
 Key: SPARK-24425
 URL: https://issues.apache.org/jira/browse/SPARK-24425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
Reporter: sam


I think this is a regression. We used to be able to easily control the number 
of output files / tasks based on num files and coalesce. Now I have to use 
`repartition` to get the desired num files / partitions which is unnecessarily 
expensive.

I've tried playing with spark.sql.files.maxPartitionBytes and 
spark.sql.files.openCostInBytes to see if I can force the conventional 
behaviour.
{code:java}
val ss = SparkSession.builder().appName("uber-cp").master(conf.master())
 .config("spark.sql.files.maxPartitionBytes", 1)
 .config("spark.sql.files.openCostInBytes", Long.MaxValue)
{code}
This didn't work. Spark just squashes all my parquet files into less partitions.

Suggest a simple `option` on DataFrameReader that can disable this (or enable 
it, default behaviour should be same as 1.6).

 

This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if 
SPARK-5997 was implemented this ticket wouldn't really be necessary.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-30 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494832#comment-16494832
 ] 

Ruben Berenguel commented on SPARK-23904:
-

[~igreenfi] that's what I mean, removing the code (no-op = no operation). I 
don't get OOM due to this string being generated, all the OOM I manage to get 
are due to too large tree plans in catalyst (which seems expected, I have tried 
more than 10 times already with different settings).

You mentioned in StackOverflow that even removing spark.ui = true, the string 
was sent through the listenerBus anyway? Where and how have you seen this 
happening? I guess we are doing something differently and I can't figure out 
what it is to reproduce your OOM.

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-05-30 Thread Unai Sarasola (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494810#comment-16494810
 ] 

Unai Sarasola commented on SPARK-20144:
---

But if you want to have exactly a copy from your data in HDFS for example, this 
would made impossible to Spark to maintain the copy of the files exactly in 
both cases.

Is that right

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-05-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494806#comment-16494806
 ] 

Apache Spark commented on SPARK-23442:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21460

> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
>  Let me know if I can help in any way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23442:


Assignee: Apache Spark

> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Assignee: Apache Spark
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
>  Let me know if I can help in any way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23442:


Assignee: (was: Apache Spark)

> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
>  Let me know if I can help in any way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494771#comment-16494771
 ] 

Dilip Biswal commented on SPARK-24424:
--

[~smilegator] Thank you. I would like to give it a try.

> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24395) Fix Behavior of NOT IN with Literals Containing NULL

2018-05-30 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494769#comment-16494769
 ] 

Xiao Li commented on SPARK-24395:
-

I think Oracle returns a different answer. We should fix them. 

> Fix Behavior of NOT IN with Literals Containing NULL
> 
>
> Key: SPARK-24395
> URL: https://issues.apache.org/jira/browse/SPARK-24395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
>
> Spark does not return the correct answer when evaluating NOT IN in some 
> cases. For example:
> {code:java}
> CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES
>   (null, null)
>   AS m(a, b);
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;{code}
> According to the semantics of null-aware anti-join, this should return no 
> rows. However, it actually returns the row {{NULL NULL}}. This was found by 
> inspecting the unit tests added for SPARK-24381 
> ([https://github.com/apache/spark/pull/21425#pullrequestreview-123421822).]
> *Acceptance Criteria*:
>  * We should be able to add the following test cases back to 
> {{subquery/in-subquery/not-in-unit-test-multi-column-literal.sql}}:
> {code:java}
>   -- Case 2
>   -- (subquery contains a row with null in all columns -> row not returned)
> SELECT *
> FROM   m
> WHERE  (a, b) NOT IN ((CAST (null AS INT), CAST (null AS DECIMAL(2, 1;
>   -- Case 3
>   -- (probe-side columns are all null -> row not returned)
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL -- Matches only (null, null)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;
>   -- Case 4
>   -- (one column null, other column matches a row in the subquery result -> 
> row not returned)
> SELECT *
> FROM   m
> WHERE  b = 1.0 -- Matches (null, 1.0)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1; 
> {code}
>  
> cc [~smilegator] [~juliuszsompolski]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24423) Add a new option `query` for JDBC sources

2018-05-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24423:

Description: 
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
{code} 
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", "dbName.tableName")
   .options(jdbcCredentials: Map)
   .load()
{code} 
 Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option.   
{code} 
 val query = """ (select * from tableName limit 10) as tmp """
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
 However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
{code} 
 val query = """select * from tableName limit 10"""
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*{color:#ff}query{color}*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
 Users are not allowed to specify query and dbtable at the same time. 

  was:
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
{code} 
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", "dbName.tableName")
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option. 
  
{code} 
 val query = """ (select * from tableName limit 10) as tmp """
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
  
{code} 
 val query = """select * from tableName limit 10"""
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*{color:#ff}query{color}*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Users are not allowed to specify query and dbtable at the same time. 


> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
> {code} 
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", "dbName.tableName")
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option.   
> {code} 
>  val query = """ (select * from tableName limit 10) as tmp """
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
> {code} 
>  val query = """select * from tableName limit 10"""
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*{color:#ff}query{color}*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24423) Add a new option `query` for JDBC sources

2018-05-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24423:

Description: 
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
{code} 
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", "dbName.tableName")
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option. 
  
{code} 
 val query = """ (select * from tableName limit 10) as tmp """
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
  
{code} 
 val query = """select * from tableName limit 10"""
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*{color:#ff}query{color}*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Users are not allowed to specify query and dbtable at the same time. 

  was:
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
 
val jdbcDf = spark.read
  .format("jdbc")
  .option("*dbtable*", "dbName.tableName")
  .options(jdbcCredentials: Map)
  .load()
 
Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option. 
 
val query = """ (select * from tableName limit 10) as tmp """
val jdbcDf = spark.read
  .format("jdbc")
  .option("*dbtable*", query)
  .options(jdbcCredentials: Map)
  .load()
 
However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
 
val query = """select * from tableName limit 10"""
val jdbcDf = spark.read
  .format("jdbc")
  .option("*{color:#ff}query{color}*", query)
  .options(jdbcCredentials: Map)
  .load()
 
Users are not allowed to specify query and dbtable at the same time. 


> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
> {code} 
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", "dbName.tableName")
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>   
>  Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option. 
>   
> {code} 
>  val query = """ (select * from tableName limit 10) as tmp """
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>   
>  However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
>   
> {code} 
>  val query = """select * from tableName limit 10"""
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*{color:#ff}query{color}*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>   
>  Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-30 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494754#comment-16494754
 ] 

Xiao Li commented on SPARK-24424:
-

Also cc [~dkbiswal] 

> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >