date:20201217

[jira] [Resolved] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey

2020-12-17 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31948.
--
Resolution: Not A Problem

> expose mapSideCombine in aggByKey/reduceByKey/foldByKey
> ---
>
> Key: SPARK-31948
> URL: https://issues.apache.org/jira/browse/SPARK-31948
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1. {{aggregateByKey}}, {{reduceByKey}} and  {{foldByKey}} will always perform 
> {{mapSideCombine}};
> However, this can be skiped sometime, specially in ML (RobustScaler):
> {code:java}
> vectors.mapPartitions { iter =>
>   if (iter.hasNext) {
> val summaries = Array.fill(numFeatures)(
>   new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 
> relativeError))
> while (iter.hasNext) {
>   val vec = iter.next
>   vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = 
> summaries(i).insert(v) }
> }
> Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
>   } else Iterator.empty
> }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code}
>  
> This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at 
> all, similar places exist in {{KMeans}}, {{GMM}}, etc;
> To my knowledge, we do not need {{mapSideCombine}} if the reduction factor 
> isn't high;
>  
> 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}},  the 
> {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided.
>  
> SPARK-772:
> {quote}
> Map side combine in group by key case does not reduce the amount of data 
> shuffled. Instead, it forces a lot more objects to go into old gen, and leads 
> to worse GC.
> {quote}
>  
> So what about:
> 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so 
> that user can disable unnecessary mapSideCombine
> 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in  
> {{treeAggregate}} and {{treeReduce}}
> 3. disabling the unnecessary {{mapSideCombine}} in ML;
> Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] 
> [~viirya]  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251547#comment-17251547
 ] 

Apache Spark commented on SPARK-33791:
--

User 'sqlwindspeaker' has created a pull request for this issue:
https://github.com/apache/spark/pull/30836

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33791:


Assignee: (was: Apache Spark)

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33791:


Assignee: Apache Spark

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Assignee: Apache Spark
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26341) Expose executor memory metrics at the stage level, in the Stages tab

2020-12-17 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-26341.

Fix Version/s: 3.2.0
   3.1.0
 Assignee: angerszhu
   Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/30573

> Expose executor memory metrics at the stage level, in the Stages tab
> 
>
> Key: SPARK-26341
> URL: https://issues.apache.org/jira/browse/SPARK-26341
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Edward Lu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0, 3.2.0
>
>
> Sub-task SPARK-23431 will add stage level executor memory metrics (peak 
> values for each stage, and peak values for each executor for the stage). This 
> information should also be exposed the the web UI, so that users can see 
> which stages are memory intensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33836:
-
Affects Version/s: (was: 3.1.0)
   3.2.0

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33825) Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33825.
--
Resolution: Invalid

Let's ask questions to dev mailing list before filing it as an issue. See also 
https://spark.apache.org/community.html

> Is Spark SQL able to auto update partition stats like hive by setting 
> hive.stats.autogather=true
> 
>
> Key: SPARK-33825
> URL: https://issues.apache.org/jira/browse/SPARK-33825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: yang
>Priority: Major
>  Labels: partitionStat
>
> {{`spark.sql.statistics.size.autoUpdate.enabled` is only work for table stats 
> update.}}{{But for partition stats,I can only update it with `ANALYZE TABLE 
> tablename PARTITION(part) COMPUTE STATISTICS`.So is Spark SQL able to auto 
> update partition stats like hive by setting hive.stats.autogather=true?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33825) Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33825:
-
Target Version/s:   (was: 3.0.1)

> Is Spark SQL able to auto update partition stats like hive by setting 
> hive.stats.autogather=true
> 
>
> Key: SPARK-33825
> URL: https://issues.apache.org/jira/browse/SPARK-33825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: yang
>Priority: Major
>  Labels: partitionStat
>
> {{`spark.sql.statistics.size.autoUpdate.enabled` is only work for table stats 
> update.}}{{But for partition stats,I can only update it with `ANALYZE TABLE 
> tablename PARTITION(part) COMPUTE STATISTICS`.So is Spark SQL able to auto 
> update partition stats like hive by setting hive.stats.autogather=true?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33791) grouping__id() result does not consistent with hive's version < 2.3

2020-12-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251516#comment-17251516
 ] 

Hyukjin Kwon commented on SPARK-33791:
--

Can you document it in 
http://spark.apache.org/docs/latest/sql-migration-guide.html#compatibility-with-apache-hive?

> grouping__id() result does not consistent with hive's version < 2.3
> ---
>
> Key: SPARK-33791
> URL: https://issues.apache.org/jira/browse/SPARK-33791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.1
>Reporter: Su Qilong
>Priority: Minor
>
> See this 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup]
> Hive's grouping__id method made a change since hive version 2.3.0. Now spark 
> does not declare this inconsistency with Hive, which may make user believe 
> they're safe from migrating their query from Hive 1.x to Spark, but which is 
> wrong.
> I guess we should note this difference in Hive migration guide, and add a 
> configuration to let grouping__id to use hive 1.x compatible algorithm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33795) gapply fails execution with rbind error

2020-12-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251515#comment-17251515
 ] 

Hyukjin Kwon commented on SPARK-33795:
--

[~n8shdw] can you check if this happens in Apache Spark instead of Databricks 
Runtime?

> gapply fails execution with rbind error
> ---
>
> Key: SPARK-33795
> URL: https://issues.apache.org/jira/browse/SPARK-33795
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
> Environment: Databricks runtime 7.3 LTS ML
>Reporter: MvR
>Priority: Major
> Attachments: Rerror.log
>
>
> Executing following code on databricks runtime 7.3 LTS ML errors out showing 
> some rbind error whereas it is successfully executed without enabling Arrow 
> in Spark session. Full error message attached.
>  
> ```
> library(dplyr)
> library(SparkR)
> SparkR::sparkR.session(sparkConfig = 
> list(spark.sql.execution.arrow.sparkr.enabled = "true"))
> mtcars %>%
>  SparkR::as.DataFrame() %>%
> SparkR::gapply(x = .,
>  cols = c("cyl", "vs"),
>  
>  func = function(key,
>  data){
>  
>  dt <- data[,c("mpg", "qsec")]
>  res <- apply(dt, 2, mean)
>  df <- data.frame(firstGroupKey = key[1],
>  secondGroupKey = key[2],
>  mean_mpg = res[1],
>  mean_cyl = res[2])
>  return(df)
>  
>  }, 
>  schema = structType(structField("cyl", "double"),
>  structField("vs", "double"),
>  structField("mpg_mean", "double"),
>  structField("qsec_mean", "double"))
>  ) %>%
>  display()
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33826) InsertIntoHiveTable generate HDFS file with invalid user

2020-12-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251513#comment-17251513
 ] 

Hyukjin Kwon commented on SPARK-33826:
--

[~AlberyZJG] are you able to show the self-contained reproducer so people can 
verify easily?

> InsertIntoHiveTable generate HDFS file with invalid user
> 
>
> Key: SPARK-33826
> URL: https://issues.apache.org/jira/browse/SPARK-33826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> *Arch:* Hive on Spark.
>  
> *Version:* Spark 2.3.2
>  
> *Conf:*
> Enable user impersonation
> hive.server2.enable.doAs=true
>  
> *Scenario:*
> Thriftserver is running with loginUser A, and Task  run as User A too.
> Client execute SQL with user B
>  
> Data generated by sql "insert into TABLE  \[tbl\] select XXX form ." is 
> written to HDFS on executor, executor doesn't know B.
>  
> *{color:#de350b}So the user file written to HDFS will be user A which should 
> be B.{color}*
>  
> I also check the inplementation of Spark 3.0.0, It could have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow

2020-12-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251512#comment-17251512
 ] 

Hyukjin Kwon commented on SPARK-33833:
--

Looks like it leverages listenters. Can you use QueryExecutionListener instead?

> Allow Spark Structured Streaming report Kafka Lag through Burrow
> 
>
> Key: SPARK-33833
> URL: https://issues.apache.org/jira/browse/SPARK-33833
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Sam Davarnia
>Priority: Major
>
> Because structured streaming tracks Kafka offset consumption by itself, 
> It is not possible to track total Kafka lag using Burrow similar to DStreams
> We have used Stream hooks as mentioned 
> [here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37]
>  
> It would be great if Spark supports this feature out of the box.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33836:


Assignee: Apache Spark

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33836:


Assignee: (was: Apache Spark)

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251510#comment-17251510
 ] 

Apache Spark commented on SPARK-33836:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/30835

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251509#comment-17251509
 ] 

Hyukjin Kwon commented on SPARK-33836:
--

cc [~zero323] FYI

> Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark
> -
>
> Key: SPARK-33836
> URL: https://issues.apache.org/jira/browse/SPARK-33836
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-32885 and SPARK-32896 we added two public APIs to enable 
> read/write with table, but only in Scala side so only JVM languages could 
> leverage them.
> Given there're lots of PySpark users, it would be great to expose these 
> public APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33836) Expose DataStreamReader.table and DataStreamWriter.toTable to PySpark

2020-12-17 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-33836:


 Summary: Expose DataStreamReader.table and 
DataStreamWriter.toTable to PySpark
 Key: SPARK-33836
 URL: https://issues.apache.org/jira/browse/SPARK-33836
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Jungtaek Lim


>From SPARK-32885 and SPARK-32896 we added two public APIs to enable read/write 
>with table, but only in Scala side so only JVM languages could leverage them.

Given there're lots of PySpark users, it would be great to expose these public 
APIs to PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.0.2

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Target Version/s: 2.4.8, 3.0.2, 3.1.0

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.1.3

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251502#comment-17251502
 ] 

Dongjoon Hyun commented on SPARK-33593:
---

Although this is not a regression, I marked this as a Blocker because this is a 
correctness issue.

cc [~hyukjin.kwon] and [~cloud_fan]

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.2.3

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Priority: Blocker  (was: Major)

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: (was: 3.0.0)
   2.3.4

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 2.4.7

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33817) Use a logical plan to cache instead of dataframe

2020-12-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33817.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30815
[https://github.com/apache/spark/pull/30815]

> Use a logical plan to cache instead of dataframe
> 
>
> Key: SPARK-33817
> URL: https://issues.apache.org/jira/browse/SPARK-33817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.2.0
>
>
> When caching a query, we can use a logical plan instead of a dataframe 
> (current implementation) to avoid creating the dataframe.
> This is also consistent with uncaching which uses a logical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33817) Use a logical plan to cache instead of dataframe

2020-12-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33817:
---

Assignee: Terry Kim

> Use a logical plan to cache instead of dataframe
> 
>
> Key: SPARK-33817
> URL: https://issues.apache.org/jira/browse/SPARK-33817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> When caching a query, we can use a logical plan instead of a dataframe 
> (current implementation) to avoid creating the dataframe.
> This is also consistent with uncaching which uses a logical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Affects Version/s: 3.2.0
   3.0.0
   3.0.1

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Parquet vector reader incorrect with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Labels: correctness  (was: )

> Parquet vector reader incorrect with binary partition value
> ---
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33593) Vector reader got incorrect data with binary partition value

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33593:
--
Summary: Vector reader got incorrect data with binary partition value  
(was: Parquet vector reader incorrect with binary partition value)

> Vector reader got incorrect data with binary partition value
> 
>
> Key: SPARK-33593
> URL: https://issues.apache.org/jira/browse/SPARK-33593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>  Labels: correctness
>
> {code:java}
> test("Parquet vector reader incorrect with binary partition value") {
>   Seq(false, true).foreach(tag => {
> withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
>   withTable("t1") {
> sql(
>   """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
> | USING PARQUET PARTITIONED BY (part)""".stripMargin)
> sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', 
> X'537061726B2053514C')")
> if (tag) {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", ""))
> } else {
>   checkAnswer(sql("SELECT name, cast(id as string), cast(part as 
> string) FROM t1"),
> Row("a", "Spark SQL", "Spark SQL"))
> }
>   }
> }
>   })
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251482#comment-17251482
 ] 

Apache Spark commented on SPARK-33834:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/30833

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33834:


Assignee: (was: Apache Spark)

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33834:


Assignee: Apache Spark

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251481#comment-17251481
 ] 

Apache Spark commented on SPARK-33834:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/30833

> Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
> --
>
> Key: SPARK-33834
> URL: https://issues.apache.org/jira/browse/SPARK-33834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251480#comment-17251480
 ] 

Apache Spark commented on SPARK-33489:
--

User 'Cactice' has created a pull request for this issue:
https://github.com/apache/spark/pull/30832

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33489) Support null for conversion from and to Arrow type

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33489:


Assignee: Apache Spark

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Assignee: Apache Spark
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33489) Support null for conversion from and to Arrow type

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33489:


Assignee: (was: Apache Spark)

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33835:


Assignee: Apache Spark

> Refector AbstractCommandBuilder
> ---
>
> Key: SPARK-33835
> URL: https://issues.apache.org/jira/browse/SPARK-33835
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Assignee: Apache Spark
>Priority: Major
>
> Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33835:


Assignee: (was: Apache Spark)

> Refector AbstractCommandBuilder
> ---
>
> Key: SPARK-33835
> URL: https://issues.apache.org/jira/browse/SPARK-33835
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Priority: Major
>
> Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251479#comment-17251479
 ] 

Apache Spark commented on SPARK-33835:
--

User 'offthewall123' has created a pull request for this issue:
https://github.com/apache/spark/pull/30831

> Refector AbstractCommandBuilder
> ---
>
> Key: SPARK-33835
> URL: https://issues.apache.org/jira/browse/SPARK-33835
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Priority: Major
>
> Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution

2020-12-17 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30186:

Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> support Dynamic Partition Pruning in Adaptive Execution
> ---
>
> Key: SPARK-30186
> URL: https://issues.apache.org/jira/browse/SPARK-30186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiaoju Wu
>Priority: Major
>
> Currently Adaptive Execution cannot work if Dynamic Partition Pruning is 
> applied.
> private def supportAdaptive(plan: SparkPlan): Boolean = {
>  // TODO migrate dynamic-partition-pruning onto adaptive execution.
>  sanityCheck(plan) &&
>  !plan.logicalLink.exists(_.isStreaming) &&
>  
> *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)*
>  &&
>  plan.children.forall(supportAdaptive)
> }
> It means we cannot benefit the performance from both AE and DPP.
> This ticket is target to make DPP + AE works together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33835) Refector AbstractCommandBuilder

2020-12-17 Thread Xudingyu (Jira)

Xudingyu created SPARK-33835:


 Summary: Refector AbstractCommandBuilder
 Key: SPARK-33835
 URL: https://issues.apache.org/jira/browse/SPARK-33835
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 3.0.0
Reporter: Xudingyu


Refector AbstractCommandBuilder: use firstNonEmpty to get javaHome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33834) Verify ALTER TABLE CHANGE COLUMN with Char and Varchar

2020-12-17 Thread Kent Yao (Jira)

Kent Yao created SPARK-33834:


 Summary: Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
 Key: SPARK-33834
 URL: https://issues.apache.org/jira/browse/SPARK-33834
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33831) Update Jetty to 9.4.34

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33831.
---
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30828
[https://github.com/apache/spark/pull/30828]

> Update Jetty to 9.4.34
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33831) Update Jetty to 9.4.34

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33831:
-

Assignee: Sean R. Owen

> Update Jetty to 9.4.34
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33822:
--
Fix Version/s: 3.0.2

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 3.0.2, 3.1.0
>
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:392)
>   at 
>

[jira] [Updated] (SPARK-33831) Update Jetty to 9.4.34

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33831:
--
Issue Type: Bug  (was: Improvement)

> Update Jetty to 9.4.34
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Sean R. Owen
>Priority: Minor
>
> We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33797) Update SS doc about State Store and task locality

2020-12-17 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33797.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30789
[https://github.com/apache/spark/pull/30789]

> Update SS doc about State Store and task locality
> -
>
> Key: SPARK-33797
> URL: https://issues.apache.org/jira/browse/SPARK-33797
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.2.0
>
>
> During running some tests for structured streaming, state store locality 
> becomes an issue and it is not very straightforward for end-users. It'd be 
> great if we can document it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33831) Update Jetty to 9.4.34

2020-12-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-33831:
-
Description: 
We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
possible CVE fix.

https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102


  was:
We should update Jetty to 9.4.35, from 9.4.28, to pick up fixes, plus a 
possible CVE fix.

https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.35.v20201120

Environment: (was: U)
Summary: Update Jetty to 9.4.34  (was: Update Jetty to 9.4.35)

> Update Jetty to 9.4.34
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
>Reporter: Sean R. Owen
>Priority: Minor
>
> We should update Jetty to 9.4.34, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251431#comment-17251431
 ] 

Apache Spark commented on SPARK-33822:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30830

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 3.1.0
>
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
>

[jira] [Commented] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251430#comment-17251430
 ] 

Apache Spark commented on SPARK-33822:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/30830

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 3.1.0
>
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
>

[jira] [Resolved] (SPARK-33824) Restructure and improve Python package management page

2020-12-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33824.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30822
[https://github.com/apache/spark/pull/30822]

> Restructure and improve Python package management page
> --
>
> Key: SPARK-33824
> URL: https://issues.apache.org/jira/browse/SPARK-33824
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> I lately wrote a blog post (pending to publish soon) about Python dependency 
> management.
> This JIRA aims to aa some of contents in the blog post into PySpark 
> documentation for users.
> Please see the linked PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33824) Restructure and improve Python package management page

2020-12-17 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33824:


Assignee: Hyukjin Kwon

> Restructure and improve Python package management page
> --
>
> Key: SPARK-33824
> URL: https://issues.apache.org/jira/browse/SPARK-33824
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> I lately wrote a blog post (pending to publish soon) about Python dependency 
> management.
> This JIRA aims to aa some of contents in the blog post into PySpark 
> documentation for users.
> Please see the linked PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33822:
--
Target Version/s: 3.0.2, 3.1.0  (was: 3.2.0)

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 3.1.0
>
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:392)
>   at 
>

[jira] [Updated] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33822:
--
Fix Version/s: (was: 3.0.2)

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 3.1.0
>
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:392)
>   at 
>

[jira] [Assigned] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33822:
-

Assignee: Takeshi Yamamuro

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:392)
>   at 
>

[jira] [Resolved] (SPARK-33822) TPCDS Q5 fails if spark.sql.adaptive.enabled=true

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33822.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30818
[https://github.com/apache/spark/pull/30818]

> TPCDS Q5 fails if spark.sql.adaptive.enabled=true
> -
>
> Key: SPARK-33822
> URL: https://issues.apache.org/jira/browse/SPARK-33822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 3.1.0, 3.0.2
>
>
> **PROBLEM STATEMENT**
> {code}
> >>> tables = ['call_center', 'catalog_page', 'catalog_returns', 
> >>> 'catalog_sales', 'customer', 'customer_address', 'customer_demographics', 
> >>> 'date_dim', 'household_demographics', 'income_band', 'inventory', 'item', 
> >>> 'promotion', 'reason', 'ship_mode', 'store', 'store_returns', 
> >>> 'store_sales', 'time_dim', 'warehouse', 'web_page', 'web_returns', 
> >>> 'web_sales', 'web_site']
> >>> for t in tables:
> ... spark.sql("CREATE TABLE %s USING PARQUET LOCATION 
> '/Users/dongjoon/data/10g/%s'" % (t, t))
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> +---++-+---+-+
> |channel|  id|sales|returns|   profit|
> +---++-+---+-+
> |   null|null|1143646603.07|30617460.71|-317540732.87|
> |catalog channel|null| 393609478.06| 9451732.79| -44801262.72|
> |catalog channel|catalog_pageA...| 0.00|   39037.48|-25330.29|
> ...
> +---++-+---+-+
> >>> sql("set spark.sql.adaptive.enabled=true")
> >>> spark.sql(spark.sparkContext.wholeTextFiles("/Users/dongjoon/data/query/q5.sql").take(1)[0][1]).show(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py",
>  line 440, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/pyspark/sql/utils.py",
>  line 128, in deco
> return f(*a, **kw)
>   File 
> "/Users/dongjoon/APACHE/spark-release/spark-3.0.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
> : java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
>   at 
>

[jira] [Commented] (SPARK-33832) Add an option in AQE to mitigate skew even if it causes an new shuffle

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251408#comment-17251408
 ] 

Apache Spark commented on SPARK-33832:
--

User 'ekoifman' has created a pull request for this issue:
https://github.com/apache/spark/pull/30829

> Add an option in AQE to mitigate skew even if it causes an new shuffle
> --
>
> Key: SPARK-33832
> URL: https://issues.apache.org/jira/browse/SPARK-33832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Currently {{OptimizeSkewedJoin}} will not apply if skew mitigation causes a 
> new shuffle.
> There are situations where it's better to mitigate skew even if it means a 
> new shuffle is added, for example if the join outputs small amount of data.
> As a first step I propose adding a SQLConf option to enable this.  
> I'll open a PR shortly to get feedback on the approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33832) Add an option in AQE to mitigate skew even if it causes an new shuffle

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251409#comment-17251409
 ] 

Apache Spark commented on SPARK-33832:
--

User 'ekoifman' has created a pull request for this issue:
https://github.com/apache/spark/pull/30829

> Add an option in AQE to mitigate skew even if it causes an new shuffle
> --
>
> Key: SPARK-33832
> URL: https://issues.apache.org/jira/browse/SPARK-33832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Currently {{OptimizeSkewedJoin}} will not apply if skew mitigation causes a 
> new shuffle.
> There are situations where it's better to mitigate skew even if it means a 
> new shuffle is added, for example if the join outputs small amount of data.
> As a first step I propose adding a SQLConf option to enable this.  
> I'll open a PR shortly to get feedback on the approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33832) Add an option in AQE to mitigate skew even if it causes an new shuffle

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33832:


Assignee: Apache Spark

> Add an option in AQE to mitigate skew even if it causes an new shuffle
> --
>
> Key: SPARK-33832
> URL: https://issues.apache.org/jira/browse/SPARK-33832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Apache Spark
>Priority: Major
>
> Currently {{OptimizeSkewedJoin}} will not apply if skew mitigation causes a 
> new shuffle.
> There are situations where it's better to mitigate skew even if it means a 
> new shuffle is added, for example if the join outputs small amount of data.
> As a first step I propose adding a SQLConf option to enable this.  
> I'll open a PR shortly to get feedback on the approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33832) Add an option in AQE to mitigate skew even if it causes an new shuffle

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33832:


Assignee: (was: Apache Spark)

> Add an option in AQE to mitigate skew even if it causes an new shuffle
> --
>
> Key: SPARK-33832
> URL: https://issues.apache.org/jira/browse/SPARK-33832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Currently {{OptimizeSkewedJoin}} will not apply if skew mitigation causes a 
> new shuffle.
> There are situations where it's better to mitigate skew even if it means a 
> new shuffle is added, for example if the join outputs small amount of data.
> As a first step I propose adding a SQLConf option to enable this.  
> I'll open a PR shortly to get feedback on the approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27620) Update jetty to 9.4.18.v20190429

2020-12-17 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-27620:
---

Assignee: Yuming Wang  (was: yuming.wang)

> Update jetty to 9.4.18.v20190429
> 
>
> Key: SPARK-27620
> URL: https://issues.apache.org/jira/browse/SPARK-27620
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Update jetty to 9.4.18.v20190429 because of 
> [CVE-2019-10247|https://nvd.nist.gov/vuln/detail/CVE-2019-10247].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA

2020-12-17 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251403#comment-17251403
 ] 

Takeshi Yamamuro commented on SPARK-33828:
--

Thanks for letting me know, Dongjoon~

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA 
> issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by 
> default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33833) Allow Spark Structured Streaming report Kafka Lag through Burrow

2020-12-17 Thread Sam Davarnia (Jira)

Sam Davarnia created SPARK-33833:


 Summary: Allow Spark Structured Streaming report Kafka Lag through 
Burrow
 Key: SPARK-33833
 URL: https://issues.apache.org/jira/browse/SPARK-33833
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Sam Davarnia


Because structured streaming tracks Kafka offset consumption by itself, 
It is not possible to track total Kafka lag using Burrow similar to DStreams

We have used Stream hooks as mentioned 
[here|https://medium.com/@ronbarabash/how-to-measure-consumer-lag-in-spark-structured-streaming-6c3645e45a37]
 

It would be great if Spark supports this feature out of the box.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33828) SQL Adaptive Query Execution QA

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33828:
--
Description: Since SPARK-31412 is delivered at 3.0.0, we received and 
handled many JIRA issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to 
enable it by default and collect all information in order to do QA for this 
feature in Apache Spark 3.2.0 timeframe.  (was: Since SPARK-31412 is delivered 
at 3.0.0, we received and handled seen many JIRA issues at 3.0.x/3.1.0/3.2.0. 
This umbrella JIRA issue aims to enable it by default and collect all 
information in order to do QA for this feature in Apache Spark 3.2.0 timeframe.)

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA 
> issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by 
> default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33832) Add an option in AQE to mitigate skew even if it causes an new shuffle

2020-12-17 Thread Eugene Koifman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated SPARK-33832:
---
Description: 
Currently {{OptimizeSkewedJoin}} will not apply if skew mitigation causes a new 
shuffle.
There are situations where it's better to mitigate skew even if it means a new 
shuffle is added, for example if the join outputs small amount of data.

As a first step I propose adding a SQLConf option to enable this.  

I'll open a PR shortly to get feedback on the approach.

  was:
Currently {{OptimizeSkewdJoin}} will not apply if skew mitigation causes a new 
shuffle.
There are situations where it's better to mitigate skew even if it means a new 
shuffle is added, for example if the join outputs small amount of data.

As a first step I propose adding a SQLConf option to enable this.  

I'll open a PR shortly to get feedback on the approach.


> Add an option in AQE to mitigate skew even if it causes an new shuffle
> --
>
> Key: SPARK-33832
> URL: https://issues.apache.org/jira/browse/SPARK-33832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Currently {{OptimizeSkewedJoin}} will not apply if skew mitigation causes a 
> new shuffle.
> There are situations where it's better to mitigate skew even if it means a 
> new shuffle is added, for example if the join outputs small amount of data.
> As a first step I propose adding a SQLConf option to enable this.  
> I'll open a PR shortly to get feedback on the approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33832) Add an option in AQE to mitigate skew even if it causes an new shuffle

2020-12-17 Thread Eugene Koifman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated SPARK-33832:
---
Description: 
Currently {{OptimizeSkewdJoin}} will not apply if skew mitigation causes a new 
shuffle.
There are situations where it's better to mitigate skew even if it means a new 
shuffle is added, for example if the join outputs small amount of data.

As a first step I propose adding a SQLConf option to enable this.  

I'll open a PR shortly to get feedback on the approach.

  was:
Currently {{OptimizeSkewJoin}} will not apply if skew mitigation causes a new 
shuffle.
There are situations where it's better to mitigate skew even if it means a new 
shuffle is added, for example if the join outputs small amount of data.

As a first step I propose adding a SQLConf option to enable this.  

I'll open a PR shortly to get feedback on the approach.


> Add an option in AQE to mitigate skew even if it causes an new shuffle
> --
>
> Key: SPARK-33832
> URL: https://issues.apache.org/jira/browse/SPARK-33832
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Currently {{OptimizeSkewdJoin}} will not apply if skew mitigation causes a 
> new shuffle.
> There are situations where it's better to mitigate skew even if it means a 
> new shuffle is added, for example if the join outputs small amount of data.
> As a first step I propose adding a SQLConf option to enable this.  
> I'll open a PR shortly to get feedback on the approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33832) Add an option in AQE to mitigate skew even if it causes an new shuffle

2020-12-17 Thread Eugene Koifman (Jira)

Eugene Koifman created SPARK-33832:
--

 Summary: Add an option in AQE to mitigate skew even if it causes 
an new shuffle
 Key: SPARK-33832
 URL: https://issues.apache.org/jira/browse/SPARK-33832
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Eugene Koifman


Currently {{OptimizeSkewJoin}} will not apply if skew mitigation causes a new 
shuffle.
There are situations where it's better to mitigate skew even if it means a new 
shuffle is added, for example if the join outputs small amount of data.

As a first step I propose adding a SQLConf option to enable this.  

I'll open a PR shortly to get feedback on the approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33831) Update Jetty to 9.4.35

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33831:


Assignee: Apache Spark

> Update Jetty to 9.4.35
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
> Environment: U
>Reporter: Sean R. Owen
>Assignee: Apache Spark
>Priority: Minor
>
> We should update Jetty to 9.4.35, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.35.v20201120



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33831) Update Jetty to 9.4.35

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33831:


Assignee: (was: Apache Spark)

> Update Jetty to 9.4.35
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
> Environment: U
>Reporter: Sean R. Owen
>Priority: Minor
>
> We should update Jetty to 9.4.35, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.35.v20201120



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33831) Update Jetty to 9.4.35

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251365#comment-17251365
 ] 

Apache Spark commented on SPARK-33831:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/30828

> Update Jetty to 9.4.35
> --
>
> Key: SPARK-33831
> URL: https://issues.apache.org/jira/browse/SPARK-33831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.1
> Environment: U
>Reporter: Sean R. Owen
>Priority: Minor
>
> We should update Jetty to 9.4.35, from 9.4.28, to pick up fixes, plus a 
> possible CVE fix.
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102
> https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.35.v20201120



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33831) Update Jetty to 9.4.35

2020-12-17 Thread Sean R. Owen (Jira)

Sean R. Owen created SPARK-33831:


 Summary: Update Jetty to 9.4.35
 Key: SPARK-33831
 URL: https://issues.apache.org/jira/browse/SPARK-33831
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 3.0.1
 Environment: U
Reporter: Sean R. Owen


We should update Jetty to 9.4.35, from 9.4.28, to pick up fixes, plus a 
possible CVE fix.

https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.35.v20201120



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27620) Update jetty to 9.4.18.v20190429

2020-12-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-27620:
-
Priority: Minor  (was: Major)

> Update jetty to 9.4.18.v20190429
> 
>
> Key: SPARK-27620
> URL: https://issues.apache.org/jira/browse/SPARK-27620
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: yuming.wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Update jetty to 9.4.18.v20190429 because of 
> [CVE-2019-10247|https://nvd.nist.gov/vuln/detail/CVE-2019-10247].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251353#comment-17251353
 ] 

Apache Spark commented on SPARK-33827:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30827

> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state store asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> So unloading inactive store asap is only useful when we don't use task 
> locality to force state store locality across batches.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251354#comment-17251354
 ] 

Apache Spark commented on SPARK-33827:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30827

> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state store asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> So unloading inactive store asap is only useful when we don't use task 
> locality to force state store locality across batches.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33827:


Assignee: Apache Spark

> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state store asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> So unloading inactive store asap is only useful when we don't use task 
> locality to force state store locality across batches.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33827:


Assignee: (was: Apache Spark)

> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state store asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> So unloading inactive store asap is only useful when we don't use task 
> locality to force state store locality across batches.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PARTITION

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33830:


Assignee: (was: Apache Spark)

> Describe the PURGE option of ALTER TABLE .. DROP PARTITION
> --
>
> Key: SPARK-33830
> URL: https://issues.apache.org/jira/browse/SPARK-33830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PARTITION

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33830:


Assignee: Apache Spark

> Describe the PURGE option of ALTER TABLE .. DROP PARTITION
> --
>
> Key: SPARK-33830
> URL: https://issues.apache.org/jira/browse/SPARK-33830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PARTITION

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251302#comment-17251302
 ] 

Apache Spark commented on SPARK-33830:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30826

> Describe the PURGE option of ALTER TABLE .. DROP PARTITION
> --
>
> Key: SPARK-33830
> URL: https://issues.apache.org/jira/browse/SPARK-33830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33829) ALTER TABLE ... RENAME TO should recreate cache for v2 tables.

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251301#comment-17251301
 ] 

Apache Spark commented on SPARK-33829:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30825

> ALTER TABLE ... RENAME TO should recreate cache for v2 tables.
> --
>
> Key: SPARK-33829
> URL: https://issues.apache.org/jira/browse/SPARK-33829
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, ALTER TABLE ... RENAME TO does not recreate the cache for v2 
> tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33829) ALTER TABLE ... RENAME TO should recreate cache for v2 tables.

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33829:


Assignee: Apache Spark

> ALTER TABLE ... RENAME TO should recreate cache for v2 tables.
> --
>
> Key: SPARK-33829
> URL: https://issues.apache.org/jira/browse/SPARK-33829
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> Currently, ALTER TABLE ... RENAME TO does not recreate the cache for v2 
> tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33829) ALTER TABLE ... RENAME TO should recreate cache for v2 tables.

2020-12-17 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251300#comment-17251300
 ] 

Apache Spark commented on SPARK-33829:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30825

> ALTER TABLE ... RENAME TO should recreate cache for v2 tables.
> --
>
> Key: SPARK-33829
> URL: https://issues.apache.org/jira/browse/SPARK-33829
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, ALTER TABLE ... RENAME TO does not recreate the cache for v2 
> tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33829) ALTER TABLE ... RENAME TO should recreate cache for v2 tables.

2020-12-17 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33829:


Assignee: (was: Apache Spark)

> ALTER TABLE ... RENAME TO should recreate cache for v2 tables.
> --
>
> Key: SPARK-33829
> URL: https://issues.apache.org/jira/browse/SPARK-33829
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, ALTER TABLE ... RENAME TO does not recreate the cache for v2 
> tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PARTITION

2020-12-17 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33830:
---
Summary: Describe the PURGE option of ALTER TABLE .. DROP PARTITION  (was: 
Describe the PURGE option of ALTER TABLE .. DROP PATITION)

> Describe the PURGE option of ALTER TABLE .. DROP PARTITION
> --
>
> Key: SPARK-33830
> URL: https://issues.apache.org/jira/browse/SPARK-33830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33830) Describe the PURGE option of ALTER TABLE .. DROP PATITION

2020-12-17 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-33830:
--

 Summary: Describe the PURGE option of ALTER TABLE .. DROP PATITION
 Key: SPARK-33830
 URL: https://issues.apache.org/jira/browse/SPARK-33830
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33829) ALTER TABLE ... RENAME TO should recreate cache for v2 tables.

2020-12-17 Thread Terry Kim (Jira)

Terry Kim created SPARK-33829:
-

 Summary: ALTER TABLE ... RENAME TO should recreate cache for v2 
tables.
 Key: SPARK-33829
 URL: https://issues.apache.org/jira/browse/SPARK-33829
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim


Currently, ALTER TABLE ... RENAME TO does not recreate the cache for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251274#comment-17251274
 ] 

L. C. Hsieh commented on SPARK-33827:
-

Another thought: Can we just make maintenance interval as public configuration 
and document it in SS doc? By reducing the interval and increasing locality 
wait, I think we should be able to handle inactive store issue without 
non-trivial change?

Increasing locality wait can reduce the possibility to have inactive state 
stores. Reducing maintenance interval can help unloading inactive store if it 
happens.



> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state store asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> So unloading inactive store asap is only useful when we don't use task 
> locality to force state store locality across batches.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA

2020-12-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251265#comment-17251265
 ] 

Dongjoon Hyun commented on SPARK-33828:
---

I created this JIRA to collect all community issues for 3.2.0 release, cc 
[~hyukjin.kwon], [~maropu], [~cloud_fan]. I hope we don't need to have much 
JIRA issues under this umbrella.

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled seen many 
> JIRA issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it 
> by default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251264#comment-17251264
 ] 

L. C. Hsieh commented on SPARK-33827:
-

cc [~dbtsai] [~dongjoon]

> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state store asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> So unloading inactive store asap is only useful when we don't use task 
> locality to force state store locality across batches.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-33827:

Description: 
SS maintains state stores in executors across batches. Due to the nature of 
Spark scheduling, a state store might be allocated on another executor in next 
batch. The state store in previous batch becomes inactive.

Now we run a maintenance task periodically to unload inactive state stores. So 
there will be some delays between a state store becomes inactive and it is 
unloaded.

Per the discussion on https://github.com/apache/spark/pull/30770 with 
[~kabhwan], I think the preference is to unload inactive state store asap.

However, we can force Spark to always allocate a state store to same executor, 
by using task locality configuration. This can reduce the possibility to have 
inactive state store.

Normally, I think with locality configuration, we might not able to see 
inactive state store generally. There is still chance an executor can be failed 
and reallocated, but in this case, inactive state store is also lost too. So it 
is not an issue.

So unloading inactive store asap is only useful when we don't use task locality 
to force state store locality across batches.

The required change to make driver-executor bi-directional for state store 
management looks non-trivial. If we already can reduce possibility of inactive 
store, is it still worth making non-trivial here?







  was:
SS maintains state stores in executors across batches. Due to the nature of 
Spark scheduling, a state store might be allocated on another executor in next 
batch. The state store in previous batch becomes inactive.

Now we run a maintenance task periodically to unload inactive state stores. So 
there will be some delays between a state store becomes inactive and it is 
unloaded.

Per the discussion on https://github.com/apache/spark/pull/30770 with 
[~kabhwan], I think the preference is to unload inactive state asap.

However, we can force Spark to always allocate a state store to same executor, 
by using task locality configuration. This can reduce the possibility to have 
inactive state store.

Normally, I think with locality configuration, we might not able to see 
inactive state store generally. There is still chance an executor can be failed 
and reallocated, but in this case, inactive state store is also lost too. So it 
is not an issue.

The required change to make driver-executor bi-directional for state store 
management looks non-trivial. If we already can reduce possibility of inactive 
store, is it still worth making non-trivial here?








> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state store asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> So unloading inactive store asap is only useful when we don't use task 
> locality to force state store locality across batches.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33828) SQL Adaptive Query Execution QA

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33828:
--
Description: Since SPARK-31412 is delivered at 3.0.0, we received and 
handled seen many JIRA issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue 
aims to enable it by default and collect all information in order to do QA for 
this feature in Apache Spark 3.2.0 timeframe.  (was: SPARK-31412)

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled seen many 
> JIRA issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it 
> by default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33828) SQL Adaptive Query Execution QA

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33828:
--
Summary: SQL Adaptive Query Execution QA  (was: Adaptive Query Execution QA)

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> SPARK-31412



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33828) Adaptive Query Execution QA

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33828:
--
Description: SPARK-31412

> Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> SPARK-31412



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-33823) Use the `CastSupport.cast` method in HashJoin

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-33823.
-

> Use the `CastSupport.cast` method in HashJoin
> -
>
> Key: SPARK-33823
> URL: https://issues.apache.org/jira/browse/SPARK-33823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at fixing the bug that throws a unsupported exception when 
> running the TPCDS q5 with AQE enabled (this option is enabled by default now):
> {code}
> java.lang.UnsupportedOperationException: BroadcastExchange does not support 
> the execute() code path.
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
>   at 
> org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   ...
> {code}
> I've checked the AQE code and I found `EnsureRequirements` wrongly puts 
> `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` 
> phase as follows:
> {code}
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint)),false), [id=#2183]
>   +- BroadcastQueryStage 2
> +- ReusedExchange [d_date_sk#1086], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), 
> [id=#1963]
> {code}
> A root cause is that a `Cast` class in a required child's distribution does 
> not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in 
> `child.outputPartitioning` has it. So, this difference can make the 
> distribution requirement check fail in `EnsureRequirements`:
> https://github.com/apache/spark/blob/1e85707738a830d33598ca267a6740b3f06b1861/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L47-L50



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33828) Adaptive Query Execution QA

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33828:
--
Priority: Critical  (was: Major)

> Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33679) Enable spark.sql.adaptive.enabled true by default

2020-12-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33679:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> Enable spark.sql.adaptive.enabled true by default
> -
>
> Key: SPARK-33679
> URL: https://issues.apache.org/jira/browse/SPARK-33679
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33794) [ANSI mode][SQL] next_day function should throw runtime exception when receiving invalid input under ANSI mode

2020-12-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-33794:

Parent: SPARK-33275
Issue Type: Sub-task  (was: Improvement)

> [ANSI mode][SQL] next_day function should throw runtime exception when 
> receiving invalid input under ANSI mode
> --
>
> Key: SPARK-33794
> URL: https://issues.apache.org/jira/browse/SPARK-33794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chongguang LIU
>Priority: Major
>
> Hello all,
> According to [ANSI 
> compliance|https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html#ansi-compliance],
>  the [next_day 
> function|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3095]
>  should throw an runtime exception when receiving invalid value for 
> dayOfWeek, exemple receiving "xx" instead of "SUNDAY".
>  
> A similar improvement has been done on the element_at function: 
> https://issues.apache.org/jira/browse/SPARK-33386
>  
> If you agree with this proposition, i can submit a pull request with 
> necessary change.
>  
> Kind regardes,
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33794) next_day function should throw runtime exception when receiving invalid input under ANSI mode

2020-12-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-33794:

Summary: next_day function should throw runtime exception when receiving 
invalid input under ANSI mode  (was: [ANSI mode][SQL] next_day function should 
throw runtime exception when receiving invalid input under ANSI mode)

> next_day function should throw runtime exception when receiving invalid input 
> under ANSI mode
> -
>
> Key: SPARK-33794
> URL: https://issues.apache.org/jira/browse/SPARK-33794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chongguang LIU
>Priority: Major
>
> Hello all,
> According to [ANSI 
> compliance|https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html#ansi-compliance],
>  the [next_day 
> function|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3095]
>  should throw an runtime exception when receiving invalid value for 
> dayOfWeek, exemple receiving "xx" instead of "SUNDAY".
>  
> A similar improvement has been done on the element_at function: 
> https://issues.apache.org/jira/browse/SPARK-33386
>  
> If you agree with this proposition, i can submit a pull request with 
> necessary change.
>  
> Kind regardes,
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33827) Unload State Store asap once it becomes inactive

2020-12-17 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251262#comment-17251262
 ] 

L. C. Hsieh commented on SPARK-33827:
-

[~kabhwan] Thanks for the discussion before.

Though, unloading inactive state stores asap sounds reasonable to me, but after 
rethinking it, as we can reduce inactive state store by setting task locality, 
is it worth making non-trivial for it?

WDYT?

> Unload State Store asap once it becomes inactive
> 
>
> Key: SPARK-33827
> URL: https://issues.apache.org/jira/browse/SPARK-33827
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> SS maintains state stores in executors across batches. Due to the nature of 
> Spark scheduling, a state store might be allocated on another executor in 
> next batch. The state store in previous batch becomes inactive.
> Now we run a maintenance task periodically to unload inactive state stores. 
> So there will be some delays between a state store becomes inactive and it is 
> unloaded.
> Per the discussion on https://github.com/apache/spark/pull/30770 with 
> [~kabhwan], I think the preference is to unload inactive state asap.
> However, we can force Spark to always allocate a state store to same 
> executor, by using task locality configuration. This can reduce the 
> possibility to have inactive state store.
> Normally, I think with locality configuration, we might not able to see 
> inactive state store generally. There is still chance an executor can be 
> failed and reallocated, but in this case, inactive state store is also lost 
> too. So it is not an issue.
> The required change to make driver-executor bi-directional for state store 
> management looks non-trivial. If we already can reduce possibility of 
> inactive store, is it still worth making non-trivial here?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 147 matches

Mail list logo