[jira] [Assigned] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35838:


Assignee: Apache Spark

> Ensure kafka-0-10-sql module can maven test independently in Scala 2.13
> ---
>
> Key: SPARK-35838
> URL: https://issues.apache.org/jira/browse/SPARK-35838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
>  
> Execute
>  
> {code:java}
> mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl external/kafka-0-10-sql
> {code}
>  
> 1 scala test aborted, the error message is 
> {code:java}
> Discovery starting.
> Discovery completed in 857 milliseconds.
> Run starting. Expected test count is: 464
> ...
> KafkaRelationSuiteV2:
> - explicit earliest to latest offsets
> - default starting and ending offsets
> - explicit offsets
> - default starting and ending offsets with headers
> - timestamp provided for starting and ending
> - timestamp provided for starting, offset provided for ending
> - timestamp provided for ending, offset provided for starting
> - timestamp provided for starting, ending not provided
> - timestamp provided for ending, starting not provided
> - global timestamp provided for starting and ending
> - no matched offset for timestamp - startingOffsets
> - preferences on offset related options
> - no matched offset for timestamp - endingOffsets
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>   at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
>   at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
>   at 
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217)
>   ...
>   Cause: java.lang.ClassNotFoundException: 
> scala.collection.parallel.TaskSupport
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
>   at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
>   at 
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
>   ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35838:


Assignee: (was: Apache Spark)

> Ensure kafka-0-10-sql module can maven test independently in Scala 2.13
> ---
>
> Key: SPARK-35838
> URL: https://issues.apache.org/jira/browse/SPARK-35838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
>  
> Execute
>  
> {code:java}
> mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl external/kafka-0-10-sql
> {code}
>  
> 1 scala test aborted, the error message is 
> {code:java}
> Discovery starting.
> Discovery completed in 857 milliseconds.
> Run starting. Expected test count is: 464
> ...
> KafkaRelationSuiteV2:
> - explicit earliest to latest offsets
> - default starting and ending offsets
> - explicit offsets
> - default starting and ending offsets with headers
> - timestamp provided for starting and ending
> - timestamp provided for starting, offset provided for ending
> - timestamp provided for ending, offset provided for starting
> - timestamp provided for starting, ending not provided
> - timestamp provided for ending, starting not provided
> - global timestamp provided for starting and ending
> - no matched offset for timestamp - startingOffsets
> - preferences on offset related options
> - no matched offset for timestamp - endingOffsets
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>   at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
>   at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
>   at 
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217)
>   ...
>   Cause: java.lang.ClassNotFoundException: 
> scala.collection.parallel.TaskSupport
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
>   at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
>   at 
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
>   ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366418#comment-17366418
 ] 

Apache Spark commented on SPARK-35838:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32994

> Ensure kafka-0-10-sql module can maven test independently in Scala 2.13
> ---
>
> Key: SPARK-35838
> URL: https://issues.apache.org/jira/browse/SPARK-35838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
>  
> Execute
>  
> {code:java}
> mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -Pscala-2.13 -pl external/kafka-0-10-sql
> {code}
>  
> 1 scala test aborted, the error message is 
> {code:java}
> Discovery starting.
> Discovery completed in 857 milliseconds.
> Run starting. Expected test count is: 464
> ...
> KafkaRelationSuiteV2:
> - explicit earliest to latest offsets
> - default starting and ending offsets
> - explicit offsets
> - default starting and ending offsets with headers
> - timestamp provided for starting and ending
> - timestamp provided for starting, offset provided for ending
> - timestamp provided for ending, offset provided for starting
> - timestamp provided for starting, ending not provided
> - timestamp provided for ending, starting not provided
> - global timestamp provided for starting and ending
> - no matched offset for timestamp - startingOffsets
> - preferences on offset related options
> - no matched offset for timestamp - endingOffsets
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>   at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
>   at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
>   at 
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217)
>   ...
>   Cause: java.lang.ClassNotFoundException: 
> scala.collection.parallel.TaskSupport
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
>   at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
>   at 
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
>   ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13

2021-06-20 Thread Yang Jie (Jira)
Yang Jie created SPARK-35838:


 Summary: Ensure kafka-0-10-sql module can maven test independently 
in Scala 2.13
 Key: SPARK-35838
 URL: https://issues.apache.org/jira/browse/SPARK-35838
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.2.0
Reporter: Yang Jie


 

Execute

 
{code:java}
mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
-Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
-Pscala-2.13 -pl external/kafka-0-10-sql
{code}
 

1 scala test aborted, the error message is 
{code:java}
Discovery starting.
Discovery completed in 857 milliseconds.
Run starting. Expected test count is: 464
...
KafkaRelationSuiteV2:
- explicit earliest to latest offsets
- default starting and ending offsets
- explicit offsets
- default starting and ending offsets with headers
- timestamp provided for starting and ending
- timestamp provided for starting, offset provided for ending
- timestamp provided for ending, offset provided for starting
- timestamp provided for starting, ending not provided
- timestamp provided for ending, starting not provided
- global timestamp provided for starting and ending
- no matched offset for timestamp - startingOffsets
- preferences on offset related options
- no matched offset for timestamp - endingOffsets
*** RUN ABORTED ***
  java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
  at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
  at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
  at 
org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217)
  ...
  Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
  at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.SparkContext.withScope(SparkContext.scala:788)
  at org.apache.spark.SparkContext.union(SparkContext.scala:1405)
  at 
org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697)
  ...

{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35776) Check all year-month interval types in arrow

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366409#comment-17366409
 ] 

Apache Spark commented on SPARK-35776:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32993

> Check all year-month interval types in arrow
> 
>
> Key: SPARK-35776
> URL: https://issues.apache.org/jira/browse/SPARK-35776
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Add tests to check that all year-month interval types are supported in 
> (de-)serialization from/to Arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35776) Check all year-month interval types in arrow

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35776:


Assignee: Apache Spark

> Check all year-month interval types in arrow
> 
>
> Key: SPARK-35776
> URL: https://issues.apache.org/jira/browse/SPARK-35776
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add tests to check that all year-month interval types are supported in 
> (de-)serialization from/to Arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35776) Check all year-month interval types in arrow

2021-06-20 Thread Apache Spark (Jira)


[jira] [Assigned] (SPARK-35776) Check all year-month interval types in arrow

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35776:


Assignee: Apache Spark

> Check all year-month interval types in arrow
> 
>
> Key: SPARK-35776
> URL: https://issues.apache.org/jira/browse/SPARK-35776
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add tests to check that all year-month interval types are supported in 
> (de-)serialization from/to Arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35836:


Assignee: (was: Apache Spark)

> Remove reference to spark.shuffle.push.based.enabled in 
> ShuffleBlockPusherSuite
> ---
>
> Key: SPARK-35836
> URL: https://issues.apache.org/jira/browse/SPARK-35836
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Trivial
>
> The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in 
> this suite, the configuration for push-based shuffle is incorrectly 
> referenced as {{spark.shuffle.push.based.enabled}}. We need to remove this 
> config from here.
> {{ShuffleBlockPusher}} is created only when push based shuffle is enabled and 
> this suite is for {{ShuffleBlockPusher}}, so no other change is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366406#comment-17366406
 ] 

Apache Spark commented on SPARK-35836:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32992

> Remove reference to spark.shuffle.push.based.enabled in 
> ShuffleBlockPusherSuite
> ---
>
> Key: SPARK-35836
> URL: https://issues.apache.org/jira/browse/SPARK-35836
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Trivial
>
> The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in 
> this suite, the configuration for push-based shuffle is incorrectly 
> referenced as {{spark.shuffle.push.based.enabled}}. We need to remove this 
> config from here.
> {{ShuffleBlockPusher}} is created only when push based shuffle is enabled and 
> this suite is for {{ShuffleBlockPusher}}, so no other change is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35836:


Assignee: Apache Spark

> Remove reference to spark.shuffle.push.based.enabled in 
> ShuffleBlockPusherSuite
> ---
>
> Key: SPARK-35836
> URL: https://issues.apache.org/jira/browse/SPARK-35836
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Apache Spark
>Priority: Trivial
>
> The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in 
> this suite, the configuration for push-based shuffle is incorrectly 
> referenced as {{spark.shuffle.push.based.enabled}}. We need to remove this 
> config from here.
> {{ShuffleBlockPusher}} is created only when push based shuffle is enabled and 
> this suite is for {{ShuffleBlockPusher}}, so no other change is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35776) Check all year-month interval types in arrow

2021-06-20 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366401#comment-17366401
 ] 

angerszhu commented on SPARK-35776:
---

Working on this

> Check all year-month interval types in arrow
> 
>
> Key: SPARK-35776
> URL: https://issues.apache.org/jira/browse/SPARK-35776
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Add tests to check that all year-month interval types are supported in 
> (de-)serialization from/to Arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems

2021-06-20 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35837:

Description: 
Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
 4. Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614
  |
|   176984, 3, 81003252

[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems

2021-06-20 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35837:

Description: 
Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
 4. Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614
  |
|   176984, 3, 81003252

[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems

2021-06-20 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35837:

Description: 
Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect the bucket read, for example, Analyzer add a cast to bucket column.
 4. Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614
  |
|   176984, 3, 81003252 

[jira] [Created] (SPARK-35837) Recommendations for Common Query Problems

2021-06-20 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35837:
---

 Summary: Recommendations for Common Query Problems
 Key: SPARK-35837
 URL: https://issues.apache.org/jira/browse/SPARK-35837
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


Teradata supports [Recommendations for Common Query 
Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg].

We can implement a similar feature.
 1. Detect the most skew values for join. The user decides whether these are 
needed.
 2. Detect the most skew values for window function. The user decides whether 
these are needed.
 3. Detect it can be optimized to bucket read, for example, Analyzer add a cast 
to bucket column.
 4.Recommend the user add a filter condition to the partition column of the 
partition table.
 5. Check the condition of join, for example, the result of cast string to 
double may be incorrect.

For example:
{code:sql}
0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE
0: jdbc:hive2://localhost:1/default> SELECT a.*,
0: jdbc:hive2://localhost:1/default>CASE
0: jdbc:hive2://localhost:1/default>  WHEN ( NOT ( a.exclude = 1
0: jdbc:hive2://localhost:1/default>   AND a.cobrand = 6
0: jdbc:hive2://localhost:1/default>   AND 
a.primary_app_id IN ( 1462, 2878, 2571 ) ) )
0: jdbc:hive2://localhost:1/default>   AND ( a.valid_page_count 
= 1 ) THEN 1
0: jdbc:hive2://localhost:1/default>  ELSE 0
0: jdbc:hive2://localhost:1/default>END AS is_singlepage,
0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *
0: jdbc:hive2://localhost:1/default> FROM   (SELECT *,
0: jdbc:hive2://localhost:1/default>'VI' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl1
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'SRP' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl2
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'My Garage' AS 
page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl3
0: jdbc:hive2://localhost:1/default> UNION
0: jdbc:hive2://localhost:1/default> SELECT *,
0: jdbc:hive2://localhost:1/default>'Motors 
Homepage' AS page_type
0: jdbc:hive2://localhost:1/default> FROM   tbl4) t
0: jdbc:hive2://localhost:1/default> WHERE  session_start_dt 
BETWEEN ( '2020-01-01' ) AND (
0: jdbc:hive2://localhost:1/default>
 CURRENT_DATE() - 10 )) a
0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id,
0: jdbc:hive2://localhost:1/default>  item_site_id,
0: jdbc:hive2://localhost:1/default>  auct_end_dt,
0: jdbc:hive2://localhost:1/default>  leaf_categ_id
0: jdbc:hive2://localhost:1/default>   FROM   tbl5
0: jdbc:hive2://localhost:1/default>   WHERE  auct_end_dt 
>= ( '2020-01-01' )) itm
0: jdbc:hive2://localhost:1/default>   ON a.item_id = 
itm.item_id
0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca
0: jdbc:hive2://localhost:1/default>   ON itm.leaf_categ_id = 
ca.leaf_categ_id
0: jdbc:hive2://localhost:1/default>  AND itm.item_site_id 
= ca.site_id;
+-+--+
| result
  |
+-+--+
| 1. Detect the most skew values for join   
  |
|   Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND 
(cast(item_site_id#1444 as decimal(9,0)) = site_id#3022))  |
| table: tbl5   
  |
|   leaf_categ_id, item_site_id, count  
  |
|   171243, 0, 115412614 

[jira] [Created] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite

2021-06-20 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-35836:
-

 Summary: Remove reference to spark.shuffle.push.based.enabled in 
ShuffleBlockPusherSuite
 Key: SPARK-35836
 URL: https://issues.apache.org/jira/browse/SPARK-35836
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 3.1.0
Reporter: Chandni Singh


The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in 
this suite, the configuration for push-based shuffle is incorrectly referenced 
as {{spark.shuffle.push.based.enabled}}. We need to remove this config from 
here.

{{ShuffleBlockPusher}} is created only when push based shuffle is enabled and 
this suite is for {{ShuffleBlockPusher}}, so no other change is required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35678) add a common softmax function

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366390#comment-17366390
 ] 

Apache Spark commented on SPARK-35678:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32991

> add a common softmax function
> -
>
> Key: SPARK-35678
> URL: https://issues.apache.org/jira/browse/SPARK-35678
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.2.0
>
>
> add softmax function in utils, which can be used in multi places



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35835) Select filter query on table with struct complex type fails

2021-06-20 Thread pavithra ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366386#comment-17366386
 ] 

pavithra ramachandran commented on SPARK-35835:
---

i shall raise a PR soon

> Select filter query on table with struct complex type fails
> ---
>
> Key: SPARK-35835
> URL: https://issues.apache.org/jira/browse/SPARK-35835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
>Reporter: Chetan Bhat
>Priority: Minor
>
> [Steps]:-
> From Spark beeline create a parquet or ORC table having complex type data. 
> Load data in the table and execute select filter query.
> 0: jdbc:hive2://vm2:22550/> create table Struct_com (CUST_ID string, YEAR 
> int, MONTH int, AGE int, GENDER string, EDUCATED string, IS_MARRIED string, 
> STRUCT_INT_DOUBLE_STRING_DATE 
> struct,CARD_COUNT 
> int,DEBIT_COUNT int, CREDIT_COUNT int, DEPOSIT double, HQ_DEPOSIT double) 
> stored as parquet;
> +-+
> | Result |
> +-+
> +-+
> No rows selected (0.161 seconds)
> 0: jdbc:hive2://vm2:22550/> LOAD DATA INPATH 
> 'hdfs://hacluster/chetan/Struct.csv' OVERWRITE INTO TABLE Struct_com;
> +-+
> | Result |
> +-+
> +-+
> No rows selected (1.09 seconds)
> 0: jdbc:hive2://vm2:22550/> SELECT 
> struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in
>  t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum 
> FROM (select * from Struct_com) SUB_QRY WHERE 
> struct_int_double_string_date.id > 5700 GRO UP BY 
> struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country
>  ORDER BY struct_int_double_string_date.COUNTRY 
> asc,struct_int_double_string_date.CHECK_DATE 
> asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri 
> ng_date.Country asc;
>  
> [Actual Issue] : - Select filter query on table with struct complex type fails
> 0: jdbc:hive2://vm2:22550/> SELECT 
> struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in
>  t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum 
> FROM (select * from Struct_com) SUB_QRY WHERE 
> struct_int_double_string_date.id > 5700 GRO UP BY 
> struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country
>  ORDER BY struct_int_double_string_date.COUNTRY 
> asc,struct_int_double_string_date.CHECK_DATE 
> asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri 
> ng_date.Country asc;
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange rangepartitioning(COUNTRY#139896 ASC NULLS FIRST, CHECK_DATE#139897 
> ASC NULLS FIRST, CHECK_DATE#139897 ASC NULLS FIRST, COUNTRY#139896 ASC NULLS 
> FIRST, 200 ), ENSURE_REQUIREMENTS, [id=#17161]
> +- *(2) HashAggregate(keys=[_gen_alias_139928#139928, 
> _gen_alias_139929#139929], functions=[sum(cast(_gen_alias_139931#139931 as 
> bigint))], output=[COUNTRY#139896, CHECK_DATE#139897, CHECK_DATE#139898, 
> Country#139899, Sum#139877L])
> +- Exchange hashpartitioning(_gen_alias_139928#139928, 
> _gen_alias_139929#139929, 200), ENSURE_REQUIREMENTS, [id=#17157]
> +- *(1) HashAggregate(keys=[_gen_alias_139928#139928, 
> _gen_alias_139929#139929], 
> functions=[partial_sum(cast(_gen_alias_139931#139931 as bigint))], output=[_g 
> en_alias_139928#139928, _gen_alias_139929#139929, sum#139934L])
> +- *(1) Project [STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS 
> _gen_alias_139928#139928, STRUCT_INT_DOUBLE_STRING_DATE#139885.CHECK_DATE AS 
> _gen_alias_13 9929#139929, STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS 
> _gen_alias_139930#139930, STRUCT_INT_DOUBLE_STRING_DATE#139885.ID AS 
> _gen_alias_139931#139931]
> +- *(1) Filter (isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#139885) AND 
> (STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700))
> +- FileScan parquet default.struct_com[STRUCT_INT_DOUBLE_STRING_DATE#139885] 
> Batched: false, DataFilters: [isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#13 
> 9885), (STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700)], Format: Parquet, 
> Location: InMemoryFileIndex[hdfs://hacluster/user/hive/warehouse/struct_com], 
> PartitionFi lters: [], PushedFilters: 
> [IsNotNull(STRUCT_INT_DOUBLE_STRING_DATE), 
> GreaterThan(STRUCT_INT_DOUBLE_STRING_DATE.ID,5700)], ReadSchema: 
> struct G_DATE:struct>
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$a

[jira] [Created] (SPARK-35835) Select filter query on table with struct complex type fails

2021-06-20 Thread Chetan Bhat (Jira)
Chetan Bhat created SPARK-35835:
---

 Summary: Select filter query on table with struct complex type 
fails
 Key: SPARK-35835
 URL: https://issues.apache.org/jira/browse/SPARK-35835
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
 Environment: Spark 3.1.1
Reporter: Chetan Bhat


[Steps]:-

>From Spark beeline create a parquet or ORC table having complex type data. 
>Load data in the table and execute select filter query.

0: jdbc:hive2://vm2:22550/> create table Struct_com (CUST_ID string, YEAR int, 
MONTH int, AGE int, GENDER string, EDUCATED string, IS_MARRIED string, 
STRUCT_INT_DOUBLE_STRING_DATE 
struct,CARD_COUNT 
int,DEBIT_COUNT int, CREDIT_COUNT int, DEPOSIT double, HQ_DEPOSIT double) 
stored as parquet;
+-+
| Result |
+-+
+-+
No rows selected (0.161 seconds)
0: jdbc:hive2://vm2:22550/> LOAD DATA INPATH 
'hdfs://hacluster/chetan/Struct.csv' OVERWRITE INTO TABLE Struct_com;
+-+
| Result |
+-+
+-+
No rows selected (1.09 seconds)
0: jdbc:hive2://vm2:22550/> SELECT 
struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in
 t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum 
FROM (select * from Struct_com) SUB_QRY WHERE struct_int_double_string_date.id 
> 5700 GRO UP BY 
struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country
 ORDER BY struct_int_double_string_date.COUNTRY 
asc,struct_int_double_string_date.CHECK_DATE 
asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri 
ng_date.Country asc;

 

[Actual Issue] : - Select filter query on table with struct complex type fails

0: jdbc:hive2://vm2:22550/> SELECT 
struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in
 t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum 
FROM (select * from Struct_com) SUB_QRY WHERE struct_int_double_string_date.id 
> 5700 GRO UP BY 
struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country
 ORDER BY struct_int_double_string_date.COUNTRY 
asc,struct_int_double_string_date.CHECK_DATE 
asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri 
ng_date.Country asc;
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange rangepartitioning(COUNTRY#139896 ASC NULLS FIRST, CHECK_DATE#139897 
ASC NULLS FIRST, CHECK_DATE#139897 ASC NULLS FIRST, COUNTRY#139896 ASC NULLS 
FIRST, 200 ), ENSURE_REQUIREMENTS, [id=#17161]
+- *(2) HashAggregate(keys=[_gen_alias_139928#139928, 
_gen_alias_139929#139929], functions=[sum(cast(_gen_alias_139931#139931 as 
bigint))], output=[COUNTRY#139896, CHECK_DATE#139897, CHECK_DATE#139898, 
Country#139899, Sum#139877L])
+- Exchange hashpartitioning(_gen_alias_139928#139928, 
_gen_alias_139929#139929, 200), ENSURE_REQUIREMENTS, [id=#17157]
+- *(1) HashAggregate(keys=[_gen_alias_139928#139928, 
_gen_alias_139929#139929], functions=[partial_sum(cast(_gen_alias_139931#139931 
as bigint))], output=[_g en_alias_139928#139928, _gen_alias_139929#139929, 
sum#139934L])
+- *(1) Project [STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS 
_gen_alias_139928#139928, STRUCT_INT_DOUBLE_STRING_DATE#139885.CHECK_DATE AS 
_gen_alias_13 9929#139929, STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS 
_gen_alias_139930#139930, STRUCT_INT_DOUBLE_STRING_DATE#139885.ID AS 
_gen_alias_139931#139931]
+- *(1) Filter (isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#139885) AND 
(STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700))
+- FileScan parquet default.struct_com[STRUCT_INT_DOUBLE_STRING_DATE#139885] 
Batched: false, DataFilters: [isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#13 9885), 
(STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700)], Format: Parquet, Location: 
InMemoryFileIndex[hdfs://hacluster/user/hive/warehouse/struct_com], PartitionFi 
lters: [], PushedFilters: [IsNotNull(STRUCT_INT_DOUBLE_STRING_DATE), 
GreaterThan(STRUCT_INT_DOUBLE_STRING_DATE.ID,5700)], ReadSchema: 
struct>

at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(Spar
 kExecuteStatementOperation.scala:396)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$3(SparkExecuteStatementOperation.scala:281)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
at 
org.apache.spark.sql

[jira] [Assigned] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved

2021-06-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32041:
---

Assignee: Peter Toth

> Exchange reuse won't work in cases when DPP, subqueries are involved
> 
>
> Key: SPARK-32041
> URL: https://issues.apache.org/jira/browse/SPARK-32041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Prakhar Jain
>Assignee: Peter Toth
>Priority: Major
>
> When an Exchange node is repeated at multiple places in the PhysicalPlan, and 
> if that exchange has some some DPP Subquery filter, then ReuseExchange 
> doesn't work for such Exchange and different stages are launched to compute 
> same thing.
> Example:
> {noformat}
> // generate data
> val factData = (1 to 100).map(i => (i%5, i%20, i))
> factData.toDF("store_id", "product_id", "units_sold")
>   .write
>   .partitionBy("store_id")
>   .format("parquet")
>   .saveAsTable("fact_stats")
> val dimData = Seq[(Int, String, String)](
>   (1, "AU", "US"),
>   (2, "CA", "US"),
>   (3, "KA", "IN"),
>   (4, "DL", "IN"),
>   (5, "GA", "PA"))
> dimData.toDF("store_id", "state_province", "country")
>   .write
>   .format("parquet")
>   .saveAsTable("dim_stats")
> sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> // Set Configs
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000")
> val query = """
> With view1 as (
>   SELECT product_id, f.store_id
>   FROM fact_stats f JOIN dim_stats
>   ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN')
> SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id
> """
> val df = spark.sql(query)
> println(df.queryExecution.executedPlan)
> {noformat}
> {noformat}
> Plan:
>  *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner
>  :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0
>  : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140]
>  : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], 
> Inner, BuildRight
>  : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : : +- *(2) Filter isnotnull(product_id#1968)
>  : : +- *(2) ColumnarToRow
>  : : +- FileScan parquet 
> default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] 
> Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [isnotnull(store_id#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], 
> PushedFilters: [IsNotNull(product_id)], ReadSchema: struct
>  : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], 
> [id=#1131|#1131]
>  : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#1021|#1021]
>  : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint))), [id=#1021|#1021]
>  : +- *(1) Project [store_id#1971|#1971]
>  : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND 
> isnotnull(store_id#1971))
>  : +- *(1) ColumnarToRow
>  : +- FileScan parquet 
> default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: 
> true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)|#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(country), 
> EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: 
> struct
>  +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0
>  +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], 
> Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026]
> {noformat}
> Issue:
>  Note the last line of plan. Its a ReusedExchange which is pointing to 
> id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange 
> node is pointing to incorrect Child node (1026 instead of 1140) and so in 
> actual, exchange reuse won't happen in this query.
> Another query where issue is because of ReuseSubquery:
> {noformat}
> spark.sql("set spark.sql.autoBroadcastJ

[jira] [Resolved] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved

2021-06-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32041.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 28885
[https://github.com/apache/spark/pull/28885]

> Exchange reuse won't work in cases when DPP, subqueries are involved
> 
>
> Key: SPARK-32041
> URL: https://issues.apache.org/jira/browse/SPARK-32041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Prakhar Jain
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0
>
>
> When an Exchange node is repeated at multiple places in the PhysicalPlan, and 
> if that exchange has some some DPP Subquery filter, then ReuseExchange 
> doesn't work for such Exchange and different stages are launched to compute 
> same thing.
> Example:
> {noformat}
> // generate data
> val factData = (1 to 100).map(i => (i%5, i%20, i))
> factData.toDF("store_id", "product_id", "units_sold")
>   .write
>   .partitionBy("store_id")
>   .format("parquet")
>   .saveAsTable("fact_stats")
> val dimData = Seq[(Int, String, String)](
>   (1, "AU", "US"),
>   (2, "CA", "US"),
>   (3, "KA", "IN"),
>   (4, "DL", "IN"),
>   (5, "GA", "PA"))
> dimData.toDF("store_id", "state_province", "country")
>   .write
>   .format("parquet")
>   .saveAsTable("dim_stats")
> sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id")
> // Set Configs
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true")
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000")
> val query = """
> With view1 as (
>   SELECT product_id, f.store_id
>   FROM fact_stats f JOIN dim_stats
>   ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN')
> SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id
> """
> val df = spark.sql(query)
> println(df.queryExecution.executedPlan)
> {noformat}
> {noformat}
> Plan:
>  *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner
>  :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0
>  : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140]
>  : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], 
> Inner, BuildRight
>  : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970]
>  : : +- *(2) Filter isnotnull(product_id#1968)
>  : : +- *(2) ColumnarToRow
>  : : +- FileScan parquet 
> default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] 
> Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [isnotnull(store_id#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), 
> dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], 
> PushedFilters: [IsNotNull(product_id)], ReadSchema: struct
>  : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], 
> [id=#1131|#1131]
>  : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#1021|#1021]
>  : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint))), [id=#1021|#1021]
>  : +- *(1) Project [store_id#1971|#1971]
>  : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND 
> isnotnull(store_id#1971))
>  : +- *(1) ColumnarToRow
>  : +- FileScan parquet 
> default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: 
> true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)|#1973), (country#1973 = IN), 
> isnotnull(store_id#1971)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(country), 
> EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: 
> struct
>  +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0
>  +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], 
> Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026]
> {noformat}
> Issue:
>  Note the last line of plan. Its a ReusedExchange which is pointing to 
> id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange 
> node is pointing to incorrect Child node (1026 instead of 1140) and so in 
> actual, exchange reuse won't

[jira] [Resolved] (SPARK-28940) Subquery reuse across all subquery levels

2021-06-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28940.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 28885
[https://github.com/apache/spark/pull/28885]

> Subquery reuse across all subquery levels
> -
>
> Key: SPARK-28940
> URL: https://issues.apache.org/jira/browse/SPARK-28940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently subquery reuse doesn't work across all subquery levels.
> Here is an example query:
> {noformat}
> SELECT (SELECT avg(key) FROM testData), (SELECT (SELECT avg(key) FROM 
> testData))
> FROM testData
> LIMIT 1
> {noformat}
> where the plan now is:
> {noformat}
> CollectLimit 1
> +- *(1) Project [Subquery scalar-subquery#268, [id=#231] AS 
> scalarsubquery()#276, Subquery scalar-subquery#270, [id=#266] AS 
> scalarsubquery()#277]
>:  :- Subquery scalar-subquery#268, [id=#231]
>:  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as 
> bigint))], output=[avg(key)#272])
>:  : +- Exchange SinglePartition, true, [id=#227]
>:  :+- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#282, count#283L])
>:  :   +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:  :  +- Scan[obj#12]
>:  +- Subquery scalar-subquery#270, [id=#266]
>: +- *(1) Project [Subquery scalar-subquery#269, [id=#263] AS 
> scalarsubquery()#275]
>::  +- Subquery scalar-subquery#269, [id=#263]
>:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 
> as bigint))], output=[avg(key)#274])
>::+- Exchange SinglePartition, true, [id=#259]
>::   +- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#286, count#287L])
>::  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:: +- Scan[obj#12]
>:+- *(1) Scan OneRowRelation[]
>+- *(1) SerializeFromObject
>   +- Scan[obj#12]
> {noformat}
> but it could be:
> {noformat}
> CollectLimit 1
> +- *(1) Project [ReusedSubquery Subquery scalar-subquery#241, [id=#148] AS 
> scalarsubquery()#248, Subquery scalar-subquery#242, [id=#164] AS 
> scalarsubquery()#249]
>:  :- ReusedSubquery Subquery scalar-subquery#241, [id=#148]
>:  +- Subquery scalar-subquery#242, [id=#164]
>: +- *(1) Project [Subquery scalar-subquery#241, [id=#148] AS 
> scalarsubquery()#247]
>::  +- Subquery scalar-subquery#241, [id=#148]
>:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 
> as bigint))], output=[avg(key)#246])
>::+- Exchange SinglePartition, true, [id=#144]
>::   +- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#258, count#259L])
>::  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:: +- Scan[obj#12]
>:+- *(1) Scan OneRowRelation[]
>+- *(1) SerializeFromObject
>   +- Scan[obj#12]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28940) Subquery reuse across all subquery levels

2021-06-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28940:
---

Assignee: Peter Toth

> Subquery reuse across all subquery levels
> -
>
> Key: SPARK-28940
> URL: https://issues.apache.org/jira/browse/SPARK-28940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>
> Currently subquery reuse doesn't work across all subquery levels.
> Here is an example query:
> {noformat}
> SELECT (SELECT avg(key) FROM testData), (SELECT (SELECT avg(key) FROM 
> testData))
> FROM testData
> LIMIT 1
> {noformat}
> where the plan now is:
> {noformat}
> CollectLimit 1
> +- *(1) Project [Subquery scalar-subquery#268, [id=#231] AS 
> scalarsubquery()#276, Subquery scalar-subquery#270, [id=#266] AS 
> scalarsubquery()#277]
>:  :- Subquery scalar-subquery#268, [id=#231]
>:  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as 
> bigint))], output=[avg(key)#272])
>:  : +- Exchange SinglePartition, true, [id=#227]
>:  :+- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#282, count#283L])
>:  :   +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:  :  +- Scan[obj#12]
>:  +- Subquery scalar-subquery#270, [id=#266]
>: +- *(1) Project [Subquery scalar-subquery#269, [id=#263] AS 
> scalarsubquery()#275]
>::  +- Subquery scalar-subquery#269, [id=#263]
>:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 
> as bigint))], output=[avg(key)#274])
>::+- Exchange SinglePartition, true, [id=#259]
>::   +- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#286, count#287L])
>::  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:: +- Scan[obj#12]
>:+- *(1) Scan OneRowRelation[]
>+- *(1) SerializeFromObject
>   +- Scan[obj#12]
> {noformat}
> but it could be:
> {noformat}
> CollectLimit 1
> +- *(1) Project [ReusedSubquery Subquery scalar-subquery#241, [id=#148] AS 
> scalarsubquery()#248, Subquery scalar-subquery#242, [id=#164] AS 
> scalarsubquery()#249]
>:  :- ReusedSubquery Subquery scalar-subquery#241, [id=#148]
>:  +- Subquery scalar-subquery#242, [id=#164]
>: +- *(1) Project [Subquery scalar-subquery#241, [id=#148] AS 
> scalarsubquery()#247]
>::  +- Subquery scalar-subquery#241, [id=#148]
>:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 
> as bigint))], output=[avg(key)#246])
>::+- Exchange SinglePartition, true, [id=#144]
>::   +- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#258, count#259L])
>::  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:: +- Scan[obj#12]
>:+- *(1) Scan OneRowRelation[]
>+- *(1) SerializeFromObject
>   +- Scan[obj#12]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29375) Exchange reuse across all subquery levels

2021-06-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29375.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 28885
[https://github.com/apache/spark/pull/28885]

> Exchange reuse across all subquery levels
> -
>
> Key: SPARK-29375
> URL: https://issues.apache.org/jira/browse/SPARK-29375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently exchange reuse doesn't work across all subquery levels.
>  Here is an example query:
> {noformat}
> SELECT
>  (SELECT max(a.key) FROM testData AS a JOIN testData AS b ON b.key = a.key),
>  a.key
> FROM testData AS a
> JOIN testData AS b ON b.key = a.key{noformat}
> where the plan is:
> {noformat}
> *(5) Project [Subquery scalar-subquery#240, [id=#193] AS 
> scalarsubquery()#247, key#13]
> :  +- Subquery scalar-subquery#240, [id=#193]
> : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], 
> output=[max(key)#246])
> :+- Exchange SinglePartition, true, [id=#189]
> :   +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], 
> output=[max#251])
> :  +- *(5) Project [key#13]
> : +- *(5) SortMergeJoin [key#13], [key#243], Inner
> ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
> ::  +- Exchange hashpartitioning(key#13, 5), true, 
> [id=#145]
> :: +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
> ::+- Scan[obj#12]
> :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0
> :   +- ReusedExchange [key#243], Exchange 
> hashpartitioning(key#13, 5), true, [id=#145]
> +- *(5) SortMergeJoin [key#13], [key#241], Inner
>:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(key#13, 5), true, [id=#205]
>: +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:+- Scan[obj#12]
>+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), 
> true, [id=#205]
> {noformat}
> but it could be improved as here:
> {noformat}
> *(5) Project [Subquery scalar-subquery#240, [id=#211] AS 
> scalarsubquery()#247, key#13]
> :  +- Subquery scalar-subquery#240, [id=#211]
> : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], 
> output=[max(key)#246])
> :+- Exchange SinglePartition, true, [id=#207]
> :   +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], 
> output=[max#251])
> :  +- *(5) Project [key#13]
> : +- *(5) SortMergeJoin [key#13], [key#243], Inner
> ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
> ::  +- Exchange hashpartitioning(key#13, 5), true, 
> [id=#145]
> :: +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
> ::+- Scan[obj#12]
> :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0
> :   +- ReusedExchange [key#243], Exchange 
> hashpartitioning(key#13, 5), true, [id=#145]
> +- *(5) SortMergeJoin [key#13], [key#241], Inner
>:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
>:  +- ReusedExchange [key#13], Exchange hashpartitioning(key#13, 5), true, 
> [id=#145]
>+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), 
> true, [id=#145]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29375) Exchange reuse across all subquery levels

2021-06-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29375:
---

Assignee: Peter Toth

> Exchange reuse across all subquery levels
> -
>
> Key: SPARK-29375
> URL: https://issues.apache.org/jira/browse/SPARK-29375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>
> Currently exchange reuse doesn't work across all subquery levels.
>  Here is an example query:
> {noformat}
> SELECT
>  (SELECT max(a.key) FROM testData AS a JOIN testData AS b ON b.key = a.key),
>  a.key
> FROM testData AS a
> JOIN testData AS b ON b.key = a.key{noformat}
> where the plan is:
> {noformat}
> *(5) Project [Subquery scalar-subquery#240, [id=#193] AS 
> scalarsubquery()#247, key#13]
> :  +- Subquery scalar-subquery#240, [id=#193]
> : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], 
> output=[max(key)#246])
> :+- Exchange SinglePartition, true, [id=#189]
> :   +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], 
> output=[max#251])
> :  +- *(5) Project [key#13]
> : +- *(5) SortMergeJoin [key#13], [key#243], Inner
> ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
> ::  +- Exchange hashpartitioning(key#13, 5), true, 
> [id=#145]
> :: +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
> ::+- Scan[obj#12]
> :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0
> :   +- ReusedExchange [key#243], Exchange 
> hashpartitioning(key#13, 5), true, [id=#145]
> +- *(5) SortMergeJoin [key#13], [key#241], Inner
>:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(key#13, 5), true, [id=#205]
>: +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:+- Scan[obj#12]
>+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), 
> true, [id=#205]
> {noformat}
> but it could be improved as here:
> {noformat}
> *(5) Project [Subquery scalar-subquery#240, [id=#211] AS 
> scalarsubquery()#247, key#13]
> :  +- Subquery scalar-subquery#240, [id=#211]
> : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], 
> output=[max(key)#246])
> :+- Exchange SinglePartition, true, [id=#207]
> :   +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], 
> output=[max#251])
> :  +- *(5) Project [key#13]
> : +- *(5) SortMergeJoin [key#13], [key#243], Inner
> ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
> ::  +- Exchange hashpartitioning(key#13, 5), true, 
> [id=#145]
> :: +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
> ::+- Scan[obj#12]
> :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0
> :   +- ReusedExchange [key#243], Exchange 
> hashpartitioning(key#13, 5), true, [id=#145]
> +- *(5) SortMergeJoin [key#13], [key#241], Inner
>:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0
>:  +- ReusedExchange [key#13], Exchange hashpartitioning(key#13, 5), true, 
> [id=#145]
>+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), 
> true, [id=#145]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35545) Split SubqueryExpression's children field into outer attributes and join conditions

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366373#comment-17366373
 ] 

Apache Spark commented on SPARK-35545:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/32990

> Split SubqueryExpression's children field into outer attributes and join 
> conditions
> ---
>
> Key: SPARK-35545
> URL: https://issues.apache.org/jira/browse/SPARK-35545
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently the children field of a subquery expression is used to store both 
> collected outer references inside the subquery plan, and also join conditions 
> after correlated predicates are pulled up. For example
> SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2
> After analysis phase:
> scalar-subquery [t2.c1]
> After PullUpCorrelatedPredicates:
> scalar-subquery [t1.c1 = t2.c1]
> The references for a subquery expressions is also confusing: 
> override lazy val references: AttributeSet =
>  if (plan.resolved) super.references -- plan.outputSet else super.references 
> We should split this children field into outer attribute references and join 
> conditions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35545) Split SubqueryExpression's children field into outer attributes and join conditions

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366372#comment-17366372
 ] 

Apache Spark commented on SPARK-35545:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/32990

> Split SubqueryExpression's children field into outer attributes and join 
> conditions
> ---
>
> Key: SPARK-35545
> URL: https://issues.apache.org/jira/browse/SPARK-35545
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently the children field of a subquery expression is used to store both 
> collected outer references inside the subquery plan, and also join conditions 
> after correlated predicates are pulled up. For example
> SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2
> After analysis phase:
> scalar-subquery [t2.c1]
> After PullUpCorrelatedPredicates:
> scalar-subquery [t1.c1 = t2.c1]
> The references for a subquery expressions is also confusing: 
> override lazy val references: AttributeSet =
>  if (plan.resolved) super.references -- plan.outputSet else super.references 
> We should split this children field into outer attribute references and join 
> conditions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35811) Deprecate DataFrame.to_spark_io

2021-06-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35811.
--
  Assignee: Kevin Su
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32964

> Deprecate DataFrame.to_spark_io
> ---
>
> Key: SPARK-35811
> URL: https://issues.apache.org/jira/browse/SPARK-35811
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Kevin Su
>Priority: Major
>
> We should deprecate the 
> [DataFrame.to_spark_io|https://docs.google.com/document/d/1RxvQJVf736Vg9XU7uiCaRlNl-P7GdmFGa6U3Ab78JJk/edit#heading=h.todz8y4xdqrx]
>  since it's duplicated with 
> [DataFrame.spark.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.to_spark_io.html],
>  and it's not existed in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35811) Deprecate DataFrame.to_spark_io

2021-06-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35811:
-
Fix Version/s: 3.2.0

> Deprecate DataFrame.to_spark_io
> ---
>
> Key: SPARK-35811
> URL: https://issues.apache.org/jira/browse/SPARK-35811
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Kevin Su
>Priority: Major
> Fix For: 3.2.0
>
>
> We should deprecate the 
> [DataFrame.to_spark_io|https://docs.google.com/document/d/1RxvQJVf736Vg9XU7uiCaRlNl-P7GdmFGa6U3Ab78JJk/edit#heading=h.todz8y4xdqrx]
>  since it's duplicated with 
> [DataFrame.spark.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.to_spark_io.html],
>  and it's not existed in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35834:


Assignee: Apache Spark

> Use the same cleanup logic as Py4J in inheritable thread API
> 
>
> Key: SPARK-35834
> URL: https://issues.apache.org/jira/browse/SPARK-35834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> After 
> https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
>  the test became flaky:
> {code}
> ==
> ERROR [71.813s]: test_save_load_pipeline_estimator 
> (pyspark.ml.tests.test_tuning.CrossValidatorTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, 
> in test_save_load_pipeline_estimator
> self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, 
> in _run_test_save_load_pipeline_estimator
> cvModel2 = crossval2.fit(training)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
> bestModel = est.fit(dataset, epm[bestIndex])
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
> return self.copy(params)._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
> _fit
> models = pool.map(inheritable_thread_target(trainSingleClass), 
> range(numClasses))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 644, in get
> raise self._value
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 119, in worker
> result = (True, func(*args, **kwds))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 44, in mapstar
> return list(map(*args))
>   File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
> InheritableThread._clean_py4j_conn_for_current_thread()
>   File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
> _clean_py4j_conn_for_current_thread
> del connections[i]
> IndexError: deque index out of range
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35834:


Assignee: (was: Apache Spark)

> Use the same cleanup logic as Py4J in inheritable thread API
> 
>
> Key: SPARK-35834
> URL: https://issues.apache.org/jira/browse/SPARK-35834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After 
> https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
>  the test became flaky:
> {code}
> ==
> ERROR [71.813s]: test_save_load_pipeline_estimator 
> (pyspark.ml.tests.test_tuning.CrossValidatorTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, 
> in test_save_load_pipeline_estimator
> self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, 
> in _run_test_save_load_pipeline_estimator
> cvModel2 = crossval2.fit(training)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
> bestModel = est.fit(dataset, epm[bestIndex])
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
> return self.copy(params)._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
> _fit
> models = pool.map(inheritable_thread_target(trainSingleClass), 
> range(numClasses))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 644, in get
> raise self._value
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 119, in worker
> result = (True, func(*args, **kwds))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 44, in mapstar
> return list(map(*args))
>   File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
> InheritableThread._clean_py4j_conn_for_current_thread()
>   File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
> _clean_py4j_conn_for_current_thread
> del connections[i]
> IndexError: deque index out of range
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366324#comment-17366324
 ] 

Apache Spark commented on SPARK-35834:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32989

> Use the same cleanup logic as Py4J in inheritable thread API
> 
>
> Key: SPARK-35834
> URL: https://issues.apache.org/jira/browse/SPARK-35834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After 
> https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
>  the test became flaky:
> {code}
> ==
> ERROR [71.813s]: test_save_load_pipeline_estimator 
> (pyspark.ml.tests.test_tuning.CrossValidatorTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, 
> in test_save_load_pipeline_estimator
> self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, 
> in _run_test_save_load_pipeline_estimator
> cvModel2 = crossval2.fit(training)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
> bestModel = est.fit(dataset, epm[bestIndex])
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
> return self.copy(params)._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
> _fit
> models = pool.map(inheritable_thread_target(trainSingleClass), 
> range(numClasses))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 644, in get
> raise self._value
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 119, in worker
> result = (True, func(*args, **kwds))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 44, in mapstar
> return list(map(*args))
>   File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
> InheritableThread._clean_py4j_conn_for_current_thread()
>   File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
> _clean_py4j_conn_for_current_thread
> del connections[i]
> IndexError: deque index out of range
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-06-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35834:
-
Issue Type: Bug  (was: Improvement)

> Use the same cleanup logic as Py4J in inheritable thread API
> 
>
> Key: SPARK-35834
> URL: https://issues.apache.org/jira/browse/SPARK-35834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After 
> https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
>  the test became flaky:
> {code}
> ==
> ERROR [71.813s]: test_save_load_pipeline_estimator 
> (pyspark.ml.tests.test_tuning.CrossValidatorTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, 
> in test_save_load_pipeline_estimator
> self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, 
> in _run_test_save_load_pipeline_estimator
> cvModel2 = crossval2.fit(training)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
> bestModel = est.fit(dataset, epm[bestIndex])
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
> return self.copy(params)._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
> _fit
> models = pool.map(inheritable_thread_target(trainSingleClass), 
> range(numClasses))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 644, in get
> raise self._value
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 119, in worker
> result = (True, func(*args, **kwds))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 44, in mapstar
> return list(map(*args))
>   File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
> InheritableThread._clean_py4j_conn_for_current_thread()
>   File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
> _clean_py4j_conn_for_current_thread
> del connections[i]
> IndexError: deque index out of range
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-06-20 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-35834:


 Summary: Use the same cleanup logic as Py4J in inheritable thread 
API
 Key: SPARK-35834
 URL: https://issues.apache.org/jira/browse/SPARK-35834
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


After 
https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
 the test became flaky:

{code}
==
ERROR [71.813s]: test_save_load_pipeline_estimator 
(pyspark.ml.tests.test_tuning.CrossValidatorTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, in 
test_save_load_pipeline_estimator
self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
  File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, in 
_run_test_save_load_pipeline_estimator
cvModel2 = crossval2.fit(training)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
bestModel = est.fit(dataset, epm[bestIndex])
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
return self.copy(params)._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
model = stage.fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
model = stage.fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
_fit
models = pool.map(inheritable_thread_target(trainSingleClass), 
range(numClasses))
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
644, in get
raise self._value
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
119, in worker
result = (True, func(*args, **kwds))
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 44, 
in mapstar
return list(map(*args))
  File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
InheritableThread._clean_py4j_conn_for_current_thread()
  File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
_clean_py4j_conn_for_current_thread
del connections[i]
IndexError: deque index out of range

--
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors

2021-06-20 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-35671:
---

Assignee: Chandni Singh

> Add Support in the ESS to serve merged shuffle block meta and data to 
> executors
> ---
>
> Key: SPARK-35671
> URL: https://issues.apache.org/jira/browse/SPARK-35671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> With push-based shuffle enabled, the reducers send 2 different requests to 
> the ESS:
> 1. Request to fetch the metadata of the merged shuffle block.
>  2. Requests to fetch the data of the merged shuffle block in chunks which 
> are by default 2MB in size.
>  The ESS should be able to serve these requests and this Jira targets all the 
> changes in the network-common and network-shuffle modules to be able to 
> support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors

2021-06-20 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-35671.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32811
[https://github.com/apache/spark/pull/32811]

> Add Support in the ESS to serve merged shuffle block meta and data to 
> executors
> ---
>
> Key: SPARK-35671
> URL: https://issues.apache.org/jira/browse/SPARK-35671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
>
> With push-based shuffle enabled, the reducers send 2 different requests to 
> the ESS:
> 1. Request to fetch the metadata of the merged shuffle block.
>  2. Requests to fetch the data of the merged shuffle block in chunks which 
> are by default 2MB in size.
>  The ESS should be able to serve these requests and this Jira targets all the 
> changes in the network-common and network-shuffle modules to be able to 
> support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35827) Show proper error message when update column types to year-month/day-time interval

2021-06-20 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35827.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32978
[https://github.com/apache/spark/pull/32978]

> Show proper error message when update column types to year-month/day-time 
> interval
> --
>
> Key: SPARK-35827
> URL: https://issues.apache.org/jira/browse/SPARK-35827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Updating column types to interval types are prohibited for V2 source tables.
> So, if we attempt to update the type of a column to the conventional interval 
> type, an error message like "Error in query: Cannot update  field 
>  to interval type;".
> But, for year-month/day-time interval types, another error message like 
> "Error in query: Cannot update  field : cannot be cast 
> to interval year;"
> It's not consistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35775) Check all year-month interval types in aggregate expressions

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35775:


Assignee: Apache Spark

> Check all year-month interval types in aggregate expressions
> 
>
> Key: SPARK-35775
> URL: https://issues.apache.org/jira/browse/SPARK-35775
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Check all supported combination of YearMonthIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35775) Check all year-month interval types in aggregate expressions

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366235#comment-17366235
 ] 

Apache Spark commented on SPARK-35775:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32988

> Check all year-month interval types in aggregate expressions
> 
>
> Key: SPARK-35775
> URL: https://issues.apache.org/jira/browse/SPARK-35775
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all supported combination of YearMonthIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35775) Check all year-month interval types in aggregate expressions

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35775:


Assignee: (was: Apache Spark)

> Check all year-month interval types in aggregate expressions
> 
>
> Key: SPARK-35775
> URL: https://issues.apache.org/jira/browse/SPARK-35775
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all supported combination of YearMonthIntervalType fields in the 
> aggregate expression: sum and avg.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35832) Add LocalRootDirsTest trait

2021-06-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35832:
-

Assignee: Dongjoon Hyun

> Add LocalRootDirsTest trait
> ---
>
> Key: SPARK-35832
> URL: https://issues.apache.org/jira/browse/SPARK-35832
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35832) Add LocalRootDirsTest trait

2021-06-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35832.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32986
[https://github.com/apache/spark/pull/32986]

> Add LocalRootDirsTest trait
> ---
>
> Key: SPARK-35832
> URL: https://issues.apache.org/jira/browse/SPARK-35832
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366181#comment-17366181
 ] 

Apache Spark commented on SPARK-35564:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32987

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35564:


Assignee: (was: Apache Spark)

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35564:


Assignee: Apache Spark

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Assignee: Apache Spark
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35833) The Statistics size of PARQUET table is not estimated correctly

2021-06-20 Thread HonglunChen (Jira)
HonglunChen created SPARK-35833:
---

 Summary: The Statistics size of PARQUET table is not estimated 
correctly
 Key: SPARK-35833
 URL: https://issues.apache.org/jira/browse/SPARK-35833
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: HonglunChen


{code:java}
// Table 'test_txt' and 'test_parquet' have the same data.
scala> val sql="select * from tmp_db.test_txt"
sql: String = select * from tmp_db.test_txt
scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes
res5: BigInt = 92990
scala> val sql = "select * from tmp_db.test_parquet"
sql: String = select * from tmp_db.test_parquet
scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes
res6: BigInt = 37556
{code}
PARQUET file is compressed by default, this could lead to choose the wrong type 
of JOIN, like BROADCASTJOIN. Driver may be OOM in this case, because the actual 
size may be much greater than the AUTO_BROADCASTJOIN_THRESHOLD.  Can we improve 
this? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2021-06-20 Thread Gejun Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173
 ] 

Gejun Shen edited comment on SPARK-20427 at 6/20/21, 12:16 PM:
---

When you use custom scheme, problem solved according to my test. 

But spark's max precision 38 seems still have some gap with oracle. For oracle, 
I believe there could be 39-40 for 10g version.

>From oracle document.

_p is the precision, or the maximum number of significant decimal digits, where 
the most significant digit is the left-most nonzero digit, and the least 
significant digit is the right-most known digit. Oracle guarantees the 
portability of numbers with precision of up to 20 base-100 digits, which is 
equivalent to 39 or 40 decimal digits depending on the position of the decimal 
point._


was (Author: sgejun):
When you use custom scheme, problem solved according to my test. 

But spark's max precision 38 seems still have some gap with oracle standard. 
For oracle, I believe there could be 39-40 for 10g version.

>From oracle document.

_p is the precision, or the maximum number of significant decimal digits, where 
the most significant digit is the left-most nonzero digit, and the least 
significant digit is the right-most known digit. Oracle guarantees the 
portability of numbers with precision of up to 20 base-100 digits, which is 
equivalent to 39 or 40 decimal digits depending on the position of the decimal 
point._

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2021-06-20 Thread Gejun Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173
 ] 

Gejun Shen edited comment on SPARK-20427 at 6/20/21, 12:16 PM:
---

When you use custom scheme, problem solved according to my test. 

But spark's max precision 38 seems still have some gap with oracle standard. 
For oracle, I believe there could be 39-40 for 10g version.

>From oracle document.

_p is the precision, or the maximum number of significant decimal digits, where 
the most significant digit is the left-most nonzero digit, and the least 
significant digit is the right-most known digit. Oracle guarantees the 
portability of numbers with precision of up to 20 base-100 digits, which is 
equivalent to 39 or 40 decimal digits depending on the position of the decimal 
point._


was (Author: sgejun):
When you use custom scheme, problem solved according to my test.

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2021-06-20 Thread Gejun Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173
 ] 

Gejun Shen edited comment on SPARK-20427 at 6/20/21, 12:09 PM:
---

When you use custom scheme, problem solved according to my test.


was (Author: sgejun):
When you use custom scheme, I believe precision would also lost. 

For example, 

123456789012345678901234567890123456789111, it will become something like 
1.2531261878596697E40. Is this really what we want?

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2021-06-20 Thread Gejun Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173
 ] 

Gejun Shen commented on SPARK-20427:


When you use custom scheme, I believe precision would also lost. 

For example, 

123456789012345678901234567890123456789111, it will become something like 
1.2531261878596697E40. Is this really what we want?

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35832) Add LocalRootDirsTest trait

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35832:


Assignee: (was: Apache Spark)

> Add LocalRootDirsTest trait
> ---
>
> Key: SPARK-35832
> URL: https://issues.apache.org/jira/browse/SPARK-35832
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35832) Add LocalRootDirsTest trait

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366135#comment-17366135
 ] 

Apache Spark commented on SPARK-35832:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32986

> Add LocalRootDirsTest trait
> ---
>
> Key: SPARK-35832
> URL: https://issues.apache.org/jira/browse/SPARK-35832
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35832) Add LocalRootDirsTest trait

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35832:


Assignee: Apache Spark

> Add LocalRootDirsTest trait
> ---
>
> Key: SPARK-35832
> URL: https://issues.apache.org/jira/browse/SPARK-35832
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35832) Add LocalRootDirsTest trait

2021-06-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35832:
--
Issue Type: Bug  (was: Improvement)

> Add LocalRootDirsTest trait
> ---
>
> Key: SPARK-35832
> URL: https://issues.apache.org/jira/browse/SPARK-35832
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35777) Check all year-month interval types in UDF

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35777:


Assignee: (was: Apache Spark)

> Check all year-month interval types in UDF
> --
>
> Key: SPARK-35777
> URL: https://issues.apache.org/jira/browse/SPARK-35777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all year-month interval types in UDF:
> # INTERVAL YEAR
> # INTERVAL YEAR TO MONTH
> # INTERVAL MONTH



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35777) Check all year-month interval types in UDF

2021-06-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35777:


Assignee: Apache Spark

> Check all year-month interval types in UDF
> --
>
> Key: SPARK-35777
> URL: https://issues.apache.org/jira/browse/SPARK-35777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Check all year-month interval types in UDF:
> # INTERVAL YEAR
> # INTERVAL YEAR TO MONTH
> # INTERVAL MONTH



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35777) Check all year-month interval types in UDF

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366134#comment-17366134
 ] 

Apache Spark commented on SPARK-35777:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32985

> Check all year-month interval types in UDF
> --
>
> Key: SPARK-35777
> URL: https://issues.apache.org/jira/browse/SPARK-35777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Check all year-month interval types in UDF:
> # INTERVAL YEAR
> # INTERVAL YEAR TO MONTH
> # INTERVAL MONTH



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35832) Add LocalRootDirsTest trait

2021-06-20 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-35832:
-

 Summary: Add LocalRootDirsTest trait
 Key: SPARK-35832
 URL: https://issues.apache.org/jira/browse/SPARK-35832
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core, Tests
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35726) Truncate java.time.Duration by fields of day-time interval type

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366130#comment-17366130
 ] 

Apache Spark commented on SPARK-35726:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32984

> Truncate java.time.Duration by fields of day-time interval type
> ---
>
> Key: SPARK-35726
> URL: https://issues.apache.org/jira/browse/SPARK-35726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Truncate input java.time.Duration instances using fields of 
> DayTimeIntervalType. For example, if DayTimeIntervalType has the end field 
> HOUR, granularity of DayTimeIntervalType values should hours too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35769) Truncate java.time.Period by fields of year-month interval type

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366128#comment-17366128
 ] 

Apache Spark commented on SPARK-35769:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32984

> Truncate java.time.Period by fields of year-month interval type
> ---
>
> Key: SPARK-35769
> URL: https://issues.apache.org/jira/browse/SPARK-35769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Truncate input java.time.Period instances using fields of 
> YearMonthIntervalType. For example, if YearMonthIntervalType has the end 
> field YEAR, granularity of YearMonthIntervalType values should years too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35726) Truncate java.time.Duration by fields of day-time interval type

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366126#comment-17366126
 ] 

Apache Spark commented on SPARK-35726:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32984

> Truncate java.time.Duration by fields of day-time interval type
> ---
>
> Key: SPARK-35726
> URL: https://issues.apache.org/jira/browse/SPARK-35726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Truncate input java.time.Duration instances using fields of 
> DayTimeIntervalType. For example, if DayTimeIntervalType has the end field 
> HOUR, granularity of DayTimeIntervalType values should hours too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35700) spark.sql.orc.filterPushdown not working with Spark 3.1.1 for tables with varchar data type

2021-06-20 Thread Saurabh Chawla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366123#comment-17366123
 ] 

Saurabh Chawla commented on SPARK-35700:


[~hyukjin.kwon]/ [~Qin Yao] / [~cloud_fan] - I was able to reproduce this issue 
in the master branch for the steps given in the SPARK-35762.

 Below are some of the scenario found while debugging this issue.

1) This is issue is reproducible for all the orc data created using the Hive, 
but when the insertion is done using the SPARK, we are not getting this 
exception.

2) We are getting the exception due this, when the file is created using hive  
and while calling the readSchema the 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L63
 |http://example.com/]

In the reader.getSchema we get the varchar datatype.

But on reading the orc file created by the Spark we are getting the string 
datatype.

3) I tried adding the validation for converting varchar to String 
{code:java}
readSchema(file, conf, ignoreCorruptFiles) match {
 case Some(schema) =>
 val orcSchema = CatalystSqlParser.parseDataType(
 schema.toString).asInstanceOf[StructType]
 val varCharExist = orcSchema.fields.exists(
 x => CharVarcharUtils.hasCharVarchar(x.dataType))
 if (varCharExist) {
 Some(CharVarcharUtils.replaceCharVarcharWithStringInSchema(orcSchema))
 } else {
 Some(orcSchema)
 }{code}
After adding this fix , we are converting the varchar to string and query is 
working fine.

 

4) Similar conversion of data type change is needed on the 

`def readSchema(file: Path, conf: Configuration, ignoreCorruptFiles: Boolean) 
`, this called by inferSchema ,

readOrcSchemasInParallel. 

 

If this approach is fine. Than I can go head and create the PR for the same, 
otherwise if we want to see some other approach we can discuss on this

 

 

> spark.sql.orc.filterPushdown not working with Spark 3.1.1 for tables with 
> varchar data type
> ---
>
> Key: SPARK-35700
> URL: https://issues.apache.org/jira/browse/SPARK-35700
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark, Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1 on K8S
>Reporter: Arghya Saha
>Priority: Major
>
> We are not able to upgrade to Spark 3.1.1 from Spark 2.4.x as the join on 
> varchar column is failing which is unexpected and works on Spark 3.0.0.  We 
> are trying to run it on Spark 3.1.1 (MR 3.2) on K8s 
> Below is my use case:
> Tables are external hive table and files are stored as ORC. We do have 
> varchar column and when we are trying to perform join on varchar column we 
> are getting the exception.
> As I understand Spark 3.1.1 have introduced varchar data type but seems its 
> not well tested with ORC and does not have backward compatibility. I have 
> even tried with below config without luck
> *spark.sql.legacy.charVarcharAsString: "true"*
> We are not getting the error when *spark.sql.orc.filterPushdown=false*
> Below is the code: Here col1 is of type varchar(32) in hive
> {code:java}
> df = spark.sql("select col1, col2 from table1 a inner join table2 on b 
> (a.col1=b.col1 and a.col2 > b.col2 )") 
> df.write.format("orc").option("compression", 
> "zlib").mode("Append").save("")
> {code}
> Below is the error:
>  
> {code:java}
> Job aborted due to stage failure: Task 43 in stage 5.0 failed 4 times, most 
> recent failure: Lost task 43.3 in stage 5.0 (TID 524) (10.219.36.64 executor 
> 5): java.lang.UnsupportedOperationException: DataType: varchar(32)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.getPredicateLeafType(OrcFilters.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.getType$1(OrcFilters.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:266)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.convertibleFiltersHelper$1(OrcFilters.scala:132)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.$anonfun$convertibleFilters$4(OrcFilters.scala:135)
>   at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
>   at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
>   at scala.collection.immutable.List.flatMap(List.scala:355)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.convertibleFilters(OrcFilters.scala:134)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFilters$.createFilter(OrcFilters

[jira] [Commented] (SPARK-35802) Error loading the stages/stage/ page in spark UI

2021-06-20 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366122#comment-17366122
 ] 

Gengliang Wang commented on SPARK-35802:


+1 with [~sarutak]. It's an API for rendering UI and requires certain 
parameters.
FYI [~Heltman] you may consider using the rest APIs in 
https://spark.apache.org/docs/latest/monitoring.html#rest-api instead.

> Error loading the stages/stage/ page in spark UI
> 
>
> Key: SPARK-35802
> URL: https://issues.apache.org/jira/browse/SPARK-35802
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.0.1, 3.1.1, 3.1.2
>Reporter: Helt Long
>Priority: Major
>
> I try to load the sparkUI page for a specific stage, I get the following 
> error:
> {quote}Unable to connect to the server. Looks like the Spark application must 
> have ended. Please Switch to the history UI.
> {quote}
> Obviously the server is still alive and process new messages.
> Looking at the network tab shows one of the requests fails:
>  
> {{curl 
> 'http://:8080/proxy/app-20201008130147-0001/api/v1/applications/app-20201008130147-0001/stages/11/0/taskTable'
> 
> 
> 
> Error 500 Request failed.
> 
> HTTP ERROR 500
> Problem accessing 
> /api/v1/applications/app-20201008130147-0001/stages/11/0/taskTable. Reason:
> Request failed. href="http://eclipse.org/jetty";>Powered by Jetty:// 9.4.z-SNAPSHOT
> 
> }}
> requests to any other object that I've tested seem to work, for example
>  
> {{curl 
> 'http://:8080/proxy/app-20201008130147-0001/api/v1/applications/app-20201008130147-0001/stages/11/0/taskSummary'}}
>  
> The exception is:
> {{/api/v1/applications/app-20201008130147-0001/stages/11/0/taskTable
> javax.servlet.ServletException: java.lang.NullPointerException
> at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410)
> at 
> org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
> at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
> at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
> at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
> at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
> at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
> at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
> at 
> org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
> at 
> org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
> at 
> org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.sparkproject.jetty.server.Server.handle(Server.java:505)
> at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370)
> at 
> org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
> at 
> org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
> at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103)
> at 
> org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
> at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at 
> org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at 
> org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at 
> org.sparkproject.jetty.uti