[jira] [Assigned] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-35838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35838: Assignee: Apache Spark > Ensure kafka-0-10-sql module can maven test independently in Scala 2.13 > --- > > Key: SPARK-35838 > URL: https://issues.apache.org/jira/browse/SPARK-35838 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > > Execute > > {code:java} > mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl external/kafka-0-10-sql > {code} > > 1 scala test aborted, the error message is > {code:java} > Discovery starting. > Discovery completed in 857 milliseconds. > Run starting. Expected test count is: 464 > ... > KafkaRelationSuiteV2: > - explicit earliest to latest offsets > - default starting and ending offsets > - explicit offsets > - default starting and ending offsets with headers > - timestamp provided for starting and ending > - timestamp provided for starting, offset provided for ending > - timestamp provided for ending, offset provided for starting > - timestamp provided for starting, ending not provided > - timestamp provided for ending, starting not provided > - global timestamp provided for starting and ending > - no matched offset for timestamp - startingOffsets > - preferences on offset related options > - no matched offset for timestamp - endingOffsets > *** RUN ABORTED *** > java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport > at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) > at org.apache.spark.SparkContext.union(SparkContext.scala:1405) > at > org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) > ... > Cause: java.lang.ClassNotFoundException: > scala.collection.parallel.TaskSupport > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) > at org.apache.spark.SparkContext.union(SparkContext.scala:1405) > at > org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) > ... > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-35838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35838: Assignee: (was: Apache Spark) > Ensure kafka-0-10-sql module can maven test independently in Scala 2.13 > --- > > Key: SPARK-35838 > URL: https://issues.apache.org/jira/browse/SPARK-35838 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > > Execute > > {code:java} > mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl external/kafka-0-10-sql > {code} > > 1 scala test aborted, the error message is > {code:java} > Discovery starting. > Discovery completed in 857 milliseconds. > Run starting. Expected test count is: 464 > ... > KafkaRelationSuiteV2: > - explicit earliest to latest offsets > - default starting and ending offsets > - explicit offsets > - default starting and ending offsets with headers > - timestamp provided for starting and ending > - timestamp provided for starting, offset provided for ending > - timestamp provided for ending, offset provided for starting > - timestamp provided for starting, ending not provided > - timestamp provided for ending, starting not provided > - global timestamp provided for starting and ending > - no matched offset for timestamp - startingOffsets > - preferences on offset related options > - no matched offset for timestamp - endingOffsets > *** RUN ABORTED *** > java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport > at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) > at org.apache.spark.SparkContext.union(SparkContext.scala:1405) > at > org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) > ... > Cause: java.lang.ClassNotFoundException: > scala.collection.parallel.TaskSupport > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) > at org.apache.spark.SparkContext.union(SparkContext.scala:1405) > at > org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) > ... > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-35838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366418#comment-17366418 ] Apache Spark commented on SPARK-35838: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32994 > Ensure kafka-0-10-sql module can maven test independently in Scala 2.13 > --- > > Key: SPARK-35838 > URL: https://issues.apache.org/jira/browse/SPARK-35838 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > > Execute > > {code:java} > mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > -Pscala-2.13 -pl external/kafka-0-10-sql > {code} > > 1 scala test aborted, the error message is > {code:java} > Discovery starting. > Discovery completed in 857 milliseconds. > Run starting. Expected test count is: 464 > ... > KafkaRelationSuiteV2: > - explicit earliest to latest offsets > - default starting and ending offsets > - explicit offsets > - default starting and ending offsets with headers > - timestamp provided for starting and ending > - timestamp provided for starting, offset provided for ending > - timestamp provided for ending, offset provided for starting > - timestamp provided for starting, ending not provided > - timestamp provided for ending, starting not provided > - global timestamp provided for starting and ending > - no matched offset for timestamp - startingOffsets > - preferences on offset related options > - no matched offset for timestamp - endingOffsets > *** RUN ABORTED *** > java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport > at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) > at org.apache.spark.SparkContext.union(SparkContext.scala:1405) > at > org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) > ... > Cause: java.lang.ClassNotFoundException: > scala.collection.parallel.TaskSupport > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) > at org.apache.spark.SparkContext.union(SparkContext.scala:1405) > at > org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) > ... > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35838) Ensure kafka-0-10-sql module can maven test independently in Scala 2.13
Yang Jie created SPARK-35838: Summary: Ensure kafka-0-10-sql module can maven test independently in Scala 2.13 Key: SPARK-35838 URL: https://issues.apache.org/jira/browse/SPARK-35838 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.2.0 Reporter: Yang Jie Execute {code:java} mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl external/kafka-0-10-sql {code} 1 scala test aborted, the error message is {code:java} Discovery starting. Discovery completed in 857 milliseconds. Run starting. Expected test count is: 464 ... KafkaRelationSuiteV2: - explicit earliest to latest offsets - default starting and ending offsets - explicit offsets - default starting and ending offsets with headers - timestamp provided for starting and ending - timestamp provided for starting, offset provided for ending - timestamp provided for ending, offset provided for starting - timestamp provided for starting, ending not provided - timestamp provided for ending, starting not provided - global timestamp provided for starting and ending - no matched offset for timestamp - startingOffsets - preferences on offset related options - no matched offset for timestamp - endingOffsets *** RUN ABORTED *** java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) ... Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) ... {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35776) Check all year-month interval types in arrow
[ https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366409#comment-17366409 ] Apache Spark commented on SPARK-35776: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32993 > Check all year-month interval types in arrow > > > Key: SPARK-35776 > URL: https://issues.apache.org/jira/browse/SPARK-35776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Add tests to check that all year-month interval types are supported in > (de-)serialization from/to Arrow format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35776) Check all year-month interval types in arrow
[ https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35776: Assignee: Apache Spark > Check all year-month interval types in arrow > > > Key: SPARK-35776 > URL: https://issues.apache.org/jira/browse/SPARK-35776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Add tests to check that all year-month interval types are supported in > (de-)serialization from/to Arrow format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35776) Check all year-month interval types in arrow
[jira] [Assigned] (SPARK-35776) Check all year-month interval types in arrow
[ https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35776: Assignee: Apache Spark > Check all year-month interval types in arrow > > > Key: SPARK-35776 > URL: https://issues.apache.org/jira/browse/SPARK-35776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Add tests to check that all year-month interval types are supported in > (de-)serialization from/to Arrow format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite
[ https://issues.apache.org/jira/browse/SPARK-35836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35836: Assignee: (was: Apache Spark) > Remove reference to spark.shuffle.push.based.enabled in > ShuffleBlockPusherSuite > --- > > Key: SPARK-35836 > URL: https://issues.apache.org/jira/browse/SPARK-35836 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Trivial > > The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in > this suite, the configuration for push-based shuffle is incorrectly > referenced as {{spark.shuffle.push.based.enabled}}. We need to remove this > config from here. > {{ShuffleBlockPusher}} is created only when push based shuffle is enabled and > this suite is for {{ShuffleBlockPusher}}, so no other change is required. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite
[ https://issues.apache.org/jira/browse/SPARK-35836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366406#comment-17366406 ] Apache Spark commented on SPARK-35836: -- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/32992 > Remove reference to spark.shuffle.push.based.enabled in > ShuffleBlockPusherSuite > --- > > Key: SPARK-35836 > URL: https://issues.apache.org/jira/browse/SPARK-35836 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Trivial > > The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in > this suite, the configuration for push-based shuffle is incorrectly > referenced as {{spark.shuffle.push.based.enabled}}. We need to remove this > config from here. > {{ShuffleBlockPusher}} is created only when push based shuffle is enabled and > this suite is for {{ShuffleBlockPusher}}, so no other change is required. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite
[ https://issues.apache.org/jira/browse/SPARK-35836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35836: Assignee: Apache Spark > Remove reference to spark.shuffle.push.based.enabled in > ShuffleBlockPusherSuite > --- > > Key: SPARK-35836 > URL: https://issues.apache.org/jira/browse/SPARK-35836 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Assignee: Apache Spark >Priority: Trivial > > The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in > this suite, the configuration for push-based shuffle is incorrectly > referenced as {{spark.shuffle.push.based.enabled}}. We need to remove this > config from here. > {{ShuffleBlockPusher}} is created only when push based shuffle is enabled and > this suite is for {{ShuffleBlockPusher}}, so no other change is required. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35776) Check all year-month interval types in arrow
[ https://issues.apache.org/jira/browse/SPARK-35776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366401#comment-17366401 ] angerszhu commented on SPARK-35776: --- Working on this > Check all year-month interval types in arrow > > > Key: SPARK-35776 > URL: https://issues.apache.org/jira/browse/SPARK-35776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Add tests to check that all year-month interval types are supported in > (de-)serialization from/to Arrow format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems
[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35837: Description: Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect the bucket read, for example, Analyzer add a cast to bucket column. 4. Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614 | | 176984, 3, 81003252
[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems
[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35837: Description: Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect the bucket read, for example, Analyzer add a cast to bucket column. 4. Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN RECOMMENDATION 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614 | | 176984, 3, 81003252
[jira] [Updated] (SPARK-35837) Recommendations for Common Query Problems
[ https://issues.apache.org/jira/browse/SPARK-35837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-35837: Description: Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect the bucket read, for example, Analyzer add a cast to bucket column. 4. Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614 | | 176984, 3, 81003252
[jira] [Created] (SPARK-35837) Recommendations for Common Query Problems
Yuming Wang created SPARK-35837: --- Summary: Recommendations for Common Query Problems Key: SPARK-35837 URL: https://issues.apache.org/jira/browse/SPARK-35837 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang Teradata supports [Recommendations for Common Query Problems|https://docs.teradata.com/r/wada1XMYPkZVTqPKz2CNaw/JE7PEg6H~4nBZYEGphxxsg]. We can implement a similar feature. 1. Detect the most skew values for join. The user decides whether these are needed. 2. Detect the most skew values for window function. The user decides whether these are needed. 3. Detect it can be optimized to bucket read, for example, Analyzer add a cast to bucket column. 4.Recommend the user add a filter condition to the partition column of the partition table. 5. Check the condition of join, for example, the result of cast string to double may be incorrect. For example: {code:sql} 0: jdbc:hive2://localhost:1/default> EXPLAIN OPTIMIZE 0: jdbc:hive2://localhost:1/default> SELECT a.*, 0: jdbc:hive2://localhost:1/default>CASE 0: jdbc:hive2://localhost:1/default> WHEN ( NOT ( a.exclude = 1 0: jdbc:hive2://localhost:1/default> AND a.cobrand = 6 0: jdbc:hive2://localhost:1/default> AND a.primary_app_id IN ( 1462, 2878, 2571 ) ) ) 0: jdbc:hive2://localhost:1/default> AND ( a.valid_page_count = 1 ) THEN 1 0: jdbc:hive2://localhost:1/default> ELSE 0 0: jdbc:hive2://localhost:1/default>END AS is_singlepage, 0: jdbc:hive2://localhost:1/default>ca.bsns_vrtcl_name 0: jdbc:hive2://localhost:1/default> FROM (SELECT * 0: jdbc:hive2://localhost:1/default> FROM (SELECT *, 0: jdbc:hive2://localhost:1/default>'VI' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl1 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'SRP' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl2 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'My Garage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl3 0: jdbc:hive2://localhost:1/default> UNION 0: jdbc:hive2://localhost:1/default> SELECT *, 0: jdbc:hive2://localhost:1/default>'Motors Homepage' AS page_type 0: jdbc:hive2://localhost:1/default> FROM tbl4) t 0: jdbc:hive2://localhost:1/default> WHERE session_start_dt BETWEEN ( '2020-01-01' ) AND ( 0: jdbc:hive2://localhost:1/default> CURRENT_DATE() - 10 )) a 0: jdbc:hive2://localhost:1/default>LEFT JOIN (SELECT item_id, 0: jdbc:hive2://localhost:1/default> item_site_id, 0: jdbc:hive2://localhost:1/default> auct_end_dt, 0: jdbc:hive2://localhost:1/default> leaf_categ_id 0: jdbc:hive2://localhost:1/default> FROM tbl5 0: jdbc:hive2://localhost:1/default> WHERE auct_end_dt >= ( '2020-01-01' )) itm 0: jdbc:hive2://localhost:1/default> ON a.item_id = itm.item_id 0: jdbc:hive2://localhost:1/default>LEFT JOIN tbl6 ca 0: jdbc:hive2://localhost:1/default> ON itm.leaf_categ_id = ca.leaf_categ_id 0: jdbc:hive2://localhost:1/default> AND itm.item_site_id = ca.site_id; +-+--+ | result | +-+--+ | 1. Detect the most skew values for join | | Check join: Join LeftOuter, ((leaf_categ_id#1453 = leaf_categ_id#3020) AND (cast(item_site_id#1444 as decimal(9,0)) = site_id#3022)) | | table: tbl5 | | leaf_categ_id, item_site_id, count | | 171243, 0, 115412614
[jira] [Created] (SPARK-35836) Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite
Chandni Singh created SPARK-35836: - Summary: Remove reference to spark.shuffle.push.based.enabled in ShuffleBlockPusherSuite Key: SPARK-35836 URL: https://issues.apache.org/jira/browse/SPARK-35836 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 3.1.0 Reporter: Chandni Singh The test suite for ShuffleBlockPusherSuite was added with SPARK-32917 and in this suite, the configuration for push-based shuffle is incorrectly referenced as {{spark.shuffle.push.based.enabled}}. We need to remove this config from here. {{ShuffleBlockPusher}} is created only when push based shuffle is enabled and this suite is for {{ShuffleBlockPusher}}, so no other change is required. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35678) add a common softmax function
[ https://issues.apache.org/jira/browse/SPARK-35678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366390#comment-17366390 ] Apache Spark commented on SPARK-35678: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/32991 > add a common softmax function > - > > Key: SPARK-35678 > URL: https://issues.apache.org/jira/browse/SPARK-35678 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.2.0 > > > add softmax function in utils, which can be used in multi places -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35835) Select filter query on table with struct complex type fails
[ https://issues.apache.org/jira/browse/SPARK-35835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366386#comment-17366386 ] pavithra ramachandran commented on SPARK-35835: --- i shall raise a PR soon > Select filter query on table with struct complex type fails > --- > > Key: SPARK-35835 > URL: https://issues.apache.org/jira/browse/SPARK-35835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: Spark 3.1.1 >Reporter: Chetan Bhat >Priority: Minor > > [Steps]:- > From Spark beeline create a parquet or ORC table having complex type data. > Load data in the table and execute select filter query. > 0: jdbc:hive2://vm2:22550/> create table Struct_com (CUST_ID string, YEAR > int, MONTH int, AGE int, GENDER string, EDUCATED string, IS_MARRIED string, > STRUCT_INT_DOUBLE_STRING_DATE > struct,CARD_COUNT > int,DEBIT_COUNT int, CREDIT_COUNT int, DEPOSIT double, HQ_DEPOSIT double) > stored as parquet; > +-+ > | Result | > +-+ > +-+ > No rows selected (0.161 seconds) > 0: jdbc:hive2://vm2:22550/> LOAD DATA INPATH > 'hdfs://hacluster/chetan/Struct.csv' OVERWRITE INTO TABLE Struct_com; > +-+ > | Result | > +-+ > +-+ > No rows selected (1.09 seconds) > 0: jdbc:hive2://vm2:22550/> SELECT > struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in > t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum > FROM (select * from Struct_com) SUB_QRY WHERE > struct_int_double_string_date.id > 5700 GRO UP BY > struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country > ORDER BY struct_int_double_string_date.COUNTRY > asc,struct_int_double_string_date.CHECK_DATE > asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri > ng_date.Country asc; > > [Actual Issue] : - Select filter query on table with struct complex type fails > 0: jdbc:hive2://vm2:22550/> SELECT > struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in > t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum > FROM (select * from Struct_com) SUB_QRY WHERE > struct_int_double_string_date.id > 5700 GRO UP BY > struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country > ORDER BY struct_int_double_string_date.COUNTRY > asc,struct_int_double_string_date.CHECK_DATE > asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri > ng_date.Country asc; > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange rangepartitioning(COUNTRY#139896 ASC NULLS FIRST, CHECK_DATE#139897 > ASC NULLS FIRST, CHECK_DATE#139897 ASC NULLS FIRST, COUNTRY#139896 ASC NULLS > FIRST, 200 ), ENSURE_REQUIREMENTS, [id=#17161] > +- *(2) HashAggregate(keys=[_gen_alias_139928#139928, > _gen_alias_139929#139929], functions=[sum(cast(_gen_alias_139931#139931 as > bigint))], output=[COUNTRY#139896, CHECK_DATE#139897, CHECK_DATE#139898, > Country#139899, Sum#139877L]) > +- Exchange hashpartitioning(_gen_alias_139928#139928, > _gen_alias_139929#139929, 200), ENSURE_REQUIREMENTS, [id=#17157] > +- *(1) HashAggregate(keys=[_gen_alias_139928#139928, > _gen_alias_139929#139929], > functions=[partial_sum(cast(_gen_alias_139931#139931 as bigint))], output=[_g > en_alias_139928#139928, _gen_alias_139929#139929, sum#139934L]) > +- *(1) Project [STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS > _gen_alias_139928#139928, STRUCT_INT_DOUBLE_STRING_DATE#139885.CHECK_DATE AS > _gen_alias_13 9929#139929, STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS > _gen_alias_139930#139930, STRUCT_INT_DOUBLE_STRING_DATE#139885.ID AS > _gen_alias_139931#139931] > +- *(1) Filter (isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#139885) AND > (STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700)) > +- FileScan parquet default.struct_com[STRUCT_INT_DOUBLE_STRING_DATE#139885] > Batched: false, DataFilters: [isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#13 > 9885), (STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700)], Format: Parquet, > Location: InMemoryFileIndex[hdfs://hacluster/user/hive/warehouse/struct_com], > PartitionFi lters: [], PushedFilters: > [IsNotNull(STRUCT_INT_DOUBLE_STRING_DATE), > GreaterThan(STRUCT_INT_DOUBLE_STRING_DATE.ID,5700)], ReadSchema: > struct G_DATE:struct> > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$a
[jira] [Created] (SPARK-35835) Select filter query on table with struct complex type fails
Chetan Bhat created SPARK-35835: --- Summary: Select filter query on table with struct complex type fails Key: SPARK-35835 URL: https://issues.apache.org/jira/browse/SPARK-35835 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Environment: Spark 3.1.1 Reporter: Chetan Bhat [Steps]:- >From Spark beeline create a parquet or ORC table having complex type data. >Load data in the table and execute select filter query. 0: jdbc:hive2://vm2:22550/> create table Struct_com (CUST_ID string, YEAR int, MONTH int, AGE int, GENDER string, EDUCATED string, IS_MARRIED string, STRUCT_INT_DOUBLE_STRING_DATE struct,CARD_COUNT int,DEBIT_COUNT int, CREDIT_COUNT int, DEPOSIT double, HQ_DEPOSIT double) stored as parquet; +-+ | Result | +-+ +-+ No rows selected (0.161 seconds) 0: jdbc:hive2://vm2:22550/> LOAD DATA INPATH 'hdfs://hacluster/chetan/Struct.csv' OVERWRITE INTO TABLE Struct_com; +-+ | Result | +-+ +-+ No rows selected (1.09 seconds) 0: jdbc:hive2://vm2:22550/> SELECT struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum FROM (select * from Struct_com) SUB_QRY WHERE struct_int_double_string_date.id > 5700 GRO UP BY struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country ORDER BY struct_int_double_string_date.COUNTRY asc,struct_int_double_string_date.CHECK_DATE asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri ng_date.Country asc; [Actual Issue] : - Select filter query on table with struct complex type fails 0: jdbc:hive2://vm2:22550/> SELECT struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_in t_double_string_date.Country, SUM(struct_int_double_string_date.id) AS Sum FROM (select * from Struct_com) SUB_QRY WHERE struct_int_double_string_date.id > 5700 GRO UP BY struct_int_double_string_date.COUNTRY,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.CHECK_DATE,struct_int_double_string_date.Country ORDER BY struct_int_double_string_date.COUNTRY asc,struct_int_double_string_date.CHECK_DATE asc,struct_int_double_string_date.CHECK_DATE asc, struct_int_double_stri ng_date.Country asc; Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange rangepartitioning(COUNTRY#139896 ASC NULLS FIRST, CHECK_DATE#139897 ASC NULLS FIRST, CHECK_DATE#139897 ASC NULLS FIRST, COUNTRY#139896 ASC NULLS FIRST, 200 ), ENSURE_REQUIREMENTS, [id=#17161] +- *(2) HashAggregate(keys=[_gen_alias_139928#139928, _gen_alias_139929#139929], functions=[sum(cast(_gen_alias_139931#139931 as bigint))], output=[COUNTRY#139896, CHECK_DATE#139897, CHECK_DATE#139898, Country#139899, Sum#139877L]) +- Exchange hashpartitioning(_gen_alias_139928#139928, _gen_alias_139929#139929, 200), ENSURE_REQUIREMENTS, [id=#17157] +- *(1) HashAggregate(keys=[_gen_alias_139928#139928, _gen_alias_139929#139929], functions=[partial_sum(cast(_gen_alias_139931#139931 as bigint))], output=[_g en_alias_139928#139928, _gen_alias_139929#139929, sum#139934L]) +- *(1) Project [STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS _gen_alias_139928#139928, STRUCT_INT_DOUBLE_STRING_DATE#139885.CHECK_DATE AS _gen_alias_13 9929#139929, STRUCT_INT_DOUBLE_STRING_DATE#139885.COUNTRY AS _gen_alias_139930#139930, STRUCT_INT_DOUBLE_STRING_DATE#139885.ID AS _gen_alias_139931#139931] +- *(1) Filter (isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#139885) AND (STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700)) +- FileScan parquet default.struct_com[STRUCT_INT_DOUBLE_STRING_DATE#139885] Batched: false, DataFilters: [isnotnull(STRUCT_INT_DOUBLE_STRING_DATE#13 9885), (STRUCT_INT_DOUBLE_STRING_DATE#139885.ID > 5700)], Format: Parquet, Location: InMemoryFileIndex[hdfs://hacluster/user/hive/warehouse/struct_com], PartitionFi lters: [], PushedFilters: [IsNotNull(STRUCT_INT_DOUBLE_STRING_DATE), GreaterThan(STRUCT_INT_DOUBLE_STRING_DATE.ID,5700)], ReadSchema: struct> at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(Spar kExecuteStatementOperation.scala:396) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$3(SparkExecuteStatementOperation.scala:281) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78) at org.apache.spark.sql
[jira] [Assigned] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved
[ https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32041: --- Assignee: Peter Toth > Exchange reuse won't work in cases when DPP, subqueries are involved > > > Key: SPARK-32041 > URL: https://issues.apache.org/jira/browse/SPARK-32041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Prakhar Jain >Assignee: Peter Toth >Priority: Major > > When an Exchange node is repeated at multiple places in the PhysicalPlan, and > if that exchange has some some DPP Subquery filter, then ReuseExchange > doesn't work for such Exchange and different stages are launched to compute > same thing. > Example: > {noformat} > // generate data > val factData = (1 to 100).map(i => (i%5, i%20, i)) > factData.toDF("store_id", "product_id", "units_sold") > .write > .partitionBy("store_id") > .format("parquet") > .saveAsTable("fact_stats") > val dimData = Seq[(Int, String, String)]( > (1, "AU", "US"), > (2, "CA", "US"), > (3, "KA", "IN"), > (4, "DL", "IN"), > (5, "GA", "PA")) > dimData.toDF("store_id", "state_province", "country") > .write > .format("parquet") > .saveAsTable("dim_stats") > sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id") > sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id") > // Set Configs > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000") > val query = """ > With view1 as ( > SELECT product_id, f.store_id > FROM fact_stats f JOIN dim_stats > ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN') > SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id > """ > val df = spark.sql(query) > println(df.queryExecution.executedPlan) > {noformat} > {noformat} > Plan: > *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner > :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0 > : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140] > : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], > Inner, BuildRight > : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : : +- *(2) Filter isnotnull(product_id#1968) > : : +- *(2) ColumnarToRow > : : +- FileScan parquet > default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] > Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: > Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [isnotnull(store_id#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], > PushedFilters: [IsNotNull(product_id)], ReadSchema: struct > : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], > [id=#1131|#1131] > : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#1021|#1021] > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))), [id=#1021|#1021] > : +- *(1) Project [store_id#1971|#1971] > : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND > isnotnull(store_id#1971)) > : +- *(1) ColumnarToRow > : +- FileScan parquet > default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: > true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), > isnotnull(store_id#1971)|#1973), (country#1973 = IN), > isnotnull(store_id#1971)], Format: Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [], PushedFilters: [IsNotNull(country), > EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: > struct > +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0 > +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], > Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026] > {noformat} > Issue: > Note the last line of plan. Its a ReusedExchange which is pointing to > id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange > node is pointing to incorrect Child node (1026 instead of 1140) and so in > actual, exchange reuse won't happen in this query. > Another query where issue is because of ReuseSubquery: > {noformat} > spark.sql("set spark.sql.autoBroadcastJ
[jira] [Resolved] (SPARK-32041) Exchange reuse won't work in cases when DPP, subqueries are involved
[ https://issues.apache.org/jira/browse/SPARK-32041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32041. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 28885 [https://github.com/apache/spark/pull/28885] > Exchange reuse won't work in cases when DPP, subqueries are involved > > > Key: SPARK-32041 > URL: https://issues.apache.org/jira/browse/SPARK-32041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Prakhar Jain >Assignee: Peter Toth >Priority: Major > Fix For: 3.2.0 > > > When an Exchange node is repeated at multiple places in the PhysicalPlan, and > if that exchange has some some DPP Subquery filter, then ReuseExchange > doesn't work for such Exchange and different stages are launched to compute > same thing. > Example: > {noformat} > // generate data > val factData = (1 to 100).map(i => (i%5, i%20, i)) > factData.toDF("store_id", "product_id", "units_sold") > .write > .partitionBy("store_id") > .format("parquet") > .saveAsTable("fact_stats") > val dimData = Seq[(Int, String, String)]( > (1, "AU", "US"), > (2, "CA", "US"), > (3, "KA", "IN"), > (4, "DL", "IN"), > (5, "GA", "PA")) > dimData.toDF("store_id", "state_province", "country") > .write > .format("parquet") > .saveAsTable("dim_stats") > sql("ANALYZE TABLE fact_stats COMPUTE STATISTICS FOR COLUMNS store_id") > sql("ANALYZE TABLE dim_stats COMPUTE STATISTICS FOR COLUMNS store_id") > // Set Configs > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=true") > spark.sql("set spark.sql.autoBroadcastJoinThreshold=1000") > val query = """ > With view1 as ( > SELECT product_id, f.store_id > FROM fact_stats f JOIN dim_stats > ON f.store_id = dim_stats.store_id WHERE dim_stats.country = 'IN') > SELECT * FROM view1 v1 join view1 v2 WHERE v1.product_id = v2.product_id > """ > val df = spark.sql(query) > println(df.queryExecution.executedPlan) > {noformat} > {noformat} > Plan: > *(7) SortMergeJoin [product_id#1968|#1968], [product_id#2060|#2060], Inner > :- *(3) Sort [product_id#1968 ASC NULLS FIRST|#1968 ASC NULLS FIRST], false, > 0 > : +- Exchange hashpartitioning(product_id#1968, 5), true, [id=#1140|#1140] > : +- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : +- *(2) BroadcastHashJoin [store_id#1970|#1970], [store_id#1971|#1971], > Inner, BuildRight > : :- *(2) Project [product_id#1968, store_id#1970|#1968, store_id#1970] > : : +- *(2) Filter isnotnull(product_id#1968) > : : +- *(2) ColumnarToRow > : : +- FileScan parquet > default.fact_stats[product_id#1968,store_id#1970|#1968,store_id#1970] > Batched: true, DataFilters: [isnotnull(product_id#1968)|#1968)], Format: > Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [isnotnull(store_id#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)|#1970), > dynamicpruningexpression(store_id#1970 IN dynamicpruning#2067)], > PushedFilters: [IsNotNull(product_id)], ReadSchema: struct > : : +- SubqueryBroadcast dynamicpruning#2067, 0, [store_id#1971|#1971], > [id=#1131|#1131] > : : +- ReusedExchange [store_id#1971|#1971], BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#1021|#1021] > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))), [id=#1021|#1021] > : +- *(1) Project [store_id#1971|#1971] > : +- *(1) Filter ((isnotnull(country#1973) AND (country#1973 = IN)) AND > isnotnull(store_id#1971)) > : +- *(1) ColumnarToRow > : +- FileScan parquet > default.dim_stats[store_id#1971,country#1973|#1971,country#1973] Batched: > true, DataFilters: [isnotnull(country#1973), (country#1973 = IN), > isnotnull(store_id#1971)|#1973), (country#1973 = IN), > isnotnull(store_id#1971)], Format: Parquet, Location: > InMemoryFileIndex[file:/home/prakhar/src/os/1_spark/sql/core/spark-warehouse/org.apache.spark.sql..., > PartitionFilters: [], PushedFilters: [IsNotNull(country), > EqualTo(country,IN), IsNotNull(store_id)], ReadSchema: > struct > +- *(6) Sort [product_id#2060 ASC NULLS FIRST|#2060 ASC NULLS FIRST], false, > 0 > +- ReusedExchange [product_id#2060, store_id#2062|#2060, store_id#2062], > Exchange hashpartitioning(product_id#1968, 5), true, [id=#1026|#1026] > {noformat} > Issue: > Note the last line of plan. Its a ReusedExchange which is pointing to > id=1026. But There is no Exchange node in plan with ID 1026. ReusedExchange > node is pointing to incorrect Child node (1026 instead of 1140) and so in > actual, exchange reuse won't
[jira] [Resolved] (SPARK-28940) Subquery reuse across all subquery levels
[ https://issues.apache.org/jira/browse/SPARK-28940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28940. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 28885 [https://github.com/apache/spark/pull/28885] > Subquery reuse across all subquery levels > - > > Key: SPARK-28940 > URL: https://issues.apache.org/jira/browse/SPARK-28940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.2.0 > > > Currently subquery reuse doesn't work across all subquery levels. > Here is an example query: > {noformat} > SELECT (SELECT avg(key) FROM testData), (SELECT (SELECT avg(key) FROM > testData)) > FROM testData > LIMIT 1 > {noformat} > where the plan now is: > {noformat} > CollectLimit 1 > +- *(1) Project [Subquery scalar-subquery#268, [id=#231] AS > scalarsubquery()#276, Subquery scalar-subquery#270, [id=#266] AS > scalarsubquery()#277] >: :- Subquery scalar-subquery#268, [id=#231] >: : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as > bigint))], output=[avg(key)#272]) >: : +- Exchange SinglePartition, true, [id=#227] >: :+- *(1) HashAggregate(keys=[], > functions=[partial_avg(cast(key#13 as bigint))], output=[sum#282, count#283L]) >: : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >: : +- Scan[obj#12] >: +- Subquery scalar-subquery#270, [id=#266] >: +- *(1) Project [Subquery scalar-subquery#269, [id=#263] AS > scalarsubquery()#275] >:: +- Subquery scalar-subquery#269, [id=#263] >:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 > as bigint))], output=[avg(key)#274]) >::+- Exchange SinglePartition, true, [id=#259] >:: +- *(1) HashAggregate(keys=[], > functions=[partial_avg(cast(key#13 as bigint))], output=[sum#286, count#287L]) >:: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >:: +- Scan[obj#12] >:+- *(1) Scan OneRowRelation[] >+- *(1) SerializeFromObject > +- Scan[obj#12] > {noformat} > but it could be: > {noformat} > CollectLimit 1 > +- *(1) Project [ReusedSubquery Subquery scalar-subquery#241, [id=#148] AS > scalarsubquery()#248, Subquery scalar-subquery#242, [id=#164] AS > scalarsubquery()#249] >: :- ReusedSubquery Subquery scalar-subquery#241, [id=#148] >: +- Subquery scalar-subquery#242, [id=#164] >: +- *(1) Project [Subquery scalar-subquery#241, [id=#148] AS > scalarsubquery()#247] >:: +- Subquery scalar-subquery#241, [id=#148] >:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 > as bigint))], output=[avg(key)#246]) >::+- Exchange SinglePartition, true, [id=#144] >:: +- *(1) HashAggregate(keys=[], > functions=[partial_avg(cast(key#13 as bigint))], output=[sum#258, count#259L]) >:: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >:: +- Scan[obj#12] >:+- *(1) Scan OneRowRelation[] >+- *(1) SerializeFromObject > +- Scan[obj#12] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28940) Subquery reuse across all subquery levels
[ https://issues.apache.org/jira/browse/SPARK-28940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28940: --- Assignee: Peter Toth > Subquery reuse across all subquery levels > - > > Key: SPARK-28940 > URL: https://issues.apache.org/jira/browse/SPARK-28940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > > Currently subquery reuse doesn't work across all subquery levels. > Here is an example query: > {noformat} > SELECT (SELECT avg(key) FROM testData), (SELECT (SELECT avg(key) FROM > testData)) > FROM testData > LIMIT 1 > {noformat} > where the plan now is: > {noformat} > CollectLimit 1 > +- *(1) Project [Subquery scalar-subquery#268, [id=#231] AS > scalarsubquery()#276, Subquery scalar-subquery#270, [id=#266] AS > scalarsubquery()#277] >: :- Subquery scalar-subquery#268, [id=#231] >: : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as > bigint))], output=[avg(key)#272]) >: : +- Exchange SinglePartition, true, [id=#227] >: :+- *(1) HashAggregate(keys=[], > functions=[partial_avg(cast(key#13 as bigint))], output=[sum#282, count#283L]) >: : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >: : +- Scan[obj#12] >: +- Subquery scalar-subquery#270, [id=#266] >: +- *(1) Project [Subquery scalar-subquery#269, [id=#263] AS > scalarsubquery()#275] >:: +- Subquery scalar-subquery#269, [id=#263] >:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 > as bigint))], output=[avg(key)#274]) >::+- Exchange SinglePartition, true, [id=#259] >:: +- *(1) HashAggregate(keys=[], > functions=[partial_avg(cast(key#13 as bigint))], output=[sum#286, count#287L]) >:: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >:: +- Scan[obj#12] >:+- *(1) Scan OneRowRelation[] >+- *(1) SerializeFromObject > +- Scan[obj#12] > {noformat} > but it could be: > {noformat} > CollectLimit 1 > +- *(1) Project [ReusedSubquery Subquery scalar-subquery#241, [id=#148] AS > scalarsubquery()#248, Subquery scalar-subquery#242, [id=#164] AS > scalarsubquery()#249] >: :- ReusedSubquery Subquery scalar-subquery#241, [id=#148] >: +- Subquery scalar-subquery#242, [id=#164] >: +- *(1) Project [Subquery scalar-subquery#241, [id=#148] AS > scalarsubquery()#247] >:: +- Subquery scalar-subquery#241, [id=#148] >:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 > as bigint))], output=[avg(key)#246]) >::+- Exchange SinglePartition, true, [id=#144] >:: +- *(1) HashAggregate(keys=[], > functions=[partial_avg(cast(key#13 as bigint))], output=[sum#258, count#259L]) >:: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >:: +- Scan[obj#12] >:+- *(1) Scan OneRowRelation[] >+- *(1) SerializeFromObject > +- Scan[obj#12] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29375) Exchange reuse across all subquery levels
[ https://issues.apache.org/jira/browse/SPARK-29375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29375. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 28885 [https://github.com/apache/spark/pull/28885] > Exchange reuse across all subquery levels > - > > Key: SPARK-29375 > URL: https://issues.apache.org/jira/browse/SPARK-29375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.2.0 > > > Currently exchange reuse doesn't work across all subquery levels. > Here is an example query: > {noformat} > SELECT > (SELECT max(a.key) FROM testData AS a JOIN testData AS b ON b.key = a.key), > a.key > FROM testData AS a > JOIN testData AS b ON b.key = a.key{noformat} > where the plan is: > {noformat} > *(5) Project [Subquery scalar-subquery#240, [id=#193] AS > scalarsubquery()#247, key#13] > : +- Subquery scalar-subquery#240, [id=#193] > : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], > output=[max(key)#246]) > :+- Exchange SinglePartition, true, [id=#189] > : +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], > output=[max#251]) > : +- *(5) Project [key#13] > : +- *(5) SortMergeJoin [key#13], [key#243], Inner > ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 > :: +- Exchange hashpartitioning(key#13, 5), true, > [id=#145] > :: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] > ::+- Scan[obj#12] > :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0 > : +- ReusedExchange [key#243], Exchange > hashpartitioning(key#13, 5), true, [id=#145] > +- *(5) SortMergeJoin [key#13], [key#241], Inner >:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(key#13, 5), true, [id=#205] >: +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >:+- Scan[obj#12] >+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0 > +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), > true, [id=#205] > {noformat} > but it could be improved as here: > {noformat} > *(5) Project [Subquery scalar-subquery#240, [id=#211] AS > scalarsubquery()#247, key#13] > : +- Subquery scalar-subquery#240, [id=#211] > : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], > output=[max(key)#246]) > :+- Exchange SinglePartition, true, [id=#207] > : +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], > output=[max#251]) > : +- *(5) Project [key#13] > : +- *(5) SortMergeJoin [key#13], [key#243], Inner > ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 > :: +- Exchange hashpartitioning(key#13, 5), true, > [id=#145] > :: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] > ::+- Scan[obj#12] > :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0 > : +- ReusedExchange [key#243], Exchange > hashpartitioning(key#13, 5), true, [id=#145] > +- *(5) SortMergeJoin [key#13], [key#241], Inner >:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 >: +- ReusedExchange [key#13], Exchange hashpartitioning(key#13, 5), true, > [id=#145] >+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0 > +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), > true, [id=#145] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29375) Exchange reuse across all subquery levels
[ https://issues.apache.org/jira/browse/SPARK-29375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29375: --- Assignee: Peter Toth > Exchange reuse across all subquery levels > - > > Key: SPARK-29375 > URL: https://issues.apache.org/jira/browse/SPARK-29375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > > Currently exchange reuse doesn't work across all subquery levels. > Here is an example query: > {noformat} > SELECT > (SELECT max(a.key) FROM testData AS a JOIN testData AS b ON b.key = a.key), > a.key > FROM testData AS a > JOIN testData AS b ON b.key = a.key{noformat} > where the plan is: > {noformat} > *(5) Project [Subquery scalar-subquery#240, [id=#193] AS > scalarsubquery()#247, key#13] > : +- Subquery scalar-subquery#240, [id=#193] > : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], > output=[max(key)#246]) > :+- Exchange SinglePartition, true, [id=#189] > : +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], > output=[max#251]) > : +- *(5) Project [key#13] > : +- *(5) SortMergeJoin [key#13], [key#243], Inner > ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 > :: +- Exchange hashpartitioning(key#13, 5), true, > [id=#145] > :: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] > ::+- Scan[obj#12] > :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0 > : +- ReusedExchange [key#243], Exchange > hashpartitioning(key#13, 5), true, [id=#145] > +- *(5) SortMergeJoin [key#13], [key#241], Inner >:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(key#13, 5), true, [id=#205] >: +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] >:+- Scan[obj#12] >+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0 > +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), > true, [id=#205] > {noformat} > but it could be improved as here: > {noformat} > *(5) Project [Subquery scalar-subquery#240, [id=#211] AS > scalarsubquery()#247, key#13] > : +- Subquery scalar-subquery#240, [id=#211] > : +- *(6) HashAggregate(keys=[], functions=[max(key#13)], > output=[max(key)#246]) > :+- Exchange SinglePartition, true, [id=#207] > : +- *(5) HashAggregate(keys=[], functions=[partial_max(key#13)], > output=[max#251]) > : +- *(5) Project [key#13] > : +- *(5) SortMergeJoin [key#13], [key#243], Inner > ::- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 > :: +- Exchange hashpartitioning(key#13, 5), true, > [id=#145] > :: +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] > ::+- Scan[obj#12] > :+- *(4) Sort [key#243 ASC NULLS FIRST], false, 0 > : +- ReusedExchange [key#243], Exchange > hashpartitioning(key#13, 5), true, [id=#145] > +- *(5) SortMergeJoin [key#13], [key#241], Inner >:- *(2) Sort [key#13 ASC NULLS FIRST], false, 0 >: +- ReusedExchange [key#13], Exchange hashpartitioning(key#13, 5), true, > [id=#145] >+- *(4) Sort [key#241 ASC NULLS FIRST], false, 0 > +- ReusedExchange [key#241], Exchange hashpartitioning(key#13, 5), > true, [id=#145] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35545) Split SubqueryExpression's children field into outer attributes and join conditions
[ https://issues.apache.org/jira/browse/SPARK-35545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366373#comment-17366373 ] Apache Spark commented on SPARK-35545: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/32990 > Split SubqueryExpression's children field into outer attributes and join > conditions > --- > > Key: SPARK-35545 > URL: https://issues.apache.org/jira/browse/SPARK-35545 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.2.0 > > > Currently the children field of a subquery expression is used to store both > collected outer references inside the subquery plan, and also join conditions > after correlated predicates are pulled up. For example > SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2 > After analysis phase: > scalar-subquery [t2.c1] > After PullUpCorrelatedPredicates: > scalar-subquery [t1.c1 = t2.c1] > The references for a subquery expressions is also confusing: > override lazy val references: AttributeSet = > if (plan.resolved) super.references -- plan.outputSet else super.references > We should split this children field into outer attribute references and join > conditions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35545) Split SubqueryExpression's children field into outer attributes and join conditions
[ https://issues.apache.org/jira/browse/SPARK-35545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366372#comment-17366372 ] Apache Spark commented on SPARK-35545: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/32990 > Split SubqueryExpression's children field into outer attributes and join > conditions > --- > > Key: SPARK-35545 > URL: https://issues.apache.org/jira/browse/SPARK-35545 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.2.0 > > > Currently the children field of a subquery expression is used to store both > collected outer references inside the subquery plan, and also join conditions > after correlated predicates are pulled up. For example > SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2 > After analysis phase: > scalar-subquery [t2.c1] > After PullUpCorrelatedPredicates: > scalar-subquery [t1.c1 = t2.c1] > The references for a subquery expressions is also confusing: > override lazy val references: AttributeSet = > if (plan.resolved) super.references -- plan.outputSet else super.references > We should split this children field into outer attribute references and join > conditions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35811) Deprecate DataFrame.to_spark_io
[ https://issues.apache.org/jira/browse/SPARK-35811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35811. -- Assignee: Kevin Su Resolution: Fixed Fixed in https://github.com/apache/spark/pull/32964 > Deprecate DataFrame.to_spark_io > --- > > Key: SPARK-35811 > URL: https://issues.apache.org/jira/browse/SPARK-35811 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Kevin Su >Priority: Major > > We should deprecate the > [DataFrame.to_spark_io|https://docs.google.com/document/d/1RxvQJVf736Vg9XU7uiCaRlNl-P7GdmFGa6U3Ab78JJk/edit#heading=h.todz8y4xdqrx] > since it's duplicated with > [DataFrame.spark.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.to_spark_io.html], > and it's not existed in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35811) Deprecate DataFrame.to_spark_io
[ https://issues.apache.org/jira/browse/SPARK-35811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-35811: - Fix Version/s: 3.2.0 > Deprecate DataFrame.to_spark_io > --- > > Key: SPARK-35811 > URL: https://issues.apache.org/jira/browse/SPARK-35811 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Kevin Su >Priority: Major > Fix For: 3.2.0 > > > We should deprecate the > [DataFrame.to_spark_io|https://docs.google.com/document/d/1RxvQJVf736Vg9XU7uiCaRlNl-P7GdmFGa6U3Ab78JJk/edit#heading=h.todz8y4xdqrx] > since it's duplicated with > [DataFrame.spark.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.to_spark_io.html], > and it's not existed in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API
[ https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35834: Assignee: Apache Spark > Use the same cleanup logic as Py4J in inheritable thread API > > > Key: SPARK-35834 > URL: https://issues.apache.org/jira/browse/SPARK-35834 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > After > https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9, > the test became flaky: > {code} > == > ERROR [71.813s]: test_save_load_pipeline_estimator > (pyspark.ml.tests.test_tuning.CrossValidatorTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, > in test_save_load_pipeline_estimator > self._run_test_save_load_pipeline_estimator(DummyLogisticRegression) > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, > in _run_test_save_load_pipeline_estimator > cvModel2 = crossval2.fit(training) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit > bestModel = est.fit(dataset, epm[bestIndex]) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit > return self.copy(params)._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in > _fit > models = pool.map(inheritable_thread_target(trainSingleClass), > range(numClasses)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 266, in map > return self._map_async(func, iterable, mapstar, chunksize).get() > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 644, in get > raise self._value > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 119, in worker > result = (True, func(*args, **kwds)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 44, in mapstar > return list(map(*args)) > File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped > InheritableThread._clean_py4j_conn_for_current_thread() > File "/__w/spark/spark/python/pyspark/util.py", line 389, in > _clean_py4j_conn_for_current_thread > del connections[i] > IndexError: deque index out of range > -- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API
[ https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35834: Assignee: (was: Apache Spark) > Use the same cleanup logic as Py4J in inheritable thread API > > > Key: SPARK-35834 > URL: https://issues.apache.org/jira/browse/SPARK-35834 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > After > https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9, > the test became flaky: > {code} > == > ERROR [71.813s]: test_save_load_pipeline_estimator > (pyspark.ml.tests.test_tuning.CrossValidatorTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, > in test_save_load_pipeline_estimator > self._run_test_save_load_pipeline_estimator(DummyLogisticRegression) > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, > in _run_test_save_load_pipeline_estimator > cvModel2 = crossval2.fit(training) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit > bestModel = est.fit(dataset, epm[bestIndex]) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit > return self.copy(params)._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in > _fit > models = pool.map(inheritable_thread_target(trainSingleClass), > range(numClasses)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 266, in map > return self._map_async(func, iterable, mapstar, chunksize).get() > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 644, in get > raise self._value > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 119, in worker > result = (True, func(*args, **kwds)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 44, in mapstar > return list(map(*args)) > File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped > InheritableThread._clean_py4j_conn_for_current_thread() > File "/__w/spark/spark/python/pyspark/util.py", line 389, in > _clean_py4j_conn_for_current_thread > del connections[i] > IndexError: deque index out of range > -- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API
[ https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366324#comment-17366324 ] Apache Spark commented on SPARK-35834: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32989 > Use the same cleanup logic as Py4J in inheritable thread API > > > Key: SPARK-35834 > URL: https://issues.apache.org/jira/browse/SPARK-35834 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > After > https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9, > the test became flaky: > {code} > == > ERROR [71.813s]: test_save_load_pipeline_estimator > (pyspark.ml.tests.test_tuning.CrossValidatorTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, > in test_save_load_pipeline_estimator > self._run_test_save_load_pipeline_estimator(DummyLogisticRegression) > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, > in _run_test_save_load_pipeline_estimator > cvModel2 = crossval2.fit(training) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit > bestModel = est.fit(dataset, epm[bestIndex]) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit > return self.copy(params)._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in > _fit > models = pool.map(inheritable_thread_target(trainSingleClass), > range(numClasses)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 266, in map > return self._map_async(func, iterable, mapstar, chunksize).get() > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 644, in get > raise self._value > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 119, in worker > result = (True, func(*args, **kwds)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 44, in mapstar > return list(map(*args)) > File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped > InheritableThread._clean_py4j_conn_for_current_thread() > File "/__w/spark/spark/python/pyspark/util.py", line 389, in > _clean_py4j_conn_for_current_thread > del connections[i] > IndexError: deque index out of range > -- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API
[ https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-35834: - Issue Type: Bug (was: Improvement) > Use the same cleanup logic as Py4J in inheritable thread API > > > Key: SPARK-35834 > URL: https://issues.apache.org/jira/browse/SPARK-35834 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > After > https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9, > the test became flaky: > {code} > == > ERROR [71.813s]: test_save_load_pipeline_estimator > (pyspark.ml.tests.test_tuning.CrossValidatorTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, > in test_save_load_pipeline_estimator > self._run_test_save_load_pipeline_estimator(DummyLogisticRegression) > File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, > in _run_test_save_load_pipeline_estimator > cvModel2 = crossval2.fit(training) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit > bestModel = est.fit(dataset, epm[bestIndex]) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit > return self.copy(params)._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit > model = stage.fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit > return self._fit(dataset) > File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in > _fit > models = pool.map(inheritable_thread_target(trainSingleClass), > range(numClasses)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 266, in map > return self._map_async(func, iterable, mapstar, chunksize).get() > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 644, in get > raise self._value > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 119, in worker > result = (True, func(*args, **kwds)) > File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line > 44, in mapstar > return list(map(*args)) > File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped > InheritableThread._clean_py4j_conn_for_current_thread() > File "/__w/spark/spark/python/pyspark/util.py", line 389, in > _clean_py4j_conn_for_current_thread > del connections[i] > IndexError: deque index out of range > -- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API
Hyukjin Kwon created SPARK-35834: Summary: Use the same cleanup logic as Py4J in inheritable thread API Key: SPARK-35834 URL: https://issues.apache.org/jira/browse/SPARK-35834 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Hyukjin Kwon After https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9, the test became flaky: {code} == ERROR [71.813s]: test_save_load_pipeline_estimator (pyspark.ml.tests.test_tuning.CrossValidatorTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, in test_save_load_pipeline_estimator self._run_test_save_load_pipeline_estimator(DummyLogisticRegression) File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, in _run_test_save_load_pipeline_estimator cvModel2 = crossval2.fit(training) File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit return self._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit bestModel = est.fit(dataset, epm[bestIndex]) File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit return self.copy(params)._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit model = stage.fit(dataset) File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit return self._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit model = stage.fit(dataset) File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit return self._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in _fit models = pool.map(inheritable_thread_target(trainSingleClass), range(numClasses)) File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(*args)) File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped InheritableThread._clean_py4j_conn_for_current_thread() File "/__w/spark/spark/python/pyspark/util.py", line 389, in _clean_py4j_conn_for_current_thread del connections[i] IndexError: deque index out of range -- {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors
[ https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-35671: --- Assignee: Chandni Singh > Add Support in the ESS to serve merged shuffle block meta and data to > executors > --- > > Key: SPARK-35671 > URL: https://issues.apache.org/jira/browse/SPARK-35671 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > With push-based shuffle enabled, the reducers send 2 different requests to > the ESS: > 1. Request to fetch the metadata of the merged shuffle block. > 2. Requests to fetch the data of the merged shuffle block in chunks which > are by default 2MB in size. > The ESS should be able to serve these requests and this Jira targets all the > changes in the network-common and network-shuffle modules to be able to > support this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors
[ https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-35671. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32811 [https://github.com/apache/spark/pull/32811] > Add Support in the ESS to serve merged shuffle block meta and data to > executors > --- > > Key: SPARK-35671 > URL: https://issues.apache.org/jira/browse/SPARK-35671 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.2.0 > > > With push-based shuffle enabled, the reducers send 2 different requests to > the ESS: > 1. Request to fetch the metadata of the merged shuffle block. > 2. Requests to fetch the data of the merged shuffle block in chunks which > are by default 2MB in size. > The ESS should be able to serve these requests and this Jira targets all the > changes in the network-common and network-shuffle modules to be able to > support this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35827) Show proper error message when update column types to year-month/day-time interval
[ https://issues.apache.org/jira/browse/SPARK-35827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-35827. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32978 [https://github.com/apache/spark/pull/32978] > Show proper error message when update column types to year-month/day-time > interval > -- > > Key: SPARK-35827 > URL: https://issues.apache.org/jira/browse/SPARK-35827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > Updating column types to interval types are prohibited for V2 source tables. > So, if we attempt to update the type of a column to the conventional interval > type, an error message like "Error in query: Cannot update field > to interval type;". > But, for year-month/day-time interval types, another error message like > "Error in query: Cannot update field : cannot be cast > to interval year;" > It's not consistent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35775) Check all year-month interval types in aggregate expressions
[ https://issues.apache.org/jira/browse/SPARK-35775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35775: Assignee: Apache Spark > Check all year-month interval types in aggregate expressions > > > Key: SPARK-35775 > URL: https://issues.apache.org/jira/browse/SPARK-35775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Check all supported combination of YearMonthIntervalType fields in the > aggregate expression: sum and avg. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35775) Check all year-month interval types in aggregate expressions
[ https://issues.apache.org/jira/browse/SPARK-35775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366235#comment-17366235 ] Apache Spark commented on SPARK-35775: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32988 > Check all year-month interval types in aggregate expressions > > > Key: SPARK-35775 > URL: https://issues.apache.org/jira/browse/SPARK-35775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Check all supported combination of YearMonthIntervalType fields in the > aggregate expression: sum and avg. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35775) Check all year-month interval types in aggregate expressions
[ https://issues.apache.org/jira/browse/SPARK-35775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35775: Assignee: (was: Apache Spark) > Check all year-month interval types in aggregate expressions > > > Key: SPARK-35775 > URL: https://issues.apache.org/jira/browse/SPARK-35775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Check all supported combination of YearMonthIntervalType fields in the > aggregate expression: sum and avg. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35832) Add LocalRootDirsTest trait
[ https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35832: - Assignee: Dongjoon Hyun > Add LocalRootDirsTest trait > --- > > Key: SPARK-35832 > URL: https://issues.apache.org/jira/browse/SPARK-35832 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Tests >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35832) Add LocalRootDirsTest trait
[ https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35832. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32986 [https://github.com/apache/spark/pull/32986] > Add LocalRootDirsTest trait > --- > > Key: SPARK-35832 > URL: https://issues.apache.org/jira/browse/SPARK-35832 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Tests >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366181#comment-17366181 ] Apache Spark commented on SPARK-35564: -- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/32987 > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35564: Assignee: (was: Apache Spark) > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35564: Assignee: Apache Spark > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Assignee: Apache Spark >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35833) The Statistics size of PARQUET table is not estimated correctly
HonglunChen created SPARK-35833: --- Summary: The Statistics size of PARQUET table is not estimated correctly Key: SPARK-35833 URL: https://issues.apache.org/jira/browse/SPARK-35833 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: HonglunChen {code:java} // Table 'test_txt' and 'test_parquet' have the same data. scala> val sql="select * from tmp_db.test_txt" sql: String = select * from tmp_db.test_txt scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes res5: BigInt = 92990 scala> val sql = "select * from tmp_db.test_parquet" sql: String = select * from tmp_db.test_parquet scala> spark.sql(sql).queryExecution.optimizedPlan.stats.sizeInBytes res6: BigInt = 37556 {code} PARQUET file is compressed by default, this could lead to choose the wrong type of JOIN, like BROADCASTJOIN. Driver may be OOM in this case, because the actual size may be much greater than the AUTO_BROADCASTJOIN_THRESHOLD. Can we improve this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173 ] Gejun Shen edited comment on SPARK-20427 at 6/20/21, 12:16 PM: --- When you use custom scheme, problem solved according to my test. But spark's max precision 38 seems still have some gap with oracle. For oracle, I believe there could be 39-40 for 10g version. >From oracle document. _p is the precision, or the maximum number of significant decimal digits, where the most significant digit is the left-most nonzero digit, and the least significant digit is the right-most known digit. Oracle guarantees the portability of numbers with precision of up to 20 base-100 digits, which is equivalent to 39 or 40 decimal digits depending on the position of the decimal point._ was (Author: sgejun): When you use custom scheme, problem solved according to my test. But spark's max precision 38 seems still have some gap with oracle standard. For oracle, I believe there could be 39-40 for 10g version. >From oracle document. _p is the precision, or the maximum number of significant decimal digits, where the most significant digit is the left-most nonzero digit, and the least significant digit is the right-most known digit. Oracle guarantees the portability of numbers with precision of up to 20 base-100 digits, which is equivalent to 39 or 40 decimal digits depending on the position of the decimal point._ > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.0 > > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173 ] Gejun Shen edited comment on SPARK-20427 at 6/20/21, 12:16 PM: --- When you use custom scheme, problem solved according to my test. But spark's max precision 38 seems still have some gap with oracle standard. For oracle, I believe there could be 39-40 for 10g version. >From oracle document. _p is the precision, or the maximum number of significant decimal digits, where the most significant digit is the left-most nonzero digit, and the least significant digit is the right-most known digit. Oracle guarantees the portability of numbers with precision of up to 20 base-100 digits, which is equivalent to 39 or 40 decimal digits depending on the position of the decimal point._ was (Author: sgejun): When you use custom scheme, problem solved according to my test. > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.0 > > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173 ] Gejun Shen edited comment on SPARK-20427 at 6/20/21, 12:09 PM: --- When you use custom scheme, problem solved according to my test. was (Author: sgejun): When you use custom scheme, I believe precision would also lost. For example, 123456789012345678901234567890123456789111, it will become something like 1.2531261878596697E40. Is this really what we want? > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.0 > > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366173#comment-17366173 ] Gejun Shen commented on SPARK-20427: When you use custom scheme, I believe precision would also lost. For example, 123456789012345678901234567890123456789111, it will become something like 1.2531261878596697E40. Is this really what we want? > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.0 > > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35832) Add LocalRootDirsTest trait
[ https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35832: Assignee: (was: Apache Spark) > Add LocalRootDirsTest trait > --- > > Key: SPARK-35832 > URL: https://issues.apache.org/jira/browse/SPARK-35832 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Tests >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35832) Add LocalRootDirsTest trait
[ https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366135#comment-17366135 ] Apache Spark commented on SPARK-35832: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32986 > Add LocalRootDirsTest trait > --- > > Key: SPARK-35832 > URL: https://issues.apache.org/jira/browse/SPARK-35832 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Tests >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35832) Add LocalRootDirsTest trait
[ https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35832: Assignee: Apache Spark > Add LocalRootDirsTest trait > --- > > Key: SPARK-35832 > URL: https://issues.apache.org/jira/browse/SPARK-35832 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Tests >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35832) Add LocalRootDirsTest trait
[ https://issues.apache.org/jira/browse/SPARK-35832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35832: -- Issue Type: Bug (was: Improvement) > Add LocalRootDirsTest trait > --- > > Key: SPARK-35832 > URL: https://issues.apache.org/jira/browse/SPARK-35832 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Tests >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35777) Check all year-month interval types in UDF
[ https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35777: Assignee: (was: Apache Spark) > Check all year-month interval types in UDF > -- > > Key: SPARK-35777 > URL: https://issues.apache.org/jira/browse/SPARK-35777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Check all year-month interval types in UDF: > # INTERVAL YEAR > # INTERVAL YEAR TO MONTH > # INTERVAL MONTH -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35777) Check all year-month interval types in UDF
[ https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35777: Assignee: Apache Spark > Check all year-month interval types in UDF > -- > > Key: SPARK-35777 > URL: https://issues.apache.org/jira/browse/SPARK-35777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Check all year-month interval types in UDF: > # INTERVAL YEAR > # INTERVAL YEAR TO MONTH > # INTERVAL MONTH -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35777) Check all year-month interval types in UDF
[ https://issues.apache.org/jira/browse/SPARK-35777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366134#comment-17366134 ] Apache Spark commented on SPARK-35777: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32985 > Check all year-month interval types in UDF > -- > > Key: SPARK-35777 > URL: https://issues.apache.org/jira/browse/SPARK-35777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Check all year-month interval types in UDF: > # INTERVAL YEAR > # INTERVAL YEAR TO MONTH > # INTERVAL MONTH -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35832) Add LocalRootDirsTest trait
Dongjoon Hyun created SPARK-35832: - Summary: Add LocalRootDirsTest trait Key: SPARK-35832 URL: https://issues.apache.org/jira/browse/SPARK-35832 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Core, Tests Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35726) Truncate java.time.Duration by fields of day-time interval type
[ https://issues.apache.org/jira/browse/SPARK-35726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366130#comment-17366130 ] Apache Spark commented on SPARK-35726: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32984 > Truncate java.time.Duration by fields of day-time interval type > --- > > Key: SPARK-35726 > URL: https://issues.apache.org/jira/browse/SPARK-35726 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > Truncate input java.time.Duration instances using fields of > DayTimeIntervalType. For example, if DayTimeIntervalType has the end field > HOUR, granularity of DayTimeIntervalType values should hours too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35769) Truncate java.time.Period by fields of year-month interval type
[ https://issues.apache.org/jira/browse/SPARK-35769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366128#comment-17366128 ] Apache Spark commented on SPARK-35769: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32984 > Truncate java.time.Period by fields of year-month interval type > --- > > Key: SPARK-35769 > URL: https://issues.apache.org/jira/browse/SPARK-35769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > Truncate input java.time.Period instances using fields of > YearMonthIntervalType. For example, if YearMonthIntervalType has the end > field YEAR, granularity of YearMonthIntervalType values should years too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35726) Truncate java.time.Duration by fields of day-time interval type
[ https://issues.apache.org/jira/browse/SPARK-35726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366126#comment-17366126 ] Apache Spark commented on SPARK-35726: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32984 > Truncate java.time.Duration by fields of day-time interval type > --- > > Key: SPARK-35726 > URL: https://issues.apache.org/jira/browse/SPARK-35726 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > Truncate input java.time.Duration instances using fields of > DayTimeIntervalType. For example, if DayTimeIntervalType has the end field > HOUR, granularity of DayTimeIntervalType values should hours too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35700) spark.sql.orc.filterPushdown not working with Spark 3.1.1 for tables with varchar data type
[ https://issues.apache.org/jira/browse/SPARK-35700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366123#comment-17366123 ] Saurabh Chawla commented on SPARK-35700: [~hyukjin.kwon]/ [~Qin Yao] / [~cloud_fan] - I was able to reproduce this issue in the master branch for the steps given in the SPARK-35762. Below are some of the scenario found while debugging this issue. 1) This is issue is reproducible for all the orc data created using the Hive, but when the insertion is done using the SPARK, we are not getting this exception. 2) We are getting the exception due this, when the file is created using hive and while calling the readSchema the [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L63 |http://example.com/] In the reader.getSchema we get the varchar datatype. But on reading the orc file created by the Spark we are getting the string datatype. 3) I tried adding the validation for converting varchar to String {code:java} readSchema(file, conf, ignoreCorruptFiles) match { case Some(schema) => val orcSchema = CatalystSqlParser.parseDataType( schema.toString).asInstanceOf[StructType] val varCharExist = orcSchema.fields.exists( x => CharVarcharUtils.hasCharVarchar(x.dataType)) if (varCharExist) { Some(CharVarcharUtils.replaceCharVarcharWithStringInSchema(orcSchema)) } else { Some(orcSchema) }{code} After adding this fix , we are converting the varchar to string and query is working fine. 4) Similar conversion of data type change is needed on the `def readSchema(file: Path, conf: Configuration, ignoreCorruptFiles: Boolean) `, this called by inferSchema , readOrcSchemasInParallel. If this approach is fine. Than I can go head and create the PR for the same, otherwise if we want to see some other approach we can discuss on this > spark.sql.orc.filterPushdown not working with Spark 3.1.1 for tables with > varchar data type > --- > > Key: SPARK-35700 > URL: https://issues.apache.org/jira/browse/SPARK-35700 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark, Spark Core >Affects Versions: 3.1.1 > Environment: Spark 3.1.1 on K8S >Reporter: Arghya Saha >Priority: Major > > We are not able to upgrade to Spark 3.1.1 from Spark 2.4.x as the join on > varchar column is failing which is unexpected and works on Spark 3.0.0. We > are trying to run it on Spark 3.1.1 (MR 3.2) on K8s > Below is my use case: > Tables are external hive table and files are stored as ORC. We do have > varchar column and when we are trying to perform join on varchar column we > are getting the exception. > As I understand Spark 3.1.1 have introduced varchar data type but seems its > not well tested with ORC and does not have backward compatibility. I have > even tried with below config without luck > *spark.sql.legacy.charVarcharAsString: "true"* > We are not getting the error when *spark.sql.orc.filterPushdown=false* > Below is the code: Here col1 is of type varchar(32) in hive > {code:java} > df = spark.sql("select col1, col2 from table1 a inner join table2 on b > (a.col1=b.col1 and a.col2 > b.col2 )") > df.write.format("orc").option("compression", > "zlib").mode("Append").save("") > {code} > Below is the error: > > {code:java} > Job aborted due to stage failure: Task 43 in stage 5.0 failed 4 times, most > recent failure: Lost task 43.3 in stage 5.0 (TID 524) (10.219.36.64 executor > 5): java.lang.UnsupportedOperationException: DataType: varchar(32) > at > org.apache.spark.sql.execution.datasources.orc.OrcFilters$.getPredicateLeafType(OrcFilters.scala:150) > at > org.apache.spark.sql.execution.datasources.orc.OrcFilters$.getType$1(OrcFilters.scala:222) > at > org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:266) > at > org.apache.spark.sql.execution.datasources.orc.OrcFilters$.convertibleFiltersHelper$1(OrcFilters.scala:132) > at > org.apache.spark.sql.execution.datasources.orc.OrcFilters$.$anonfun$convertibleFilters$4(OrcFilters.scala:135) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) > at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.execution.datasources.orc.OrcFilters$.convertibleFilters(OrcFilters.scala:134) > at > org.apache.spark.sql.execution.datasources.orc.OrcFilters$.createFilter(OrcFilters
[jira] [Commented] (SPARK-35802) Error loading the stages/stage/ page in spark UI
[ https://issues.apache.org/jira/browse/SPARK-35802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366122#comment-17366122 ] Gengliang Wang commented on SPARK-35802: +1 with [~sarutak]. It's an API for rendering UI and requires certain parameters. FYI [~Heltman] you may consider using the rest APIs in https://spark.apache.org/docs/latest/monitoring.html#rest-api instead. > Error loading the stages/stage/ page in spark UI > > > Key: SPARK-35802 > URL: https://issues.apache.org/jira/browse/SPARK-35802 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0, 3.0.1, 3.1.1, 3.1.2 >Reporter: Helt Long >Priority: Major > > I try to load the sparkUI page for a specific stage, I get the following > error: > {quote}Unable to connect to the server. Looks like the Spark application must > have ended. Please Switch to the history UI. > {quote} > Obviously the server is still alive and process new messages. > Looking at the network tab shows one of the requests fails: > > {{curl > 'http://:8080/proxy/app-20201008130147-0001/api/v1/applications/app-20201008130147-0001/stages/11/0/taskTable' > > > > Error 500 Request failed. > > HTTP ERROR 500 > Problem accessing > /api/v1/applications/app-20201008130147-0001/stages/11/0/taskTable. Reason: > Request failed. href="http://eclipse.org/jetty";>Powered by Jetty:// 9.4.z-SNAPSHOT > > }} > requests to any other object that I've tested seem to work, for example > > {{curl > 'http://:8080/proxy/app-20201008130147-0001/api/v1/applications/app-20201008130147-0001/stages/11/0/taskSummary'}} > > The exception is: > {{/api/v1/applications/app-20201008130147-0001/stages/11/0/taskTable > javax.servlet.ServletException: java.lang.NullPointerException > at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) > at > org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) > at > org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) > at > org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) > at > org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at > org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at > org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at > org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) > at > org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) > at > org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) > at > org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.sparkproject.jetty.server.Server.handle(Server.java:505) > at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370) > at > org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) > at > org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) > at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103) > at > org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) > at > org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) > at > org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) > at > org.sparkproject.jetty.uti