[jira] [Commented] (SPARK-42784) Fix the problem of incomplete creation of subdirectories in push merged localDir
[ https://issues.apache.org/jira/browse/SPARK-42784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700044#comment-17700044 ] Apache Spark commented on SPARK-42784: -- User 'Stove-hust' has created a pull request for this issue: https://github.com/apache/spark/pull/40412 > Fix the problem of incomplete creation of subdirectories in push merged > localDir > > > Key: SPARK-42784 > URL: https://issues.apache.org/jira/browse/SPARK-42784 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.3.2 >Reporter: Fencheng Mei >Priority: Major > > After we massively enabled push-based shuffle in our production environment, > we found some warn messages appearing in the server-side log messages. > the warning log like: > ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to > BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) > failed. > java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize > merged shuffle partition for appId application_1671244879475_44020960 > shuffleId 3 shuffleMergeId 0 reduceId 935. > After investigation, we identified the triggering mechanism of the bug。 > The driver requested two different containers on the same physical machine. > During the creation of the 'push-merged' directory in the first container > (container_1), the mergeDir was created first, then the subDir were created > based on the value of the "spark.diskStore.subDirectories" parameter. > However, the resources of container_1 were preempted during the creation of > the sub-directories, resulting in subDir not being created (only part of it > was created ). As the mergeDir still existed, the second container > (container_2) was unable to create further subDir (as it assumed that all > directories had already been created). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42784) Fix the problem of incomplete creation of subdirectories in push merged localDir
[ https://issues.apache.org/jira/browse/SPARK-42784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42784: Assignee: Apache Spark > Fix the problem of incomplete creation of subdirectories in push merged > localDir > > > Key: SPARK-42784 > URL: https://issues.apache.org/jira/browse/SPARK-42784 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.3.2 >Reporter: Fencheng Mei >Assignee: Apache Spark >Priority: Major > > After we massively enabled push-based shuffle in our production environment, > we found some warn messages appearing in the server-side log messages. > the warning log like: > ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to > BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) > failed. > java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize > merged shuffle partition for appId application_1671244879475_44020960 > shuffleId 3 shuffleMergeId 0 reduceId 935. > After investigation, we identified the triggering mechanism of the bug。 > The driver requested two different containers on the same physical machine. > During the creation of the 'push-merged' directory in the first container > (container_1), the mergeDir was created first, then the subDir were created > based on the value of the "spark.diskStore.subDirectories" parameter. > However, the resources of container_1 were preempted during the creation of > the sub-directories, resulting in subDir not being created (only part of it > was created ). As the mergeDir still existed, the second container > (container_2) was unable to create further subDir (as it assumed that all > directories had already been created). > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42783) Infer window group limit should run as late as possible
[ https://issues.apache.org/jira/browse/SPARK-42783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42783: Assignee: (was: Apache Spark) > Infer window group limit should run as late as possible > --- > > Key: SPARK-42783 > URL: https://issues.apache.org/jira/browse/SPARK-42783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42781) provide one format for writing to kafka
[ https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700041#comment-17700041 ] Apache Spark commented on SPARK-42781: -- User '1511351836' has created a pull request for this issue: https://github.com/apache/spark/pull/40411 > provide one format for writing to kafka > --- > > Key: SPARK-42781 > URL: https://issues.apache.org/jira/browse/SPARK-42781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.2 >Reporter: 董云鹏 >Priority: Minor > Fix For: 3.2.4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42783) Infer window group limit should run as late as possible
[ https://issues.apache.org/jira/browse/SPARK-42783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700040#comment-17700040 ] Apache Spark commented on SPARK-42783: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40410 > Infer window group limit should run as late as possible > --- > > Key: SPARK-42783 > URL: https://issues.apache.org/jira/browse/SPARK-42783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42783) Infer window group limit should run as late as possible
[ https://issues.apache.org/jira/browse/SPARK-42783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42783: Assignee: Apache Spark > Infer window group limit should run as late as possible > --- > > Key: SPARK-42783 > URL: https://issues.apache.org/jira/browse/SPARK-42783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42782) Port TestUDFJson from Hive
[ https://issues.apache.org/jira/browse/SPARK-42782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42782: Assignee: Apache Spark > Port TestUDFJson from Hive > -- > > Key: SPARK-42782 > URL: https://issues.apache.org/jira/browse/SPARK-42782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/hive/blob/ba0217ff17501fb849d8999e808d37579db7b4f1/ql/src/test/org/apache/hadoop/hive/ql/udf/TestUDFJson.java -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42782) Port TestUDFJson from Hive
[ https://issues.apache.org/jira/browse/SPARK-42782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700038#comment-17700038 ] Apache Spark commented on SPARK-42782: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40409 > Port TestUDFJson from Hive > -- > > Key: SPARK-42782 > URL: https://issues.apache.org/jira/browse/SPARK-42782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/hive/blob/ba0217ff17501fb849d8999e808d37579db7b4f1/ql/src/test/org/apache/hadoop/hive/ql/udf/TestUDFJson.java -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42782) Port TestUDFJson from Hive
[ https://issues.apache.org/jira/browse/SPARK-42782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42782: Assignee: (was: Apache Spark) > Port TestUDFJson from Hive > -- > > Key: SPARK-42782 > URL: https://issues.apache.org/jira/browse/SPARK-42782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/hive/blob/ba0217ff17501fb849d8999e808d37579db7b4f1/ql/src/test/org/apache/hadoop/hive/ql/udf/TestUDFJson.java -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42780) Upgrade google Tink from 1.7.0 to 1.8.0
[ https://issues.apache.org/jira/browse/SPARK-42780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42780: Assignee: (was: Apache Spark) > Upgrade google Tink from 1.7.0 to 1.8.0 > --- > > Key: SPARK-42780 > URL: https://issues.apache.org/jira/browse/SPARK-42780 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [SNYK-JAVA-COMGOOGLEPROTOBUF-3040284|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3040284] > [SNYK-JAVA-COMGOOGLEPROTOBUF-3167772|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3167772] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42780) Upgrade google Tink from 1.7.0 to 1.8.0
[ https://issues.apache.org/jira/browse/SPARK-42780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42780: Assignee: Apache Spark > Upgrade google Tink from 1.7.0 to 1.8.0 > --- > > Key: SPARK-42780 > URL: https://issues.apache.org/jira/browse/SPARK-42780 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Major > > [SNYK-JAVA-COMGOOGLEPROTOBUF-3040284|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3040284] > [SNYK-JAVA-COMGOOGLEPROTOBUF-3167772|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3167772] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42780) Upgrade google Tink from 1.7.0 to 1.8.0
[ https://issues.apache.org/jira/browse/SPARK-42780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700016#comment-17700016 ] Apache Spark commented on SPARK-42780: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/40408 > Upgrade google Tink from 1.7.0 to 1.8.0 > --- > > Key: SPARK-42780 > URL: https://issues.apache.org/jira/browse/SPARK-42780 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [SNYK-JAVA-COMGOOGLEPROTOBUF-3040284|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3040284] > [SNYK-JAVA-COMGOOGLEPROTOBUF-3167772|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3167772] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42781) provide one format for writing to kafka
[ https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42781: Assignee: (was: Apache Spark) > provide one format for writing to kafka > --- > > Key: SPARK-42781 > URL: https://issues.apache.org/jira/browse/SPARK-42781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.2 >Reporter: 董云鹏 >Priority: Minor > Fix For: 3.2.4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42781) provide one format for writing to kafka
[ https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42781: Assignee: Apache Spark > provide one format for writing to kafka > --- > > Key: SPARK-42781 > URL: https://issues.apache.org/jira/browse/SPARK-42781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.2 >Reporter: 董云鹏 >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42781) provide one format for writing to kafka
[ https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700014#comment-17700014 ] Apache Spark commented on SPARK-42781: -- User '1511351836' has created a pull request for this issue: https://github.com/apache/spark/pull/40380 > provide one format for writing to kafka > --- > > Key: SPARK-42781 > URL: https://issues.apache.org/jira/browse/SPARK-42781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.2 >Reporter: 董云鹏 >Priority: Minor > Fix For: 3.2.4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42778: Assignee: (was: Apache Spark) > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699974#comment-17699974 ] Apache Spark commented on SPARK-42778: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40407 > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699973#comment-17699973 ] Apache Spark commented on SPARK-42778: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40407 > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42778) QueryStageExec should respect supportsRowBased
[ https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42778: Assignee: Apache Spark > QueryStageExec should respect supportsRowBased > -- > > Key: SPARK-42778 > URL: https://issues.apache.org/jira/browse/SPARK-42778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699937#comment-17699937 ] Apache Spark commented on SPARK-42101: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40406 > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0 > > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42340) Implement GroupedData.applyInPandas
[ https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699924#comment-17699924 ] Apache Spark commented on SPARK-42340: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/40405 > Implement GroupedData.applyInPandas > --- > > Key: SPARK-42340 > URL: https://issues.apache.org/jira/browse/SPARK-42340 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42340) Implement GroupedData.applyInPandas
[ https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699922#comment-17699922 ] Apache Spark commented on SPARK-42340: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/40405 > Implement GroupedData.applyInPandas > --- > > Key: SPARK-42340 > URL: https://issues.apache.org/jira/browse/SPARK-42340 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42340) Implement GroupedData.applyInPandas
[ https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42340: Assignee: Apache Spark > Implement GroupedData.applyInPandas > --- > > Key: SPARK-42340 > URL: https://issues.apache.org/jira/browse/SPARK-42340 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42340) Implement GroupedData.applyInPandas
[ https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42340: Assignee: (was: Apache Spark) > Implement GroupedData.applyInPandas > --- > > Key: SPARK-42340 > URL: https://issues.apache.org/jira/browse/SPARK-42340 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2
[ https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699874#comment-17699874 ] Apache Spark commented on SPARK-21782: -- User 'megaserg' has created a pull request for this issue: https://github.com/apache/spark/pull/18990 > Repartition creates skews when numPartitions is a power of 2 > > > Key: SPARK-21782 > URL: https://issues.apache.org/jira/browse/SPARK-21782 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sergey Serebryakov >Assignee: Sergey Serebryakov >Priority: Major > Labels: repartition > Fix For: 2.3.0 > > Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png > > > *Problem:* > When an RDD (particularly with a low item-per-partition ratio) is > repartitioned to {{numPartitions}} = power of 2, the resulting partitions are > very uneven-sized. This affects both {{repartition()}} and > {{coalesce(shuffle=true)}}. > *Steps to reproduce:* > {code} > $ spark-shell > scala> sc.parallelize(0 until 1000, > 250).repartition(64).glom().map(_.length).collect() > res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) > {code} > *Explanation:* > Currently, the [algorithm for > repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450] > (shuffle-enabled coalesce) is as follows: > - for each initial partition {{index}}, generate {{position}} as {{(new > Random(index)).nextInt(numPartitions)}} > - then, for element number {{k}} in initial partition {{index}}, put it in > the new partition {{position + k}} (modulo {{numPartitions}}). > So, essentially elements are smeared roughly equally over {{numPartitions}} > buckets - starting from the one with number {{position+1}}. > Note that a new instance of {{Random}} is created for every initial partition > {{index}}, with a fixed seed {{index}}, and then discarded. So the > {{position}} is deterministic for every {{index}} for any RDD in the world. > Also, [{{nextInt(bound)}} > implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393] > has a special case when {{bound}} is a power of 2, which is basically taking > several highest bits from the initial seed, with only a minimal scrambling. > Due to deterministic seed, using the generator only once, and lack of > scrambling, the {{position}} values for power-of-two {{numPartitions}} always > end up being almost the same regardless of the {{index}}, causing some > buckets to be much more popular than others. So, {{repartition}} will in fact > intentionally produce skewed partitions even when before the partition were > roughly equal in size. > The behavior seems to have been introduced in SPARK-1770 by > https://github.com/apache/spark/pull/727/ > {quote} > The load balancing is not perfect: a given output partition > can have up to N more elements than the average if there are N input > partitions. However, some randomization is used to minimize the > probabiliy that this happens. > {quote} > Another related ticket: SPARK-17817 - > https://github.com/apache/spark/pull/15445 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2
[ https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699875#comment-17699875 ] Apache Spark commented on SPARK-21782: -- User 'megaserg' has created a pull request for this issue: https://github.com/apache/spark/pull/18990 > Repartition creates skews when numPartitions is a power of 2 > > > Key: SPARK-21782 > URL: https://issues.apache.org/jira/browse/SPARK-21782 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sergey Serebryakov >Assignee: Sergey Serebryakov >Priority: Major > Labels: repartition > Fix For: 2.3.0 > > Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png > > > *Problem:* > When an RDD (particularly with a low item-per-partition ratio) is > repartitioned to {{numPartitions}} = power of 2, the resulting partitions are > very uneven-sized. This affects both {{repartition()}} and > {{coalesce(shuffle=true)}}. > *Steps to reproduce:* > {code} > $ spark-shell > scala> sc.parallelize(0 until 1000, > 250).repartition(64).glom().map(_.length).collect() > res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) > {code} > *Explanation:* > Currently, the [algorithm for > repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450] > (shuffle-enabled coalesce) is as follows: > - for each initial partition {{index}}, generate {{position}} as {{(new > Random(index)).nextInt(numPartitions)}} > - then, for element number {{k}} in initial partition {{index}}, put it in > the new partition {{position + k}} (modulo {{numPartitions}}). > So, essentially elements are smeared roughly equally over {{numPartitions}} > buckets - starting from the one with number {{position+1}}. > Note that a new instance of {{Random}} is created for every initial partition > {{index}}, with a fixed seed {{index}}, and then discarded. So the > {{position}} is deterministic for every {{index}} for any RDD in the world. > Also, [{{nextInt(bound)}} > implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393] > has a special case when {{bound}} is a power of 2, which is basically taking > several highest bits from the initial seed, with only a minimal scrambling. > Due to deterministic seed, using the generator only once, and lack of > scrambling, the {{position}} values for power-of-two {{numPartitions}} always > end up being almost the same regardless of the {{index}}, causing some > buckets to be much more popular than others. So, {{repartition}} will in fact > intentionally produce skewed partitions even when before the partition were > roughly equal in size. > The behavior seems to have been introduced in SPARK-1770 by > https://github.com/apache/spark/pull/727/ > {quote} > The load balancing is not perfect: a given output partition > can have up to N more elements than the average if there are N input > partitions. However, some randomization is used to minimize the > probabiliy that this happens. > {quote} > Another related ticket: SPARK-17817 - > https://github.com/apache/spark/pull/15445 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats
[ https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699873#comment-17699873 ] Apache Spark commented on SPARK-42777: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/40404 > Support converting TimestampNTZ catalog stats to plan stats > --- > > Key: SPARK-42777 > URL: https://issues.apache.org/jira/browse/SPARK-42777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats
[ https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42777: Assignee: Gengliang Wang (was: Apache Spark) > Support converting TimestampNTZ catalog stats to plan stats > --- > > Key: SPARK-42777 > URL: https://issues.apache.org/jira/browse/SPARK-42777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats
[ https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699872#comment-17699872 ] Apache Spark commented on SPARK-42777: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/40404 > Support converting TimestampNTZ catalog stats to plan stats > --- > > Key: SPARK-42777 > URL: https://issues.apache.org/jira/browse/SPARK-42777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats
[ https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42777: Assignee: Apache Spark (was: Gengliang Wang) > Support converting TimestampNTZ catalog stats to plan stats > --- > > Key: SPARK-42777 > URL: https://issues.apache.org/jira/browse/SPARK-42777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
[ https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42754: Assignee: Apache Spark > Spark 3.4 history server's SQL tab incorrectly groups SQL executions when > replaying event logs from Spark 3.3 and earlier > - > > Key: SPARK-42754 > URL: https://issues.apache.org/jira/browse/SPARK-42754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > Attachments: example.png > > > In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL > executions when replaying event logs generated by older Spark versions. > > {*}Reproduction{*}: > {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf > spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} > {code:java} > sql("select * from range(10)").collect() > sql("select * from range(20)").collect() > sql("select * from range(30)").collect(){code} > Exit the shell and use the Spark History Server to replay this application's > UI. > In the SQL tab I expect to see three separate queries, but Spark 3.4's > history server incorrectly groups the second and third queries as nested > queries of the first (see attached screenshot). > > {*}Root cause{*}: > [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new > *non-optional* {{rootExecutionId: Long}} field to the > SparkListenerSQLExecutionStart case class. > When JsonProtocol deserializes this event it uses the "ignore missing > properties" Jackson deserialization option, causing the > {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}. > The value {{0}} is a legitimate execution ID, so in the deserialized event we > have no ability to distinguish between the absence of a value and a case > where all queries have the first query as the root. > *Proposed* {*}fix{*}: > I think we should change this field to be of type {{Option[Long]}} . I > believe this is a release blocker for Spark 3.4.0 because we cannot change > the type of this new field in a future release without breaking binary > compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
[ https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699852#comment-17699852 ] Apache Spark commented on SPARK-42754: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40403 > Spark 3.4 history server's SQL tab incorrectly groups SQL executions when > replaying event logs from Spark 3.3 and earlier > - > > Key: SPARK-42754 > URL: https://issues.apache.org/jira/browse/SPARK-42754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Priority: Blocker > Attachments: example.png > > > In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL > executions when replaying event logs generated by older Spark versions. > > {*}Reproduction{*}: > {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf > spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} > {code:java} > sql("select * from range(10)").collect() > sql("select * from range(20)").collect() > sql("select * from range(30)").collect(){code} > Exit the shell and use the Spark History Server to replay this application's > UI. > In the SQL tab I expect to see three separate queries, but Spark 3.4's > history server incorrectly groups the second and third queries as nested > queries of the first (see attached screenshot). > > {*}Root cause{*}: > [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new > *non-optional* {{rootExecutionId: Long}} field to the > SparkListenerSQLExecutionStart case class. > When JsonProtocol deserializes this event it uses the "ignore missing > properties" Jackson deserialization option, causing the > {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}. > The value {{0}} is a legitimate execution ID, so in the deserialized event we > have no ability to distinguish between the absence of a value and a case > where all queries have the first query as the root. > *Proposed* {*}fix{*}: > I think we should change this field to be of type {{Option[Long]}} . I > believe this is a release blocker for Spark 3.4.0 because we cannot change > the type of this new field in a future release without breaking binary > compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier
[ https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42754: Assignee: (was: Apache Spark) > Spark 3.4 history server's SQL tab incorrectly groups SQL executions when > replaying event logs from Spark 3.3 and earlier > - > > Key: SPARK-42754 > URL: https://issues.apache.org/jira/browse/SPARK-42754 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Josh Rosen >Priority: Blocker > Attachments: example.png > > > In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL > executions when replaying event logs generated by older Spark versions. > > {*}Reproduction{*}: > {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf > spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}} > {code:java} > sql("select * from range(10)").collect() > sql("select * from range(20)").collect() > sql("select * from range(30)").collect(){code} > Exit the shell and use the Spark History Server to replay this application's > UI. > In the SQL tab I expect to see three separate queries, but Spark 3.4's > history server incorrectly groups the second and third queries as nested > queries of the first (see attached screenshot). > > {*}Root cause{*}: > [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new > *non-optional* {{rootExecutionId: Long}} field to the > SparkListenerSQLExecutionStart case class. > When JsonProtocol deserializes this event it uses the "ignore missing > properties" Jackson deserialization option, causing the > {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}. > The value {{0}} is a legitimate execution ID, so in the deserialized event we > have no ability to distinguish between the absence of a value and a case > where all queries have the first query as the root. > *Proposed* {*}fix{*}: > I think we should change this field to be of type {{Option[Long]}} . I > believe this is a release blocker for Spark 3.4.0 because we cannot change > the type of this new field in a future release without breaking binary > compatibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42020) createDataFrame with UDT
[ https://issues.apache.org/jira/browse/SPARK-42020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42020: Assignee: (was: Apache Spark) > createDataFrame with UDT > > > Key: SPARK-42020 > URL: https://issues.apache.org/jira/browse/SPARK-42020 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > pyspark/sql/tests/test_types.py:596 > (TypesParityTests.test_apply_schema_with_udt) > self = testMethod=test_apply_schema_with_udt> > def test_apply_schema_with_udt(self): > row = (1.0, ExamplePoint(1.0, 2.0)) > schema = StructType( > [ > StructField("label", DoubleType(), False), > StructField("point", ExamplePointUDT(), False), > ] > ) > > df = self.spark.createDataFrame([row], schema) > ../test_types.py:605: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > ../../connect/session.py:282: in createDataFrame > _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in > _data]) > pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist > ??? > pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist > ??? > pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays > ??? > pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays > ??? > pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays > ??? > pyarrow/array.pxi:320: in pyarrow.lib.array > ??? > pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array > ??? > pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert ExamplePoint(1.0,2.0) with > type ExamplePoint: did not recognize Python value type when inferring an > Arrow data type > pyarrow/error.pxi:100: ArrowInvalid > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42020) createDataFrame with UDT
[ https://issues.apache.org/jira/browse/SPARK-42020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42020: Assignee: Apache Spark > createDataFrame with UDT > > > Key: SPARK-42020 > URL: https://issues.apache.org/jira/browse/SPARK-42020 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > pyspark/sql/tests/test_types.py:596 > (TypesParityTests.test_apply_schema_with_udt) > self = testMethod=test_apply_schema_with_udt> > def test_apply_schema_with_udt(self): > row = (1.0, ExamplePoint(1.0, 2.0)) > schema = StructType( > [ > StructField("label", DoubleType(), False), > StructField("point", ExamplePointUDT(), False), > ] > ) > > df = self.spark.createDataFrame([row], schema) > ../test_types.py:605: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > ../../connect/session.py:282: in createDataFrame > _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in > _data]) > pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist > ??? > pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist > ??? > pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays > ??? > pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays > ??? > pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays > ??? > pyarrow/array.pxi:320: in pyarrow.lib.array > ??? > pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array > ??? > pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert ExamplePoint(1.0,2.0) with > type ExamplePoint: did not recognize Python value type when inferring an > Arrow data type > pyarrow/error.pxi:100: ArrowInvalid > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42020) createDataFrame with UDT
[ https://issues.apache.org/jira/browse/SPARK-42020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699825#comment-17699825 ] Apache Spark commented on SPARK-42020: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40402 > createDataFrame with UDT > > > Key: SPARK-42020 > URL: https://issues.apache.org/jira/browse/SPARK-42020 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > pyspark/sql/tests/test_types.py:596 > (TypesParityTests.test_apply_schema_with_udt) > self = testMethod=test_apply_schema_with_udt> > def test_apply_schema_with_udt(self): > row = (1.0, ExamplePoint(1.0, 2.0)) > schema = StructType( > [ > StructField("label", DoubleType(), False), > StructField("point", ExamplePointUDT(), False), > ] > ) > > df = self.spark.createDataFrame([row], schema) > ../test_types.py:605: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > ../../connect/session.py:282: in createDataFrame > _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in > _data]) > pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist > ??? > pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist > ??? > pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays > ??? > pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays > ??? > pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays > ??? > pyarrow/array.pxi:320: in pyarrow.lib.array > ??? > pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array > ??? > pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert ExamplePoint(1.0,2.0) with > type ExamplePoint: did not recognize Python value type when inferring an > Arrow data type > pyarrow/error.pxi:100: ArrowInvalid > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42773) Minor grammatical change to "Supports Spark Connect" message
[ https://issues.apache.org/jira/browse/SPARK-42773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42773: Assignee: Apache Spark > Minor grammatical change to "Supports Spark Connect" message > > > Key: SPARK-42773 > URL: https://issues.apache.org/jira/browse/SPARK-42773 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Allan Folting >Assignee: Apache Spark >Priority: Major > > Changing "Support Spark Connect" to "Supports Spark Connect" in the 3.4.0 > version change message which is also used in the documentation: > > .. versionchanged:: 3.4.0 > Supports Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42773) Minor grammatical change to "Supports Spark Connect" message
[ https://issues.apache.org/jira/browse/SPARK-42773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42773: Assignee: (was: Apache Spark) > Minor grammatical change to "Supports Spark Connect" message > > > Key: SPARK-42773 > URL: https://issues.apache.org/jira/browse/SPARK-42773 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Allan Folting >Priority: Major > > Changing "Support Spark Connect" to "Supports Spark Connect" in the 3.4.0 > version change message which is also used in the documentation: > > .. versionchanged:: 3.4.0 > Supports Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42773) Minor grammatical change to "Supports Spark Connect" message
[ https://issues.apache.org/jira/browse/SPARK-42773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699781#comment-17699781 ] Apache Spark commented on SPARK-42773: -- User 'allanf-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40401 > Minor grammatical change to "Supports Spark Connect" message > > > Key: SPARK-42773 > URL: https://issues.apache.org/jira/browse/SPARK-42773 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Allan Folting >Priority: Major > > Changing "Support Spark Connect" to "Supports Spark Connect" in the 3.4.0 > version change message which is also used in the documentation: > > .. versionchanged:: 3.4.0 > Supports Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41359) Use `PhysicalDataType` instead of DataType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-41359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41359: Assignee: Apache Spark > Use `PhysicalDataType` instead of DataType in UnsafeRow > --- > > Key: SPARK-41359 > URL: https://issues.apache.org/jira/browse/SPARK-41359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41359) Use `PhysicalDataType` instead of DataType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-41359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699720#comment-17699720 ] Apache Spark commented on SPARK-41359: -- User 'ClownXC' has created a pull request for this issue: https://github.com/apache/spark/pull/40400 > Use `PhysicalDataType` instead of DataType in UnsafeRow > --- > > Key: SPARK-41359 > URL: https://issues.apache.org/jira/browse/SPARK-41359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41359) Use `PhysicalDataType` instead of DataType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-41359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41359: Assignee: (was: Apache Spark) > Use `PhysicalDataType` instead of DataType in UnsafeRow > --- > > Key: SPARK-41359 > URL: https://issues.apache.org/jira/browse/SPARK-41359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699610#comment-17699610 ] Apache Spark commented on SPARK-42101: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/40399 > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0 > > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699611#comment-17699611 ] Apache Spark commented on SPARK-42101: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/40399 > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0 > > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42052) Codegen Support for HiveSimpleUDF
[ https://issues.apache.org/jira/browse/SPARK-42052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699595#comment-17699595 ] Apache Spark commented on SPARK-42052: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40397 > Codegen Support for HiveSimpleUDF > - > > Key: SPARK-42052 > URL: https://issues.apache.org/jira/browse/SPARK-42052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42772) Change the default value of JDBC options about push down to true
[ https://issues.apache.org/jira/browse/SPARK-42772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42772: Assignee: (was: Apache Spark) > Change the default value of JDBC options about push down to true > > > Key: SPARK-42772 > URL: https://issues.apache.org/jira/browse/SPARK-42772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42772) Change the default value of JDBC options about push down to true
[ https://issues.apache.org/jira/browse/SPARK-42772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699571#comment-17699571 ] Apache Spark commented on SPARK-42772: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40396 > Change the default value of JDBC options about push down to true > > > Key: SPARK-42772 > URL: https://issues.apache.org/jira/browse/SPARK-42772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42772) Change the default value of JDBC options about push down to true
[ https://issues.apache.org/jira/browse/SPARK-42772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42772: Assignee: Apache Spark > Change the default value of JDBC options about push down to true > > > Key: SPARK-42772 > URL: https://issues.apache.org/jira/browse/SPARK-42772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42770) SQLImplicitsTestSuite test failed with Java 17
[ https://issues.apache.org/jira/browse/SPARK-42770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42770: Assignee: (was: Apache Spark) > SQLImplicitsTestSuite test failed with Java 17 > -- > > Key: SPARK-42770 > URL: https://issues.apache.org/jira/browse/SPARK-42770 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682] > {code:java} > [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 > milliseconds) > 4429[info] 2023-03-02T23:00:20.404434 did not equal > 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63) > 4430[info] org.scalatest.exceptions.TestFailedException: > 4431[info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > 4432[info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > 4433[info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > 4434[info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > 4435[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63) > 4436[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133) > 4437[info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 4438[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > 4439[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > 4440[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > 4441[info] at org.scalatest.Transformer.apply(Transformer.scala:22) > 4442[info] at org.scalatest.Transformer.apply(Transformer.scala:20) > 4443[info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > 4445[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > 4446[info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > 4447[info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > 4448[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > 4449[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > 4450[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > 4451[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > 4452[info] at > org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > 4453[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > 4454[info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > 4455[info] at scala.collection.immutable.List.foreach(List.scala:431) > 4456[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > 4457[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > 4458[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > 4459[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > 4460[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > 4461[info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > 4462[info] at org.scalatest.Suite.run(Suite.scala:1114) > 4463[info] at org.scalatest.Suite.run$(Suite.scala:1096) > 4464[info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > 4465[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > 4466[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > 4467[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > 4468[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > 4469[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34) > 4470[info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > 4471[info] at > org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > 4472[info] at > org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > 4473[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34) > 4474[info] at >
[jira] [Assigned] (SPARK-42770) SQLImplicitsTestSuite test failed with Java 17
[ https://issues.apache.org/jira/browse/SPARK-42770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42770: Assignee: Apache Spark > SQLImplicitsTestSuite test failed with Java 17 > -- > > Key: SPARK-42770 > URL: https://issues.apache.org/jira/browse/SPARK-42770 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > [https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682] > {code:java} > [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 > milliseconds) > 4429[info] 2023-03-02T23:00:20.404434 did not equal > 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63) > 4430[info] org.scalatest.exceptions.TestFailedException: > 4431[info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > 4432[info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > 4433[info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > 4434[info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > 4435[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63) > 4436[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133) > 4437[info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 4438[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > 4439[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > 4440[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > 4441[info] at org.scalatest.Transformer.apply(Transformer.scala:22) > 4442[info] at org.scalatest.Transformer.apply(Transformer.scala:20) > 4443[info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > 4445[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > 4446[info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > 4447[info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > 4448[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > 4449[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > 4450[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > 4451[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > 4452[info] at > org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > 4453[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > 4454[info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > 4455[info] at scala.collection.immutable.List.foreach(List.scala:431) > 4456[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > 4457[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > 4458[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > 4459[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > 4460[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > 4461[info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > 4462[info] at org.scalatest.Suite.run(Suite.scala:1114) > 4463[info] at org.scalatest.Suite.run$(Suite.scala:1096) > 4464[info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > 4465[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > 4466[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > 4467[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > 4468[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > 4469[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34) > 4470[info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > 4471[info] at > org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > 4472[info] at > org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > 4473[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34) > 4474[info] at >
[jira] [Commented] (SPARK-42770) SQLImplicitsTestSuite test failed with Java 17
[ https://issues.apache.org/jira/browse/SPARK-42770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699542#comment-17699542 ] Apache Spark commented on SPARK-42770: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40395 > SQLImplicitsTestSuite test failed with Java 17 > -- > > Key: SPARK-42770 > URL: https://issues.apache.org/jira/browse/SPARK-42770 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > > [https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682] > {code:java} > [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 > milliseconds) > 4429[info] 2023-03-02T23:00:20.404434 did not equal > 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63) > 4430[info] org.scalatest.exceptions.TestFailedException: > 4431[info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > 4432[info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > 4433[info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > 4434[info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > 4435[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63) > 4436[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133) > 4437[info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 4438[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > 4439[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > 4440[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > 4441[info] at org.scalatest.Transformer.apply(Transformer.scala:22) > 4442[info] at org.scalatest.Transformer.apply(Transformer.scala:20) > 4443[info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > 4445[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > 4446[info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > 4447[info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > 4448[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > 4449[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > 4450[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > 4451[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > 4452[info] at > org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > 4453[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > 4454[info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > 4455[info] at scala.collection.immutable.List.foreach(List.scala:431) > 4456[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > 4457[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > 4458[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > 4459[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > 4460[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > 4461[info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > 4462[info] at org.scalatest.Suite.run(Suite.scala:1114) > 4463[info] at org.scalatest.Suite.run$(Suite.scala:1096) > 4464[info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > 4465[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > 4466[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > 4467[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > 4468[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > 4469[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34) > 4470[info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > 4471[info] at > org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > 4472[info] at > org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > 4473[info] at > org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34) >
[jira] [Commented] (SPARK-42771) Refactor HiveGenericUDF
[ https://issues.apache.org/jira/browse/SPARK-42771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699527#comment-17699527 ] Apache Spark commented on SPARK-42771: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40394 > Refactor HiveGenericUDF > --- > > Key: SPARK-42771 > URL: https://issues.apache.org/jira/browse/SPARK-42771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42771) Refactor HiveGenericUDF
[ https://issues.apache.org/jira/browse/SPARK-42771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42771: Assignee: (was: Apache Spark) > Refactor HiveGenericUDF > --- > > Key: SPARK-42771 > URL: https://issues.apache.org/jira/browse/SPARK-42771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42771) Refactor HiveGenericUDF
[ https://issues.apache.org/jira/browse/SPARK-42771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42771: Assignee: Apache Spark > Refactor HiveGenericUDF > --- > > Key: SPARK-42771 > URL: https://issues.apache.org/jira/browse/SPARK-42771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699516#comment-17699516 ] Apache Spark commented on SPARK-40082: -- User 'Stove-hust' has created a pull request for this issue: https://github.com/apache/spark/pull/40393 > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > -- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.1.1 >Reporter: Penglei Shi >Priority: Major > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitted, its all > speculative tasks success and register map output, but speculative task > successful events can not trigger shuffleMergeFinalized because this stage > has been removed from runningStages. > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will be resubmitted, but next stages are added to > waitingStages after this stage being finished, so next stages will not be > submitted even though this stage's resubmitting has been finished. > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40082: Assignee: (was: Apache Spark) > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > -- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.1.1 >Reporter: Penglei Shi >Priority: Major > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitted, its all > speculative tasks success and register map output, but speculative task > successful events can not trigger shuffleMergeFinalized because this stage > has been removed from runningStages. > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will be resubmitted, but next stages are added to > waitingStages after this stage being finished, so next stages will not be > submitted even though this stage's resubmitting has been finished. > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40082: Assignee: Apache Spark > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > -- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.1.1 >Reporter: Penglei Shi >Assignee: Apache Spark >Priority: Major > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitted, its all > speculative tasks success and register map output, but speculative task > successful events can not trigger shuffleMergeFinalized because this stage > has been removed from runningStages. > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will be resubmitted, but next stages are added to > waitingStages after this stage being finished, so next stages will not be > submitted even though this stage's resubmitting has been finished. > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699515#comment-17699515 ] Apache Spark commented on SPARK-40082: -- User 'Stove-hust' has created a pull request for this issue: https://github.com/apache/spark/pull/40393 > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > -- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.1.1 >Reporter: Penglei Shi >Priority: Major > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitted, its all > speculative tasks success and register map output, but speculative task > successful events can not trigger shuffleMergeFinalized because this stage > has been removed from runningStages. > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will be resubmitted, but next stages are added to > waitingStages after this stage being finished, so next stages will not be > submitted even though this stage's resubmitting has been finished. > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42769) Add ENV_DRIVER_POD_IP env variable to executor pods
[ https://issues.apache.org/jira/browse/SPARK-42769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699470#comment-17699470 ] Apache Spark commented on SPARK-42769: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40392 > Add ENV_DRIVER_POD_IP env variable to executor pods > --- > > Key: SPARK-42769 > URL: https://issues.apache.org/jira/browse/SPARK-42769 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42769) Add ENV_DRIVER_POD_IP env variable to executor pods
[ https://issues.apache.org/jira/browse/SPARK-42769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42769: Assignee: Apache Spark > Add ENV_DRIVER_POD_IP env variable to executor pods > --- > > Key: SPARK-42769 > URL: https://issues.apache.org/jira/browse/SPARK-42769 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42769) Add ENV_DRIVER_POD_IP env variable to executor pods
[ https://issues.apache.org/jira/browse/SPARK-42769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42769: Assignee: (was: Apache Spark) > Add ENV_DRIVER_POD_IP env variable to executor pods > --- > > Key: SPARK-42769 > URL: https://issues.apache.org/jira/browse/SPARK-42769 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42766) YarnAllocator should filter excluded nodes when launching allocated containers
[ https://issues.apache.org/jira/browse/SPARK-42766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699467#comment-17699467 ] Apache Spark commented on SPARK-42766: -- User 'wangshengjie123' has created a pull request for this issue: https://github.com/apache/spark/pull/40391 > YarnAllocator should filter excluded nodes when launching allocated containers > -- > > Key: SPARK-42766 > URL: https://issues.apache.org/jira/browse/SPARK-42766 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.2 >Reporter: wangshengjie >Priority: Major > > In production environment, we hit an issue like this: > If we request 10 containers form nodeA and nodeB, first response from Yarn > return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second > response from Yarn maybe return some containers from nodeA and launching > containers, but when containers(Executor) setup and send register request to > Driver, it will be rejected and this failure will be counted to > {code:java} > spark.yarn.max.executor.failures {code} > , and will casue app failed. > {code:java} > Max number of executor failures ($maxNumExecutorFailures) reached{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42766) YarnAllocator should filter excluded nodes when launching allocated containers
[ https://issues.apache.org/jira/browse/SPARK-42766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42766: Assignee: (was: Apache Spark) > YarnAllocator should filter excluded nodes when launching allocated containers > -- > > Key: SPARK-42766 > URL: https://issues.apache.org/jira/browse/SPARK-42766 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.2 >Reporter: wangshengjie >Priority: Major > > In production environment, we hit an issue like this: > If we request 10 containers form nodeA and nodeB, first response from Yarn > return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second > response from Yarn maybe return some containers from nodeA and launching > containers, but when containers(Executor) setup and send register request to > Driver, it will be rejected and this failure will be counted to > {code:java} > spark.yarn.max.executor.failures {code} > , and will casue app failed. > {code:java} > Max number of executor failures ($maxNumExecutorFailures) reached{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42766) YarnAllocator should filter excluded nodes when launching allocated containers
[ https://issues.apache.org/jira/browse/SPARK-42766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42766: Assignee: Apache Spark > YarnAllocator should filter excluded nodes when launching allocated containers > -- > > Key: SPARK-42766 > URL: https://issues.apache.org/jira/browse/SPARK-42766 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.2 >Reporter: wangshengjie >Assignee: Apache Spark >Priority: Major > > In production environment, we hit an issue like this: > If we request 10 containers form nodeA and nodeB, first response from Yarn > return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second > response from Yarn maybe return some containers from nodeA and launching > containers, but when containers(Executor) setup and send register request to > Driver, it will be rejected and this failure will be counted to > {code:java} > spark.yarn.max.executor.failures {code} > , and will casue app failed. > {code:java} > Max number of executor failures ($maxNumExecutorFailures) reached{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42768) Enable cached plan apply AQE by default
[ https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699445#comment-17699445 ] Apache Spark commented on SPARK-42768: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40390 > Enable cached plan apply AQE by default > --- > > Key: SPARK-42768 > URL: https://issues.apache.org/jira/browse/SPARK-42768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42768) Enable cached plan apply AQE by default
[ https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42768: Assignee: (was: Apache Spark) > Enable cached plan apply AQE by default > --- > > Key: SPARK-42768 > URL: https://issues.apache.org/jira/browse/SPARK-42768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42768) Enable cached plan apply AQE by default
[ https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42768: Assignee: Apache Spark > Enable cached plan apply AQE by default > --- > > Key: SPARK-42768 > URL: https://issues.apache.org/jira/browse/SPARK-42768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive
[ https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42767: Assignee: Apache Spark > Add check condition to start connect server fallback with `in-memory` and > auto ignored some tests strongly depend on hive > - > > Key: SPARK-42767 > URL: https://issues.apache.org/jira/browse/SPARK-42767 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive
[ https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42767: Assignee: (was: Apache Spark) > Add check condition to start connect server fallback with `in-memory` and > auto ignored some tests strongly depend on hive > - > > Key: SPARK-42767 > URL: https://issues.apache.org/jira/browse/SPARK-42767 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive
[ https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699444#comment-17699444 ] Apache Spark commented on SPARK-42767: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40389 > Add check condition to start connect server fallback with `in-memory` and > auto ignored some tests strongly depend on hive > - > > Key: SPARK-42767 > URL: https://issues.apache.org/jira/browse/SPARK-42767 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive
[ https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699443#comment-17699443 ] Apache Spark commented on SPARK-42767: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40389 > Add check condition to start connect server fallback with `in-memory` and > auto ignored some tests strongly depend on hive > - > > Key: SPARK-42767 > URL: https://issues.apache.org/jira/browse/SPARK-42767 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42765) Regulate the import path of `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699425#comment-17699425 ] Apache Spark commented on SPARK-42765: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/40388 > Regulate the import path of `pandas_udf` > > > Key: SPARK-42765 > URL: https://issues.apache.org/jira/browse/SPARK-42765 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Remove the outdated import path of `pandas_udf` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42765) Regulate the import path of `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42765: Assignee: Apache Spark > Regulate the import path of `pandas_udf` > > > Key: SPARK-42765 > URL: https://issues.apache.org/jira/browse/SPARK-42765 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Remove the outdated import path of `pandas_udf` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42765) Regulate the import path of `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699426#comment-17699426 ] Apache Spark commented on SPARK-42765: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/40388 > Regulate the import path of `pandas_udf` > > > Key: SPARK-42765 > URL: https://issues.apache.org/jira/browse/SPARK-42765 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Remove the outdated import path of `pandas_udf` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42765) Regulate the import path of `pandas_udf`
[ https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42765: Assignee: (was: Apache Spark) > Regulate the import path of `pandas_udf` > > > Key: SPARK-42765 > URL: https://issues.apache.org/jira/browse/SPARK-42765 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Remove the outdated import path of `pandas_udf` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend
[ https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699417#comment-17699417 ] Apache Spark commented on SPARK-42764: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40387 > Parameterize the max number of attempts for driver props fetcher in > KubernetesExecutorBackend > - > > Key: SPARK-42764 > URL: https://issues.apache.org/jira/browse/SPARK-42764 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend
[ https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42764: Assignee: (was: Apache Spark) > Parameterize the max number of attempts for driver props fetcher in > KubernetesExecutorBackend > - > > Key: SPARK-42764 > URL: https://issues.apache.org/jira/browse/SPARK-42764 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend
[ https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699416#comment-17699416 ] Apache Spark commented on SPARK-42764: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40387 > Parameterize the max number of attempts for driver props fetcher in > KubernetesExecutorBackend > - > > Key: SPARK-42764 > URL: https://issues.apache.org/jira/browse/SPARK-42764 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend
[ https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42764: Assignee: Apache Spark > Parameterize the max number of attempts for driver props fetcher in > KubernetesExecutorBackend > - > > Key: SPARK-42764 > URL: https://issues.apache.org/jira/browse/SPARK-42764 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42753) ReusedExchange refers to non-existent node
[ https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699388#comment-17699388 ] Apache Spark commented on SPARK-42753: -- User 'StevenChenDatabricks' has created a pull request for this issue: https://github.com/apache/spark/pull/40385 > ReusedExchange refers to non-existent node > -- > > Key: SPARK-42753 > URL: https://issues.apache.org/jira/browse/SPARK-42753 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Steven Chen >Priority: Major > > There is an AQE “issue“ where during AQE planning, the Exchange "that's > being" reused could be replaced in the plan tree. So, when we print the query > plan, the ReusedExchange will refer to an “unknown“ Exchange. An example > below: > > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown] > Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code} > > > Below is an example to demonstrate the root cause: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A > |-- SomeNode Y > |-- Exchange B > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C > |-- SomeNode N > |-- Exchange D > {code} > > > Step 1: Exchange B is materialized and the QueryStage is added to stage cache > Step 2: Exchange D reuses Exchange B > Step 3: Exchange C is materialized and the QueryStage is added to stage cache > Step 4: Exchange A reuses Exchange C > > Then the final plan looks like: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A -> ReusedExchange (reuses Exchange C) > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C -> PhotonShuffleMapStage > |-- SomeNode N > |-- Exchange D -> ReusedExchange (reuses Exchange B) > {code} > > > As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist > node. This *DOES NOT* affect query execution but will cause the query > visualization malfunction in the following ways: > # The ReusedExchange child subtree will still appear in the Spark UI graph > but will contain no node IDs. > # The ReusedExchange node details in the Explain plan will refer to a > UNKNOWN node. Example below. > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown]{code} > # The child exchange and its subtree may be missing from the Explain text > completely. No node details or tree string shown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42753) ReusedExchange refers to non-existent node
[ https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42753: Assignee: (was: Apache Spark) > ReusedExchange refers to non-existent node > -- > > Key: SPARK-42753 > URL: https://issues.apache.org/jira/browse/SPARK-42753 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Steven Chen >Priority: Major > > There is an AQE “issue“ where during AQE planning, the Exchange "that's > being" reused could be replaced in the plan tree. So, when we print the query > plan, the ReusedExchange will refer to an “unknown“ Exchange. An example > below: > > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown] > Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code} > > > Below is an example to demonstrate the root cause: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A > |-- SomeNode Y > |-- Exchange B > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C > |-- SomeNode N > |-- Exchange D > {code} > > > Step 1: Exchange B is materialized and the QueryStage is added to stage cache > Step 2: Exchange D reuses Exchange B > Step 3: Exchange C is materialized and the QueryStage is added to stage cache > Step 4: Exchange A reuses Exchange C > > Then the final plan looks like: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A -> ReusedExchange (reuses Exchange C) > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C -> PhotonShuffleMapStage > |-- SomeNode N > |-- Exchange D -> ReusedExchange (reuses Exchange B) > {code} > > > As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist > node. This *DOES NOT* affect query execution but will cause the query > visualization malfunction in the following ways: > # The ReusedExchange child subtree will still appear in the Spark UI graph > but will contain no node IDs. > # The ReusedExchange node details in the Explain plan will refer to a > UNKNOWN node. Example below. > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown]{code} > # The child exchange and its subtree may be missing from the Explain text > completely. No node details or tree string shown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42753) ReusedExchange refers to non-existent node
[ https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42753: Assignee: Apache Spark > ReusedExchange refers to non-existent node > -- > > Key: SPARK-42753 > URL: https://issues.apache.org/jira/browse/SPARK-42753 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Steven Chen >Assignee: Apache Spark >Priority: Major > > There is an AQE “issue“ where during AQE planning, the Exchange "that's > being" reused could be replaced in the plan tree. So, when we print the query > plan, the ReusedExchange will refer to an “unknown“ Exchange. An example > below: > > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown] > Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code} > > > Below is an example to demonstrate the root cause: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A > |-- SomeNode Y > |-- Exchange B > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C > |-- SomeNode N > |-- Exchange D > {code} > > > Step 1: Exchange B is materialized and the QueryStage is added to stage cache > Step 2: Exchange D reuses Exchange B > Step 3: Exchange C is materialized and the QueryStage is added to stage cache > Step 4: Exchange A reuses Exchange C > > Then the final plan looks like: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A -> ReusedExchange (reuses Exchange C) > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C -> PhotonShuffleMapStage > |-- SomeNode N > |-- Exchange D -> ReusedExchange (reuses Exchange B) > {code} > > > As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist > node. This *DOES NOT* affect query execution but will cause the query > visualization malfunction in the following ways: > # The ReusedExchange child subtree will still appear in the Spark UI graph > but will contain no node IDs. > # The ReusedExchange node details in the Explain plan will refer to a > UNKNOWN node. Example below. > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown]{code} > # The child exchange and its subtree may be missing from the Explain text > completely. No node details or tree string shown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42753) ReusedExchange refers to non-existent node
[ https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699389#comment-17699389 ] Apache Spark commented on SPARK-42753: -- User 'StevenChenDatabricks' has created a pull request for this issue: https://github.com/apache/spark/pull/40385 > ReusedExchange refers to non-existent node > -- > > Key: SPARK-42753 > URL: https://issues.apache.org/jira/browse/SPARK-42753 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Steven Chen >Priority: Major > > There is an AQE “issue“ where during AQE planning, the Exchange "that's > being" reused could be replaced in the plan tree. So, when we print the query > plan, the ReusedExchange will refer to an “unknown“ Exchange. An example > below: > > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown] > Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code} > > > Below is an example to demonstrate the root cause: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A > |-- SomeNode Y > |-- Exchange B > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C > |-- SomeNode N > |-- Exchange D > {code} > > > Step 1: Exchange B is materialized and the QueryStage is added to stage cache > Step 2: Exchange D reuses Exchange B > Step 3: Exchange C is materialized and the QueryStage is added to stage cache > Step 4: Exchange A reuses Exchange C > > Then the final plan looks like: > > {code:java} > AdaptiveSparkPlan > |-- SomeNode X (subquery xxx) > |-- Exchange A -> ReusedExchange (reuses Exchange C) > Subquery:Hosting operator = SomeNode Hosting Expression = xxx > dynamicpruning#388 > AdaptiveSparkPlan > |-- SomeNode M > |-- Exchange C -> PhotonShuffleMapStage > |-- SomeNode N > |-- Exchange D -> ReusedExchange (reuses Exchange B) > {code} > > > As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist > node. This *DOES NOT* affect query execution but will cause the query > visualization malfunction in the following ways: > # The ReusedExchange child subtree will still appear in the Spark UI graph > but will contain no node IDs. > # The ReusedExchange node details in the Explain plan will refer to a > UNKNOWN node. Example below. > {code:java} > (2775) ReusedExchange [Reuses operator id: unknown]{code} > # The child exchange and its subtree may be missing from the Explain text > completely. No node details or tree string shown. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4
[ https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699298#comment-17699298 ] Apache Spark commented on SPARK-42763: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40384 > Upgrade ZooKeeper from 3.6.3 to 3.6.4 > - > > Key: SPARK-42763 > URL: https://issues.apache.org/jira/browse/SPARK-42763 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42762: Assignee: (was: Apache Spark) > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42762: Assignee: Apache Spark > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699280#comment-17699280 ] Apache Spark commented on SPARK-42762: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/40383 > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request
[ https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699279#comment-17699279 ] Apache Spark commented on SPARK-42762: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/40383 > Improve Logging for disconnects during exec id request > -- > > Key: SPARK-42762 > URL: https://issues.apache.org/jira/browse/SPARK-42762 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.5.0 >Reporter: Holden Karau >Priority: Minor > > Improve Logging for disconnects during exec id request to simplify our > network logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42679) createDataFrame doesn't work with non-nullable schema.
[ https://issues.apache.org/jira/browse/SPARK-42679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699271#comment-17699271 ] Apache Spark commented on SPARK-42679: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40382 > createDataFrame doesn't work with non-nullable schema. > -- > > Key: SPARK-42679 > URL: https://issues.apache.org/jira/browse/SPARK-42679 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > spark.createDataFrame won't work with non-nullable schema as below: > {code:java} > from pyspark.sql.types import * > schema_false = StructType([StructField("id", IntegerType(), False)]) > spark.createDataFrame([[1]], schema=schema_false) > Traceback (most recent call last): > ... > pyspark.errors.exceptions.connect.AnalysisException: > [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's > required to be non-nullable.{code} > whereas it works fine with nullable schema: > {code:java} > schema_true = StructType([StructField("id", IntegerType(), True)]) > spark.createDataFrame([[1]], schema=schema_true) > DataFrame[id: int]{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0
[ https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42761: Assignee: (was: Apache Spark) > kubernetes-client from 6.4.1 to 6.5.0 > - > > Key: SPARK-42761 > URL: https://issues.apache.org/jira/browse/SPARK-42761 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0 > [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0
[ https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42761: Assignee: Apache Spark > kubernetes-client from 6.4.1 to 6.5.0 > - > > Key: SPARK-42761 > URL: https://issues.apache.org/jira/browse/SPARK-42761 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Major > > Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0 > [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0
[ https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699269#comment-17699269 ] Apache Spark commented on SPARK-42761: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/40381 > kubernetes-client from 6.4.1 to 6.5.0 > - > > Key: SPARK-42761 > URL: https://issues.apache.org/jira/browse/SPARK-42761 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0 > [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699264#comment-17699264 ] Apache Spark commented on SPARK-42760: -- User '1511351836' has created a pull request for this issue: https://github.com/apache/spark/pull/40380 > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42760: Assignee: Apache Spark > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Assignee: Apache Spark >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1
[ https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42760: Assignee: (was: Apache Spark) > The partition of result data frame of join is always 1 > -- > > Key: SPARK-42760 > URL: https://issues.apache.org/jira/browse/SPARK-42760 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.3.2 > Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, > local mode >Reporter: binyang >Priority: Major > > I am using pyspark. The partition of result data frame of join is always 1. > Here is my code from > https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join > > print(spark.version) > def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4): > spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions) > spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > df1 = spark.range(1, 1000).repartition(data_partitions) > df2 = spark.range(1, 2000).repartition(data_partitions) > df3 = spark.range(1, 3000).repartition(data_partitions) > print("Data partitions is: {}. Shuffle partitions is > {}".format(data_partitions, shuffle_partitions)) > print("Data partitions before join: > {}".format(df1.rdd.getNumPartitions())) > df = (df1.join(df2, df1.id == df2.id) > .join(df3, df1.id == df3.id)) > print("Data partitions after join : {}".format(df.rdd.getNumPartitions())) > example_shuffle_partitions() > > In Spark 3.0.3, it prints out: > 3.0.3 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 4 > However, it prints out the following in the latest 3.3.2 > 3.3.2 > Data partitions is: 10. Shuffle partitions is 4 > Data partitions before join: 10 > Data partitions after join : 1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42759: Assignee: (was: Apache Spark) > Avoid duplicated `build/apache-maven` install when target already exists > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42759: Assignee: Apache Spark > Avoid duplicated `build/apache-maven` install when target already exists > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699229#comment-17699229 ] Apache Spark commented on SPARK-42759: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40379 > Avoid repeated downloads of maven.tar.gz > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699228#comment-17699228 ] Apache Spark commented on SPARK-42759: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40379 > Avoid repeated downloads of maven.tar.gz > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42759) Avoid repeated downloads of maven.tar.gz
[ https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42759: Assignee: (was: Apache Spark) > Avoid repeated downloads of maven.tar.gz > > > Key: SPARK-42759 > URL: https://issues.apache.org/jira/browse/SPARK-42759 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org