[jira] [Commented] (SPARK-42784) Fix the problem of incomplete creation of subdirectories in push merged localDir

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700044#comment-17700044
 ] 

Apache Spark commented on SPARK-42784:
--

User 'Stove-hust' has created a pull request for this issue:
https://github.com/apache/spark/pull/40412

> Fix the problem of incomplete creation of subdirectories in push merged 
> localDir
> 
>
> Key: SPARK-42784
> URL: https://issues.apache.org/jira/browse/SPARK-42784
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.2
>Reporter: Fencheng Mei
>Priority: Major
>
> After we massively enabled push-based shuffle in our production environment, 
> we found some warn messages appearing in the server-side log messages.
> the warning log like:
> ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to 
> BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) 
> failed.
> java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize 
> merged shuffle partition for appId application_1671244879475_44020960 
> shuffleId 3 shuffleMergeId 0 reduceId 935.
> After investigation, we identified the triggering mechanism of the bug。
> The driver requested two different containers on the same physical machine. 
> During the creation of the 'push-merged' directory in the first container 
> (container_1), the mergeDir was created first, then the subDir were created 
> based on the value of the "spark.diskStore.subDirectories" parameter. 
> However, the resources of container_1 were preempted during the creation of 
> the sub-directories, resulting in subDir not being created (only part of it 
> was created ). As the mergeDir still existed, the second container 
> (container_2) was unable to create further subDir (as it assumed that all 
> directories had already been created).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42784) Fix the problem of incomplete creation of subdirectories in push merged localDir

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42784:


Assignee: Apache Spark

> Fix the problem of incomplete creation of subdirectories in push merged 
> localDir
> 
>
> Key: SPARK-42784
> URL: https://issues.apache.org/jira/browse/SPARK-42784
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.3.2
>Reporter: Fencheng Mei
>Assignee: Apache Spark
>Priority: Major
>
> After we massively enabled push-based shuffle in our production environment, 
> we found some warn messages appearing in the server-side log messages.
> the warning log like:
> ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to 
> BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) 
> failed.
> java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize 
> merged shuffle partition for appId application_1671244879475_44020960 
> shuffleId 3 shuffleMergeId 0 reduceId 935.
> After investigation, we identified the triggering mechanism of the bug。
> The driver requested two different containers on the same physical machine. 
> During the creation of the 'push-merged' directory in the first container 
> (container_1), the mergeDir was created first, then the subDir were created 
> based on the value of the "spark.diskStore.subDirectories" parameter. 
> However, the resources of container_1 were preempted during the creation of 
> the sub-directories, resulting in subDir not being created (only part of it 
> was created ). As the mergeDir still existed, the second container 
> (container_2) was unable to create further subDir (as it assumed that all 
> directories had already been created).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42783) Infer window group limit should run as late as possible

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42783:


Assignee: (was: Apache Spark)

> Infer window group limit should run as late as possible
> ---
>
> Key: SPARK-42783
> URL: https://issues.apache.org/jira/browse/SPARK-42783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42781) provide one format for writing to kafka

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700041#comment-17700041
 ] 

Apache Spark commented on SPARK-42781:
--

User '1511351836' has created a pull request for this issue:
https://github.com/apache/spark/pull/40411

> provide one format for writing to kafka
> ---
>
> Key: SPARK-42781
> URL: https://issues.apache.org/jira/browse/SPARK-42781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2
>Reporter: 董云鹏
>Priority: Minor
> Fix For: 3.2.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42783) Infer window group limit should run as late as possible

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700040#comment-17700040
 ] 

Apache Spark commented on SPARK-42783:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40410

> Infer window group limit should run as late as possible
> ---
>
> Key: SPARK-42783
> URL: https://issues.apache.org/jira/browse/SPARK-42783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42783) Infer window group limit should run as late as possible

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42783:


Assignee: Apache Spark

> Infer window group limit should run as late as possible
> ---
>
> Key: SPARK-42783
> URL: https://issues.apache.org/jira/browse/SPARK-42783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42782) Port TestUDFJson from Hive

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42782:


Assignee: Apache Spark

> Port TestUDFJson from Hive
> --
>
> Key: SPARK-42782
> URL: https://issues.apache.org/jira/browse/SPARK-42782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/hive/blob/ba0217ff17501fb849d8999e808d37579db7b4f1/ql/src/test/org/apache/hadoop/hive/ql/udf/TestUDFJson.java



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42782) Port TestUDFJson from Hive

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700038#comment-17700038
 ] 

Apache Spark commented on SPARK-42782:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/40409

> Port TestUDFJson from Hive
> --
>
> Key: SPARK-42782
> URL: https://issues.apache.org/jira/browse/SPARK-42782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/apache/hive/blob/ba0217ff17501fb849d8999e808d37579db7b4f1/ql/src/test/org/apache/hadoop/hive/ql/udf/TestUDFJson.java



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42782) Port TestUDFJson from Hive

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42782:


Assignee: (was: Apache Spark)

> Port TestUDFJson from Hive
> --
>
> Key: SPARK-42782
> URL: https://issues.apache.org/jira/browse/SPARK-42782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/apache/hive/blob/ba0217ff17501fb849d8999e808d37579db7b4f1/ql/src/test/org/apache/hadoop/hive/ql/udf/TestUDFJson.java



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42780) Upgrade google Tink from 1.7.0 to 1.8.0

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42780:


Assignee: (was: Apache Spark)

> Upgrade google Tink from 1.7.0 to 1.8.0
> ---
>
> Key: SPARK-42780
> URL: https://issues.apache.org/jira/browse/SPARK-42780
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [SNYK-JAVA-COMGOOGLEPROTOBUF-3040284|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3040284]
> [SNYK-JAVA-COMGOOGLEPROTOBUF-3167772|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3167772]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42780) Upgrade google Tink from 1.7.0 to 1.8.0

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42780:


Assignee: Apache Spark

> Upgrade google Tink from 1.7.0 to 1.8.0
> ---
>
> Key: SPARK-42780
> URL: https://issues.apache.org/jira/browse/SPARK-42780
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> [SNYK-JAVA-COMGOOGLEPROTOBUF-3040284|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3040284]
> [SNYK-JAVA-COMGOOGLEPROTOBUF-3167772|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3167772]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42780) Upgrade google Tink from 1.7.0 to 1.8.0

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700016#comment-17700016
 ] 

Apache Spark commented on SPARK-42780:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/40408

> Upgrade google Tink from 1.7.0 to 1.8.0
> ---
>
> Key: SPARK-42780
> URL: https://issues.apache.org/jira/browse/SPARK-42780
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [SNYK-JAVA-COMGOOGLEPROTOBUF-3040284|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3040284]
> [SNYK-JAVA-COMGOOGLEPROTOBUF-3167772|https://security.snyk.io/vuln/SNYK-JAVA-COMGOOGLEPROTOBUF-3167772]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42781) provide one format for writing to kafka

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42781:


Assignee: (was: Apache Spark)

> provide one format for writing to kafka
> ---
>
> Key: SPARK-42781
> URL: https://issues.apache.org/jira/browse/SPARK-42781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2
>Reporter: 董云鹏
>Priority: Minor
> Fix For: 3.2.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42781) provide one format for writing to kafka

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42781:


Assignee: Apache Spark

> provide one format for writing to kafka
> ---
>
> Key: SPARK-42781
> URL: https://issues.apache.org/jira/browse/SPARK-42781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2
>Reporter: 董云鹏
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42781) provide one format for writing to kafka

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700014#comment-17700014
 ] 

Apache Spark commented on SPARK-42781:
--

User '1511351836' has created a pull request for this issue:
https://github.com/apache/spark/pull/40380

> provide one format for writing to kafka
> ---
>
> Key: SPARK-42781
> URL: https://issues.apache.org/jira/browse/SPARK-42781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.2
>Reporter: 董云鹏
>Priority: Minor
> Fix For: 3.2.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42778:


Assignee: (was: Apache Spark)

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699974#comment-17699974
 ] 

Apache Spark commented on SPARK-42778:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40407

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699973#comment-17699973
 ] 

Apache Spark commented on SPARK-42778:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40407

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42778:


Assignee: Apache Spark

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699937#comment-17699937
 ] 

Apache Spark commented on SPARK-42101:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40406

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.5.0
>
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42340) Implement GroupedData.applyInPandas

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699924#comment-17699924
 ] 

Apache Spark commented on SPARK-42340:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40405

> Implement GroupedData.applyInPandas
> ---
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42340) Implement GroupedData.applyInPandas

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699922#comment-17699922
 ] 

Apache Spark commented on SPARK-42340:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40405

> Implement GroupedData.applyInPandas
> ---
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42340) Implement GroupedData.applyInPandas

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42340:


Assignee: Apache Spark

> Implement GroupedData.applyInPandas
> ---
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42340) Implement GroupedData.applyInPandas

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42340:


Assignee: (was: Apache Spark)

> Implement GroupedData.applyInPandas
> ---
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699874#comment-17699874
 ] 

Apache Spark commented on SPARK-21782:
--

User 'megaserg' has created a pull request for this issue:
https://github.com/apache/spark/pull/18990

> Repartition creates skews when numPartitions is a power of 2
> 
>
> Key: SPARK-21782
> URL: https://issues.apache.org/jira/browse/SPARK-21782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sergey Serebryakov
>Assignee: Sergey Serebryakov
>Priority: Major
>  Labels: repartition
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png
>
>
> *Problem:*
> When an RDD (particularly with a low item-per-partition ratio) is 
> repartitioned to {{numPartitions}} = power of 2, the resulting partitions are 
> very uneven-sized. This affects both {{repartition()}} and 
> {{coalesce(shuffle=true)}}.
> *Steps to reproduce:*
> {code}
> $ spark-shell
> scala> sc.parallelize(0 until 1000, 
> 250).repartition(64).glom().map(_.length).collect()
> res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
> {code}
> *Explanation:*
> Currently, the [algorithm for 
> repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450]
>  (shuffle-enabled coalesce) is as follows:
> - for each initial partition {{index}}, generate {{position}} as {{(new 
> Random(index)).nextInt(numPartitions)}}
> - then, for element number {{k}} in initial partition {{index}}, put it in 
> the new partition {{position + k}} (modulo {{numPartitions}}).
> So, essentially elements are smeared roughly equally over {{numPartitions}} 
> buckets - starting from the one with number {{position+1}}.
> Note that a new instance of {{Random}} is created for every initial partition 
> {{index}}, with a fixed seed {{index}}, and then discarded. So the 
> {{position}} is deterministic for every {{index}} for any RDD in the world. 
> Also, [{{nextInt(bound)}} 
> implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393]
>  has a special case when {{bound}} is a power of 2, which is basically taking 
> several highest bits from the initial seed, with only a minimal scrambling.
> Due to deterministic seed, using the generator only once, and lack of 
> scrambling, the {{position}} values for power-of-two {{numPartitions}} always 
> end up being almost the same regardless of the {{index}}, causing some 
> buckets to be much more popular than others. So, {{repartition}} will in fact 
> intentionally produce skewed partitions even when before the partition were 
> roughly equal in size.
> The behavior seems to have been introduced in SPARK-1770 by 
> https://github.com/apache/spark/pull/727/
> {quote}
> The load balancing is not perfect: a given output partition
> can have up to N more elements than the average if there are N input
> partitions. However, some randomization is used to minimize the
> probabiliy that this happens.
> {quote}
> Another related ticket: SPARK-17817 - 
> https://github.com/apache/spark/pull/15445



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699875#comment-17699875
 ] 

Apache Spark commented on SPARK-21782:
--

User 'megaserg' has created a pull request for this issue:
https://github.com/apache/spark/pull/18990

> Repartition creates skews when numPartitions is a power of 2
> 
>
> Key: SPARK-21782
> URL: https://issues.apache.org/jira/browse/SPARK-21782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sergey Serebryakov
>Assignee: Sergey Serebryakov
>Priority: Major
>  Labels: repartition
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png
>
>
> *Problem:*
> When an RDD (particularly with a low item-per-partition ratio) is 
> repartitioned to {{numPartitions}} = power of 2, the resulting partitions are 
> very uneven-sized. This affects both {{repartition()}} and 
> {{coalesce(shuffle=true)}}.
> *Steps to reproduce:*
> {code}
> $ spark-shell
> scala> sc.parallelize(0 until 1000, 
> 250).repartition(64).glom().map(_.length).collect()
> res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
> {code}
> *Explanation:*
> Currently, the [algorithm for 
> repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450]
>  (shuffle-enabled coalesce) is as follows:
> - for each initial partition {{index}}, generate {{position}} as {{(new 
> Random(index)).nextInt(numPartitions)}}
> - then, for element number {{k}} in initial partition {{index}}, put it in 
> the new partition {{position + k}} (modulo {{numPartitions}}).
> So, essentially elements are smeared roughly equally over {{numPartitions}} 
> buckets - starting from the one with number {{position+1}}.
> Note that a new instance of {{Random}} is created for every initial partition 
> {{index}}, with a fixed seed {{index}}, and then discarded. So the 
> {{position}} is deterministic for every {{index}} for any RDD in the world. 
> Also, [{{nextInt(bound)}} 
> implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393]
>  has a special case when {{bound}} is a power of 2, which is basically taking 
> several highest bits from the initial seed, with only a minimal scrambling.
> Due to deterministic seed, using the generator only once, and lack of 
> scrambling, the {{position}} values for power-of-two {{numPartitions}} always 
> end up being almost the same regardless of the {{index}}, causing some 
> buckets to be much more popular than others. So, {{repartition}} will in fact 
> intentionally produce skewed partitions even when before the partition were 
> roughly equal in size.
> The behavior seems to have been introduced in SPARK-1770 by 
> https://github.com/apache/spark/pull/727/
> {quote}
> The load balancing is not perfect: a given output partition
> can have up to N more elements than the average if there are N input
> partitions. However, some randomization is used to minimize the
> probabiliy that this happens.
> {quote}
> Another related ticket: SPARK-17817 - 
> https://github.com/apache/spark/pull/15445



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699873#comment-17699873
 ] 

Apache Spark commented on SPARK-42777:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40404

> Support converting TimestampNTZ catalog stats to plan stats
> ---
>
> Key: SPARK-42777
> URL: https://issues.apache.org/jira/browse/SPARK-42777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42777:


Assignee: Gengliang Wang  (was: Apache Spark)

> Support converting TimestampNTZ catalog stats to plan stats
> ---
>
> Key: SPARK-42777
> URL: https://issues.apache.org/jira/browse/SPARK-42777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699872#comment-17699872
 ] 

Apache Spark commented on SPARK-42777:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40404

> Support converting TimestampNTZ catalog stats to plan stats
> ---
>
> Key: SPARK-42777
> URL: https://issues.apache.org/jira/browse/SPARK-42777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42777) Support converting TimestampNTZ catalog stats to plan stats

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42777:


Assignee: Apache Spark  (was: Gengliang Wang)

> Support converting TimestampNTZ catalog stats to plan stats
> ---
>
> Key: SPARK-42777
> URL: https://issues.apache.org/jira/browse/SPARK-42777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42754:


Assignee: Apache Spark

> Spark 3.4 history server's SQL tab incorrectly groups SQL executions when 
> replaying event logs from Spark 3.3 and earlier
> -
>
> Key: SPARK-42754
> URL: https://issues.apache.org/jira/browse/SPARK-42754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: example.png
>
>
> In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
> executions when replaying event logs generated by older Spark versions.
>  
> {*}Reproduction{*}:
> {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
> spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
> {code:java}
> sql("select * from range(10)").collect()
> sql("select * from range(20)").collect()
> sql("select * from range(30)").collect(){code}
> Exit the shell and use the Spark History Server to replay this application's 
> UI.
> In the SQL tab I expect to see three separate queries, but Spark 3.4's 
> history server incorrectly groups the second and third queries as nested 
> queries of the first (see attached screenshot).
>  
> {*}Root cause{*}: 
> [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
> *non-optional* {{rootExecutionId: Long}} field to the 
> SparkListenerSQLExecutionStart case class.
> When JsonProtocol deserializes this event it uses the "ignore missing 
> properties" Jackson deserialization option, causing the 
> {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}.
> The value {{0}} is a legitimate execution ID, so in the deserialized event we 
> have no ability to distinguish between the absence of a value and a case 
> where all queries have the first query as the root.
> *Proposed* {*}fix{*}:
> I think we should change this field to be of type {{Option[Long]}} . I 
> believe this is a release blocker for Spark 3.4.0 because we cannot change 
> the type of this new field in a future release without breaking binary 
> compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699852#comment-17699852
 ] 

Apache Spark commented on SPARK-42754:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40403

> Spark 3.4 history server's SQL tab incorrectly groups SQL executions when 
> replaying event logs from Spark 3.3 and earlier
> -
>
> Key: SPARK-42754
> URL: https://issues.apache.org/jira/browse/SPARK-42754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: example.png
>
>
> In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
> executions when replaying event logs generated by older Spark versions.
>  
> {*}Reproduction{*}:
> {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
> spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
> {code:java}
> sql("select * from range(10)").collect()
> sql("select * from range(20)").collect()
> sql("select * from range(30)").collect(){code}
> Exit the shell and use the Spark History Server to replay this application's 
> UI.
> In the SQL tab I expect to see three separate queries, but Spark 3.4's 
> history server incorrectly groups the second and third queries as nested 
> queries of the first (see attached screenshot).
>  
> {*}Root cause{*}: 
> [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
> *non-optional* {{rootExecutionId: Long}} field to the 
> SparkListenerSQLExecutionStart case class.
> When JsonProtocol deserializes this event it uses the "ignore missing 
> properties" Jackson deserialization option, causing the 
> {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}.
> The value {{0}} is a legitimate execution ID, so in the deserialized event we 
> have no ability to distinguish between the absence of a value and a case 
> where all queries have the first query as the root.
> *Proposed* {*}fix{*}:
> I think we should change this field to be of type {{Option[Long]}} . I 
> believe this is a release blocker for Spark 3.4.0 because we cannot change 
> the type of this new field in a future release without breaking binary 
> compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42754) Spark 3.4 history server's SQL tab incorrectly groups SQL executions when replaying event logs from Spark 3.3 and earlier

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42754:


Assignee: (was: Apache Spark)

> Spark 3.4 history server's SQL tab incorrectly groups SQL executions when 
> replaying event logs from Spark 3.3 and earlier
> -
>
> Key: SPARK-42754
> URL: https://issues.apache.org/jira/browse/SPARK-42754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: example.png
>
>
> In Spark 3.4.0 RC4, the Spark History Server's SQL tab incorrectly groups SQL 
> executions when replaying event logs generated by older Spark versions.
>  
> {*}Reproduction{*}:
> {{In ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
> spark.eventLog.dir=eventlogs, run three non-nested SQL queries:}}
> {code:java}
> sql("select * from range(10)").collect()
> sql("select * from range(20)").collect()
> sql("select * from range(30)").collect(){code}
> Exit the shell and use the Spark History Server to replay this application's 
> UI.
> In the SQL tab I expect to see three separate queries, but Spark 3.4's 
> history server incorrectly groups the second and third queries as nested 
> queries of the first (see attached screenshot).
>  
> {*}Root cause{*}: 
> [https://github.com/apache/spark/pull/39268] / SPARK-41752 added a new 
> *non-optional* {{rootExecutionId: Long}} field to the 
> SparkListenerSQLExecutionStart case class.
> When JsonProtocol deserializes this event it uses the "ignore missing 
> properties" Jackson deserialization option, causing the 
> {{rootExecutionField}} to be initialized with a default value of {{{}0{}}}.
> The value {{0}} is a legitimate execution ID, so in the deserialized event we 
> have no ability to distinguish between the absence of a value and a case 
> where all queries have the first query as the root.
> *Proposed* {*}fix{*}:
> I think we should change this field to be of type {{Option[Long]}} . I 
> believe this is a release blocker for Spark 3.4.0 because we cannot change 
> the type of this new field in a future release without breaking binary 
> compatibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42020) createDataFrame with UDT

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42020:


Assignee: (was: Apache Spark)

> createDataFrame with UDT
> 
>
> Key: SPARK-42020
> URL: https://issues.apache.org/jira/browse/SPARK-42020
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> pyspark/sql/tests/test_types.py:596 
> (TypesParityTests.test_apply_schema_with_udt)
> self =  testMethod=test_apply_schema_with_udt>
> def test_apply_schema_with_udt(self):
> row = (1.0, ExamplePoint(1.0, 2.0))
> schema = StructType(
> [
> StructField("label", DoubleType(), False),
> StructField("point", ExamplePointUDT(), False),
> ]
> )
> >   df = self.spark.createDataFrame([row], schema)
> ../test_types.py:605: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> ../../connect/session.py:282: in createDataFrame
> _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
> _data])
> pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist
> ???
> pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist
> ???
> pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays
> ???
> pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays
> ???
> pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays
> ???
> pyarrow/array.pxi:320: in pyarrow.lib.array
> ???
> pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
> ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   ???
> E   pyarrow.lib.ArrowInvalid: Could not convert ExamplePoint(1.0,2.0) with 
> type ExamplePoint: did not recognize Python value type when inferring an 
> Arrow data type
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42020) createDataFrame with UDT

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42020:


Assignee: Apache Spark

> createDataFrame with UDT
> 
>
> Key: SPARK-42020
> URL: https://issues.apache.org/jira/browse/SPARK-42020
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> pyspark/sql/tests/test_types.py:596 
> (TypesParityTests.test_apply_schema_with_udt)
> self =  testMethod=test_apply_schema_with_udt>
> def test_apply_schema_with_udt(self):
> row = (1.0, ExamplePoint(1.0, 2.0))
> schema = StructType(
> [
> StructField("label", DoubleType(), False),
> StructField("point", ExamplePointUDT(), False),
> ]
> )
> >   df = self.spark.createDataFrame([row], schema)
> ../test_types.py:605: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> ../../connect/session.py:282: in createDataFrame
> _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
> _data])
> pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist
> ???
> pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist
> ???
> pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays
> ???
> pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays
> ???
> pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays
> ???
> pyarrow/array.pxi:320: in pyarrow.lib.array
> ???
> pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
> ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   ???
> E   pyarrow.lib.ArrowInvalid: Could not convert ExamplePoint(1.0,2.0) with 
> type ExamplePoint: did not recognize Python value type when inferring an 
> Arrow data type
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42020) createDataFrame with UDT

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699825#comment-17699825
 ] 

Apache Spark commented on SPARK-42020:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40402

> createDataFrame with UDT
> 
>
> Key: SPARK-42020
> URL: https://issues.apache.org/jira/browse/SPARK-42020
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> pyspark/sql/tests/test_types.py:596 
> (TypesParityTests.test_apply_schema_with_udt)
> self =  testMethod=test_apply_schema_with_udt>
> def test_apply_schema_with_udt(self):
> row = (1.0, ExamplePoint(1.0, 2.0))
> schema = StructType(
> [
> StructField("label", DoubleType(), False),
> StructField("point", ExamplePointUDT(), False),
> ]
> )
> >   df = self.spark.createDataFrame([row], schema)
> ../test_types.py:605: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> ../../connect/session.py:282: in createDataFrame
> _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
> _data])
> pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist
> ???
> pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist
> ???
> pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays
> ???
> pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays
> ???
> pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays
> ???
> pyarrow/array.pxi:320: in pyarrow.lib.array
> ???
> pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
> ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   ???
> E   pyarrow.lib.ArrowInvalid: Could not convert ExamplePoint(1.0,2.0) with 
> type ExamplePoint: did not recognize Python value type when inferring an 
> Arrow data type
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42773) Minor grammatical change to "Supports Spark Connect" message

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42773:


Assignee: Apache Spark

> Minor grammatical change to "Supports Spark Connect" message
> 
>
> Key: SPARK-42773
> URL: https://issues.apache.org/jira/browse/SPARK-42773
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Assignee: Apache Spark
>Priority: Major
>
> Changing "Support Spark Connect" to "Supports Spark Connect" in the 3.4.0 
> version change message which is also used in the documentation:
>  
> .. versionchanged:: 3.4.0
>      Supports Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42773) Minor grammatical change to "Supports Spark Connect" message

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42773:


Assignee: (was: Apache Spark)

> Minor grammatical change to "Supports Spark Connect" message
> 
>
> Key: SPARK-42773
> URL: https://issues.apache.org/jira/browse/SPARK-42773
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Priority: Major
>
> Changing "Support Spark Connect" to "Supports Spark Connect" in the 3.4.0 
> version change message which is also used in the documentation:
>  
> .. versionchanged:: 3.4.0
>      Supports Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42773) Minor grammatical change to "Supports Spark Connect" message

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699781#comment-17699781
 ] 

Apache Spark commented on SPARK-42773:
--

User 'allanf-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40401

> Minor grammatical change to "Supports Spark Connect" message
> 
>
> Key: SPARK-42773
> URL: https://issues.apache.org/jira/browse/SPARK-42773
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Priority: Major
>
> Changing "Support Spark Connect" to "Supports Spark Connect" in the 3.4.0 
> version change message which is also used in the documentation:
>  
> .. versionchanged:: 3.4.0
>      Supports Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41359) Use `PhysicalDataType` instead of DataType in UnsafeRow

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41359:


Assignee: Apache Spark

> Use `PhysicalDataType` instead of DataType in UnsafeRow
> ---
>
> Key: SPARK-41359
> URL: https://issues.apache.org/jira/browse/SPARK-41359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41359) Use `PhysicalDataType` instead of DataType in UnsafeRow

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699720#comment-17699720
 ] 

Apache Spark commented on SPARK-41359:
--

User 'ClownXC' has created a pull request for this issue:
https://github.com/apache/spark/pull/40400

> Use `PhysicalDataType` instead of DataType in UnsafeRow
> ---
>
> Key: SPARK-41359
> URL: https://issues.apache.org/jira/browse/SPARK-41359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41359) Use `PhysicalDataType` instead of DataType in UnsafeRow

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41359:


Assignee: (was: Apache Spark)

> Use `PhysicalDataType` instead of DataType in UnsafeRow
> ---
>
> Key: SPARK-41359
> URL: https://issues.apache.org/jira/browse/SPARK-41359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699610#comment-17699610
 ] 

Apache Spark commented on SPARK-42101:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/40399

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.5.0
>
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699611#comment-17699611
 ] 

Apache Spark commented on SPARK-42101:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/40399

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.5.0
>
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42052) Codegen Support for HiveSimpleUDF

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699595#comment-17699595
 ] 

Apache Spark commented on SPARK-42052:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40397

> Codegen Support for HiveSimpleUDF
> -
>
> Key: SPARK-42052
> URL: https://issues.apache.org/jira/browse/SPARK-42052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42772) Change the default value of JDBC options about push down to true

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42772:


Assignee: (was: Apache Spark)

> Change the default value of JDBC options about push down to true
> 
>
> Key: SPARK-42772
> URL: https://issues.apache.org/jira/browse/SPARK-42772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42772) Change the default value of JDBC options about push down to true

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699571#comment-17699571
 ] 

Apache Spark commented on SPARK-42772:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40396

> Change the default value of JDBC options about push down to true
> 
>
> Key: SPARK-42772
> URL: https://issues.apache.org/jira/browse/SPARK-42772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42772) Change the default value of JDBC options about push down to true

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42772:


Assignee: Apache Spark

> Change the default value of JDBC options about push down to true
> 
>
> Key: SPARK-42772
> URL: https://issues.apache.org/jira/browse/SPARK-42772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42770) SQLImplicitsTestSuite test failed with Java 17

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42770:


Assignee: (was: Apache Spark)

> SQLImplicitsTestSuite test failed with Java 17
> --
>
> Key: SPARK-42770
> URL: https://issues.apache.org/jira/browse/SPARK-42770
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682]
> {code:java}
> [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 
> milliseconds)
> 4429[info]   2023-03-02T23:00:20.404434 did not equal 
> 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63)
> 4430[info]   org.scalatest.exceptions.TestFailedException:
> 4431[info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> 4432[info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> 4433[info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> 4434[info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> 4435[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63)
> 4436[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133)
> 4437[info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 4438[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> 4439[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> 4440[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> 4441[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> 4442[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> 4443[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> 4445[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> 4446[info]   at 
> org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
> 4447[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> 4448[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> 4449[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> 4450[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> 4451[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> 4452[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
> 4453[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> 4454[info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> 4455[info]   at scala.collection.immutable.List.foreach(List.scala:431)
> 4456[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> 4457[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> 4458[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> 4459[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> 4460[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> 4461[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> 4462[info]   at org.scalatest.Suite.run(Suite.scala:1114)
> 4463[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
> 4464[info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
> 4465[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> 4466[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> 4467[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> 4468[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> 4469[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34)
> 4470[info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> 4471[info]   at 
> org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 4472[info]   at 
> org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 4473[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34)
> 4474[info]   at 
> 

[jira] [Assigned] (SPARK-42770) SQLImplicitsTestSuite test failed with Java 17

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42770:


Assignee: Apache Spark

> SQLImplicitsTestSuite test failed with Java 17
> --
>
> Key: SPARK-42770
> URL: https://issues.apache.org/jira/browse/SPARK-42770
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> [https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682]
> {code:java}
> [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 
> milliseconds)
> 4429[info]   2023-03-02T23:00:20.404434 did not equal 
> 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63)
> 4430[info]   org.scalatest.exceptions.TestFailedException:
> 4431[info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> 4432[info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> 4433[info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> 4434[info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> 4435[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63)
> 4436[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133)
> 4437[info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 4438[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> 4439[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> 4440[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> 4441[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> 4442[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> 4443[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> 4445[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> 4446[info]   at 
> org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
> 4447[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> 4448[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> 4449[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> 4450[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> 4451[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> 4452[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
> 4453[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> 4454[info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> 4455[info]   at scala.collection.immutable.List.foreach(List.scala:431)
> 4456[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> 4457[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> 4458[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> 4459[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> 4460[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> 4461[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> 4462[info]   at org.scalatest.Suite.run(Suite.scala:1114)
> 4463[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
> 4464[info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
> 4465[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> 4466[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> 4467[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> 4468[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> 4469[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34)
> 4470[info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> 4471[info]   at 
> org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 4472[info]   at 
> org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 4473[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34)
> 4474[info]   at 
> 

[jira] [Commented] (SPARK-42770) SQLImplicitsTestSuite test failed with Java 17

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699542#comment-17699542
 ] 

Apache Spark commented on SPARK-42770:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40395

> SQLImplicitsTestSuite test failed with Java 17
> --
>
> Key: SPARK-42770
> URL: https://issues.apache.org/jira/browse/SPARK-42770
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682]
> {code:java}
> [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 
> milliseconds)
> 4429[info]   2023-03-02T23:00:20.404434 did not equal 
> 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63)
> 4430[info]   org.scalatest.exceptions.TestFailedException:
> 4431[info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> 4432[info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> 4433[info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> 4434[info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> 4435[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63)
> 4436[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133)
> 4437[info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 4438[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> 4439[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> 4440[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> 4441[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> 4442[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> 4443[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> 4445[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> 4446[info]   at 
> org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
> 4447[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> 4448[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> 4449[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> 4450[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> 4451[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> 4452[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
> 4453[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> 4454[info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> 4455[info]   at scala.collection.immutable.List.foreach(List.scala:431)
> 4456[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> 4457[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> 4458[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> 4459[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> 4460[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> 4461[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> 4462[info]   at org.scalatest.Suite.run(Suite.scala:1114)
> 4463[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
> 4464[info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
> 4465[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> 4466[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> 4467[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> 4468[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> 4469[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34)
> 4470[info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> 4471[info]   at 
> org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 4472[info]   at 
> org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 4473[info]   at 
> org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34)
> 

[jira] [Commented] (SPARK-42771) Refactor HiveGenericUDF

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699527#comment-17699527
 ] 

Apache Spark commented on SPARK-42771:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40394

> Refactor HiveGenericUDF
> ---
>
> Key: SPARK-42771
> URL: https://issues.apache.org/jira/browse/SPARK-42771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42771) Refactor HiveGenericUDF

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42771:


Assignee: (was: Apache Spark)

> Refactor HiveGenericUDF
> ---
>
> Key: SPARK-42771
> URL: https://issues.apache.org/jira/browse/SPARK-42771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42771) Refactor HiveGenericUDF

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42771:


Assignee: Apache Spark

> Refactor HiveGenericUDF
> ---
>
> Key: SPARK-42771
> URL: https://issues.apache.org/jira/browse/SPARK-42771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699516#comment-17699516
 ] 

Apache Spark commented on SPARK-40082:
--

User 'Stove-hust' has created a pull request for this issue:
https://github.com/apache/spark/pull/40393

> DAGScheduler may not schduler new stage in condition of push-based shuffle 
> enabled
> --
>
> Key: SPARK-40082
> URL: https://issues.apache.org/jira/browse/SPARK-40082
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.1.1
>Reporter: Penglei Shi
>Priority: Major
> Attachments: missParentStages.png, shuffleMergeFinalized.png, 
> submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks 
> existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
> then its parent stages will be resubmitting firstly and it will cost some 
> time to compute. Before the shuffleMapStage being resubmitted, its all 
> speculative tasks success and register map output, but speculative task 
> successful events can not trigger shuffleMergeFinalized because this stage 
> has been removed from runningStages.
> Then this stage is resubmitted, but speculative tasks have registered map 
> output and there are no missing tasks to compute, resubmitting stages will 
> also not trigger shuffleMergeFinalized. Eventually this stage‘s 
> _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage 
> occurring fetchFailed. And in getMissingParentStages, this stage will be 
> marked as missing and will be resubmitted, but next stages are added to 
> waitingStages after this stage being finished, so next stages will not be 
> submitted even though this stage's resubmitting has been finished.
> I have only met some times in my production env and it is difficult to 
> reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40082:


Assignee: (was: Apache Spark)

> DAGScheduler may not schduler new stage in condition of push-based shuffle 
> enabled
> --
>
> Key: SPARK-40082
> URL: https://issues.apache.org/jira/browse/SPARK-40082
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.1.1
>Reporter: Penglei Shi
>Priority: Major
> Attachments: missParentStages.png, shuffleMergeFinalized.png, 
> submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks 
> existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
> then its parent stages will be resubmitting firstly and it will cost some 
> time to compute. Before the shuffleMapStage being resubmitted, its all 
> speculative tasks success and register map output, but speculative task 
> successful events can not trigger shuffleMergeFinalized because this stage 
> has been removed from runningStages.
> Then this stage is resubmitted, but speculative tasks have registered map 
> output and there are no missing tasks to compute, resubmitting stages will 
> also not trigger shuffleMergeFinalized. Eventually this stage‘s 
> _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage 
> occurring fetchFailed. And in getMissingParentStages, this stage will be 
> marked as missing and will be resubmitted, but next stages are added to 
> waitingStages after this stage being finished, so next stages will not be 
> submitted even though this stage's resubmitting has been finished.
> I have only met some times in my production env and it is difficult to 
> reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40082:


Assignee: Apache Spark

> DAGScheduler may not schduler new stage in condition of push-based shuffle 
> enabled
> --
>
> Key: SPARK-40082
> URL: https://issues.apache.org/jira/browse/SPARK-40082
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.1.1
>Reporter: Penglei Shi
>Assignee: Apache Spark
>Priority: Major
> Attachments: missParentStages.png, shuffleMergeFinalized.png, 
> submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks 
> existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
> then its parent stages will be resubmitting firstly and it will cost some 
> time to compute. Before the shuffleMapStage being resubmitted, its all 
> speculative tasks success and register map output, but speculative task 
> successful events can not trigger shuffleMergeFinalized because this stage 
> has been removed from runningStages.
> Then this stage is resubmitted, but speculative tasks have registered map 
> output and there are no missing tasks to compute, resubmitting stages will 
> also not trigger shuffleMergeFinalized. Eventually this stage‘s 
> _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage 
> occurring fetchFailed. And in getMissingParentStages, this stage will be 
> marked as missing and will be resubmitted, but next stages are added to 
> waitingStages after this stage being finished, so next stages will not be 
> submitted even though this stage's resubmitting has been finished.
> I have only met some times in my production env and it is difficult to 
> reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699515#comment-17699515
 ] 

Apache Spark commented on SPARK-40082:
--

User 'Stove-hust' has created a pull request for this issue:
https://github.com/apache/spark/pull/40393

> DAGScheduler may not schduler new stage in condition of push-based shuffle 
> enabled
> --
>
> Key: SPARK-40082
> URL: https://issues.apache.org/jira/browse/SPARK-40082
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.1.1
>Reporter: Penglei Shi
>Priority: Major
> Attachments: missParentStages.png, shuffleMergeFinalized.png, 
> submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks 
> existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
> then its parent stages will be resubmitting firstly and it will cost some 
> time to compute. Before the shuffleMapStage being resubmitted, its all 
> speculative tasks success and register map output, but speculative task 
> successful events can not trigger shuffleMergeFinalized because this stage 
> has been removed from runningStages.
> Then this stage is resubmitted, but speculative tasks have registered map 
> output and there are no missing tasks to compute, resubmitting stages will 
> also not trigger shuffleMergeFinalized. Eventually this stage‘s 
> _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage 
> occurring fetchFailed. And in getMissingParentStages, this stage will be 
> marked as missing and will be resubmitted, but next stages are added to 
> waitingStages after this stage being finished, so next stages will not be 
> submitted even though this stage's resubmitting has been finished.
> I have only met some times in my production env and it is difficult to 
> reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42769) Add ENV_DRIVER_POD_IP env variable to executor pods

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699470#comment-17699470
 ] 

Apache Spark commented on SPARK-42769:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40392

> Add ENV_DRIVER_POD_IP env variable to executor pods
> ---
>
> Key: SPARK-42769
> URL: https://issues.apache.org/jira/browse/SPARK-42769
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42769) Add ENV_DRIVER_POD_IP env variable to executor pods

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42769:


Assignee: Apache Spark

> Add ENV_DRIVER_POD_IP env variable to executor pods
> ---
>
> Key: SPARK-42769
> URL: https://issues.apache.org/jira/browse/SPARK-42769
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42769) Add ENV_DRIVER_POD_IP env variable to executor pods

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42769:


Assignee: (was: Apache Spark)

> Add ENV_DRIVER_POD_IP env variable to executor pods
> ---
>
> Key: SPARK-42769
> URL: https://issues.apache.org/jira/browse/SPARK-42769
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42766) YarnAllocator should filter excluded nodes when launching allocated containers

2023-03-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699467#comment-17699467
 ] 

Apache Spark commented on SPARK-42766:
--

User 'wangshengjie123' has created a pull request for this issue:
https://github.com/apache/spark/pull/40391

> YarnAllocator should filter excluded nodes when launching allocated containers
> --
>
> Key: SPARK-42766
> URL: https://issues.apache.org/jira/browse/SPARK-42766
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.2
>Reporter: wangshengjie
>Priority: Major
>
> In production environment, we hit an issue like this:
> If we request 10 containers form nodeA and nodeB, first response from Yarn 
> return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second 
> response from Yarn maybe return some containers from nodeA and launching 
> containers, but when containers(Executor) setup and send register request to 
> Driver, it will be rejected and this failure will be counted to 
> {code:java}
> spark.yarn.max.executor.failures {code}
> , and will casue app failed.
> {code:java}
> Max number of executor failures ($maxNumExecutorFailures) reached{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42766) YarnAllocator should filter excluded nodes when launching allocated containers

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42766:


Assignee: (was: Apache Spark)

> YarnAllocator should filter excluded nodes when launching allocated containers
> --
>
> Key: SPARK-42766
> URL: https://issues.apache.org/jira/browse/SPARK-42766
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.2
>Reporter: wangshengjie
>Priority: Major
>
> In production environment, we hit an issue like this:
> If we request 10 containers form nodeA and nodeB, first response from Yarn 
> return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second 
> response from Yarn maybe return some containers from nodeA and launching 
> containers, but when containers(Executor) setup and send register request to 
> Driver, it will be rejected and this failure will be counted to 
> {code:java}
> spark.yarn.max.executor.failures {code}
> , and will casue app failed.
> {code:java}
> Max number of executor failures ($maxNumExecutorFailures) reached{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42766) YarnAllocator should filter excluded nodes when launching allocated containers

2023-03-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42766:


Assignee: Apache Spark

> YarnAllocator should filter excluded nodes when launching allocated containers
> --
>
> Key: SPARK-42766
> URL: https://issues.apache.org/jira/browse/SPARK-42766
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.2
>Reporter: wangshengjie
>Assignee: Apache Spark
>Priority: Major
>
> In production environment, we hit an issue like this:
> If we request 10 containers form nodeA and nodeB, first response from Yarn 
> return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second 
> response from Yarn maybe return some containers from nodeA and launching 
> containers, but when containers(Executor) setup and send register request to 
> Driver, it will be rejected and this failure will be counted to 
> {code:java}
> spark.yarn.max.executor.failures {code}
> , and will casue app failed.
> {code:java}
> Max number of executor failures ($maxNumExecutorFailures) reached{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42768) Enable cached plan apply AQE by default

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699445#comment-17699445
 ] 

Apache Spark commented on SPARK-42768:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40390

> Enable cached plan apply AQE by default
> ---
>
> Key: SPARK-42768
> URL: https://issues.apache.org/jira/browse/SPARK-42768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42768) Enable cached plan apply AQE by default

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42768:


Assignee: (was: Apache Spark)

> Enable cached plan apply AQE by default
> ---
>
> Key: SPARK-42768
> URL: https://issues.apache.org/jira/browse/SPARK-42768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42768) Enable cached plan apply AQE by default

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42768:


Assignee: Apache Spark

> Enable cached plan apply AQE by default
> ---
>
> Key: SPARK-42768
> URL: https://issues.apache.org/jira/browse/SPARK-42768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42767:


Assignee: Apache Spark

> Add check condition to start connect server fallback with `in-memory` and 
> auto ignored some tests strongly depend on hive
> -
>
> Key: SPARK-42767
> URL: https://issues.apache.org/jira/browse/SPARK-42767
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42767:


Assignee: (was: Apache Spark)

> Add check condition to start connect server fallback with `in-memory` and 
> auto ignored some tests strongly depend on hive
> -
>
> Key: SPARK-42767
> URL: https://issues.apache.org/jira/browse/SPARK-42767
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699444#comment-17699444
 ] 

Apache Spark commented on SPARK-42767:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40389

> Add check condition to start connect server fallback with `in-memory` and 
> auto ignored some tests strongly depend on hive
> -
>
> Key: SPARK-42767
> URL: https://issues.apache.org/jira/browse/SPARK-42767
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699443#comment-17699443
 ] 

Apache Spark commented on SPARK-42767:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40389

> Add check condition to start connect server fallback with `in-memory` and 
> auto ignored some tests strongly depend on hive
> -
>
> Key: SPARK-42767
> URL: https://issues.apache.org/jira/browse/SPARK-42767
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42765) Regulate the import path of `pandas_udf`

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699425#comment-17699425
 ] 

Apache Spark commented on SPARK-42765:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40388

> Regulate the import path of `pandas_udf`
> 
>
> Key: SPARK-42765
> URL: https://issues.apache.org/jira/browse/SPARK-42765
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Remove the outdated import path of `pandas_udf`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42765) Regulate the import path of `pandas_udf`

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42765:


Assignee: Apache Spark

> Regulate the import path of `pandas_udf`
> 
>
> Key: SPARK-42765
> URL: https://issues.apache.org/jira/browse/SPARK-42765
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Remove the outdated import path of `pandas_udf`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42765) Regulate the import path of `pandas_udf`

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699426#comment-17699426
 ] 

Apache Spark commented on SPARK-42765:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40388

> Regulate the import path of `pandas_udf`
> 
>
> Key: SPARK-42765
> URL: https://issues.apache.org/jira/browse/SPARK-42765
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Remove the outdated import path of `pandas_udf`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42765) Regulate the import path of `pandas_udf`

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42765:


Assignee: (was: Apache Spark)

> Regulate the import path of `pandas_udf`
> 
>
> Key: SPARK-42765
> URL: https://issues.apache.org/jira/browse/SPARK-42765
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Remove the outdated import path of `pandas_udf`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699417#comment-17699417
 ] 

Apache Spark commented on SPARK-42764:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40387

> Parameterize the max number of attempts for driver props fetcher in 
> KubernetesExecutorBackend
> -
>
> Key: SPARK-42764
> URL: https://issues.apache.org/jira/browse/SPARK-42764
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42764:


Assignee: (was: Apache Spark)

> Parameterize the max number of attempts for driver props fetcher in 
> KubernetesExecutorBackend
> -
>
> Key: SPARK-42764
> URL: https://issues.apache.org/jira/browse/SPARK-42764
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699416#comment-17699416
 ] 

Apache Spark commented on SPARK-42764:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40387

> Parameterize the max number of attempts for driver props fetcher in 
> KubernetesExecutorBackend
> -
>
> Key: SPARK-42764
> URL: https://issues.apache.org/jira/browse/SPARK-42764
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42764) Parameterize the max number of attempts for driver props fetcher in KubernetesExecutorBackend

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42764:


Assignee: Apache Spark

> Parameterize the max number of attempts for driver props fetcher in 
> KubernetesExecutorBackend
> -
>
> Key: SPARK-42764
> URL: https://issues.apache.org/jira/browse/SPARK-42764
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42753) ReusedExchange refers to non-existent node

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699388#comment-17699388
 ] 

Apache Spark commented on SPARK-42753:
--

User 'StevenChenDatabricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/40385

> ReusedExchange refers to non-existent node
> --
>
> Key: SPARK-42753
> URL: https://issues.apache.org/jira/browse/SPARK-42753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Steven Chen
>Priority: Major
>
> There is an AQE “issue“ where during AQE planning, the Exchange "that's 
> being" reused could be replaced in the plan tree. So, when we print the query 
> plan, the ReusedExchange will refer to an “unknown“ Exchange. An example 
> below:
>  
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]
>  Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code}
>  
>  
> Below is an example to demonstrate the root cause:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A
>           |-- SomeNode Y
>               |-- Exchange B
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C
>           |-- SomeNode N
>               |-- Exchange D
> {code}
>  
>  
> Step 1: Exchange B is materialized and the QueryStage is added to stage cache
> Step 2: Exchange D reuses Exchange B
> Step 3: Exchange C is materialized and the QueryStage is added to stage cache
> Step 4: Exchange A reuses Exchange C
>  
> Then the final plan looks like:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A -> ReusedExchange (reuses Exchange C)
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C -> PhotonShuffleMapStage 
>           |-- SomeNode N
>               |-- Exchange D -> ReusedExchange (reuses Exchange B)
> {code}
>  
>  
> As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
> node. This *DOES NOT* affect query execution but will cause the query 
> visualization malfunction in the following ways:
>  # The ReusedExchange child subtree will still appear in the Spark UI graph 
> but will contain no node IDs.
>  # The ReusedExchange node details in the Explain plan will refer to a 
> UNKNOWN node. Example below.
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]{code}
>  # The child exchange and its subtree may be missing from the Explain text 
> completely. No node details or tree string shown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42753) ReusedExchange refers to non-existent node

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42753:


Assignee: (was: Apache Spark)

> ReusedExchange refers to non-existent node
> --
>
> Key: SPARK-42753
> URL: https://issues.apache.org/jira/browse/SPARK-42753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Steven Chen
>Priority: Major
>
> There is an AQE “issue“ where during AQE planning, the Exchange "that's 
> being" reused could be replaced in the plan tree. So, when we print the query 
> plan, the ReusedExchange will refer to an “unknown“ Exchange. An example 
> below:
>  
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]
>  Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code}
>  
>  
> Below is an example to demonstrate the root cause:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A
>           |-- SomeNode Y
>               |-- Exchange B
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C
>           |-- SomeNode N
>               |-- Exchange D
> {code}
>  
>  
> Step 1: Exchange B is materialized and the QueryStage is added to stage cache
> Step 2: Exchange D reuses Exchange B
> Step 3: Exchange C is materialized and the QueryStage is added to stage cache
> Step 4: Exchange A reuses Exchange C
>  
> Then the final plan looks like:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A -> ReusedExchange (reuses Exchange C)
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C -> PhotonShuffleMapStage 
>           |-- SomeNode N
>               |-- Exchange D -> ReusedExchange (reuses Exchange B)
> {code}
>  
>  
> As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
> node. This *DOES NOT* affect query execution but will cause the query 
> visualization malfunction in the following ways:
>  # The ReusedExchange child subtree will still appear in the Spark UI graph 
> but will contain no node IDs.
>  # The ReusedExchange node details in the Explain plan will refer to a 
> UNKNOWN node. Example below.
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]{code}
>  # The child exchange and its subtree may be missing from the Explain text 
> completely. No node details or tree string shown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42753) ReusedExchange refers to non-existent node

2023-03-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42753:


Assignee: Apache Spark

> ReusedExchange refers to non-existent node
> --
>
> Key: SPARK-42753
> URL: https://issues.apache.org/jira/browse/SPARK-42753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Steven Chen
>Assignee: Apache Spark
>Priority: Major
>
> There is an AQE “issue“ where during AQE planning, the Exchange "that's 
> being" reused could be replaced in the plan tree. So, when we print the query 
> plan, the ReusedExchange will refer to an “unknown“ Exchange. An example 
> below:
>  
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]
>  Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code}
>  
>  
> Below is an example to demonstrate the root cause:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A
>           |-- SomeNode Y
>               |-- Exchange B
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C
>           |-- SomeNode N
>               |-- Exchange D
> {code}
>  
>  
> Step 1: Exchange B is materialized and the QueryStage is added to stage cache
> Step 2: Exchange D reuses Exchange B
> Step 3: Exchange C is materialized and the QueryStage is added to stage cache
> Step 4: Exchange A reuses Exchange C
>  
> Then the final plan looks like:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A -> ReusedExchange (reuses Exchange C)
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C -> PhotonShuffleMapStage 
>           |-- SomeNode N
>               |-- Exchange D -> ReusedExchange (reuses Exchange B)
> {code}
>  
>  
> As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
> node. This *DOES NOT* affect query execution but will cause the query 
> visualization malfunction in the following ways:
>  # The ReusedExchange child subtree will still appear in the Spark UI graph 
> but will contain no node IDs.
>  # The ReusedExchange node details in the Explain plan will refer to a 
> UNKNOWN node. Example below.
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]{code}
>  # The child exchange and its subtree may be missing from the Explain text 
> completely. No node details or tree string shown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42753) ReusedExchange refers to non-existent node

2023-03-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699389#comment-17699389
 ] 

Apache Spark commented on SPARK-42753:
--

User 'StevenChenDatabricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/40385

> ReusedExchange refers to non-existent node
> --
>
> Key: SPARK-42753
> URL: https://issues.apache.org/jira/browse/SPARK-42753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Steven Chen
>Priority: Major
>
> There is an AQE “issue“ where during AQE planning, the Exchange "that's 
> being" reused could be replaced in the plan tree. So, when we print the query 
> plan, the ReusedExchange will refer to an “unknown“ Exchange. An example 
> below:
>  
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]
>  Output [3]: [sr_customer_sk#271, sr_store_sk#275, sum#377L]{code}
>  
>  
> Below is an example to demonstrate the root cause:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A
>           |-- SomeNode Y
>               |-- Exchange B
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C
>           |-- SomeNode N
>               |-- Exchange D
> {code}
>  
>  
> Step 1: Exchange B is materialized and the QueryStage is added to stage cache
> Step 2: Exchange D reuses Exchange B
> Step 3: Exchange C is materialized and the QueryStage is added to stage cache
> Step 4: Exchange A reuses Exchange C
>  
> Then the final plan looks like:
>  
> {code:java}
> AdaptiveSparkPlan
>   |-- SomeNode X (subquery xxx)
>       |-- Exchange A -> ReusedExchange (reuses Exchange C)
> Subquery:Hosting operator = SomeNode Hosting Expression = xxx 
> dynamicpruning#388
> AdaptiveSparkPlan
>   |-- SomeNode M
>       |-- Exchange C -> PhotonShuffleMapStage 
>           |-- SomeNode N
>               |-- Exchange D -> ReusedExchange (reuses Exchange B)
> {code}
>  
>  
> As a result, the ReusedExchange (reuses Exchange B) will refer to a non-exist 
> node. This *DOES NOT* affect query execution but will cause the query 
> visualization malfunction in the following ways:
>  # The ReusedExchange child subtree will still appear in the Spark UI graph 
> but will contain no node IDs.
>  # The ReusedExchange node details in the Explain plan will refer to a 
> UNKNOWN node. Example below.
> {code:java}
> (2775) ReusedExchange [Reuses operator id: unknown]{code}
>  # The child exchange and its subtree may be missing from the Explain text 
> completely. No node details or tree string shown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42763) Upgrade ZooKeeper from 3.6.3 to 3.6.4

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699298#comment-17699298
 ] 

Apache Spark commented on SPARK-42763:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40384

> Upgrade ZooKeeper from 3.6.3 to 3.6.4
> -
>
> Key: SPARK-42763
> URL: https://issues.apache.org/jira/browse/SPARK-42763
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42762:


Assignee: (was: Apache Spark)

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42762:


Assignee: Apache Spark

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699280#comment-17699280
 ] 

Apache Spark commented on SPARK-42762:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/40383

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42762) Improve Logging for disconnects during exec id request

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699279#comment-17699279
 ] 

Apache Spark commented on SPARK-42762:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/40383

> Improve Logging for disconnects during exec id request
> --
>
> Key: SPARK-42762
> URL: https://issues.apache.org/jira/browse/SPARK-42762
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.0
>Reporter: Holden Karau
>Priority: Minor
>
> Improve Logging for disconnects during exec id request to simplify our 
> network logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42679) createDataFrame doesn't work with non-nullable schema.

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699271#comment-17699271
 ] 

Apache Spark commented on SPARK-42679:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40382

> createDataFrame doesn't work with non-nullable schema.
> --
>
> Key: SPARK-42679
> URL: https://issues.apache.org/jira/browse/SPARK-42679
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> spark.createDataFrame won't work with non-nullable schema as below:
> {code:java}
> from pyspark.sql.types import *
> schema_false = StructType([StructField("id", IntegerType(), False)])
> spark.createDataFrame([[1]], schema=schema_false)
> Traceback (most recent call last):
> ...
> pyspark.errors.exceptions.connect.AnalysisException: 
> [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's 
> required to be non-nullable.{code}
> whereas it works fine with nullable schema:
> {code:java}
> schema_true = StructType([StructField("id", IntegerType(), True)])
> spark.createDataFrame([[1]], schema=schema_true)
> DataFrame[id: int]{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42761:


Assignee: (was: Apache Spark)

> kubernetes-client from 6.4.1 to 6.5.0
> -
>
> Key: SPARK-42761
> URL: https://issues.apache.org/jira/browse/SPARK-42761
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0
> [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42761:


Assignee: Apache Spark

> kubernetes-client from 6.4.1 to 6.5.0
> -
>
> Key: SPARK-42761
> URL: https://issues.apache.org/jira/browse/SPARK-42761
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0
> [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42761) kubernetes-client from 6.4.1 to 6.5.0

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699269#comment-17699269
 ] 

Apache Spark commented on SPARK-42761:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/40381

> kubernetes-client from 6.4.1 to 6.5.0
> -
>
> Key: SPARK-42761
> URL: https://issues.apache.org/jira/browse/SPARK-42761
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Upgrade fabric8:kubernetes-client from 6.4.1 to 6.5.0
> [CVE-2022-1471|https://www.cve.org/CVERecord?id=CVE-2022-1471]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699264#comment-17699264
 ] 

Apache Spark commented on SPARK-42760:
--

User '1511351836' has created a pull request for this issue:
https://github.com/apache/spark/pull/40380

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42760:


Assignee: Apache Spark

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Assignee: Apache Spark
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42760) The partition of result data frame of join is always 1

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42760:


Assignee: (was: Apache Spark)

> The partition of result data frame of join is always 1
> --
>
> Key: SPARK-42760
> URL: https://issues.apache.org/jira/browse/SPARK-42760
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.3.2
> Environment: standard spark 3.0.3/3.3.2, using in jupyter notebook, 
> local mode
>Reporter: binyang
>Priority: Major
>
> I am using pyspark. The partition of result data frame of join is always 1.
> Here is my code from 
> https://stackoverflow.com/questions/51876281/is-partitioning-retained-after-a-spark-sql-join
>  
> print(spark.version)
> def example_shuffle_partitions(data_partitions=10, shuffle_partitions=4):
>     spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions)
>     spark.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
>     df1 = spark.range(1, 1000).repartition(data_partitions)
>     df2 = spark.range(1, 2000).repartition(data_partitions)
>     df3 = spark.range(1, 3000).repartition(data_partitions)
>     print("Data partitions is: {}. Shuffle partitions is 
> {}".format(data_partitions, shuffle_partitions))
>     print("Data partitions before join: 
> {}".format(df1.rdd.getNumPartitions()))
>     df = (df1.join(df2, df1.id == df2.id)
>           .join(df3, df1.id == df3.id))
>     print("Data partitions after join : {}".format(df.rdd.getNumPartitions()))
> example_shuffle_partitions()
>  
> In Spark 3.0.3, it prints out:
> 3.0.3
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 4
> However, it prints out the following in the latest 3.3.2
> 3.3.2
> Data partitions is: 10. Shuffle partitions is 4
> Data partitions before join: 10
> Data partitions after join : 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42759:


Assignee: (was: Apache Spark)

> Avoid duplicated `build/apache-maven` install when target already exists
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42759) Avoid duplicated `build/apache-maven` install when target already exists

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42759:


Assignee: Apache Spark

> Avoid duplicated `build/apache-maven` install when target already exists
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699229#comment-17699229
 ] 

Apache Spark commented on SPARK-42759:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40379

> Avoid repeated downloads of maven.tar.gz
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699228#comment-17699228
 ] 

Apache Spark commented on SPARK-42759:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40379

> Avoid repeated downloads of maven.tar.gz
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42759) Avoid repeated downloads of maven.tar.gz

2023-03-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42759:


Assignee: (was: Apache Spark)

> Avoid repeated downloads of maven.tar.gz
> 
>
> Key: SPARK-42759
> URL: https://issues.apache.org/jira/browse/SPARK-42759
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >