[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Affects Version/s: 3.2.3 3.2.2 3.1.3 > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956 ] xuanzhiang deleted comment on SPARK-48956: was (Author: JIRAUSER295364): Metric info error. Actual output 35351985,but got duplicate data. I will try to reproduce the problem and give use cases > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.1.3, 3.2.1, 3.2.2, 3.2.3 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668 ] xuanzhiang edited comment on SPARK-48956 at 7/22/24 6:51 AM: - I found out that the mission failed because of shuffle. Because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run. I think it's a task restart reading an unfixed partition, so data duplication.I think it was the change in parallelism that caused the partition change. was (Author: JIRAUSER295364): I found out that the mission failed because of shuffle. Because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run. I think it's a task restart reading an unfixed partition, so data duplication. > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668 ] xuanzhiang edited comment on SPARK-48956 at 7/22/24 6:51 AM: - I found out that the mission failed because of shuffle. Because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run. I think it's a task restart reading an unfixed partition, so data duplication.I think it was the change in parallelism that caused the partition change.Should I turn off dynamic allocation? was (Author: JIRAUSER295364): I found out that the mission failed because of shuffle. Because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run. I think it's a task restart reading an unfixed partition, so data duplication.I think it was the change in parallelism that caused the partition change. > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668 ] xuanzhiang edited comment on SPARK-48956 at 7/22/24 6:49 AM: - I found out that the mission failed because of shuffle. Because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run. I think it's a task restart reading an unfixed partition, so data duplication. was (Author: JIRAUSER295364): I found out that the mission failed because of shuffle. Because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run. !image-2024-07-22-14-47-50-773.png! > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867668#comment-17867668 ] xuanzhiang commented on SPARK-48956: I found out that the mission failed because of shuffle. Because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run. !image-2024-07-22-14-47-50-773.png! > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Attachment: image-2024-07-22-14-47-50-773.png > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png, > image-2024-07-22-14-47-50-773.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Attachment: (was: image-2024-07-22-09-59-31-004.png) > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-10-00-45-793.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867631#comment-17867631 ] xuanzhiang commented on SPARK-48956: !image-2024-07-22-10-00-45-793.png! Metric info error. Actual output 35351985,but got duplicate data. I will try to reproduce the problem and give use cases > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png, > image-2024-07-22-10-00-45-793.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867631#comment-17867631 ] xuanzhiang edited comment on SPARK-48956 at 7/22/24 2:02 AM: - Metric info error. Actual output 35351985,but got duplicate data. I will try to reproduce the problem and give use cases was (Author: JIRAUSER295364): !image-2024-07-22-10-00-45-793.png! Metric info error. Actual output 35351985,but got duplicate data. I will try to reproduce the problem and give use cases > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png, > image-2024-07-22-10-00-45-793.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Attachment: image-2024-07-22-10-00-45-793.png > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png, > image-2024-07-22-10-00-45-793.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Attachment: image-2024-07-22-09-59-31-004.png > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png, image-2024-07-22-09-59-31-004.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Attachment: image-2024-07-21-18-22-04-665.png > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956 ] xuanzhiang deleted comment on SPARK-48956: was (Author: JIRAUSER295364): !image-2024-07-21-18-22-04-665.png! > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867595#comment-17867595 ] xuanzhiang commented on SPARK-48956: !image-2024-07-21-18-22-04-665.png! > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956 ] xuanzhiang deleted comment on SPARK-48956: was (Author: JIRAUSER295364): !image-2024-07-21-18-21-33-888.png! > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png, > image-2024-07-21-18-22-04-665.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
[ https://issues.apache.org/jira/browse/SPARK-48956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-48956: --- Attachment: image-2024-07-21-18-21-33-888.png > Spark Repartition Task Field Retry Cause Data Duplication > - > > Key: SPARK-48956 > URL: https://issues.apache.org/jira/browse/SPARK-48956 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: image-2024-07-21-18-21-33-888.png > > > The question seems like > [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48956) Spark Repartition Task Field Retry Cause Data Duplication
xuanzhiang created SPARK-48956: -- Summary: Spark Repartition Task Field Retry Cause Data Duplication Key: SPARK-48956 URL: https://issues.apache.org/jira/browse/SPARK-48956 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.2.1 Reporter: xuanzhiang The question seems like [SPARK-23207|https://issues.apache.org/jira/browse/SPARK-23207] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867593#comment-17867593 ] xuanzhiang edited comment on SPARK-23207 at 7/21/24 10:01 AM: -- [~igor.berman] So we are. We are running on yarn and with dynamic allocation.Task retries gen duplicate data. was (Author: JIRAUSER295364): [~igor.berman] So we are. We are running on yarn and with dynamic allocation. > Shuffle+Repartition on an DataFrame could lead to incorrect answers > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Blocker > Labels: correctness > Fix For: 2.1.4, 2.2.3, 2.3.0 > > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code:java} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867593#comment-17867593 ] xuanzhiang commented on SPARK-23207: [~igor.berman] So we are. We are running on yarn and with dynamic allocation. > Shuffle+Repartition on an DataFrame could lead to incorrect answers > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Blocker > Labels: correctness > Fix For: 2.1.4, 2.2.3, 2.3.0 > > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code:java} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42217) Support lateral column alias in queries with Window
[ https://issues.apache.org/jira/browse/SPARK-42217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804201#comment-17804201 ] xuanzhiang commented on SPARK-42217: Hello, Does spark3.4.2 support LCA with window? > Support lateral column alias in queries with Window > --- > > Key: SPARK-42217 > URL: https://issues.apache.org/jira/browse/SPARK-42217 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Xinyi Yu >Assignee: Xinyi Yu >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42227) Use approx_percentile function running slower than percentile in spark3
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-42227: --- Summary: Use approx_percentile function running slower than percentile in spark3 (was: Use approx_percentile function running slower in spark3 than spark2) > Use approx_percentile function running slower than percentile in spark3 > > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: percentile+objectHashAggregateExec.png, > percentile+objectHashAggregateExec_shuffle_task.png, > percentile_approx+objectHashAggregateExec.png, > percentile_approx+objectHashAggregateExec_shuffle_task.png > > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687889#comment-17687889 ] xuanzhiang edited comment on SPARK-42227 at 2/13/23 11:16 AM: -- spark version : 3.2.1 hadoop version : 3.0.0 job info: !percentile_approx objectHashAggregateExec.png! !percentile objectHashAggregateExec.png! shuffle read task info : !percentile_approx objectHashAggregateExec_shuffle_task.png! !percentile objectHashAggregateExec_shuffle_task.png! was (Author: JIRAUSER295364): spark version : 3.2.1 hadoop version : 3.0.0 job info: !percentile_approx objectHashAggregateExec.png! shuffle read task info : !percentile_approx objectHashAggregateExec_shuffle_task.png! > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: percentile+objectHashAggregateExec.png, > percentile+objectHashAggregateExec_shuffle_task.png, > percentile_approx+objectHashAggregateExec.png, > percentile_approx+objectHashAggregateExec_shuffle_task.png > > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687889#comment-17687889 ] xuanzhiang edited comment on SPARK-42227 at 2/13/23 11:16 AM: -- spark version : 3.2.1 hadoop version : 3.0.0 here are job info and shuffle read task info was (Author: JIRAUSER295364): spark version : 3.2.1 hadoop version : 3.0.0 job info: !percentile_approx objectHashAggregateExec.png! !percentile objectHashAggregateExec.png! shuffle read task info : !percentile_approx objectHashAggregateExec_shuffle_task.png! !percentile objectHashAggregateExec_shuffle_task.png! > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: percentile+objectHashAggregateExec.png, > percentile+objectHashAggregateExec_shuffle_task.png, > percentile_approx+objectHashAggregateExec.png, > percentile_approx+objectHashAggregateExec_shuffle_task.png > > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-42227: --- Attachment: percentile_approx+objectHashAggregateExec_shuffle_task.png percentile+objectHashAggregateExec_shuffle_task.png > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: percentile+objectHashAggregateExec.png, > percentile+objectHashAggregateExec_shuffle_task.png, > percentile_approx+objectHashAggregateExec.png, > percentile_approx+objectHashAggregateExec_shuffle_task.png > > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-42227: --- Attachment: percentile_approx+objectHashAggregateExec.png percentile+objectHashAggregateExec.png > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: percentile+objectHashAggregateExec.png, > percentile+objectHashAggregateExec_shuffle_task.png, > percentile_approx+objectHashAggregateExec.png, > percentile_approx+objectHashAggregateExec_shuffle_task.png > > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687889#comment-17687889 ] xuanzhiang commented on SPARK-42227: spark version : 3.2.1 hadoop version : 3.0.0 job info: !percentile_approx objectHashAggregateExec.png! shuffle read task info : !percentile_approx objectHashAggregateExec_shuffle_task.png! > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > Attachments: percentile+objectHashAggregateExec.png, > percentile+objectHashAggregateExec_shuffle_task.png, > percentile_approx+objectHashAggregateExec.png, > percentile_approx+objectHashAggregateExec_shuffle_task.png > > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687852#comment-17687852 ] xuanzhiang edited comment on SPARK-42227 at 2/13/23 9:58 AM: - [~gurwls223] The percentile is thirty percent faster than the approx_percentile. It doesn't make sense. I see approx_percentile have a long time shuffle read task left. But percentile is normal . I'll repeat the problem later. was (Author: JIRAUSER295364): The percentile is thirty percent faster than the approx_percentile. It doesn't make sense. I see approx_percentile have a long time shuffle read task left. But percentile is normal . I'll repeat the problem later. > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687852#comment-17687852 ] xuanzhiang edited comment on SPARK-42227 at 2/13/23 9:58 AM: - The percentile is thirty percent faster than the approx_percentile. It doesn't make sense. I see approx_percentile have a long time shuffle read task left. But percentile is normal . I'll repeat the problem later. was (Author: JIRAUSER295364): The percentile is thirty percent faster than the approx_percentile. It doesn't make sense. I see approx_percentile have a long time shuffle read task left. But percentile is normal . I'll repeat the problem later. > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687852#comment-17687852 ] xuanzhiang commented on SPARK-42227: The percentile is thirty percent faster than the approx_percentile. It doesn't make sense. I see approx_percentile have a long time shuffle read task left. But percentile is normal . I'll repeat the problem later. > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683640#comment-17683640 ] xuanzhiang edited comment on SPARK-40499 at 2/3/23 3:05 AM: now we only choose use PERCENTILE,it use HashAggregate and shuffle normal. was (Author: JIRAUSER295364): now when we use PERCENTILE_APPROX, we need disable objHashAggregate. Or we choose use PERCENTILE,it use HashAggregate and shuffle normal. > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Major > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683640#comment-17683640 ] xuanzhiang commented on SPARK-40499: now when we use PERCENTILE_APPROX, we need disable objHashAggregate. Or we choose use PERCENTILE,it use HashAggregate and shuffle normal. > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Major > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42292) Spark SQL not use hive partition info
[ https://issues.apache.org/jira/browse/SPARK-42292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang resolved SPARK-42292. Resolution: Fixed when i set spark.sql.hive.convertMetastoreParquet=true , spark3 use inner parquet reader. > Spark SQL not use hive partition info > - > > Key: SPARK-42292 > URL: https://issues.apache.org/jira/browse/SPARK-42292 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > > I use spark3 to count partition num , like : > table a is external parquet table, it have 3 partition columns (year ,month, > day). > query sql : "select distinct month , day from a where year = '2022' " > i think spark can find hive metadata and use partition info, but it load all > "year = '2022'" partition data. > in spark2.4, it use TableLocalScanExec ,but spark3 use HiveTableRelation and > scan hive parquet. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42292) Spark SQL not use hive partition info
xuanzhiang created SPARK-42292: -- Summary: Spark SQL not use hive partition info Key: SPARK-42292 URL: https://issues.apache.org/jira/browse/SPARK-42292 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xuanzhiang I use spark3 to count partition num , like : table a is external parquet table, it have 3 partition columns (year ,month, day). query sql : "select distinct month , day from a where year = '2022' " i think spark can find hive metadata and use partition info, but it load all "year = '2022'" partition data. in spark2.4, it use TableLocalScanExec ,but spark3 use HiveTableRelation and scan hive parquet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
xuanzhiang created SPARK-42227: -- Summary: Use approx_percentile function running slower in spark3 than spark2 Key: SPARK-42227 URL: https://issues.apache.org/jira/browse/SPARK-42227 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xuanzhiang approx_percentile(end_ts-start_ts,0.9) cost_p90 in spark3 , it use objectHashAggregate method , but it shuffle very slow. when i use percentile , it become fast. i dont know the reson, i think approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41684) spark3 read the one partition data and write to anthor partition cause error
[ https://issues.apache.org/jira/browse/SPARK-41684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681657#comment-17681657 ] xuanzhiang commented on SPARK-41684: u can set spark.sql.hive.convertMetastoreParquet=false . > spark3 read the one partition data and write to anthor partition cause error > > > Key: SPARK-41684 > URL: https://issues.apache.org/jira/browse/SPARK-41684 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: sinlang >Priority: Major > > spark3 read the one partition data and write to anthor partition cause error > {code:java} > 1 create temporary view t1 : > select * from jt_ods.ods_ebi_stm_retail_settle_detail_full_di > where dt = '2022-12-21' > union all ( > select * from jt_ods.ods_ebi_stm_retail_settle_detail_full_df as i > where i.dt = '2022-12-20' > and not exists(select 1 from jt_ods.ods_ebi_stm_retail_settle_detail_full_di > as d where d.dt = '2022-12-21' and i.id = d.id)) > 2 insert data : > insert sql insert overwrite table > jt_ods.ods_ebi_stm_retail_settle_detail_full_df partition(dt = '2022-12-21') > select * from t distribute by rand() {code} > {code:java} > 2022-12-22 16:29:48 Driver ERROR > org.apache.spark.deploy.yarn.ApplicationMaster > User class threw exception: org.apache.spark.sql.AnalysisException: Cannot > overwrite a path that is also being read from. > org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also > being read from. > at > org.apache.spark.sql.errors.QueryCompilationErrors$.cannotOverwritePathBeingReadFromError(QueryCompilationErrors.scala:1834) > at > org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:980) > at > org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:221) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Priority: Blocker (was: Major) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Blocker > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Priority: Major (was: Blocker) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Major > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Priority: Blocker (was: Minor) > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Blocker > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Environment: hadoop: 3.0.0 spark: 2.4.0 / 3.2.1 shuffle:spark 2.4.0 was: hadoop 3.0.0 spark2.4.0 / spark3.2.1 shuffle: spark2.4.0 > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop: 3.0.0 > spark: 2.4.0 / 3.2.1 > shuffle:spark 2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Attachment: spark3.2-shuffle-data.png > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Description: spark.sql( s""" |SELECT | Info , | PERCENTILE_APPROX(cost,0.5) cost_p50, | PERCENTILE_APPROX(cost,0.9) cost_p90, | PERCENTILE_APPROX(cost,0.95) cost_p95, | PERCENTILE_APPROX(cost,0.99) cost_p99, | PERCENTILE_APPROX(cost,0.999) cost_p999 |FROM | textData |""".stripMargin) * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours. * If we upgrade the Shuffle, will we get performance regression? * was: spark.sql( s""" |SELECT | Info , | PERCENTILE_APPROX(cost,0.5) cost_p50, | PERCENTILE_APPROX(cost,0.9) cost_p90, | PERCENTILE_APPROX(cost,0.95) cost_p95, | PERCENTILE_APPROX(cost,0.99) cost_p99, | PERCENTILE_APPROX(cost,0.999) cost_p999 |FROM | textData |""".stripMargin) * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours. * > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * If we upgrade the Shuffle, will we get performance regression? > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Attachment: spark2.4-shuffle-data.png > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > Attachments: spark2.4-shuffle-data.png > > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuanzhiang updated SPARK-40499: --- Environment: hadoop 3.0.0 spark2.4.0 / spark3.2.1 shuffle: spark2.4.0 was:!image-2022-09-20-16-57-01-881.png! > Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 > > > Key: SPARK-40499 > URL: https://issues.apache.org/jira/browse/SPARK-40499 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.2.1 > Environment: hadoop 3.0.0 > spark2.4.0 / spark3.2.1 > shuffle: spark2.4.0 >Reporter: xuanzhiang >Priority: Minor > > spark.sql( > s""" > |SELECT > | Info , > | PERCENTILE_APPROX(cost,0.5) cost_p50, > | PERCENTILE_APPROX(cost,0.9) cost_p90, > | PERCENTILE_APPROX(cost,0.95) cost_p95, > | PERCENTILE_APPROX(cost,0.99) cost_p99, > | PERCENTILE_APPROX(cost,0.999) cost_p999 > |FROM > | textData > |""".stripMargin) > * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 > pull shuffle data very quick . but , when we use spark 3.2.1 and use old > shuffle , 140M shuffle data cost 3 hours. > * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40499) Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
xuanzhiang created SPARK-40499: -- Summary: Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 Key: SPARK-40499 URL: https://issues.apache.org/jira/browse/SPARK-40499 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.2.1 Environment: !image-2022-09-20-16-57-01-881.png! Reporter: xuanzhiang spark.sql( s""" |SELECT | Info , | PERCENTILE_APPROX(cost,0.5) cost_p50, | PERCENTILE_APPROX(cost,0.9) cost_p90, | PERCENTILE_APPROX(cost,0.95) cost_p95, | PERCENTILE_APPROX(cost,0.99) cost_p99, | PERCENTILE_APPROX(cost,0.999) cost_p999 |FROM | textData |""".stripMargin) * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours. * -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org