[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2024-05-14 Thread gaoyajun02 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846293#comment-17846293
 ] 

gaoyajun02 commented on SPARK-42694:


Have you enabled push-based shuffle?

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2024-03-06 Thread liukai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824279#comment-17824279
 ] 

liukai commented on SPARK-42694:


I also encountered this situation in spark3.1.1. Setting the parameter 
spark.shuffle.useOldFetchProtocol to true can avoid this situation, but I have 
not found the specific cause.

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2023-11-16 Thread FengZhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787074#comment-17787074
 ] 

FengZhou commented on SPARK-42694:
--

No. Everything is OK, all the tasks are successful.

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2023-08-16 Thread Shuaipeng Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755329#comment-17755329
 ] 

Shuaipeng Lee commented on SPARK-42694:
---

Are there any exeptions when the data-loss occur?

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2023-03-07 Thread FengZhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697682#comment-17697682
 ] 

FengZhou commented on SPARK-42694:
--

[~yumwang] 
This version has been running in production environment for over a year, and 
upgrading now would have a significant impact. Upgrading to 3.3.2 requires 
retesting and validation of the associated Ranger and Spark permission plugins. 
Therefore, the only option for a short-term upgrade is to choose 3.1.3. 
However, it's unclear whether upgrading to 3.1.3 will solve the problem since 
the cause of the issue is unknown, as it only occurs occasionally, which is 
confusing us.

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2023-03-07 Thread FengZhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697683#comment-17697683
 ] 

FengZhou commented on SPARK-42694:
--

[~bjornjorgensen] 
As the current upgrade would have a significant impact, is there any other 
faster way to locate and solve the problem besides upgrading?

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2023-03-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697614#comment-17697614
 ] 

Bjørn Jørgensen commented on SPARK-42694:
-

Spark 3.1 [is 
EOL|https://github.com/apache/spark-website/commit/40f58f884bd258d6a332d583dc91c717b6b461f0
 ] 
Try Spark 3.3.2 or 3.2.3 

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42694) Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2023-03-07 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697316#comment-17697316
 ] 

Yuming Wang commented on SPARK-42694:
-

Could you upgrade to Spark 3.1.3 or Spark 3.3.2?

> Data duplication and loss occur after executing 'insert overwrite...' in 
> Spark 3.1.1
> 
>
> Key: SPARK-42694
> URL: https://issues.apache.org/jira/browse/SPARK-42694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Hadoop 3.2.1
> Hive 3.1.2
>Reporter: FengZhou
>Priority: Critical
>  Labels: shuffle, spark
> Attachments: image-2023-03-07-15-59-08-818.png, 
> image-2023-03-07-15-59-27-665.png
>
>
> We are currently using Spark version 3.1.1 in our production environment. We 
> have noticed that occasionally, after executing 'insert overwrite ... 
> select', the resulting data is inconsistent, with some data being duplicated 
> or lost. This issue does not occur all the time and seems to be more 
> prevalent on large tables with tens of millions of records.
> We compared the execution plans for two runs of the same SQL and found that 
> they were identical. In the case where the SQL was executed successfully, the 
> amount of data being written and read during the shuffle stage was the same. 
> However, in the case where the problem occurred, the amount of data being 
> written and read during the shuffle stage was different. Please see the 
> attached screenshots for the write/read data during shuffle stage.
>  
> Normal SQL:
> !image-2023-03-07-15-59-08-818.png!
> SQL with issues:
> !image-2023-03-07-15-59-27-665.png!
>  
> Is this problem caused by a bug in version 3.1.1, specifically (SPARK-34534): 
> 'New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss 
> or correctness'? Or is it caused by something else? What could be the root 
> cause of this problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org