[jira] [Updated] (SPARK-42069) Data duplicate or data lost with non-deterministic function

Jarred Li (Jira) Sat, 14 Jan 2023 19:58:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-42069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jarred Li updated SPARK-42069:
------------------------------
    Description: 
When write table with shuffle data and non-deterministic function, data may be 
duplicate or lost due to retry task attempt.

For example:
{quote}
insert overwrite table target_table partition(ds)
select ... from a join b join c...
ditributed by ds, cast(rand()*10 as int)
{quote}

As rand() is non deterministic, the order of input to shuffle data may change 
in retry task. a row that is already present in another shuffe output might get 
distributed again to a new shuffle output (causing data duplication) or some 
row might not get any shuffle out as the designated shuffle output might have 
already finished (causing data loss).


  was:
When write table with shuffle data and non-deterministic function, data may be 
duplicate or lost due to retry task attempt.

 

For example:
{quote}insert overwrite table target_table partition(ds)
select ... from a join b join c...
ditributed by ds, cast(rand()*10 as int){quote}


> Data duplicate or data lost with non-deterministic function
> -----------------------------------------------------------
>
>                 Key: SPARK-42069
>                 URL: https://issues.apache.org/jira/browse/SPARK-42069
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0, 3.1.0, 3.2.3
>            Reporter: Jarred Li
>            Priority: Major
>
> When write table with shuffle data and non-deterministic function, data may 
> be duplicate or lost due to retry task attempt.
> For example:
> {quote}
> insert overwrite table target_table partition(ds)
> select ... from a join b join c...
> ditributed by ds, cast(rand()*10 as int)
> {quote}
> As rand() is non deterministic, the order of input to shuffle data may change 
> in retry task. a row that is already present in another shuffe output might 
> get distributed again to a new shuffle output (causing data duplication) or 
> some row might not get any shuffle out as the designated shuffle output might 
> have already finished (causing data loss).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42069) Data duplicate or data lost with non-deterministic function

Reply via email to