[Spark SQL] Data duplicate or data lost with non-deterministic function

李建伟 Sat, 14 Jan 2023 20:14:05 -0800

Hi All, I met one data duplicate issue when writing table with shuffle data and 
non-deterministic function.


For example:

insert overwrite table target_table partition(ds)
select ... from a join b join c...
ditributed by ds, cast(rand()*10 as int)

As rand() is non deterministic, the order of input to shuffle data may change 
in retry task. a row that is already present in another shuffe output might get 
distributed again to a new shuffle output (causing data duplication) or some 
row might not get any shuffle out as the designated shuffle output might have 
already finished (causing data loss).
Do we have any suggestion on how to avoid data duplicate and data lost for such 
scenario?

[Spark SQL] Data duplicate or data lost with non-deterministic function

Reply via email to