Hi All, I met one data duplicate issue when writing table with shuffle data and non-deterministic function.
For example: insert overwrite table target_table partition(ds) select ... from a join b join c... ditributed by ds, cast(rand()*10 as int) As rand() is non deterministic, the order of input to shuffle data may change in retry task. a row that is already present in another shuffe output might get distributed again to a new shuffle output (causing data duplication) or some row might not get any shuffle out as the designated shuffle output might have already finished (causing data loss). Do we have any suggestion on how to avoid data duplicate and data lost for such scenario?