Missing / Duplicate Data when Spark retries

Ruijing Li Wed, 09 Sep 2020 22:04:24 -0700

Hi all,

I am on Spark 2.4.4 using Mesos as the task resource scheduler. The context
is my job maps over multiple datasets, for each dataset it takes one
dataframe from a parquet file from one HDFS path, and another dataframe
from second HDFS path, unions them by name, then deduplicate by most recent
date using windowing and rank
https://stackoverflow.com/questions/50269678/dropping-duplicate-records-based-using-window-function-in-spark-scala
.


I have a strange issue where sometimes my job fails from shuffle error and
it retries the stage/task again. Unfortunately, it somehow loses data and
generates duplicates after the retry succeeds. I read about spark and it
should keep a lineage, my theory is somehow spark isn't keeping the correct
lineage and actually regenerating only the successful data, so it created
duplicates but lost parts of the data. I'm totally unsure how this would
happen, I don't have indeterministic data though. Anyone have encountered
something similar or an inkling?

Thanks!

-- 
Cheers,
Ruijing Li

Missing / Duplicate Data when Spark retries

Reply via email to