Hi all, I am on Spark 2.4.4 using Mesos as the task resource scheduler. The context is my job maps over multiple datasets, for each dataset it takes one dataframe from a parquet file from one HDFS path, and another dataframe from second HDFS path, unions them by name, then deduplicate by most recent date using windowing and rank https://stackoverflow.com/questions/50269678/dropping-duplicate-records-based-using-window-function-in-spark-scala .
I have a strange issue where sometimes my job fails from shuffle error and it retries the stage/task again. Unfortunately, it somehow loses data and generates duplicates after the retry succeeds. I read about spark and it should keep a lineage, my theory is somehow spark isn't keeping the correct lineage and actually regenerating only the successful data, so it created duplicates but lost parts of the data. I'm totally unsure how this would happen, I don't have indeterministic data though. Anyone have encountered something similar or an inkling? Thanks! -- Cheers, Ruijing Li