I agree Sean, although its strange since we aren’t using any UDFs but sticking to spark provided functions. If anyone in the community has seen such an issue before I would be happy to learn more!
On Thu, Sep 10, 2020 at 6:01 AM Sean Owen <sro...@gmail.com> wrote: > It's more likely a subtle issue with your code or data, but hard to > > say without knowing more. The lineage is fine and deterministic, but > > your data or operations might not be. > > > > On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li <liruijin...@gmail.com> wrote: > > > > > > Hi all, > > > > > > I am on Spark 2.4.4 using Mesos as the task resource scheduler. The > context is my job maps over multiple datasets, for each dataset it takes > one dataframe from a parquet file from one HDFS path, and another dataframe > from second HDFS path, unions them by name, then deduplicate by most recent > date using windowing and rank > https://stackoverflow.com/questions/50269678/dropping-duplicate-records-based-using-window-function-in-spark-scala > . > > > > > > I have a strange issue where sometimes my job fails from shuffle error > and it retries the stage/task again. Unfortunately, it somehow loses data > and generates duplicates after the retry succeeds. I read about spark and > it should keep a lineage, my theory is somehow spark isn't keeping the > correct lineage and actually regenerating only the successful data, so it > created duplicates but lost parts of the data. I'm totally unsure how this > would happen, I don't have indeterministic data though. Anyone have > encountered something similar or an inkling? > > > > > > Thanks! > > > > > > -- > > > Cheers, > > > Ruijing Li > > -- Cheers, Ruijing Li