Re: Missing / Duplicate Data when Spark retries

Ruijing Li Thu, 10 Sep 2020 10:30:19 -0700

I agree Sean, although its strange since we aren’t using any UDFs but
sticking to spark provided functions. If anyone in the community has seen
such an issue before I would be happy to learn more!


On Thu, Sep 10, 2020 at 6:01 AM Sean Owen <sro...@gmail.com> wrote:

> It's more likely a subtle issue with your code or data, but hard to
>
> say without knowing more. The lineage is fine and deterministic, but
>
> your data or operations might not be.
>
>
>
> On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li <liruijin...@gmail.com> wrote:
>
> >
>
> > Hi all,
>
> >
>
> > I am on Spark 2.4.4 using Mesos as the task resource scheduler. The
> context is my job maps over multiple datasets, for each dataset it takes
> one dataframe from a parquet file from one HDFS path, and another dataframe
> from second HDFS path, unions them by name, then deduplicate by most recent
> date using windowing and rank
> https://stackoverflow.com/questions/50269678/dropping-duplicate-records-based-using-window-function-in-spark-scala
> .
>
> >
>
> > I have a strange issue where sometimes my job fails from shuffle error
> and it retries the stage/task again. Unfortunately, it somehow loses data
> and generates duplicates after the retry succeeds. I read about spark and
> it should keep a lineage, my theory is somehow spark isn't keeping the
> correct lineage and actually regenerating only the successful data, so it
> created duplicates but lost parts of the data. I'm totally unsure how this
> would happen, I don't have indeterministic data though. Anyone have
> encountered something similar or an inkling?
>
> >
>
> > Thanks!
>
> >
>
> > --
>
> > Cheers,
>
> > Ruijing Li
>
> --
Cheers,
Ruijing Li

Re: Missing / Duplicate Data when Spark retries

Reply via email to