Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Ruijing Li
I agree Sean, although its strange since we aren’t using any UDFs but
sticking to spark provided functions. If anyone in the community has seen
such an issue before I would be happy to learn more!

On Thu, Sep 10, 2020 at 6:01 AM Sean Owen  wrote:

> It's more likely a subtle issue with your code or data, but hard to
>
> say without knowing more. The lineage is fine and deterministic, but
>
> your data or operations might not be.
>
>
>
> On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li  wrote:
>
> >
>
> > Hi all,
>
> >
>
> > I am on Spark 2.4.4 using Mesos as the task resource scheduler. The
> context is my job maps over multiple datasets, for each dataset it takes
> one dataframe from a parquet file from one HDFS path, and another dataframe
> from second HDFS path, unions them by name, then deduplicate by most recent
> date using windowing and rank
> https://stackoverflow.com/questions/50269678/dropping-duplicate-records-based-using-window-function-in-spark-scala
> .
>
> >
>
> > I have a strange issue where sometimes my job fails from shuffle error
> and it retries the stage/task again. Unfortunately, it somehow loses data
> and generates duplicates after the retry succeeds. I read about spark and
> it should keep a lineage, my theory is somehow spark isn't keeping the
> correct lineage and actually regenerating only the successful data, so it
> created duplicates but lost parts of the data. I'm totally unsure how this
> would happen, I don't have indeterministic data though. Anyone have
> encountered something similar or an inkling?
>
> >
>
> > Thanks!
>
> >
>
> > --
>
> > Cheers,
>
> > Ruijing Li
>
> --
Cheers,
Ruijing Li


Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Sean Owen
It's more likely a subtle issue with your code or data, but hard to
say without knowing more. The lineage is fine and deterministic, but
your data or operations might not be.

On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li  wrote:
>
> Hi all,
>
> I am on Spark 2.4.4 using Mesos as the task resource scheduler. The context 
> is my job maps over multiple datasets, for each dataset it takes one 
> dataframe from a parquet file from one HDFS path, and another dataframe from 
> second HDFS path, unions them by name, then deduplicate by most recent date 
> using windowing and rank 
> https://stackoverflow.com/questions/50269678/dropping-duplicate-records-based-using-window-function-in-spark-scala.
>
> I have a strange issue where sometimes my job fails from shuffle error and it 
> retries the stage/task again. Unfortunately, it somehow loses data and 
> generates duplicates after the retry succeeds. I read about spark and it 
> should keep a lineage, my theory is somehow spark isn't keeping the correct 
> lineage and actually regenerating only the successful data, so it created 
> duplicates but lost parts of the data. I'm totally unsure how this would 
> happen, I don't have indeterministic data though. Anyone have encountered 
> something similar or an inkling?
>
> Thanks!
>
> --
> Cheers,
> Ruijing Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Missing / Duplicate Data when Spark retries

2020-09-09 Thread Ruijing Li
Hi all,

I am on Spark 2.4.4 using Mesos as the task resource scheduler. The context
is my job maps over multiple datasets, for each dataset it takes one
dataframe from a parquet file from one HDFS path, and another dataframe
from second HDFS path, unions them by name, then deduplicate by most recent
date using windowing and rank
https://stackoverflow.com/questions/50269678/dropping-duplicate-records-based-using-window-function-in-spark-scala
.

I have a strange issue where sometimes my job fails from shuffle error and
it retries the stage/task again. Unfortunately, it somehow loses data and
generates duplicates after the retry succeeds. I read about spark and it
should keep a lineage, my theory is somehow spark isn't keeping the correct
lineage and actually regenerating only the successful data, so it created
duplicates but lost parts of the data. I'm totally unsure how this would
happen, I don't have indeterministic data though. Anyone have encountered
something similar or an inkling?

Thanks!

-- 
Cheers,
Ruijing Li