[Spark Core] Spark data loss/data duplication when executors die

Erik Eklund Mon, 13 Feb 2023 05:24:46 -0800

Hi,

We are facing this issue when we convert RDD -> Dataset followed by repartition 
+ write. We are using spot instances on k8s which means they can die at any 
moment. And when they do during this phase, we very often see data duplication 
happening.


Pseudo job code:

val rdd = data.map(…)
val ds = spark.createDataset(rdd, classEncoder)
                .repartition(N)
                .write
                .format(“parquet”)
                .mode(“overwrite”)
                .save(path)

If I kill an executor pod during the repartition stage I can reproduce the 
issue. If I instead move the repartition to happen on the rdd instead of the 
dataset I cannot reproduce the issue.

Is this a bug in spark lineage when going from rdd -> ds/df -> repartition when 
an executor drops? There is no randomness in the map function on the rdd before 
you ask 😊

Thanks,
Erik

[Spark Core] Spark data loss/data duplication when executors die

Reply via email to