Hi,
We are facing this issue when we convert RDD -> Dataset followed by repartition
+ write. We are using spot instances on k8s which means they can die at any
moment. And when they do during this phase, we very often see data duplication
happening.
Pseudo job code:
val rdd = data.map(…)
val ds = spark.createDataset(rdd, classEncoder)
.repartition(N)
.write
.format(“parquet”)
.mode(“overwrite”)
.save(path)
If I kill an executor pod during the repartition stage I can reproduce the
issue. If I instead move the repartition to happen on the rdd instead of the
dataset I cannot reproduce the issue.
Is this a bug in spark lineage when going from rdd -> ds/df -> repartition when
an executor drops? There is no randomness in the map function on the rdd before
you ask 😊
Thanks,
Erik