Is there any possibility to avoid double computation in case of RDD checkpointing

Ivan Petrov Sun, 16 Aug 2020 15:56:58 -0700

Hi!
i use RDD checkpoint before writing to mongo to avoid duplicate records in
DB. Seems like Driver writes the same data twice in case of task failure.
- data calculated
- mongo _id created
- spark mongo connector writes data to Mongo
- task crashes
- (BOOM!) spark recomputes partition and gets new _id for mongo records
- i get duplicate records in Mongo


So I've added a checkpoint before writing to mongo.
Now Spark doubled execution runtime because of checkpoint.
What is the right way to avoid it? i think to save data to HDFS and then
read and write it to mongo instead of using checkpoint...
is it viable idea?

Is there any possibility to avoid double computation in case of RDD checkpointing

Reply via email to