Hi! i use RDD checkpoint before writing to mongo to avoid duplicate records in DB. Seems like Driver writes the same data twice in case of task failure. - data calculated - mongo _id created - spark mongo connector writes data to Mongo - task crashes - (BOOM!) spark recomputes partition and gets new _id for mongo records - i get duplicate records in Mongo
So I've added a checkpoint before writing to mongo. Now Spark doubled execution runtime because of checkpoint. What is the right way to avoid it? i think to save data to HDFS and then read and write it to mongo instead of using checkpoint... is it viable idea?