Hi,

I have a case where I have an iterative process which overwrites the results of 
a previous iteration.
Every iteration I need to write a dataframe with the results.
The problem is that when I write, if I simply overwrite the results of the 
previous iteration, this is not fault tolerant. i.e. if the program crashes in 
the middle of an iteration, the data from previous ones is lost as overwrite 
first removes the previous data and then starts writing.

Currently we simply write to a new directory and then rename but this is not 
the best way as it requires us to know the interfaces to the underlying file 
system (as well as requiring some extra work to manage which is the last one 
etc.)
I know I can also use checkpoint (although I haven't fully tested the process 
there), however, checkpointing converts the result to RDD which both takes more 
time and more space.
I was wondering if there is any efficient method of managing this from inside 
spark.
Thanks,
                Assaf.

Reply via email to