Normally you can fetch the filesystem interface from the configuration ( I assume you mean URI). Managing to get the last iteration: I do not understand the issue. You can have as the directory the current timestamp and at the end you simply select the directory with the highest number.
Regards to checkpointing , you can use also kyroserializer to avoid some space overhead. Aside from that, can you elaborate on the use case why you need to write every iteration? > On 14 Feb 2017, at 11:22, Mendelson, Assaf <assaf.mendel...@rsa.com> wrote: > > Hi, > > I have a case where I have an iterative process which overwrites the results > of a previous iteration. > Every iteration I need to write a dataframe with the results. > The problem is that when I write, if I simply overwrite the results of the > previous iteration, this is not fault tolerant. i.e. if the program crashes > in the middle of an iteration, the data from previous ones is lost as > overwrite first removes the previous data and then starts writing. > > Currently we simply write to a new directory and then rename but this is not > the best way as it requires us to know the interfaces to the underlying file > system (as well as requiring some extra work to manage which is the last one > etc.) > I know I can also use checkpoint (although I haven’t fully tested the process > there), however, checkpointing converts the result to RDD which both takes > more time and more space. > I was wondering if there is any efficient method of managing this from inside > spark. > Thanks, > Assaf.