Normally you can fetch the filesystem interface from the configuration ( I 
assume you mean URI).
Managing to get the last iteration: I do not understand the issue. You can have 
as the directory the current timestamp and at the end you simply select the 
directory with the highest number.

Regards to checkpointing , you can use also kyroserializer to avoid some space 
overhead.

Aside from that, can you elaborate on the use case why you need to write every 
iteration?

> On 14 Feb 2017, at 11:22, Mendelson, Assaf <assaf.mendel...@rsa.com> wrote:
> 
> Hi,
>  
> I have a case where I have an iterative process which overwrites the results 
> of a previous iteration.
> Every iteration I need to write a dataframe with the results.
> The problem is that when I write, if I simply overwrite the results of the 
> previous iteration, this is not fault tolerant. i.e. if the program crashes 
> in the middle of an iteration, the data from previous ones is lost as 
> overwrite first removes the previous data and then starts writing.
>  
> Currently we simply write to a new directory and then rename but this is not 
> the best way as it requires us to know the interfaces to the underlying file 
> system (as well as requiring some extra work to manage which is the last one 
> etc.)
> I know I can also use checkpoint (although I haven’t fully tested the process 
> there), however, checkpointing converts the result to RDD which both takes 
> more time and more space.
> I was wondering if there is any efficient method of managing this from inside 
> spark.
> Thanks,
>                 Assaf.

Reply via email to