I know how to get the filesystem, the problem is that this means using Hadoop 
directly so if in the future we change to something else (e.g. S3) I would need 
to rewrite the code. This also relate to finding the last iteration, I would 
need to use Hadoop filesystem which is not agnostic to the deployment.

Kyroserializer still costs much more than dataframe write.

As for the use case, I am doing a very large number of iterations. So the idea 
is that every X iterations I want to save to disk so that if something crashes 
I do not have to begin from the first iteration but just from the relevant 
iteration.

Basically I would have liked to see something like saving normally and the 
original data would not be removed until a successful write.
Assaf.

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Tuesday, February 14, 2017 12:54 PM
To: Mendelson, Assaf
Cc: user
Subject: Re: fault tolerant dataframe write with overwrite

Normally you can fetch the filesystem interface from the configuration ( I 
assume you mean URI).
Managing to get the last iteration: I do not understand the issue. You can have 
as the directory the current timestamp and at the end you simply select the 
directory with the highest number.

Regards to checkpointing , you can use also kyroserializer to avoid some space 
overhead.

Aside from that, can you elaborate on the use case why you need to write every 
iteration?

On 14 Feb 2017, at 11:22, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:
Hi,

I have a case where I have an iterative process which overwrites the results of 
a previous iteration.
Every iteration I need to write a dataframe with the results.
The problem is that when I write, if I simply overwrite the results of the 
previous iteration, this is not fault tolerant. i.e. if the program crashes in 
the middle of an iteration, the data from previous ones is lost as overwrite 
first removes the previous data and then starts writing.

Currently we simply write to a new directory and then rename but this is not 
the best way as it requires us to know the interfaces to the underlying file 
system (as well as requiring some extra work to manage which is the last one 
etc.)
I know I can also use checkpoint (although I haven’t fully tested the process 
there), however, checkpointing converts the result to RDD which both takes more 
time and more space.
I was wondering if there is any efficient method of managing this from inside 
spark.
Thanks,
                Assaf.

Reply via email to