Hi,

I am looking for a setup that would be to be able to split a single spark
processing into 2 jobs (operational constraints) without wasting too much
time persisting the data between the two jobs during spark
checkpoint/writes.

I have a config with a lot of ram and I'm willing to configure a a few
hundreds GB in ramfs, but I cannot find any feedbacks on these kind of
configurations... and the doc hadoop that tells me "network replication
negates the benefits of writing to memory" doesn't inspire me much
confidence regarding performance improvement.
My HDFS is configured with replication 3, so if the LAZY_PERSIST writes
still imply waiting for the three replica to be written in the ramfs of
three different datanodes, I can understand that the performance
improvement will be really small.
In addition if my second job is not provisioned on the same datanodes, the
ramfs might not be of any help.

Any advice regarding the use of HDFS ramfs and Lazy persist with spark
checkpoint ? Deadend ? larger linux pagecache would be more helpful ?

Regards,
JL

Reply via email to