Hi, I am looking for a setup that would be to be able to split a single spark processing into 2 jobs (operational constraints) without wasting too much time persisting the data between the two jobs during spark checkpoint/writes.
I have a config with a lot of ram and I'm willing to configure a a few hundreds GB in ramfs, but I cannot find any feedbacks on these kind of configurations... and the doc hadoop that tells me "network replication negates the benefits of writing to memory" doesn't inspire me much confidence regarding performance improvement. My HDFS is configured with replication 3, so if the LAZY_PERSIST writes still imply waiting for the three replica to be written in the ramfs of three different datanodes, I can understand that the performance improvement will be really small. In addition if my second job is not provisioned on the same datanodes, the ramfs might not be of any help. Any advice regarding the use of HDFS ramfs and Lazy persist with spark checkpoint ? Deadend ? larger linux pagecache would be more helpful ? Regards, JL