Can you increase the number of partitions and also increase the number of executors? (This should improve the parallelization but you may become disk i/o bound)
On Nov 8, 2016, at 4:08 PM, Elf Of Lothlorein <redarro...@gmail.com<mailto:redarro...@gmail.com>> wrote: Hi I am trying to save a RDD to disk and I am using the saveAsNewAPIHadoopFile for that. I am seeing that it takes almost 20 mins for about 900 GB of data. Is there any parameter that I can tune to make this saving faster. I am running about 45 executors with 5 cores each on 5 Spark worker nodes and using Spark on YARN for this.. Thanks for your help. C