My streaming job is creating files on S3.The problem is that those files end up 
very small if I just write them to S3 directly.This is why I use coalesce() to 
reduce the number of files and make them larger.
However, coalesce shuffles data and my job processing time ends up higher than 
sparkBatchIntervalMilliseconds.
I have observed that if I coalesce the number of partitions to be equal to the 
cores in the cluster I get less shuffling - but that is unsubstantiated.Is 
there any dependency/rule between number of executors, number of cores etc. 
that I can use to minimize shuffling and at the same time achieve minimum 
number of output files per batch?What is the best practice?
                                          

Reply via email to