Best way to merge files from streaming jobs

Jelez Raditchkov Fri, 04 Mar 2016 13:11:57 -0800

My streaming job is creating files on S3.The problem is that those files end up 
very small if I just write them to S3 directly.This is why I use coalesce() to 
reduce the number of files and make them larger.
However, coalesce shuffles data and my job processing time ends up higher than 
sparkBatchIntervalMilliseconds.
I have observed that if I coalesce the number of partitions to be equal to the 
cores in the cluster I get less shuffling - but that is unsubstantiated.Is 
there any dependency/rule between number of executors, number of cores etc. 
that I can use to minimize shuffling and at the same time achieve minimum 
number of output files per batch?What is the best practice?

Best way to merge files from streaming jobs

Reply via email to