My streaming job is creating files on S3.The problem is that those files end up
very small if I just write them to S3 directly.This is why I use coalesce() to
reduce the number of files and make them larger.
However, coalesce shuffles data and my job processing time ends up higher than
sparkBatchIntervalMilliseconds.
I have observed that if I coalesce the number of partitions to be equal to the
cores in the cluster I get less shuffling - but that is unsubstantiated.Is
there any dependency/rule between number of executors, number of cores etc.
that I can use to minimize shuffling and at the same time achieve minimum
number of output files per batch?What is the best practice?