Hi Team,

We are running into this poor performance issue and seeking your suggestion on 
how to improve it:

We have a particular dataset which we aggregate from other datasets and like to 
write out to one single file (because it is small enough).  We found that after 
a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD 
to 1 partition before writing it out, and this coalesce degrade the 
performance, not that this additional coalesce operation took additional 
runtime, but it somehow dictates the partitions to use in the upstream 

We hope there is a simple and useful way to solve this kind of issue which we 
believe is quite common for many people.



Reply via email to