Probably could also be because that coalesce can cause some upstream
transformations to also have parallelism of 1. I think (?) an OK solution
is to cache the result, then coalesce and write. Or combine the files after
the fact. or do what Silvio said.

On Wed, Feb 3, 2021 at 12:55 PM James Yu <ja...@ispot.tv> wrote:

> Hi Team,
>
> We are running into this poor performance issue and seeking your
> suggestion on how to improve it:
>
> We have a particular dataset which we aggregate from other datasets and
> like to write out to one single file (because it is small enough).  We
> found that after a series of transformations (GROUP BYs, FLATMAPs), we
> coalesced the final RDD to 1 partition before writing it out, and this
> coalesce degrade the performance, not that this additional coalesce
> operation took additional runtime, but it somehow dictates the partitions
> to use in the upstream transformations.
>
> We hope there is a simple and useful way to solve this kind of issue which
> we believe is quite common for many people.
>
>
> Thanks
>
> James
>

Reply via email to