As I suggested, you need to use repartition(1) in place of coalesce(1)

That will give you a single file output without losing parallelization for the 
rest of the job.

From: James Yu <ja...@ispot.tv>
Date: Wednesday, February 3, 2021 at 2:19 PM
To: Silvio Fiorito <silvio.fior...@granturing.com>, user <user@spark.apache.org>
Subject: Re: Poor performance caused by coalesce to 1

Hi Silvio,

The result file is less than 50 MB in size so I think it is small and 
acceptable enough for one task to write.

Your suggestion sounds interesting. Could you guide us further on how to easily 
"add a stage boundary"?

Thanks
________________________________
From: Silvio Fiorito <silvio.fior...@granturing.com>
Sent: Wednesday, February 3, 2021 11:05 AM
To: James Yu <ja...@ispot.tv>; user <user@spark.apache.org>
Subject: Re: Poor performance caused by coalesce to 1


Coalesce is reducing the parallelization of your last stage, in your case to 1 
task. So, it’s natural it will give poor performance especially with large 
data. If you absolutely need a single file output, you can instead add a stage 
boundary and use repartition(1). This will give your query full parallelization 
during processing while at the end giving you a single task that writes data 
out. Note that if the file is large (e.g. in 1GB or more) you’ll probably still 
notice slowness while writing. You may want to reconsider the 1-file 
requirement for larger datasets.



From: James Yu <ja...@ispot.tv>
Date: Wednesday, February 3, 2021 at 1:54 PM
To: user <user@spark.apache.org>
Subject: Poor performance caused by coalesce to 1



Hi Team,



We are running into this poor performance issue and seeking your suggestion on 
how to improve it:



We have a particular dataset which we aggregate from other datasets and like to 
write out to one single file (because it is small enough).  We found that after 
a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD 
to 1 partition before writing it out, and this coalesce degrade the 
performance, not that this additional coalesce operation took additional 
runtime, but it somehow dictates the partitions to use in the upstream 
transformations.



We hope there is a simple and useful way to solve this kind of issue which we 
believe is quite common for many people.





Thanks



James

Reply via email to