Re: Impact of coalesce operation before writing dataframe

Andrii Biletskyi Tue, 23 May 2017 13:27:18 -0700

 No, I didn't try to use repartition, how exactly it impacts the parallelism?In 
my understanding coalesce simply "unions" multiple partitions located on same 
executor "one on on top of the other", while repartition does hash-based 
shuffle decreasing the number of output partitions. So how this exactly affects 
the parallelism, which stage of the job?
Thanks,Andrii


    On Tuesday, May 23, 2017 10:20 PM, Michael Armbrust 
<mich...@databricks.com> wrote:
 

 coalesce is nice because it does not shuffle, but the consequence of avoiding 
a shuffle is it will also reduce parallelism of the preceding computation.  
Have you tried using repartition instead?
On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi 
<andrii.bilets...@yahoo.com.invalid> wrote:

Hi all,
I'm trying to understand the impact of coalesce operation on spark job 
performance.
As a side note: were are using emrfs (i.e. aws s3) as source and a target for 
the job.
Omitting unnecessary details job can be explained as: join 200M records 
Dataframe stored in orc format on emrfs with another 200M records cached 
Dataframe, the result of the join put back to emrfs. First DF is a set of wide 
rows (Spark UI shows 300 GB) and the second is relatively small (Spark shows 20 
GB).
I have enough resources in my cluster to perform the job but I don't like the 
fact that output datasource contains 200 part orc files (as spark.sql.shuffle. 
partitions defaults to 200) so before saving orc to emrfs I'm doing 
.coalesce(10). From documentation coalesce looks like a quite harmless 
operations: no repartitioning etc.
But with such setup my job fails to write dataset on the last stage. Right now 
the error is OOM: GC overhead. When I change  .coalesce(10) to .coalesce(100) 
the job runs much faster and finishes without errors.
So what's the impact of .coalesce in this case? And how to do in place 
concatenation of files (not involving hive) to end up with smaller amount of 
bigger files, as with .coalesce(100) job generates 100 orc snappy encoded files 
~300MB each.
Thanks,Andrii

Re: Impact of coalesce operation before writing dataframe

Reply via email to