coalesce is nice because it does not shuffle, but the consequence of avoiding a shuffle is it will also reduce parallelism of the preceding computation. Have you tried using repartition instead?
On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi < andrii.bilets...@yahoo.com.invalid> wrote: > Hi all, > > I'm trying to understand the impact of coalesce operation on spark job > performance. > > As a side note: were are using emrfs (i.e. aws s3) as source and a target > for the job. > > Omitting unnecessary details job can be explained as: join 200M records > Dataframe stored in orc format on emrfs with another 200M records cached > Dataframe, the result of the join put back to emrfs. First DF is a set of > wide rows (Spark UI shows 300 GB) and the second is relatively small (Spark > shows 20 GB). > > I have enough resources in my cluster to perform the job but I don't like > the fact that output datasource contains 200 part orc files (as > spark.sql.shuffle.partitions defaults to 200) so before saving orc to > emrfs I'm doing .coalesce(10). From documentation coalesce looks like a > quite harmless operations: no repartitioning etc. > > But with such setup my job fails to write dataset on the last stage. Right > now the error is OOM: GC overhead. When I change .coalesce(10) to > .coalesce(100) the job runs much faster and finishes without errors. > > So what's the impact of .coalesce in this case? And how to do in place > concatenation of files (not involving hive) to end up with smaller amount > of bigger files, as with .coalesce(100) job generates 100 orc snappy > encoded files ~300MB each. > > Thanks, > Andrii >