coalesce serialising earlier work

Adrian Bridgett Tue, 09 Aug 2016 00:11:43 -0700

In short: df.coalesce(1).write seems to make all the earliercalculations about the dataframe go through a single task (rather thanon multiple tasks and then the final dataframe to be sent through asingle worker). Any idea how we can force the job to run in parallel?


In more detail:

We have a job that we wish to write out as a single CSV file. We havetwo approaches (code below):


df = ....(filtering, calculations)
df.coalesce(num).write.
      format("com.databricks.spark.csv").
      option("codec", "org.apache.hadoop.io.compress.GzipCodec").
      save(output_path)

Option A: (num=100)
- dataframe calculated in parallel
- upload in parallel
- gzip in parallel

- but we then have to run "hdfs dfs -getmerge" to download all data andthen write it back again.


Option B: (num=1)
- single gzip (but gzip is pretty quick)
- uploads go through a single machine
- no HDFS commands

- dataframe is _not_ calculated in parallel (we can see filters gettingjust a single task)

What I'm not sure is why spark (1.6.1) is deciding to run just a singletask for the calculation - and what we can do about it? We really wantthe df to be calculated in parallel and then this is _then_ coalescedbefore being written. (It may be that the -getmerge approach will stillbe faster)


df.coalesce(100).coalesce(1).write.....  doesn't look very likely to help!

Adrian

--
*Adrian Bridgett*

coalesce serialising earlier work

Reply via email to