In short: df.coalesce(1).write seems to make all the earlier calculations about the dataframe go through a single task (rather than on multiple tasks and then the final dataframe to be sent through a single worker). Any idea how we can force the job to run in parallel?

In more detail:

We have a job that we wish to write out as a single CSV file. We have two approaches (code below):

df = ....(filtering, calculations)
df.coalesce(num).write.
      format("com.databricks.spark.csv").
      option("codec", "org.apache.hadoop.io.compress.GzipCodec").
      save(output_path)

Option A: (num=100)
- dataframe calculated in parallel
- upload in parallel
- gzip in parallel
- but we then have to run "hdfs dfs -getmerge" to download all data and then write it back again.

Option B: (num=1)
- single gzip (but gzip is pretty quick)
- uploads go through a single machine
- no HDFS commands
- dataframe is _not_ calculated in parallel (we can see filters getting just a single task)

What I'm not sure is why spark (1.6.1) is deciding to run just a single task for the calculation - and what we can do about it? We really want the df to be calculated in parallel and then this is _then_ coalesced before being written. (It may be that the -getmerge approach will still be faster)

df.coalesce(100).coalesce(1).write.....  doesn't look very likely to help!

Adrian

--
*Adrian Bridgett*

Reply via email to