In short: df.coalesce(1).write seems to make all the earlier
calculations about the dataframe go through a single task (rather than
on multiple tasks and then the final dataframe to be sent through a
single worker). Any idea how we can force the job to run in parallel?
In more detail:
We have a job that we wish to write out as a single CSV file. We have
two approaches (code below):
df = ....(filtering, calculations)
df.coalesce(num).write.
format("com.databricks.spark.csv").
option("codec", "org.apache.hadoop.io.compress.GzipCodec").
save(output_path)
Option A: (num=100)
- dataframe calculated in parallel
- upload in parallel
- gzip in parallel
- but we then have to run "hdfs dfs -getmerge" to download all data and
then write it back again.
Option B: (num=1)
- single gzip (but gzip is pretty quick)
- uploads go through a single machine
- no HDFS commands
- dataframe is _not_ calculated in parallel (we can see filters getting
just a single task)
What I'm not sure is why spark (1.6.1) is deciding to run just a single
task for the calculation - and what we can do about it? We really want
the df to be calculated in parallel and then this is _then_ coalesced
before being written. (It may be that the -getmerge approach will still
be faster)
df.coalesce(100).coalesce(1).write..... doesn't look very likely to help!
Adrian
--
*Adrian Bridgett*