How about running a count step to force spark to materialise data frame and
then repartition to 1?
On 9 Aug 2016 17:11, "Adrian Bridgett" wrote:
> In short: df.coalesce(1).write seems to make all the earlier calculations
> about the dataframe go through a single task (rather than on multiple tasks
> and then the final dataframe to be sent through a single worker). Any idea
> how we can force the job to run in parallel?
>
> In more detail:
>
> We have a job that we wish to write out as a single CSV file. We have two
> approaches (code below):
>
> df = (filtering, calculations)
> df.coalesce(num).write.
> format("com.databricks.spark.csv").
> option("codec", "org.apache.hadoop.io.compress.GzipCodec").
> save(output_path)
> Option A: (num=100)
> - dataframe calculated in parallel
> - upload in parallel
> - gzip in parallel
> - but we then have to run "hdfs dfs -getmerge" to download all data and
> then write it back again.
>
> Option B: (num=1)
> - single gzip (but gzip is pretty quick)
> - uploads go through a single machine
> - no HDFS commands
> - dataframe is _not_ calculated in parallel (we can see filters getting
> just a single task)
>
> What I'm not sure is why spark (1.6.1) is deciding to run just a single
> task for the calculation - and what we can do about it? We really want
> the df to be calculated in parallel and then this is _then_ coalesced
> before being written. (It may be that the -getmerge approach will still be
> faster)
>
> df.coalesce(100).coalesce(1).write. doesn't look very likely to help!
>
> Adrian
> --
> *Adrian Bridgett*
>