Re: coalesce serialising earlier work

2016-08-09 Thread ayan guha
How about running a count step to force spark to materialise data frame and
then repartition to 1?
On 9 Aug 2016 17:11, "Adrian Bridgett"  wrote:

> In short:  df.coalesce(1).write seems to make all the earlier calculations
> about the dataframe go through a single task (rather than on multiple tasks
> and then the final dataframe to be sent through a single worker).  Any idea
> how we can force the job to run in parallel?
>
> In more detail:
>
> We have a job that we wish to write out as a single CSV file.  We have two
> approaches (code below):
>
> df = (filtering, calculations)
> df.coalesce(num).write.
>   format("com.databricks.spark.csv").
>   option("codec", "org.apache.hadoop.io.compress.GzipCodec").
>   save(output_path)
> Option A: (num=100)
> - dataframe calculated in parallel
> - upload in parallel
> - gzip in parallel
> - but we then have to run "hdfs dfs -getmerge" to download all data and
> then write it back again.
>
> Option B: (num=1)
> - single gzip (but gzip is pretty quick)
> - uploads go through a single machine
> - no HDFS commands
> - dataframe is _not_ calculated in parallel (we can see filters getting
> just a single task)
>
> What I'm not sure is why spark (1.6.1) is deciding to run just a single
> task for the calculation - and what we can do about it?   We really want
> the df to be calculated in parallel and then this is _then_ coalesced
> before being written.  (It may be that the -getmerge approach will still be
> faster)
>
> df.coalesce(100).coalesce(1).write.  doesn't look very likely to help!
>
> Adrian
> --
> *Adrian Bridgett*
>


coalesce serialising earlier work

2016-08-09 Thread Adrian Bridgett
In short:  df.coalesce(1).write seems to make all the earlier 
calculations about the dataframe go through a single task (rather than 
on multiple tasks and then the final dataframe to be sent through a 
single worker).  Any idea how we can force the job to run in parallel?


In more detail:

We have a job that we wish to write out as a single CSV file.  We have 
two approaches (code below):


df = (filtering, calculations)
df.coalesce(num).write.
  format("com.databricks.spark.csv").
  option("codec", "org.apache.hadoop.io.compress.GzipCodec").
  save(output_path)

Option A: (num=100)
- dataframe calculated in parallel
- upload in parallel
- gzip in parallel
- but we then have to run "hdfs dfs -getmerge" to download all data and 
then write it back again.


Option B: (num=1)
- single gzip (but gzip is pretty quick)
- uploads go through a single machine
- no HDFS commands
- dataframe is _not_ calculated in parallel (we can see filters getting 
just a single task)


What I'm not sure is why spark (1.6.1) is deciding to run just a single 
task for the calculation - and what we can do about it? We really want 
the df to be calculated in parallel and then this is _then_ coalesced 
before being written.  (It may be that the -getmerge approach will still 
be faster)


df.coalesce(100).coalesce(1).write.  doesn't look very likely to help!

Adrian

--
*Adrian Bridgett*