Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matt Cheah
Hi everyone,

I'm running into more and more cases where too many files are opened when
spark.shuffle.consolidateFiles is turned off.

I was wondering if this is a common scenario among the rest of the
community, and if so, if it is worth considering the setting to be turned on
by default. From the documentation, it seems like the performance could be
hurt on ext3 file systems. However, what are the concrete numbers of
performance degradation that is seen typically? A 2x slowdown in the average
job? 3x? Also, what cause the performance degradation on ext3 file systems
specifically?

Thanks,

-Matt Cheah






smime.p7s
Description: S/MIME cryptographic signature


Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
better performance while creating fewer files. So I'd suggest trying that too.

Matei

 On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote:
 
 Hey Matt,
 
 There's some prior work that compares consolidation performance on some
 medium-scale workload:
 http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
 
 There we noticed about 2x performance degradation in the reduce phase on
 ext3. I am not aware of any other concrete numbers. Maybe others have more
 experiences to add.
 
 -Andrew
 
 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com:
 
 Hi everyone,
 
 I'm running into more and more cases where too many files are opened when
 spark.shuffle.consolidateFiles is turned off.
 
 I was wondering if this is a common scenario among the rest of the
 community, and if so, if it is worth considering the setting to be turned
 on by default. From the documentation, it seems like the performance could
 be hurt on ext3 file systems. However, what are the concrete numbers of
 performance degradation that is seen typically? A 2x slowdown in the
 average job? 3x? Also, what cause the performance degradation on ext3 file
 systems specifically?
 
 Thanks,
 
 -Matt Cheah
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try 
branch-1.1 for it).

Matei

 On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
 better performance while creating fewer files. So I'd suggest trying that too.
 
 Matei
 
 On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote:
 
 Hey Matt,
 
 There's some prior work that compares consolidation performance on some
 medium-scale workload:
 http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
 
 There we noticed about 2x performance degradation in the reduce phase on
 ext3. I am not aware of any other concrete numbers. Maybe others have more
 experiences to add.
 
 -Andrew
 
 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com:
 
 Hi everyone,
 
 I'm running into more and more cases where too many files are opened when
 spark.shuffle.consolidateFiles is turned off.
 
 I was wondering if this is a common scenario among the rest of the
 community, and if so, if it is worth considering the setting to be turned
 on by default. From the documentation, it seems like the performance could
 be hurt on ext3 file systems. However, what are the concrete numbers of
 performance degradation that is seen typically? A 2x slowdown in the
 average job? 3x? Also, what cause the performance degradation on ext3 file
 systems specifically?
 
 Thanks,
 
 -Matt Cheah
 
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Zach Fry
Hey Andrew, Matei,

Thanks for responding.

For some more context, we were running into Too many open files issues
where we were seeing this happen immediately after the Collect phase
(about 30 seconds into a run) on a decently sized dataset (14 MM rows).
The ulimit set in the spark-env was 256,000 which we believe should have
been enough, but even with it set at that number, we were still seeing
issues. 
Can you comment on what a good ulimit should be in these cases?

We believe what might have caused this is  some process got orphaned
without cleaning up its open file handles.
However, other than anecdotal evidence and some speculation, we don't have
much evidence to expand on this further.

We were wondering if we could get some more information about how many
files get opened during a shuffle.
We discussed that it is going to be around N x M, where N is the number of
Tasks and M is the number of Reducers.
Does this sound about right?


Are there any other considerations we should be aware of when setting
consolidateFiles to True?

Thanks, 
Zach Fry
Palantir | Developer Support Engineer
z...@palantir.com mailto:em...@palantir.com | 650.226.6338



On 11/3/14 6:28 09PM, Matei Zaharia matei.zaha...@gmail.com wrote:

In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will
have better performance while creating fewer files. So I'd suggest trying
that too.

Matei

 On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote:
 
 Hey Matt,
 
 There's some prior work that compares consolidation performance on some
 medium-scale workload:
 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cs.berkeley.edu_-
7Ekubitron_courses_cs262a-2DF13_projects_reports_project16-5Freport.pdfd
=AAIFAgc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=0Yj0NJdi423O9rGnW
Dox5yE_2OXftYbKeoFygDwj99Um=fQgGKwxzg3lfq5XUaEZy674jjtWDSrFOHIrIDFEGpQc
s=ukSpYSbxzzrYdHJEXPMx3gGsErP2vA2PMdBVsY3EOnAe=
 
 There we noticed about 2x performance degradation in the reduce phase on
 ext3. I am not aware of any other concrete numbers. Maybe others have
more
 experiences to add.
 
 -Andrew
 
 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com:
 
 Hi everyone,
 
 I'm running into more and more cases where too many files are opened
when
 spark.shuffle.consolidateFiles is turned off.
 
 I was wondering if this is a common scenario among the rest of the
 community, and if so, if it is worth considering the setting to be
turned
 on by default. From the documentation, it seems like the performance
could
 be hurt on ext3 file systems. However, what are the concrete numbers of
 performance degradation that is seen typically? A 2x slowdown in the
 average job? 3x? Also, what cause the performance degradation on ext3
file
 systems specifically?
 
 Thanks,
 
 -Matt Cheah
 
 
 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org