Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Zach Fry
Hey Andrew, Matei, Thanks for responding. For some more context, we were running into "Too many open files" issues where we were seeing this happen immediately after the Collect phase (about 30 seconds into a run) on a decently sized dataset (14 MM rows). The ulimit set in the spark-env was 256,0

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it). Matei > On Nov 3, 2014, at 6:28 PM, Matei Zaharia wrote: > > In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have > better performance while creating fewer files. So I'd suggest

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei > On Nov 3, 2014, at 6:12 PM, Andrew Or wrote: > > Hey Matt, > > There's some prior work that compares consolidation performance on

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Andrew Or
Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of a

Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matt Cheah
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the