Spark shuffle consolidateFiles performance degradation numbers
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Re: Spark shuffle consolidateFiles performance degradation numbers
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of any other concrete numbers. Maybe others have more experiences to add. -Andrew 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com: Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark shuffle consolidateFiles performance degradation numbers
(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it). Matei On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of any other concrete numbers. Maybe others have more experiences to add. -Andrew 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com: Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark shuffle consolidateFiles performance degradation numbers
Hey Andrew, Matei, Thanks for responding. For some more context, we were running into Too many open files issues where we were seeing this happen immediately after the Collect phase (about 30 seconds into a run) on a decently sized dataset (14 MM rows). The ulimit set in the spark-env was 256,000 which we believe should have been enough, but even with it set at that number, we were still seeing issues. Can you comment on what a good ulimit should be in these cases? We believe what might have caused this is some process got orphaned without cleaning up its open file handles. However, other than anecdotal evidence and some speculation, we don't have much evidence to expand on this further. We were wondering if we could get some more information about how many files get opened during a shuffle. We discussed that it is going to be around N x M, where N is the number of Tasks and M is the number of Reducers. Does this sound about right? Are there any other considerations we should be aware of when setting consolidateFiles to True? Thanks, Zach Fry Palantir | Developer Support Engineer z...@palantir.com mailto:em...@palantir.com | 650.226.6338 On 11/3/14 6:28 09PM, Matei Zaharia matei.zaha...@gmail.com wrote: In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote: Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cs.berkeley.edu_- 7Ekubitron_courses_cs262a-2DF13_projects_reports_project16-5Freport.pdfd =AAIFAgc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=0Yj0NJdi423O9rGnW Dox5yE_2OXftYbKeoFygDwj99Um=fQgGKwxzg3lfq5XUaEZy674jjtWDSrFOHIrIDFEGpQc s=ukSpYSbxzzrYdHJEXPMx3gGsErP2vA2PMdBVsY3EOnAe= There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of any other concrete numbers. Maybe others have more experiences to add. -Andrew 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com: Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the documentation, it seems like the performance could be hurt on ext3 file systems. However, what are the concrete numbers of performance degradation that is seen typically? A 2x slowdown in the average job? 3x? Also, what cause the performance degradation on ext3 file systems specifically? Thanks, -Matt Cheah - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org