Re: Shuffle file consolidation

2014-05-29 Thread Nathan Kronenfeld
Thanks, I missed that. One thing that's still unclear to me, even looking at that, is - does this parameter have to be set when starting up the cluster, on each of the workers, or can it be set by an individual client job? On Fri, May 23, 2014 at 10:13 AM, Han JU ju.han.fe...@gmail.com wrote:

Re: Shuffle file consolidation

2014-05-29 Thread Matei Zaharia
It can be set in an individual application. Consolidation had some issues on ext3 as mentioned there, though we might enable it by default in the future because other optimizations now made it perform on par with the non-consolidation version. It also had some bugs in 0.9.0 so I’d suggest at

Shuffle file consolidation

2014-05-23 Thread Nathan Kronenfeld
In trying to sort some largish datasets, we came across the spark.shuffle.consolidateFiles property, and I found in the source code that it is set, by default, to false, with a note to default it to true when the feature is stable. Does anyone know what is unstable about this? If we set it true,

Re: Shuffle file consolidation

2014-05-23 Thread Han JU
Hi Nathan, There's some explanation in the spark configuration section: ``` If set to true, consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to true