Re: Hadoop map reduce merge algorithm

2012-01-13 Thread Bai Shen
As far as I can tell, the amount of ram available has no effect on the merge. Regardless of what you set the io.sort.mb and io.sort.factor to, it will eventually end up attempting to bring the entire output into memory to merge. If the mb and factor are set low, it will simply require more passes

Re: Hadoop map reduce merge algorithm

2012-01-12 Thread Ravi Gummadi
Yes. Spills of map output get merged to single file. The spills are triggered by the buffer size set using the configuration property io.sort.mb. Obviously bigger value for io.sort.mb is preferred for better performance --- but the limit is to be set based on the amount of RAM available. Also, t

Re: Hadoop map reduce merge algorithm

2012-01-12 Thread Bai Shen
That's my understanding as well. I can't seem to find any settings that govern the step where the output is merged into a single file. io.sort.factor modifies the number of passes that is done, but it eventually ends up doing the same thing no matter how many spill files there are. They're simply

Re: Hadoop map reduce merge algorithm

2012-01-12 Thread Wellington Chevreuil
Intermediate data from the map phase is written to disk by the Mapper. After that, the data will be sent to Reducer(s) and it will perform 3 steps: - shuffle: where all output data from mappers are sorted as input to the Reducer(s); - sort: output data from mappers are grouped by key. This is d

Re: Hadoop map reduce merge algorithm

2012-01-12 Thread Robert Evans
My understanding is that the mapper will cache the output in memory until its memory buffer fills up, at which point it will sort the data and spill it to disk. Once a given number of spill files are created they will be merged together into a larger spill file. Once the mapper finishes then t