As far as I can tell, the amount of ram available has no effect on the
merge. Regardless of what you set the io.sort.mb and io.sort.factor to, it
will eventually end up attempting to bring the entire output into memory to
merge. If the mb and factor are set low, it will simply require more
passes
Yes. Spills of map output get merged to single file. The spills are triggered
by the buffer size set using the configuration property io.sort.mb. Obviously
bigger value for io.sort.mb is preferred for better performance --- but the
limit is to be set based on the amount of RAM available.
Also, t
That's my understanding as well. I can't seem to find any settings that
govern the step where the output is merged into a single file.
io.sort.factor modifies the number of passes that is done, but it
eventually ends up doing the same thing no matter how many spill files
there are. They're simply
Intermediate data from the map phase is written to disk by the Mapper.
After that, the data will be sent to Reducer(s) and it will perform 3
steps:
- shuffle: where all output data from mappers are sorted as input to
the Reducer(s);
- sort: output data from mappers are grouped by key. This is d
My understanding is that the mapper will cache the output in memory until its
memory buffer fills up, at which point it will sort the data and spill it to
disk. Once a given number of spill files are created they will be merged
together into a larger spill file. Once the mapper finishes then t