Hi all, I am trying to understand the process of Collect, Spill and Merge in Map, I've referred to a few documentations but still have a few questions.
Here is my understanding about the spill phase in map: 1.Collect function add a record into the buffer. 2.If the buffer exceeds a threshold (determined by parameters like io.sort.mb), spill phase begins. 3.Spill phase includes 3 actions : sort , combine and compression. 4.Spill may be performed multiple times thus a few spilled files will be generated. 5.If there are more than 1 spilled files, Merge phase begins and merge these files into a big one. If there is any miss understanding about these phases, please correct me ,thanks! And my questions are: 1.Where is the partition being calculated (in Collect or Spill) ? Does Collect simply append a record into the buffer and check whether we should spill the buffer? 2.At Merge phase, since the spilled files are compressed, does it need to uncompressed these files and compress them again? Since Merge may be performed more than 1 round, does it compress intermediate files? 3.Does the Merge phase at Map and Reduce side almost the same (External merge-sort combined with Min-Heap) ?