Dave, you are right, collect() will be called whenever a [K,V] will be inserted into kvbuffer. Here, I mean when all [K,V] are created and the last collect() finishes :).
But I think if map phase created bigger amount of output than input, we need some different procedure. On Tue, May 3, 2011 at 10:29 PM, Dave Shine < [email protected]> wrote: > I'm a relative newbie to Hadoop, but your assumption below is not correct > in my organization. It is common for us to call output.collect() more than > once in a map() function. > > Dave Shine > > > -----Original Message----- > From: elton sky [mailto:[email protected]] > Sent: Tuesday, May 03, 2011 4:49 AM > To: [email protected] > Subject: Re: Why mergeParts() is not parallel with collect() on map? > > Pls correct me if I am wrong. One of the important assumptions of hadoop > map > reduce is: map's output should be smaller than input. So the workload on > reduce should be smaller than map phase. That's why we put sort, spill and > merge all on map side. Reduce just merge sorted output. > > > > However, typically, the map's merge is much less intensive than the > > reduce's merge. As a result, this might just bloat the code for little > gain, > > except in the most extreme cases. > > In some cases, if the output of map is bigger than input, there might be > many spill files to be merged. > > > On Tue, May 3, 2011 at 5:52 PM, Arun C Murthy <[email protected]> wrote: > > > Elton, > > > > > > On May 2, 2011, at 11:30 PM, elton sky wrote: > > > > In shuffle phase, reduce copies output from map. In parallel, there are > >> InMemoryMerger and OnDiskMerger merge copied files if too many. But on > >> map, > >> the mergeParts*() *happens only after collect() finished. Why don't we > >> parallel spills merging with collect()/sort&spill on map? > >> > > > > Certainly feasible, please feel free to open a jira for the enhancement. > > > > However, typically, the map's merge is much less intensive than the > > reduce's merge. As a result, this might just bloat the code for little > gain, > > except in the most extreme cases. > > > > Arun > > > > > > > > The information contained in this email message is considered confidential > and proprietary to the sender and is intended solely for review and use by > the named recipient. Any unauthorized review, use or distribution is > strictly prohibited. If you have received this message in error, please > advise the sender by reply email and delete the message. >
