There has to be a simpler way :)
On Tue, Apr 28, 2009 at 9:22 PM, jason hadoop <jason.had...@gmail.com>wrote: > It may be simpler to just have a post processing step that uses something > like multi-file input to aggregate the results. > > As a complete sideways thinking solution, I suspect you have far more map > tasks than you have physical machines, > instead of writing your output via output.collect, your tasks could open a > 'side effect file' and append to it, since these are in the local file > system you actually have the ability to append to them. You will need to > play some interesting games with the OutputCommitter though. > > An alternative would be to write N output records, where N is the number of > reduces, where each of the N keys is guaranteed to to a unique reduce task, > and the value of the record is the local file name and the host name. > The side effect files would need to be written into the job working area or > some public area on the node., rather than the task output area, or the > output committer could place them in the proper place (that way failed tasks > are handled correctly). > > The reduce then reads the keys it has, opens and concatinates what files > are on it's machine, and very very little sorting happens. > > > > Each reduce then collects the side effect files that > > 2009/4/28 Dmitry Pushkarev <u...@stanford.edu> > > Hi. >> >> >> >> I'm writing streaming based tasks that involves running thousands of >> mappers, after that I want to put all these outputs into small number (say >> 30) output files mainly so that disk space will be used more efficiently, >> the way I'm doing it right now is using /bin/cat as reducer and setting >> number of reducers to desired. This involves two highly ineffective (for >> the >> task) steps - sorting and fetching. Is there a way to get around that? >> >> Ideally I'd want all mapper outputs to be written to one file, one record >> per line. >> >> >> >> Thanks. >> >> >> >> --- >> >> Dmitry Pushkarev >> >> +1-650-644-8988 >> >> >> >> > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422