Re: streaming but no sorting

jason hadoop Tue, 28 Apr 2009 21:22:54 -0700

There has to be a simpler way :)


On Tue, Apr 28, 2009 at 9:22 PM, jason hadoop <jason.had...@gmail.com>wrote:

> It may be simpler to just have a post processing step that uses something
> like multi-file input to aggregate the results.
>
> As a complete sideways thinking solution, I suspect you have far more map
> tasks than you have physical machines,
> instead of writing your output via output.collect, your tasks could open a
> 'side effect file' and append to it, since these are in the local file
> system you actually have the ability to append to them. You will need to
> play some interesting games with the OutputCommitter though.
>
> An alternative would be to write N output records, where N is the number of
> reduces, where each of the N keys is guaranteed to to a unique reduce task,
> and the value of the record is the local file name and the host name.
> The side effect files would need to be written into the job working area or
> some public area on the node., rather than the task output area, or the
> output committer could place them in the proper place (that way failed tasks
> are handled correctly).
>
> The reduce then reads the keys it has, opens and concatinates what files
> are on it's machine, and very very little sorting happens.
>
>
>
> Each reduce then collects the side effect files that
>
> 2009/4/28 Dmitry Pushkarev <u...@stanford.edu>
>
> Hi.
>>
>>
>>
>> I'm writing streaming based tasks that involves running thousands of
>> mappers, after that I want to put all these outputs into small number (say
>> 30) output files mainly so that disk space will be used more efficiently,
>> the way I'm doing it right now is using /bin/cat as reducer and setting
>> number of reducers to desired. This involves two highly ineffective (for
>> the
>> task) steps - sorting and fetching.  Is there a way to get around that?
>>
>> Ideally I'd want all mapper outputs to be written to one file, one record
>> per line.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> ---
>>
>> Dmitry Pushkarev
>>
>> +1-650-644-8988
>>
>>
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: streaming but no sorting

Reply via email to