Re: Merging reducer outputs into a single part-00000 file

Rasit OZDAS Wed, 14 Jan 2009 00:46:36 -0800

Jim,

As far as I know, there is no operation done after Reducer.
At the first look, the situation reminds me of same keys for all the tasks,
This can be the result of one of following cases:
- input format reads same keys for every task.
- mapper collects every incoming key-value pairs under same key.
- reducer makes the same.


But if you  are a little experienced, you already know these.
Ordered list means one final file, or am I missing something?

Hope this helps,
Rasit


2009/1/11 Jim Twensky <jim.twen...@gmail.com>:
> Hello,
>
> The original map-reduce paper states: "After successful completion, the
> output of the map-reduce execution is available in the R output ﬁles (one
> per reduce task, with ﬁle names as speciﬁed by the user)." However, when
> using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
> single file called part-00000. I was wondering how and when this merging
> process is done. When the reducer calls output.collect(key,value), is this
> record written to a local temporary output file in the reducer's disk and
> then these local files (a total of R) are later merged into one single file
> with a final thread or is it directly written to the final output file
> (part-00000)? I am asking this because I'd like to get an ordered sample of
> the final output data, ie. one record per every 1000 records or something
> similar and I don't want to run a serial process that iterates on the final
> output file.
>
> Thanks,
> Jim
>



-- 
M. Raşit ÖZDAŞ

Re: Merging reducer outputs into a single part-00000 file

Reply via email to