Re: MultipleOutputs or Partitioner

Alex Kozlov Mon, 10 May 2010 09:34:12 -0700

Hi Alan,

On Mon, May 10, 2010 at 5:08 AM, Some Body <[email protected]> wrote:


> Hi,
>
> I'm trying to understand how to generate multiple outputs in my reducer
> (using 0.20.2+228).
> Do I need MultipleOutput or should I partition my output in the mapper?
>
>
The question is scalability.  If you are OK with running only 2 (or N)
reducers, "morning" and "afternoon", and they are approximately of the same
size, you should implement a custom partitioner.  However, this approach is
not scalable since you will always be stuck with a predefined number of
reducers.

A better approach is to leave the # of reducers flexible and use 'hadoop fs
-getmerge' or custom Java code afterwards to merge multiple files.

Alex K


> My reducer currently gets key/val input pairs like this which all end up in
> my part_r_0000 file.
>
>    hostA_VarX_2010-05-01_morning    <FLOATVAL>
>    hostA_VarY_2010-05-01_morning    <FLOATVAL>
>    hostA_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-01_morning    <FLOATVAL>
>    hostB_VarY_2010-05-01_morning    <FLOATVAL>
>    hostB_VarX_2010-05-01_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-01_afternoon    <FLOATVAL>
>    .....
>    hostA_VarX_2010-05-02_morning    <FLOATVAL>
>    hostA_VarY_2010-05-02_morning    <FLOATVAL>
>    hostA_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostA_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>    hostB_VarX_2010-05-02_morning    <FLOATVAL>
>    hostB_VarY_2010-05-02_morning    <FLOATVAL>
>    hostB_VarX_2010-05-02_afternoon    <FLOATVAL>
>    hostB_VarY_2010-05-02_afternoon    <FLOATVAL>
>    .....
>
> But instead of 1 output file I want one output file per day/group. e.g.
>    2010-05-01_morning.txt
>    2010-05-01_afternoon.txt
>
> Each <date>_<time>.txt file would contain all keys/vals for all hosts &
> VarNames
>
> Thanks,
> Alan

Re: MultipleOutputs or Partitioner

Reply via email to