Re: Map-Reduce: How to make MR output one file an hour?

Simon Dong Sat, 01 Mar 2014 21:16:18 -0800

Fengyun,

Is there any particular reason you have to have exactly 1 file per hour? As
you probably knew already, each reducer will output 1 file, or if you use
MultipleOutputs as I suggested, a set of files. If you have to fit the
number of reducers to the number hours you have from the input, and
generate the number of files accordingly, it will most likely be at the
expense of cluster efficiency and performance. A worst case scenario of
course is if you have a bunch of data all within the same hour, then you
have to settle with 1 reducer without any parallelization at all.


A workaround is to use MultipleOutputs to generate a set of files for each
hour, with the hour being a the base name. Or if you so choose, a
sub-directory for each hour. For example if you use mmddhh as the base
name, you will have a set of files for an hour like:

030119-r-00000
...
030119-r-0000n
030120-r-00000
...
030120-r-0000n

Or in a sub-directory:

030119/part-r-00000
...
030119/part-r-0000n

You can then use wild card to glob the output either for manual processing,
or as input path for subsequent jobs.

-Simon



On Sat, Mar 1, 2014 at 7:37 PM, Fengyun RAO <raofeng...@gmail.com> wrote:

> Thanks Devin. We don't just want one file. It's complicated.
>
> if the input folder contains data in X hours, we want X files,
> if Y hours, we want Y files.
>
> obviously, X or Y is unknown on compile time.
>
> 2014-03-01 20:48 GMT+08:00 Devin Suiter RDX <dsui...@rdx.com>:
>
>> If you only want one file, then you need to set the number of reducers to
>> 1.
>>
>> If the size of the data makes the original MR job impractical to use a
>> single reducer, you run a second job on the output of the first, with the
>> default mapper and reducer, which are the Identity- ones, and set that
>> numReducers = 1.
>>
>> Or use hdfs getmerge function to collate the results to one file.
>> On Mar 1, 2014 4:59 AM, "Fengyun RAO" <raofeng...@gmail.com> wrote:
>>
>>> Thanks, but how to set reducer number to X? X is dependent on input
>>> (run-time), which is unknown on job configuration (compile time).
>>>
>>>
>>> 2014-03-01 17:44 GMT+08:00 AnilKumar B <akumarb2...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> Write the custom partitioner on <timestamp> and as you mentioned set
>>>> #reducers to X.
>>>>
>>>>
>>>>
>>>
>

Re: Map-Reduce: How to make MR output one file an hour?

Reply via email to