Fengyun, Is there any particular reason you have to have exactly 1 file per hour? As you probably knew already, each reducer will output 1 file, or if you use MultipleOutputs as I suggested, a set of files. If you have to fit the number of reducers to the number hours you have from the input, and generate the number of files accordingly, it will most likely be at the expense of cluster efficiency and performance. A worst case scenario of course is if you have a bunch of data all within the same hour, then you have to settle with 1 reducer without any parallelization at all.
A workaround is to use MultipleOutputs to generate a set of files for each hour, with the hour being a the base name. Or if you so choose, a sub-directory for each hour. For example if you use mmddhh as the base name, you will have a set of files for an hour like: 030119-r-00000 ... 030119-r-0000n 030120-r-00000 ... 030120-r-0000n Or in a sub-directory: 030119/part-r-00000 ... 030119/part-r-0000n You can then use wild card to glob the output either for manual processing, or as input path for subsequent jobs. -Simon On Sat, Mar 1, 2014 at 7:37 PM, Fengyun RAO <raofeng...@gmail.com> wrote: > Thanks Devin. We don't just want one file. It's complicated. > > if the input folder contains data in X hours, we want X files, > if Y hours, we want Y files. > > obviously, X or Y is unknown on compile time. > > 2014-03-01 20:48 GMT+08:00 Devin Suiter RDX <dsui...@rdx.com>: > >> If you only want one file, then you need to set the number of reducers to >> 1. >> >> If the size of the data makes the original MR job impractical to use a >> single reducer, you run a second job on the output of the first, with the >> default mapper and reducer, which are the Identity- ones, and set that >> numReducers = 1. >> >> Or use hdfs getmerge function to collate the results to one file. >> On Mar 1, 2014 4:59 AM, "Fengyun RAO" <raofeng...@gmail.com> wrote: >> >>> Thanks, but how to set reducer number to X? X is dependent on input >>> (run-time), which is unknown on job configuration (compile time). >>> >>> >>> 2014-03-01 17:44 GMT+08:00 AnilKumar B <akumarb2...@gmail.com>: >>> >>>> Hi, >>>> >>>> Write the custom partitioner on <timestamp> and as you mentioned set >>>> #reducers to X. >>>> >>>> >>>> >>> >