Re: Map-Reduce: How to make MR output one file an hour?

2014-03-02 Thread Fengyun RAO
thanks, Shekhar. I'm unfamiliar with Flume, but I will look into it later 2014-03-02 15:36 GMT+08:00 Shekhar Sharma shekhar2...@gmail.com: Don't you think using flume would be easier. Use hdfs sink and use a property to roll out the log file every hour. By doing this way you use a single

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread AnilKumar B
Hi, Write the custom partitioner on timestamp and as you mentioned set #reducers to X.

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Fengyun RAO
Thanks, but how to set reducer number to X? X is dependent on input (run-time), which is unknown on job configuration (compile time). 2014-03-01 17:44 GMT+08:00 AnilKumar B akumarb2...@gmail.com: Hi, Write the custom partitioner on timestamp and as you mentioned set #reducers to X.

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Simon Dong
You can use MultipleOutputs and construct the custom file name based on timestamp. http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html On Fri, Feb 28, 2014 at 11:44 PM, Fengyun RAO raofeng...@gmail.com wrote: It's a common web log analysis

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Fengyun RAO
Thanks Devin. We don't just want one file. It's complicated. if the input folder contains data in X hours, we want X files, if Y hours, we want Y files. obviously, X or Y is unknown on compile time. 2014-03-01 20:48 GMT+08:00 Devin Suiter RDX dsui...@rdx.com: If you only want one file, then

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Simon Dong
Fengyun, Is there any particular reason you have to have exactly 1 file per hour? As you probably knew already, each reducer will output 1 file, or if you use MultipleOutputs as I suggested, a set of files. If you have to fit the number of reducers to the number hours you have from the input, and

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Fengyun RAO
Thanks, Simon. that's very clear. 2014-03-02 14:53 GMT+08:00 Simon Dong simond...@gmail.com: Reading data for each hour shouldn't be a problem, as for Hadoop or shell you can pretty much do everything with mmddhh* as you can do with mmddhh. But if you need the data for the hour all sorted

Re: Map-Reduce: How to make MR output one file an hour?

2014-03-01 Thread Shekhar Sharma
Don't you think using flume would be easier. Use hdfs sink and use a property to roll out the log file every hour. By doing this way you use a single flume agent to receive logs as and when it is generating and you will be directly dumping to hdfs. If you want to remove unwanted logs you can write

Map-Reduce: How to make MR output one file an hour?

2014-02-28 Thread Fengyun RAO
It's a common web log analysis situation. The original weblog is saved every hour on multiple servers. Now we would like the parsed log results to be saved one file an hour. How to make it? In our MR job, the input is a directory with many files in many hours, let's say 4X files in X hours. if