thanks, Shekhar. I'm unfamiliar with Flume, but I will look into it later
2014-03-02 15:36 GMT+08:00 Shekhar Sharma shekhar2...@gmail.com:
Don't you think using flume would be easier. Use hdfs sink and use a
property to roll out the log file every hour.
By doing this way you use a single
Hi,
Write the custom partitioner on timestamp and as you mentioned set
#reducers to X.
Thanks, but how to set reducer number to X? X is dependent on input
(run-time), which is unknown on job configuration (compile time).
2014-03-01 17:44 GMT+08:00 AnilKumar B akumarb2...@gmail.com:
Hi,
Write the custom partitioner on timestamp and as you mentioned set
#reducers to X.
You can use MultipleOutputs and construct the custom file name based on
timestamp.
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
On Fri, Feb 28, 2014 at 11:44 PM, Fengyun RAO raofeng...@gmail.com wrote:
It's a common web log analysis
Thanks Devin. We don't just want one file. It's complicated.
if the input folder contains data in X hours, we want X files,
if Y hours, we want Y files.
obviously, X or Y is unknown on compile time.
2014-03-01 20:48 GMT+08:00 Devin Suiter RDX dsui...@rdx.com:
If you only want one file, then
Fengyun,
Is there any particular reason you have to have exactly 1 file per hour? As
you probably knew already, each reducer will output 1 file, or if you use
MultipleOutputs as I suggested, a set of files. If you have to fit the
number of reducers to the number hours you have from the input, and
Thanks, Simon. that's very clear.
2014-03-02 14:53 GMT+08:00 Simon Dong simond...@gmail.com:
Reading data for each hour shouldn't be a problem, as for Hadoop or shell
you can pretty much do everything with mmddhh* as you can do with mmddhh.
But if you need the data for the hour all sorted
Don't you think using flume would be easier. Use hdfs sink and use a
property to roll out the log file every hour.
By doing this way you use a single flume agent to receive logs as and when
it is generating and you will be directly dumping to hdfs.
If you want to remove unwanted logs you can write
It's a common web log analysis situation. The original weblog is saved
every hour on multiple servers.
Now we would like the parsed log results to be saved one file an hour. How
to make it?
In our MR job, the input is a directory with many files in many hours,
let's say 4X files in X hours.
if