We have the same scenario as you described. The following is our solution, just 
FYI:

We installed a local scribe agent on every node of our cluster, and we have 
several central scribe servers. We extended log4j to support writing logs to 
the local scribe agent,  and the local scribe agents forward the logs to the 
central scribe servers, at last the central scribe servers write these logs to 
a specified hdfs cluster used for offline processing.

Then we use hive/impale to analyse  the collected logs.

From: Public Network Services 
<publicnetworkservi...@gmail.com<mailto:publicnetworkservi...@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Tuesday, August 6, 2013 1:58 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Large-scale collection of logs from multiple Hadoop nodes

Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop 
cluster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of which 
constantly produces Syslog events that need to be gathered an analyzed at 
another point. The total amount of logs could be tens of gigabytes per day, if 
not more, and the reception rate in the order of thousands of events per 
second, if not more.

One solution is to send those events over the network (e.g., using using flume) 
and collect them in one or more (less than 5) nodes in the cluster, or in 
another location, whereby the logs will be processed by a either constantly 
MapReduce job, or by non-Hadoop servers running some log processing application.

Another approach could be to deposit all these events into a queuing system 
like ActiveMQ or RabbitMQ, or whatever.

In all cases, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS

Reply via email to