RE: HDFS as a logfile ??

Ricky Ho Mon, 13 Apr 2009 08:04:35 -0700

Ari, thanks for your note.

Like to understand more how Chukwa group log entries ... If I have appA running 
in machine X, Y and appB running in machine Y, Z.  Each of them calling the 
Chukwa log API.

Do I have all entries going in the same HDFS file ?  or 4 separated HDFS files 
based on the App/Machine combination ?

If the answer of first Q is "yes", then what if appA and appB has different 
format of log entries ?
If the answer of second Q is "yes", then are all these HDFS files cut at the 
same time boundary ?

Looks like in Chukwa, application first log to a daemon, which buffer-write the 
log entries into a local file.  And there is a separate process to ship these 
data to a remote collector daemon which issue the actual HDFS write.  I observe 
the following overhead ...

1) The overhead of extra write to local disk and ship the data over to the 
collector.  If HDFS supports append, the application can then go directly to 
the HDFS

2) The centralized collector establish a bottleneck to the otherwise perfectly 
parallel HDFS architecture.

Am I missing something here ?

Rgds,
Ricky

-----Original Message-----
From: Ariel Rabkin [mailto:asrab...@gmail.com] 
Sent: Monday, April 13, 2009 7:38 AM
To: core-user@hadoop.apache.org
Subject: Re: HDFS as a logfile ??

Chukwa is a Hadoop subproject aiming to do something similar, though
particularly for the case of Hadoop logs.  You may find it useful.

Hadoop unfortunately does not support concurrent appends.  As a
result, the Chukwa project found itself creating a whole new demon,
the chukwa collector, precisely to merge the event streams and write
it out, just once. We're set to do a release within the next week or
two, but in the meantime you can check it out from SVN at
https://svn.apache.org/repos/asf/hadoop/chukwa/trunk

--Ari

On Fri, Apr 10, 2009 at 12:06 AM, Ricky Ho <r...@adobe.com> wrote:
> I want to analyze the traffic pattern and statistics of a distributed 
> application.  I am thinking of having the application write the events as log 
> entries into HDFS and then later I can use a Map/Reduce task to do the 
> analysis in parallel.  Is this a good approach ?
>
> In this case, does HDFS support concurrent write (append) to a file ?  
> Another question is whether the write API thread-safe ?
>
> Rgds,
> Ricky
>

-- 
Ari Rabkin asrab...@gmail.com
UC Berkeley Computer Science Department

RE: HDFS as a logfile ??

Reply via email to