Re: Combining AVRO files efficiently within HDFS

Frank Grimes Fri, 06 Jan 2012 09:22:31 -0800

Hi Bobby,

Actually, the problem we're trying to solve is one of completeness.

Say we have 3 machines generating log events and putting them to HDFS on an 
hourly basis.
e.g. 
2012-01/01/00/machine1.log.avro
2012-01/01/00/machine2.log.avro
2012-01/01/00/machine3.log.avro

Sometime after the hour, we would have a scheduled job verify that all the 
expected machines' log files are present and complete in HDFS.

Before launching MapReduce jobs for a given date range, we want to verify that 
the job will run over complete data. 
If not, the query would error out. 

We want our query/MapReduce layer to not need to be aware of logs at the 
machine level, only the presence or not of an hour's worth of logs.

We were thinking that after verifying all in individual log files for an hour, 
they could be combined into 2012-01/01/00/log.avro.
The presence of 2012-01-01-00.log.avro would be all that needs to be verified.

However, we're new to both Avro and Hadoop so not sure of the most efficient 
(and reliable) way to accomplish this.

Thanks,

Frank Grimes

On 2012-01-06, at 11:46 AM, Robert Evans wrote:

> Frank,
> 
> That depends on what you mean by combining. It sounds like you are trying to 
> aggregate data from several days, which may involve doing a join so I would 
> say a MapReduce job is your best bet.  If you are not going to do any 
> processing at all then why are you trying to combine them?  Is there 
> something that requires them all to be part of a single file?  MapReduce 
> processing should be able to handle reading in multiple files just as well as 
> reading in a single file.
> 
> --Bobby Evans
> 
> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote:
> 
> Hi All,
> 
> I was wondering if there was an easy way to combing multiple .avro files 
> efficiently.
> e.g. combining multiple hours of logs into a daily aggregate
> 
> Note that our Avro schema might evolve to have new (nullable) fields added 
> but no fields will be removed.
> 
> I'd like to avoid needing to pull the data down for combining and subsequent 
> "hadoop dfs -put".
> 
> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that 
> automatically?
> FYI, the following seems to indicate that Avro files might be easily 
> combinable: https://issues.apache.org/jira/browse/AVRO-127
> 
> Or is an M/R job the best way to go for this?
> 
> Thanks,
> 
> Frank Grimes
>

Re: Combining AVRO files efficiently within HDFS

Reply via email to