Re: Combining AVRO files efficiently within HDFS

Steve Edison Fri, 06 Jan 2012 12:46:10 -0800

I was exploring .har based hadop archive files for a similar small log file
scenario I have. I have millions of log files which are less than 64MB each
and I want to put them into HDFS and run analysis. Still exploring if HDFS
is a good options. Traditionally what I have learnt is that HDFS isn't good
for small files.


-Steve

On Fri, Jan 6, 2012 at 12:05 PM, Dave Shine <
dave.sh...@channelintelligence.com> wrote:

> Frank,
>
> We have a very serious small file problem.  I created a M/R job that
> combines files as it seemed best to use all the resources of the cluster
> rather than opening a stream and combining files single threaded or trying
> to do something via command line.
>
> Dave
>
>
> -----Original Message-----
> From: Frank Grimes [mailto:frankgrime...@gmail.com]
> Sent: Friday, January 06, 2012 2:56 PM
> To: hdfs-user@hadoop.apache.org
> Subject: Re: Combining AVRO files efficiently within HDFS
>
> Hi Joey,
>
> That's a very good suggestion and might suit us just fine.
>
> However, many of the files will be much smaller than the HDFS block size.
> That could affect the performance of the MapReduce jobs, correct?
> Also, from my understanding it would put more burden on the name node
> (memory usage) than is necessary.
>
> Assuming we did want to combine the actual files... how would you suggest
> we might go about it?
>
> Thanks,
>
> Frank Grimes
>
>
> On 2012-01-06, at 1:05 PM, Joey Echeverria wrote:
>
> > I would do it by staging the machine data into a temporary directory
> > and then renaming the directory when it's been verified. So, data
> > would be written into directories like this:
> >
> > 2012-01/02/00/stage/machine1.log.avro
> > 2012-01/02/00/stage/machine2.log.avro
> > 2012-01/02/00/stage/machine3.log.avro
> >
> > After verification, you'd rename the 2012-01/02/00/stage directory to
> > 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic
> > operation, you get the guarantee the you're looking for without having
> > to do extra IO. There shouldn't be a benefit to merging the individual
> > files unless they're too small.
> >
> > -Joey
> >
> > On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrime...@gmail.com>
> wrote:
> >> Hi Bobby,
> >>
> >> Actually, the problem we're trying to solve is one of completeness.
> >>
> >> Say we have 3 machines generating log events and putting them to HDFS
> >> on an hourly basis.
> >> e.g.
> >> 2012-01/01/00/machine1.log.avro
> >> 2012-01/01/00/machine2.log.avro
> >> 2012-01/01/00/machine3.log.avro
> >>
> >> Sometime after the hour, we would have a scheduled job verify that
> >> all the expected machines' log files are present and complete in HDFS.
> >>
> >> Before launching MapReduce jobs for a given date range, we want to
> >> verify that the job will run over complete data.
> >> If not, the query would error out.
> >>
> >> We want our query/MapReduce layer to not need to be aware of logs at
> >> the machine level, only the presence or not of an hour's worth of logs.
> >>
> >> We were thinking that after verifying all in individual log files for
> >> an hour, they could be combined into 2012-01/01/00/log.avro.
> >> The presence of 2012-01-01-00.log.avro would be all that needs to be
> >> verified.
> >>
> >> However, we're new to both Avro and Hadoop so not sure of the most
> >> efficient (and reliable) way to accomplish this.
> >>
> >> Thanks,
> >>
> >> Frank Grimes
> >>
> >>
> >> On 2012-01-06, at 11:46 AM, Robert Evans wrote:
> >>
> >> Frank,
> >>
> >> That depends on what you mean by combining. It sounds like you are
> >> trying to aggregate data from several days, which may involve doing a
> >> join so I would say a MapReduce job is your best bet.  If you are not
> >> going to do any processing at all then why are you trying to combine
> >> them?  Is there something that requires them all to be part of a
> >> single file?  MapReduce processing should be able to handle reading
> >> in multiple files just as well as reading in a single file.
> >>
> >> --Bobby Evans
> >>
> >> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >> I was wondering if there was an easy way to combing multiple .avro
> >> files efficiently.
> >> e.g. combining multiple hours of logs into a daily aggregate
> >>
> >> Note that our Avro schema might evolve to have new (nullable) fields
> >> added but no fields will be removed.
> >>
> >> I'd like to avoid needing to pull the data down for combining and
> >> subsequent "hadoop dfs -put".
> >>
> >> Would https://issues.apache.org/jira/browse/HDFS-222 be able to
> >> handle that automatically?
> >> FYI, the following seems to indicate that Avro files might be easily
> >> combinable: https://issues.apache.org/jira/browse/AVRO-127
> >>
> >> Or is an M/R job the best way to go for this?
> >>
> >> Thanks,
> >>
> >> Frank Grimes
> >>
> >>
> >
> >
> >
> > --
> > Joseph Echeverria
> > Cloudera, Inc.
> > 443.305.9434
>
>
> The information contained in this email message is considered confidential
> and proprietary to the sender and is intended solely for review and use by
> the named recipient. Any unauthorized review, use or distribution is
> strictly prohibited. If you have received this message in error, please
> advise the sender by reply email and delete the message.
>

Re: Combining AVRO files efficiently within HDFS

Reply via email to