I was exploring .har based hadop archive files for a similar small log file scenario I have. I have millions of log files which are less than 64MB each and I want to put them into HDFS and run analysis. Still exploring if HDFS is a good options. Traditionally what I have learnt is that HDFS isn't good for small files.
-Steve On Fri, Jan 6, 2012 at 12:05 PM, Dave Shine < dave.sh...@channelintelligence.com> wrote: > Frank, > > We have a very serious small file problem. I created a M/R job that > combines files as it seemed best to use all the resources of the cluster > rather than opening a stream and combining files single threaded or trying > to do something via command line. > > Dave > > > -----Original Message----- > From: Frank Grimes [mailto:frankgrime...@gmail.com] > Sent: Friday, January 06, 2012 2:56 PM > To: hdfs-user@hadoop.apache.org > Subject: Re: Combining AVRO files efficiently within HDFS > > Hi Joey, > > That's a very good suggestion and might suit us just fine. > > However, many of the files will be much smaller than the HDFS block size. > That could affect the performance of the MapReduce jobs, correct? > Also, from my understanding it would put more burden on the name node > (memory usage) than is necessary. > > Assuming we did want to combine the actual files... how would you suggest > we might go about it? > > Thanks, > > Frank Grimes > > > On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: > > > I would do it by staging the machine data into a temporary directory > > and then renaming the directory when it's been verified. So, data > > would be written into directories like this: > > > > 2012-01/02/00/stage/machine1.log.avro > > 2012-01/02/00/stage/machine2.log.avro > > 2012-01/02/00/stage/machine3.log.avro > > > > After verification, you'd rename the 2012-01/02/00/stage directory to > > 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic > > operation, you get the guarantee the you're looking for without having > > to do extra IO. There shouldn't be a benefit to merging the individual > > files unless they're too small. > > > > -Joey > > > > On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrime...@gmail.com> > wrote: > >> Hi Bobby, > >> > >> Actually, the problem we're trying to solve is one of completeness. > >> > >> Say we have 3 machines generating log events and putting them to HDFS > >> on an hourly basis. > >> e.g. > >> 2012-01/01/00/machine1.log.avro > >> 2012-01/01/00/machine2.log.avro > >> 2012-01/01/00/machine3.log.avro > >> > >> Sometime after the hour, we would have a scheduled job verify that > >> all the expected machines' log files are present and complete in HDFS. > >> > >> Before launching MapReduce jobs for a given date range, we want to > >> verify that the job will run over complete data. > >> If not, the query would error out. > >> > >> We want our query/MapReduce layer to not need to be aware of logs at > >> the machine level, only the presence or not of an hour's worth of logs. > >> > >> We were thinking that after verifying all in individual log files for > >> an hour, they could be combined into 2012-01/01/00/log.avro. > >> The presence of 2012-01-01-00.log.avro would be all that needs to be > >> verified. > >> > >> However, we're new to both Avro and Hadoop so not sure of the most > >> efficient (and reliable) way to accomplish this. > >> > >> Thanks, > >> > >> Frank Grimes > >> > >> > >> On 2012-01-06, at 11:46 AM, Robert Evans wrote: > >> > >> Frank, > >> > >> That depends on what you mean by combining. It sounds like you are > >> trying to aggregate data from several days, which may involve doing a > >> join so I would say a MapReduce job is your best bet. If you are not > >> going to do any processing at all then why are you trying to combine > >> them? Is there something that requires them all to be part of a > >> single file? MapReduce processing should be able to handle reading > >> in multiple files just as well as reading in a single file. > >> > >> --Bobby Evans > >> > >> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote: > >> > >> Hi All, > >> > >> I was wondering if there was an easy way to combing multiple .avro > >> files efficiently. > >> e.g. combining multiple hours of logs into a daily aggregate > >> > >> Note that our Avro schema might evolve to have new (nullable) fields > >> added but no fields will be removed. > >> > >> I'd like to avoid needing to pull the data down for combining and > >> subsequent "hadoop dfs -put". > >> > >> Would https://issues.apache.org/jira/browse/HDFS-222 be able to > >> handle that automatically? > >> FYI, the following seems to indicate that Avro files might be easily > >> combinable: https://issues.apache.org/jira/browse/AVRO-127 > >> > >> Or is an M/R job the best way to go for this? > >> > >> Thanks, > >> > >> Frank Grimes > >> > >> > > > > > > > > -- > > Joseph Echeverria > > Cloudera, Inc. > > 443.305.9434 > > > The information contained in this email message is considered confidential > and proprietary to the sender and is intended solely for review and use by > the named recipient. Any unauthorized review, use or distribution is > strictly prohibited. If you have received this message in error, please > advise the sender by reply email and delete the message. >