Ok, so I wrote a MapReduce job to merge the files and it appears to be working with a limited input set. Thanks again, BTW.
However, if I increase the amount of input data I start getting the following types of errors: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out/file.out or org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_0.out Are there any logs I should be looking at to determine the exact cause of these errors? Are there any settings I could/should be increasing? Note that in order to avoid unnecessary sorting overhead, I made each key a constant (1L) so that the logs are combined but ordering isn't necessarily preserved. i.e. public static class AvroReachMapper extends AvroMapper<DeliveryLogEvent, Pair<Long, DeliveryLogEvent>> { public void map(DeliveryLogEvent levent, AvroCollector<Pair<Long, DeliveryLogEvent>> collector, Reporter reporter) throws IOException { collector.collect(new Pair<Long, DeliveryLogEvent>(1L, levent)); } } public static class Reduce extends AvroReducer<Long, DeliveryLogEvent, DeliveryLogEvent> { @Override public void reduce(Long key, Iterable<DeliveryLogEvent> values, AvroCollector<DeliveryLogEvent> collector, Reporter reporter) throws IOException { for (DeliveryLogEvent event : values) { collector.collect(event); } } } I've also noticed that /tmp/mapred seems to fill up and doesn't automatically get cleaned out. Is Hadoop itself supposed to clean up those old temporary work files or do we need a Cron job for that? Thanks, Frank Grimes On 2012-01-06, at 3:56 PM, Joey Echeverria wrote: > I would use a MapReduce job to merge them. > > -Joey > > On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes <frankgrime...@gmail.com> wrote: >> Hi Joey, >> >> That's a very good suggestion and might suit us just fine. >> >> However, many of the files will be much smaller than the HDFS block size. >> That could affect the performance of the MapReduce jobs, correct? >> Also, from my understanding it would put more burden on the name node >> (memory usage) than is necessary. >> >> Assuming we did want to combine the actual files... how would you suggest we >> might go about it? >> >> Thanks, >> >> Frank Grimes >> >> >> On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: >> >>> I would do it by staging the machine data into a temporary directory >>> and then renaming the directory when it's been verified. So, data >>> would be written into directories like this: >>> >>> 2012-01/02/00/stage/machine1.log.avro >>> 2012-01/02/00/stage/machine2.log.avro >>> 2012-01/02/00/stage/machine3.log.avro >>> >>> After verification, you'd rename the 2012-01/02/00/stage directory to >>> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic >>> operation, you get the guarantee the you're looking for without having >>> to do extra IO. There shouldn't be a benefit to merging the individual >>> files unless they're too small. >>> >>> -Joey >>> >>> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrime...@gmail.com> >>> wrote: >>>> Hi Bobby, >>>> >>>> Actually, the problem we're trying to solve is one of completeness. >>>> >>>> Say we have 3 machines generating log events and putting them to HDFS on an >>>> hourly basis. >>>> e.g. >>>> 2012-01/01/00/machine1.log.avro >>>> 2012-01/01/00/machine2.log.avro >>>> 2012-01/01/00/machine3.log.avro >>>> >>>> Sometime after the hour, we would have a scheduled job verify that all the >>>> expected machines' log files are present and complete in HDFS. >>>> >>>> Before launching MapReduce jobs for a given date range, we want to verify >>>> that the job will run over complete data. >>>> If not, the query would error out. >>>> >>>> We want our query/MapReduce layer to not need to be aware of logs at the >>>> machine level, only the presence or not of an hour's worth of logs. >>>> >>>> We were thinking that after verifying all in individual log files for an >>>> hour, they could be combined into 2012-01/01/00/log.avro. >>>> The presence of 2012-01-01-00.log.avro would be all that needs to be >>>> verified. >>>> >>>> However, we're new to both Avro and Hadoop so not sure of the most >>>> efficient >>>> (and reliable) way to accomplish this. >>>> >>>> Thanks, >>>> >>>> Frank Grimes >>>> >>>> >>>> On 2012-01-06, at 11:46 AM, Robert Evans wrote: >>>> >>>> Frank, >>>> >>>> That depends on what you mean by combining. It sounds like you are trying >>>> to >>>> aggregate data from several days, which may involve doing a join so I would >>>> say a MapReduce job is your best bet. If you are not going to do any >>>> processing at all then why are you trying to combine them? Is there >>>> something that requires them all to be part of a single file? MapReduce >>>> processing should be able to handle reading in multiple files just as well >>>> as reading in a single file. >>>> >>>> --Bobby Evans >>>> >>>> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote: >>>> >>>> Hi All, >>>> >>>> I was wondering if there was an easy way to combing multiple .avro files >>>> efficiently. >>>> e.g. combining multiple hours of logs into a daily aggregate >>>> >>>> Note that our Avro schema might evolve to have new (nullable) fields added >>>> but no fields will be removed. >>>> >>>> I'd like to avoid needing to pull the data down for combining and >>>> subsequent >>>> "hadoop dfs -put". >>>> >>>> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that >>>> automatically? >>>> FYI, the following seems to indicate that Avro files might be easily >>>> combinable: https://issues.apache.org/jira/browse/AVRO-127 >>>> >>>> Or is an M/R job the best way to go for this? >>>> >>>> Thanks, >>>> >>>> Frank Grimes >>>> >>>> >>> >>> >>> >>> -- >>> Joseph Echeverria >>> Cloudera, Inc. >>> 443.305.9434 >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434