As it turns out, this is due to our /tmp partition being too small. We'll either need to increase it or put hadoop.tmp.dir on a bigger partition.
On 2012-01-11, at 4:29 PM, Frank Grimes wrote: > Ok, so I wrote a MapReduce job to merge the files and it appears to be > working with a limited input set. > Thanks again, BTW. > > However, if I increase the amount of input data I start getting the following > types of errors: > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for output/file.out/file.out > or > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for output/map_0.out > > Are there any logs I should be looking at to determine the exact cause of > these errors? > Are there any settings I could/should be increasing? > > Note that in order to avoid unnecessary sorting overhead, I made each key a > constant (1L) so that the logs are combined but ordering isn't necessarily > preserved. > i.e. > > public static class AvroReachMapper extends > AvroMapper<DeliveryLogEvent, Pair<Long, DeliveryLogEvent>> { > public void map(DeliveryLogEvent levent, > AvroCollector<Pair<Long, DeliveryLogEvent>> collector, Reporter reporter) > throws IOException { > > collector.collect(new Pair<Long, DeliveryLogEvent>(1L, > levent)); > } > } > > public static class Reduce extends AvroReducer<Long, DeliveryLogEvent, > DeliveryLogEvent> { > > @Override > public void reduce(Long key, Iterable<DeliveryLogEvent> values, > AvroCollector<DeliveryLogEvent> collector, > Reporter reporter) > throws IOException { > > for (DeliveryLogEvent event : values) { > collector.collect(event); > } > } > > } > > I've also noticed that /tmp/mapred seems to fill up and doesn't automatically > get cleaned out. > Is Hadoop itself supposed to clean up those old temporary work files or do we > need a Cron job for that? > > Thanks, > > Frank Grimes > > > > > On 2012-01-06, at 3:56 PM, Joey Echeverria wrote: > >> I would use a MapReduce job to merge them. >> >> -Joey >> >> On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes <frankgrime...@gmail.com> >> wrote: >>> Hi Joey, >>> >>> That's a very good suggestion and might suit us just fine. >>> >>> However, many of the files will be much smaller than the HDFS block size. >>> That could affect the performance of the MapReduce jobs, correct? >>> Also, from my understanding it would put more burden on the name node >>> (memory usage) than is necessary. >>> >>> Assuming we did want to combine the actual files... how would you suggest >>> we might go about it? >>> >>> Thanks, >>> >>> Frank Grimes >>> >>> >>> On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: >>> >>>> I would do it by staging the machine data into a temporary directory >>>> and then renaming the directory when it's been verified. So, data >>>> would be written into directories like this: >>>> >>>> 2012-01/02/00/stage/machine1.log.avro >>>> 2012-01/02/00/stage/machine2.log.avro >>>> 2012-01/02/00/stage/machine3.log.avro >>>> >>>> After verification, you'd rename the 2012-01/02/00/stage directory to >>>> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic >>>> operation, you get the guarantee the you're looking for without having >>>> to do extra IO. There shouldn't be a benefit to merging the individual >>>> files unless they're too small. >>>> >>>> -Joey >>>> >>>> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrime...@gmail.com> >>>> wrote: >>>>> Hi Bobby, >>>>> >>>>> Actually, the problem we're trying to solve is one of completeness. >>>>> >>>>> Say we have 3 machines generating log events and putting them to HDFS on >>>>> an >>>>> hourly basis. >>>>> e.g. >>>>> 2012-01/01/00/machine1.log.avro >>>>> 2012-01/01/00/machine2.log.avro >>>>> 2012-01/01/00/machine3.log.avro >>>>> >>>>> Sometime after the hour, we would have a scheduled job verify that all the >>>>> expected machines' log files are present and complete in HDFS. >>>>> >>>>> Before launching MapReduce jobs for a given date range, we want to verify >>>>> that the job will run over complete data. >>>>> If not, the query would error out. >>>>> >>>>> We want our query/MapReduce layer to not need to be aware of logs at the >>>>> machine level, only the presence or not of an hour's worth of logs. >>>>> >>>>> We were thinking that after verifying all in individual log files for an >>>>> hour, they could be combined into 2012-01/01/00/log.avro. >>>>> The presence of 2012-01-01-00.log.avro would be all that needs to be >>>>> verified. >>>>> >>>>> However, we're new to both Avro and Hadoop so not sure of the most >>>>> efficient >>>>> (and reliable) way to accomplish this. >>>>> >>>>> Thanks, >>>>> >>>>> Frank Grimes >>>>> >>>>> >>>>> On 2012-01-06, at 11:46 AM, Robert Evans wrote: >>>>> >>>>> Frank, >>>>> >>>>> That depends on what you mean by combining. It sounds like you are trying >>>>> to >>>>> aggregate data from several days, which may involve doing a join so I >>>>> would >>>>> say a MapReduce job is your best bet. If you are not going to do any >>>>> processing at all then why are you trying to combine them? Is there >>>>> something that requires them all to be part of a single file? MapReduce >>>>> processing should be able to handle reading in multiple files just as well >>>>> as reading in a single file. >>>>> >>>>> --Bobby Evans >>>>> >>>>> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> I was wondering if there was an easy way to combing multiple .avro files >>>>> efficiently. >>>>> e.g. combining multiple hours of logs into a daily aggregate >>>>> >>>>> Note that our Avro schema might evolve to have new (nullable) fields added >>>>> but no fields will be removed. >>>>> >>>>> I'd like to avoid needing to pull the data down for combining and >>>>> subsequent >>>>> "hadoop dfs -put". >>>>> >>>>> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle >>>>> that >>>>> automatically? >>>>> FYI, the following seems to indicate that Avro files might be easily >>>>> combinable: https://issues.apache.org/jira/browse/AVRO-127 >>>>> >>>>> Or is an M/R job the best way to go for this? >>>>> >>>>> Thanks, >>>>> >>>>> Frank Grimes >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Joseph Echeverria >>>> Cloudera, Inc. >>>> 443.305.9434 >>> >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 >