Combining AVRO files efficiently within HDFS

2012-01-06 Thread Frank Grimes
Hi All, I was wondering if there was an easy way to combing multiple .avro files efficiently. e.g. combining multiple hours of logs into a daily aggregate Note that our Avro schema might evolve to have new (nullable) fields added but no fields will be removed. I'd like to avoid needing to pull

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Robert Evans
Frank, That depends on what you mean by combining. It sounds like you are trying to aggregate data from several days, which may involve doing a join so I would say a MapReduce job is your best bet. If you are not going to do any processing at all then why are you trying to combine them? Is th

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Frank Grimes
Hi Bobby, Actually, the problem we're trying to solve is one of completeness. Say we have 3 machines generating log events and putting them to HDFS on an hourly basis. e.g. 2012-01/01/00/machine1.log.avro 2012-01/01/00/machine2.log.avro 2012-01/01/00/machine3.log.avro Sometime after the hour,

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Joey Echeverria
I would do it by staging the machine data into a temporary directory and then renaming the directory when it's been verified. So, data would be written into directories like this: 2012-01/02/00/stage/machine1.log.avro 2012-01/02/00/stage/machine2.log.avro 2012-01/02/00/stage/machine3.log.avro Aft

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Frank Grimes
Hi Joey, That's a very good suggestion and might suit us just fine. However, many of the files will be much smaller than the HDFS block size. That could affect the performance of the MapReduce jobs, correct? Also, from my understanding it would put more burden on the name node (memory usage) tha

RE: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Dave Shine
- From: Frank Grimes [mailto:frankgrime...@gmail.com] Sent: Friday, January 06, 2012 2:56 PM To: hdfs-user@hadoop.apache.org Subject: Re: Combining AVRO files efficiently within HDFS Hi Joey, That's a very good suggestion and might suit us just fine. However, many of the files will be much sm

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Steve Edison
r > rather than opening a stream and combining files single threaded or trying > to do something via command line. > > Dave > > > -Original Message- > From: Frank Grimes [mailto:frankgrime...@gmail.com] > Sent: Friday, January 06, 2012 2:56 PM > To: hdfs-user@

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Joey Echeverria
I would use a MapReduce job to merge them. -Joey On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes wrote: > Hi Joey, > > That's a very good suggestion and might suit us just fine. > > However, many of the files will be much smaller than the HDFS block size. > That could affect the performance of the

Re: Combining AVRO files efficiently within HDFS

2012-01-11 Thread Frank Grimes
Ok, so I wrote a MapReduce job to merge the files and it appears to be working with a limited input set. Thanks again, BTW. However, if I increase the amount of input data I start getting the following types of errors: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any va

Re: Combining AVRO files efficiently within HDFS

2012-01-12 Thread Frank Grimes
As it turns out, this is due to our /tmp partition being too small. We'll either need to increase it or put hadoop.tmp.dir on a bigger partition. On 2012-01-11, at 4:29 PM, Frank Grimes wrote: > Ok, so I wrote a MapReduce job to merge the files and it appears to be > working with a limited inpu