Hi All,
I was wondering if there was an easy way to combing multiple .avro files
efficiently.
e.g. combining multiple hours of logs into a daily aggregate
Note that our Avro schema might evolve to have new (nullable) fields added but
no fields will be removed.
I'd like to avoid needing to pull
Frank,
That depends on what you mean by combining. It sounds like you are trying to
aggregate data from several days, which may involve doing a join so I would say
a MapReduce job is your best bet. If you are not going to do any processing at
all then why are you trying to combine them? Is th
Hi Bobby,
Actually, the problem we're trying to solve is one of completeness.
Say we have 3 machines generating log events and putting them to HDFS on an
hourly basis.
e.g.
2012-01/01/00/machine1.log.avro
2012-01/01/00/machine2.log.avro
2012-01/01/00/machine3.log.avro
Sometime after the hour,
I would do it by staging the machine data into a temporary directory
and then renaming the directory when it's been verified. So, data
would be written into directories like this:
2012-01/02/00/stage/machine1.log.avro
2012-01/02/00/stage/machine2.log.avro
2012-01/02/00/stage/machine3.log.avro
Aft
Hi Joey,
That's a very good suggestion and might suit us just fine.
However, many of the files will be much smaller than the HDFS block size.
That could affect the performance of the MapReduce jobs, correct?
Also, from my understanding it would put more burden on the name node (memory
usage) tha
-
From: Frank Grimes [mailto:frankgrime...@gmail.com]
Sent: Friday, January 06, 2012 2:56 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Combining AVRO files efficiently within HDFS
Hi Joey,
That's a very good suggestion and might suit us just fine.
However, many of the files will be much sm
r
> rather than opening a stream and combining files single threaded or trying
> to do something via command line.
>
> Dave
>
>
> -Original Message-
> From: Frank Grimes [mailto:frankgrime...@gmail.com]
> Sent: Friday, January 06, 2012 2:56 PM
> To: hdfs-user@
I would use a MapReduce job to merge them.
-Joey
On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes wrote:
> Hi Joey,
>
> That's a very good suggestion and might suit us just fine.
>
> However, many of the files will be much smaller than the HDFS block size.
> That could affect the performance of the
Ok, so I wrote a MapReduce job to merge the files and it appears to be working
with a limited input set.
Thanks again, BTW.
However, if I increase the amount of input data I start getting the following
types of errors:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any va
As it turns out, this is due to our /tmp partition being too small.
We'll either need to increase it or put hadoop.tmp.dir on a bigger partition.
On 2012-01-11, at 4:29 PM, Frank Grimes wrote:
> Ok, so I wrote a MapReduce job to merge the files and it appears to be
> working with a limited inpu
10 matches
Mail list logo