Hadoop Serialization: Avro

2011-11-26 Thread Leonardo Urbina
Hey everyone,

First time posting to the list. I'm currently writing a hadoop job that
will run daily and whose output will be part of the part of the next day's
input. Also, the output will potentially be read by other programs for
later analysis.

Since my program's output is used as part of the next day's input, it would
be nice if it was stored in some binary format that is easy to read the
next time around. But this format also needs to be readable by other
outside programs, not necessarily written in Java. After searching for a
while it seems that Avro is what I want to be using. In any case, I have
been looking around for a while and I can't seem to find a single example
of how to use Avro within a Hadoop job.

It seems that in order to use Avro I need to change the io.serializations
value, however I don't know which value should be specified. Furthermore, I
found that there are classes Avro{Input,Output}Format but these use a
series of other Avro classes which, as far as I understand, seem need the
use of other Avro classes such as AvroWrapper, AvroKey, AvroValue, and as
far as I am concerned Avro* (with * replaced with pretty much any Hadoop
class name). It seems however that these are used so that the Avro format
is used throughout the Hadoop process to pass objects around.

I just want to use Avro to save my output and read it again as input next
time around. So far I have been using SequenceFile{Input,Output}Format, and
have implemented the Writable interface in the relevant classes, however
this is not portable to other languages. Is there a way to use Avro without
a substantial rewrite (using Avro* classes) of my Hadoop job? Thanks in
advance,

Best,
-Leo

-- 
Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics
lurb...@mit.edu


Re: Hadoop Serialization: Avro

2011-11-26 Thread Brock Noland
Hi,

Depending on the response you get here, you might also post the
question separately on avro-user.

On Sat, Nov 26, 2011 at 1:46 PM, Leonardo Urbina  wrote:
> Hey everyone,
>
> First time posting to the list. I'm currently writing a hadoop job that
> will run daily and whose output will be part of the part of the next day's
> input. Also, the output will potentially be read by other programs for
> later analysis.
>
> Since my program's output is used as part of the next day's input, it would
> be nice if it was stored in some binary format that is easy to read the
> next time around. But this format also needs to be readable by other
> outside programs, not necessarily written in Java. After searching for a
> while it seems that Avro is what I want to be using. In any case, I have
> been looking around for a while and I can't seem to find a single example
> of how to use Avro within a Hadoop job.
>
> It seems that in order to use Avro I need to change the io.serializations
> value, however I don't know which value should be specified. Furthermore, I
> found that there are classes Avro{Input,Output}Format but these use a
> series of other Avro classes which, as far as I understand, seem need the
> use of other Avro classes such as AvroWrapper, AvroKey, AvroValue, and as
> far as I am concerned Avro* (with * replaced with pretty much any Hadoop
> class name). It seems however that these are used so that the Avro format
> is used throughout the Hadoop process to pass objects around.
>
> I just want to use Avro to save my output and read it again as input next
> time around. So far I have been using SequenceFile{Input,Output}Format, and
> have implemented the Writable interface in the relevant classes, however
> this is not portable to other languages. Is there a way to use Avro without
> a substantial rewrite (using Avro* classes) of my Hadoop job? Thanks in
> advance,
>
> Best,
> -Leo
>
> --
> Leo Urbina
> Massachusetts Institute of Technology
> Department of Electrical Engineering and Computer Science
> Department of Mathematics
> lurb...@mit.edu
>


Re: Hadoop Serialization: Avro

2011-11-26 Thread Leonardo Urbina
Thanks, I will send the question to that last as well,

Best,
-Leo

Sent from my phone

On Nov 26, 2011, at 7:32 PM, Brock Noland  wrote:

> Hi,
>
> Depending on the response you get here, you might also post the
> question separately on avro-user.
>
> On Sat, Nov 26, 2011 at 1:46 PM, Leonardo Urbina  wrote:
>> Hey everyone,
>>
>> First time posting to the list. I'm currently writing a hadoop job that
>> will run daily and whose output will be part of the part of the next day's
>> input. Also, the output will potentially be read by other programs for
>> later analysis.
>>
>> Since my program's output is used as part of the next day's input, it would
>> be nice if it was stored in some binary format that is easy to read the
>> next time around. But this format also needs to be readable by other
>> outside programs, not necessarily written in Java. After searching for a
>> while it seems that Avro is what I want to be using. In any case, I have
>> been looking around for a while and I can't seem to find a single example
>> of how to use Avro within a Hadoop job.
>>
>> It seems that in order to use Avro I need to change the io.serializations
>> value, however I don't know which value should be specified. Furthermore, I
>> found that there are classes Avro{Input,Output}Format but these use a
>> series of other Avro classes which, as far as I understand, seem need the
>> use of other Avro classes such as AvroWrapper, AvroKey, AvroValue, and as
>> far as I am concerned Avro* (with * replaced with pretty much any Hadoop
>> class name). It seems however that these are used so that the Avro format
>> is used throughout the Hadoop process to pass objects around.
>>
>> I just want to use Avro to save my output and read it again as input next
>> time around. So far I have been using SequenceFile{Input,Output}Format, and
>> have implemented the Writable interface in the relevant classes, however
>> this is not portable to other languages. Is there a way to use Avro without
>> a substantial rewrite (using Avro* classes) of my Hadoop job? Thanks in
>> advance,
>>
>> Best,
>> -Leo
>>
>> --
>> Leo Urbina
>> Massachusetts Institute of Technology
>> Department of Electrical Engineering and Computer Science
>> Department of Mathematics
>> lurb...@mit.edu
>>