Re: Using avro with hadoop streaming

R. Tyler Ballance Wed, 21 Apr 2010 20:20:23 -0700

On Wed, 21 Apr 2010, Doug Cutting wrote:

> R. Tyler Ballance wrote:
> >Is hadoop streaming support actually /working/ in trunk?
> 
> Hadoop Streaming access to Avro data?  No.  Hadoop Streaming is
> primarily intended for textual, CSV-style data.
> 
> To better integrate languages Avro data into Perl, Python and Ruby
> mapreduce programs, we hope to builds something like Hadoop Pipes.
> 
>   https://issues.apache.org/jira/browse/AVRO-512
> 
> I hope to work on this in the coming weeks.


Ah, this rings a bit clearer to me, mind you I'm a hadidiot, I'm more
into generating the avro datas (and the RPC!).

I'll follow the ticket, looking forward to seeing that going in.

> 
> AVRO-493 only provides Avro data to Java mapreduce programs.  The
> best documentation for it currently are its unit test source code.
> 
> http://tinyurl.com/yz8bd22
> http://tinyurl.com/2a3xbu8

Handy links, I don't think we're going to invest any time in writing anything
other than Python code for the time being. Until you have the chance to crank
through #512, our intermediary solution has been to pre-process avro logs,
pulling out the schema into a separate file and dumping it to a textual JSON
file suitable for streaming into hadoop.

Cheers,
-R. Tyler Ballance
--------------------------------------
  Jabber: [email protected]
  GitHub: http://github.com/rtyler
Identica: http://identi.ca/dero
 Twitter: http://twitter.com/agentdero
    Blog: http://unethicalblogger.com

pgpMRYNZ5o25v.pgp
Description: PGP signature

Re: Using avro with hadoop streaming

Reply via email to