Questions re integrating Avro into Cascading process

Ken Krugler Thu, 15 Apr 2010 10:34:06 -0700

Hi all,

We're looking at creating a Cascading Scheme for Avro, and have got afew questions below. These are very general, as this is more of ascoping phase (as in, are we crazy to try this) so apologies inadvance for lack of detail.

For context, Cascading is an open source project that provides aworkflow API on top of Hadoop. The key unit of data is a tuple, whichcorresponds to a record - you have fields (names) and values.Cascading uses a generalized "tap" concept for reading & writingtuples, where a tap uses a scheme to handle the low-level mapping fromCascading-land to/from the storage format.

So the goal here is to define a Cascading Scheme that will run on0.18.3 and later versions of Hadoop, and provide general support forreading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.

We grabbed the recently committed AvroXXX code fromorg.apache.avro.mapred (thanks Doug & Scott), and began building theCascading scheme to bridge between AvroWrapper<T> keys and Cascadingtuples.

1. What's the best approach if we want to dynamically define the Avroschema, based on a list of field names and types (classes)?

This assumes it's possible to dynamically define & use a schema, ofcourse.


2. How much has the new Hadoop map-reduce support code been tested?

3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,and that then creating problems. Anything else?

4. The key integration point, besides the fields+classes to schemaissue above, is mapping between Cascading tuples and AvroWrapper<T>

If we're using (I assume) the generic format, any input on how we'd dothis two-way conversion?


Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Questions re integrating Avro into Cascading process

Reply via email to