Hi all,
We're looking at creating a Cascading Scheme for Avro, and have got a
few questions below. These are very general, as this is more of a
scoping phase (as in, are we crazy to try this) so apologies in
advance for lack of detail.
For context, Cascading is an open source project that provides a
workflow API on top of Hadoop. The key unit of data is a tuple, which
corresponds to a record - you have fields (names) and values.
Cascading uses a generalized "tap" concept for reading & writing
tuples, where a tap uses a scheme to handle the low-level mapping from
Cascading-land to/from the storage format.
So the goal here is to define a Cascading Scheme that will run on
0.18.3 and later versions of Hadoop, and provide general support for
reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.
We grabbed the recently committed AvroXXX code from
org.apache.avro.mapred (thanks Doug & Scott), and began building the
Cascading scheme to bridge between AvroWrapper<T> keys and Cascading
tuples.
1. What's the best approach if we want to dynamically define the Avro
schema, based on a list of field names and types (classes)?
This assumes it's possible to dynamically define & use a schema, of
course.
2. How much has the new Hadoop map-reduce support code been tested?
3. Will there be issues with running in 0.18.3, 0.19.2, etc?
I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,
and that then creating problems. Anything else?
4. The key integration point, besides the fields+classes to schema
issue above, is mapping between Cascading tuples and AvroWrapper<T>
If we're using (I assume) the generic format, any input on how we'd do
this two-way conversion?
Thanks!
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g