On Apr 16, 2010, at 11:20 AM, Ken Krugler wrote: > Hi Scott, > > Thanks for the response. See below for my comments... > >> >> Correct me if I'm wrong, but its notion of a record is very simple >> -- there are no arrays or maps -- just a list of fields. >> This maps to avro easily. > > Correct - currently Cascading doesn't have built-in support for > arrays, maps or unions - though I believe arrays & maps are on the list. >
It would be great if Cascading, Pig, and Hive (along with Avro) could get to some good common ground on all of these data types. >> Creating an Avro schema programmatically is fairly straightforward >> -- especially without arrays, maps, or unions. If the code has >> access to the Cascading record definition, transforming that into an >> Avro schema dynamically should be straightforward. Schema has >> various constructors and static methods from which you can get the >> JSON schema representation or just pass around Schema objects. > > We're currently using the string rep, since a Schema isn't > serializable, and Cascading needs that to save the defined workflow in > the job conf. > That should work well. The JSON string representation is the canonical, cross-language, serialization of an Avro schema. > > So far one issue is that we need to translate between Cascading > Strings and Avro Utf8 types, but most everything else works just fine. > Let us know about the difficulties here and any suggestions or requests for enhancement. I am interested in making the String <> Utf8 situation more efficient and easier to use. >> One can go farther and use AvroWrapper and o.a.avro.mapred define >> the M/R jobs enabling a lot of other possibilities. I can't >> confidently state what all the requirements are here outside of >> doing the Cascading record <> Avro schema translation and changing >> all the touch points that Cascading has on the K/V types. > > It's pretty much four routines in the scheme: > > - sinkInit (setting up the conf properly, for which we're using the > AvroJob support) > - sourceInit (same thing) > > - sink (mapping from Tuple to o.a.avro.Generic.GenericData) > - source (mapping from o.a.avro.Generic.GenericData to Tuple) > > The above is all based on the Avro mapred support, so we just have to > do the translation work for Fields <-> Schema and Tuple <-> GenericData. > > It looks pretty doable, thanks for the help! > > -- Ken > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > >