AvroStorage pig adapter

Scott Carey Thu, 13 May 2010 22:37:38 -0700

I'm working on some prototypes for an org.apache.avro.pig package for java that 
contains avro <> pig storage adapters.


This is going fairly well.  I already have org.apache.avro.mapreduce done for 
InputFormat and OutputFormat (pig requires use of the 0.20 api), once I get 
testing working I'll submit a patch and JIRA for that.  We also need to package 
these libraries in a different jar than the core avro content.

However there are some difficulties that I could use a little help on.   All 
Pig datatypes map to Avro easily except for the Pig MAP datatype.

A Pig map, like an Avro map, must have a string as a key.  However, its value 
type is essentially Object and can be any pig type.  A single pig map might 
have the contents:
"city" > "San Francisco"
"elevation" > 25

Avro map values must all have the same type.  A pig schema does not define what 
type is inside the map.  Other pig serializations dynamically handle it.
In Avro, this seems straightforward at first -- the value of the map must be a 
Union of all possible pig types:
  [ null, boolean, int, long, float, double, chararray, bytearray, tuple, bag, 
map ]

The problem comes in with the last three.   I'm fairly sure there is no valid 
Avro schema to represent this situation.  The tuple in the union can be a tuple 
of any possible compostion -- avro requires defining its fields in advance.  
Likewise, the bag can contain tuples of any possible composition.  The map has 
to self-reference, and there's a bit of a chicken-egg problem there.  In order 
to create the map schema I have to have the union containing it already created.

If I support only maps that contain simple value types, then pig can only 
detect the failure at runtime when the output is written and a complex map 
value is encountered during serialization.
I can support these arbitrary types by serializing them to byte[] via their 
Writable API and storing these as an avro bytes type.  That is a hack that I'd 
rather avoid but looks to be the one way out.

Unless I'm missing something, we can't serialize Pig maps in pure Avro unless 
either:

* Avro adds some sort of 'dynamic' typing for records that have unknown fields 
at schema create time.  For example, {"name": "unknown", "type": 
"dynamic-record"} can signify an untyped collection of fields, each field in 
binary can be prefixed by a type byte and the field names auto-generated 
(perhaps "$0", "$1", etc).
* Pig makes their map values strictly typed, like Avro and Hive.  I'd also like 
to see Pig and Avro support maps that have integer and long keys like Hive but 
that is a separate concern.


On the other side -- reading an avro container file into a Pig schema, there 
are a few limitations:

An Avro schema cannot be translated cleanly to Pig if:
  There is a union that is more than a union of NULL and one type, unless it is 
a map value.

There are some hacks around the above -- the union can be 'expanded':

a field in avro:
 {"name": "foo", "type": ["type": "int", "type": "string"]}
 
becomes two fields in pig, one null:
 (fooInt: int, fooString: string)
  
The above is not a general purpose solution but it can be useful.  It would be 
nice if Pig supported unions.  In many ways it already does at a lower level, 
but this is not exposed in the user facing types system.


JIRA's to come when the code is in better shape.

-Scott

AvroStorage pig adapter

Reply via email to