I'm working on some prototypes for an org.apache.avro.pig package for java that
contains avro <> pig storage adapters.
This is going fairly well. I already have org.apache.avro.mapreduce done for
InputFormat and OutputFormat (pig requires use of the 0.20 api), once I get
testing working I'll submit a patch and JIRA for that. We also need to package
these libraries in a different jar than the core avro content.
However there are some difficulties that I could use a little help on. All
Pig datatypes map to Avro easily except for the Pig MAP datatype.
A Pig map, like an Avro map, must have a string as a key. However, its value
type is essentially Object and can be any pig type. A single pig map might
have the contents:
"city" > "San Francisco"
"elevation" > 25
Avro map values must all have the same type. A pig schema does not define what
type is inside the map. Other pig serializations dynamically handle it.
In Avro, this seems straightforward at first -- the value of the map must be a
Union of all possible pig types:
[ null, boolean, int, long, float, double, chararray, bytearray, tuple, bag,
map ]
The problem comes in with the last three. I'm fairly sure there is no valid
Avro schema to represent this situation. The tuple in the union can be a tuple
of any possible compostion -- avro requires defining its fields in advance.
Likewise, the bag can contain tuples of any possible composition. The map has
to self-reference, and there's a bit of a chicken-egg problem there. In order
to create the map schema I have to have the union containing it already created.
If I support only maps that contain simple value types, then pig can only
detect the failure at runtime when the output is written and a complex map
value is encountered during serialization.
I can support these arbitrary types by serializing them to byte[] via their
Writable API and storing these as an avro bytes type. That is a hack that I'd
rather avoid but looks to be the one way out.
Unless I'm missing something, we can't serialize Pig maps in pure Avro unless
either:
* Avro adds some sort of 'dynamic' typing for records that have unknown fields
at schema create time. For example, {"name": "unknown", "type":
"dynamic-record"} can signify an untyped collection of fields, each field in
binary can be prefixed by a type byte and the field names auto-generated
(perhaps "$0", "$1", etc).
* Pig makes their map values strictly typed, like Avro and Hive. I'd also like
to see Pig and Avro support maps that have integer and long keys like Hive but
that is a separate concern.
On the other side -- reading an avro container file into a Pig schema, there
are a few limitations:
An Avro schema cannot be translated cleanly to Pig if:
There is a union that is more than a union of NULL and one type, unless it is
a map value.
There are some hacks around the above -- the union can be 'expanded':
a field in avro:
{"name": "foo", "type": ["type": "int", "type": "string"]}
becomes two fields in pig, one null:
(fooInt: int, fooString: string)
The above is not a general purpose solution but it can be useful. It would be
nice if Pig supported unions. In many ways it already does at a lower level,
but this is not exposed in the user facing types system.
JIRA's to come when the code is in better shape.
-Scott