I have been doing some research into figuring out how to read (and later write) 
Avro container files in Pig and Hive.

This has brought up some interesting challenges.  Below are some of my thoughts 
on the situation so far.  I'm sure some Avro JIRA tickets will result 
eventually. 


* PIG
>From my preliminary work, mapping Pig to Avro should be relatively easy since 
>the main data types map to each other fairly cleanly.  Both have Maps and 
>Arrays/Bags, for example, and the maps require string keys.
Making an arbitrary reader/writer will be a bit more of a challenge, but the 
API in 0.7 should be better (http://issues.apache.org/jira/browse/PIG-966 
http://wiki.apache.org/pig/LoadStoreRedesignProposal).
I wish I had time to make sure their new proposal was sufficient to handle Avro 
files as cleanly and efficiently as possible before it gets into an official 
release.

Pig may require a lot of 'hidden' unions with null in the schemas if it is used 
to write generically.  The use case best matches the Generic API now, but 
something else down the road may be better.

* HIVE
The Hive type system can almost map to Avro completely.  They support arrays, 
maps, and structs.   Their maps however can have any intrinsic type as a key 
(int, long, string, float, double).   Other than that, arrays are arrays, and 
structs are records.   Avro files should be better performing and more compact 
than sequence files.

** Unions are a challenge
Unions are a challenge in both.  Currently I am using Pig with a custom 
LoadFunc and for each Union I have I generate a field for each non-null branch 
and a field to specify which branch is used.  This is ... not a good long term 
solution.  For example {"name":"myField", "type":"union", "branches": 
["string", "bytes"]} would generate three pig fields, myFieldString, 
myFieldBytes, myFieldType. 
In Hive, that hack could work and be equally ugly, or possibly a "table family" 
could be created for certain union types with a table per branch.  In other 
cases a custom operation is needed.
   
Example 1, small 'leaf' union:  I have a field that is a union of a String and 
a fixed byte[16].   In my custom Pig script I just convert the bytes to hex and 
always use a string, generating one field.   I could also just create one field 
as a variable length byte[] type and use the utf8 bytes of the string.  In my 
case the string is always more than 32 characters, so there are no collisions 
between the branches in either.   These custom field mappings cannot be done 
with a generic "read any avro file in Pig/Hive" class. 

Example 2, large 'branch' union:  Some unions are unions of many lager more 
complicated records.  In Pig this can map to a SPLIT (several record streams 
from one source) or in Hive a 'table family' but neither can be currently done 
naturally or automatically -- a fully custom reader / writer for each schema 
that has such a 'branch union' in it is necessary.


Getting some sort of union-type features added to both would be beneficial, 
even if these are restricted in scope and only cover a few more common use 
cases.

** Avro enhancements
Both Specific and Generic APIs lead to extra object overhead here.  For example 
in Pig one creates the Avro object then reads its fields and copies them into a 
pig Schema.  Lower level readers are better -- ideally the Pig reader gets 
callbacks for each field it is interested in in the order it expects (reader 
schema order) and fills out its own object.  I think some of our Decoders can 
operate that way.  A Pig feature that makes it easier to construct tuples 
out-of-order (writer schema order) would be useful too.

Hive has a lot of projection features that could be served well by slightly 
different file formats (for example, the ability to skip variable length fields 
faster -- a per record map of field sizes perhaps -- could be useful).

Neither will support recursive schemas.  Is there a quick way to check if a 
schema is recursive?  In general, some features in Avro to make it easier to 
'categorize' a schema would be beneficial.

Reply via email to