[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904680#action_12904680
 ] 

Scott Carey commented on PIG-794:
---------------------------------

So a summary of the differences I can see quickly are:

h5. Schema usage:
This creates a 'generic' Avro schema that can be used for any pig data.  Each 
field in a Tuple is a Union of all possible pig types, and each Tuple is a list 
of fields.  It does not preserve the field names or types -- these are not 
important for intermediate data anyway.

AVRO-592 translates the Pig schema into a specific Avro schema that persists 
the field names and types, so that:
STORE foo INTO 'file' USING AvroStorage();
Will create a file that
foo2 = LOAD 'file' USING AvroStorage(); 
will be able to re-create the exact schema for use in a script.

h5. Serialization and Deserialization:
This uses the same style as Avro's GenericRecord, which traverses the schema on 
the fly and writes fields for each record.

AVRO-592 constructs a state machine for each specific schema to optimally 
traverse a Tuple to serialize a record or create a Tuple when deserializing.  
This should be faster but the code is definitely harder to read (but easy to 
unit test -- AVRO-592 has 98% unit test code coverage on that portion).


Integrating these should not be too hard.  I'll try and put my latest version 
of AVRO-592 up there late today or tomorrow.




> Use Avro serialization in Pig
> -----------------------------
>
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to