[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729700#action_12729700
 ] 

Alan Gates commented on PIG-794:
--------------------------------

I agree with Doug's comments that it's better to use an API to build the schema 
that will give us compile time checking.  I think it will also (hopefully) be 
easier to figure out the schema when reading the code, as it will avoid the 
need to read JSON directly.

I have a general question on the approach.  This is a direct port of Pig's 
BinStorage to use Avro, including the writing of indicator bytes for types.  I 
do not have a deep knowledge of Avro.  But I had assumed that since it was a 
de/serialization framework with types, part of what it would provide was type 
recognition.  That is, can't this code rely on Avro to set the type for it?  Do 
we need to be writing those indicator bytes ourselves?  Perhaps this is the 
same comment that Doug is making about using GenericDatumReader and addField.

In response to Hong's comment, the sync marks are vulnerable as you point out.  
But the loader needs some way to find a proper starting place when it's handed 
any block but the initial block of a file.  I wonder if we could create a new 
sync type.  It would always consist of a 100 byte marker (say the first 25 
prime numbers, or the first 25 digits of pi or something).  We could then write 
a tuple with that sync type every 1000 records in the data.  Loaders that don't 
start at position 0 could then seek to the first sync type it found before it 
began reading.  All loaders would read past the end of their position until 
they saw a sync type.

As for this being compatible with with non-pig apps, that isn't the purpose of 
this AvroStorage function.  This is for pig to pass data between MR jobs for 
itself.  Having a tool independent storage format is a bigger project, as it 
requires agreeing on things like sync marks, how to represent different Avro 
objects, etc.

> Use Avro serialization in Pig
> -----------------------------
>
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>             Fix For: 0.2.0
>
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to