[ 
https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985958#action_12985958
 ] 

Scott Carey commented on PIG-1748:
----------------------------------

@Jacob
Of course projects can do what they wish.  I'm simply hoping many can 
collaborate together on this general problem category.

{quote}This seems like an odd approach to me, essentially inverting the domain 
knowledge of each application to Avro, rather than the application itself where 
its developers frolic and work. Is there something I'm missing here?
{quote}
Writing a Pig storage adapter requires Avro domain knowledge and Pig domain 
knowledge.  I found that it required more knowledge of Avro than Pig to do 
well.  If all you ever want to achieve is:

Pig - >> Avro file - >> Pig, then maybe it doesn't matter who hosts it. 

But what if you want to do:
Pig - >>  Avro file - >> Cascading - >> Avro file - >> Hive - >> Avro file - >> 
Pig   ?

Now which project should host what defines how all those data models can 
interact through a common schema system?  pig contrib?  hive contrib?  howl? 
cascading (gpl . . .)?

In the longer term, the common elements needed by all of the above can 
crystallize out into an avro module general to all, and individual modules 
hosted by each project can use that.  What that might look like won't be 
apparent until there are enough example use cases however.

> Add load/store function AvroStorage for avro data
> -------------------------------------------------
>
>                 Key: PIG-1748
>                 URL: https://issues.apache.org/jira/browse/PIG-1748
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: lin guo
>            Assignee: Jakob Homan
>         Attachments: avro_storage.patch, avro_test_files.tar.gz, 
> PIG-1748-2.patch
>
>
> We want to use Pig to process arbitrary Avro data and store results as Avro 
> files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. 
> Due to discrepancies of Avro and Pig data models, AvroStorage has:
> 1. Limited support for "record": we do not support recursively defined record 
> because the number of fields in such records is data dependent.
> 2. Limited support for "union": we only accept nullable union like ["null", 
> "some-type"].
> For simplicity, we also make the following assumptions:
> If the input directory is a leaf directory, then we assume Avro data files in 
> it have the same schema;
> If the input directory contains sub-directories, then we assume Avro data 
> files in all sub-directories have the same schema.
> AvroStorage takes no input parameters when used as a LoadFunc (except for 
> "debug [debug-level]"). 
> Users can provide parameters to AvroStorage when used as a StoreFunc. If they 
> don't, Avro schema of output data is derived from its 
> Pig schema.
> Detailed documentation can be found in 
> http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to