[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Joseph Adler (JIRA) Wed, 31 Oct 2012 09:57:14 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487954#comment-13487954
 ]


Joseph Adler commented on PIG-3015:
-----------------------------------

Before addressing the questions, I wanted to propose a naming schema for the 
load and store functions. To be consistent with other Pig UDFs, I think it 
makes more sense to use different function names rather than passing different 
types of arguments to the UDF. Can I propose something like this:

LoadFuncs:

- AvroStorage. May be instantiated with zero, one, or two arguments. If called 
with no arguments, the function will load the schema from the most recent data 
file found in the specified path and use that schema. If called with one 
argument, the argument will be a String that specifies the input schema. The 
String may either contain the schema definition, may be a URI that refers to 
the location of the input schema in a file, or may be an example data file from 
which to read the schema. If two arguments are specified, the first argument 
refers to the type of the output records (the name of the type) and the second 
argument may be either a JSON string, a URI for a schema definition file, or a 
URI for an example file that contains the definition of that type.

 This function does not check schema compatibility of input files or allow 
recursive schema definitions. Fails when corrupted files are encountered.
- AvroStorage.AllowRecursive. Same as above, except this function does not 
check schema compatibility of input files but does allow recursive schema 
definitions. Recursively defined records are just defined as schemaless tuples 
in the Pig Schema.
- AvroStorage.IgnoreCorrupted Same as above, except this function will not 
allow recursive schema definitions, but will not fail on corrupted input files.
- AvroStorage.AllowRecursiveAndIgnoreCorrupted Same as above, except this 
function allows recursive definitions and does not fail on corrupted input 
files.


StoreFunc:

- AvroStorage. May be instantiated with zero, one, or two arguments; the 
meaning of the arguments can be inferred from how they are specified. If called 
with no arguments, the function will translate the pig schema to an Avro 
schema, use a default name for the record types, and not assign a namespace to 
the records. If called with one argument, the argument will be a String that 
may specify the output schema, or may specify the record name for the output 
records. If the string specifies the schema definition, may be a URI that 
refers to the location of the input schema in a file, or may be an example data 
file from which to reuse the schema. If two arguments are specified, they may 
refer to the name and namespace for the output records. Alternately, the first 
argument may refer to the type of the output records (the name of the schema), 
and the second argument may be either a JSON string, a URI for a schema 
definition file, or a URI for an example file that contains the definition of 
that type.


Answers to questions:

LoadFunc 1a: Yes, the storage function will convert avro schemas to pig 
schemas, and vice versa. 

I haven't tried to convert multiple "compatible but different" schemas to one 
pig schema. I believe that if you manually supply a schema to the function that 
is a superset of all the schemas in the input data, the underlying Avro 
libraries will take care of this for you... though this brings up another 
question: what does "compatible" mean in this case? Personally, I do not think 
that the core Pig library should attempt to resolve this problem for users; I 
think it is best for users to load files with different load functions, cast 
and rename fields as appropriate in pig code, then take a union of the values. 
It's possible to miss real (and important) errors if Pig does a lot of type 
conversions and manipulations under the covers.

LoadFunc 2: I think this is necessary for a few reasons: It's faster to supply 
a schema manually (the Pig run time doesn't have to read files from HDFS at 
planning time to detect the schema). By specifying the schema, you can also 
specify a subset of fields to de-serialize, reducing the size of the input 
data. Finally, by specifying a schema manually, you can read a set of files 
with compatible but different schemas.

I think PIG-2875 is a design mistake. If I had been involved in the project, I 
would have argued hard against this. You can't specify a recursive schema in 
Pig, so why allow users to load files with recursive schemas in Pig? It is 
possible to load recursively defined records into pig, but that seems like a 
recipe for confusion and errors. By default, recursive schema definitions 
should result in an error, or at least a warning message. I'd propose that this 
be allowed only as an option.

Storefunc 2a:

I don't think it's hard to specfiy those three options. It's probably OK for 
the StoreFunc to allow the user to specify either a schema, a URI that refers 
to a schema file, or a URI that refers to an example file, then for the 
function to figure out what the argument means and do the right thing. 

Can you explain the use case for multiple stores with different output schemas? 
I'm having a hard time understanding why it makes sense to do something 
complicated like that.


                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Reply via email to