[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Joseph Adler (JIRA) Tue, 13 Nov 2012 09:48:14 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496363#comment-13496363
 ]


Joseph Adler commented on PIG-3015:
-----------------------------------

Progress update: I merged in the code, and am now working on test cases. I plan 
to submit the patches for review later this week.

Right now, I am working on unit tests for AvroStorage. Because AvroStorage is 
so complicated, I am trying to find ways to make the test cases easier to 
manage. (I don't like seeing a single test file with dozens of distinct test 
cases, and dozens of test data files in one directory). I feel like it's too 
hard to understand what's being tested and what's not being tested, and too 
hard to maintain the tests. AvroStorage is very complicated, and I think it's 
worth changing the test strategy to be more methodical and rigorous. Here's 
what I'm proposing:

(1) Test files will be kept in different directories by file type: schema 
(AVSC) files, raw text input files, json formatted input files, uncompressed 
avro files, deflate compressed avro files, snappy compressed avro files, 
uncompressed avro output files, deflate compressed avro output files, snappy 
compressed output files. 
(2) Test pig scripts will be kept in discrete files, with parameters as file 
names. I'll modify the test runner to set the runtime parameters correctly. (I 
think this increases the readability of the test cases and also helps with 
debugging; you can always type "java -cp pig.jar org.apache.pig.Main -x local 
-f test_file" to run the files outside the test harness and see what happens)
(3) I'm thinking about modifying the build process to compile human readable 
files (in JSON format) into avro files before running the tests.

What do you guys think?
                
> Rewrite of AvroStorage
> ----------------------
>
>                 Key: PIG-3015
>                 URL: https://issues.apache.org/jira/browse/PIG-3015
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joseph Adler
>            Assignee: Joseph Adler
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

Reply via email to