[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488201#comment-13488201 ]
Cheolsoo Park commented on PIG-3015: ------------------------------------ Hi Joseph, 1) Using different functions sounds OK to me, but couldn't we handle them via args using CommandLineParser? IMHO, this is simpler and more scalable. Another advantage of using CommandLineParser is that we don't have to infer the meaning of arguments based on the number of arguments. Other built-in storages (e.g. HBaseStorage) use CommandLineParser, so why don't we do the same to provide the universal syntax to the user across the project? Thoughts? 2) Multiple schema support {quote} this brings up another question: what does "compatible" mean in this case? {quote} Please refer to the rules listed in [PIG-2579|https://issues.apache.org/jira/browse/PIG-2579?focusedCommentId=13446546&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13446546]. I did this because it asked by several people. The use case is that people define Avro schemas, but they evolve over time. Since the AvroStorage used to assume that all the input files have the exactly the same schema, they couldn't load them. PIG-2579 was trying to address that inconvenience. Do you think that we should include a similar functionality as an option in the new storage? 3) Recursive record support {quote} You can't specify a recursive schema in Pig, so why allow users to load files with recursive schemas in Pig? By default, recursive schema definitions should result in an error, or at least a warning message. I'd propose that this be allowed only as an option. {quote} Agreed (and guilty :-)). In fact, this was a feature request from one of my customers. The rationale was that people couldn't change their already-defined recursive schemas, but they wanted to do some processing on non-recursive parts of data. Providing it as an option sound good to me. 4) Multiple store support {quote} Can you explain the use case for multiple stores with different output schemas? I'm having a hard time understanding why it makes sense to do something complicated like that. {quote} I think that I wasn't clear. All I wanted to say is that if we have more than one relation to store in a script, we should be able to do it. {code} set1 = load 'input1.txt' using PigStorage() as ( ... ); store set1 into 'set1' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1'); set2 = load 'input2.txt' using PigStorage() as ( ... ); store set2 into 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '2'); {code} The current storage supports multiple stores via the 'index' option. In fact, this is very hacky, and we should get rid of it. Nevertheless, I wanted to know if this will be still supported. On a second thought, I think that your proposal already implies multiple store support because: - The output schema will be derived from the Pig schema per store, or - The user will specify the output schema per store. So I don't see any problem. Thanks! > Rewrite of AvroStorage > ---------------------- > > Key: PIG-3015 > URL: https://issues.apache.org/jira/browse/PIG-3015 > Project: Pig > Issue Type: Improvement > Components: piggybank > Reporter: Joseph Adler > > The current AvroStorage implementation has a lot of issues: it requires old > versions of Avro, it copies data much more than needed, and it's verbose and > complicated. (One pet peeve of mine is that old versions of Avro don't > support Snappy compression.) > I rewrote AvroStorage from scratch to fix these issues. In early tests, the > new implementation is significantly faster, and the code is a lot simpler. > Rewriting AvroStorage also enabled me to implement support for Trevni. > I'm opening this ticket to facilitate discussion while I figure out the best > way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira