[ https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13250831#comment-13250831 ]
Scott Carey commented on PIG-2632: ---------------------------------- {quote} should "a:int,b:int" really generate a different class than "x:int,y:int"? maybe! {quote} They could be the same class, but both support getField(String) if the constructor took a reference to the Schema that has the field name details or a datastructure that maps fields names to indexes. That isn't trivial though, and again its something that Avro could give you for free. I think ironing out the kinks here before going for extra features is the way to go, we can leverage other tools for advanced features in a later version. I will review more late this week or next week. {code} import org.apache.mahout.math.Varint; {code} I am not a fan of requiring another library on my already crowded hadoop classpath that might have interesting version conflicts. Is mahout already required by Pig? Doesn't hadoop have a variable length int already for some of its file formats? > Create a SchemaTuple which generates efficient Tuples via code gen > ------------------------------------------------------------------ > > Key: PIG-2632 > URL: https://issues.apache.org/jira/browse/PIG-2632 > Project: Pig > Issue Type: Improvement > Reporter: Jonathan Coveney > Assignee: Jonathan Coveney > Fix For: 0.11 > > Attachments: PIG-2632-0.patch, PIG-2632-1.patch, PIG-2632-3.patch > > > This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing > the Schema on the frontend, we can code generate Tuples which can be used for > fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, > and it's ~15% smaller serialized (heavily heavily depends on the data, > though). Need to do get/set tests, but assuming that it's on par (or even > faster) than Tuple, the memory gain is huge. > Need to clean up the code and add tests. > Right now, it generates a SchemaTuple for every inputSchema and outputSchema > given to UDF's. The next step is to make a SchemaBag, where I think the > serialization savings will be really huge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira