[ https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Coveney updated PIG-2632: ---------------------------------- Attachment: schematuple benchmarking.pdf Scott, I attached a pdf. Note: I think the serializing times (if they're still in there) aren't representative, as I didn't use the FSDataOutputStream. I haven't bothered to redo because I know the gain is significant, and would rather focus on shipping. Would love your thoughts. > Create a SchemaTuple which generates efficient Tuples via code gen > ------------------------------------------------------------------ > > Key: PIG-2632 > URL: https://issues.apache.org/jira/browse/PIG-2632 > Project: Pig > Issue Type: Improvement > Reporter: Jonathan Coveney > Assignee: Jonathan Coveney > Fix For: 0.11 > > Attachments: PIG-2632-0.patch, PIG-2632-1.patch, PIG-2632-3.patch, > PIG-2632-4.patch, PIG-2632-5.patch, PIG-2632-6.patch, PIG-2632-7.patch, > PIG-2632-8.patch, schematuple benchmarking.pdf, schematuple benchmarking.pptx > > > This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing > the Schema on the frontend, we can code generate Tuples which can be used for > fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, > and it's ~15% smaller serialized (heavily heavily depends on the data, > though). Need to do get/set tests, but assuming that it's on par (or even > faster) than Tuple, the memory gain is huge. > Need to clean up the code and add tests. > Right now, it generates a SchemaTuple for every inputSchema and outputSchema > given to UDF's. The next step is to make a SchemaBag, where I think the > serialization savings will be really huge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira