[ 
https://issues.apache.org/jira/browse/PIG-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2632:
----------------------------------

    Attachment: schematuple benchmarking.pptx

Powerpoint?! What is this, Wall Street? I know, I know. But we thought it would 
be instructive to benchmark SchemaTuple versus the existing Tuple 
implementations (namely Tuple and PrimitiveTuple) to see what sort of gains are 
possible.

If the schema type wasn't mentioned, it's a long. In general it was all done 
with longs (for ease), except in the case of serialization where the long/int 
difference made a pretty big difference.

Some results:
- It takes a depressingly long time to make a tuple with a given size. Well, 
depressing being on a pretty small order, but it's because with a new Tuple of 
a given size it nulls out those values. We could alleviate this, but I don't 
know if the code complexity/slightly increased memory footprint would be worth 
it.
- In general, the PrimitiveTuple performance is poor (though it does have a 
decreased memory footprint). There are some enhancements that would make it 
faster, but I think that SchemaTuple will end up making it completely obsolete.
- The values that were set or serialized started at 0 and went up depending on 
how many values google calipers gave it. This was especially important for the 
size on disk of serialized values: SchemaTuple uses Varint, so obviously for 
smaller values it's going to be more compact. However, more of note, is that 
Tuple storage for longs is really really large. We can optimize it (I have a 
patch that does, but need to test the speed implications). After SchemaTuple 
will probably come SchemaBag, but after that will come some changes to 
serialization at the suggestion of Scott Carey that could be really huge.

But basically, SchemaTuple is better than Tuples in every way (that it 
applies). The current patch uses it where it is possible with UDF's, but patch 
1 will probably just be their existence (and perhaps using it with UDF's where 
the Schema is known), and then I'll incrementally add it (first candidate would 
be joins or anything internal to pig…the memory and speed benefits should be 
very beneficial).
                
> Create a SchemaTuple which generates efficient Tuples via code gen
> ------------------------------------------------------------------
>
>                 Key: PIG-2632
>                 URL: https://issues.apache.org/jira/browse/PIG-2632
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11
>
>         Attachments: PIG-2632-0.patch, PIG-2632-1.patch, PIG-2632-3.patch, 
> schematuple benchmarking.pptx
>
>
> This work builds on Dmitriy's PrimitiveTuple work. The idea is that, knowing 
> the Schema on the frontend, we can code generate Tuples which can be used for 
> fun and profit. In rudimentary tests, the memory efficiency is 2-4x better, 
> and it's ~15% smaller serialized (heavily heavily depends on the data, 
> though). Need to do get/set tests, but assuming that it's on par (or even 
> faster) than Tuple, the memory gain is huge.
> Need to clean up the code and add tests.
> Right now, it generates a SchemaTuple for every inputSchema and outputSchema 
> given to UDF's. The next step is to make a SchemaBag, where I think the 
> serialization savings will be really huge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to