[
https://issues.apache.org/jira/browse/PIG-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286758#comment-13286758
]
Jonathan Coveney commented on PIG-2638:
---------------------------------------
Ashutosh,
Given the way that we currently serialize values, there is actually no gain to
using varint, because we are, in all cases, writing a byte value that specifies
what is being serialized. In fact we can do better than varint... instead of
needing a bit flag at the head of every byte, we can just have something like
the follows:
INT_1BYTE,
INT_2BYTE,
INT_3BYTE,
INT_4BYTE
and the same analogue for the long. Given that currently there is no way NOT to
write that object identification byte, the gain from varint/varlong doesn't
exist, since you can do it more compactly (given what we do) anyway). However,
as I work to erase that need (in SchemaTuple, for example), varint/varlong
begin to make a lot more sense.
I think this patch is some really easy low hanging fruit, and in the future I
have some ideas around how to greatly improve serialization performance that
will be more sweeping.
Would love your thoughts.
> Optimize BinInterSedes treatment of longs
> -----------------------------------------
>
> Key: PIG-2638
> URL: https://issues.apache.org/jira/browse/PIG-2638
> Project: Pig
> Issue Type: Improvement
> Reporter: Jonathan Coveney
> Assignee: Jonathan Coveney
> Fix For: 0.11, 0.10.1
>
> Attachments: PIG-2638-0.patch, PIG-2638-1.patch
>
>
> During adventures in BinInterSedes, I noticed that Integers are written in an
> optimized fashion, but longs are not. Given that, in the general case, we
> have to write type information anyway, we might as well do the same
> optimization for Longs. That is to say, given that most longs won't have 8
> bytes of information in them, why should we waste the space of serializing 8
> bytes?
> This patch takes its inspiration from varint encoding per these two sources:
> http://javasourcecode.org/html/open-source/mahout/mahout-0.5/org/apache/mahout/math/Varint.java.html
> https://developers.google.com/protocol-buffers/docs/encoding
> Though, nicely enough, we don't actually have to use varints. Since we HAVE
> to write an 8 byte type header, we might as well include the number of bytes
> we had to write. I use zig zag encoding so that in the case of negative
> numbers, we see the benefit.
> This should decrease the amount of serialized long data by a good bit.
> Patch incoming. It passes test-commit in 0.11.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira