[ 
https://issues.apache.org/jira/browse/PIG-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4656:
------------------------------------
    Summary: Improve String serialization and comparator performance in 
BinInterSedes  (was: Improve serialization and comparator performance in 
BinInterSedes)

Actually did some more verification and looks like the TINY, SMALL approach is 
better than WritableUtils.writeVInt. Current uses unsigned max values for byte 
and short  (255 and 65535) and is able to represent better till 65535 with 
lesser bytes than WritableUtils.writeVInt. And most of the data will fall into 
that category. Also WritableUtils.writeVInt uses 2 bytes and 3 bytes for byte 
and short respectively as 1 byte takes up length for some ranges. For eg: 32767 
uses 3 bytes and not 2. So better to leave it at the current approach.  One 
thing that might be advantageous though is use WritableUtils.writeVLong to 
serialize LONG instead of out.writeLong().  Though for values >= Math.pow(2, 
56) it uses 9 bytes, for  val > Math.pow(2, 32)  and val < Math.pow(2, 48) it 
uses 5 to 7 bytes which is good. timestamps which is the most used long uses 7 
bytes instead of 8. Apart from the byte saving need to see the time taken to 
serialize and deserialize to see if it is really advantageous.  So will deal 
with that in a separate jira and just fix the String serialization and 
comparison performance which is really bad in this jira.

> Improve String serialization and comparator performance in BinInterSedes
> ------------------------------------------------------------------------
>
>                 Key: PIG-4656
>                 URL: https://issues.apache.org/jira/browse/PIG-4656
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>
> Two major optimizations can be done:
>   -  PIG-1472 added multiple data types to store different sizes (byte, 
> short, int). It can be simplified using WritableUtils.writeVInt. There is no 
> difference for byte and short compared to current approach. But with int, it 
> could be beneficial where lot of numbers could be written with 3 bytes 
> instead of 4. For eg: 32768 is written using 3 bytes in with 
> WritableUtils.writeVInt whereas currently 4 bytes (int) is used. 
>   -  String comparison in BinInterSedesTupleRawComparator initializes String 
> for comparison. Should instead compare bytes like Text.Comparator.
> {code}
> str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8);
> str2 = new String(bb2.array(), bb2.position(), casz2, BinInterSedes.UTF8);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to