[
https://issues.apache.org/jira/browse/PIG-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700087#comment-14700087
]
Rohini Palaniswamy commented on PIG-4656:
-----------------------------------------
out.writeUTF is taking more bytes.
{code}
String s = "test";
ByteArrayOutputStream bout = new ByteArrayOutputStream();
DataOutputStream os = new DataOutputStream(bout);
os.writeUTF(s);
byte[] b = bout.toByteArray();
System.out.println(b.length);
System.out.println(Arrays.toString(b));
System.out.println(s.getBytes("UTF-8").length);
System.out.println(Arrays.toString(s.getBytes("UTF-8")));
ByteBuffer bb = Text.encode(s);
System.out.println(bb.array().length);
System.out.println(bb.limit());
System.out.println(Arrays.toString(bb.array()));
Output:
6
[0, 4, 116, 101, 115, 116]
4
[116, 101, 115, 116]
4
4
[116, 101, 115, 116]
{code}
We will have to write out the length for Text.encode. So it should be 5 bytes
(4 is the length). But writeUTF takes 6 bytes. bb.array().length is sometimes
greater than bb.limit() probably due to array expansion, but we only write out
till bb.limit().
WritableComparator.compareBytes does comparison on Lexicographic order of
binary data. When we don't use tuples and just do order by on primitive
chararray type, the chararray is serialized using TextWritable and comparison
logic used in PigTextRawComparator is Text.Comparator. This patch is mimicking
the same behavior in BinInterSedes.
> Improve String serialization and comparator performance in BinInterSedes
> ------------------------------------------------------------------------
>
> Key: PIG-4656
> URL: https://issues.apache.org/jira/browse/PIG-4656
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4656-1.patch
>
>
> Two major optimizations can be done:
> - PIG-1472 added multiple data types to store different sizes (byte,
> short, int). It can be simplified using WritableUtils.writeVInt. There is no
> difference for byte and short compared to current approach. But with int, it
> could be beneficial where lot of numbers could be written with 3 bytes
> instead of 4. For eg: 32768 is written using 3 bytes in with
> WritableUtils.writeVInt whereas currently 4 bytes (int) is used.
> - String comparison in BinInterSedesTupleRawComparator initializes String
> for comparison. Should instead compare bytes like Text.Comparator.
> {code}
> str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8);
> str2 = new String(bb2.array(), bb2.position(), casz2, BinInterSedes.UTF8);
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)