[ https://issues.apache.org/jira/browse/PIG-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698434#comment-14698434 ]
Daniel Dai commented on PIG-4656: --------------------------------- Chararray serde change looks fine, just curious why Text.encode performs better than out.writeUTF? For the BinInterSedesTupleRawComparator change, shouldn't it use String semantics not byte semantics to compare? Since BinInterSedesTupleRawComparator is mostly used in secondary sort, a typical example is nested sort. In this case, we shall sort using alphabet order, for which we have to deserialize the String, isn't it? > Improve String serialization and comparator performance in BinInterSedes > ------------------------------------------------------------------------ > > Key: PIG-4656 > URL: https://issues.apache.org/jira/browse/PIG-4656 > Project: Pig > Issue Type: Improvement > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Fix For: 0.16.0 > > Attachments: PIG-4656-1.patch > > > Two major optimizations can be done: > - PIG-1472 added multiple data types to store different sizes (byte, > short, int). It can be simplified using WritableUtils.writeVInt. There is no > difference for byte and short compared to current approach. But with int, it > could be beneficial where lot of numbers could be written with 3 bytes > instead of 4. For eg: 32768 is written using 3 bytes in with > WritableUtils.writeVInt whereas currently 4 bytes (int) is used. > - String comparison in BinInterSedesTupleRawComparator initializes String > for comparison. Should instead compare bytes like Text.Comparator. > {code} > str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8); > str2 = new String(bb2.array(), bb2.position(), casz2, BinInterSedes.UTF8); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)