[ https://issues.apache.org/jira/browse/PIG-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Noguchi updated PIG-2975: ------------------------------ Attachment: pig-2975-trunk_v03-unionapproach.txt bq. 2. We could just make a custom WritableComparator for this case My impression is that using the BytesWritable compare directly would be the fastest. bq. (right now it is going to be TUPLE_1 / {TINYBYTEARRAY, SMALLBYTEARRAY, BYTEARRAY} / SIZE/ and so on. If the header size is different, I would need a switch somewhere. So thought of this lame approach. {noformat} /* * This class tries to optimize for the most common input, DataByteArray * In order to preserve the alphabetical ordering for DataByteArray, * we skip the first 4 bytes when comparing. * For non-DataByteArray, empty 4bytes is added so that content is not * skipped by the above offset. Order for non-DataByteArray would look * random since it includes all the headers for comparisons. * * Bytes comparison is done by pair (isByteArray, mValue) to avoid any * potential collision among DataByteArray and non-DataByteArray. * //Serialization structure * struct { * byte mNull; * int size; (empty for non-DataByteArray) * byte isByteArray; * union { * byte [size]; //for DataType.BYTEARRAY * Tuple.serialized //for all others * } mValue; * byte mIndex; * } * */ {noformat} This sacrifices the space for performance. * For DataType.BYTEARRAY, it adds 2 more bytes for small record (<256). size(4bytes) + 1byte(isByteArray) = 5bytes Before, it was TUPLE_1(1byte) + TINYBYTEARRAY(1byte) + size(1byte) = 3bytes * For non-BYTEARRAY, 5 bytes. empty 4 bytes + 1byte boolean. This is in addition to whatever Tuple adds when serialized. > TestTypedMap.testOrderBy failing with incorrect result > ------------------------------------------------------- > > Key: PIG-2975 > URL: https://issues.apache.org/jira/browse/PIG-2975 > Project: Pig > Issue Type: Sub-task > Affects Versions: 0.11 > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Priority: Blocker > Fix For: 0.11 > > Attachments: PIG-2975-0_jco.patch, PIG-2975-0_jco-v2.patch, > pig-2975-trunk_v01.txt, pig-2975-trunk_v02-broken.txt, > pig-2975-trunk_v03-unionapproach.txt > > > Looked at > {noformat} > junit.framework.AssertionFailedError > at org.apache.pig.test.TestTypedMap.testOrderBy(TestTypedMap.java:352) > {noformat} > This looks like a valid test case failing with incorrect result. > {noformat} > % cat test/orderby.txt > [key#1,key9#23] > [key#3,key3#2] > [key#22] > % cat test/orderby.pig > a = load 'test/orderby.txt' as (m:[]); > b = foreach a generate m#'key' as b0; > dump b; > c = order b by b0; > dump c; > % java ... org.apache.pig.Main -x local test/orderby.pig > [dump b] > (1) > (3) > (22) > ... > [dump c] > (1) > (1) > (22) > % > where did the '(3)' go? > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira