[ 
https://issues.apache.org/jira/browse/PIG-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-2975:
------------------------------

    Attachment: pig-2975-trunk_v03-unionapproach.txt

bq. 2. We could just make a custom WritableComparator for this case

My impression is that using the BytesWritable compare directly would be the 
fastest.


bq. (right now it is going to be TUPLE_1 / {TINYBYTEARRAY, SMALLBYTEARRAY, 
BYTEARRAY} / SIZE/ and so on. 

If the header size is different, I would need a switch somewhere. So thought of 
this lame approach.

{noformat}
/*
 * This class tries to optimize for the most common input, DataByteArray
 * In order to preserve the alphabetical ordering for DataByteArray,
 * we skip the first 4 bytes when comparing.
 * For non-DataByteArray, empty 4bytes is added so that content is not
 * skipped by the above offset.  Order for non-DataByteArray would look
 * random since it includes all the headers for comparisons.
 *
 * Bytes comparison is done by pair (isByteArray, mValue) to avoid any
 * potential collision among DataByteArray and non-DataByteArray.
 * //Serialization structure
 * struct {
 *   byte mNull;
 *   int size; (empty for non-DataByteArray)
 *   byte isByteArray;
 *   union {
 *    byte [size];      //for DataType.BYTEARRAY
 *    Tuple.serialized  //for all others
 *   } mValue;
 *   byte mIndex;
 * }
 *
 */
{noformat}

This sacrifices the space for performance.
* For DataType.BYTEARRAY, it adds 2 more bytes for small record (<256).
size(4bytes) + 1byte(isByteArray) = 5bytes
Before, it was TUPLE_1(1byte) + TINYBYTEARRAY(1byte) + size(1byte) = 3bytes

* For non-BYTEARRAY, 5 bytes. empty 4 bytes + 1byte boolean. This is in 
addition to whatever Tuple adds when serialized.

                
> TestTypedMap.testOrderBy failing with incorrect result 
> -------------------------------------------------------
>
>                 Key: PIG-2975
>                 URL: https://issues.apache.org/jira/browse/PIG-2975
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: 0.11
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Blocker
>             Fix For: 0.11
>
>         Attachments: PIG-2975-0_jco.patch, PIG-2975-0_jco-v2.patch, 
> pig-2975-trunk_v01.txt, pig-2975-trunk_v02-broken.txt, 
> pig-2975-trunk_v03-unionapproach.txt
>
>
> Looked at 
> {noformat}
> junit.framework.AssertionFailedError
>     at org.apache.pig.test.TestTypedMap.testOrderBy(TestTypedMap.java:352)
> {noformat}
> This looks like a valid test case failing with incorrect result.
> {noformat}
> % cat test/orderby.txt
> [key#1,key9#23]
> [key#3,key3#2]
> [key#22]
> % cat test/orderby.pig
> a = load 'test/orderby.txt' as (m:[]);
> b = foreach a generate m#'key' as b0;
> dump b;
> c = order b by b0;
> dump c;
> % java ... org.apache.pig.Main    -x local test/orderby.pig 
> [dump b]
> (1)
> (3)
> (22)
> ...
> [dump c]
> (1)
> (1)
> (22)
> %
> where did the '(3)' go?
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to