[jira] Updated: (PIG-1295) Binary comparator for secondary sort

Gianmarco De Francisci Morales (JIRA) Wed, 04 Aug 2010 11:12:45 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gianmarco De Francisci Morales updated PIG-1295:
------------------------------------------------

    Attachment: PIG-1295_0.12.patch

Ok, first working integration.
Modified PigTupleRawComparatorNew to use the raw comparators via TupleFactory.
Created a new class PigSecondaryKeyComparatorNew that should substitute the old 
one. This one uses the raw comparators.
Modified JobControlCompiler to use the new comparators.

Moved the null/index semantic outside the raw comparators and inside the 
wrappers.

Modified BinSedesTupleComparator to correctly handle sort order. The sort order 
is applied to the first call to compare tuples. In case we are doing a 
secondary sort, the sort orders are propagated 1 level more (because we have a 
nested tuple with the keys, and we need to apply the sort orders to the content 
of the outermost tuple).
The code is not the cleanest possible but TestPigTupleRawComparator and 
TestSecondarySort pass.

TODO:
Implement the logic for PIG-927.
I plan to create a new interface (TupleRawComparator) and add a method to check 
if during the comparison a field of type NULL was encountered. This interface 
will be used instead of the simple RawComparator to hold the reference to our 
raw comparators.

Write speed test.
Is there something already made that can be used to test the speed improvement? 
The inputs for the unit test are of course too small.

> Binary comparator for secondary sort
> ------------------------------------
>
>                 Key: PIG-1295
>                 URL: https://issues.apache.org/jira/browse/PIG-1295
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>             Fix For: 0.8.0
>
>         Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
> PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.2.patch, 
> PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, 
> PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

Reply via email to