[jira] Commented: (PIG-1295) Binary comparator for secondary sort

Gianmarco De Francisci Morales (JIRA) Tue, 03 Aug 2010 11:02:13 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894961#action_12894961
 ]


Gianmarco De Francisci Morales commented on PIG-1295:
-----------------------------------------------------

Thanks for the suggestion Daniel.

My idea to integrate the RawTupleComparator is to modify 
PigSecondaryKeyComparator to delegate the comparison to it. I cannot mimic the 
current behavior of making 2 calls (one for the main and one for the secondary 
key) because I do not know the boundaries between the keys. So I need to make a 
single call to compare the compound key. There are some issues though.

1) There are some inconsistencies between the behavior of PigTupleRawComparator 
(the original one) and PigSecondaryKeyComparator.
Specifically, when tuples are null the first one returns 0 while the second one 
compares the indexes.
Furthermore, indexes are compared also when one of the fields in the tuples is 
null, in order not to join them (if I understood correctly PIG-927). This is 
done in PigSecondaryKeyComparator but not in PigTupleRawComparator.
Is it designed to be like this or is it a bug?
I suppose the behaviors of the two comparators should be more or less the same.

2) In PigSecondaryKeyComparator the key is assumed to be a 2-field tuple where 
the 0th field is the main key and the 1st field is the secondary key.
We can directly feed the binary representation of this tuple inside our new raw 
comparator, but we need to consolidate the sort orders. Right now there are 2 
different and independent sort orders serialized in the jobConf (pig.sortOrder 
and pig.secondarySortOrder). In the simple case, when sort orders for all the 
columns are specified, we can just concatenate them together (sort of). There 
are some problems when we have WholeTuple sort orders as they might differ. 
I would like to keep all of this out of the tuple comparator and define some 
clean interface to pass the sort orders.  One problem I see is that I probably 
need the tuple sizes (recursively) to do this, and this is not known at 
configuration time. I also need to fix this in the current comparator, in order 
to take into account the recursion inside nested tuples.

3) Should I keep all the mIndex/mNull handling outside the RawTupleComparator 
and write wrappers that deal with them?
That is, should the RawTupleComparator know how to deal with a NullableTuple or 
should it just know its kind of Tuple (BinInterSedes or Default)

> Binary comparator for secondary sort
> ------------------------------------
>
>                 Key: PIG-1295
>                 URL: https://issues.apache.org/jira/browse/PIG-1295
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>             Fix For: 0.8.0
>
>         Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
> PIG-1295_0.11.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, 
> PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, 
> PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

Reply via email to