[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gianmarco De Francisci Morales updated PIG-1295:
------------------------------------------------
Attachment: PIG-1295_0.8.patch
I added the code for "if the user does not use DefaultTuple we fall back to the
default deserialization case". I assume the user defined tuple will have a
different DataType byte from DataType.TUPLE. If this is not the case, I see no
way of discerning DefaultTuple from any other Tuple implementation.
Anyway, I think this issue needs to be properly addressed in the context of
[PIG-1472|https://issues.apache.org/jira/browse/PIG-1472].
I added support for BIGCHARARRAY.
UTF-8 decoding is quite convoluted. It is a variable length encoding, so we
cannot avoid using a String. [UTF-8|http://en.wikipedia.org/wiki/UTF-8]
Before tackling the integration with
[PIG-1472|https://issues.apache.org/jira/browse/PIG-1472] I need to familiarize
with the code in the patch. I will write a proposal for the integration in the
next days.
I also made some changes to DataByteArray in order to encapsulate the logic for
comparison in a publicly accessible method. This way the raw comparison is
consistent with the behavior of the class, in a way similar to the other cases
where I delegate comparison to the class.
> Binary comparator for secondary sort
> ------------------------------------
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Daniel Dai
> Assignee: Gianmarco De Francisci Morales
> Fix For: 0.8.0
>
> Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
> PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch,
> PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of
> comparator if available. The benefit of binary comparator is we do not need
> to instantiate the object before we compare. We see a ~30% speedup after we
> switch to binary comparator. Currently, Pig use binary comparator in
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need
> to do a sort in order to filter out duplicate values; however, we do not care
> how comparator sort keys. Groupby also share this character. In this case, we
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we
> have implementation for simple types, such as integer, long, float,
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have
> a binary comparator implementation. This especially matters when we switch to
> use secondary sort. In secondary sort, we convert the inner sort of nested
> foreach into the secondary key and rely on hadoop to sorting on both main key
> and secondary key. The sorting key will become a two items tuple. Since the
> secondary key the sorting key of the nested foreach, so the sorting semantics
> matters. It turns out we do not have binary comparator once we use secondary
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary
> structure of the serialized tuple. We can focus on most common use cases
> first, which is "group by" followed by a nested sort. In this case, we will
> use secondary sort. Semantics of the first key does not matter but semantics
> of secondary key matters. We need to identify the boundary of main key and
> secondary key in the binary tuple buffer without instantiate tuple itself.
> Then if the first key equals, we use a binary comparator to compare secondary
> key. Secondary key can also be a complex data type, but for the first step,
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010"
> program.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.