[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883486#action_12883486 ]
Hadoop QA commented on PIG-1295: -------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448251/PIG-1295_0.6.patch against trunk revision 958666. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 150 javac compiler warnings (more than the trunk's current 145 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 402 release audit warnings (more than the trunk's current 399 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/console This message is automatically generated. > Binary comparator for secondary sort > ------------------------------------ > > Key: PIG-1295 > URL: https://issues.apache.org/jira/browse/PIG-1295 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.7.0 > Reporter: Daniel Dai > Assignee: Gianmarco De Francisci Morales > Fix For: 0.8.0 > > Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, > PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch > > > When hadoop framework doing the sorting, it will try to use binary version of > comparator if available. The benefit of binary comparator is we do not need > to instantiate the object before we compare. We see a ~30% speedup after we > switch to binary comparator. Currently, Pig use binary comparator in > following case: > 1. When semantics of order doesn't matter. For example, in distinct, we need > to do a sort in order to filter out duplicate values; however, we do not care > how comparator sort keys. Groupby also share this character. In this case, we > rely on hadoop's default binary comparator > 2. Semantics of order matter, but the key is of simple type. In this case, we > have implementation for simple types, such as integer, long, float, > chararray, databytearray, string > However, if the key is a tuple and the sort semantics matters, we do not have > a binary comparator implementation. This especially matters when we switch to > use secondary sort. In secondary sort, we convert the inner sort of nested > foreach into the secondary key and rely on hadoop to sorting on both main key > and secondary key. The sorting key will become a two items tuple. Since the > secondary key the sorting key of the nested foreach, so the sorting semantics > matters. It turns out we do not have binary comparator once we use secondary > sort, and we see a significant slow down. > Binary comparator for tuple should be doable once we understand the binary > structure of the serialized tuple. We can focus on most common use cases > first, which is "group by" followed by a nested sort. In this case, we will > use secondary sort. Semantics of the first key does not matter but semantics > of secondary key matters. We need to identify the boundary of main key and > secondary key in the binary tuple buffer without instantiate tuple itself. > Then if the first key equals, we use a binary comparator to compare secondary > key. Secondary key can also be a complex data type, but for the first step, > we focus on simple secondary key, which is the most common use case. > We mark this issue to be a candidate project for "Google summer of code 2010" > program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.