[jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Doug Cutting (JIRA) Wed, 18 Apr 2007 10:10:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489818
 ]


Doug Cutting commented on HADOOP-485:
-------------------------------------

First, I don't think we can assume that the key comparator will also work for 
values.  We should permit specification of an additional comparator for values. 
 (An Exception should be thrown if values are not WritableComparables.)

Second, this shouldn't be triggered for all jobs, but only those that specify a 
value comparator.

Third, this implementation seems inconsistent.  You've changed some code paths, 
but not considered other code paths.  For example, SequenceFile compares keys 
in lots of places.  If we want to permit SequenceFile's sorting tools to 
consider values, then we should consistently modify all of SequenceFile's uses 
of comparators.  Similarly, you've changed value sorting in one step of the 
sort, but there are many other places where sorting is done that I'm not sure 
are addressed by this patch.

I wonder if instead this might be more simply implemented at reduce time.  
ReduceTask could be modified to buffer and sort values.  If there are more 
values for a key than fit in memory, then the values could be spilled to disk 
as a SequenceFile and sorted.  Might that work?

> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
>                 Key: HADOOP-485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Tahir Hashmi
>         Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a 
> particular order, but extending the key with the additional fields causes  
> them to be handled by different calls to reduce. (The user then collects the 
> values until they detect a "real" key change and then processes them.)
> It would be much easier if the framework let you define a second comparator 
> that did the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); 
> reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end 
> up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of 
> the key is just for sorting because the reduce will only see the first 
> representative of each equivalence class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Reply via email to