[jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Doug Cutting (JIRA) Wed, 18 Apr 2007 10:36:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489829
 ]


Doug Cutting commented on HADOOP-485:
-------------------------------------

Sorry for spamming this issue, but I finally remembered the discussion that 
spawned this issue last summer.  The idea was that, when a compound key is 
used, one should be able to specify two comparators: one for sorting, and one 
for deciding how to group calls to reduce.  The original request was *not* 
about considering values in sort.  In the description above, the A1,V1 pairs 
are assumed to be compound keys, not key,value pairs.  Normally, since A1,V1 
does not equal A1,V2, these would be passed in separate calls to reduce.  The 
goal of this issue is to permit one to specify a reduce key comparator that 
*would* make A1,V1 equal to A2,V1 and hence make only a single call to reduce.  
This would *only* be used when deciding how to break the stream of sorted keys 
into calls to reduce().

Note that this feature is required to implement something like the user-code 
ValueSorting described in the above comment.


> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
>                 Key: HADOOP-485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Tahir Hashmi
>         Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a 
> particular order, but extending the key with the additional fields causes  
> them to be handled by different calls to reduce. (The user then collects the 
> values until they detect a "real" key change and then processes them.)
> It would be much easier if the framework let you define a second comparator 
> that did the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); 
> reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end 
> up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of 
> the key is just for sorting because the reduce will only see the first 
> representative of each equivalence class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Reply via email to