[jira] Commented: (MAPREDUCE-1639) Grouping using hashing instead of sorting

Joydeep Sen Sarma (JIRA) Sat, 27 Mar 2010 10:34:49 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850556#action_12850556
 ]


Joydeep Sen Sarma commented on MAPREDUCE-1639:
----------------------------------------------

realize this is non-trivial.

> 2. A much easier way to achieve their result is to include the hash in the 
> key as the first 4 bytes of the serialization, then have a comparator that 
> uses the first 4 bytes as memcmp before it uses the dispatched raw compare 
> method.

not sure about this. it doesn't change the fact that all keys are being 
compared against each other. just because we compare fewer bytes doesn't mean 
that the number of comparisons goes down. it's still n.logn. the cost of each 
comparison is lower though.

if we did a radix sort (first) on the leading 4 (hashed) bytes - then yeah - it 
changes things. but at that point we are taking (pretty much) the same approach 
as the paper and talking about a different data path.

> Grouping using hashing instead of sorting
> -----------------------------------------
>
>                 Key: MAPREDUCE-1639
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1639
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Joydeep Sen Sarma
>
> most applications of map-reduce care about grouping and not sorting. Sorting 
> is a (relatively expensive) way to achieve grouping. In order to achieve just 
> grouping - one can:
> - replace the sort on the Mappers with a HashTable - and maintain lists of 
> key-values against each hash-bucket.
> - key-value tuples inside each hash bucket are sorted - before spilling or 
> sending to Reducer. Anytime this is done - Combiner can be invoked.
> - HashTable is serialized by hash-bucketid. So merges (of either spills or 
> Map Outputs) works similar to today (at least there's no change in overall 
> compute complexity of merge)
> Of course this hashtable has nothing to do with partitioning. it's just a 
> replacement for map-side sort.
> --
> this is (pretty much) straight from the MARS project paper: 
> http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf. They report a 45% 
> speedup in inverted index calculation using hashing instead of sorting 
> (reference implementation is NOT against Hadoop though).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1639) Grouping using hashing instead of sorting

Reply via email to