Grouping using hashing instead of sorting
-----------------------------------------

                 Key: MAPREDUCE-1639
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1639
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
            Reporter: Joydeep Sen Sarma


most applications of map-reduce care about grouping and not sorting. Sorting is 
a (relatively expensive) way to achieve grouping. In order to achieve just 
grouping - one can:

- replace the sort on the Mappers with a HashTable - and maintain lists of 
key-values against each hash-bucket.
- key-value tuples inside each hash bucket are sorted - before spilling or 
sending to Reducer. Anytime this is done - Combiner can be invoked.
- HashTable is serialized by hash-bucketid. So merges (of either spills or Map 
Outputs) works similar to today (at least there's no change in overall compute 
complexity of merge)

Of course this hashtable has nothing to do with partitioning. it's just a 
replacement for map-side sort.

--

this is (pretty much) straight from the MARS project paper: 
http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf. They report a 45% speedup 
in inverted index calculation using hashing instead of sorting (reference 
implementation is NOT against Hadoop though).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to