Hi all,

Recently I've been looking around (using Google ;) ) to see what are various applications of the map-reduce paradigm described in the published sources; and what are the classes of problems that people tried to solve using map-reduce.

To my surprise, I found very few examples. Apart from the well-known two papers by Google (one describing MapReduce, the other describing Sawzall) there seems to be very little information in this area.

Why is that? Is it because people don't know about it, or other models for tackling out-of-core tasks are more popular, or it's not applicable to most problems out there? I'm not sure. From my experience I know that it's often not obvious (and sometimes impossible?) how to decompose an existing algorithm so that it fits in the map-reduce paradigm.

Do people use Hadoop for tasks outside the well-known class of web-related problems?

I will share two examples of how I use Hadoop - one is simple, the other less so.

I'm using Hadoop to build co-occurence vectors for phrases in a large corpus of documents from a specific area. Phrases come from a pre-defined vocabulary, and consists of 1-10 words. The map-reduce decomposition of the problem is straightforward: map() creates co-occurences for the current document (or actually - each sentence), and reduce() aggregates them to build a global co-occurence table.

Another example: I'm working on an implementation of minimal perfect hash function for large key collections (currently testing with 100 mln keys). In map() it's partitioning the input keys using a universal hash function into buckets of at most 256 keys in size, and then in reduce() it calculates an MPHF over each bucket.

Yet another example (ok that's three not two ;) ): web graph compression using "brute force" method. This is somewhat more involved...

Let's assume that we have an existing representation of a webgraph in the form of adjacency lists, where we have the mapping of sourceUrl -> (targetUrl1, targetUrl2, ...). URLs are vertices in the graph, and links are edges.

First, in a map-reduce job I collect all unique URLs - this is simple, so I'll skip the explanation. Then I assign integer vertex ID's to them, sequentially (this is not a map-reduce job, just a sweep through a MapFile containing all URLs).

In the next step I split the webgraph in a way that assigns unique identifiers to each adjacency list, and then record whether a particular URL is a target or source in this mapping:

v1 -> (v2,v3,v4)   =>     v1 -> L1:s   =>     v1 -> (L1:s,L2:t)
v2 -> (v1,v5,v6)   map    v2 -> L1:t  reduce  v2 -> (L1:t,L2:s)
...                       v3 -> L1:t          v3 -> (L1:t)
                         v4 -> L1:t          ...
                         v2 -> L2:s
                         v1 -> L2:t

In the next step I perform a join with the list of vertex id-s prepared before:

v1 -> 1         v1 -> (L1:s,L2:t)           1 -> (L1:s,L2:t)
v2 -> 2    +    v2 -> (L1:t,L2:s)   =>      2 -> (L1:t,L2:s)
v3 -> 3   map   v3 -> (L1:t)       reduce   3 -> (L1:t)

And finally, I invert this result to get the original webgraph, but this time with vertices numbered with an integer id (by the way, this operation is equivalent to calculating a minimal perfect hash and using it to renumber the graph):

1 -> (L1:s,L2:t) L1:s -> 1 L1:(1:s,2:t,3:t,4:t) == 1 -> (2,3,4) 2 -> (L1:t,L2:s) => L2:t -> 1 => L2:(2:s,1:t,5:t,6:t) == 2 -> (1,5,6)
3 -> (L1:t) ...   map    L1:t -> 2  reduce
                        L2:s -> 2


So, I'm curious: what are you guys using Hadoop for? Do you have some interesting examples how Hadoop solves a particular task that was difficult/impossible to do otherwise?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to