What do people use Hadoop for?

Andrzej Bialecki Mon, 22 Jan 2007 03:55:57 -0800

Hi all,

Recently I've been looking around (using Google ;) ) to see what arevarious applications of the map-reduce paradigm described in thepublished sources; and what are the classes of problems that peopletried to solve using map-reduce.

To my surprise, I found very few examples. Apart from the well-known twopapers by Google (one describing MapReduce, the other describingSawzall) there seems to be very little information in this area.

Why is that? Is it because people don't know about it, or other modelsfor tackling out-of-core tasks are more popular, or it's not applicableto most problems out there? I'm not sure. From my experience I know thatit's often not obvious (and sometimes impossible?) how to decompose anexisting algorithm so that it fits in the map-reduce paradigm.

Do people use Hadoop for tasks outside the well-known class ofweb-related problems?

I will share two examples of how I use Hadoop - one is simple, the otherless so.

I'm using Hadoop to build co-occurence vectors for phrases in a largecorpus of documents from a specific area. Phrases come from apre-defined vocabulary, and consists of 1-10 words. The map-reducedecomposition of the problem is straightforward: map() createsco-occurences for the current document (or actually - each sentence),and reduce() aggregates them to build a global co-occurence table.

Another example: I'm working on an implementation of minimal perfecthash function for large key collections (currently testing with 100 mlnkeys). In map() it's partitioning the input keys using a universal hashfunction into buckets of at most 256 keys in size, and then in reduce()it calculates an MPHF over each bucket.

Yet another example (ok that's three not two ;) ): web graph compressionusing "brute force" method. This is somewhat more involved...

Let's assume that we have an existing representation of a webgraph inthe form of adjacency lists, where we have the mapping of sourceUrl ->(targetUrl1, targetUrl2, ...). URLs are vertices in the graph, and linksare edges.

First, in a map-reduce job I collect all unique URLs - this is simple,so I'll skip the explanation. Then I assign integer vertex ID's to them,sequentially (this is not a map-reduce job, just a sweep through aMapFile containing all URLs).

In the next step I split the webgraph in a way that assigns uniqueidentifiers to each adjacency list, and then record whether a particularURL is a target or source in this mapping:


v1 -> (v2,v3,v4)   =>     v1 -> L1:s   =>     v1 -> (L1:s,L2:t)
v2 -> (v1,v5,v6)   map    v2 -> L1:t  reduce  v2 -> (L1:t,L2:s)
...                       v3 -> L1:t          v3 -> (L1:t)
                         v4 -> L1:t          ...
                         v2 -> L2:s
                         v1 -> L2:t

In the next step I perform a join with the list of vertex id-s preparedbefore:


v1 -> 1         v1 -> (L1:s,L2:t)           1 -> (L1:s,L2:t)
v2 -> 2    +    v2 -> (L1:t,L2:s)   =>      2 -> (L1:t,L2:s)
v3 -> 3   map   v3 -> (L1:t)       reduce   3 -> (L1:t)

And finally, I invert this result to get the original webgraph, but thistime with vertices numbered with an integer id (by the way, thisoperation is equivalent to calculating a minimal perfect hash and usingit to renumber the graph):

1 -> (L1:s,L2:t) L1:s -> 1 L1:(1:s,2:t,3:t,4:t) == 1-> (2,3,4)2 -> (L1:t,L2:s) => L2:t -> 1 => L2:(2:s,1:t,5:t,6:t) == 2-> (1,5,6)

3 -> (L1:t) ...   map    L1:t -> 2  reduce
                        L2:s -> 2

So, I'm curious: what are you guys using Hadoop for? Do you have someinteresting examples how Hadoop solves a particular task that wasdifficult/impossible to do otherwise?


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

What do people use Hadoop for?

Reply via email to