Hi all,
Recently I've been looking around (using Google ;) ) to see what are
various applications of the map-reduce paradigm described in the
published sources; and what are the classes of problems that people
tried to solve using map-reduce.
To my surprise, I found very few examples. Apart from the well-known two
papers by Google (one describing MapReduce, the other describing
Sawzall) there seems to be very little information in this area.
Why is that? Is it because people don't know about it, or other models
for tackling out-of-core tasks are more popular, or it's not applicable
to most problems out there? I'm not sure. From my experience I know that
it's often not obvious (and sometimes impossible?) how to decompose an
existing algorithm so that it fits in the map-reduce paradigm.
Do people use Hadoop for tasks outside the well-known class of
web-related problems?
I will share two examples of how I use Hadoop - one is simple, the other
less so.
I'm using Hadoop to build co-occurence vectors for phrases in a large
corpus of documents from a specific area. Phrases come from a
pre-defined vocabulary, and consists of 1-10 words. The map-reduce
decomposition of the problem is straightforward: map() creates
co-occurences for the current document (or actually - each sentence),
and reduce() aggregates them to build a global co-occurence table.
Another example: I'm working on an implementation of minimal perfect
hash function for large key collections (currently testing with 100 mln
keys). In map() it's partitioning the input keys using a universal hash
function into buckets of at most 256 keys in size, and then in reduce()
it calculates an MPHF over each bucket.
Yet another example (ok that's three not two ;) ): web graph compression
using "brute force" method. This is somewhat more involved...
Let's assume that we have an existing representation of a webgraph in
the form of adjacency lists, where we have the mapping of sourceUrl ->
(targetUrl1, targetUrl2, ...). URLs are vertices in the graph, and links
are edges.
First, in a map-reduce job I collect all unique URLs - this is simple,
so I'll skip the explanation. Then I assign integer vertex ID's to them,
sequentially (this is not a map-reduce job, just a sweep through a
MapFile containing all URLs).
In the next step I split the webgraph in a way that assigns unique
identifiers to each adjacency list, and then record whether a particular
URL is a target or source in this mapping:
v1 -> (v2,v3,v4) => v1 -> L1:s => v1 -> (L1:s,L2:t)
v2 -> (v1,v5,v6) map v2 -> L1:t reduce v2 -> (L1:t,L2:s)
... v3 -> L1:t v3 -> (L1:t)
v4 -> L1:t ...
v2 -> L2:s
v1 -> L2:t
In the next step I perform a join with the list of vertex id-s prepared
before:
v1 -> 1 v1 -> (L1:s,L2:t) 1 -> (L1:s,L2:t)
v2 -> 2 + v2 -> (L1:t,L2:s) => 2 -> (L1:t,L2:s)
v3 -> 3 map v3 -> (L1:t) reduce 3 -> (L1:t)
And finally, I invert this result to get the original webgraph, but this
time with vertices numbered with an integer id (by the way, this
operation is equivalent to calculating a minimal perfect hash and using
it to renumber the graph):
1 -> (L1:s,L2:t) L1:s -> 1 L1:(1:s,2:t,3:t,4:t) == 1
-> (2,3,4)
2 -> (L1:t,L2:s) => L2:t -> 1 => L2:(2:s,1:t,5:t,6:t) == 2
-> (1,5,6)
3 -> (L1:t) ... map L1:t -> 2 reduce
L2:s -> 2
So, I'm curious: what are you guys using Hadoop for? Do you have some
interesting examples how Hadoop solves a particular task that was
difficult/impossible to do otherwise?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com