I understand that part of the rules of MapReduce is that there's no shared
global information; nevertheless I have a problem that requires shared
global information and I'm trying to get a sense of what mechanisms are
available to address it.

I have a bunch of *sets* built on a vocabulary of *elements*.  In the
following, for example:

X {a, b, c}
Y {b, c}
Z {a, c, d}

X,Y, and Z are the sets and a, b, c, and d are the elements. (Maybe the sets
are web pages and the elements are keywords; it doesn't matter.)

Given all the sets I need to build a reverse lookup table of elements to
sets.

a {X, Z}
b {X, Y}
c {X, Y, Z}
d {Z}

Then given a set I need to emit a proper subset of it whose composition is a
function of the values of its elements in the lookup table.  So given Y =
{b, c}, I need to look up b = {X, Y} and c = {X, Y, Z} in order to generate
the output I need.

I have a large number (order of 10^9) of both sets and elements, so need to
use a scalable system like MapReduce.

Building the lookup table from the sets is easy--that's what MapReduce is
designed to do. However, I'm not sure how to make use of this table in a
subsequent MapReduce process because it is shared global information.

Are there general approaches to solving this problem? I've been trying to
come up with some clever secondary-sort style key arithmetic, but I don't
think that's the answer. Should I just build a BDB database and pack it into
the distributed cache? (I'm reluctant to, because that database is going to
be really big.  It seems like it should live in a single location on HDFS.)
Are the higher level tools like Hive or Cascading the right solution here?

Reply via email to