I understand that part of the rules of MapReduce is that there's no shared global information; nevertheless I have a problem that requires shared global information and I'm trying to get a sense of what mechanisms are available to address it.
I have a bunch of *sets* built on a vocabulary of *elements*. In the following, for example: X {a, b, c} Y {b, c} Z {a, c, d} X,Y, and Z are the sets and a, b, c, and d are the elements. (Maybe the sets are web pages and the elements are keywords; it doesn't matter.) Given all the sets I need to build a reverse lookup table of elements to sets. a {X, Z} b {X, Y} c {X, Y, Z} d {Z} Then given a set I need to emit a proper subset of it whose composition is a function of the values of its elements in the lookup table. So given Y = {b, c}, I need to look up b = {X, Y} and c = {X, Y, Z} in order to generate the output I need. I have a large number (order of 10^9) of both sets and elements, so need to use a scalable system like MapReduce. Building the lookup table from the sets is easy--that's what MapReduce is designed to do. However, I'm not sure how to make use of this table in a subsequent MapReduce process because it is shared global information. Are there general approaches to solving this problem? I've been trying to come up with some clever secondary-sort style key arithmetic, but I don't think that's the answer. Should I just build a BDB database and pack it into the distributed cache? (I'm reluctant to, because that database is going to be really big. It seems like it should live in a single location on HDFS.) Are the higher level tools like Hive or Cascading the right solution here?