Thanks for your answer. After mulling over this problem for a few days, I believe there might be a clearer way for me to phrase to question, so let me try that before diving into the specifics of the linear algebra analysis you give.
I need to share an inverted index of elements to sets as described above. And crucially this index is *immutable*: after it has been created it only has to be read from, never written to. So a clearer way to phrase this question is: how do I share a large read-only inverted index among multiple MapReduce jobs? I can think of two approaches. 1. Treat it as a database JOIN on elements operation between the original table of sets and the inverted index. This is the tack that Ted was suggesting in his response. 2. Put the inverted index into a MapFile<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html>. The individual jobs load the inverted index at setup() time and do random access reads from it as needed. A few questions: 1. Do others agree that these are the two big classes of solution? 2. Do people have a sense of what the pros and cons of each might be? (BTW quadratic runtime in the density of the set membership rows is probably not a problem; the sets I am dealing with are small relative to the vocabulary size and relatively disjoint.) 3. Is Pig or Hive a good tool to use for solution (1)? (I have a feeling the answer might be a 10-line Pig script, but I don't have enough SQL experience to just knock one out.) 4. For solution (2), will MapFile scale to a map with 10^9 entries? (Assuming I use the io.map.index.skip property to make the right search-speed/memory tradeoff for my configuration.)