Thanks for your answer. After mulling over this problem for a few days, I
believe there might be a clearer way for me to phrase to question, so let me
try that before diving into the specifics of the linear algebra analysis you
give.

I need to share an inverted index of elements to sets as described above.
And crucially this index is *immutable*: after it has been created it only
has to be read from, never written to. So a clearer way to phrase this
question is: how do I share a large read-only inverted index among multiple
MapReduce jobs?

I can think of two approaches.

   1. Treat it as a database JOIN on elements operation between the original
   table of sets and the inverted index. This is the tack that Ted was
   suggesting in his response.
   2. Put the inverted index into a
MapFile<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html>.
   The individual jobs load the inverted index at setup() time and do random
   access reads from it as needed.

A few questions:

   1. Do others agree that these are the two big classes of solution?
   2. Do people have a sense of what the pros and cons of each might be?
   (BTW quadratic runtime in the density of the set membership rows is probably
   not a problem; the sets I am dealing with are small relative to the
   vocabulary size and relatively disjoint.)
   3. Is Pig or Hive a good tool to use for solution (1)? (I have a feeling
   the answer might be a 10-line Pig script, but I don't have enough SQL
   experience to just knock one out.)
   4. For solution (2), will MapFile scale to a map with 10^9 entries?
   (Assuming I use the io.map.index.skip property to make the right
   search-speed/memory tradeoff for my configuration.)

Reply via email to