Re: Using global reverse lookup tables

W.P. McNeill Fri, 15 Apr 2011 11:46:09 -0700

Thanks for your answer. After mulling over this problem for a few days, I
believe there might be a clearer way for me to phrase to question, so let me
try that before diving into the specifics of the linear algebra analysis you
give.

I need to share an inverted index of elements to sets as described above.
And crucially this index is *immutable*: after it has been created it only
has to be read from, never written to. So a clearer way to phrase this
question is: how do I share a large read-only inverted index among multiple
MapReduce jobs?

I can think of two approaches.

1. Treat it as a database JOIN on elements operation between the original
table of sets and the inverted index. This is the tack that Ted was
suggesting in his response.
2. Put the inverted index into a
MapFile<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html>.
The individual jobs load the inverted index at setup() time and do random
access reads from it as needed.

A few questions:

1. Do others agree that these are the two big classes of solution?
2. Do people have a sense of what the pros and cons of each might be?
(BTW quadratic runtime in the density of the set membership rows is probably
not a problem; the sets I am dealing with are small relative to the
vocabulary size and relatively disjoint.)
3. Is Pig or Hive a good tool to use for solution (1)? (I have a feeling
the answer might be a 10-line Pig script, but I don't have enough SQL
experience to just knock one out.)
4. For solution (2), will MapFile scale to a map with 10^9 entries?
(Assuming I use the io.map.index.skip property to make the right
search-speed/memory tradeoff for my configuration.)

Re: Using global reverse lookup tables

Reply via email to