Well, there is considerable redundancy in the list of words that could
result in massive compression.  This is roughly what Lucene is doing.
 Storing each string incurs substantial overhead that dwarfs even the
original size of the strings (overhead is about 50 bytes per string, the
average word length in running English text is 6 characters, a dictionary
should average somewhat longer).  Since we are using UCS-16 internally to
store what are essentially ASCII strings for wikipedia, we have more than
10:1 overhead here.

The classic compression for string collections is to sort and use prefix
compression.  Accelerated order list searching works in log log N time to
search such a list or a separate array of offsets can allow direct
addressing of strings.

On particular trick to help with such an arrangement is to encode an offset
into the string back to where the full prefix can be found.

The Lucene compression is a bit stronger than simple prefix elimination in
that it constructs a DFA that recognizes words from the dictionary.  This
gives you both compression and the function of the hash table rolled into
one system.  By the time you have walked through the bytes of a word, you
have also looked it up without ever needing to construct the string
containing the word.

My suggestion is that we have a show of hands to see who cares (and how
much) about keeping Lucene 3 compatibility.  If no, or very few, hands go
up, moving to a modern representation is probably a good idea.

On Wed, Nov 7, 2012 at 7:58 AM, Sean Owen <[email protected]> wrote:

> This just means "out of memory": the dictionary is too big. It's nothing in
> particular to do with the number of size or rate of objects allocated. I
> don't know if a different implementation is going to be appreciably
> different in terms of size -- these are already primitive-based specialized
> implementations.
>
> My dumb question is, is this not just a consequence of the implementation
> trying to store an unbounded (well, up to trillions) of entries in a map in
> memory?
>
> Beyond this I don't know anything about the implementation.
>
>
> On Wed, Nov 7, 2012 at 3:17 PM, Grant Ingersoll <[email protected]>
> wrote:
>
> > Hi,
> >
> > We're hitting OOMs while running vectorization during dictionary loading
> > in TFPartialIndexVectorReducer.  We have the dictionary chunk size set to
> > 100 (the minimum) and have about 11M items in the dictionary (bigrams are
> > on) and our heap size is set to 12 GB.   We haven't debugged deeply yet,
> > but the OOM routinely occurs in the rehash method:
> > 2012-11-07 04:34:04,750 FATAL org.apache.hadoop.mapred.Child: Error
> > running child : java.lang.OutOfMemoryError: Java heap space
> >         at
> >
> org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:430)
> >         at
> >
> org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:383)
> >         at
> >
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:131)
> >         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> >         at
> > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> >         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:416)
> >         at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >
> > I'm also guessing (haven't turned on GC logs yet) that the GC simply
> can't
> > keep up w/ the allocations, but perhaps there is also a bug somewhere in
> > the dictionary code and the dictionary is corrupt and it's misreading the
> > size.  I can share the dictionary privately if anyone wants to look at
> it,
> > but I can't share it publicly.
> >
> > Has anyone else seen this?  Is my understanding correct?
> >
> > I can see a couple of remedies:
> > 1.  Pass in an initial capacity and see if we can better control the size
> > of the allocation
> > 2. Switch to Lucene's FST for dictionaries:  The tradeoff would be a much
> > smaller dictionary (10 GB of wikipedia in Lucene is roughly a 250K
> > dictionary size) and very little  deserialization (the dictionary is all
> > byte arrays) at the cost of lookups in a given mapper.  However, that
> > latter cost would likely be more than made up for by the fact that in
> most
> > situations, one would only need 1 dictionary chunk, thereby eliminating
> > several MapReduce iterations.  The other downside/upside is that we would
> > need to go to Lucene 4.
> >
> > Thoughts?
> >
> > -Grant
>

Reply via email to