Re: Vectorization, dictionary size, OpenObjectIntHashMap and OOM

Grant Ingersoll Wed, 07 Nov 2012 09:01:40 -0800

On Nov 7, 2012, at 11:50 AM, Sean Owen wrote:

> It's a trie?


It's quite cool what Lucene does:  linear time build (if your terms are sorted) 
and a bunch of other stuff:
http://www.slideshare.net/LucidImagination/weiss-dawid-finite-state-automata-in-lucene
http://blog.mikemccandless.com/2012/05/finite-state-automata-in-lucene.html


> Yeah that could be a big win. It gets tricky with Unicode, but
> imagine there is a lot of gain even so.
> "Bigrams over 11M terms" jumped out too as a place to start.
> (I don't see any particular backwards compatibility issue with Lucene 3 to
> even worry about.)


Yeah, we are more aggressively pruning, but something still isn't right in that 
w/ 12 GB of heap and a dictionary chunk of 100 MB, you'd think it would fit.  
All I can think is that there is some fairly rapid case where the GC can't keep 
up on the rehash allocation (but that doesn't even fit right, b/c you'd think 
the GC would just go into a full collection).  I'll see if we can debug more.  
There could also be some bug in the dictionary writing where it is setting the 
size of something wrong or it is corrupted for some reason and so we get a 
large overallocation.

-Grant

Re: Vectorization, dictionary size, OpenObjectIntHashMap and OOM

Reply via email to