On Nov 7, 2012, at 11:50 AM, Sean Owen wrote: > It's a trie?
It's quite cool what Lucene does: linear time build (if your terms are sorted) and a bunch of other stuff: http://www.slideshare.net/LucidImagination/weiss-dawid-finite-state-automata-in-lucene http://blog.mikemccandless.com/2012/05/finite-state-automata-in-lucene.html > Yeah that could be a big win. It gets tricky with Unicode, but > imagine there is a lot of gain even so. > "Bigrams over 11M terms" jumped out too as a place to start. > (I don't see any particular backwards compatibility issue with Lucene 3 to > even worry about.) Yeah, we are more aggressively pruning, but something still isn't right in that w/ 12 GB of heap and a dictionary chunk of 100 MB, you'd think it would fit. All I can think is that there is some fairly rapid case where the GC can't keep up on the rehash allocation (but that doesn't even fit right, b/c you'd think the GC would just go into a full collection). I'll see if we can debug more. There could also be some bug in the dictionary writing where it is setting the size of something wrong or it is corrupted for some reason and so we get a large overallocation. -Grant
