I see the same type of exception later on in the KMeans driver https://gist.github.com/15c918acd2583e4ac54f
This is using the same large dataset that Grant mentioned. I should clarify that it's not 11M terms, but 11M bigrams after pruning. 242,646 docs 172,502,741 tokens Cheers -David On Nov 7, 2012, at 12:06 PM, Grant Ingersoll wrote: > It's in throwing it in the config of the Reducer, so not likely the vector, > but it could be. > > Once we went back to unigrams, the OOM in that spot went away. > > On Nov 7, 2012, at 12:00 PM, Robin Anil wrote: > >> Not seen the code in a while but AFAIR the reducer is not loading any >> dictionary. We chunk the dictionary to create partial vector. I think you >> just have a huge vector >> On Nov 7, 2012 10:50 AM, "Sean Owen" <sro...@gmail.com> wrote: >> >>> It's a trie? Yeah that could be a big win. It gets tricky with Unicode, but >>> imagine there is a lot of gain even so. >>> "Bigrams over 11M terms" jumped out too as a place to start. >>> (I don't see any particular backwards compatibility issue with Lucene 3 to >>> even worry about.) >>> > > -------------------------------------------- > Grant Ingersoll > http://www.lucidworks.com > > > >