I see the same type of exception later on in the KMeans driver

https://gist.github.com/15c918acd2583e4ac54f

This is using the same large dataset that Grant mentioned. I should clarify 
that it's not 11M terms, but 11M bigrams after pruning.

242,646 docs
172,502,741 tokens

Cheers
-David

On Nov 7, 2012, at 12:06 PM, Grant Ingersoll wrote:

> It's in throwing it in the config of the Reducer, so not likely the vector, 
> but it could be.
> 
> Once we went back to unigrams, the OOM in that spot went away.
> 
> On Nov 7, 2012, at 12:00 PM, Robin Anil wrote:
> 
>> Not seen the code in a while but AFAIR the reducer is not loading any
>> dictionary. We chunk the dictionary to create partial vector. I think you
>> just have a huge vector
>> On Nov 7, 2012 10:50 AM, "Sean Owen" <sro...@gmail.com> wrote:
>> 
>>> It's a trie? Yeah that could be a big win. It gets tricky with Unicode, but
>>> imagine there is a lot of gain even so.
>>> "Bigrams over 11M terms" jumped out too as a place to start.
>>> (I don't see any particular backwards compatibility issue with Lucene 3 to
>>> even worry about.)
>>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidworks.com
> 
> 
> 
> 

Reply via email to