Re: Vectorization, dictionary size, OpenObjectIntHashMap and OOM

Sean Owen Wed, 07 Nov 2012 11:14:23 -0800

Oh, 11M bigrams. Well I can't see how that would come near running through
12GB of heap, even half of it.
Are you guys sure that the child workers are actually being allowed to use
12GB heap? There are lots of places to put the "mapred.child.java.opts"
parameter that don't actually do anything, which I have learned by making
that mistake about 3 times every which way.



On Wed, Nov 7, 2012 at 7:04 PM, David Arthur <mum...@gmail.com> wrote:

> I see the same type of exception later on in the KMeans driver
>
> https://gist.github.com/15c918acd2583e4ac54f
>
> This is using the same large dataset that Grant mentioned. I should
> clarify that it's not 11M terms, but 11M bigrams after pruning.
>
> 242,646 docs
> 172,502,741 tokens
>
> Cheers
> -David
>
> On Nov 7, 2012, at 12:06 PM, Grant Ingersoll wrote:
>
> > It's in throwing it in the config of the Reducer, so not likely the
> vector, but it could be.
> >
> > Once we went back to unigrams, the OOM in that spot went away.
> >
> > On Nov 7, 2012, at 12:00 PM, Robin Anil wrote:
> >
> >> Not seen the code in a while but AFAIR the reducer is not loading any
> >> dictionary. We chunk the dictionary to create partial vector. I think
> you
> >> just have a huge vector
> >> On Nov 7, 2012 10:50 AM, "Sean Owen" <sro...@gmail.com> wrote:
> >>
> >>> It's a trie? Yeah that could be a big win. It gets tricky with
> Unicode, but
> >>> imagine there is a lot of gain even so.
> >>> "Bigrams over 11M terms" jumped out too as a place to start.
> >>> (I don't see any particular backwards compatibility issue with Lucene
> 3 to
> >>> even worry about.)
> >>>
> >
> > --------------------------------------------
> > Grant Ingersoll
> > http://www.lucidworks.com
> >
> >
> >
> >
>
>

Re: Vectorization, dictionary size, OpenObjectIntHashMap and OOM

Reply via email to