I tuned LDA to about 30% boost in speed, by replacing hashmaps by array
(size is bound) and removing repeated dense matrix allocation by a single
allocation followed by a clear. I removed(commented out) the asserts that
david had put. Also replaced get by getQuick as too many checks were
happening.

Then I profiled it again this is what is saw
total time 1227s

  Time in mapper 630sec
     Time in LDAInference  470sec
     *  ClearMatrix  266sec*
       LDA logic     200 sec
*     TaskInputOutputContext.write  - 143 sec*

  *Time in MergeThread  146 sec*
*       *Time spent in reduce logic~10sec

It seems LDA could be pushed further. I dont know how much more we can
squeeze out the ClearMatrix function(i even tried system.arraycopy)
The root of all evil is the IntPairWritable, which is for some reason
creating longer write times in hadoop serializer. I will have to see why(its
just 2 int ?), maybe we need to write longer bytes for efficiency ? Any
clues would be helpful.

Robin

On Tue, Mar 2, 2010 at 6:51 PM, Sean Owen <sro...@gmail.com> wrote:

> I'll have a look there. May be worth piling in one more little thing
> like this in the 'code freeze'.
>
> Incidentally Hadoop announced version 0.20.2 a few days ago -- still
> looking for it on Maven but I will be starting up our release process
> again as soon as I see it.
>
> So really-truly let's get those last items for 0.3 in or mark them for
> 0.4 today, or else I might have to push them forward.
>
> On Tue, Mar 2, 2010 at 9:45 AM, Robin Anil <robin.a...@gmail.com> wrote:
> > Another issue along the same topic
> >
> > We have Bigram in co-occurrence, IntPairWritable in LDA, and gramkey in
> > dictionary vectorizer ? they store the same thing two integers. Can't we
> > start a crackdown on these and others like these?
> >
> > Robin
> >
>

Reply via email to