I tuned LDA to about 30% boost in speed, by replacing hashmaps by array (size is bound) and removing repeated dense matrix allocation by a single allocation followed by a clear. I removed(commented out) the asserts that david had put. Also replaced get by getQuick as too many checks were happening.
Then I profiled it again this is what is saw total time 1227s Time in mapper 630sec Time in LDAInference 470sec * ClearMatrix 266sec* LDA logic 200 sec * TaskInputOutputContext.write - 143 sec* *Time in MergeThread 146 sec* * *Time spent in reduce logic~10sec It seems LDA could be pushed further. I dont know how much more we can squeeze out the ClearMatrix function(i even tried system.arraycopy) The root of all evil is the IntPairWritable, which is for some reason creating longer write times in hadoop serializer. I will have to see why(its just 2 int ?), maybe we need to write longer bytes for efficiency ? Any clues would be helpful. Robin On Tue, Mar 2, 2010 at 6:51 PM, Sean Owen <sro...@gmail.com> wrote: > I'll have a look there. May be worth piling in one more little thing > like this in the 'code freeze'. > > Incidentally Hadoop announced version 0.20.2 a few days ago -- still > looking for it on Maven but I will be starting up our release process > again as soon as I see it. > > So really-truly let's get those last items for 0.3 in or mark them for > 0.4 today, or else I might have to push them forward. > > On Tue, Mar 2, 2010 at 9:45 AM, Robin Anil <robin.a...@gmail.com> wrote: > > Another issue along the same topic > > > > We have Bigram in co-occurrence, IntPairWritable in LDA, and gramkey in > > dictionary vectorizer ? they store the same thing two integers. Can't we > > start a crackdown on these and others like these? > > > > Robin > > >