I've created a small benchmark to play around with the way the dot product is computed. It tries to mimic Josh's usecase: multiplying 2.5M dense item vectors of size 20 by a dense user vector of size 20.
https://gist.github.com/sscdotopen/5108988 I compared using the .dot() method from the Vector class to simply computing the dot product like this: static double dot(Vector one, Vector two) { double sum = 0; for (int n = 0; n < NUM_FEATURES; n++) { sum += one.getQuick(n) * two.getQuick(n); } return sum; } The results indicate that using Vector.dot() incurs a 6x to 7x overhead compared to the simple method. On 07.03.2013 16:00, Josh Devins wrote: > I ran from what's in trunk as of this morning. I didn't dig in further to > see where that extra time was coming from but can do so when I get some > time soon. > > > On 7 March 2013 15:56, Sebastian Schelter <s...@apache.org> wrote: > >> Hi Josh, >> >> Did you run the patch from the jira issue or did you run the trunk? I >> made some follow up changes after uploading the patch. I can't imagine >> why those small changes would lead to an increase of 50% in the runtime. >> >> /s >> >> >> >> On 07.03.2013 15:02, Josh Devins wrote: >>> So the good news is that the patch runs ;) The bad news is that it's >>> slower, going from 1600-1800ms to ~2500ms to calculate a single users' >> topK >>> recommendations. For kicks, I ran a couple other experiments, >> progressively >>> removing code to isolate the problem area. Results are detailed here: >>> https://gist.github.com/joshdevins/5106930 >>> >>> Conclusions thus far: >>> * the patch is not helpful (for performance) and should be reverted or >>> fixed again (sorry Sebastian) >>> * the dot product operation in `Vector` is not efficient enough for >> large >>> vectors/matrices, when used as it is in the ALS `RecommenderJob`, inside >> a >>> loop over `M` >>> >>> I've tried a few other experiments with Colt (for example) but there was >> no >>> noticeable gain. Parallelizing inside the map task (manually or with >>> Parallel Colt) is possible but obviously is not ideal in an environment >>> like Hadoop -- this would save memory since you only need a few map tasks >>> loading the matrices, but isn't playing very nicely within a shared >> cluster >>> :) >>> >>> Next step at this point is to look at either reducing the number of items >>> to recommend over, LSH or a third secret plan that "the PhD's" are >> thinking >>> about. Paper forthcoming, no doubt :D >>> >>> @Sebastian, happy to run any patches on our cluster/dataset before making >>> more commits. >>> >>> >>> >>> On 6 March 2013 20:58, Josh Devins <h...@joshdevins.com> wrote: >>> >>>> Got sidetracked today but I'll run Sebastian's version in trunk tomorrow >>>> and report back. >>>> >>>> >>>> On 6 March 2013 17:07, Sebastian Schelter <s...@apache.org> wrote: >>>> >>>>> I already committed a fix in that direction. I modified our >>>>> FixedSizePriorityQueue to allow inspection of its head for direct >>>>> comparison. This obviates the need to instantiate a Comparable and >> offer >>>>> it to the queue. >>>>> >>>>> /s >>>>> >>>>> >>>>> On 06.03.2013 17:01, Ted Dunning wrote: >>>>>> I would recommend against a mutable object on maintenance grounds. >>>>>> >>>>>> Better is to keep the threshold that a new score must meet and only >>>>>> construct the object on need. That cuts the allocation down to >>>>> negligible >>>>>> levels. >>>>>> >>>>>> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> OK, that's reasonable on 35 machines. (You can turn up to 70 >> reducers, >>>>>>> probably, as most machines can handle 2 reducers at once). >>>>>>> I think the recommendation step loads one whole matrix into memory. >>>>> You're >>>>>>> not running out of memory but if you're turning up the heap size to >>>>>>> accommodate, you might be hitting swapping, yes. I think (?) the >>>>>>> conventional wisdom is to turn off swap for Hadoop. >>>>>>> >>>>>>> Sebastian yes that is probably a good optimization; I've had good >>>>> results >>>>>>> reusing a mutable object in this context. >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <h...@joshdevins.com> >>>>> wrote: >>>>>>> >>>>>>>> The factorization at 2-hours is kind of a non-issue (certainly fast >>>>>>>> enough). It was run with (if I recall correctly) 30 reducers across >> a >>>>> 35 >>>>>>>> node cluster, with 10 iterations. >>>>>>>> >>>>>>>> I was a bit shocked at how long the recommendation step took and >> will >>>>>>> throw >>>>>>>> some timing debug in to see where the problem lies exactly. There >>>>> were no >>>>>>>> other jobs running on the cluster during these attempts, but it's >>>>>>> certainly >>>>>>>> possible that something is swapping or the like. I'll be looking >> more >>>>>>>> closely today before I start to consider other options for >> calculating >>>>>>> the >>>>>>>> recommendations. >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> >