I've created a small benchmark to play around with the way the dot
product is computed. It tries to mimic Josh's usecase: multiplying 2.5M
dense item vectors of size 20 by a dense user vector of size 20.

https://gist.github.com/sscdotopen/5108988

I compared using the .dot() method from the Vector class to simply
computing the dot product like this:

  static double dot(Vector one, Vector two) {
    double sum = 0;
      for (int n = 0; n < NUM_FEATURES; n++) {
        sum += one.getQuick(n) * two.getQuick(n);
      }
    return sum;
  }

The results indicate that using Vector.dot() incurs a 6x to 7x overhead
compared to the simple method.

On 07.03.2013 16:00, Josh Devins wrote:
> I ran from what's in trunk as of this morning. I didn't dig in further to
> see where that extra time was coming from but can do so when I get some
> time soon.
> 
> 
> On 7 March 2013 15:56, Sebastian Schelter <s...@apache.org> wrote:
> 
>> Hi Josh,
>>
>> Did you run the patch from the jira issue or did you run the trunk? I
>> made some follow up changes after uploading the patch. I can't imagine
>> why those small changes would lead to an increase of 50% in the runtime.
>>
>> /s
>>
>>
>>
>> On 07.03.2013 15:02, Josh Devins wrote:
>>> So the good news is that the patch runs ;)  The bad news is that it's
>>> slower, going from 1600-1800ms to ~2500ms to calculate a single users'
>> topK
>>> recommendations. For kicks, I ran a couple other experiments,
>> progressively
>>> removing code to isolate the problem area. Results are detailed here:
>>> https://gist.github.com/joshdevins/5106930
>>>
>>> Conclusions thus far:
>>>  * the patch is not helpful (for performance) and should be reverted or
>>> fixed again (sorry Sebastian)
>>>  * the dot product operation in `Vector` is not efficient enough for
>> large
>>> vectors/matrices, when used as it is in the ALS `RecommenderJob`, inside
>> a
>>> loop over `M`
>>>
>>> I've tried a few other experiments with Colt (for example) but there was
>> no
>>> noticeable gain. Parallelizing inside the map task (manually or with
>>> Parallel Colt) is possible but obviously is not ideal in an environment
>>> like Hadoop -- this would save memory since you only need a few map tasks
>>> loading the matrices, but isn't playing very nicely within a shared
>> cluster
>>> :)
>>>
>>> Next step at this point is to look at either reducing the number of items
>>> to recommend over, LSH or a third secret plan that "the PhD's" are
>> thinking
>>> about. Paper forthcoming, no doubt :D
>>>
>>> @Sebastian, happy to run any patches on our cluster/dataset before making
>>> more commits.
>>>
>>>
>>>
>>> On 6 March 2013 20:58, Josh Devins <h...@joshdevins.com> wrote:
>>>
>>>> Got sidetracked today but I'll run Sebastian's version in trunk tomorrow
>>>> and report back.
>>>>
>>>>
>>>> On 6 March 2013 17:07, Sebastian Schelter <s...@apache.org> wrote:
>>>>
>>>>> I already committed a fix in that direction. I modified our
>>>>> FixedSizePriorityQueue to allow inspection of its head for direct
>>>>> comparison. This obviates the need to instantiate a Comparable and
>> offer
>>>>> it to the queue.
>>>>>
>>>>> /s
>>>>>
>>>>>
>>>>> On 06.03.2013 17:01, Ted Dunning wrote:
>>>>>> I would recommend against a mutable object on maintenance grounds.
>>>>>>
>>>>>> Better is to keep the threshold that a new score must meet and only
>>>>>> construct the object on need.  That cuts the allocation down to
>>>>> negligible
>>>>>> levels.
>>>>>>
>>>>>> On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> OK, that's reasonable on 35 machines. (You can turn up to 70
>> reducers,
>>>>>>> probably, as most machines can handle 2 reducers at once).
>>>>>>> I think the recommendation step loads one whole matrix into memory.
>>>>> You're
>>>>>>> not running out of memory but if you're turning up the heap size to
>>>>>>> accommodate, you might be hitting swapping, yes. I think (?) the
>>>>>>> conventional wisdom is to turn off swap for Hadoop.
>>>>>>>
>>>>>>> Sebastian yes that is probably a good optimization; I've had good
>>>>> results
>>>>>>> reusing a mutable object in this context.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 6, 2013 at 10:54 AM, Josh Devins <h...@joshdevins.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> The factorization at 2-hours is kind of a non-issue (certainly fast
>>>>>>>> enough). It was run with (if I recall correctly) 30 reducers across
>> a
>>>>> 35
>>>>>>>> node cluster, with 10 iterations.
>>>>>>>>
>>>>>>>> I was a bit shocked at how long the recommendation step took and
>> will
>>>>>>> throw
>>>>>>>> some timing debug in to see where the problem lies exactly. There
>>>>> were no
>>>>>>>> other jobs running on the cluster during these attempts, but it's
>>>>>>> certainly
>>>>>>>> possible that something is swapping or the like. I'll be looking
>> more
>>>>>>>> closely today before I start to consider other options for
>> calculating
>>>>>>> the
>>>>>>>> recommendations.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
> 

Reply via email to