Re: mahout collections updates

Sebastian Schelter Tue, 12 Mar 2013 15:42:44 -0700

Using this code in DenseVector gives me similar performance:

  @Override
  public double dot(Vector x) {
    if (!x.isDense()) {
      return super.dot(x);
    } else {
      double sum = 0;
      final int size = size();
      for (int n = 0; n < size; n++) {
        sum += values[n] * x.getQuick(n);
      }
      return sum;
    }
  }


On 12.03.2013 23:18, Jake Mannix wrote:
> Ok now that's even weirder than I thought!  You're still calling vector
> methods, you're not doing direct array access or anything...
> 
> On Tuesday, March 12, 2013, Sebastian Schelter wrote:
> 
>> I wrote a small benchmark as we investigated this issue recently for
>> computing recommendations from an ALS factorization.
>>
>> https://gist.github.com/sscdotopen/5147521
>>
>> Here's some results on my notebook, indicating a 5x to 6x performance
>> improvement:
>>
>> dot() 570ms, direct 98ms
>> dot() 568ms, direct 97ms
>> dot() 559ms, direct 97ms
>> dot() 566ms, direct 98ms
>> dot() 574ms, direct 98ms
>> dot() 582ms, direct 101ms
>> dot() 581ms, direct 98ms
>> dot() 576ms, direct 98ms
>> dot() 574ms, direct 97ms
>> dot() 574ms, direct 99ms
>> dot() 564ms, direct 98ms
>> dot() 576ms, direct 97ms
>> dot() 580ms, direct 100ms
>>
>>
>>
>> On 12.03.2013 23:00, Sean Owen wrote:
>>> It's almost certainly the overhead of the iterator creation and iterator
>>> methods.
>>> DenseVector.dot() is not specialized and the simple dot product method
>> here
>>> could as well be placed there. Then the call to DenseVector.dot() would
>> be
>>> equally unsurprisingly fast.
>>>
>>>
>>> On Tue, Mar 12, 2013 at 9:56 PM, Jake Mannix 
>>> <[email protected]<javascript:;>>
>> wrote:
>>>
>>>> But then where does it slow down?  It just wraps a double[]
>>>>
>>>> On Tuesday, March 12, 2013, Sebastian Schelter wrote:
>>>>
>>>>> I looked into DenseVector and it doesn't use any primitive collections,
>>>>> so ignore my last mail :)
>>>>>
>>>>> On 12.03.2013 22:16, Sebastian Schelter wrote:
>>>>>> As a sidenote: I was kinda shocked recently, that switching from
>>>>>> DenseVector's dot() method to a direct dot product computation gave a
>>>> 3x
>>>>>> increase in performance in
>>>>>> org.apache.mahout.cf.taste.hadoop.als.RecommenderJob.
>>>>>>
>>>>>> It seems like we really have a performance problem for some usecases.
>>>>>>
>>>>>> On 12.03.2013 22:04, Dawid Weiss wrote:
>>>>>>>> The primary use case for mahout collections is directly *inside* of
>>>>>>>> our Vector interface.  Which is to say, it's not directly exposed to
>>>>>>>> most users, and we don't really expose the ability to do guava
>>>>> collections
>>>>>>>> stuff on them at all: We Do Math. :)  So in particular, we don't
>>>> expose
>>>>>>>
>>>>>>> Fair enough. But you might want to expose some of it at some point
>> and
>>>>>>> if this happens it
>>>>>>> may just be ready for you.
>>>>>>>
>>>>>>>> Question is whether there's anything to be gained by just swapping
>>>>>>>> our own collections *out* for something else, like HPPC or fastutil.
>>>>>>>
>>>>>>> Depends. Speed optimizations may be one reason -- you'd need to check
>>>>>>> if the code gains anything by using these libraries compared to
>> Mahout
>>>>>>> collections. While microbenchmarks may show large differences my bet
>>>>>>> is that overall results, taking into account
>>>>>>> computations and, God forbid, I/O, will be within noise range unless
>>>>>>> you're really using these data structures a *lot* in tight loops. The
>>>>>>> only practical benefit I see is getting rid of a chunk of code you
>>>>>>> don't wish to
>>>>>>> maintain (like you said: missing features, unit tests, etc.). But I
>>>>>>> don't negate there is some entertainment value in going back to such
>>>>>>> fundamental data structures and trying to squeeze the last bit of
>>>>>>> performance out of them. :)
>>>>>>>
>>>>>>> Dawid
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>>   -jake
>>>>
>>>
>>
>>
>

Re: mahout collections updates

Reply via email to