Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-01 Thread Ted Dunning
Colt did a nice job of this.  Basically, their idea was to take various
general functional patterns and allow the functions to be plugged in.

Common patterns that are reasonable to include in such a framework include:

a) dot product as the aggregration of a pairwise function application
(normal dot product is this with aggregation = + and pairwise function = *)

b) element-wise transformation of all elements of a matrix or vector

c) aggregation of an elementwise transformation of a vector or matrix (sum,
frobenius norm, euclidean squared distance are examples of this)

c) row-wise or column-wise transformation of a matrix resulting in a list of
objects

d) row-wise or column-wise aggregation of a matrix resulting in a vector
(row sum or column sum is a good example here)

e) row-wise or column-wise aggregated outer products (normal matrix
multiplication is an example where the product is dot and aggregation is +)

Combining these operations with various view operators such as transpose,
sub-matrix and diagonal allow various other interesting operators to be
implemented.  Trace, for example, becomes
m.diagonal().aggregate(Functions.plus).

The result is a very powerful API that has enormous expressivity.

It is also, unfortunately, essentially opaque to most users so lots of
short-cut and convenience functions are important.

On Thu, Oct 1, 2009 at 10:53 AM, Jake Mannix  wrote:

> Of course, to do this right in Mahout, we'd probably need to have some way
> of
> telling the Vector instance what to use for its norm (so it knows whether
> to
>
> cache it's L^2 norm, L^p norm, or some other inner product applied to
> itself).
>
> Which gets me thinking: maybe we should have an InnerProduct interface,
> similar to DistanceMeasure, which defined how to compute dot().  As it is,
> we basically assume in our API that while you may want norm(int p) for
> various p, and you may want different DistanceMeasure impls of various
> types, we say that dot() is always the Euclidean dot().
>
> Should that be pluggable as well?
>



-- 
Ted Dunning, CTO
DeepDyve


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-01 Thread Jake Mannix
On Thu, Oct 1, 2009 at 10:10 AM, Ted Dunning  wrote:

> Btw... the other think that the HashVector does better is inserts.  The
> sorted vector could do much better on average if it deferred sorting until
> an access or iteration was done.  Even iteration doesn't necessarily need
> sorting, but it could by seen as part of the contract.
>

Iteration doesn't *need* sorting, but I agree that it should be part of the
contract
because doing fast dot products of two OrderedIntDoublePair instances really
needs ordering, so you can just walk them both in parallel.

In decomposer, I ended up having an ImmutableSparseVector subclass which
did the sorting in the constructor (if it wasn't already sorted) and then
didn't
allow further inserts/deletes/modifications.

Speaking of other little niceties: caching the norm is something I found
pretty
useful for a vector, and keeping track of a boolean "isDirty" which lets you

know whether you've invalidated it and will need to recalculate on the next
call to norm().

Of course, to do this right in Mahout, we'd probably need to have some way
of
telling the Vector instance what to use for its norm (so it knows whether to

cache it's L^2 norm, L^p norm, or some other inner product applied to
itself).

Which gets me thinking: maybe we should have an InnerProduct interface,
similar to DistanceMeasure, which defined how to compute dot().  As it is,
we basically assume in our API that while you may want norm(int p) for
various p, and you may want different DistanceMeasure impls of various
types, we say that dot() is always the Euclidean dot().

Should that be pluggable as well?


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-01 Thread Ted Dunning
Btw... the other think that the HashVector does better is inserts.  The
sorted vector could do much better on average if it deferred sorting until
an access or iteration was done.  Even iteration doesn't necessarily need
sorting, but it could by seen as part of the contract.

On Thu, Oct 1, 2009 at 9:36 AM, Jake Mannix  wrote:

> Yeah, I added the "trying to find..." part of the debug output because I
> couldn't
> figure out what IntDoubleHash was "impossibly confused" about.
>
> Unfortunately, seeing what it was confused about only confused *me* about
> why it was impossible.
>
> On Thu, Oct 1, 2009 at 9:12 AM, Ted Dunning  wrote:
>
> > It indicates a bug in the code or in the writer's head.
> >
> > You are correct about the intent.  The default value (which should
> probably
> > just be 0) should be returned if the value is missing.
> >
> > On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA)  > >wrote:
> >
> > > I don't think this is "impossible confusion", it's just simply supposed
> > to
> > > return a 0 when it can't find the index, right?
> >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-01 Thread Jake Mannix
Yeah, I added the "trying to find..." part of the debug output because I
couldn't
figure out what IntDoubleHash was "impossibly confused" about.

Unfortunately, seeing what it was confused about only confused *me* about
why it was impossible.

On Thu, Oct 1, 2009 at 9:12 AM, Ted Dunning  wrote:

> It indicates a bug in the code or in the writer's head.
>
> You are correct about the intent.  The default value (which should probably
> just be 0) should be returned if the value is missing.
>
> On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA)  >wrote:
>
> > I don't think this is "impossible confusion", it's just simply supposed
> to
> > return a 0 when it can't find the index, right?
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-01 Thread Ted Dunning
It indicates a bug in the code or in the writer's head.

You are correct about the intent.  The default value (which should probably
just be 0) should be returned if the value is missing.

On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA) wrote:

> I don't think this is "impossible confusion", it's just simply supposed to
> return a 0 when it can't find the index, right?




-- 
Ted Dunning, CTO
DeepDyve


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-01 Thread Sean Owen
(PS yeah that was my fault for misreading the original message.)

On Thu, Oct 1, 2009 at 11:39 AM, Grant Ingersoll  wrote:
>
> On Sep 30, 2009, at 4:34 PM, Jake Mannix wrote:
>
>> I didn't say that equals() should ignore name, I said the opposite -
>> equals
>> and
>> hashCode() should *only* take into account the contents and the name, and
>> not
>> implementation (which means that hashCode() needs to stay in one place and
>> not get monkeyed with in subclasses.
>>
>
> Right, that is my understanding


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-01 Thread Grant Ingersoll


On Sep 30, 2009, at 4:34 PM, Jake Mannix wrote:

I didn't say that equals() should ignore name, I said the opposite -  
equals

and
hashCode() should *only* take into account the contents and the  
name, and

not
implementation (which means that hashCode() needs to stay in one  
place and

not get monkeyed with in subclasses.



Right, that is my understanding


On Wed, Sep 30, 2009 at 1:18 PM, Sean Owen  wrote:


No I don't hear anyone wanting to make equals() ignore the name.
(Otherwise, hashCode() would have to ignore it as well.)

JIRA also seems pretty laggy to me.

On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix 
wrote:

If we are not going to break the contract between equals() and

hashCode(),

and we're having equals() *only* take into account the mathematical

contents
and the name, then I'd say what we need to do is implement hashCode 
() in

a

top level class like AbstractVector.

(Is something funny going on with JIRA?  Seems broken...)






Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-09-30 Thread Jake Mannix
I didn't say that equals() should ignore name, I said the opposite - equals
and
hashCode() should *only* take into account the contents and the name, and
not
implementation (which means that hashCode() needs to stay in one place and
not get monkeyed with in subclasses.

On Wed, Sep 30, 2009 at 1:18 PM, Sean Owen  wrote:

> No I don't hear anyone wanting to make equals() ignore the name.
> (Otherwise, hashCode() would have to ignore it as well.)
>
> JIRA also seems pretty laggy to me.
>
> On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix 
> wrote:
> > If we are not going to break the contract between equals() and
> hashCode(),
> > and we're having equals() *only* take into account the mathematical
> contents
> > and the name, then I'd say what we need to do is implement hashCode() in
> a
> > top level class like AbstractVector.
> >
> > (Is something funny going on with JIRA?  Seems broken...)
>


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-09-30 Thread Jake Mannix
On Wed, Sep 30, 2009 at 1:16 PM, Grant Ingersoll wrote:

>
> On Sep 30, 2009, at 4:03 PM, Jake Mannix wrote:
>
>  Regarding having equals() effectively delegate to
>> getName().equals(other.getName()) && equivalent(other) means that we need
>> to
>> be extra special careful about implementations of hashCode() :
>>
>> If we are not going to break the contract between equals() and hashCode(),
>> and we're having equals() *only* take into account the mathematical
>> contents
>> and the name, then I'd say what we need to do is implement hashCode() in a
>> top level class like AbstractVector.
>>
>
> That is what is happening.


It is on trunk, but not in Ted's patch, which is what I'm currently looking
at, and want
to make sure I'm adhering to convention as I play with Ted's impls.

  -jake


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-09-30 Thread Sean Owen
No I don't hear anyone wanting to make equals() ignore the name.
(Otherwise, hashCode() would have to ignore it as well.)

JIRA also seems pretty laggy to me.

On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix  wrote:
> If we are not going to break the contract between equals() and hashCode(),
> and we're having equals() *only* take into account the mathematical contents
> and the name, then I'd say what we need to do is implement hashCode() in a
> top level class like AbstractVector.
>
> (Is something funny going on with JIRA?  Seems broken...)


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-09-30 Thread Grant Ingersoll


On Sep 30, 2009, at 4:03 PM, Jake Mannix wrote:


Regarding having equals() effectively delegate to
getName().equals(other.getName()) && equivalent(other) means that we  
need to

be extra special careful about implementations of hashCode() :

If we are not going to break the contract between equals() and  
hashCode(),
and we're having equals() *only* take into account the mathematical  
contents
and the name, then I'd say what we need to do is implement hashCode 
() in a

top level class like AbstractVector.


That is what is happening.



(Is something funny going on with JIRA?  Seems broken...)


Yes, there is something wrong.  Infra is aware of it.




 -jake

On Wed, Sep 30, 2009 at 10:01 AM, Sean Owen (JIRA)   
wrote:




  [
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760956#action_12760956]

Sean Owen commented on MAHOUT-165:
--

Are my conclusions sound then:

We agree that equals() should be 'pretty strict'. The conventional  
Java
wisdom is that equals(), in fact, ought not return true for  
instances of
differing classes, unless you really know what you're doing. I  
guess we do.

:)

If the idea behind equals() is "do class-specific stuff, otherwise,  
check
names, and use equivalent() then", then we don't need  
strictEquivalence() --

where's it used?

(If I represented the logic correctly above -- is that as simple as  
we can

make it? seems a touch complex)

I am not sure anything is 'broken' in practice here but I sense it  
could be

simpler.



Using better primitives hash for sparse vector for performance gains


   Key: MAHOUT-165
   URL: https://issues.apache.org/jira/browse/MAHOUT-165
   Project: Mahout
Issue Type: Improvement
Components: Matrix
  Affects Versions: 0.2
  Reporter: Shashikant Kore
  Assignee: Grant Ingersoll
   Fix For: 0.2

   Attachments: colt.jar, mahout-165-trove.patch,  
MAHOUT-165.patch,

mahout-165.patch



In SparseVector, we need primitives hash map for index and values.  
The
present implementation of this hash map is not as efficient as some  
of the

other implementations in non-Apache projects.
In an experiment, I found that, for get/set operations, the  
primitive

hash of  Colt performance an order of magnitude better than
OrderedIntDoubleMapping. For iteration it is 2x slower, though.
Using Colt in Sparsevector improved performance of canopy  
generation. For
an experimental dataset, the current implementation takes 50  
minutes. Using
Colt, reduces this duration to 19-20 minutes. That's 60% reduction  
in the

delay.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-09-30 Thread Jake Mannix
Regarding having equals() effectively delegate to
getName().equals(other.getName()) && equivalent(other) means that we need to
be extra special careful about implementations of hashCode() :

If we are not going to break the contract between equals() and hashCode(),
and we're having equals() *only* take into account the mathematical contents
and the name, then I'd say what we need to do is implement hashCode() in a
top level class like AbstractVector.

(Is something funny going on with JIRA?  Seems broken...)

  -jake

On Wed, Sep 30, 2009 at 10:01 AM, Sean Owen (JIRA)  wrote:

>
>[
> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760956#action_12760956]
>
> Sean Owen commented on MAHOUT-165:
> --
>
> Are my conclusions sound then:
>
> We agree that equals() should be 'pretty strict'. The conventional Java
> wisdom is that equals(), in fact, ought not return true for instances of
> differing classes, unless you really know what you're doing. I guess we do.
> :)
>
> If the idea behind equals() is "do class-specific stuff, otherwise, check
> names, and use equivalent() then", then we don't need strictEquivalence() --
> where's it used?
>
> (If I represented the logic correctly above -- is that as simple as we can
> make it? seems a touch complex)
>
> I am not sure anything is 'broken' in practice here but I sense it could be
> simpler.
>
>
> > Using better primitives hash for sparse vector for performance gains
> > 
> >
> > Key: MAHOUT-165
> > URL: https://issues.apache.org/jira/browse/MAHOUT-165
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Matrix
> >Affects Versions: 0.2
> >Reporter: Shashikant Kore
> >Assignee: Grant Ingersoll
> > Fix For: 0.2
> >
> > Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch,
> mahout-165.patch
> >
> >
> > In SparseVector, we need primitives hash map for index and values. The
> present implementation of this hash map is not as efficient as some of the
> other implementations in non-Apache projects.
> > In an experiment, I found that, for get/set operations, the primitive
> hash of  Colt performance an order of magnitude better than
> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
> > Using Colt in Sparsevector improved performance of canopy generation. For
> an experimental dataset, the current implementation takes 50 minutes. Using
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the
> delay.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-08-29 Thread Ted Dunning
I have just written a replacement.  I will post a patch as soon as I get
some solid testing done.

On Sat, Aug 29, 2009 at 2:29 PM, Grant Ingersoll wrote:

> Right, Colt likely could be used depending on the package it comes from and
> as long as it doesn't have deps on the other packages.
>
> -Grant
>
>
> On Aug 29, 2009, at 2:22 PM, Ted Dunning wrote:
>
>  Trove is LGPL so we can't lift code.  Even linking can be tricky.
>>
>> On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) > >wrote:
>>
>>
>>>  [
>>>
>>> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904
>>> #action_12748904]
>>>
>>> Shashikant Kore commented on MAHOUT-165:
>>> 
>>>
>>> I'm fine with copying relevant classes from Colt or Trove.
>>>
>>> Please let me know your library of choice. I will create the patch and
>>> upload.
>>>
>>>
>>>
>>>  Using better primitives hash for sparse vector for performance gains
 

   Key: MAHOUT-165
   URL: https://issues.apache.org/jira/browse/MAHOUT-165
   Project: Mahout
Issue Type: Improvement
Components: Matrix
  Affects Versions: 0.2
  Reporter: Shashikant Kore
   Fix For: 0.2

   Attachments: mahout-165.patch


 In SparseVector, we need primitives hash map for index and values. The

>>> present implementation of this hash map is not as efficient as some of
>>> the
>>> other implementations in non-Apache projects.
>>>
 In an experiment, I found that, for get/set operations, the primitive

>>> hash of  Colt performance an order of magnitude better than
>>> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
>>>
 Using Colt in Sparsevector improved performance of canopy generation.
 For

>>> an experimental dataset, the current implementation takes 50 minutes.
>>> Using
>>> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the
>>> delay.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Ted Dunning, CTO
DeepDyve


Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-08-29 Thread Grant Ingersoll
Right, Colt likely could be used depending on the package it comes  
from and as long as it doesn't have deps on the other packages.


-Grant

On Aug 29, 2009, at 2:22 PM, Ted Dunning wrote:


Trove is LGPL so we can't lift code.  Even linking can be tricky.

On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) >wrote:




  [
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904 
#action_12748904]


Shashikant Kore commented on MAHOUT-165:


I'm fine with copying relevant classes from Colt or Trove.

Please let me know your library of choice. I will create the patch  
and

upload.




Using better primitives hash for sparse vector for performance gains


   Key: MAHOUT-165
   URL: https://issues.apache.org/jira/browse/MAHOUT-165
   Project: Mahout
Issue Type: Improvement
Components: Matrix
  Affects Versions: 0.2
  Reporter: Shashikant Kore
   Fix For: 0.2

   Attachments: mahout-165.patch


In SparseVector, we need primitives hash map for index and values.  
The
present implementation of this hash map is not as efficient as some  
of the

other implementations in non-Apache projects.
In an experiment, I found that, for get/set operations, the  
primitive

hash of  Colt performance an order of magnitude better than
OrderedIntDoubleMapping. For iteration it is 2x slower, though.
Using Colt in Sparsevector improved performance of canopy  
generation. For
an experimental dataset, the current implementation takes 50  
minutes. Using
Colt, reduces this duration to 19-20 minutes. That's 60% reduction  
in the

delay.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.





--
Ted Dunning, CTO
DeepDyve


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-08-29 Thread Ted Dunning
Trove is LGPL so we can't lift code.  Even linking can be tricky.

On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) wrote:

>
>[
> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904#action_12748904]
>
> Shashikant Kore commented on MAHOUT-165:
> 
>
> I'm fine with copying relevant classes from Colt or Trove.
>
> Please let me know your library of choice. I will create the patch and
> upload.
>
>
>
> > Using better primitives hash for sparse vector for performance gains
> > 
> >
> > Key: MAHOUT-165
> > URL: https://issues.apache.org/jira/browse/MAHOUT-165
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Matrix
> >Affects Versions: 0.2
> >Reporter: Shashikant Kore
> > Fix For: 0.2
> >
> > Attachments: mahout-165.patch
> >
> >
> > In SparseVector, we need primitives hash map for index and values. The
> present implementation of this hash map is not as efficient as some of the
> other implementations in non-Apache projects.
> > In an experiment, I found that, for get/set operations, the primitive
> hash of  Colt performance an order of magnitude better than
> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
> > Using Colt in Sparsevector improved performance of canopy generation. For
> an experimental dataset, the current implementation takes 50 minutes. Using
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the
> delay.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve