Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Colt did a nice job of this. Basically, their idea was to take various general functional patterns and allow the functions to be plugged in. Common patterns that are reasonable to include in such a framework include: a) dot product as the aggregration of a pairwise function application (normal dot product is this with aggregation = + and pairwise function = *) b) element-wise transformation of all elements of a matrix or vector c) aggregation of an elementwise transformation of a vector or matrix (sum, frobenius norm, euclidean squared distance are examples of this) c) row-wise or column-wise transformation of a matrix resulting in a list of objects d) row-wise or column-wise aggregation of a matrix resulting in a vector (row sum or column sum is a good example here) e) row-wise or column-wise aggregated outer products (normal matrix multiplication is an example where the product is dot and aggregation is +) Combining these operations with various view operators such as transpose, sub-matrix and diagonal allow various other interesting operators to be implemented. Trace, for example, becomes m.diagonal().aggregate(Functions.plus). The result is a very powerful API that has enormous expressivity. It is also, unfortunately, essentially opaque to most users so lots of short-cut and convenience functions are important. On Thu, Oct 1, 2009 at 10:53 AM, Jake Mannix wrote: > Of course, to do this right in Mahout, we'd probably need to have some way > of > telling the Vector instance what to use for its norm (so it knows whether > to > > cache it's L^2 norm, L^p norm, or some other inner product applied to > itself). > > Which gets me thinking: maybe we should have an InnerProduct interface, > similar to DistanceMeasure, which defined how to compute dot(). As it is, > we basically assume in our API that while you may want norm(int p) for > various p, and you may want different DistanceMeasure impls of various > types, we say that dot() is always the Euclidean dot(). > > Should that be pluggable as well? > -- Ted Dunning, CTO DeepDyve
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
On Thu, Oct 1, 2009 at 10:10 AM, Ted Dunning wrote: > Btw... the other think that the HashVector does better is inserts. The > sorted vector could do much better on average if it deferred sorting until > an access or iteration was done. Even iteration doesn't necessarily need > sorting, but it could by seen as part of the contract. > Iteration doesn't *need* sorting, but I agree that it should be part of the contract because doing fast dot products of two OrderedIntDoublePair instances really needs ordering, so you can just walk them both in parallel. In decomposer, I ended up having an ImmutableSparseVector subclass which did the sorting in the constructor (if it wasn't already sorted) and then didn't allow further inserts/deletes/modifications. Speaking of other little niceties: caching the norm is something I found pretty useful for a vector, and keeping track of a boolean "isDirty" which lets you know whether you've invalidated it and will need to recalculate on the next call to norm(). Of course, to do this right in Mahout, we'd probably need to have some way of telling the Vector instance what to use for its norm (so it knows whether to cache it's L^2 norm, L^p norm, or some other inner product applied to itself). Which gets me thinking: maybe we should have an InnerProduct interface, similar to DistanceMeasure, which defined how to compute dot(). As it is, we basically assume in our API that while you may want norm(int p) for various p, and you may want different DistanceMeasure impls of various types, we say that dot() is always the Euclidean dot(). Should that be pluggable as well?
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Btw... the other think that the HashVector does better is inserts. The sorted vector could do much better on average if it deferred sorting until an access or iteration was done. Even iteration doesn't necessarily need sorting, but it could by seen as part of the contract. On Thu, Oct 1, 2009 at 9:36 AM, Jake Mannix wrote: > Yeah, I added the "trying to find..." part of the debug output because I > couldn't > figure out what IntDoubleHash was "impossibly confused" about. > > Unfortunately, seeing what it was confused about only confused *me* about > why it was impossible. > > On Thu, Oct 1, 2009 at 9:12 AM, Ted Dunning wrote: > > > It indicates a bug in the code or in the writer's head. > > > > You are correct about the intent. The default value (which should > probably > > just be 0) should be returned if the value is missing. > > > > On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA) > >wrote: > > > > > I don't think this is "impossible confusion", it's just simply supposed > > to > > > return a 0 when it can't find the index, right? > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- Ted Dunning, CTO DeepDyve
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Yeah, I added the "trying to find..." part of the debug output because I couldn't figure out what IntDoubleHash was "impossibly confused" about. Unfortunately, seeing what it was confused about only confused *me* about why it was impossible. On Thu, Oct 1, 2009 at 9:12 AM, Ted Dunning wrote: > It indicates a bug in the code or in the writer's head. > > You are correct about the intent. The default value (which should probably > just be 0) should be returned if the value is missing. > > On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA) >wrote: > > > I don't think this is "impossible confusion", it's just simply supposed > to > > return a 0 when it can't find the index, right? > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
It indicates a bug in the code or in the writer's head. You are correct about the intent. The default value (which should probably just be 0) should be returned if the value is missing. On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA) wrote: > I don't think this is "impossible confusion", it's just simply supposed to > return a 0 when it can't find the index, right? -- Ted Dunning, CTO DeepDyve
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
(PS yeah that was my fault for misreading the original message.) On Thu, Oct 1, 2009 at 11:39 AM, Grant Ingersoll wrote: > > On Sep 30, 2009, at 4:34 PM, Jake Mannix wrote: > >> I didn't say that equals() should ignore name, I said the opposite - >> equals >> and >> hashCode() should *only* take into account the contents and the name, and >> not >> implementation (which means that hashCode() needs to stay in one place and >> not get monkeyed with in subclasses. >> > > Right, that is my understanding
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
On Sep 30, 2009, at 4:34 PM, Jake Mannix wrote: I didn't say that equals() should ignore name, I said the opposite - equals and hashCode() should *only* take into account the contents and the name, and not implementation (which means that hashCode() needs to stay in one place and not get monkeyed with in subclasses. Right, that is my understanding On Wed, Sep 30, 2009 at 1:18 PM, Sean Owen wrote: No I don't hear anyone wanting to make equals() ignore the name. (Otherwise, hashCode() would have to ignore it as well.) JIRA also seems pretty laggy to me. On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix wrote: If we are not going to break the contract between equals() and hashCode(), and we're having equals() *only* take into account the mathematical contents and the name, then I'd say what we need to do is implement hashCode () in a top level class like AbstractVector. (Is something funny going on with JIRA? Seems broken...)
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
I didn't say that equals() should ignore name, I said the opposite - equals and hashCode() should *only* take into account the contents and the name, and not implementation (which means that hashCode() needs to stay in one place and not get monkeyed with in subclasses. On Wed, Sep 30, 2009 at 1:18 PM, Sean Owen wrote: > No I don't hear anyone wanting to make equals() ignore the name. > (Otherwise, hashCode() would have to ignore it as well.) > > JIRA also seems pretty laggy to me. > > On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix > wrote: > > If we are not going to break the contract between equals() and > hashCode(), > > and we're having equals() *only* take into account the mathematical > contents > > and the name, then I'd say what we need to do is implement hashCode() in > a > > top level class like AbstractVector. > > > > (Is something funny going on with JIRA? Seems broken...) >
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
On Wed, Sep 30, 2009 at 1:16 PM, Grant Ingersoll wrote: > > On Sep 30, 2009, at 4:03 PM, Jake Mannix wrote: > > Regarding having equals() effectively delegate to >> getName().equals(other.getName()) && equivalent(other) means that we need >> to >> be extra special careful about implementations of hashCode() : >> >> If we are not going to break the contract between equals() and hashCode(), >> and we're having equals() *only* take into account the mathematical >> contents >> and the name, then I'd say what we need to do is implement hashCode() in a >> top level class like AbstractVector. >> > > That is what is happening. It is on trunk, but not in Ted's patch, which is what I'm currently looking at, and want to make sure I'm adhering to convention as I play with Ted's impls. -jake
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
No I don't hear anyone wanting to make equals() ignore the name. (Otherwise, hashCode() would have to ignore it as well.) JIRA also seems pretty laggy to me. On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix wrote: > If we are not going to break the contract between equals() and hashCode(), > and we're having equals() *only* take into account the mathematical contents > and the name, then I'd say what we need to do is implement hashCode() in a > top level class like AbstractVector. > > (Is something funny going on with JIRA? Seems broken...)
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
On Sep 30, 2009, at 4:03 PM, Jake Mannix wrote: Regarding having equals() effectively delegate to getName().equals(other.getName()) && equivalent(other) means that we need to be extra special careful about implementations of hashCode() : If we are not going to break the contract between equals() and hashCode(), and we're having equals() *only* take into account the mathematical contents and the name, then I'd say what we need to do is implement hashCode () in a top level class like AbstractVector. That is what is happening. (Is something funny going on with JIRA? Seems broken...) Yes, there is something wrong. Infra is aware of it. -jake On Wed, Sep 30, 2009 at 10:01 AM, Sean Owen (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760956#action_12760956] Sean Owen commented on MAHOUT-165: -- Are my conclusions sound then: We agree that equals() should be 'pretty strict'. The conventional Java wisdom is that equals(), in fact, ought not return true for instances of differing classes, unless you really know what you're doing. I guess we do. :) If the idea behind equals() is "do class-specific stuff, otherwise, check names, and use equivalent() then", then we don't need strictEquivalence() -- where's it used? (If I represented the logic correctly above -- is that as simple as we can make it? seems a touch complex) I am not sure anything is 'broken' in practice here but I sense it could be simpler. Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. In an experiment, I found that, for get/set operations, the primitive hash of Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Regarding having equals() effectively delegate to getName().equals(other.getName()) && equivalent(other) means that we need to be extra special careful about implementations of hashCode() : If we are not going to break the contract between equals() and hashCode(), and we're having equals() *only* take into account the mathematical contents and the name, then I'd say what we need to do is implement hashCode() in a top level class like AbstractVector. (Is something funny going on with JIRA? Seems broken...) -jake On Wed, Sep 30, 2009 at 10:01 AM, Sean Owen (JIRA) wrote: > >[ > https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760956#action_12760956] > > Sean Owen commented on MAHOUT-165: > -- > > Are my conclusions sound then: > > We agree that equals() should be 'pretty strict'. The conventional Java > wisdom is that equals(), in fact, ought not return true for instances of > differing classes, unless you really know what you're doing. I guess we do. > :) > > If the idea behind equals() is "do class-specific stuff, otherwise, check > names, and use equivalent() then", then we don't need strictEquivalence() -- > where's it used? > > (If I represented the logic correctly above -- is that as simple as we can > make it? seems a touch complex) > > I am not sure anything is 'broken' in practice here but I sense it could be > simpler. > > > > Using better primitives hash for sparse vector for performance gains > > > > > > Key: MAHOUT-165 > > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > > Project: Mahout > > Issue Type: Improvement > > Components: Matrix > >Affects Versions: 0.2 > >Reporter: Shashikant Kore > >Assignee: Grant Ingersoll > > Fix For: 0.2 > > > > Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, > mahout-165.patch > > > > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > > In an experiment, I found that, for get/set operations, the primitive > hash of Colt performance an order of magnitude better than > OrderedIntDoubleMapping. For iteration it is 2x slower, though. > > Using Colt in Sparsevector improved performance of canopy generation. For > an experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
I have just written a replacement. I will post a patch as soon as I get some solid testing done. On Sat, Aug 29, 2009 at 2:29 PM, Grant Ingersoll wrote: > Right, Colt likely could be used depending on the package it comes from and > as long as it doesn't have deps on the other packages. > > -Grant > > > On Aug 29, 2009, at 2:22 PM, Ted Dunning wrote: > > Trove is LGPL so we can't lift code. Even linking can be tricky. >> >> On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) > >wrote: >> >> >>> [ >>> >>> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904 >>> #action_12748904] >>> >>> Shashikant Kore commented on MAHOUT-165: >>> >>> >>> I'm fine with copying relevant classes from Colt or Trove. >>> >>> Please let me know your library of choice. I will create the patch and >>> upload. >>> >>> >>> >>> Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Fix For: 0.2 Attachments: mahout-165.patch In SparseVector, we need primitives hash map for index and values. The >>> present implementation of this hash map is not as efficient as some of >>> the >>> other implementations in non-Apache projects. >>> In an experiment, I found that, for get/set operations, the primitive >>> hash of Colt performance an order of magnitude better than >>> OrderedIntDoubleMapping. For iteration it is 2x slower, though. >>> Using Colt in Sparsevector improved performance of canopy generation. For >>> an experimental dataset, the current implementation takes 50 minutes. >>> Using >>> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the >>> delay. >>> >>> -- >>> This message is automatically generated by JIRA. >>> - >>> You can reply to this email to add a comment to the issue online. >>> >>> >>> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > > -- Ted Dunning, CTO DeepDyve
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Right, Colt likely could be used depending on the package it comes from and as long as it doesn't have deps on the other packages. -Grant On Aug 29, 2009, at 2:22 PM, Ted Dunning wrote: Trove is LGPL so we can't lift code. Even linking can be tricky. On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) >wrote: [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904 #action_12748904] Shashikant Kore commented on MAHOUT-165: I'm fine with copying relevant classes from Colt or Trove. Please let me know your library of choice. I will create the patch and upload. Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Fix For: 0.2 Attachments: mahout-165.patch In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. In an experiment, I found that, for get/set operations, the primitive hash of Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Ted Dunning, CTO DeepDyve -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Trove is LGPL so we can't lift code. Even linking can be tricky. On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) wrote: > >[ > https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904#action_12748904] > > Shashikant Kore commented on MAHOUT-165: > > > I'm fine with copying relevant classes from Colt or Trove. > > Please let me know your library of choice. I will create the patch and > upload. > > > > > Using better primitives hash for sparse vector for performance gains > > > > > > Key: MAHOUT-165 > > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > > Project: Mahout > > Issue Type: Improvement > > Components: Matrix > >Affects Versions: 0.2 > >Reporter: Shashikant Kore > > Fix For: 0.2 > > > > Attachments: mahout-165.patch > > > > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > > In an experiment, I found that, for get/set operations, the primitive > hash of Colt performance an order of magnitude better than > OrderedIntDoubleMapping. For iteration it is 2x slower, though. > > Using Colt in Sparsevector improved performance of canopy generation. For > an experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > -- Ted Dunning, CTO DeepDyve