Re: eclipse codestyle.xml?
Great -- this works now. Thanks! On Tue, Nov 24, 2009 at 10:20 AM, Grant Ingersoll wrote: > Actually, the Mahout wiki links are out of date. I'll update. >
[jira] Updated: (MAHOUT-206) Separate and clearly label different SparseVector implementations
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-206: --- Attachment: MAHOUT-206.patch This adds back SparseVector now as the parent of both sparse impls, and extending AbstractVector. There are certainly optimizations which can still be done, and again, no work has been done to dig through all the places where RandomAccessSparseVectors are created and choose whether it would be better to have SequentialAccessSparseVector. This patch also has the fix for MAHOUT-207. > Separate and clearly label different SparseVector implementations > - > > Key: MAHOUT-206 > URL: https://issues.apache.org/jira/browse/MAHOUT-206 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-206.patch, MAHOUT-206.patch > > > Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array > impl of SparseVector for an OpenIntDoubleMap (hash-based) one. We actually > need both, as I think I've mentioned a gazillion times. > There was a patch, long ago, on MAHOUT-165, in which Ted had > OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that > effect), and neither of them are called SparseVector. I like this, because > it forces people to choose what kind of SparseVector they want (and they > should: sparse is an optimization, and the client should make a conscious > decision what they're optimizing for). > We could call them RandomAccessSparseVector and SequentialAccessSparseVector, > to be really obvious. > But really, the important part is we have both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
True, and that advantage rarely comes up. However declaring abstract methods in an abstract class has exactly the same problem that adding interface methods does -- and I take it this is the heart of the problem -- all implementors get broken immediately. If you intend to add methods this way, neither one is any different indeed, both have the same problem. Abstract classes afford the possibility of adding methods plus implementation, without breaking anybody, so yeah I'm into abstract classes. But then that's no argument against an abstract class + interface, which would add a small bit of flexibility too. I don't feel strongly about it but do think the interface + abstract approach is more conventional than declaring APIs through abstract classes. On Tue, Nov 24, 2009 at 8:49 PM, Yonik Seeley wrote: > The only advantage an interface has over an abstract class is multiple > inheritance. > You can use abstract classes like interfaces: make it possible to > override all methods and avoiding state unless absolutely needed for > back compat changes.
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
On Tue, Nov 24, 2009 at 3:30 PM, Sean Owen wrote: > I'm willing to be convinced but what is the theoretical argument for this? Rather the opposite - it's a practical argument gained through experience. > I am all for interfaces *and* abstract classes. You write the API in > terms of interfaces for maximum flexibility. You provide abstract > partial implementations for convenience. Everyone is happy. The only advantage an interface has over an abstract class is multiple inheritance. You can use abstract classes like interfaces: make it possible to override all methods and avoiding state unless absolutely needed for back compat changes. -Yonik
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
On Nov 24, 2009, at 3:30 PM, Sean Owen wrote: > I'm willing to be convinced but what is the theoretical argument for this? See the Lucene archives: http://search.lucidimagination.com. There has been a lot of discussion on it. And I mean a lot. And then some. :-) Search for anything on interfaces, abstract classes or back compatibility. > > I am all for interfaces *and* abstract classes. You write the API in > terms of interfaces for maximum flexibility. You provide abstract > partial implementations for convenience. Everyone is happy. I agree. > > The best argument I've seen against it is that it can be overkill. In > super-performance-critical situations the dynamic dispatch overhead is > perhaps worth thinking about, but that's rare. What else? > > I take the point about interfaces changing, but this is significant > when you expect a lot of third-party implementers of your interfaces. > I don't think that is true here. I disagree here. In open source, you never know where the next good idea is coming from. We should just always assume they will change. > > On Tue, Nov 24, 2009 at 8:08 PM, Ted Dunning wrote: >> Yes. Interfaces are the problem that commons math have boxed themselves in >> with. The Hadoop crew (especially Doug C) are adamant about using as few >> interfaces as possible except as mixin signals and only in cases where the >> interface really is going to be very, very stable. >> >> Our vector interfaces are definitely not going to be that stable for quite a >> while. >> >> On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix wrote: >> >>> Well we do use AbstractVector. Are you suggesting that we *not* have a >>> Vector interface >>> at all, and *only* have an abstract base class? Similarly for Matrix? >>> >>> -jake >>> >>> On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning >>> wrote: >>> We should use abstract classes almost everywhere instead of interfaces to ease backward compatibility issues with user written extensions to >>> Vectors and Matrices. On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) wrote: > It seems like there is still some commonality between the two > implementations (size, cardinality, etc.) that I think it would be > worthwhile to keep SparseVector as an abstract class which the other >>> two > extend. > -- Ted Dunning, CTO DeepDyve >>> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
Yes, I have lived this pain for a long time with Lucene. Personally, though, a lot of the pain comes from a fairly strict back compatibility policy that to me isn't always well founded given the release cycle Lucene usually operates under. I've always wished there was a @introducing annotation for interfaces, such that you could tell people what is coming down the pike. I also often feel the right answer is a combination of both. New methods could be added on a new interface that is then applied to an Abstract class, thus it can be inherited by downstream implementors. People who don't inherit from the Abstract can choose to add the new interface if they see fit. For now, we don't have any back compat commitments. I think once we get to 0.9, we can decide on that. On Nov 24, 2009, at 3:21 PM, Jake Mannix wrote: > Oof. > > So you're arguing this as a temporary thing, until our interfaces stabilize? > It makes > unit testing much harder this way, but I guess I see the rationale. > > If we do this, we need to leave a lot out of that base class - there may be > some really > big differences in implementation of these classes (for example: distributed > / hdfs > backed matrices vs locally memory-resident ones), so very very little should > be > assumed in the base impl. I guess more can be done in the vector case, > however. > > -jake > > On Tue, Nov 24, 2009 at 12:08 PM, Ted Dunning wrote: > >> Yes. Interfaces are the problem that commons math have boxed themselves in >> with. The Hadoop crew (especially Doug C) are adamant about using as few >> interfaces as possible except as mixin signals and only in cases where the >> interface really is going to be very, very stable. >> >> Our vector interfaces are definitely not going to be that stable for quite >> a >> while. >> >> On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix >> wrote: >> >>> Well we do use AbstractVector. Are you suggesting that we *not* have a >>> Vector interface >>> at all, and *only* have an abstract base class? Similarly for Matrix? >>> >>> -jake >>> >>> On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning >>> wrote: >>> We should use abstract classes almost everywhere instead of interfaces >> to ease backward compatibility issues with user written extensions to >>> Vectors and Matrices. On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) < >> j...@apache.org > wrote: > It seems like there is still some commonality between the two > implementations (size, cardinality, etc.) that I think it would be > worthwhile to keep SparseVector as an abstract class which the other >>> two > extend. > -- Ted Dunning, CTO DeepDyve >>> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
I'm willing to be convinced but what is the theoretical argument for this? I am all for interfaces *and* abstract classes. You write the API in terms of interfaces for maximum flexibility. You provide abstract partial implementations for convenience. Everyone is happy. The best argument I've seen against it is that it can be overkill. In super-performance-critical situations the dynamic dispatch overhead is perhaps worth thinking about, but that's rare. What else? I take the point about interfaces changing, but this is significant when you expect a lot of third-party implementers of your interfaces. I don't think that is true here. On Tue, Nov 24, 2009 at 8:08 PM, Ted Dunning wrote: > Yes. Interfaces are the problem that commons math have boxed themselves in > with. The Hadoop crew (especially Doug C) are adamant about using as few > interfaces as possible except as mixin signals and only in cases where the > interface really is going to be very, very stable. > > Our vector interfaces are definitely not going to be that stable for quite a > while. > > On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix wrote: > >> Well we do use AbstractVector. Are you suggesting that we *not* have a >> Vector interface >> at all, and *only* have an abstract base class? Similarly for Matrix? >> >> -jake >> >> On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning >> wrote: >> >> > We should use abstract classes almost everywhere instead of interfaces to >> > ease backward compatibility issues with user written extensions to >> Vectors >> > and Matrices. >> > >> > On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) > > >wrote: >> > >> > > It seems like there is still some commonality between the two >> > > implementations (size, cardinality, etc.) that I think it would be >> > > worthwhile to keep SparseVector as an abstract class which the other >> two >> > > extend. >> > > >> > >> > >> > >> > -- >> > Ted Dunning, CTO >> > DeepDyve >> > >> > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
Oof. So you're arguing this as a temporary thing, until our interfaces stabilize? It makes unit testing much harder this way, but I guess I see the rationale. If we do this, we need to leave a lot out of that base class - there may be some really big differences in implementation of these classes (for example: distributed / hdfs backed matrices vs locally memory-resident ones), so very very little should be assumed in the base impl. I guess more can be done in the vector case, however. -jake On Tue, Nov 24, 2009 at 12:08 PM, Ted Dunning wrote: > Yes. Interfaces are the problem that commons math have boxed themselves in > with. The Hadoop crew (especially Doug C) are adamant about using as few > interfaces as possible except as mixin signals and only in cases where the > interface really is going to be very, very stable. > > Our vector interfaces are definitely not going to be that stable for quite > a > while. > > On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix > wrote: > > > Well we do use AbstractVector. Are you suggesting that we *not* have a > > Vector interface > > at all, and *only* have an abstract base class? Similarly for Matrix? > > > > -jake > > > > On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning > > wrote: > > > > > We should use abstract classes almost everywhere instead of interfaces > to > > > ease backward compatibility issues with user written extensions to > > Vectors > > > and Matrices. > > > > > > On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) < > j...@apache.org > > > >wrote: > > > > > > > It seems like there is still some commonality between the two > > > > implementations (size, cardinality, etc.) that I think it would be > > > > worthwhile to keep SparseVector as an abstract class which the other > > two > > > > extend. > > > > > > > > > > > > > > > > -- > > > Ted Dunning, CTO > > > DeepDyve > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
Yes. Interfaces are the problem that commons math have boxed themselves in with. The Hadoop crew (especially Doug C) are adamant about using as few interfaces as possible except as mixin signals and only in cases where the interface really is going to be very, very stable. Our vector interfaces are definitely not going to be that stable for quite a while. On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix wrote: > Well we do use AbstractVector. Are you suggesting that we *not* have a > Vector interface > at all, and *only* have an abstract base class? Similarly for Matrix? > > -jake > > On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning > wrote: > > > We should use abstract classes almost everywhere instead of interfaces to > > ease backward compatibility issues with user written extensions to > Vectors > > and Matrices. > > > > On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) > >wrote: > > > > > It seems like there is still some commonality between the two > > > implementations (size, cardinality, etc.) that I think it would be > > > worthwhile to keep SparseVector as an abstract class which the other > two > > > extend. > > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- Ted Dunning, CTO DeepDyve
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
Well we do use AbstractVector. Are you suggesting that we *not* have a Vector interface at all, and *only* have an abstract base class? Similarly for Matrix? -jake On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning wrote: > We should use abstract classes almost everywhere instead of interfaces to > ease backward compatibility issues with user written extensions to Vectors > and Matrices. > > On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) >wrote: > > > It seems like there is still some commonality between the two > > implementations (size, cardinality, etc.) that I think it would be > > worthwhile to keep SparseVector as an abstract class which the other two > > extend. > > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
We should use abstract classes almost everywhere instead of interfaces to ease backward compatibility issues with user written extensions to Vectors and Matrices. On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) wrote: > It seems like there is still some commonality between the two > implementations (size, cardinality, etc.) that I think it would be > worthwhile to keep SparseVector as an abstract class which the other two > extend. > -- Ted Dunning, CTO DeepDyve
[jira] Created: (MAHOUT-209) Add aggregate() methods for Vector
Add aggregate() methods for Vector -- Key: MAHOUT-209 URL: https://issues.apache.org/jira/browse/MAHOUT-209 Project: Mahout Issue Type: Improvement Components: Matrix Environment: all Reporter: Jake Mannix Priority: Minor Fix For: 0.3 As discussed in MAHOUT-165 at some point, Vector (and Matrix, but let's put that on a separate ticket) could do with a nice exposure of methods like the following: {code} // this can get optimized, of course public double aggregate(Vector other, BinaryFunction aggregator, BinaryFunction combiner) { double result = 0; for(int i=0; i
[jira] Created: (MAHOUT-208) Vector.getLengthSquared() is dangerously optimized
Vector.getLengthSquared() is dangerously optimized -- Key: MAHOUT-208 URL: https://issues.apache.org/jira/browse/MAHOUT-208 Project: Mahout Issue Type: Bug Components: Matrix Affects Versions: 0.1 Environment: all Reporter: Jake Mannix Fix For: 0.3 SparseVector and DenseVector both cache the value of lengthSquared, so that subsequent calls to it get the cached value. Great, except the cache is never cleared - calls to set/setQuick or assign or anything, all leave the cached value unchanged. Mutating method calls should set lengthNorm to -1 so that the cache is cleared. This could be a really nasty bug if hit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements
[ https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782080#action_12782080 ] Grant Ingersoll commented on MAHOUT-207: All makes sense. Per the refactoring in MAHOUT-206, I think this argues even more for an abstract SparseVector implementation that can handle some of the common code. > AbstractVector.hashCode() should not care about the order of iteration over > elements > > > Key: MAHOUT-207 > URL: https://issues.apache.org/jira/browse/MAHOUT-207 > Project: Mahout > Issue Type: Bug > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-207.patch > > > As was discussed in MAHOUT-165, hashCode can be implemented simply like this: > {code} > public int hashCode() { > final int prime = 31; > int result = prime + ((name == null) ? 0 : name.hashCode()); > result = prime * result + size(); > Iterator iter = iterateNonZero(); > while (iter.hasNext()) { > Element ele = iter.next(); > long v = Double.doubleToLongBits(ele.get()); > result += (ele.index() * (int)(v^(v>>32))); > } > return result; > } > {code} > which obviates the need to sort the elements in the case of a random access > hash-based implementation. Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when > v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse > vectors which have zero elements returned from the iterateNonZero() iterator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements
[ https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782077#action_12782077 ] Jake Mannix commented on MAHOUT-207: We definitely should include the optimization that set(i, 0) should work fast, but it's not trivial to do. Hash based impls need to yes, remove the previous entry. Array-based sparse vectors do what, exactly? A vector represented as { indices: int[] { 1, 3, 5 }, values: double[] { 1.1, 2.2, 3.3 } } to start, gets a call to setQuick(3, 0), so that in the current implementation it becomes { indices: int[] { 1, 3, 5 }, values: double[] { 1.1, 0, 3.3 } }. What would you suggest be done to "remove" the entry efficiently? What would be useful, is that all sparse vector impls have a "compact()" method which removes all zeroes and finds a small-space representation. But either way, we should not *require* in the contract for sparse vector that there be no zero values. Regarding the call to equivalent, you're right, that should not be done as the static method - if equivalent was non-static, then it could be overridden to be done smartly by subclasses (ie. if one of the two being compared is a DenseVector or RandomAccessSparseVector, iterate over the other checking the dense one via getQuick(), and if both of them are SequentialAccessSparseVectors, both iterators can be walked in parallel). > AbstractVector.hashCode() should not care about the order of iteration over > elements > > > Key: MAHOUT-207 > URL: https://issues.apache.org/jira/browse/MAHOUT-207 > Project: Mahout > Issue Type: Bug > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-207.patch > > > As was discussed in MAHOUT-165, hashCode can be implemented simply like this: > {code} > public int hashCode() { > final int prime = 31; > int result = prime + ((name == null) ? 0 : name.hashCode()); > result = prime * result + size(); > Iterator iter = iterateNonZero(); > while (iter.hasNext()) { > Element ele = iter.next(); > long v = Double.doubleToLongBits(ele.get()); > result += (ele.index() * (int)(v^(v>>32))); > } > return result; > } > {code} > which obviates the need to sort the elements in the case of a random access > hash-based implementation. Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when > v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse > vectors which have zero elements returned from the iterateNonZero() iterator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements
[ https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782064#action_12782064 ] Grant Ingersoll commented on MAHOUT-207: Aren't we loosing some of the benefits of SparseVector with this explicit set to zero stuff (by having to call equivalent)? I've wondered in the past how a Sparse implementation should handle something like setQuick(i, 0). One approach is to set it, but the other is to ignore it and possibly remove any previous nonzero entry, right? Seems like tradeoffs w/ both. > AbstractVector.hashCode() should not care about the order of iteration over > elements > > > Key: MAHOUT-207 > URL: https://issues.apache.org/jira/browse/MAHOUT-207 > Project: Mahout > Issue Type: Bug > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-207.patch > > > As was discussed in MAHOUT-165, hashCode can be implemented simply like this: > {code} > public int hashCode() { > final int prime = 31; > int result = prime + ((name == null) ? 0 : name.hashCode()); > result = prime * result + size(); > Iterator iter = iterateNonZero(); > while (iter.hasNext()) { > Element ele = iter.next(); > long v = Double.doubleToLongBits(ele.get()); > result += (ele.index() * (int)(v^(v>>32))); > } > return result; > } > {code} > which obviates the need to sort the elements in the case of a random access > hash-based implementation. Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when > v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse > vectors which have zero elements returned from the iterateNonZero() iterator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782054#action_12782054 ] Grant Ingersoll commented on MAHOUT-206: Jake, there's something weird in this patch in regards to SparseVector. It didn't delete the file, but instead left it empty. It seems like there is still some commonality between the two implementations (size, cardinality, etc.) that I think it would be worthwhile to keep SparseVector as an abstract class which the other two extend. > Separate and clearly label different SparseVector implementations > - > > Key: MAHOUT-206 > URL: https://issues.apache.org/jira/browse/MAHOUT-206 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-206.patch > > > Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array > impl of SparseVector for an OpenIntDoubleMap (hash-based) one. We actually > need both, as I think I've mentioned a gazillion times. > There was a patch, long ago, on MAHOUT-165, in which Ted had > OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that > effect), and neither of them are called SparseVector. I like this, because > it forces people to choose what kind of SparseVector they want (and they > should: sparse is an optimization, and the client should make a conscious > decision what they're optimizing for). > We could call them RandomAccessSparseVector and SequentialAccessSparseVector, > to be really obvious. > But really, the important part is we have both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements
[ https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782049#action_12782049 ] Jake Mannix commented on MAHOUT-207: It looks like the work done on MAHOUT-159 did not use addition as the combiner on the hashCode() for Elements of the vector, so the answer was iteration order dependent. Unit tests also didn't check what happened if a sparse vector had explicitly zero values set on it, which should not affect hasCode() or equals() computation (the latter was fine, the former was not!). > AbstractVector.hashCode() should not care about the order of iteration over > elements > > > Key: MAHOUT-207 > URL: https://issues.apache.org/jira/browse/MAHOUT-207 > Project: Mahout > Issue Type: Bug > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-207.patch > > > As was discussed in MAHOUT-165, hashCode can be implemented simply like this: > {code} > public int hashCode() { > final int prime = 31; > int result = prime + ((name == null) ? 0 : name.hashCode()); > result = prime * result + size(); > Iterator iter = iterateNonZero(); > while (iter.hasNext()) { > Element ele = iter.next(); > long v = Double.doubleToLongBits(ele.get()); > result += (ele.index() * (int)(v^(v>>32))); > } > return result; > } > {code} > which obviates the need to sort the elements in the case of a random access > hash-based implementation. Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when > v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse > vectors which have zero elements returned from the iterateNonZero() iterator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements
[ https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782051#action_12782051 ] Ted Dunning commented on MAHOUT-207: I think that 159 is superseded by this work. > AbstractVector.hashCode() should not care about the order of iteration over > elements > > > Key: MAHOUT-207 > URL: https://issues.apache.org/jira/browse/MAHOUT-207 > Project: Mahout > Issue Type: Bug > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-207.patch > > > As was discussed in MAHOUT-165, hashCode can be implemented simply like this: > {code} > public int hashCode() { > final int prime = 31; > int result = prime + ((name == null) ? 0 : name.hashCode()); > result = prime * result + size(); > Iterator iter = iterateNonZero(); > while (iter.hasNext()) { > Element ele = iter.next(); > long v = Double.doubleToLongBits(ele.get()); > result += (ele.index() * (int)(v^(v>>32))); > } > return result; > } > {code} > which obviates the need to sort the elements in the case of a random access > hash-based implementation. Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when > v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse > vectors which have zero elements returned from the iterateNonZero() iterator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-206) Separate and clearly label different SparseVector implementations
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned MAHOUT-206: -- Assignee: Grant Ingersoll > Separate and clearly label different SparseVector implementations > - > > Key: MAHOUT-206 > URL: https://issues.apache.org/jira/browse/MAHOUT-206 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-206.patch > > > Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array > impl of SparseVector for an OpenIntDoubleMap (hash-based) one. We actually > need both, as I think I've mentioned a gazillion times. > There was a patch, long ago, on MAHOUT-165, in which Ted had > OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that > effect), and neither of them are called SparseVector. I like this, because > it forces people to choose what kind of SparseVector they want (and they > should: sparse is an optimization, and the client should make a conscious > decision what they're optimizing for). > We could call them RandomAccessSparseVector and SequentialAccessSparseVector, > to be really obvious. > But really, the important part is we have both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements
[ https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned MAHOUT-207: -- Assignee: Grant Ingersoll > AbstractVector.hashCode() should not care about the order of iteration over > elements > > > Key: MAHOUT-207 > URL: https://issues.apache.org/jira/browse/MAHOUT-207 > Project: Mahout > Issue Type: Bug > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-207.patch > > > As was discussed in MAHOUT-165, hashCode can be implemented simply like this: > {code} > public int hashCode() { > final int prime = 31; > int result = prime + ((name == null) ? 0 : name.hashCode()); > result = prime * result + size(); > Iterator iter = iterateNonZero(); > while (iter.hasNext()) { > Element ele = iter.next(); > long v = Double.doubleToLongBits(ele.get()); > result += (ele.index() * (int)(v^(v>>32))); > } > return result; > } > {code} > which obviates the need to sort the elements in the case of a random access > hash-based implementation. Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when > v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse > vectors which have zero elements returned from the iterateNonZero() iterator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements
[ https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782041#action_12782041 ] Grant Ingersoll commented on MAHOUT-207: How does this all relate to https://issues.apache.org/jira/browse/MAHOUT-159? > AbstractVector.hashCode() should not care about the order of iteration over > elements > > > Key: MAHOUT-207 > URL: https://issues.apache.org/jira/browse/MAHOUT-207 > Project: Mahout > Issue Type: Bug > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix > Fix For: 0.3 > > Attachments: MAHOUT-207.patch > > > As was discussed in MAHOUT-165, hashCode can be implemented simply like this: > {code} > public int hashCode() { > final int prime = 31; > int result = prime + ((name == null) ? 0 : name.hashCode()); > result = prime * result + size(); > Iterator iter = iterateNonZero(); > while (iter.hasNext()) { > Element ele = iter.next(); > long v = Double.doubleToLongBits(ele.get()); > result += (ele.index() * (int)(v^(v>>32))); > } > return result; > } > {code} > which obviates the need to sort the elements in the case of a random access > hash-based implementation. Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when > v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse > vectors which have zero elements returned from the iterateNonZero() iterator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow
[ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix resolved MAHOUT-201. Resolution: Duplicate The patch for this is currently included in the patch for MAHOUT-206 (which would have badly broken this patch anyways). > OrderedIntDoubleMapping / SparseVector is unnecessarily slow > > > Key: MAHOUT-201 > URL: https://issues.apache.org/jira/browse/MAHOUT-201 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Affects Versions: 0.2 >Reporter: Jake Mannix > Fix For: 0.3 > > Attachments: MAHOUT-201.patch > > > In the work on MAHOUT-165, I find that while Colt's sparse vector > implementation is great from a hashing standpoint (it's memory efficient and > fast for random-access), they don't provide anything like the > OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for > random access, or out-of-order modification, but is minimally sized > memory-wise and blazingly fast for doing read-only dot-products and vector > sums (where the latter is read-only on inputs, and is creating new output) > with each other, and with DenseVectors. > This line of thinking got me looking back at the current SparseVector > implementation we have in Mahout, because it *is* based on an int[] and a > double[]. Unfortunately, it's not at all optimized for the cases where it > can outperform all other sparse impls: > * it should override dot(Vector) and plus(Vector) to check whether the input > is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap > implementation of SparseVector, that case as well), and do specialized > operations here. > * even when those particular methods aren't being used, the AllIterator and > NonZeroIterator inner classes are very inefficient: > ** minor things like caching the values.numMappings() and values.getIndices > in final instance variables in the Iterators > ** the huge performance drain of Element.get() : {code} public double get() { > return values.get(ind); } {code}, which is implemented as a binary search > on index values array (the index of which was already known!) followed by the > array lookup > This last point is probably the entire reason why we've seen performance > problems with the SparseVector, as it's in both the NonZeroIterator and the > AllIterator, and so turned any O(numNonZeroElements) operations into > O(numNonZeroElements * log(numNonZeroElements)) (with some additional > constant factors for too much indirection thrown in for good measure). > Unless there is another JIRA ticket which has a patch fixing this which I > didn't notice, I can whip up a patch (I've got a similar implementation over > in decomposer I can pull stuff out of, although mine is simpler because it is > immutable, so it's not just a copy and paste). > We don't have any benchmarking code anywhere yet, do we? Is there a JIRA > ticket open for that already? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Moving ahead to Hadoop 0.22
Well, it is wrong at some level and it will become more and more wrong. I have heard from Chris Wenzel that the cost of moving post 19 was pretty high. It would be good to do that when we can do it whole-heartedly. (what is the situation with 21?) On Tue, Nov 24, 2009 at 1:23 AM, Sean Owen wrote: > My alternative is to go back to 0.19, which isn't the end of the world, > but feels wrong. > -- Ted Dunning, CTO DeepDyve
[jira] Updated: (MAHOUT-206) Separate and clearly label different SparseVector implementations
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-206: --- Attachment: MAHOUT-206.patch Patch renames the hash-based SparseVector with RandomAccessSparseVector, and adds back the old form of SparseVector (backed by OrderedIntDoubleMapping) named now as SequentialAccessSparseVector. Unit tests added testing both forms of sparse vector. BUT, I have not gone through all the places where we use SparseVector instances and deliberately picked SequetialAccess or RandomAccess choices for the sparse impl. All usage is done as RandomAccessSparseVector (ie. by the refactor-rename on trunk of SparseVector to RandomAccessSparseVector), and we will need to intentionally swap in SequentialAccessSparseVector in the primarily immutable/read-only case. > Separate and clearly label different SparseVector implementations > - > > Key: MAHOUT-206 > URL: https://issues.apache.org/jira/browse/MAHOUT-206 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Affects Versions: 0.2 > Environment: all >Reporter: Jake Mannix > Fix For: 0.3 > > Attachments: MAHOUT-206.patch > > > Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array > impl of SparseVector for an OpenIntDoubleMap (hash-based) one. We actually > need both, as I think I've mentioned a gazillion times. > There was a patch, long ago, on MAHOUT-165, in which Ted had > OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that > effect), and neither of them are called SparseVector. I like this, because > it forces people to choose what kind of SparseVector they want (and they > should: sparse is an optimization, and the client should make a conscious > decision what they're optimizing for). > We could call them RandomAccessSparseVector and SequentialAccessSparseVector, > to be really obvious. > But really, the important part is we have both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: eclipse codestyle.xml?
Actually, the Mahout wiki links are out of date. I'll update. On Nov 24, 2009, at 10:19 AM, Grant Ingersoll wrote: > Hmm, weird. They must have gotten lost when the ASF upgraded MoinMoin. They > are the same as Lucene's: http://wiki.apache.org/lucene-java/HowToContribute > > On Nov 24, 2009, at 10:11 AM, Drew Farris wrote: > >> Hi All, >> >> On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The >> link at the bottom of the page to the eclipse codestyle.xml for >> Mahout's coding conventions seems to be broken. >> >> Does anyone have a codestyle.xml for eclipse available? >> >> Thanks, >> >> Drew > >
Re: eclipse codestyle.xml?
We updated the lucene ones during apache con - this should work though! On Tue, Nov 24, 2009 at 4:19 PM, Grant Ingersoll wrote: > Hmm, weird. They must have gotten lost when the ASF upgraded MoinMoin. They > are the same as Lucene's: http://wiki.apache.org/lucene-java/HowToContribute > > On Nov 24, 2009, at 10:11 AM, Drew Farris wrote: > >> Hi All, >> >> On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The >> link at the bottom of the page to the eclipse codestyle.xml for >> Mahout's coding conventions seems to be broken. >> >> Does anyone have a codestyle.xml for eclipse available? >> >> Thanks, >> >> Drew > > >
Re: eclipse codestyle.xml?
Hmm, weird. They must have gotten lost when the ASF upgraded MoinMoin. They are the same as Lucene's: http://wiki.apache.org/lucene-java/HowToContribute On Nov 24, 2009, at 10:11 AM, Drew Farris wrote: > Hi All, > > On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The > link at the bottom of the page to the eclipse codestyle.xml for > Mahout's coding conventions seems to be broken. > > Does anyone have a codestyle.xml for eclipse available? > > Thanks, > > Drew
eclipse codestyle.xml?
Hi All, On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The link at the bottom of the page to the eclipse codestyle.xml for Mahout's coding conventions seems to be broken. Does anyone have a codestyle.xml for eclipse available? Thanks, Drew
[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions
[ https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781896#action_12781896 ] Grant Ingersoll commented on MAHOUT-204: Yeah, go ahead and submit the patch, then do the formatting. > Better integration of Mahout matrix capabilities with Colt Matrix additions > --- > > Key: MAHOUT-204 > URL: https://issues.apache.org/jira/browse/MAHOUT-204 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.3 >Reporter: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-204-author-cleanup.patch > > > Per MAHOUT-165, we need to refactor the matrix package structures a bit to be > more coherent and clean. For instance, there are two levels of matrix > packages now, so those should be rectified. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781858#action_12781858 ] Sean Owen commented on MAHOUT-103: -- Yes, this is basically item-based recommendation. With some superficial changes, it would exactly fit that model. Co-occurrence here is like a similarity metric, which is ultimately used as a weighting. Canonically this value would be in [-1,1], and you can easily map [1,...) into that range of course. Next you're sort of estimating preferences when you add up co-occurrence values. Canonically, you'd be doing a weighted average over M1 - M3. This is the same thing -- you're just not dividing by 3. The result is conceptually the same, though different approaches would yield slightly different results. I'm not necessarily suggesting you change the algorithm. At the same time I am also about to implement this very same thing -- the more 'canoncial' form, to go hand-in-hand with the existing GenericItemBasedRecommender. I'd rather avoid duplication, and would like to make the Hadoop-based implementation as analogous to the existing code as possible. All I'd say is, go ahead, and maybe we look at generalizing it or shifting these concepts towards the canonical setup later. Look at GenericIRStatsEvaluator and subclass for precision-recall approaches. > Co-occurence based nearest neighbourhood > > > Key: MAHOUT-103 > URL: https://issues.apache.org/jira/browse/MAHOUT-103 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering >Reporter: Ankur >Assignee: Ankur > Attachments: jira-103.patch, mahout-103.patch.v1 > > > Nearest neighborhood type queries for users/items can be answered efficiently > and effectively by analyzing the co-occurrence model of a user/item w.r.t > another. This patch aims at providing an implementation for answering such > queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781838#action_12781838 ] Ankur commented on MAHOUT-103: -- For this co-occurrence based recommender I am planning to write a set of map-reduce jobs that compute recommendations for users as folllowing:- 1. Take user's item history 2. for each item in his history fetch the top-N similar items. (Similarity based on co-occurrence) 3. Add the co-occurrence scores if an item appears more than once (NOT weighted avg). Consider an e.g. user-history { M1, M2, M3 } and top - 3 similar movies for each of these along with co-occurrence scores M1 -> (A, 5), (B, 4), (C, 2) M2 -> (D, 6), (E, 3), (F, 2) M3 -> (G, 8), (C, 5), (B, 2) So the final scores in decreasing order will look like (G, 8) (C, 7) (B, 6) (D, 6) (A, 5) (E, 3) (F, 2) The idea I want to capture is that a candidate item gets higher score if its similar to more items in user's click history. Do you see any issue with this approach ? Any other better approach that you can think of ? As for the precision-recall test, I am still trying to see how to divide the data in 'train' and 'test' for a fair evaluation. How do we do it in the existing code ? > Co-occurence based nearest neighbourhood > > > Key: MAHOUT-103 > URL: https://issues.apache.org/jira/browse/MAHOUT-103 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering >Reporter: Ankur >Assignee: Ankur > Attachments: jira-103.patch, mahout-103.patch.v1 > > > Nearest neighborhood type queries for users/items can be answered efficiently > and effectively by analyzing the co-occurrence model of a user/item w.r.t > another. This patch aims at providing an implementation for answering such > queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions
[ https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781834#action_12781834 ] Sean Owen commented on MAHOUT-204: -- I am happy to do this, but in order to avoid a massive merge conflict, go ahead and submit the above patch? > Better integration of Mahout matrix capabilities with Colt Matrix additions > --- > > Key: MAHOUT-204 > URL: https://issues.apache.org/jira/browse/MAHOUT-204 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.3 >Reporter: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-204-author-cleanup.patch > > > Per MAHOUT-165, we need to refactor the matrix package structures a bit to be > more coherent and clean. For instance, there are two levels of matrix > packages now, so those should be rectified. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Moving ahead to Hadoop 0.22
Early report from my testing is it's going to break a lot of our code, so, perhaps a bridge too far now. There's one reason I'm keen to move forward and it's not merely wanting to be on the bleeding edge, far from it. It's that 0.20.x does not work at all for my jobs. It runs into bugs that 0.20.1 does not appear to fix. So I'm kind of between a rock and hard place. My alternative is to go back to 0.19, which isn't the end of the world, but feels wrong. On Tue, Nov 24, 2009 at 8:25 AM, Robin Anil wrote: > 0.22 is supposed to stabilize the new mapreduce package and remove > support for the old mapred package. So I am guessing the reason for > moving to 0.22 would go side by side with conversion of all our > existing mapred programs to mapreduce ones. And I believe I read > somewhere that this is the api that is going to used as they move > closer to hadoop 1.0 > > Robin > > > > On Tue, Nov 24, 2009 at 1:10 PM, Ted Dunning wrote: >> I hate it when people on the commons list start whining about dropping >> support for java version -17 (joke, exaggeration warning), but here I am >> about to do it. >> >> I still run 0.19. A fair number of production clusters run 0.18.3. Many >> run 0.20 >> >> Hopefully when 1.0 comes out there will be a large move to that, but is it >> important yet to gauge where our largest audience is or even will be in, >> say, 6-12 months? >> >> I don't see a large benefit to moving to the latest just because it is the >> latest. I do see a benefit in moving to a future-proof API sooner rather >> than later, but is 0.22 realistic yet? >> >> On Mon, Nov 23, 2009 at 5:22 PM, Sean Owen wrote: >> >>> In my own client, I'm forging ahead to Hadoop 0.22 to see if it works >>> for me. If it's not too much change to update what we've got to 0.22 >>> (and the change is nonzero) and it works better for me, maybe we can >>> jump ahead to depend on it. >>> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> >
Re: Moving ahead to Hadoop 0.22
0.22 is supposed to stabilize the new mapreduce package and remove support for the old mapred package. So I am guessing the reason for moving to 0.22 would go side by side with conversion of all our existing mapred programs to mapreduce ones. And I believe I read somewhere that this is the api that is going to used as they move closer to hadoop 1.0 Robin On Tue, Nov 24, 2009 at 1:10 PM, Ted Dunning wrote: > I hate it when people on the commons list start whining about dropping > support for java version -17 (joke, exaggeration warning), but here I am > about to do it. > > I still run 0.19. A fair number of production clusters run 0.18.3. Many > run 0.20 > > Hopefully when 1.0 comes out there will be a large move to that, but is it > important yet to gauge where our largest audience is or even will be in, > say, 6-12 months? > > I don't see a large benefit to moving to the latest just because it is the > latest. I do see a benefit in moving to a future-proof API sooner rather > than later, but is 0.22 realistic yet? > > On Mon, Nov 23, 2009 at 5:22 PM, Sean Owen wrote: > >> In my own client, I'm forging ahead to Hadoop 0.22 to see if it works >> for me. If it's not too much change to update what we've got to 0.22 >> (and the change is nonzero) and it works better for me, maybe we can >> jump ahead to depend on it. >> > > > > -- > Ted Dunning, CTO > DeepDyve >