[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776921#action_12776921 ] Sean Owen commented on MAHOUT-103: -- Re-post an updated patch and happy to give my comments on it. The more the merrier. If it's basically sound I'd like to mention it in the forthcoming book which I'm writing now. I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the rating. The framework can do this automatically too if you like in the DataModel. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776924#action_12776924 ] Sean Owen commented on MAHOUT-165: -- IntDoubleHash right? We could look at that, but I thought the status here was that Colt worked just fine and fast. Perhaps I miss something but I don't see a remaining issue with using (part of) Colt. I somehow strongly suspect we will benefit from not reinventing a wheel here, and whatever we need can be done with Colt, plus perhaps some contributed changes, plus a custom implementation here and there. +1 for Whatever Is Needed To Use Colt? Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.3 Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. In an experiment, I found that, for get/set operations, the primitive hash of Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.2 status
please use Decision Forests instead of Random Forests On Thu, Nov 12, 2009 at 9:01 AM, Robin Anil robin.a...@gmail.com wrote: Please edit/add stuff. Robin == Apache Mahout 0.2 has been released and is now available for public download. Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. link Mahout is a machine learning library meant to scale to the size of data we manage today. Built on top of the powerful map/reduce paradigm of Apache Hadoop project, Mahout lets you run popular machine learning methods like clustering, collaborative filtering, classification over Terabytes of data over thousands of computers. The complete changelist can be found here: http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278 New Mahout 0.2 features include - Major performance enhancements in Collaborative Filtering, Classification and Clustering - New: Latent Dirichlet Allocation(LDA) implementation for topic modelling - New: Frequent Itemset Mining for mining top-k patterns from a list of transactions - New: Random Forests implementation for Decision Tree classification (In Memory Partial Data) - New: HBase storage support for Naive Bayes model building and classification - New: Generation of vectors from Text documents for use with Mahout Algorithms - Performance improvements in various Vector implementations - Tons of bug fixes and code cleanup On Thu, Nov 12, 2009 at 9:06 AM, Grant Ingersoll gsing...@apache.org wrote: Anyone care to writeup a release announcement? Here's Solr's: http://lucene.grantingersoll.com/2009/11/10/apache-solr-1-4-0-offically-released/ I've cleaned up the build quite a bit and am now testing preparing the artifacts w/ the much simpler build (no more installing third party libs, they are all up under o.a.mahout in the Maven repo). I'd like to have everything ready to go once the artifacts are put up for a vote. Thanks, Grant
Re: 0.2 status
Adding and revising a little: Apache Mahout 0.2 has been released and is now available for public download at http://www.apache.org/dyn/closer.cgi/lucene/mahout Up to date maven artifacts can be found in the Apache repository at https://repository.apache.org/content/repositories/releases/org/apache/mahout/ Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. http://www.apache.org/licenses/LICENSE-2.0 Mahout is a machine learning library meant to scale to the size of data we manage today. Built on top of the powerful map/reduce paradigm of Apache Hadoop project, Mahout lets you run popular machine learning methods like clustering, collaborative filtering, classification over Terabytes of data over thousands of computers. - We may want to emphasize that using Mahout makes sense also for those people that do not have clusters with thousands of nodes? Mahout is a machine learning library meant to scale: Scale in terms of community to support anyone interested in using machine learning. Scale in terms of business by providing the library under a commercially friendly, free software license. Scale in terms of computation to the size of data we manage today. Built on top of the powerful map/reduce paradigm of the Apache Hadoop project, Mahout lets you solve popular machine learning problem settings like clustering, collaborative filtering and classification over Terabytes of data over thousands of computers. Implemented with scalability in mind the latest release brings many performance optimizations so that even in a single node setup the library performs well. - As mentioned earlier by Grant, we do need performance benchmarks at least for the the next release to prove that. The complete changelist can be found here: http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278 New Mahout 0.2 features include - Major performance enhancements in Collaborative Filtering, Classification and Clustering - New: Latent Dirichlet Allocation(LDA) implementation for topic modelling - New: Frequent Itemset Mining for mining top-k patterns from a list of transactions - New: Decision Forests implementation for Decision Tree classification (In Memory Partial Data) - New: HBase storage support for Naive Bayes model building and classification - New: Generation of vectors from Text documents for use with Mahout Algorithms - Performance improvements in various Vector implementations - Tons of bug fixes and code cleanup Getting started: New to Mahout? 1) Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout 2) Check out the Quick start: http://cwiki.apache.org/MAHOUT/quickstart.html 3) Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT 4) Join the community by subscribing to mahout-u...@lucene.apache.org 5) Give back: http://www.apache.org/foundation/getinvolved.html 6) Consider adding yourself to the power by Wiki page: http://cwiki.apache.org/MAHOUT/poweredby.html For more information on Apache Mahout, see http://lucene.apache.org/mahout Additional comment: I suppose, I will copy this over to my personal blog once the release is out. I would like to invite those interested in or using Mahout to do so as well.
[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776945#action_12776945 ] Jake Mannix commented on MAHOUT-165: Well, I've always had good luck with Colt, but at least Ted seemed to feel that Colt was no longer state of the art, but maybe he can chime in and elaborate. Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.3 Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. In an experiment, I found that, for get/set operations, the primitive hash of Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939 ] Ankur commented on MAHOUT-103: -- Re-post an updated patch Sure I'll have the updated code coming by early next week. If it's basically sound I'd like to mention it +10, The more people know about it the better chances it has of being used :-) I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the rating ... Simply dropping the rating might introduce too much noise. I was thinking of keeoing only those that have ratings 2.5 (or 2 to be more liberal). Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776951#action_12776951 ] Sean Owen commented on MAHOUT-103: -- That last point is interesting. Another school of thought is that rating something, even negatively, suggests you have a closer association to that thing than to the millions of other things you've never heard of. Let's say you rate Bach a 5 and Brahms a 4 and Mendelssohn a 1.5. Would you rather recommend a Mendelssohn recording to this person, or death metal? This is my understanding of the intuition I've gotten from Ted, and seems to bear out somewhat in practice, that ratings have a lot less info than one would think. Well it's obviously something one can evaluate within the framework with the evaluator code to decide for sure. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.2 status
It all sounds fine to me. On Thu, Nov 12, 2009 at 9:54 AM, Isabel Drost isa...@apache.org wrote: Adding and revising a little:
Re: Dependencies outside Maven central (Was: Oh joy)
That's weird, that is the default Maven repository, I wouldn't think you would need to add it. On Nov 12, 2009, at 5:30 AM, Isabel Drost wrote: On Wed, 11 Nov 2009 18:23:50 -0500 Grant Ingersoll gsing...@apache.org wrote: https://issues.apache.org/jira/browse/MAHOUT-198 tracks this. I am committing now. Please check out, delete your ~/.m2/repository/org directory and try mvn clean install! Trashed my local maven repo, checked out an built again - it did not find the lucene-2.9.1 release. Adding the following repository to the pom fixed that problem for me: Index: maven/pom.xml === --- maven/pom.xml (revision 835320) +++ maven/pom.xml (working copy) @@ -76,6 +76,12 @@ layoutdefault/layout /repository repository + idmaven2-repository.maven.org/id + nameMaven.org Repository for Maven/name + urlhttp://repo1.maven.org/maven2/url + layoutdefault/layout +/repository +repository idApache snapshots/id urlhttp://people.apache.org/maven-snapshot-repository/url snapshots After that the build runs smoothly for me. Big Thanks to you Grant, for resolving all those issues. Isabel -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776962#action_12776962 ] Sean Owen commented on MAHOUT-198: -- The only glitch I see is that the Java Mail 1.4 pom is invalid. It's a doc like this: !DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN htmlhead title301 Moved Permanently/title /headbody h1Moved Permanently/h1 pThe document has moved a href=http://download.java.net/maven/1/javax.mail/poms/mail-1.4.pom;here/a./p hr addressApache Server at maven-repository.dev.java.net Port 443/address /body/html Obviously that's not within our control. I tried manually copying the updated pom file, though I still have problems with then other dependencies. Anyone seeing this? I wonder if there is a way to see why Java Mail is included as a dependency? we shouldn't have anything to do with it directly. Cleanup pom, remove lib dependencies, etc. -- Key: MAHOUT-198 URL: https://issues.apache.org/jira/browse/MAHOUT-198 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-198.core-lib.patch This patch cleans up the poms to not do install. It removes the core/lib directory. I have published the necessary artifacts to our Mahout Maven repo already, so they should be publicly available. See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies and http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966 ] Ankur commented on MAHOUT-103: -- In that case dropping ratings might not be such a good idea and may lead to bad results. Consider the following movies that a user might have seen with the scores Matrix - 4.5 Matrix Reloaded - 2.5 Matrix Revolutions - 2 Assuming that a lot of people have watched these movies and didn't like the subsequent two versions, they still will get high similarity scores w.r.t Matrix going purely by co-occurrence. IMHO, that leaves us with the following 2 alternatives :- 1. Add the ratings when counting co-occurrence and hope that better ones will stand out even if they co-occur less. 2. Apply a Re-scorer that re-ranks the the similar items for a given item based on their average scores. Point 1 is something I am thinking of trying out. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776986#action_12776986 ] Sean Owen commented on MAHOUT-103: -- What's the problem in this example? Two people that have both seen all three Matrix films are probably similar. All the more so if they've rated the first one highly and the other two poorly. You'd correctly identify them as similar with or without ratings here. The issue, I suppose, comes up when you encounter someone who didn't like the first one and liked the other two (strange, I know). Without pref values, we'd draw the same conclusion -- they have some similarity. With pref values, most metrics would say they are very dissimilar. I actually think that's the wrong conclusion! The fact that two people bothered to watch all three says much more about their similarities than the variance in ratings says about their differences. I'd still guess they're sorta-similar, and metrics without pref values would tend to draw the more correct conclusion. Of course there's no one right answer, and we can easily construct situations where throwing out pref values indeed hurts the result. I'm only asserting that it's entirely possible, in real data sets, for ratings to *hurt* on the whole. Let's start by adding the basic approach and then keep going to look at variations. I at least have some global knowledge of how the framework is set up and could help design in these variations in a way that's consistent with the framework. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776993#action_12776993 ] Sean Owen commented on MAHOUT-198: -- OK, after another wipe of my .m2 directory, this went away. Then it complained about many missing artifacts, some quite basic-looking ones. I also added that repository stanza -- think we should check that in, cool? And ran back into the mail .pom issue. After another wipe of .m2, I was back to just missing the Mahout 0.3 SNAPSHOT artifact. OK, that's normal right? So I have to mvn install rather than mvn compile still. That works. Cleanup pom, remove lib dependencies, etc. -- Key: MAHOUT-198 URL: https://issues.apache.org/jira/browse/MAHOUT-198 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-198.core-lib.patch This patch cleans up the poms to not do install. It removes the core/lib directory. I have published the necessary artifacts to our Mahout Maven repo already, so they should be publicly available. See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies and http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777003#action_12777003 ] Grant Ingersoll commented on MAHOUT-198: Yep, still require mvn install As for Java Mail, do we know which 3rd party lib has the dependency on Java Mail? We may need to put in an exclusion. Cleanup pom, remove lib dependencies, etc. -- Key: MAHOUT-198 URL: https://issues.apache.org/jira/browse/MAHOUT-198 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-198.core-lib.patch This patch cleans up the poms to not do install. It removes the core/lib directory. I have published the necessary artifacts to our Mahout Maven repo already, so they should be publicly available. See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies and http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777004#action_12777004 ] Drew Farris commented on MAHOUT-198: From the output of 'mvn dependency:tree' is appears that the culprit is log4j 1.2.15 Cleanup pom, remove lib dependencies, etc. -- Key: MAHOUT-198 URL: https://issues.apache.org/jira/browse/MAHOUT-198 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-198.core-lib.patch This patch cleans up the poms to not do install. It removes the core/lib directory. I have published the necessary artifacts to our Mahout Maven repo already, so they should be publicly available. See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies and http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777005#action_12777005 ] Grant Ingersoll commented on MAHOUT-198: OK, I have a fix for the mail thing. Checking in shortly. Cleanup pom, remove lib dependencies, etc. -- Key: MAHOUT-198 URL: https://issues.apache.org/jira/browse/MAHOUT-198 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-198.core-lib.patch This patch cleans up the poms to not do install. It removes the core/lib directory. I have published the necessary artifacts to our Mahout Maven repo already, so they should be publicly available. See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies and http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777023#action_12777023 ] Drew Farris commented on MAHOUT-198: works fine for me. Cleanup pom, remove lib dependencies, etc. -- Key: MAHOUT-198 URL: https://issues.apache.org/jira/browse/MAHOUT-198 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-198.core-lib.patch This patch cleans up the poms to not do install. It removes the core/lib directory. I have published the necessary artifacts to our Mahout Maven repo already, so they should be publicly available. See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies and http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777050#action_12777050 ] Ted Dunning commented on MAHOUT-165: My issues (which I used for quite some time) were probably either remediable or irrelevant. The remediable problem was that the API was opaque for new-comers and very difficult to extend with new matrix implementations. If we take Colt as a starting point and fix some of the extension and opacity issues, then this problem goes away. My second issue is that more modern libraries like MTJ can achieve about 4x the raw performance of Colt. As Grant rightly points out, that probably doesn't matter to us right away since the goal here is scaling rather than raw hot-iron performance on a single box. Moreover, as Grant also points out, we will have a pluggable interface which should allow us to switch if the commons math guys ever come around. Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.3 Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. In an experiment, I found that, for get/set operations, the primitive hash of Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777082#action_12777082 ] Jake Mannix commented on MAHOUT-165: Ok then, let's try out Colt, unless we have a more permissive policy in here about MTJ than the c-math guys have: they didn't want MTJ because using it required either including a jar file of the output of f2j translations of some Fortran code... which is ok for us as long as it's apache-compatible, since we don't have the hard no external dependencies requirement that they have. What Shashi wrote before was this, when he attached the modified colt jar: bq. Jar for Colt after removing the LGPL code of hep.aida and the the dependent classes. The classes in colt.matrix.* are removed as they require hep.aida. I actually stripped the hep.aida.* dependencies out of even the colt.matrix.* classes in Colt on my local gitrepo, which keeps pretty much all of the functionality intact. I can make an updated patch which has the full source code for that, so that we can include it instead of just having a jar. Do we want to try comparing both MTJ and Colt? Also: do we think our linear API is complete enough to solidify on as a wrapper for whatever is plugged in underneath? Some of the changes which have been discussed in other tickets and on the list are * pulling Writable off of the interface, so that not every impl is hooked into such a coupling to Hadoop, then wrapping it with a Writable wrapper / subclass to add that functionality * the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) and double aggregate(Vector other, BinaryDoubleFunction aggregator, BinaryDoubleFunction map) methods for abstracting away inner products and norms. Not necessary, but very easily implemented in AbstractVector so that nobody needs to worry about these methods if they don't like programming that way. Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.3 Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. In an experiment, I found that, for get/set operations, the primitive hash of Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777089#action_12777089 ] Ted Dunning commented on MAHOUT-165: bq. pulling Writable off of the interface, so that not every impl is hooked into such a coupling to Hadoop, then wrapping it with a Writable wrapper / subclass to add that functionality +1 Same thing should be done with row and column labels. Not sure how to handle matrices of indefinite dimension which are probably important for some of what we do. Perhaps just declare them as very, very large in a wrapper. bq. the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) and double aggregate(Vector other, BinaryDoubleFunction aggregator, BinaryDoubleFunction map) methods for abstracting away inner products and norms. Not necessary, but very easily implemented in AbstractVector so that nobody needs to worry about these methods if they don't like programming that way. These are very handy function. Row and/or column aggregator functions are also important. Colt gets a big boost in speed by testing in the implementation for special combinations of these functional constructs. That lets it implement dot and sum with bespoke code and avoid the function call overhead (with associated risk of the JVM not in-lining enough). Another big change is that Colt makes extensive use of view semantics. I think that this is a really good idea, but it does differ a bit from what we have done so far. Using better primitives hash for sparse vector for performance gains Key: MAHOUT-165 URL: https://issues.apache.org/jira/browse/MAHOUT-165 Project: Mahout Issue Type: Improvement Components: Matrix Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.3 Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. In an experiment, I found that, for get/set operations, the primitive hash of Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.2 status
OK, I think the java mail thing is resolved. Let me try building the artifacts again. On Nov 12, 2009, at 6:22 AM, Sean Owen wrote: It all sounds fine to me. On Thu, Nov 12, 2009 at 9:54 AM, Isabel Drost isa...@apache.org wrote: Adding and revising a little:
[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777173#action_12777173 ] Sean Owen commented on MAHOUT-198: -- I ran into one more unit test failure. It's due to directly comparing two double values and for some reason it fails only on the command line on my computer. I added an 'epsilon' param to the unit test and it's fine now. We should prolly make that common practice wherever the test compares doubles, like I've done in my own unit tests, but it's pretty small. Shall I commit that change to name the standard repo with a repository tag? I needed that too. I'm still getting some weird build errors but guessing they are my own environment's fault. Cleanup pom, remove lib dependencies, etc. -- Key: MAHOUT-198 URL: https://issues.apache.org/jira/browse/MAHOUT-198 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-198.core-lib.patch This patch cleans up the poms to not do install. It removes the core/lib directory. I have published the necessary artifacts to our Mahout Maven repo already, so they should be publicly available. See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies and http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[VOTE] Release 0.2
Please vote on releasing the artifacts at: https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/ KEYS file is in the Mahout root trunk. Things to do before voting: 1. Download and verify signatures on all the artifacts. 2. Try out the tests, examples, etc. 3. Try it out in any apps that you have. 4. See the Apache pages on releases and see what else I'm missing. 5. Others?
Re: [VOTE] Release 0.2
Hmm, I'm on a Mac and running on the command line. What version of OS X and what JVM? On Nov 12, 2009, at 5:49 PM, Sean Owen wrote: I still see that test failure I mentioned, but only happens on the command line (and perhaps only on a Mac). It is to do with a double value being compared for exact equality. I fixed it but it's hardly a blocker. Otherwise +1 On Thu, Nov 12, 2009 at 9:58 PM, Grant Ingersoll gsing...@apache.org wrote: Please vote on releasing the artifacts at: https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/ KEYS file is in the Mahout root trunk. Things to do before voting: 1. Download and verify signatures on all the artifacts. 2. Try out the tests, examples, etc. 3. Try it out in any apps that you have. 4. See the Apache pages on releases and see what else I'm missing. 5. Others?
Re: [VOTE] Release 0.2
And which mvn? On Thu, Nov 12, 2009 at 5:07 PM, Grant Ingersoll gsing...@apache.orgwrote: Hmm, I'm on a Mac and running on the command line. What version of OS X and what JVM? On Nov 12, 2009, at 5:49 PM, Sean Owen wrote: I still see that test failure I mentioned, but only happens on the command line (and perhaps only on a Mac). It is to do with a double value being compared for exact equality. I fixed it but it's hardly a blocker. Otherwise +1 On Thu, Nov 12, 2009 at 9:58 PM, Grant Ingersoll gsing...@apache.org wrote: Please vote on releasing the artifacts at: https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/ KEYS file is in the Mahout root trunk. Things to do before voting: 1. Download and verify signatures on all the artifacts. 2. Try out the tests, examples, etc. 3. Try it out in any apps that you have. 4. See the Apache pages on releases and see what else I'm missing. 5. Others? -- Ted Dunning, CTO DeepDyve
[jira] Updated: (MAHOUT-199) Parent POM missing in public maven repository
[ https://issues.apache.org/jira/browse/MAHOUT-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Friedrich updated MAHOUT-199: -- Description: I wanted to play with Taste and thus created a Maven project that referenced the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't build my project. I had to download the 0.1 release and install the parent POM in my local repository (cd mahout-0.1/maven mvn install). Steps to reproduce: $ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo $ cd MahoutDemo $ vi MahoutDemo add dependencies listed below $ mvn package The dependency section I added: dependency groupIdorg.apache.mahout/groupId artifactIdmahout-core/artifactId version0.1/version /dependency Could you please deploy the parent POM? It would make it a lot easier to play with Mahout/Taste. Thanks in advance for your help! was: I wanted to play with Taste and thus created a Maven project that referenced the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't build my project. I had to download the 0.1 release and install the parent POM in my local repository (cd mahout-0.1/maven mvn install). Steps to reproduce: $ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo $ cd MahoutDemo $ vi MahoutDemo # add dependencies listed below $ mvn package The dependency section I added: dependency groupIdorg.apache.mahout/groupId artifactIdmahout-core/artifactId version0.1/version /dependency Could you please deploy the parent POM? It would make it a lot easier to play with Mahout/Taste. Thanks in advance for your help! Parent POM missing in public maven repository - Key: MAHOUT-199 URL: https://issues.apache.org/jira/browse/MAHOUT-199 Project: Mahout Issue Type: Wish Affects Versions: 0.1 Environment: Maven 2.0.9 Reporter: Matthias Friedrich Priority: Minor I wanted to play with Taste and thus created a Maven project that referenced the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't build my project. I had to download the 0.1 release and install the parent POM in my local repository (cd mahout-0.1/maven mvn install). Steps to reproduce: $ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo $ cd MahoutDemo $ vi MahoutDemo add dependencies listed below $ mvn package The dependency section I added: dependency groupIdorg.apache.mahout/groupId artifactIdmahout-core/artifactId version0.1/version /dependency Could you please deploy the parent POM? It would make it a lot easier to play with Mahout/Taste. Thanks in advance for your help! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.