Re: Hadoop upgrade
D'oh! Thanks! On Mar 18, 2009, at 4:32 AM, Sean Owen wrote: (I upgraded to 0.19.1 last week.) On Tue, Mar 17, 2009 at 10:41 PM, Grant Ingersoll gsing...@apache.org wrote: OK, pending MAHOUT-110, I think we are good to go on the release. Not sure who volunteered to upgrade Hadoop, so go for it now, or it will wait until after 0.1.
[jira] Commented: (MAHOUT-110) Ant script for building Taste web app
[ https://issues.apache.org/jira/browse/MAHOUT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682980#action_12682980 ] Sean Owen commented on MAHOUT-110: -- I say go for it. I will merge with my patch locally and see if there is anything left and take care of that tonight. I'm glad this works out, many thanks. Ant script for building Taste web app - Key: MAHOUT-110 URL: https://issues.apache.org/jira/browse/MAHOUT-110 Project: Mahout Issue Type: Task Components: Collaborative Filtering Affects Versions: 0.1 Reporter: Sean Owen Assignee: Sean Owen Fix For: 0.1 Attachments: AntScript.patch, MAHOUT-110-docs.patch, MAHOUT-110.patch, MAHOUT-110.patch WIll attach patch after creating. This is a follow-up from a thread on mahout-dev. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-110) Ant script for building Taste web app
[ https://issues.apache.org/jira/browse/MAHOUT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682983#action_12682983 ] Grant Ingersoll commented on MAHOUT-110: Will do. Thanks for the kick in the pants to get going on it. I like the mvn jetty:run-war a lot. Now I can demo Taste next week! Ant script for building Taste web app - Key: MAHOUT-110 URL: https://issues.apache.org/jira/browse/MAHOUT-110 Project: Mahout Issue Type: Task Components: Collaborative Filtering Affects Versions: 0.1 Reporter: Sean Owen Assignee: Sean Owen Fix For: 0.1 Attachments: AntScript.patch, MAHOUT-110-docs.patch, MAHOUT-110.patch, MAHOUT-110.patch WIll attach patch after creating. This is a follow-up from a thread on mahout-dev. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-99. --- Resolution: Fixed Fix Version/s: 0.1 Committed revision 755548. Thanks! Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-111) Redirect Test output to file
[ https://issues.apache.org/jira/browse/MAHOUT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-111: --- Affects Version/s: (was: 0.2) 0.1 Redirect Test output to file Key: MAHOUT-111 URL: https://issues.apache.org/jira/browse/MAHOUT-111 Project: Mahout Issue Type: Improvement Affects Versions: 0.1 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Trivial The tests are really verbose to std out. Have them direct their output to a file and only report pass/fail on std out. This should be a simple setting on the test plugin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-111) Redirect Test output to file
[ https://issues.apache.org/jira/browse/MAHOUT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-111. Resolution: Fixed Fix Version/s: 0.1 Fixed Redirect Test output to file Key: MAHOUT-111 URL: https://issues.apache.org/jira/browse/MAHOUT-111 Project: Mahout Issue Type: Improvement Affects Versions: 0.1 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Trivial Fix For: 0.1 The tests are really verbose to std out. Have them direct their output to a file and only report pass/fail on std out. This should be a simple setting on the test plugin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Concerns about Maven
On Mar 17, 2009, at 9:06 AM, Enis Soztutar wrote: -Grant Knowing nothing about the mahout build script(s), I think that having both ant and maven scripts might prove to be problematic. However keeping one module(taste) in ant will work. As a side note, we have discussed this same thing in Hadoop and we opted for ant +ivy. The build process is very complex for Hadoop, and there are some things that simply cannot be done with maven. ant+ivy works pretty well for us and we can generate pom files for deployment. Thanks, Enis. I did notice that Hadoop had started using Ivy. Do you know when Hadoop is going to start publishing it's artifacts to the Maven repo? We are doing it now for Mahout (i.e. publishing the Hadoop artifacts, see http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/) but would really rather rely on Hadoop doing it. I would think Ivy would allow for this and it would make Hadoop adoption even easier. -Grant
Thoughts on ...
http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ -Grant
[jira] Resolved: (MAHOUT-110) Ant script for building Taste web app
[ https://issues.apache.org/jira/browse/MAHOUT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-110. Resolution: Fixed Committed Ant script for building Taste web app - Key: MAHOUT-110 URL: https://issues.apache.org/jira/browse/MAHOUT-110 Project: Mahout Issue Type: Task Components: Collaborative Filtering Affects Versions: 0.1 Reporter: Sean Owen Assignee: Sean Owen Fix For: 0.1 Attachments: AntScript.patch, MAHOUT-110-docs.patch, MAHOUT-110.patch, MAHOUT-110.patch WIll attach patch after creating. This is a follow-up from a thread on mahout-dev. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Concerns about Maven
Grant Ingersoll wrote: On Mar 17, 2009, at 9:06 AM, Enis Soztutar wrote: -Grant Knowing nothing about the mahout build script(s), I think that having both ant and maven scripts might prove to be problematic. However keeping one module(taste) in ant will work. As a side note, we have discussed this same thing in Hadoop and we opted for ant+ivy. The build process is very complex for Hadoop, and there are some things that simply cannot be done with maven. ant+ivy works pretty well for us and we can generate pom files for deployment. Thanks, Enis. I did notice that Hadoop had started using Ivy. Do you know when Hadoop is going to start publishing it's artifacts to the Maven repo? We are doing it now for Mahout (i.e. publishing the Hadoop artifacts, see http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/) but would really rather rely on Hadoop doing it. I would think Ivy would allow for this and it would make Hadoop adoption even easier. -Grant The issue is open for a while, but I'm afraid no body has step-up to add maven deployment to the release procedure. until the deployment is complete, you can use local deployment with ant maven-artifacts ; mvn install:install .
Dirchlet Job example
Hey Jeff, Is it appropriate to have a Job example like we do for k-means and some of the other clustering algorithms for dirichlet? I see you do have some type of UI in there, right?Are there directions somewhere for running the example? http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html just seems to show the output. -Grant
Re: Thoughts on ...
Interesting optimization. We can incorporate it by adding a centroid^2 argument to DistanceMeasure interface and adjusting the affected clustering algorithms. All would benefit from this optimization. I will build a test to assess its impact and report. Jeff Grant Ingersoll wrote: http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ -Grant PGP.sig Description: PGP signature
Re: [jira] Reopened: (MAHOUT-99) Improving speed of KMeans
Did you reopen this issue because of this error? I just ran the example and it ran without error. Jeff Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reopened MAHOUT-99: --- Hi Pallavi, I'm getting: 09/03/18 11:13:56 WARN mapred.LocalJobRunner: job_local_0001 java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1938) at org.apache.mahout.clustering.kmeans.Cluster.decodeCluster(Cluster.java:81) at org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:80) at org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:66) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) when running http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
Re: Dirchlet Job example
Not only appropriate but essential. I will add a README file in the code and instructions in the wiki today. Jeff Grant Ingersoll wrote: Hey Jeff, Is it appropriate to have a Job example like we do for k-means and some of the other clustering algorithms for dirichlet? I see you do have some type of UI in there, right?Are there directions somewhere for running the example? http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html just seems to show the output. -Grant PGP.sig Description: PGP signature
Re: Dirchlet Job example
Yeah, I was wondering about that simple, but nice cluster-showing UI... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Grant Ingersoll gsing...@apache.org To: mahout-dev@lucene.apache.org Sent: Wednesday, March 18, 2009 12:01:28 PM Subject: Dirchlet Job example Hey Jeff, Is it appropriate to have a Job example like we do for k-means and some of the other clustering algorithms for dirichlet? I see you do have some type of UI in there, right?Are there directions somewhere for running the example? http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html just seems to show the output. -Grant
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
I'm running the example in Eclipse using the stand-alone mode in the hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in Eclipse. I cannot; however, get any hadoop stuff to work from the command line. Even though my JAVA_HOME environment is set to /Library/Java/Home and java -version yields: Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) ... the hadoop build script and the start-all.sh commands all complain about class version errors. Can any other Mac users help me out? Jeff Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683077#action_12683077 ] Grant Ingersoll commented on MAHOUT-99: --- Yeah, what version of Hadoop are you running? I got it w/ 0.19.1, but maybe I didn't set something up right. {code} bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {code} Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
mvn package tar file issue
Hi, Am I the only person getting the following after mvn package? [INFO] [ERROR] BUILD ERROR [INFO] [INFO] Failed to create assembly: Error creating assembly archive project: A tar file cannot include itself. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
On my Mac, I have: $ echo $JAVA_HOME /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home -Grant On Mar 18, 2009, at 2:10 PM, Jeff Eastman wrote: I'm running the example in Eclipse using the stand-alone mode in the hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in Eclipse. I cannot; however, get any hadoop stuff to work from the command line. Even though my JAVA_HOME environment is set to / Library/Java/Home and java -version yields: Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) ... the hadoop build script and the start-all.sh commands all complain about class version errors. Can any other Mac users help me out? Jeff Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683077 #action_12683077 ] Grant Ingersoll commented on MAHOUT-99: --- Yeah, what version of Hadoop are you running? I got it w/ 0.19.1, but maybe I didn't set something up right. {code} bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/ target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {code} Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
[jira] Commented: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683140#action_12683140 ] Grant Ingersoll commented on MAHOUT-99: --- I seem to recall hitting something similar before, let me poke around... Seems somewhat similar to the problems we were having on http://www.lucidimagination.com/search/document/31bd6ab8d94bb3e5/problems_with_kmeans_clustering#31bd6ab8d94bb3e5, but I'm not sure Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: mvn package tar file issue
Yes, at the top. Bad? Doing it from core worked. How come it doesn't work from root and should it, at least for 0.2? WOuld be more intuitive, no? Otis - Original Message From: Grant Ingersoll gsing...@apache.org To: mahout-dev@lucene.apache.org Sent: Wednesday, March 18, 2009 2:19:29 PM Subject: Re: mvn package tar file issue Where are you running it? The top? On Mar 18, 2009, at 2:15 PM, Otis Gospodnetic wrote: Hi, Am I the only person getting the following after mvn package? [INFO] [ERROR] BUILD ERROR [INFO] [INFO] Failed to create assembly: Error creating assembly archive project: A tar file cannot include itself. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Taste: user's neighbours and their similarity
Hi, Is there a way to get a collection of neighbours for a given user? I'm referring to the same neighbour collection that recommendations are derived from. I didn't see a way, so I simply made NearestNUserNeighborhood.Estimator public (diff below), so I could do something like this: public CollectionSimilarUser getHood(Object userID) throws TasteException { User theUser = recommender.getDataModel().getUser(userID); TopItems.EstimatorUser estimator = new NearestNUserNeighborhood.Estimator(similarity, theUser, minSimilarity); CollectionUser neighbors = hood.getUserNeighborhood(userID); CollectionSimilarUser similarHood = new ArrayListSimilarUser(neighbors.size()); System.out.println(Neighbors for user: + userID + : + neighbors.size()); for (User user : neighbors) { SimilarUser su = new SimilarUser(user, estimator.estimate(user)); similarHood.add(su); } return similarHood; } This gives me the needed collection: [SimilarUser[user:User[id:U2], similarity:0.7084]] $ svn diff core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java Index: core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java === --- core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java (revision 755664) +++ core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java (working copy) @@ -109,12 +109,12 @@ return NearestNUserNeighborhood; } - private static class Estimator implements TopItems.EstimatorUser { + public static class Estimator implements TopItems.EstimatorUser { private final UserSimilarity userSimilarityImpl; private final User theUser; private final double minSim; -private Estimator(UserSimilarity userSimilarityImpl, User theUser, double minSim) { +public Estimator(UserSimilarity userSimilarityImpl, User theUser, double minSim) { this.userSimilarityImpl = userSimilarityImpl; this.theUser = theUser; this.minSim = minSim; Is there an existing way to get the neighbours + similarity information? If not, is the above change OK? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683170#action_12683170 ] Sean Owen commented on MAHOUT-103: -- 1. How do you feel about, therefore, changing to use more abstract objects rather than, say, Click? These objects could be the existing ones, or modified or new ones. I think as you say the existing objects are about what is needed. That way the solution is that much more reusable. Same with the job -- the more it uses abstract/standard classes, the more reusable I think it looks. 2. Yeah the two interfaces are nearly identical: provide a method that takes two items as input and a numerical score as output. I suppose it just makes sense to use the existing ItemSimilarity interface in this section of the code. 3. Good question, here is my brief digression: The code was originally written with an on-line model in mind -- recommendations happen in real-time. Over time that has proved inefficient or impractical for large data sets, though it remains quite nice for small- to medium-size data sets. Hence i have attempted to preserve the real-time model at the core, and build a batch-oriented extension around it using Hadoop. The two are a bit separate, and that is fine. So in this section of the code, I don't mind attaching Hadoop-related jobs that are not intimately connected to the core code. I am trying to keep them as consistent as possible so that the original on-line and newer off-line models don't evolve into two separate worlds within this part of the code. To be specific... well I don't know, I don't have a problem with adding this job actually. Ideally we build a bit more around it: takes as input the standard preference-file format as used by FileDataModel, and outputs a file format that can be ready by a new ItemSimillarity implementation that would read and cache all these results. That would be a nice step towards integrating with the core code. This is something I have been remiss in - I wrote a job to do the pre-computation of item-item diffs for slope one but never wrote an implementation of DiffStorage that would read this output and operate based on those results. This would close the loop. How about we make #3 my part of this issue, to complete the connection between this job and the core code a bit more? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Taste: user's neighbours and their similarity
How about the method UserBasedRecommender.mostSimilarUsers()? or a bit more directly, UserNeighborhood.getUserNeighborhood()? (They are arguably kind of redundant but it's 'for historical reasons' and low on my list of design sins.) These in turn largely use TopItems.getTopUsers() and you apparently already see all this so: I suppose you are interested in the latter since it reports some measure of similarity as well as the users themselves. You want to just refactor getTopUsers() there so a version is also provided that gives you the SimilarUser objects instead of just the Users? OK by me and perhaps a bit more general than putting code in NearestNUserNeighborhood. On Wed, Mar 18, 2009 at 9:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Is there a way to get a collection of neighbours for a given user? I'm referring to the same neighbour collection that recommendations are derived from. I didn't see a way, so I simply made NearestNUserNeighborhood.Estimator public (diff below), so I could do something like this: public CollectionSimilarUser getHood(Object userID) throws TasteException { User theUser = recommender.getDataModel().getUser(userID); TopItems.EstimatorUser estimator = new NearestNUserNeighborhood.Estimator(similarity, theUser, minSimilarity); CollectionUser neighbors = hood.getUserNeighborhood(userID); CollectionSimilarUser similarHood = new ArrayListSimilarUser(neighbors.size()); System.out.println(Neighbors for user: + userID + : + neighbors.size()); for (User user : neighbors) { SimilarUser su = new SimilarUser(user, estimator.estimate(user)); similarHood.add(su); } return similarHood; } This gives me the needed collection: [SimilarUser[user:User[id:U2], similarity:0.7084]] $ svn diff core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java Index: core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java === --- core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java (revision 755664) +++ core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java (working copy) @@ -109,12 +109,12 @@ return NearestNUserNeighborhood; } - private static class Estimator implements TopItems.EstimatorUser { + public static class Estimator implements TopItems.EstimatorUser { private final UserSimilarity userSimilarityImpl; private final User theUser; private final double minSim; - private Estimator(UserSimilarity userSimilarityImpl, User theUser, double minSim) { + public Estimator(UserSimilarity userSimilarityImpl, User theUser, double minSim) { this.userSimilarityImpl = userSimilarityImpl; this.theUser = theUser; this.minSim = minSim; Is there an existing way to get the neighbours + similarity information? If not, is the above change OK? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: Packaging step taking forever... is this right?
Took me ~15 minutes the first time, 5 minutes subsequent times. Yeah it still seems long, and does seem like something is amiss, but if it works it seems OK for now. On Wed, Mar 18, 2009 at 9:52 PM, Jeff Eastman j...@windwardsolutions.com wrote: [WARNING] Entry: mahout-0.2-SNAPSHOT/Users/jeff/Documents/workspace/Mahout/target/mahout-0.1-SNAPSHOT-project.tar.bz2 longer than 100 characters. No movement in the system transcript for many, many minutes. Jeff
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683232#action_12683232 ] Ted Dunning commented on MAHOUT-103: 1. How do you feel about, therefore, changing to use more abstract objects rather than, say, Click? How is click more or less abstract than the term user? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683238#action_12683238 ] Sean Owen commented on MAHOUT-103: -- The comparison would be to Item. You could say that's as domain-specific as Click; I'd suggest that User/Item are the 'abstract' concepts in this context since collaborative filtering is invariably explained in terms of users and items, though of course your user or item can be whatever you like. At least, there is no need to have both Click and Item -- unless this particular context requires one to store more information about a click as an item, in which case it should at least implement Item. But I don't think that's the case. The good news is that this work doesn't seem to only apply to processing click logs, so, I'm suggesting it might be even more useful to express it in terms of the 'abstract' concepts in this context. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683251#action_12683251 ] Sean Owen commented on MAHOUT-103: -- The comparison would be to Item. You could say that's as domain-specific as Click; I'd suggest that User/Item are the 'abstract' concepts in this context since collaborative filtering is invariably explained in terms of users and items, though of course your user or item can be whatever you like. At least, there is no need to have both Click and Item -- unless this particular context requires one to store more information about a click as an item, in which case it should at least implement Item. But I don't think that's the case. The good news is that this work doesn't seem to only apply to processing click logs, so, I'm suggesting it might be even more useful to express it in terms of the 'abstract' concepts in this context. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets
[ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683254#action_12683254 ] Richard Tomsett commented on MAHOUT-59: --- Ugh, I had an example almost done but managed to over-write it by having folders with too-similar names. That'll teach me :-\ anyway, looking at the K-Means issue [MAHOUT-99] at the moment but will hopefully post a bag of words example relatively soon...! Create some examples of clustering well-known datasets -- Key: MAHOUT-59 URL: https://issues.apache.org/jira/browse/MAHOUT-59 Project: Mahout Issue Type: New Feature Components: Clustering Reporter: Jeff Eastman Attachments: MAHOUT-59.patch The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683297#action_12683297 ] Pallavi Palleti commented on MAHOUT-99: --- Yup. That must be the issue. But I am wondering how the test case succeeded? Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683312#action_12683312 ] Pallavi Palleti commented on MAHOUT-99: --- I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters. Jeff Richard Tomsett (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683252#action_12683252 ] Richard Tomsett commented on MAHOUT-99: --- Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Yeah. But, I am wondering how the testcases succeeded? I ran them using mvn clean install command. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 9:56 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters. Jeff Richard Tomsett (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir a.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683 252#action_12683252 ] Richard Tomsett commented on MAHOUT-99: --- Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Sure, why don't you go ahead and post a patch? Pallavi Palleti (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683312#action_12683312 ] Pallavi Palleti commented on MAHOUT-99: --- I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
It depends on the kind of output. If we are just outputting only some numeric values then it is preferred to have SequenceFile as the data is written as binary. If not, it is preferred to write as simple text. Text file is readable where as binary is not readable. As we consider the data as text in reducers of both Canopy and KMeans, I don't see any performance improvement in using SequenceFile. So, I used TextInputFormat which is read friendly. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 10:19 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans Also why not consider just converting canopy? Which reader is better? Jeff Eastman wrote: * PGP Signed: 03/18/09 at 21:37:36 Sure, why don't you go ahead and post a patch? Pallavi Palleti (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji ra.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=126 83312#action_12683312 ] Pallavi Palleti commented on MAHOUT-99: --- I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. * Jeff Eastman j...@windwardsolutions.com * 0x6BFF1277 .