Re: Hadoop upgrade

2009-03-18 Thread Grant Ingersoll

D'oh!

Thanks!

On Mar 18, 2009, at 4:32 AM, Sean Owen wrote:


(I upgraded to 0.19.1 last week.)

On Tue, Mar 17, 2009 at 10:41 PM, Grant Ingersoll  
gsing...@apache.org wrote:
OK, pending MAHOUT-110, I think we are good to go on the release.   
Not sure
who volunteered to upgrade Hadoop, so go for it now, or it will  
wait until

after 0.1.




[jira] Commented: (MAHOUT-110) Ant script for building Taste web app

2009-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682980#action_12682980
 ] 

Sean Owen commented on MAHOUT-110:
--

I say go for it. I will merge with my patch locally and see if there is 
anything left and take care of that tonight. I'm glad this works out, many 
thanks.

 Ant script for building Taste web app
 -

 Key: MAHOUT-110
 URL: https://issues.apache.org/jira/browse/MAHOUT-110
 Project: Mahout
  Issue Type: Task
  Components: Collaborative Filtering
Affects Versions: 0.1
Reporter: Sean Owen
Assignee: Sean Owen
 Fix For: 0.1

 Attachments: AntScript.patch, MAHOUT-110-docs.patch, 
 MAHOUT-110.patch, MAHOUT-110.patch


 WIll attach patch after creating. This is a follow-up from a thread on 
 mahout-dev.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-110) Ant script for building Taste web app

2009-03-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682983#action_12682983
 ] 

Grant Ingersoll commented on MAHOUT-110:


Will do.  Thanks for the kick in the pants to get going on it.  I like the mvn 
jetty:run-war a lot.  Now I can demo Taste next week!

 Ant script for building Taste web app
 -

 Key: MAHOUT-110
 URL: https://issues.apache.org/jira/browse/MAHOUT-110
 Project: Mahout
  Issue Type: Task
  Components: Collaborative Filtering
Affects Versions: 0.1
Reporter: Sean Owen
Assignee: Sean Owen
 Fix For: 0.1

 Attachments: AntScript.patch, MAHOUT-110-docs.patch, 
 MAHOUT-110.patch, MAHOUT-110.patch


 WIll attach patch after creating. This is a follow-up from a thread on 
 mahout-dev.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-99.
---

   Resolution: Fixed
Fix Version/s: 0.1

Committed revision 755548.

Thanks!

 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper to 
 reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
 Also removed the implicit assumption of Combiner runs only once approach and 
 the code is modified accordingly so that it won't create a bug when combiner 
 runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-111) Redirect Test output to file

2009-03-18 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-111:
---

Affects Version/s: (was: 0.2)
   0.1

 Redirect Test output to file
 

 Key: MAHOUT-111
 URL: https://issues.apache.org/jira/browse/MAHOUT-111
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.1
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial

 The tests are really verbose to std out.  Have them direct their output to a 
 file and only report pass/fail on std out.  This should be a simple setting 
 on the test plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-111) Redirect Test output to file

2009-03-18 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-111.


   Resolution: Fixed
Fix Version/s: 0.1

Fixed

 Redirect Test output to file
 

 Key: MAHOUT-111
 URL: https://issues.apache.org/jira/browse/MAHOUT-111
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.1
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Fix For: 0.1


 The tests are really verbose to std out.  Have them direct their output to a 
 file and only report pass/fail on std out.  This should be a simple setting 
 on the test plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Concerns about Maven

2009-03-18 Thread Grant Ingersoll


On Mar 17, 2009, at 9:06 AM, Enis Soztutar wrote:


-Grant
Knowing nothing about the mahout build script(s), I think that  
having both ant and maven scripts might prove to be problematic.  
However keeping one module(taste) in ant will work. As a side note,  
we have discussed this same thing in Hadoop and we opted for ant 
+ivy. The build process is very complex for Hadoop, and there are  
some things that simply cannot be done with maven. ant+ivy works  
pretty well for us and we can generate pom files for deployment.



Thanks, Enis.  I did notice that Hadoop had started using Ivy.  Do you  
know when Hadoop is going to start publishing it's artifacts to the  
Maven repo?  We are doing it now for Mahout (i.e. publishing the  
Hadoop artifacts, see http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/) 
 but would really rather rely on Hadoop doing it.


I would think Ivy would allow for this and it would make Hadoop  
adoption even easier.


-Grant


Thoughts on ...

2009-03-18 Thread Grant Ingersoll

http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/

-Grant


[jira] Resolved: (MAHOUT-110) Ant script for building Taste web app

2009-03-18 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-110.


Resolution: Fixed

Committed

 Ant script for building Taste web app
 -

 Key: MAHOUT-110
 URL: https://issues.apache.org/jira/browse/MAHOUT-110
 Project: Mahout
  Issue Type: Task
  Components: Collaborative Filtering
Affects Versions: 0.1
Reporter: Sean Owen
Assignee: Sean Owen
 Fix For: 0.1

 Attachments: AntScript.patch, MAHOUT-110-docs.patch, 
 MAHOUT-110.patch, MAHOUT-110.patch


 WIll attach patch after creating. This is a follow-up from a thread on 
 mahout-dev.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Concerns about Maven

2009-03-18 Thread Enis Soztutar

Grant Ingersoll wrote:


On Mar 17, 2009, at 9:06 AM, Enis Soztutar wrote:


-Grant
Knowing nothing about the mahout build script(s), I think that having 
both ant and maven scripts might prove to be problematic. However 
keeping one module(taste) in ant will work. As a side note, we have 
discussed this same thing in Hadoop and we opted for ant+ivy. The 
build process is very complex for Hadoop, and there are some things 
that simply cannot be done with maven. ant+ivy works pretty well for 
us and we can generate pom files for deployment.



Thanks, Enis.  I did notice that Hadoop had started using Ivy.  Do you 
know when Hadoop is going to start publishing it's artifacts to the 
Maven repo?  We are doing it now for Mahout (i.e. publishing the 
Hadoop artifacts, see 
http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/) but 
would really rather rely on Hadoop doing it.


I would think Ivy would allow for this and it would make Hadoop 
adoption even easier.


-Grant
The issue is open for a while, but I'm afraid no body has step-up to add 
maven deployment to the release procedure. until the deployment is 
complete, you can use local deployment with ant maven-artifacts ; mvn 
install:install .


Dirchlet Job example

2009-03-18 Thread Grant Ingersoll

Hey Jeff,

Is it appropriate to have a Job example like we do for k-means and  
some of the other clustering algorithms for dirichlet?  I see you do  
have some type of UI in there, right?Are there directions  
somewhere for running the example? http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html 
 just seems to show the output.


-Grant




Re: Thoughts on ...

2009-03-18 Thread Jeff Eastman
Interesting optimization. We can incorporate it by adding a centroid^2 
argument to DistanceMeasure interface and adjusting the affected 
clustering algorithms. All would benefit from this optimization. I will 
build a test to assess its impact and report.


Jeff

Grant Ingersoll wrote:
http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ 



-Grant






PGP.sig
Description: PGP signature


Re: [jira] Reopened: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Jeff Eastman
Did you reopen this issue because of this error? I just ran the example 
and it ran without error.

Jeff

Grant Ingersoll (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reopened MAHOUT-99:
---


Hi Pallavi,

I'm getting: 
09/03/18 11:13:56 WARN mapred.LocalJobRunner: job_local_0001

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1938)
at 
org.apache.mahout.clustering.kmeans.Cluster.decodeCluster(Cluster.java:81)
at 
org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:80)
at 
org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:66)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

when running http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html

  

Improving speed of KMeans
-

Key: MAHOUT-99
URL: https://issues.apache.org/jira/browse/MAHOUT-99
Project: Mahout
 Issue Type: Improvement
 Components: Clustering
   Reporter: Pallavi Palleti
   Assignee: Grant Ingersoll
Fix For: 0.1

Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


Improved the speed of KMeans by passing only cluster ID from mapper to reducer. 
Previously, whole Cluster Info as formatted s`tring was being sent.
Also removed the implicit assumption of Combiner runs only once approach and 
the code is modified accordingly so that it won't create a bug when combiner 
runs zero or more than once.



  




PGP.sig
Description: PGP signature


Re: Dirchlet Job example

2009-03-18 Thread Jeff Eastman
Not only appropriate but essential. I will add a README file in the code 
and instructions in the wiki today.


Jeff


Grant Ingersoll wrote:

Hey Jeff,

Is it appropriate to have a Job example like we do for k-means and 
some of the other clustering algorithms for dirichlet?  I see you do 
have some type of UI in there, right?Are there directions 
somewhere for running the example? 
http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html just 
seems to show the output.


-Grant








PGP.sig
Description: PGP signature


Re: Dirchlet Job example

2009-03-18 Thread Otis Gospodnetic

Yeah, I was wondering about that simple, but nice cluster-showing UI...

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: mahout-dev@lucene.apache.org
 Sent: Wednesday, March 18, 2009 12:01:28 PM
 Subject: Dirchlet Job example
 
 Hey Jeff,
 
 Is it appropriate to have a Job example like we do for k-means and some of 
 the 
 other clustering algorithms for dirichlet?  I see you do have some type of UI 
 in 
 there, right?Are there directions somewhere for running the example? 
 http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html just seems 
 to 
 show the output.
 
 -Grant



Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Jeff Eastman
I'm running the example in Eclipse using the stand-alone mode in the 
hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in 
Eclipse. I cannot; however, get any hadoop stuff to work from the 
command line. Even though my JAVA_HOME environment is set to 
/Library/Java/Home and java -version yields:


Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

... the hadoop build script and the start-all.sh commands all complain 
about class version errors. Can any other Mac users help me out?


Jeff


Grant Ingersoll (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683077#action_12683077 ] 


Grant Ingersoll commented on MAHOUT-99:
---

Yeah, what version of Hadoop are you running?  I got it w/ 0.19.1, but maybe I 
didn't set something up right.

{code}
 bin/hadoop jar 
~/projects/lucene/mahout/mahout-clean/examples/target/mahout-examples-0.2-SNAPSHOT.job
 org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
{code}

  

Improving speed of KMeans
-

Key: MAHOUT-99
URL: https://issues.apache.org/jira/browse/MAHOUT-99
Project: Mahout
 Issue Type: Improvement
 Components: Clustering
   Reporter: Pallavi Palleti
   Assignee: Grant Ingersoll
Fix For: 0.1

Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


Improved the speed of KMeans by passing only cluster ID from mapper to reducer. 
Previously, whole Cluster Info as formatted s`tring was being sent.
Also removed the implicit assumption of Combiner runs only once approach and 
the code is modified accordingly so that it won't create a bug when combiner 
runs zero or more than once.



  




PGP.sig
Description: PGP signature


mvn package tar file issue

2009-03-18 Thread Otis Gospodnetic

Hi,

Am I the only person getting the following after mvn package?

[INFO] 
[ERROR] BUILD ERROR
[INFO] 
[INFO] Failed to create assembly: Error creating assembly archive project: A 
tar file cannot include itself.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Grant Ingersoll

On my Mac, I have:
$ echo $JAVA_HOME
/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

-Grant

On Mar 18, 2009, at 2:10 PM, Jeff Eastman wrote:

I'm running the example in Eclipse using the stand-alone mode in the  
hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in  
Eclipse. I cannot; however, get any hadoop stuff to work from the  
command line. Even though my JAVA_HOME environment is set to / 
Library/Java/Home and java -version yields:


Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

... the hadoop build script and the start-all.sh commands all  
complain about class version errors. Can any other Mac users help me  
out?


Jeff


Grant Ingersoll (JIRA) wrote:
   [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683077 
#action_12683077 ]

Grant Ingersoll commented on MAHOUT-99:
---

Yeah, what version of Hadoop are you running?  I got it w/ 0.19.1,  
but maybe I didn't set something up right.


{code}
bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/ 
target/mahout-examples-0.2-SNAPSHOT.job  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

{code}



Improving speed of KMeans
-

   Key: MAHOUT-99
   URL: https://issues.apache.org/jira/browse/MAHOUT-99
   Project: Mahout
Issue Type: Improvement
Components: Clustering
  Reporter: Pallavi Palleti
  Assignee: Grant Ingersoll
   Fix For: 0.1

   Attachments: MAHOUT-99-1.patch, Mahout-99.patch,  
MAHOUT-99.patch



Improved the speed of KMeans by passing only cluster ID from  
mapper to reducer. Previously, whole Cluster Info as formatted  
s`tring was being sent.
Also removed the implicit assumption of Combiner runs only once  
approach and the code is modified accordingly so that it won't  
create a bug when combiner runs zero or more than once.











[jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683140#action_12683140
 ] 

Grant Ingersoll commented on MAHOUT-99:
---

I seem to recall hitting something similar before, let me poke around...

Seems somewhat similar to the problems we were having on 
http://www.lucidimagination.com/search/document/31bd6ab8d94bb3e5/problems_with_kmeans_clustering#31bd6ab8d94bb3e5,
 but I'm not sure

 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper to 
 reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
 Also removed the implicit assumption of Combiner runs only once approach and 
 the code is modified accordingly so that it won't create a bug when combiner 
 runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: mvn package tar file issue

2009-03-18 Thread Otis Gospodnetic

Yes, at the top.  Bad?
Doing it from core worked.
How come it doesn't work from root and should it, at least for 0.2?  WOuld be 
more intuitive, no?

Otis



- Original Message 
 From: Grant Ingersoll gsing...@apache.org
 To: mahout-dev@lucene.apache.org
 Sent: Wednesday, March 18, 2009 2:19:29 PM
 Subject: Re: mvn package tar file issue
 
 Where are you running it?  The top?
 
 On Mar 18, 2009, at 2:15 PM, Otis Gospodnetic wrote:
 
  
  Hi,
  
  Am I the only person getting the following after mvn package?
  
  [INFO] 
 
  [ERROR] BUILD ERROR
  [INFO] 
 
  [INFO] Failed to create assembly: Error creating assembly archive project: 
  A 
 tar file cannot include itself.
  
  Thanks,
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  



Taste: user's neighbours and their similarity

2009-03-18 Thread Otis Gospodnetic

Hi,

Is there a way to get a collection of neighbours for a given user?  I'm 
referring to the same neighbour collection that recommendations are derived 
from.  I didn't see a way, so I simply made NearestNUserNeighborhood.Estimator 
public (diff below), so I could do something like this:

  public CollectionSimilarUser getHood(Object userID) throws TasteException {
User theUser = recommender.getDataModel().getUser(userID);
TopItems.EstimatorUser estimator = new 
NearestNUserNeighborhood.Estimator(similarity, theUser, minSimilarity);
CollectionUser neighbors = hood.getUserNeighborhood(userID);
CollectionSimilarUser similarHood = new 
ArrayListSimilarUser(neighbors.size());
System.out.println(Neighbors for user:  + userID + :  + 
neighbors.size());
for (User user : neighbors) {
  SimilarUser su = new SimilarUser(user, estimator.estimate(user));
  similarHood.add(su);
}
return similarHood;
  }

This gives me the needed collection:

[SimilarUser[user:User[id:U2], similarity:0.7084]]


$ svn diff  
core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
Index: 
core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
===
--- 
core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
   (revision 755664)
+++ 
core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
   (working copy)
@@ -109,12 +109,12 @@
 return NearestNUserNeighborhood;
   }
 
-  private static class Estimator implements TopItems.EstimatorUser {
+  public static class Estimator implements TopItems.EstimatorUser {
 private final UserSimilarity userSimilarityImpl;
 private final User theUser;
 private final double minSim;
 
-private Estimator(UserSimilarity userSimilarityImpl, User theUser, double 
minSim) {
+public Estimator(UserSimilarity userSimilarityImpl, User theUser, double 
minSim) {
   this.userSimilarityImpl = userSimilarityImpl;
   this.theUser = theUser;
   this.minSim = minSim;



Is there an existing way to get the neighbours + similarity information?  If 
not, is the above change OK?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683170#action_12683170
 ] 

Sean Owen commented on MAHOUT-103:
--

1. How do you feel about, therefore, changing to use more abstract objects 
rather than, say, Click? These objects could be the existing ones, or 
modified or new ones. I think as you say the existing objects are about what is 
needed. That way the solution is that much more reusable. Same with the job -- 
the more it uses abstract/standard classes, the more reusable I think it looks.

2. Yeah the two interfaces are nearly identical: provide a method that takes 
two items as input and a numerical score as output. I suppose it just makes 
sense to use the existing ItemSimilarity interface in this section of the code.

3. Good question, here is my brief digression:

The code was originally written with an on-line model in mind -- 
recommendations happen in real-time. Over time that has proved inefficient or 
impractical for large data sets, though it remains quite nice for small- to 
medium-size data sets. Hence i have attempted to preserve the real-time model 
at the core, and build a batch-oriented extension around it using Hadoop.

The two are a bit separate, and that is fine. So in this section of the code, I 
don't mind attaching Hadoop-related jobs that are not intimately connected to 
the core code. I am trying to keep them as consistent as possible so that the 
original on-line and newer off-line models don't evolve into two separate 
worlds within this part of the code.

To be specific... well I don't know, I don't have a problem with adding this 
job actually. Ideally we build a bit more around it: takes as input the 
standard preference-file format as used by FileDataModel, and outputs a file 
format that can be ready by a new ItemSimillarity implementation that would 
read and cache all these results. That would be a nice step towards integrating 
with the core code.

This is something I have been remiss in - I wrote a job to do the 
pre-computation of item-item diffs for slope one but never wrote an 
implementation of DiffStorage that would read this output and operate based on 
those results. This would close the loop. 

How about we make #3 my part of this issue, to complete the connection between 
this job and the core code a bit more?

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Taste: user's neighbours and their similarity

2009-03-18 Thread Sean Owen
How about the method UserBasedRecommender.mostSimilarUsers()? or a bit
more directly, UserNeighborhood.getUserNeighborhood()? (They are
arguably kind of redundant but it's 'for historical reasons' and low
on my list of design sins.) These in turn largely use
TopItems.getTopUsers() and you apparently already see all this so:

I suppose you are interested in the latter since it reports some
measure of similarity as well as the users themselves.

You want to just refactor getTopUsers() there so a version is also
provided that gives you the SimilarUser objects instead of just the
Users? OK by me and perhaps a bit more general than putting code in
NearestNUserNeighborhood.

On Wed, Mar 18, 2009 at 9:04 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 Hi,

 Is there a way to get a collection of neighbours for a given user?  I'm 
 referring to the same neighbour collection that recommendations are derived 
 from.  I didn't see a way, so I simply made 
 NearestNUserNeighborhood.Estimator public (diff below), so I could do 
 something like this:

  public CollectionSimilarUser getHood(Object userID) throws TasteException {
    User theUser = recommender.getDataModel().getUser(userID);
    TopItems.EstimatorUser estimator = new 
 NearestNUserNeighborhood.Estimator(similarity, theUser, minSimilarity);
    CollectionUser neighbors = hood.getUserNeighborhood(userID);
    CollectionSimilarUser similarHood = new 
 ArrayListSimilarUser(neighbors.size());
    System.out.println(Neighbors for user:  + userID + :  + 
 neighbors.size());
    for (User user : neighbors) {
      SimilarUser su = new SimilarUser(user, estimator.estimate(user));
      similarHood.add(su);
    }
    return similarHood;
  }

 This gives me the needed collection:

 [SimilarUser[user:User[id:U2], similarity:0.7084]]


 $ svn diff  
 core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
 Index: 
 core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
 ===
 --- 
 core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
        (revision 755664)
 +++ 
 core/src/main/java/org/apache/mahout/cf/taste/impl/neighborhood/NearestNUserNeighborhood.java
        (working copy)
 @@ -109,12 +109,12 @@
     return NearestNUserNeighborhood;
   }

 -  private static class Estimator implements TopItems.EstimatorUser {
 +  public static class Estimator implements TopItems.EstimatorUser {
     private final UserSimilarity userSimilarityImpl;
     private final User theUser;
     private final double minSim;

 -    private Estimator(UserSimilarity userSimilarityImpl, User theUser, 
 double minSim) {
 +    public Estimator(UserSimilarity userSimilarityImpl, User theUser, double 
 minSim) {
       this.userSimilarityImpl = userSimilarityImpl;
       this.theUser = theUser;
       this.minSim = minSim;



 Is there an existing way to get the neighbours + similarity information?  If 
 not, is the above change OK?

 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




Re: Packaging step taking forever... is this right?

2009-03-18 Thread Sean Owen
Took me ~15 minutes the first time, 5 minutes subsequent times. Yeah
it still seems long, and does seem like something is amiss, but if it
works it seems OK for now.

On Wed, Mar 18, 2009 at 9:52 PM, Jeff Eastman
j...@windwardsolutions.com wrote:
 [WARNING] Entry:
 mahout-0.2-SNAPSHOT/Users/jeff/Documents/workspace/Mahout/target/mahout-0.1-SNAPSHOT-project.tar.bz2
 longer than 100 characters.

 No movement in the system transcript for many, many minutes.
 Jeff



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683232#action_12683232
 ] 

Ted Dunning commented on MAHOUT-103:


  1. How do you feel about, therefore, changing to use more abstract objects 
  rather than, say, Click? 

How is click more or less abstract than the term user?



 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683238#action_12683238
 ] 

Sean Owen commented on MAHOUT-103:
--

The comparison would be to Item. You could say that's as domain-specific as 
Click; I'd suggest that User/Item are the 'abstract' concepts in this context 
since collaborative filtering is invariably explained in terms of users and 
items, though of course your user or item can be whatever you like.

At least, there is no need to have both Click and Item -- unless this 
particular context requires one to store more information about a click as an 
item, in which case it should at least implement Item. But I don't think that's 
the case.

The good news is that this work doesn't seem to only apply to processing click 
logs, so, I'm suggesting it might be even more useful to express it in terms of 
the 'abstract' concepts in this context.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683251#action_12683251
 ] 

Sean Owen commented on MAHOUT-103:
--

The comparison would be to Item. You could say that's as domain-specific as 
Click; I'd suggest that User/Item are the 'abstract' concepts in this context 
since collaborative filtering is invariably explained in terms of users and 
items, though of course your user or item can be whatever you like.

At least, there is no need to have both Click and Item -- unless this 
particular context requires one to store more information about a click as an 
item, in which case it should at least implement Item. But I don't think that's 
the case.

The good news is that this work doesn't seem to only apply to processing click 
logs, so, I'm suggesting it might be even more useful to express it in terms of 
the 'abstract' concepts in this context.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets

2009-03-18 Thread Richard Tomsett (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683254#action_12683254
 ] 

Richard Tomsett commented on MAHOUT-59:
---

Ugh, I had an example almost done but managed to over-write it by having 
folders with too-similar names. That'll teach me :-\ anyway, looking at the 
K-Means issue [MAHOUT-99] at the moment but will hopefully post a bag of words 
example relatively soon...!

 Create some examples of clustering well-known datasets
 --

 Key: MAHOUT-59
 URL: https://issues.apache.org/jira/browse/MAHOUT-59
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Reporter: Jeff Eastman
 Attachments: MAHOUT-59.patch


 The existing unit tests for clustering need to be augmented with examples 
 from the literature which illustrate its correct operation on datasets which 
 have known clusters present. See http://archive.ics.uci.edu/ml/ for some 
 candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Pallavi Palleti (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683297#action_12683297
 ] 

Pallavi Palleti commented on MAHOUT-99:
---

Yup. That must be the issue. But I am wondering how the test case succeeded?

 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper to 
 reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
 Also removed the implicit assumption of Combiner runs only once approach and 
 the code is modified accordingly so that it won't create a bug when combiner 
 runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Pallavi Palleti (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683312#action_12683312
 ] 

Pallavi Palleti commented on MAHOUT-99:
---

I have used KeyValueLineRecordReader internally for my code and forgot to 
revert back to SequenceFileReader. Will that be sufficient to add another patch 
on the latest code and modify only KMeansDriver to use SequenceFileReader? 
Kindly let me know.

Thanks
Pallavi

 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper to 
 reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
 Also removed the implicit assumption of Combiner runs only once approach and 
 the code is modified accordingly so that it won't create a bug when combiner 
 runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Jeff Eastman
The Synthetic Control kMeans job calls the Canopy job to build its 
initial clusters as is commonly done. If the kMeans record format was 
changed and the Canopy not changed accordingly, then everything would 
still compile but there would be a mismatch when the kMeans mapper tried 
to read in the clusters.


Jeff


Richard Tomsett (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683252#action_12683252 ] 


Richard Tomsett commented on MAHOUT-99:
---

Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the 
same error on the Synthetic Control example. It seems to be because the new 
KMeans code uses a KeyValueLineRecordReader object to read the input cluster 
centres from the canopy clustering output, but the canopy clustering job 
outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the 
cluster centres). Think that's the problem at least, I''ll have a quick play.

  

Improving speed of KMeans
-

Key: MAHOUT-99
URL: https://issues.apache.org/jira/browse/MAHOUT-99
Project: Mahout
 Issue Type: Improvement
 Components: Clustering
   Reporter: Pallavi Palleti
   Assignee: Grant Ingersoll
Fix For: 0.1

Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


Improved the speed of KMeans by passing only cluster ID from mapper to reducer. 
Previously, whole Cluster Info as formatted s`tring was being sent.
Also removed the implicit assumption of Combiner runs only once approach and 
the code is modified accordingly so that it won't create a bug when combiner 
runs zero or more than once.



  




PGP.sig
Description: PGP signature


RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Palleti, Pallavi
Yeah. But, I am wondering how the testcases succeeded? I ran them using mvn 
clean install command.

Thanks
Pallavi

-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 9:56 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

The Synthetic Control kMeans job calls the Canopy job to build its initial 
clusters as is commonly done. If the kMeans record format was changed and the 
Canopy not changed accordingly, then everything would still compile but there 
would be a mismatch when the kMeans mapper tried to read in the clusters.

Jeff


Richard Tomsett (JIRA) wrote:
 [ 
 https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir
 a.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683
 252#action_12683252 ]

 Richard Tomsett commented on MAHOUT-99:
 ---

 Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get 
 the same error on the Synthetic Control example. It seems to be because the 
 new KMeans code uses a KeyValueLineRecordReader object to read the input 
 cluster centres from the canopy clustering output, but the canopy clustering 
 job outputs a SequenceFile (and the old KMeans code read in a SequenceFile 
 for the cluster centres). Think that's the problem at least, I''ll have a 
 quick play.

   
 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
 MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper to 
 reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
 Also removed the implicit assumption of Combiner runs only once approach and 
 the code is modified accordingly so that it won't create a bug when combiner 
 runs zero or more than once.
 

   



Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Jeff Eastman

Sure, why don't you go ahead and post a patch?


Pallavi Palleti (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683312#action_12683312 ] 


Pallavi Palleti commented on MAHOUT-99:
---

I have used KeyValueLineRecordReader internally for my code and forgot to 
revert back to SequenceFileReader. Will that be sufficient to add another patch 
on the latest code and modify only KMeansDriver to use SequenceFileReader? 
Kindly let me know.

Thanks
Pallavi

  

Improving speed of KMeans
-

Key: MAHOUT-99
URL: https://issues.apache.org/jira/browse/MAHOUT-99
Project: Mahout
 Issue Type: Improvement
 Components: Clustering
   Reporter: Pallavi Palleti
   Assignee: Grant Ingersoll
Fix For: 0.1

Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


Improved the speed of KMeans by passing only cluster ID from mapper to reducer. 
Previously, whole Cluster Info as formatted s`tring was being sent.
Also removed the implicit assumption of Combiner runs only once approach and 
the code is modified accordingly so that it won't create a bug when combiner 
runs zero or more than once.



  




PGP.sig
Description: PGP signature


RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

2009-03-18 Thread Palleti, Pallavi
It depends on the kind of output. If we are just outputting only some numeric 
values then it is preferred to have SequenceFile as the data is written as 
binary. If not, it is preferred to write as simple text. Text file is readable 
where as binary is not readable. 

As we consider the data as text in reducers of both Canopy and KMeans, I don't 
see any performance improvement in using SequenceFile. So, I used 
TextInputFormat which is read friendly.
 
Thanks
Pallavi

-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 10:19 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Also why not consider just converting canopy? Which reader is better?


Jeff Eastman wrote:
 * PGP Signed: 03/18/09 at 21:37:36

 Sure, why don't you go ahead and post a patch?


 Pallavi Palleti (JIRA) wrote:
 [
 https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji
 ra.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=126
 83312#action_12683312
 ]
 Pallavi Palleti commented on MAHOUT-99:
 ---

 I have used KeyValueLineRecordReader internally for my code and 
 forgot to revert back to SequenceFileReader. Will that be sufficient 
 to add another patch on the latest code and modify only KMeansDriver 
 to use SequenceFileReader? Kindly let me know.

 Thanks
 Pallavi

  
 Improving speed of KMeans
 -

 Key: MAHOUT-99
 URL: https://issues.apache.org/jira/browse/MAHOUT-99
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Pallavi Palleti
Assignee: Grant Ingersoll
 Fix For: 0.1

 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
 MAHOUT-99.patch


 Improved the speed of KMeans by passing only cluster ID from mapper 
 to reducer. Previously, whole Cluster Info as formatted s`tring was 
 being sent.
 Also removed the implicit assumption of Combiner runs only once 
 approach and the code is modified accordingly so that it won't 
 create a bug when combiner runs zero or more than once.
 

   


 * Jeff Eastman j...@windwardsolutions.com
 * 0x6BFF1277

 .