Writing java program for performing kmeans clustering on reuters dataset instead of ./mahout seqdirectory | seq2sparse | kmeans| clusterdump ,Steps to Follow

2012-01-03 Thread rahul raghavendhra
I am new to mahout, i have svn the trunk and installed it using mvn.. now i wish to write a java program(instead of the shell script build-reuters.sh/cluster-reuters.sh) that performs a kmeans clustering by calling the methods or by creating instance (if possible) in the classes which convert the

Re: Writing java program for performing kmeans clustering on reuters dataset instead of ./mahout seqdirectory | seq2sparse | kmeans| clusterdump ,Steps to Follow

2012-01-03 Thread Paritosh Ranjan
I think mahout-core ( and its internal dependencies ) can do most of what you need. You will have to create your vectors yourself and write to HDFS. Then use KMeansDriver's run method to do clustering. Then use ClusterOutputPostProcessor to separate out vectors belonging to different

Re: Writing java program for performing kmeans clustering on reuters dataset instead of ./mahout seqdirectory | seq2sparse | kmeans| clusterdump ,Steps to Follow

2012-01-03 Thread praveenesh kumar
Have you tried this link ? http://shuyo.wordpress.com/2011/02/14/mahout-development-environment-with-maven-and-eclipse-2/ It is telling you how to import mahout in action examples in eclipse. Just add Hadoop and mahout dependencies in pom.xml and there is a small Mahout in action example to run

Re: Compile failure: log has private access in org.apache.mahout.clustering.kmeans.KMeansClusterer

2012-01-03 Thread Sean Owen
I think you've got some old code lying around; this class doesn't exist anymore. On Tue, Jan 3, 2012 at 2:32 PM, Andrea Leistra andrea.leis...@concur.com wrote: This morning I checked out the mahout trunk.   When attempting mvn install I get the following error: [ERROR] COMPILATION ERROR :

Re: Purchase prediction

2012-01-03 Thread Ted Dunning
The recent data is usually just the user history, not the off-line item-item relationship build. For brand new items, there is the cold start problem, but this is often handled by putting these items on a New Arrivals page so that you can expose them to users until you get enough data to include

Re: SGD and memory

2012-01-03 Thread Ted Dunning
You math is correct. When you say you have 105 features, what do you mean? Are these textual features? Or what? On Tue, Jan 3, 2012 at 2:53 PM, Grant Ingersoll gsing...@apache.org wrote: I'm trying to run the full ASF email SGD classifier problem and am facing heap size issues. My current

Re: SGD and memory

2012-01-03 Thread Lance Norskog
Does these algorithms have good locality? For doing giant online computations it might be worth storing these in memory-mapped files. Or, give up and get the M/R SGD code in. On Tue, Jan 3, 2012 at 2:59 PM, Ted Dunning ted.dunn...@gmail.com wrote: You math is correct. When you say you have 105

Re: SGD and memory

2012-01-03 Thread Ted Dunning
No. They don't have particularly good locality. The would have moderate hotspots, but these would be scatter all over. The hotspots might allow L2 cache to help, but would not allow disk based data to work. The major opportunity for improvement here is to incorporate some of the advances that

Item Based Recommendation Evaluation based on Number of Preferences

2012-01-03 Thread Nick Jordan
Hi All, I'm currently running an item based recommendation using KnnItemBasedRecommender. My data set isn't very large at approximately 30k preferences over 10k items. When running a AverageAbsoluteDifferenceRecommenderEvaluator evaluation on a 0.9 training set the result is ~0.80 (on a

Re: SGD and memory

2012-01-03 Thread Grant Ingersoll
On Jan 3, 2012, at 5:59 PM, Ted Dunning wrote: You math is correct. When you say you have 105 features, what do you mean? Sorry, that should have been 105 categories/labels. I'm trying to do the ASF email equivalent of 20 news groups, but in this case it's 105 ASF projects. The basic

Re: Purchase prediction

2012-01-03 Thread Lance Norskog
If you can use an SVD-based recommender, here is a way to update an SVD in constant time that is much much smaller than the original decomposition. http://www.merl.com/papers/docs/TR2006-059.pdf On Tue, Jan 3, 2012 at 1:44 PM, Ted Dunning ted.dunn...@gmail.com wrote: The recent data is usually

Re: how to create classifier out of kmeans cluster

2012-01-03 Thread Paritosh Ranjan
This topic has been discussed earlier. Check out this thread. This might answer your question. http://comments.gmane.org/gmane.comp.apache.mahout.user/10988 On 03-01-2012 22:18, prasenjit mukherjee wrote: After I use mahout kmeans to create the clusters, does Mahout have any tools/utilities

Re: SGD and memory

2012-01-03 Thread Ted Dunning
Ahh... of course. I should have understood that from the multiplication you did since 104 = 105-1. On Tue, Jan 3, 2012 at 7:58 PM, Grant Ingersoll gsing...@apache.org wrote: On Jan 3, 2012, at 5:59 PM, Ted Dunning wrote: You math is correct. When you say you have 105 features, what do

Re: Item Based Recommendation Evaluation based on Number of Preferences

2012-01-03 Thread Sean Owen
That is the opposite of what you'd expect, and I think that's a possible explanation you've identified, but still seems unlikely to me. Something else may be wrong. Is this repeatable, and not just a fluke of the random number generator? What are the exact args you're using, just to make sure