Re: Problem compiling mahout

2011-11-17 Thread Sean Owen
This isn't anything to do with chmod, as far as I know: Hadoop uses Java to set readable permission, and this is not implemented in Windows. chmod is already on the Cygwin path anyway. It seems pretty normal that Hadoop might want to make its output directory writable! On Thu, Nov 17, 2011 at

Re: Understanding the SVD recommender

2011-11-17 Thread Sean Owen
One more question. OK, so I use Lanczos to find V_k by finding the top k eigenvectors of AT * A. A is sparse. But isn't AT * A dense, then? Is that just how it is? This will be my last basic question for the week: I understand that A ~= U_k * S_k * V_kT. Let's call the product on the right A_k.

Re: Understanding the SVD recommender

2011-11-17 Thread Jake Mannix
On Thu, Nov 17, 2011 at 5:26 AM, Sean Owen sro...@gmail.com wrote: One more question. OK, so I use Lanczos to find V_k by finding the top k eigenvectors of AT * A. A is sparse. But isn't AT * A dense, then? Is that just how it is? A'A and AA' are both dense, yes, but you never compute them.

Re: Understanding the SVD recommender

2011-11-17 Thread Sean Owen
Ah-ha. That's clicked now. Especially as I read the comments and see it already says exactly this. And I understand that you just compute extra eigenvectors then throw out near-duplicates, or those that are too un-eigenvector -- are there good pointers on the alternatives for that, or are

Weighting Preferences for Particular Items in Mahout?

2011-11-17 Thread Jamey Wood
Is there some way to weight particular preferences within Mahout? For example, suppose you were creating some kind of literature recommender that uses a 5-star preference scale. If you wanted to give double the weighting to preferences for novels versus preferences for short stories, what would

Re: Weighting Preferences for Particular Items in Mahout?

2011-11-17 Thread Sean Owen
Not directly, but you could modify an item-based recommender to do so. Where it uses an item-item similarity as a weight in a weighted average, you could modify the weight however you like depending on the types of the two items. On Thu, Nov 17, 2011 at 5:16 PM, Jamey Wood jamey.w...@gmail.com

Re: Understanding the SVD recommender

2011-11-17 Thread Ted Dunning
On Thu, Nov 17, 2011 at 7:21 AM, Jake Mannix jake.man...@gmail.com wrote: On Thu, Nov 17, 2011 at 5:26 AM, Sean Owen sro...@gmail.com wrote: One more question. OK, so I use Lanczos to find V_k by finding the top k eigenvectors of AT * A. A is sparse. But isn't AT * A dense, then? Is that

Re: NewsKMeansClustering does not find any clusters!

2011-11-17 Thread Ahmad Ammari
Hi Grant, I am running the NewsKMeansClustering Class from NetBeans (Run - Run File). I did not change anything in the class code except the name of the input directory, so the class can see the dataset that I want to cluster. So, I changed the statement: String inputDir = inputDir; to: String

Re: Understanding the SVD recommender

2011-11-17 Thread Ted Dunning
Yeah... a good alternative is to use the random projection stuff. On Thu, Nov 17, 2011 at 9:12 AM, Sean Owen sro...@gmail.com wrote: Ah-ha. That's clicked now. Especially as I read the comments and see it already says exactly this. And I understand that you just compute extra eigenvectors

Re: NewsKMeansClustering does not find any clusters!

2011-11-17 Thread Ahmad Ammari
Hi Jeff, Can you please elaborate what is meant by the -c path? I am running the Class NewsKMeansClustering normally from NetBeans (not from a command-line shell neither from mahout launcher script). So, I am not including any options with the run. Thanks, Ahmad On Wed, Nov 16, 2011 at 5:22 PM,

Re: Weighting Preferences for Particular Items in Mahout?

2011-11-17 Thread Jamey Wood
Thanks, Sean. We'll look into that. For user-based recommenders (or even just calculating UserSimilarity), would it have the desired effect if we added multiple virtual preference data points for the real items that we wished to more heavily weight? For example, if our real preference data

Austin Hacker Dojo - Big Data Machine Learning

2011-11-17 Thread David Boney
I am interested in starting a hacker dojo in Austin for big data machine learning. We would meet one evening a week to work on coding up Hadoop based machine learning and statistical analysis problems for big data systems. This would be a hacker dojo where the focus is on coding. I can teach

RE: OutofMemoryError when running kmeans or fuzzykmeans cluster method

2011-11-17 Thread Jeff Eastman
How did you set the heap sizes? If you are running on a cluster you need to add properties to your mapred-site.xml. Something like this: property namemapred.map.child.java.opts/name value-Xmx1500m/value descriptionJava opts for the map tasks. MapR: Default heapsize(-Xmx) is

Re: Weighting Preferences for Particular Items in Mahout?

2011-11-17 Thread Sean Owen
Well I think you could fit it inside some of the user-user similarities, yes. For a Pearson correlation, you could count important items twice or something, yes. I wouldn't do that by literally adding more items to the model as it creates other problems. It's possible; it may or may not have the

Re: Understanding the SVD recommender

2011-11-17 Thread Ted Dunning
Agree. On Thu, Nov 17, 2011 at 11:30 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: However, it would seem to me that QR as a completely isolated job would have little value in machine learning applications.

Re: Understanding the SVD recommender

2011-11-17 Thread Dmitriy Lyubimov
On Thu, Nov 17, 2011 at 11:30 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: I will finish adding an option with Cholesky decomposition route to SSVD some time early in Q1 2012. PPS i already put some jobs in (they are in the trunk) for Cholesky route. I thought it would be an easy mod but then

Re: Understanding the SVD recommender

2011-11-17 Thread Sebastian Schelter
I think Dmitriys description of the SGD and ALS-WR approach hits the nail on the head. However there is a third way to factorize the rating matrix which we haven't talked about yet. It's described in Yehuda Koren's Collaborative Filtering for Implicit Feedback Datasets

Re: Understanding the SVD recommender

2011-11-17 Thread Dmitriy Lyubimov
Yes. This is even one more step away from straightforward SVD, i.e. explicitly analyizing implicit feedback (pun intended). On Thu, Nov 17, 2011 at 12:38 PM, Sebastian Schelter s...@apache.org wrote: I think Dmitriys description of the SGD and ALS-WR approach hits the nail on the head.

Re: lsi

2011-11-17 Thread Grant Ingersoll
I've never implemented LSI. Is there a way to incrementally build the model (by simply indexing documents) or is it something that one only runs after the fact once one has built up the much bigger matrix? If it's the former, I bet it wouldn't be that hard to just implement the appropriate

Re: lsi

2011-11-17 Thread Ted Dunning
It is possible to index/vectorize new documents in an existing projection. Building the projection is pretty much a from-scratch operation. Rebuilding the projection can be done pretty infrequently. On Thu, Nov 17, 2011 at 1:47 PM, Grant Ingersoll gsing...@apache.orgwrote: I've never

Re: lsi

2011-11-17 Thread Dmitriy Lyubimov
The only way to build model incrementally is to do a 'fold in' of new observations, that i know. However, folding in (which is just a multiplication of a new vector over the matrices as Ted explained somewhere else) is just a projection into already trained space of factors, but not a repetition

Re: lsi

2011-11-17 Thread Dmitriy Lyubimov
PS the danger of using an overly specific corpus is that training may not be able to learn polisemy very well unless it sees other documents with examples of use of the industry jargon words that may also mean something else. But you definitely want to include documents that do have words

Re: Problem compiling mahout

2011-11-17 Thread Lance Norskog
'chmod' is the program that sets readable permission. It does whatever Windows magic is required to match the Posix command line semantics. The cygwin path is not the true windows path. So, when Java runs it gets the true path which has no Cygwin. You have to add c:\cygwin\bin to the windows path

OutofMemory problem in ClusterDumper

2011-11-17 Thread zou.cl
Hi guys, I just noticed the out of memory problem in the ClusterDumper class. It seems that it loads all the data (for example, the clusteredPoints) into the Map container which cost huge memory if we have GBs data. I think we could also use Mapreduce to print the results instead of