Hi Pat, ParallelALSFactorizationJob actually implements two different flavours of matrix factorization, one that is aimed at explicit feedback data (such as ratings):
"Large-scale Parallel Collaborative Filtering for the Netflix Prize" [1] and another one that is aimed at using implicit feedback data (clicks, pageviews, etc): "Collaborative Filtering for Implicit Feedback Datasets" available" [2] The first approach is the default behavior, if you specify implicitFeedback = true, than the second approach is used. I'd recommend to have a look at the papers for the parameters for details, I can give you a brief explanation here: --numIterations controls the number of iterations to execute, it might be a burden to have the users explicitly set this, but otherwise we would have to check training error convergence after every iteration, which would slow down this already very slow job even more --numFeatures controls the number of latent features that we use to model the user and item factors. For a production setting I would start to experiment with a low number such as 10 or 20. --alpha is a special parameter that's only necessary to handle implicit data, have a look at [2] --lambda is a hyperparameter that controls the regularization. You want your solution to work well on unseen data and not overfit on the training data. In order to find a good lambda, you have to look at the RMSE your factorization gives on unseen data. Best, Sebastian [1] http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf [2] http://research.yahoo.com/pub/2433</p> On 02.01.2013 17:27, Pat Ferrel wrote: > What is the intuition regarding the choice or tuning of the ALS params? > > Job-Specific Options: > > --lambda lambda regularization > parameter > --implicitFeedback implicitFeedback data consists of implicit > feedback? > --alpha alpha confidence parameter > (only used on > > implicit feedback) > --numFeatures numFeatures dimension of the feature space > > --numIterations numIterations number of iterations > > I've set up an iterative search for the lambda that gets the lowest rmse but > what is the likely range? Can the range to search be determined from the data > (all 1 or nothing in my case). > > I do plan to include implicit feedback (values less than 1) eventually. Not > sure what this controls. I would think implicit feedback means preferences of > varying strengths and that could be seen in the input so I'm unsure about > this flag's meaning and use. > > No idea what the confidence factor should be or how it is used. > > Features? I suppose the number should be much less than the number of items > but there is a rule of thumb that applies to SVD so I wonder if there is also > one for ALS-WR? > > Iterations seems straightforward since the greater the number the better the > results. I just need to see where the improvement is too small to warrant the > time spent. > > The only parameter I wonder about for recommendfactorized is the maxRating? I > assume it is just a scaling factor so all ratings are between 0 and > maxRating? It doesn't do something unexpected like return anything > > maxRating as maxRating? In my case I have prefs 0-1 so maxRating is 1? I > imagine that the math might sometimes produce a rating higher than the max > pref so this is to clean up the returned ratings range? >