Re: ALS-WR on Million Song dataset

2013-03-21 Thread Han JU
Hi Sebastian, It runs much faster! On the same recommendation task it now terminates in 45min rather than ~2h yesterday. I think with further tuning it can be even faster. I'm trying reading the code, some hints for a good starting point? Thanks a lot! 2013/3/20 Sebastian Schelter

Re: ALS-WR on Million Song dataset

2013-03-21 Thread Sebastian Schelter
Nice to hear that! In order to get into the code, I suggest you first read the papers regarding ALS for Collaborative Filtering: Large-scale Parallel Collaborative Filtering for the Netflix Prize

Re: ALS-WR on Million Song dataset

2013-03-20 Thread Han JU
Hi Sebastian, I've tried the svn trunk. Hadoop constantly complains about memory like out of memory error. On the datanode there's 4 physic cores and by hyper-threading it has 16 logical cores, so I set --numThreadsPerSolver to 16 and that seems to have a problem with memory. How you set your

Re: ALS-WR on Million Song dataset

2013-03-20 Thread Sebastian Schelter
Hi JU, the job creates an OpenIntObjectHashMapVector holding the feature vectors as DenseVectors. In one map-job, it is filled with the user-feature vectors, in the next one with the item feature vectors. I used 4 gigabytes for a dataset with 1.8M users (using 20 features), so I guess that

Re: ALS-WR on Million Song dataset

2013-03-20 Thread Sebastian Schelter
I concur with everything that you state. In ideal world, we would have a framework that offers a well implemented hybrid hash-join [1] that takes advantage of all available memory and gracefully uses the disk once the amount of memory is not enough, such as the one used by Stratosphere [2]. Best,

Re: ALS-WR on Million Song dataset

2013-03-20 Thread Han JU
Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts and 8 threads for each mapper. Now the job runs smoothly and the whole factorization ends in 45min. With your settings I think it should be even faster. One more thing is that the RecommendJob is kind of slow (for all

Re: ALS-WR on Million Song dataset

2013-03-20 Thread Sebastian Schelter
Hi JU, I reworked the RecommenderJob in a similar way as the ALS job. Can you give it a try? You have to try the patch from https://issues.apache.org/jira/browse/MAHOUT-1169 In introduces a new param to RecommenderJob called --numThreads. The configuration of the job should be done similar to

Re: ALS-WR on Million Song dataset

2013-03-19 Thread Han JU
Thanks Sebastian and Sean, I will dig more into the paper. With a simple try on a small part of the data, it seems larger alpha (~40) gets me a better result. Do you have an idea how long it will be for ParellelALS for the 700mb complete dataset? It contains ~48 million triples. The hadoop cluster

Re: ALS-WR on Million Song dataset

2013-03-19 Thread Sebastian Schelter
Hi JU, We recently rewrote the factorization code, it should be much faster now. You should use the current trunk, make Hadoop schedule only one mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make it reuse the JVMs and add the parameter --numThreadsPerSolver with the number

ALS-WR on Million Song dataset

2013-03-18 Thread Han JU
Hi, I'm wondering has someone tried ParallelALS with implicite feedback job on million song dataset? Some pointers on alpha and lambda? In the paper alpha is 40 and lambda is 150, but I don't know what are their r values in the matrix. They said is based on time units that users have watched the

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
One word of caution, is that there are at least two papers on ALS and they define lambda differently. I think you are talking about Collaborative Filtering for Implicit Feedback Datasets. I've been working with some folks who point out that alpha=40 seems to be too high for most data sets. After

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sebastian Schelter
JU, are you refering to this dataset? http://labrosa.ee.columbia.edu/millionsong/tasteprofile On 18.03.2013 17:47, Sean Owen wrote: One word of caution, is that there are at least two papers on ALS and they define lambda differently. I think you are talking about Collaborative Filtering for

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Han JU
Thanks for quick responses. Yes it's that dataset. What I'm using is triplets of user_id song_id play_times, of ~ 1m users. No audio things, just plein text triples. It seems to me that the paper about implicit feedback matchs well this dataset: no explicit ratings, but times of listening to a

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sebastian Schelter
You should also be aware that the alpha parameter comes from a formula the authors introduce to measure the confidence in the observed values: confidence = 1 + alpha * observed_value You can also change that formula in the code to something that you see more fit, the paper even suggests

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
Yes that's fine input then. Large alpha should go with small R values, not large R. Really alpha controls how much observed input (R != 0) is weighted towards 1 versus how much unobserved input (R=0) is weighted to 0. I scale lambda by alpha to complete this effect. On Mon, Mar 18, 2013 at 1:06