Hi Sebastian,
It runs much faster! On the same recommendation task it now terminates in
45min rather than ~2h yesterday. I think with further tuning it can be even
faster.
I'm trying reading the code, some hints for a good starting point?
Thanks a lot!
2013/3/20 Sebastian Schelter
Nice to hear that!
In order to get into the code, I suggest you first read the papers
regarding ALS for Collaborative Filtering:
Large-scale Parallel Collaborative Filtering for the Netflix Prize
Hi Sebastian,
I've tried the svn trunk. Hadoop constantly complains about memory like
out of memory error.
On the datanode there's 4 physic cores and by hyper-threading it has 16
logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
a problem with memory.
How you set your
Hi JU,
the job creates an OpenIntObjectHashMapVector holding the feature
vectors as DenseVectors. In one map-job, it is filled with the
user-feature vectors, in the next one with the item feature vectors.
I used 4 gigabytes for a dataset with 1.8M users (using 20 features),
so I guess that
I concur with everything that you state. In ideal world, we would have a
framework that offers a well implemented hybrid hash-join [1] that takes
advantage of all available memory and gracefully uses the disk once the
amount of memory is not enough, such as the one used by Stratosphere [2].
Best,
Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts
and 8 threads for each mapper. Now the job runs smoothly and the whole
factorization ends in 45min. With your settings I think it should be even
faster.
One more thing is that the RecommendJob is kind of slow (for all
Hi JU,
I reworked the RecommenderJob in a similar way as the ALS job. Can you
give it a try?
You have to try the patch from
https://issues.apache.org/jira/browse/MAHOUT-1169
In introduces a new param to RecommenderJob called --numThreads. The
configuration of the job should be done similar to
Thanks Sebastian and Sean, I will dig more into the paper.
With a simple try on a small part of the data, it seems larger alpha (~40)
gets me a better result.
Do you have an idea how long it will be for ParellelALS for the 700mb
complete dataset? It contains ~48 million triples. The hadoop cluster
Hi JU,
We recently rewrote the factorization code, it should be much faster
now. You should use the current trunk, make Hadoop schedule only one
mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
it reuse the JVMs and add the parameter --numThreadsPerSolver with the
number
Hi,
I'm wondering has someone tried ParallelALS with implicite feedback job on
million song dataset? Some pointers on alpha and lambda?
In the paper alpha is 40 and lambda is 150, but I don't know what are their
r values in the matrix. They said is based on time units that users have
watched the
One word of caution, is that there are at least two papers on ALS and they
define lambda differently. I think you are talking about Collaborative
Filtering for Implicit Feedback Datasets.
I've been working with some folks who point out that alpha=40 seems to be
too high for most data sets. After
JU,
are you refering to this dataset?
http://labrosa.ee.columbia.edu/millionsong/tasteprofile
On 18.03.2013 17:47, Sean Owen wrote:
One word of caution, is that there are at least two papers on ALS and they
define lambda differently. I think you are talking about Collaborative
Filtering for
Thanks for quick responses.
Yes it's that dataset. What I'm using is triplets of user_id song_id
play_times, of ~ 1m users. No audio things, just plein text triples.
It seems to me that the paper about implicit feedback matchs well this
dataset: no explicit ratings, but times of listening to a
You should also be aware that the alpha parameter comes from a formula
the authors introduce to measure the confidence in the observed values:
confidence = 1 + alpha * observed_value
You can also change that formula in the code to something that you see
more fit, the paper even suggests
Yes that's fine input then.
Large alpha should go with small R values, not large R. Really alpha
controls how much observed input (R != 0) is weighted towards 1 versus how
much unobserved input (R=0) is weighted to 0. I scale lambda by alpha to
complete this effect.
On Mon, Mar 18, 2013 at 1:06
15 matches
Mail list logo