Hi JU, are you sure regarding 1. ? It would be a bug. How do you exactly call the job?
2. The threshold is used during the similarity computation and is a lower bound for the similarities considered. For certain measures (like Pearson or Cosine) it also allows to prune some item pairs early. You have to choose it experimentally according to your usecase. 3. The job has a higher computational complexity than ALS and its runtime depends on the distribution of the interactions, e.g. users with a high number of interactions cause the job to take very long. There is a parameter that controls this, maxPrefsPerUserInItemSimilarity, per default it is 1000 (which means 1000 interactions are considered). You can set this to something like 500 if you want. Regarding the fact that only one reducer runs, how large is your input, does it span several blocks in hdfs? 48M datapoints is not that much, you could try to do the recommendation on a single machine if you have sufficient memory. The class o.a.m.cf.taste.similarity.precompute.example.BatchItemSimilaritiesGroupLens shows how to precompute similarities efficiently on a single machine. After that, you can instantiate a recommender with the similarities to get your 110000 recommendations. On 25.03.2013 10:31, Han JU wrote: > Hi, > > After ParallelAlsJob, I'm trying now the parallel item-based recommender > job. Here's some questions. > > 1. I specified a userFile, which contains 110000 diff. users, but the > output contains more than this, nearly 130000 users' recommendatoin. Why is > this? > 2. How the threashold value is chosen in real cases? For example I'm > using boolean data and LogLikelyHood. > 3. The job runs slowly, nearly 8h on 48M datapoints. By default all jobs > have only one reducer, which is the slowest part. How should I choose and > set the reducer number to make it fast?For example the last job, > PartialMultiplyMapper-Reducer, takes 7h and its reducer takes 5h. On the > same data ParallelAls finishes in 1.5h with the threaded version. > > Thanks! >