Hi JU,

are you sure regarding 1. ? It would be a bug. How do you exactly call
the job?

2. The threshold is used during the similarity computation and is a
lower bound for the similarities considered. For certain measures (like
Pearson or Cosine) it also allows to prune some item pairs early. You
have to choose it experimentally according to your usecase.

3. The job has a higher computational complexity than ALS and its
runtime depends on the distribution of the interactions, e.g. users with
a high number of interactions cause the job to take very long. There is
a parameter that controls this, maxPrefsPerUserInItemSimilarity, per
default it is 1000 (which means 1000 interactions are considered). You
can set this to something like 500 if you want.

Regarding the fact that only one reducer runs, how large is your input,
does it span several blocks in hdfs?

48M datapoints is not that much, you could try to do the recommendation
on a single machine if you have sufficient memory. The class
o.a.m.cf.taste.similarity.precompute.example.BatchItemSimilaritiesGroupLens
shows how to precompute similarities efficiently on a single machine.
After that, you can instantiate a recommender with the similarities to
get your 110000 recommendations.





On 25.03.2013 10:31, Han JU wrote:
> Hi,
> 
> After ParallelAlsJob, I'm trying now the parallel item-based recommender
> job. Here's some questions.
> 
>    1. I specified a userFile, which contains 110000 diff. users, but the
>    output contains more than this, nearly 130000 users' recommendatoin. Why is
>    this?
>    2. How the threashold value is chosen in real cases? For example I'm
>    using boolean data and LogLikelyHood.
>    3. The job runs slowly, nearly 8h on 48M datapoints. By default all jobs
>    have only one reducer, which is the slowest part. How should I choose and
>    set the reducer number to make it fast?For example the last job,
>    PartialMultiplyMapper-Reducer, takes 7h and its reducer takes 5h. On the
>    same data ParallelAls finishes in 1.5h with the threaded version.
> 
> Thanks!
> 

Reply via email to