Hi Stefano, happy new year too!

The running time of RecommenderJob is neither proportional to the number
of users you wanna compute recommendations for nor to the number of
recommendations per single user. Those parameters just influence the
last step of the job, but most time will be spent before when computing
item-item-similarities, which is done independently of the number of
users you wanna have recommendations for or the number of
recommendations per user.

We have some parameters to control the amount of data considered in the
recommendation process, have you tried adjusting them to your needs? If
you haven't I think playing with those should be the best place to start
for you:

  --maxPrefsPerUser maxPrefsPerUser
        Maximum number of preferences considered per user in final
        recommendation phase

  --maxSimilaritiesPerItem maxSimilaritiesPerItem
        Maximum number of similarities considered per item

  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem
        try to cap the number of cooccurrences per item to this number


It would be very cool if you could keep us up to date with your progress
and maybe provide some numbers. I think there are a lot of things in the
RecommenderJob that could be optimized by us to increase its performance
and scalability, I think we'd be happy to patch it for you if you
encounter a problem.

--sebastian


Am 02.01.2011 10:36, schrieb Stefano Bellasio:
> Hi guys, happy new year :) well, after several weeks of testing finally i had 
> a complete amazon ec2-hadoop working environment thanks to Cloudera ec2 
> script. Well, right now i'm doing some test with movielens (10 mln version) 
> and i need just to compute recommendations with different similirity by 
> RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances, 1 
> master node and 2 worker node (large instance) but even if i know that 
> recommender is not fast, i was thinking that 3 instances are very fast...my 
> process took about 3 hours to complete for 1 users (i specified the user that 
> needs recommendation with a user.txt file)....and just 10 recommendations. 
> So, my question is, what is the correct setup for my cluster? How many nodes? 
> How many data nodes and so on? Is there something that i can do to speed up 
> this process...my goal is to recommend with a dataset of about 20/30 GB and 
> 200 milions of items...so i'm worried about that. 
> 
> Thanks :) Stefano

Reply via email to