Hi guys, well i'm doing some tests in those days and i have some questions. Here there is my environment and basic configuration:
1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm using a 3 node with large instances + one master node to control the cluster. 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now are on 10 mln versions. This is the command that i'm using to start my cluster: hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt I'm trying different values for : maxSimilaritiesPerItem maxPrefsPerUser maxCooccurrencesPerItem and using about 10 users per time. With this command, 10 mln user data set, my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is a good time? Well, right now i have 2 goals, and im posting here to request your help to figure out some problems :) My primary goal is to run item-based recommendations and see what happens when i change the parameters in time and performance of my cluster. Also, i need to look at the similarities, i will be test three of them: cousine, pearson, and co-occurence. Good choices? I noted also that all the similarities computation is in RAM (right?) so my matrix is built and stored in RAM, is there an other way to do that? - I need to understand what kind of scalability i obtain with many nodes (3 for now, i can arrive to 5), i think that similarities calculation took most of the time, am i right? - I know there is something like mapred.task to define how many instances some task can use...do i need that? How can i specify this? - I need to see the exact time of each computation, i'm looking to jobtracker but seems that never happens in my cluster even if job (with mapping and reducing) is running. Is there another way to know the perfect time of any computation? - Finally, i will take all the data and try to plot them to figure out some good trends based on number of nodes, time and data set dimension. Well, any suggestion you want to give me is accepted :) Thank you guys Il giorno 02/gen/2011, alle ore 11.24, Sebastian Schelter ha scritto: > Am 02.01.2011 11:21, schrieb Stefano Bellasio: >> One question related to users.txt where i specify the users number: how can >> i type more users? what format? right now i think is one number for each >> row, is right? Thanks >> > > Exactly, it's one userID per line. > > --sebastian > >> Il giorno 02/gen/2011, alle ore 11.08, Sebastian Schelter ha scritto: >> >>> Hi Stefano, happy new year too! >>> >>> The running time of RecommenderJob is neither proportional to the number >>> of users you wanna compute recommendations for nor to the number of >>> recommendations per single user. Those parameters just influence the >>> last step of the job, but most time will be spent before when computing >>> item-item-similarities, which is done independently of the number of >>> users you wanna have recommendations for or the number of >>> recommendations per user. >>> >>> We have some parameters to control the amount of data considered in the >>> recommendation process, have you tried adjusting them to your needs? If >>> you haven't I think playing with those should be the best place to start >>> for you: >>> >>> --maxPrefsPerUser maxPrefsPerUser >>> Maximum number of preferences considered per user in final >>> recommendation phase >>> >>> --maxSimilaritiesPerItem maxSimilaritiesPerItem >>> Maximum number of similarities considered per item >>> >>> --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem >>> try to cap the number of cooccurrences per item to this number >>> >>> >>> It would be very cool if you could keep us up to date with your progress >>> and maybe provide some numbers. I think there are a lot of things in the >>> RecommenderJob that could be optimized by us to increase its performance >>> and scalability, I think we'd be happy to patch it for you if you >>> encounter a problem. >>> >>> --sebastian >>> >>> >>> Am 02.01.2011 10:36, schrieb Stefano Bellasio: >>>> Hi guys, happy new year :) well, after several weeks of testing finally i >>>> had a complete amazon ec2-hadoop working environment thanks to Cloudera >>>> ec2 script. Well, right now i'm doing some test with movielens (10 mln >>>> version) and i need just to compute recommendations with different >>>> similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 >>>> instances, 1 master node and 2 worker node (large instance) but even if i >>>> know that recommender is not fast, i was thinking that 3 instances are >>>> very fast...my process took about 3 hours to complete for 1 users (i >>>> specified the user that needs recommendation with a user.txt file)....and >>>> just 10 recommendations. So, my question is, what is the correct setup for >>>> my cluster? How many nodes? How many data nodes and so on? Is there >>>> something that i can do to speed up this process...my goal is to recommend >>>> with a dataset of about 20/30 GB and 200 milions of items...so i'm worried >>>> about that. >>>> >>>> Thanks :) Stefano >>> >> >
