Ok, so can i continue with just 3 nodes? Im a bit confused right now. With computation time i mean that i need to know how much time takes every test...as i said i can see nothing from my JobTracker, it says the number of nodes but no job active or map/reduce operations, and i dont know why :/
Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto: > Those numbers seem "reasonable" to a first approximation, maybe a > little higher than I would have expected given past experience. > > You should be able to increase speed with more nodes, sure, but I use > 3 for testing too. > > The jobs are I/O bound for sure. I don't think you will see > appreciable difference with different algorithms. > > Yes the amount of data used in the similarity computation is the big > factor for time. You probably need to tell it to keep fewer item-item > pairs with the "max" parameters you mentioned earlier. > > mapred.num.tasks controls the number of mappers -- or at leasts > suggests it to Hadoop. > > What do you mean about the time of computation? The job tracker shows > you when the individual tasks start and finish. > > On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio > <[email protected]> wrote: >> Hi guys, well i'm doing some tests in those days and i have some questions. >> Here there is my environment and basic configuration: >> >> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm >> using a 3 node with large instances + one master node to control the cluster. >> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now >> are on 10 mln versions. >> >> This is the command that i'm using to start my cluster: >> >> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar >> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob >> -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio >> --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem >> 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt >> >> I'm trying different values for : >> >> maxSimilaritiesPerItem >> maxPrefsPerUser >> maxCooccurrencesPerItem >> >> and using about 10 users per time. With this command, 10 mln user data set, >> my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is >> a good time? >> >> >> Well, right now i have 2 goals, and im posting here to request your help to >> figure out some problems :) My primary goal is to run item-based >> recommendations and see what happens when i change the parameters in time >> and performance of my cluster. Also, i need to look at the similarities, i >> will be test three of them: cousine, pearson, and co-occurence. Good >> choices? I noted also that all the similarities computation is in RAM >> (right?) so my matrix is built and stored in RAM, is there an other way to >> do that? >> >> - I need to understand what kind of scalability i obtain with many nodes (3 >> for now, i can arrive to 5), i think that similarities calculation took most >> of the time, am i right? >> >> - I know there is something like mapred.task to define how many instances >> some task can use...do i need that? How can i specify this? >> >> - I need to see the exact time of each computation, i'm looking to >> jobtracker but seems that never happens in my cluster even if job (with >> mapping and reducing) is running. Is there another way to know the perfect >> time of any computation? >> >> - Finally, i will take all the data and try to plot them to figure out some >> good trends based on number of nodes, time and data set dimension. >> >> Well, any suggestion you want to give me is accepted :) Thank you guys >> >>
