My experience is that until you get to 5-10 nodes using Hadoop will be slower than a sequential implementation.
You can definitely continue with 3 nodes as Sean suggests for testing, but I would not expect this to be a performant solution. On Thu, Jan 6, 2011 at 9:00 AM, Stefano Bellasio <[email protected]>wrote: > Ok, so can i continue with just 3 nodes? Im a bit confused right now. With > computation time i mean that i need to know how much time takes every > test...as i said i can see nothing from my JobTracker, it says the number of > nodes but no job active or map/reduce operations, and i dont know why :/ > > Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto: > > > Those numbers seem "reasonable" to a first approximation, maybe a > > little higher than I would have expected given past experience. > > > > You should be able to increase speed with more nodes, sure, but I use > > 3 for testing too. > > > > The jobs are I/O bound for sure. I don't think you will see > > appreciable difference with different algorithms. > > > > Yes the amount of data used in the similarity computation is the big > > factor for time. You probably need to tell it to keep fewer item-item > > pairs with the "max" parameters you mentioned earlier. > > > > mapred.num.tasks controls the number of mappers -- or at leasts > > suggests it to Hadoop. > > > > What do you mean about the time of computation? The job tracker shows > > you when the individual tasks start and finish. > > > > On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio > > <[email protected]> wrote: > >> Hi guys, well i'm doing some tests in those days and i have some > questions. Here there is my environment and basic configuration: > >> > >> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm > using a 3 node with large instances + one master node to control the > cluster. > >> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right > now are on 10 mln versions. > >> > >> This is the command that i'm using to start my cluster: > >> > >> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob > -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio > --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem > 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt > >> > >> I'm trying different values for : > >> > >> maxSimilaritiesPerItem > >> maxPrefsPerUser > >> maxCooccurrencesPerItem > >> > >> and using about 10 users per time. With this command, 10 mln user data > set, my cluster took more than 4 hours (with 3 nodes) to give > recommendations. Is a good time? > >> > >> > >> Well, right now i have 2 goals, and im posting here to request your help > to figure out some problems :) My primary goal is to run item-based > recommendations and see what happens when i change the parameters in time > and performance of my cluster. Also, i need to look at the similarities, i > will be test three of them: cousine, pearson, and co-occurence. Good > choices? I noted also that all the similarities computation is in RAM > (right?) so my matrix is built and stored in RAM, is there an other way to > do that? > >> > >> - I need to understand what kind of scalability i obtain with many nodes > (3 for now, i can arrive to 5), i think that similarities calculation took > most of the time, am i right? > >> > >> - I know there is something like mapred.task to define how many > instances some task can use...do i need that? How can i specify this? > >> > >> - I need to see the exact time of each computation, i'm looking to > jobtracker but seems that never happens in my cluster even if job (with > mapping and reducing) is running. Is there another way to know the perfect > time of any computation? > >> > >> - Finally, i will take all the data and try to plot them to figure out > some good trends based on number of nodes, time and data set dimension. > >> > >> Well, any suggestion you want to give me is accepted :) Thank you guys > >> > >> > >
