Hi Ken, thanks. Right now im not saving my output data in S3 and ebs, so when my cluster finish i download the output file and switch off machines. I was supposing that JobTracker was in realtime, is not? Il giorno 06/gen/2011, alle ore 18.48, Ken Krugler ha scritto:
> > On Jan 6, 2011, at 9:00am, Stefano Bellasio wrote: > >> Ok, so can i continue with just 3 nodes? Im a bit confused right now. With >> computation time i mean that i need to know how much time takes every >> test...as i said i can see nothing from my JobTracker, it says the number of >> nodes but no job active or map/reduce operations, and i dont know why :/ > > If it's been more than a day, your job history will disappear (at least with > default settings). > > Some other things than can cause jobs on EC2 m1.large instances to run slower > than expected: > > * If you're using too much memory on a slave, based on TaskTracker + DataNode > + simultaneous active child JVMs for map & reduce tasks, then you can get > into swap hell. But that's an easy one to check, just log onto the slave > while a job is active, run the top command, and check swap space usage. > > * An m1.large has two drives. With a typical default configuration, you're > only using one of these, and thus cutting your I/O performance in half. > > * Depending on where your servers get allocated, network traffic can be going > through multiple routers to get between systems. EMR is better at (or will > be) getting all of the servers provisioned close to each other. > > * Depending on who you're sharing your virtualized server with, you can get a > node that runs much slower than expected. Usually this happens when somebody > else is hammering the same disk, from my experience. But it's also usually > short-lived, and over time the effects disappear. But for this reason we try > to run jobs with the number of reduce tasks set to 1.75 * available slots, to > help avoid the lags of waiting for one slow server to complete. > > BTW none of this is specific to Mahout, just general Hadoop/EC2 tuning. > > -- Ken > >> >> Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto: >> >>> Those numbers seem "reasonable" to a first approximation, maybe a >>> little higher than I would have expected given past experience. >>> >>> You should be able to increase speed with more nodes, sure, but I use >>> 3 for testing too. >>> >>> The jobs are I/O bound for sure. I don't think you will see >>> appreciable difference with different algorithms. >>> >>> Yes the amount of data used in the similarity computation is the big >>> factor for time. You probably need to tell it to keep fewer item-item >>> pairs with the "max" parameters you mentioned earlier. >>> >>> mapred.num.tasks controls the number of mappers -- or at leasts >>> suggests it to Hadoop. >>> >>> What do you mean about the time of computation? The job tracker shows >>> you when the individual tasks start and finish. >>> >>> On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio >>> <[email protected]> wrote: >>>> Hi guys, well i'm doing some tests in those days and i have some >>>> questions. Here there is my environment and basic configuration: >>>> >>>> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm >>>> using a 3 node with large instances + one master node to control the >>>> cluster. >>>> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right >>>> now are on 10 mln versions. >>>> >>>> This is the command that i'm using to start my cluster: >>>> >>>> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar >>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob >>>> -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio >>>> --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 >>>> --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt >>>> >>>> I'm trying different values for : >>>> >>>> maxSimilaritiesPerItem >>>> maxPrefsPerUser >>>> maxCooccurrencesPerItem >>>> >>>> and using about 10 users per time. With this command, 10 mln user data >>>> set, my cluster took more than 4 hours (with 3 nodes) to give >>>> recommendations. Is a good time? >>>> >>>> >>>> Well, right now i have 2 goals, and im posting here to request your help >>>> to figure out some problems :) My primary goal is to run item-based >>>> recommendations and see what happens when i change the parameters in time >>>> and performance of my cluster. Also, i need to look at the similarities, i >>>> will be test three of them: cousine, pearson, and co-occurence. Good >>>> choices? I noted also that all the similarities computation is in RAM >>>> (right?) so my matrix is built and stored in RAM, is there an other way to >>>> do that? >>>> >>>> - I need to understand what kind of scalability i obtain with many nodes >>>> (3 for now, i can arrive to 5), i think that similarities calculation took >>>> most of the time, am i right? >>>> >>>> - I know there is something like mapred.task to define how many instances >>>> some task can use...do i need that? How can i specify this? >>>> >>>> - I need to see the exact time of each computation, i'm looking to >>>> jobtracker but seems that never happens in my cluster even if job (with >>>> mapping and reducing) is running. Is there another way to know the perfect >>>> time of any computation? >>>> >>>> - Finally, i will take all the data and try to plot them to figure out >>>> some good trends based on number of nodes, time and data set dimension. >>>> >>>> Well, any suggestion you want to give me is accepted :) Thank you guys >>>> >>>> >> > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > >
