About the performance issues of EC2: yes i know that is not a perfect 
environment, but right now im looking for some hints about tuning of machines 
to do a better job with Hadoop and Mahout. My final goal, as said, is to plot 
some results for my thesis, so i need to tune everything is possible. I'm also 
reading something about JVM tuning in Mahout in action...hope that deserves a 
read
Il giorno 06/gen/2011, alle ore 18.48, Ken Krugler ha scritto:

> 
> On Jan 6, 2011, at 9:00am, Stefano Bellasio wrote:
> 
>> Ok, so can i continue with just 3 nodes? Im a bit confused right now. With 
>> computation time i mean that i need to know how much time takes every 
>> test...as i said i can see nothing from my JobTracker, it says the number of 
>> nodes but no job active or map/reduce operations, and i dont know why :/
> 
> If it's been more than a day, your job history will disappear (at least with 
> default settings).
> 
> Some other things than can cause jobs on EC2 m1.large instances to run slower 
> than expected:
> 
> * If you're using too much memory on a slave, based on TaskTracker + DataNode 
> + simultaneous active child JVMs for map & reduce tasks, then you can get 
> into swap hell. But that's an easy one to check, just log onto the slave 
> while a job is active, run the top command, and check swap space usage.
> 
> * An m1.large has two drives. With a typical default configuration, you're 
> only using one of these, and thus cutting your I/O performance in half.
> 
> * Depending on where your servers get allocated, network traffic can be going 
> through multiple routers to get between systems. EMR is better at (or will 
> be) getting all of the servers provisioned close to each other.
> 
> * Depending on who you're sharing your virtualized server with, you can get a 
> node that runs much slower than expected. Usually this happens when somebody 
> else is hammering the same disk, from my experience. But it's also usually 
> short-lived, and over time the effects disappear. But for this reason we try 
> to run jobs with the number of reduce tasks set to 1.75 * available slots, to 
> help avoid the lags of waiting for one slow server to complete.
> 
> BTW none of this is specific to Mahout, just general Hadoop/EC2 tuning.
> 
> -- Ken
> 
>> 
>> Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
>> 
>>> Those numbers seem "reasonable" to a first approximation, maybe a
>>> little higher than I would have expected given past experience.
>>> 
>>> You should be able to increase speed with more nodes, sure, but I use
>>> 3 for testing too.
>>> 
>>> The jobs are I/O bound for sure. I don't think you will see
>>> appreciable difference with different algorithms.
>>> 
>>> Yes the amount of data used in the similarity computation is the big
>>> factor for time. You probably need to tell it to keep fewer item-item
>>> pairs with the "max" parameters you  mentioned earlier.
>>> 
>>> mapred.num.tasks controls the number of mappers -- or at leasts
>>> suggests it to Hadoop.
>>> 
>>> What do you mean about the time of computation? The job tracker shows
>>> you when the individual tasks start and finish.
>>> 
>>> On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
>>> <[email protected]> wrote:
>>>> Hi guys, well i'm doing some tests in those days and i have some 
>>>> questions. Here there is my environment and basic configuration:
>>>> 
>>>> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm 
>>>> using a 3 node with large instances + one master node to control the 
>>>> cluster.
>>>> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right 
>>>> now are on 10 mln versions.
>>>> 
>>>> This is the command that i'm using to start my cluster:
>>>> 
>>>> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar 
>>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 
>>>> -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio 
>>>> --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 
>>>> --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
>>>> 
>>>> I'm trying different values for :
>>>> 
>>>> maxSimilaritiesPerItem
>>>> maxPrefsPerUser
>>>> maxCooccurrencesPerItem
>>>> 
>>>> and using about 10 users per time. With this command, 10 mln user data 
>>>> set, my cluster took more than 4 hours (with 3 nodes) to give 
>>>> recommendations. Is a good time?
>>>> 
>>>> 
>>>> Well, right now i have 2 goals, and im posting here to request your help 
>>>> to figure out some problems :) My primary goal is to run item-based 
>>>> recommendations and see what happens when i change the parameters in time 
>>>> and performance of my cluster. Also, i need to look at the similarities, i 
>>>> will be test three of them: cousine, pearson, and co-occurence. Good 
>>>> choices? I noted also that all the similarities computation is in RAM 
>>>> (right?) so my matrix is built and stored in RAM, is there an other way to 
>>>> do that?
>>>> 
>>>> - I need to understand what kind of scalability i obtain with many nodes 
>>>> (3 for now, i can arrive to 5), i think that similarities calculation took 
>>>> most of the time, am i right?
>>>> 
>>>> - I know there is something like mapred.task to define how many instances 
>>>> some task can use...do i need that? How can i specify this?
>>>> 
>>>> - I need to see the exact time of each computation, i'm looking to 
>>>> jobtracker but seems that never happens in my cluster even if job (with 
>>>> mapping and reducing) is running. Is there another way to know the perfect 
>>>> time of any computation?
>>>> 
>>>> - Finally, i will take all the data and try to plot them to figure out 
>>>> some good trends based on number of nodes, time and data set dimension.
>>>> 
>>>> Well, any suggestion you want to give me is accepted :) Thank you guys
>>>> 
>>>> 
>> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
> 

Reply via email to