Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Ken Krugler Thu, 06 Jan 2011 09:49:30 -0800


On Jan 6, 2011, at 9:00am, Stefano Bellasio wrote:

Ok, so can i continue with just 3 nodes? Im a bit confused rightnow. With computation time i mean that i need to know how much timetakes every test...as i said i can see nothing from my JobTracker,it says the number of nodes but no job active or map/reduceoperations, and i dont know why :/

If it's been more than a day, your job history will disappear (atleast with default settings).

Some other things than can cause jobs on EC2 m1.large instances to runslower than expected:

* If you're using too much memory on a slave, based on TaskTracker +DataNode + simultaneous active child JVMs for map & reduce tasks, thenyou can get into swap hell. But that's an easy one to check, just logonto the slave while a job is active, run the top command, and checkswap space usage.

* An m1.large has two drives. With a typical default configuration,you're only using one of these, and thus cutting your I/O performancein half.

* Depending on where your servers get allocated, network traffic canbe going through multiple routers to get between systems. EMR isbetter at (or will be) getting all of the servers provisioned close toeach other.

* Depending on who you're sharing your virtualized server with, youcan get a node that runs much slower than expected. Usually thishappens when somebody else is hammering the same disk, from myexperience. But it's also usually short-lived, and over time theeffects disappear. But for this reason we try to run jobs with thenumber of reduce tasks set to 1.75 * available slots, to help avoidthe lags of waiting for one slow server to complete.


BTW none of this is specific to Mahout, just general Hadoop/EC2 tuning.

-- Ken

Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
Those numbers seem "reasonable" to a first approximation, maybe a
little higher than I would have expected given past experience.

You should be able to increase speed with more nodes, sure, but I use
3 for testing too.

The jobs are I/O bound for sure. I don't think you will see
appreciable difference with different algorithms.

Yes the amount of data used in the similarity computation is the big
factor for time. You probably need to tell it to keep fewer item-item
pairs with the "max" parameters you  mentioned earlier.

mapred.num.tasks controls the number of mappers -- or at leasts
suggests it to Hadoop.

What do you mean about the time of computation? The job tracker shows
you when the individual tasks start and finish.

On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
<[email protected]> wrote:
Hi guys, well i'm doing some tests in those days and i have somequestions. Here there is my environment and basic configuration:
1) Amazon EC2 Cluster powered by Cloudera script with ApacheWhirr, i'm using a 3 node with large instances + one master nodeto control the cluster.2) Movielens data set, based on 100k, 1 mln and 10mln ... my testsright now are on 10 mln versions.
This is the command that i'm using to start my cluster:
hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jarorg.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
I'm trying different values for :

maxSimilaritiesPerItem
maxPrefsPerUser
maxCooccurrencesPerItem
and using about 10 users per time. With this command, 10 mln userdata set, my cluster took more than 4 hours (with 3 nodes) to giverecommendations. Is a good time?
Well, right now i have 2 goals, and im posting here to requestyour help to figure out some problems :) My primary goal is to runitem-based recommendations and see what happens when i change theparameters in time and performance of my cluster. Also, i need tolook at the similarities, i will be test three of them: cousine,pearson, and co-occurence. Good choices? I noted also that all thesimilarities computation is in RAM (right?) so my matrix is builtand stored in RAM, is there an other way to do that?
- I need to understand what kind of scalability i obtain with manynodes (3 for now, i can arrive to 5), i think that similaritiescalculation took most of the time, am i right?
- I know there is something like mapred.task to define how manyinstances some task can use...do i need that? How can i specifythis?
- I need to see the exact time of each computation, i'm looking tojobtracker but seems that never happens in my cluster even if job(with mapping and reducing) is running. Is there another way toknow the perfect time of any computation?
- Finally, i will take all the data and try to plot them to figureout some good trends based on number of nodes, time and data setdimension.
Well, any suggestion you want to give me is accepted :) Thank youguys


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?

Reply via email to