On Jan 6, 2011, at 9:00am, Stefano Bellasio wrote:
Ok, so can i continue with just 3 nodes? Im a bit confused right
now. With computation time i mean that i need to know how much time
takes every test...as i said i can see nothing from my JobTracker,
it says the number of nodes but no job active or map/reduce
operations, and i dont know why :/
If it's been more than a day, your job history will disappear (at
least with default settings).
Some other things than can cause jobs on EC2 m1.large instances to run
slower than expected:
* If you're using too much memory on a slave, based on TaskTracker +
DataNode + simultaneous active child JVMs for map & reduce tasks, then
you can get into swap hell. But that's an easy one to check, just log
onto the slave while a job is active, run the top command, and check
swap space usage.
* An m1.large has two drives. With a typical default configuration,
you're only using one of these, and thus cutting your I/O performance
in half.
* Depending on where your servers get allocated, network traffic can
be going through multiple routers to get between systems. EMR is
better at (or will be) getting all of the servers provisioned close to
each other.
* Depending on who you're sharing your virtualized server with, you
can get a node that runs much slower than expected. Usually this
happens when somebody else is hammering the same disk, from my
experience. But it's also usually short-lived, and over time the
effects disappear. But for this reason we try to run jobs with the
number of reduce tasks set to 1.75 * available slots, to help avoid
the lags of waiting for one slow server to complete.
BTW none of this is specific to Mahout, just general Hadoop/EC2 tuning.
-- Ken
Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
Those numbers seem "reasonable" to a first approximation, maybe a
little higher than I would have expected given past experience.
You should be able to increase speed with more nodes, sure, but I use
3 for testing too.
The jobs are I/O bound for sure. I don't think you will see
appreciable difference with different algorithms.
Yes the amount of data used in the similarity computation is the big
factor for time. You probably need to tell it to keep fewer item-item
pairs with the "max" parameters you mentioned earlier.
mapred.num.tasks controls the number of mappers -- or at leasts
suggests it to Hadoop.
What do you mean about the time of computation? The job tracker shows
you when the individual tasks start and finish.
On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
<[email protected]> wrote:
Hi guys, well i'm doing some tests in those days and i have some
questions. Here there is my environment and basic configuration:
1) Amazon EC2 Cluster powered by Cloudera script with Apache
Whirr, i'm using a 3 node with large instances + one master node
to control the cluster.
2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests
right now are on 10 mln versions.
This is the command that i'm using to start my cluster:
hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -
Dmapred.input.dir=input -Dmapred.output.dir=data/
movielens_2gennaio --maxSimilaritiesPerItem 150 --maxPrefsPerUser
30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -
u users.txt
I'm trying different values for :
maxSimilaritiesPerItem
maxPrefsPerUser
maxCooccurrencesPerItem
and using about 10 users per time. With this command, 10 mln user
data set, my cluster took more than 4 hours (with 3 nodes) to give
recommendations. Is a good time?
Well, right now i have 2 goals, and im posting here to request
your help to figure out some problems :) My primary goal is to run
item-based recommendations and see what happens when i change the
parameters in time and performance of my cluster. Also, i need to
look at the similarities, i will be test three of them: cousine,
pearson, and co-occurence. Good choices? I noted also that all the
similarities computation is in RAM (right?) so my matrix is built
and stored in RAM, is there an other way to do that?
- I need to understand what kind of scalability i obtain with many
nodes (3 for now, i can arrive to 5), i think that similarities
calculation took most of the time, am i right?
- I know there is something like mapred.task to define how many
instances some task can use...do i need that? How can i specify
this?
- I need to see the exact time of each computation, i'm looking to
jobtracker but seems that never happens in my cluster even if job
(with mapping and reducing) is running. Is there another way to
know the perfect time of any computation?
- Finally, i will take all the data and try to plot them to figure
out some good trends based on number of nodes, time and data set
dimension.
Well, any suggestion you want to give me is accepted :) Thank you
guys
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g