Ok, so can i continue with just 3 nodes? Im a bit confused right now. With 
computation time i mean that i need to know how much time takes every test...as 
i said i can see nothing from my JobTracker, it says the number of nodes but no 
job active or map/reduce operations, and i dont know why :/

Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:

> Those numbers seem "reasonable" to a first approximation, maybe a
> little higher than I would have expected given past experience.
> 
> You should be able to increase speed with more nodes, sure, but I use
> 3 for testing too.
> 
> The jobs are I/O bound for sure. I don't think you will see
> appreciable difference with different algorithms.
> 
> Yes the amount of data used in the similarity computation is the big
> factor for time. You probably need to tell it to keep fewer item-item
> pairs with the "max" parameters you  mentioned earlier.
> 
> mapred.num.tasks controls the number of mappers -- or at leasts
> suggests it to Hadoop.
> 
> What do you mean about the time of computation? The job tracker shows
> you when the individual tasks start and finish.
> 
> On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
> <[email protected]> wrote:
>> Hi guys, well i'm doing some tests in those days and i have some questions. 
>> Here there is my environment and basic configuration:
>> 
>> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm 
>> using a 3 node with large instances + one master node to control the cluster.
>> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now 
>> are on 10 mln versions.
>> 
>> This is the command that i'm using to start my cluster:
>> 
>> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar 
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob 
>> -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio 
>> --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 
>> 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
>> 
>> I'm trying different values for :
>> 
>> maxSimilaritiesPerItem
>> maxPrefsPerUser
>> maxCooccurrencesPerItem
>> 
>> and using about 10 users per time. With this command, 10 mln user data set, 
>> my cluster took more than 4 hours (with 3 nodes) to give recommendations. Is 
>> a good time?
>> 
>> 
>> Well, right now i have 2 goals, and im posting here to request your help to 
>> figure out some problems :) My primary goal is to run item-based 
>> recommendations and see what happens when i change the parameters in time 
>> and performance of my cluster. Also, i need to look at the similarities, i 
>> will be test three of them: cousine, pearson, and co-occurence. Good 
>> choices? I noted also that all the similarities computation is in RAM 
>> (right?) so my matrix is built and stored in RAM, is there an other way to 
>> do that?
>> 
>> - I need to understand what kind of scalability i obtain with many nodes (3 
>> for now, i can arrive to 5), i think that similarities calculation took most 
>> of the time, am i right?
>> 
>> - I know there is something like mapred.task to define how many instances 
>> some task can use...do i need that? How can i specify this?
>> 
>> - I need to see the exact time of each computation, i'm looking to 
>> jobtracker but seems that never happens in my cluster even if job (with 
>> mapping and reducing) is running. Is there another way to know the perfect 
>> time of any computation?
>> 
>> - Finally, i will take all the data and try to plot them to figure out some 
>> good trends based on number of nodes, time and data set dimension.
>> 
>> Well, any suggestion you want to give me is accepted :) Thank you guys
>> 
>> 

Reply via email to