Hi Sebastian

in my datamodel are 17733658 Datapoints, there are 230116 unique users (U(Ix)) and 208760 unique items I(Ux). The datapoints are in some way dense and sparse because I test to merge 2 Datasets and invert the result, so I can use the ItemSimilarityJob:

e.g.:

I = Item
U = User

Dataset1 (the sparse one):
  I1 I2 I3 I4
U1 9        8
U2 7     4
U3    8     5
U5 5     9

Dataset2 (the dense one, but has much less Items than Dataset1):

  I5 I6
U1 1  2
U2 3  2
U3 2
U4 5  3
U5 1  1

Invert Dataset(1+2) so Users are Items an vice versa:

     I(U1) I(U2) I(U3) I(U4) I(U5)
U(I1) 9     7                 5
U(I2)             8
U(I3)       4                 9
U(I4) 8           5
U(I5) 1     3     2     5     1
U(I6) 2     2           3     1

So Yes your right because of this invertation I have users with a lots of preferences (nearly the number of users in Dataset2), and I can understand why the system seems to stop.

Maybe the Invertation of the Data isn't a good way for this purpose and i have to write my own UserUserSimilartityJob. (ok in the moment I have no idea how to do this, because I just started with hadoop and mapreduce, but I can try ;-) ).

Do you have some other hints I can try?


Can you same how many datapoints your data contains and how dense those are? 200MB doesn't seem that much, it shouldn't take hours with 8 m1.large instances.

Can you give us the values of the following counters?

MaybePruneRowsMapper: Elements.USED
MaybePruneRowsMapper: Elements.NEGLECTED

CooccurrencesMapper: Counter.COOCCURRENCES

I'm not sure, if I find the data in the logs you want, but maybe this log-sample helps:

MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
MaybePruneRowsMapper: Elements.USED = 6821670

I cant find Counter.COOCCURRENCES


INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%
INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201104130912_0004
INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
INFO org.apache.hadoop.mapred.JobClient (main): org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
INFO org.apache.hadoop.mapred.JobClient (main):     Launched reduce tasks=24
INFO org.apache.hadoop.mapred.JobClient (main):     Rack-local map tasks=3
INFO org.apache.hadoop.mapred.JobClient (main):     Launched map tasks=24
INFO org.apache.hadoop.mapred.JobClient (main):     Data-local map tasks=21
INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
INFO org.apache.hadoop.mapred.JobClient (main):     FILE_BYTES_READ=45452916
INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_READ=120672701 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_WRITTEN=106567950 INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_WRITTEN=51234800
INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
INFO org.apache.hadoop.mapred.JobClient (main): Reduce input groups=208760
INFO org.apache.hadoop.mapred.JobClient (main):     Combine output records=0
INFO org.apache.hadoop.mapred.JobClient (main):     Map input records=230201
INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle bytes=60461985 INFO org.apache.hadoop.mapred.JobClient (main): Reduce output records=208760
INFO org.apache.hadoop.mapred.JobClient (main):     Spilled Records=13643340
INFO org.apache.hadoop.mapred.JobClient (main): Map output bytes=136433400
INFO org.apache.hadoop.mapred.JobClient (main):     Combine input records=0
INFO org.apache.hadoop.mapred.JobClient (main): Map output records=6821670 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input records=6821670 INFO org.apache.mahout.common.AbstractJob (main): Command line arguments: {--endPhase=2147483647, --maxSimilaritiesPerRow=501, --numberOfColumns=230201, --similarityClassname=SIMILARITY_LOGLIKELIHOOD, --startPhase=0, --tempDir=temp} INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: 2 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 24 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 24 INFO org.apache.hadoop.mapred.JobClient (main): Running job: job_201104130912_0005
INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%


By the way, I see that you're located in Berlin, I have some free time in the next 2 weeks if you want we could meet for a coffee and you'll get some free consultation!

It would be really great to meet you, but only the head office is in Berlin. I am in Dresden and althougth that is not far away, it does not look like I can go to Berlin. Maybe it works out later when I visit the headquarter. I am sure you can explain a lot to me.



Thanks in advance
Thomas








On 14.04.2011 12:18, Thomas Rewig wrote:
 Hello
right now I'm testing Mahout (taste) Jobs on AWS EMR.
I wonder if anyone does have any experience with the best cluster size and the best EC2 instances. Are there any best practices for mahout (taste) jobs?

On my first test I used a small 22 MB user-item-model an compute an ItemSimilarityJob with 3 small EC2 instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.small --master-instance-type m1.small --num-instances 3 --name mahout-0.5-itemSimJob-TEST


ruby elastic-mapreduce
--jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
--main-class org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
--arg -i --arg s3://some-uri/input/data_small_in.csv
--arg -o --arg s3://some-uri/output/data_out_small.csv
--arg -s
--arg SIMILARITY_LOGLIKELIHOOD
--arg -m
--arg 500
--arg -mo
--arg 500
-j JobId

Here everything worked well, even if it took a few minutes.

In a second test I used a bigger 200 MB user-item-model and do the same with a cluster of large instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.large --master-instance-type m1.large --num-instances 8 --name mahout-0.5-itemSimJob-TEST2

I logged in on the masterNode with ssh and looked at the syslog. First a few hours everything looked ok and then it seems to stops at a 63% reduce step. I wait a few hours but nothing happend and so i terminate the job. I even couldn't find any errors in the logs.

So here my questions:
1. are the any proved best practice clustersizes and instancetypes (Standard- or High-Memory- or High-CPU-Instances) that work fine for big recommender jobs, or do I have to test it for every different job I use? 2. would it have some positiv effect if I split my big data_in.csv into many small csv's?

Do anyone have any experience with it and have some hints?

Thanks in advance
Thomas






Reply via email to