Re: Hints for Best Practices for Jobs with amazon EMR

Thomas Rewig Thu, 14 Apr 2011 08:00:31 -0700

 Hi Sebastian

in my datamodel are 17733658 Datapoints, there are 230116 unique users(U(Ix)) and 208760 unique items I(Ux).The datapoints are in some way dense and sparse because I test to merge2 Datasets and invert the result, so I can use the ItemSimilarityJob:


e.g.:

I = Item
U = User

Dataset1 (the sparse one):
  I1 I2 I3 I4
U1 9        8
U2 7     4
U3    8     5
U5 5     9

Dataset2 (the dense one, but has much less Items than Dataset1):

  I5 I6
U1 1  2
U2 3  2
U3 2
U4 5  3
U5 1  1

Invert Dataset(1+2) so Users are Items an vice versa:

     I(U1) I(U2) I(U3) I(U4) I(U5)
U(I1) 9     7                 5
U(I2)             8
U(I3)       4                 9
U(I4) 8           5
U(I5) 1     3     2     5     1
U(I6) 2     2           3     1

So Yes your right because of this invertation I have users with a lotsof preferences (nearly the number of users in Dataset2), and I canunderstand why the system seems to stop.

Maybe the Invertation of the Data isn't a good way for this purpose andi have to write my own UserUserSimilartityJob. (ok in the moment I haveno idea how to do this, because I just started with hadoop andmapreduce, but I can try ;-) ).


Do you have some other hints I can try?

Can you same how many datapoints your data contains and how densethose are? 200MB doesn't seem that much, it shouldn't take hours with8 m1.large instances.
Can you give us the values of the following counters?

MaybePruneRowsMapper: Elements.USED
MaybePruneRowsMapper: Elements.NEGLECTED

CooccurrencesMapper: Counter.COOCCURRENCES

I'm not sure, if I find the data in the logs you want, but maybe thislog-sample helps:


MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
MaybePruneRowsMapper: Elements.USED = 6821670

I cant find Counter.COOCCURRENCES


INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%

INFO org.apache.hadoop.mapred.JobClient (main): Job complete:job_201104130912_0004

INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20

INFO org.apache.hadoop.mapred.JobClient (main):org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements

INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
INFO org.apache.hadoop.mapred.JobClient (main):     Launched reduce tasks=24
INFO org.apache.hadoop.mapred.JobClient (main):     Rack-local map tasks=3
INFO org.apache.hadoop.mapred.JobClient (main):     Launched map tasks=24
INFO org.apache.hadoop.mapred.JobClient (main):     Data-local map tasks=21
INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
INFO org.apache.hadoop.mapred.JobClient (main):     FILE_BYTES_READ=45452916

INFO org.apache.hadoop.mapred.JobClient (main):HDFS_BYTES_READ=120672701INFO org.apache.hadoop.mapred.JobClient (main):FILE_BYTES_WRITTEN=106567950INFO org.apache.hadoop.mapred.JobClient (main):HDFS_BYTES_WRITTEN=51234800

INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework

INFO org.apache.hadoop.mapred.JobClient (main): Reduce inputgroups=208760

INFO org.apache.hadoop.mapred.JobClient (main):     Combine output records=0
INFO org.apache.hadoop.mapred.JobClient (main):     Map input records=230201

INFO org.apache.hadoop.mapred.JobClient (main): Reduce shufflebytes=60461985INFO org.apache.hadoop.mapred.JobClient (main): Reduce outputrecords=208760

INFO org.apache.hadoop.mapred.JobClient (main):     Spilled Records=13643340

INFO org.apache.hadoop.mapred.JobClient (main): Map outputbytes=136433400

INFO org.apache.hadoop.mapred.JobClient (main):     Combine input records=0

INFO org.apache.hadoop.mapred.JobClient (main): Map outputrecords=6821670INFO org.apache.hadoop.mapred.JobClient (main): Reduce inputrecords=6821670INFO org.apache.mahout.common.AbstractJob (main): Command linearguments: {--endPhase=2147483647, --maxSimilaritiesPerRow=501,--numberOfColumns=230201,--similarityClassname=SIMILARITY_LOGLIKELIHOOD, --startPhase=0,--tempDir=temp}INFO org.apache.hadoop.mapred.JobClient (main): Default number of maptasks: 2INFO org.apache.hadoop.mapred.JobClient (main): Default number of reducetasks: 24INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Totalinput paths to process : 24INFO org.apache.hadoop.mapred.JobClient (main): Running job:job_201104130912_0005

INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%

By the way, I see that you're located in Berlin, I have some free timein the next 2 weeks if you want we could meet for a coffee and you'llget some free consultation!

It would be really great to meet you, but only the head office is inBerlin. I am in Dresden and althougth that is not far away, it does notlook like I can go to Berlin. Maybe it works out later when I visit theheadquarter. I am sure you can explain a lot to me.




Thanks in advance
Thomas

On 14.04.2011 12:18, Thomas Rewig wrote:
 Hello
right now I'm testing Mahout (taste) Jobs on AWS EMR.
I wonder if anyone does have any experience with the best clustersize and the best EC2 instances. Are there any best practices formahout (taste) jobs?
On my first test I used a small 22 MB user-item-model an compute anItemSimilarityJob with 3 small EC2 instances:
ruby elastic-mapreduce --create --alive --slave-instance-typem1.small --master-instance-type m1.small --num-instances 3 --namemahout-0.5-itemSimJob-TEST
ruby elastic-mapreduce
--jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
--main-classorg.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
--arg -i --arg s3://some-uri/input/data_small_in.csv
--arg -o --arg s3://some-uri/output/data_out_small.csv
--arg -s
--arg SIMILARITY_LOGLIKELIHOOD
--arg -m
--arg 500
--arg -mo
--arg 500
-j JobId

Here everything worked well, even if it took a few minutes.
In a second test I used a bigger 200 MB user-item-model and do thesame with a cluster of large instances:
ruby elastic-mapreduce --create --alive --slave-instance-typem1.large --master-instance-type m1.large --num-instances 8 --namemahout-0.5-itemSimJob-TEST2
I logged in on the masterNode with ssh and looked at the syslog.First a few hours everything looked ok and then it seems to stops ata 63% reduce step. I wait a few hours but nothing happend and so iterminate the job. I even couldn't find any errors in the logs.
So here my questions:
1. are the any proved best practice clustersizes and instancetypes(Standard- or High-Memory- or High-CPU-Instances) that work fine forbig recommender jobs, or do I have to test it for every differentjob I use?2. would it have some positiv effect if I split my big data_in.csvinto many small csv's?
Do anyone have any experience with it and have some hints?

Thanks in advance
Thomas

Re: Hints for Best Practices for Jobs with amazon EMR

Reply via email to