Re: Hints for Best Practices for Jobs with amazon EMR

Lance Norskog Thu, 14 Apr 2011 17:51:18 -0700

Also, can you limit users to N preferences, say 50? I don't know the
Mahout job; is this possible in the job flow?


Lance

On 4/14/11, Ted Dunning <[email protected]> wrote:
> Thomas,
>
> You need to avoid operations that cause dense matrices.  If you can possibly
> do that, you need to.
>
> One way is to rethink what you mean by similarity.
>
> On Thu, Apr 14, 2011 at 2:43 PM, Sebastian Schelter <[email protected]> wrote:
>
>> Hi Thomas,
>>
>> I'd say now the long running time comes from items from dataset two that
>> have lots of user-preferences. Seems like your data is to dense to compare
>> all pairs of users with ItemSimilarityJob.
>>
>> What exactly is the problem you're trying to solve with computing similar
>> users? Do you need that as input for the computation of recommendations?
>> Maybe we'll find another approach for you on this list.
>>
>> --sebastian
>>
>> On 14.04.2011 16:59, Thomas Rewig wrote:
>>
>>>  Hi Sebastian
>>>
>>> in my datamodel are 17733658 Datapoints, there are 230116 unique users
>>> (U(Ix)) and 208760 unique items I(Ux).
>>> The datapoints are in some way dense and sparse because I test to merge 2
>>> Datasets and invert the result, so I can use the ItemSimilarityJob:
>>>
>>> e.g.:
>>>
>>> I = Item
>>> U = User
>>>
>>> Dataset1 (the sparse one):
>>>  I1 I2 I3 I4
>>> U1 9        8
>>> U2 7     4
>>> U3    8     5
>>> U5 5     9
>>>
>>> Dataset2 (the dense one, but has much less Items than Dataset1):
>>>
>>>  I5 I6
>>> U1 1  2
>>> U2 3  2
>>> U3 2
>>> U4 5  3
>>> U5 1  1
>>>
>>> Invert Dataset(1+2) so Users are Items an vice versa:
>>>
>>>     I(U1) I(U2) I(U3) I(U4) I(U5)
>>> U(I1) 9     7                 5
>>> U(I2)             8
>>> U(I3)       4                 9
>>> U(I4) 8           5
>>> U(I5) 1     3     2     5     1
>>> U(I6) 2     2           3     1
>>>
>>> So Yes your right because of this invertation I have users with a lots of
>>> preferences (nearly the number of users in Dataset2), and I can
>>> understand
>>> why the system seems to stop.
>>>
>>> Maybe the Invertation of the Data isn't a good way for this purpose and i
>>> have to write my own UserUserSimilartityJob. (ok in the moment I have no
>>> idea how to do this, because I just started with hadoop and mapreduce,
>>> but I
>>> can try ;-) ).
>>>
>>> Do you have some other hints I can try?
>>>
>>>
>>>> Can you same how many datapoints your data contains and how dense those
>>>> are? 200MB doesn't seem that much, it shouldn't take hours with 8
>>>> m1.large
>>>> instances.
>>>>
>>>> Can you give us the values of the following counters?
>>>>
>>>> MaybePruneRowsMapper: Elements.USED
>>>> MaybePruneRowsMapper: Elements.NEGLECTED
>>>>
>>>> CooccurrencesMapper: Counter.COOCCURRENCES
>>>>
>>>
>>> I'm not sure, if I find the data in the logs you want, but maybe this
>>> log-sample helps:
>>>
>>> MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
>>> MaybePruneRowsMapper: Elements.USED = 6821670
>>>
>>> I cant find Counter.COOCCURRENCES
>>>
>>>
>>> INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
>>> INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%
>>> INFO org.apache.hadoop.mapred.JobClient (main): Job complete:
>>> job_201104130912_0004
>>> INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
>>> INFO org.apache.hadoop.mapred.JobClient (main):
>>> org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
>>> INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
>>> INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
>>> INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Launched reduce
>>> tasks=24
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Rack-local map
>>> tasks=3
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Launched map tasks=24
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Data-local map
>>> tasks=21
>>> INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
>>> INFO org.apache.hadoop.mapred.JobClient (main):
>>> FILE_BYTES_READ=45452916
>>> INFO org.apache.hadoop.mapred.JobClient (main):
>>> HDFS_BYTES_READ=120672701
>>> INFO org.apache.hadoop.mapred.JobClient (main):
>>> FILE_BYTES_WRITTEN=106567950
>>> INFO org.apache.hadoop.mapred.JobClient (main):
>>> HDFS_BYTES_WRITTEN=51234800
>>> INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input
>>> groups=208760
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Combine output
>>> records=0
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Map input
>>> records=230201
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce shuffle
>>> bytes=60461985
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce output
>>> records=208760
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Spilled
>>> Records=13643340
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Map output
>>> bytes=136433400
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Combine input
>>> records=0
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Map output
>>> records=6821670
>>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input
>>> records=6821670
>>> INFO org.apache.mahout.common.AbstractJob (main): Command line arguments:
>>> {--endPhase=2147483647, --maxSimilaritiesPerRow=501,
>>> --numberOfColumns=230201, --similarityClassname=SIMILARITY_LOGLIKELIHOOD,
>>> --startPhase=0, --tempDir=temp}
>>> INFO org.apache.hadoop.mapred.JobClient (main): Default number of map
>>> tasks: 2
>>> INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce
>>> tasks: 24
>>> INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total
>>> input paths to process : 24
>>> INFO org.apache.hadoop.mapred.JobClient (main): Running job:
>>> job_201104130912_0005
>>> INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
>>> INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%
>>>
>>>
>>>> By the way, I see that you're located in Berlin, I have some free time
>>>> in
>>>> the next 2 weeks if you want we could meet for a coffee and you'll get
>>>> some
>>>> free consultation!
>>>>
>>>>  It would be really great to meet you, but only the head office is in
>>> Berlin. I am in Dresden and althougth that is not far away, it does not
>>> look
>>> like I can go to Berlin. Maybe it works out later when I visit the
>>> headquarter. I am sure you can explain a lot to me.
>>>
>>>
>>>
>>> Thanks in advance
>>> Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> On 14.04.2011 12:18, Thomas Rewig wrote:
>>>>
>>>>>  Hello
>>>>> right now I'm testing Mahout (taste) Jobs on AWS EMR.
>>>>> I wonder if anyone does have any experience with the best cluster size
>>>>> and the best EC2 instances. Are there any best practices for mahout
>>>>> (taste)
>>>>> jobs?
>>>>>
>>>>> On my first test I used a small 22 MB user-item-model an compute an
>>>>> ItemSimilarityJob with 3 small EC2 instances:
>>>>>
>>>>> ruby elastic-mapreduce --create --alive --slave-instance-type m1.small
>>>>> --master-instance-type m1.small --num-instances 3  --name
>>>>> mahout-0.5-itemSimJob-TEST
>>>>>
>>>>>
>>>>> ruby elastic-mapreduce
>>>>> --jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
>>>>> --main-class
>>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>>> --arg -i --arg s3://some-uri/input/data_small_in.csv
>>>>> --arg -o --arg s3://some-uri/output/data_out_small.csv
>>>>> --arg -s
>>>>> --arg SIMILARITY_LOGLIKELIHOOD
>>>>> --arg -m
>>>>> --arg 500
>>>>> --arg -mo
>>>>> --arg 500
>>>>> -j JobId
>>>>>
>>>>> Here everything worked well, even if it took a few minutes.
>>>>>
>>>>> In a second test I used a bigger 200 MB user-item-model and do the same
>>>>> with a cluster of large instances:
>>>>>
>>>>> ruby elastic-mapreduce --create --alive --slave-instance-type m1.large
>>>>> --master-instance-type m1.large --num-instances 8  --name
>>>>> mahout-0.5-itemSimJob-TEST2
>>>>>
>>>>> I logged in on the masterNode with ssh and looked at the syslog. First
>>>>> a
>>>>> few hours everything looked ok and then it seems to stops at a 63%
>>>>> reduce
>>>>> step. I wait a few hours but nothing happend and so i terminate the
>>>>> job. I
>>>>> even couldn't find any errors in the logs.
>>>>>
>>>>> So here my questions:
>>>>> 1. are the any proved best practice clustersizes and instancetypes
>>>>> (Standard- or High-Memory- or High-CPU-Instances) that work fine for
>>>>> big
>>>>> recommender jobs, or do I have to test it for every  different job I
>>>>> use?
>>>>> 2. would it have some positiv effect if I split my big data_in.csv into
>>>>> many small csv's?
>>>>>
>>>>> Do anyone have any experience with it and have some hints?
>>>>>
>>>>> Thanks in advance
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Lance Norskog
[email protected]

Re: Hints for Best Practices for Jobs with amazon EMR

Reply via email to