Re: Hints for Best Practices for Jobs with amazon EMR

Ted Dunning Thu, 14 Apr 2011 15:09:51 -0700

Thomas,

You need to avoid operations that cause dense matrices.  If you can possibly
do that, you need to.


One way is to rethink what you mean by similarity.

On Thu, Apr 14, 2011 at 2:43 PM, Sebastian Schelter <[email protected]> wrote:

> Hi Thomas,
>
> I'd say now the long running time comes from items from dataset two that
> have lots of user-preferences. Seems like your data is to dense to compare
> all pairs of users with ItemSimilarityJob.
>
> What exactly is the problem you're trying to solve with computing similar
> users? Do you need that as input for the computation of recommendations?
> Maybe we'll find another approach for you on this list.
>
> --sebastian
>
> On 14.04.2011 16:59, Thomas Rewig wrote:
>
>>  Hi Sebastian
>>
>> in my datamodel are 17733658 Datapoints, there are 230116 unique users
>> (U(Ix)) and 208760 unique items I(Ux).
>> The datapoints are in some way dense and sparse because I test to merge 2
>> Datasets and invert the result, so I can use the ItemSimilarityJob:
>>
>> e.g.:
>>
>> I = Item
>> U = User
>>
>> Dataset1 (the sparse one):
>>  I1 I2 I3 I4
>> U1 9        8
>> U2 7     4
>> U3    8     5
>> U5 5     9
>>
>> Dataset2 (the dense one, but has much less Items than Dataset1):
>>
>>  I5 I6
>> U1 1  2
>> U2 3  2
>> U3 2
>> U4 5  3
>> U5 1  1
>>
>> Invert Dataset(1+2) so Users are Items an vice versa:
>>
>>     I(U1) I(U2) I(U3) I(U4) I(U5)
>> U(I1) 9     7                 5
>> U(I2)             8
>> U(I3)       4                 9
>> U(I4) 8           5
>> U(I5) 1     3     2     5     1
>> U(I6) 2     2           3     1
>>
>> So Yes your right because of this invertation I have users with a lots of
>> preferences (nearly the number of users in Dataset2), and I can understand
>> why the system seems to stop.
>>
>> Maybe the Invertation of the Data isn't a good way for this purpose and i
>> have to write my own UserUserSimilartityJob. (ok in the moment I have no
>> idea how to do this, because I just started with hadoop and mapreduce, but I
>> can try ;-) ).
>>
>> Do you have some other hints I can try?
>>
>>
>>> Can you same how many datapoints your data contains and how dense those
>>> are? 200MB doesn't seem that much, it shouldn't take hours with 8 m1.large
>>> instances.
>>>
>>> Can you give us the values of the following counters?
>>>
>>> MaybePruneRowsMapper: Elements.USED
>>> MaybePruneRowsMapper: Elements.NEGLECTED
>>>
>>> CooccurrencesMapper: Counter.COOCCURRENCES
>>>
>>
>> I'm not sure, if I find the data in the logs you want, but maybe this
>> log-sample helps:
>>
>> MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
>> MaybePruneRowsMapper: Elements.USED = 6821670
>>
>> I cant find Counter.COOCCURRENCES
>>
>>
>> INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
>> INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%
>> INFO org.apache.hadoop.mapred.JobClient (main): Job complete:
>> job_201104130912_0004
>> INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
>> INFO org.apache.hadoop.mapred.JobClient (main):
>> org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
>> INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
>> INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
>> INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
>> INFO org.apache.hadoop.mapred.JobClient (main):     Launched reduce
>> tasks=24
>> INFO org.apache.hadoop.mapred.JobClient (main):     Rack-local map tasks=3
>> INFO org.apache.hadoop.mapred.JobClient (main):     Launched map tasks=24
>> INFO org.apache.hadoop.mapred.JobClient (main):     Data-local map
>> tasks=21
>> INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
>> INFO org.apache.hadoop.mapred.JobClient (main):
>> FILE_BYTES_READ=45452916
>> INFO org.apache.hadoop.mapred.JobClient (main):
>> HDFS_BYTES_READ=120672701
>> INFO org.apache.hadoop.mapred.JobClient (main):
>> FILE_BYTES_WRITTEN=106567950
>> INFO org.apache.hadoop.mapred.JobClient (main):
>> HDFS_BYTES_WRITTEN=51234800
>> INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input
>> groups=208760
>> INFO org.apache.hadoop.mapred.JobClient (main):     Combine output
>> records=0
>> INFO org.apache.hadoop.mapred.JobClient (main):     Map input
>> records=230201
>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce shuffle
>> bytes=60461985
>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce output
>> records=208760
>> INFO org.apache.hadoop.mapred.JobClient (main):     Spilled
>> Records=13643340
>> INFO org.apache.hadoop.mapred.JobClient (main):     Map output
>> bytes=136433400
>> INFO org.apache.hadoop.mapred.JobClient (main):     Combine input
>> records=0
>> INFO org.apache.hadoop.mapred.JobClient (main):     Map output
>> records=6821670
>> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input
>> records=6821670
>> INFO org.apache.mahout.common.AbstractJob (main): Command line arguments:
>> {--endPhase=2147483647, --maxSimilaritiesPerRow=501,
>> --numberOfColumns=230201, --similarityClassname=SIMILARITY_LOGLIKELIHOOD,
>> --startPhase=0, --tempDir=temp}
>> INFO org.apache.hadoop.mapred.JobClient (main): Default number of map
>> tasks: 2
>> INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce
>> tasks: 24
>> INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total
>> input paths to process : 24
>> INFO org.apache.hadoop.mapred.JobClient (main): Running job:
>> job_201104130912_0005
>> INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
>> INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%
>>
>>
>>> By the way, I see that you're located in Berlin, I have some free time in
>>> the next 2 weeks if you want we could meet for a coffee and you'll get some
>>> free consultation!
>>>
>>>  It would be really great to meet you, but only the head office is in
>> Berlin. I am in Dresden and althougth that is not far away, it does not look
>> like I can go to Berlin. Maybe it works out later when I visit the
>> headquarter. I am sure you can explain a lot to me.
>>
>>
>>
>> Thanks in advance
>> Thomas
>>
>>
>>
>>
>>
>>
>>
>>>
>>> On 14.04.2011 12:18, Thomas Rewig wrote:
>>>
>>>>  Hello
>>>> right now I'm testing Mahout (taste) Jobs on AWS EMR.
>>>> I wonder if anyone does have any experience with the best cluster size
>>>> and the best EC2 instances. Are there any best practices for mahout (taste)
>>>> jobs?
>>>>
>>>> On my first test I used a small 22 MB user-item-model an compute an
>>>> ItemSimilarityJob with 3 small EC2 instances:
>>>>
>>>> ruby elastic-mapreduce --create --alive --slave-instance-type m1.small
>>>> --master-instance-type m1.small --num-instances 3  --name
>>>> mahout-0.5-itemSimJob-TEST
>>>>
>>>>
>>>> ruby elastic-mapreduce
>>>> --jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
>>>> --main-class
>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>> --arg -i --arg s3://some-uri/input/data_small_in.csv
>>>> --arg -o --arg s3://some-uri/output/data_out_small.csv
>>>> --arg -s
>>>> --arg SIMILARITY_LOGLIKELIHOOD
>>>> --arg -m
>>>> --arg 500
>>>> --arg -mo
>>>> --arg 500
>>>> -j JobId
>>>>
>>>> Here everything worked well, even if it took a few minutes.
>>>>
>>>> In a second test I used a bigger 200 MB user-item-model and do the same
>>>> with a cluster of large instances:
>>>>
>>>> ruby elastic-mapreduce --create --alive --slave-instance-type m1.large
>>>> --master-instance-type m1.large --num-instances 8  --name
>>>> mahout-0.5-itemSimJob-TEST2
>>>>
>>>> I logged in on the masterNode with ssh and looked at the syslog. First a
>>>> few hours everything looked ok and then it seems to stops at a 63% reduce
>>>> step. I wait a few hours but nothing happend and so i terminate the job. I
>>>> even couldn't find any errors in the logs.
>>>>
>>>> So here my questions:
>>>> 1. are the any proved best practice clustersizes and instancetypes
>>>> (Standard- or High-Memory- or High-CPU-Instances) that work fine for big
>>>> recommender jobs, or do I have to test it for every  different job I use?
>>>> 2. would it have some positiv effect if I split my big data_in.csv into
>>>> many small csv's?
>>>>
>>>> Do anyone have any experience with it and have some hints?
>>>>
>>>> Thanks in advance
>>>> Thomas
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Hints for Best Practices for Jobs with amazon EMR

Reply via email to