Also, can you limit users to N preferences, say 50? I don't know the Mahout job; is this possible in the job flow?
Lance On 4/14/11, Ted Dunning <[email protected]> wrote: > Thomas, > > You need to avoid operations that cause dense matrices. If you can possibly > do that, you need to. > > One way is to rethink what you mean by similarity. > > On Thu, Apr 14, 2011 at 2:43 PM, Sebastian Schelter <[email protected]> wrote: > >> Hi Thomas, >> >> I'd say now the long running time comes from items from dataset two that >> have lots of user-preferences. Seems like your data is to dense to compare >> all pairs of users with ItemSimilarityJob. >> >> What exactly is the problem you're trying to solve with computing similar >> users? Do you need that as input for the computation of recommendations? >> Maybe we'll find another approach for you on this list. >> >> --sebastian >> >> On 14.04.2011 16:59, Thomas Rewig wrote: >> >>> Hi Sebastian >>> >>> in my datamodel are 17733658 Datapoints, there are 230116 unique users >>> (U(Ix)) and 208760 unique items I(Ux). >>> The datapoints are in some way dense and sparse because I test to merge 2 >>> Datasets and invert the result, so I can use the ItemSimilarityJob: >>> >>> e.g.: >>> >>> I = Item >>> U = User >>> >>> Dataset1 (the sparse one): >>> I1 I2 I3 I4 >>> U1 9 8 >>> U2 7 4 >>> U3 8 5 >>> U5 5 9 >>> >>> Dataset2 (the dense one, but has much less Items than Dataset1): >>> >>> I5 I6 >>> U1 1 2 >>> U2 3 2 >>> U3 2 >>> U4 5 3 >>> U5 1 1 >>> >>> Invert Dataset(1+2) so Users are Items an vice versa: >>> >>> I(U1) I(U2) I(U3) I(U4) I(U5) >>> U(I1) 9 7 5 >>> U(I2) 8 >>> U(I3) 4 9 >>> U(I4) 8 5 >>> U(I5) 1 3 2 5 1 >>> U(I6) 2 2 3 1 >>> >>> So Yes your right because of this invertation I have users with a lots of >>> preferences (nearly the number of users in Dataset2), and I can >>> understand >>> why the system seems to stop. >>> >>> Maybe the Invertation of the Data isn't a good way for this purpose and i >>> have to write my own UserUserSimilartityJob. (ok in the moment I have no >>> idea how to do this, because I just started with hadoop and mapreduce, >>> but I >>> can try ;-) ). >>> >>> Do you have some other hints I can try? >>> >>> >>>> Can you same how many datapoints your data contains and how dense those >>>> are? 200MB doesn't seem that much, it shouldn't take hours with 8 >>>> m1.large >>>> instances. >>>> >>>> Can you give us the values of the following counters? >>>> >>>> MaybePruneRowsMapper: Elements.USED >>>> MaybePruneRowsMapper: Elements.NEGLECTED >>>> >>>> CooccurrencesMapper: Counter.COOCCURRENCES >>>> >>> >>> I'm not sure, if I find the data in the logs you want, but maybe this >>> log-sample helps: >>> >>> MaybePruneRowsMapper: Elements.NEGLECTED = 8798627 >>> MaybePruneRowsMapper: Elements.USED = 6821670 >>> >>> I cant find Counter.COOCCURRENCES >>> >>> >>> INFO org.apache.hadoop.mapred.JobClient (main): map 100% reduce 92% >>> INFO org.apache.hadoop.mapred.JobClient (main): map 100% reduce 100% >>> INFO org.apache.hadoop.mapred.JobClient (main): Job complete: >>> job_201104130912_0004 >>> INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20 >>> INFO org.apache.hadoop.mapred.JobClient (main): >>> org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements >>> INFO org.apache.hadoop.mapred.JobClient (main): NEGLECTED=8798627 >>> INFO org.apache.hadoop.mapred.JobClient (main): USED=6821670 >>> INFO org.apache.hadoop.mapred.JobClient (main): Job Counters >>> INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce >>> tasks=24 >>> INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map >>> tasks=3 >>> INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=24 >>> INFO org.apache.hadoop.mapred.JobClient (main): Data-local map >>> tasks=21 >>> INFO org.apache.hadoop.mapred.JobClient (main): FileSystemCounters >>> INFO org.apache.hadoop.mapred.JobClient (main): >>> FILE_BYTES_READ=45452916 >>> INFO org.apache.hadoop.mapred.JobClient (main): >>> HDFS_BYTES_READ=120672701 >>> INFO org.apache.hadoop.mapred.JobClient (main): >>> FILE_BYTES_WRITTEN=106567950 >>> INFO org.apache.hadoop.mapred.JobClient (main): >>> HDFS_BYTES_WRITTEN=51234800 >>> INFO org.apache.hadoop.mapred.JobClient (main): Map-Reduce Framework >>> INFO org.apache.hadoop.mapred.JobClient (main): Reduce input >>> groups=208760 >>> INFO org.apache.hadoop.mapred.JobClient (main): Combine output >>> records=0 >>> INFO org.apache.hadoop.mapred.JobClient (main): Map input >>> records=230201 >>> INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle >>> bytes=60461985 >>> INFO org.apache.hadoop.mapred.JobClient (main): Reduce output >>> records=208760 >>> INFO org.apache.hadoop.mapred.JobClient (main): Spilled >>> Records=13643340 >>> INFO org.apache.hadoop.mapred.JobClient (main): Map output >>> bytes=136433400 >>> INFO org.apache.hadoop.mapred.JobClient (main): Combine input >>> records=0 >>> INFO org.apache.hadoop.mapred.JobClient (main): Map output >>> records=6821670 >>> INFO org.apache.hadoop.mapred.JobClient (main): Reduce input >>> records=6821670 >>> INFO org.apache.mahout.common.AbstractJob (main): Command line arguments: >>> {--endPhase=2147483647, --maxSimilaritiesPerRow=501, >>> --numberOfColumns=230201, --similarityClassname=SIMILARITY_LOGLIKELIHOOD, >>> --startPhase=0, --tempDir=temp} >>> INFO org.apache.hadoop.mapred.JobClient (main): Default number of map >>> tasks: 2 >>> INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce >>> tasks: 24 >>> INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total >>> input paths to process : 24 >>> INFO org.apache.hadoop.mapred.JobClient (main): Running job: >>> job_201104130912_0005 >>> INFO org.apache.hadoop.mapred.JobClient (main): map 0% reduce 0% >>> INFO org.apache.hadoop.mapred.JobClient (main): map 4% reduce 0% >>> >>> >>>> By the way, I see that you're located in Berlin, I have some free time >>>> in >>>> the next 2 weeks if you want we could meet for a coffee and you'll get >>>> some >>>> free consultation! >>>> >>>> It would be really great to meet you, but only the head office is in >>> Berlin. I am in Dresden and althougth that is not far away, it does not >>> look >>> like I can go to Berlin. Maybe it works out later when I visit the >>> headquarter. I am sure you can explain a lot to me. >>> >>> >>> >>> Thanks in advance >>> Thomas >>> >>> >>> >>> >>> >>> >>> >>>> >>>> On 14.04.2011 12:18, Thomas Rewig wrote: >>>> >>>>> Hello >>>>> right now I'm testing Mahout (taste) Jobs on AWS EMR. >>>>> I wonder if anyone does have any experience with the best cluster size >>>>> and the best EC2 instances. Are there any best practices for mahout >>>>> (taste) >>>>> jobs? >>>>> >>>>> On my first test I used a small 22 MB user-item-model an compute an >>>>> ItemSimilarityJob with 3 small EC2 instances: >>>>> >>>>> ruby elastic-mapreduce --create --alive --slave-instance-type m1.small >>>>> --master-instance-type m1.small --num-instances 3 --name >>>>> mahout-0.5-itemSimJob-TEST >>>>> >>>>> >>>>> ruby elastic-mapreduce >>>>> --jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar >>>>> --main-class >>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob >>>>> --arg -i --arg s3://some-uri/input/data_small_in.csv >>>>> --arg -o --arg s3://some-uri/output/data_out_small.csv >>>>> --arg -s >>>>> --arg SIMILARITY_LOGLIKELIHOOD >>>>> --arg -m >>>>> --arg 500 >>>>> --arg -mo >>>>> --arg 500 >>>>> -j JobId >>>>> >>>>> Here everything worked well, even if it took a few minutes. >>>>> >>>>> In a second test I used a bigger 200 MB user-item-model and do the same >>>>> with a cluster of large instances: >>>>> >>>>> ruby elastic-mapreduce --create --alive --slave-instance-type m1.large >>>>> --master-instance-type m1.large --num-instances 8 --name >>>>> mahout-0.5-itemSimJob-TEST2 >>>>> >>>>> I logged in on the masterNode with ssh and looked at the syslog. First >>>>> a >>>>> few hours everything looked ok and then it seems to stops at a 63% >>>>> reduce >>>>> step. I wait a few hours but nothing happend and so i terminate the >>>>> job. I >>>>> even couldn't find any errors in the logs. >>>>> >>>>> So here my questions: >>>>> 1. are the any proved best practice clustersizes and instancetypes >>>>> (Standard- or High-Memory- or High-CPU-Instances) that work fine for >>>>> big >>>>> recommender jobs, or do I have to test it for every different job I >>>>> use? >>>>> 2. would it have some positiv effect if I split my big data_in.csv into >>>>> many small csv's? >>>>> >>>>> Do anyone have any experience with it and have some hints? >>>>> >>>>> Thanks in advance >>>>> Thomas >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >> > -- Lance Norskog [email protected]
