Re: Hints for Best Practices for Jobs with amazon EMR

Thomas Rewig Fri, 15 Apr 2011 02:56:19 -0700

 Hi Sebastian,

It is not my primary goal to compute an inputmodel for an itembasedrecommendation system, of course it would be a nice bonus. The thing Itry to do is to get an orderd list of n-similar items for every item(could be songs, artists, playlists, users ... the usual stuff) based oncollaborative data (playhistorys and equal stuff ) mixed with staticaspect data (signalbased stuff).

For me taste seems to be the easiest way to get this orderd list and theresults on the small testsets are really nice. Maybe there is a way toreach the same result with some clustering technique, but I haventtested it until now.

Of course I can compute all Item-Item-Similaritys for the sparse dataand use the dense data in a next step as a filter for thesimilarity-lists. But I tought it would be cool, if I drop all data inone model and see if the results are perhaps better.This is a thing I wanted to try for quite some time. And I was hopingthat it would work with the hadoop cluster.

Does anyone have any other ideas and techniques I can try to calculatesimilarity of items based on collaborative data?


Thanks in advance
Thomas

Hi Thomas,
I'd say now the long running time comes from items from dataset twothat have lots of user-preferences. Seems like your data is to denseto compare all pairs of users with ItemSimilarityJob.
What exactly is the problem you're trying to solve with computingsimilar users? Do you need that as input for the computation ofrecommendations? Maybe we'll find another approach for you on this list.
--sebastian

On 14.04.2011 16:59, Thomas Rewig wrote:
 Hi Sebastian
in my datamodel are 17733658 Datapoints, there are 230116 uniqueusers (U(Ix)) and 208760 unique items I(Ux).The datapoints are in some way dense and sparse because I test tomerge 2 Datasets and invert the result, so I can use theItemSimilarityJob:
e.g.:

I = Item
U = User

Dataset1 (the sparse one):
  I1 I2 I3 I4
U1 9        8
U2 7     4
U3    8     5
U5 5     9

Dataset2 (the dense one, but has much less Items than Dataset1):

  I5 I6
U1 1  2
U2 3  2
U3 2
U4 5  3
U5 1  1

Invert Dataset(1+2) so Users are Items an vice versa:

     I(U1) I(U2) I(U3) I(U4) I(U5)
U(I1) 9     7                 5
U(I2)             8
U(I3)       4                 9
U(I4) 8           5
U(I5) 1     3     2     5     1
U(I6) 2     2           3     1
So Yes your right because of this invertation I have users with alots of preferences (nearly the number of users in Dataset2), and Ican understand why the system seems to stop.
Maybe the Invertation of the Data isn't a good way for this purposeand i have to write my own UserUserSimilartityJob. (ok in the momentI have no idea how to do this, because I just started with hadoop andmapreduce, but I can try ;-) ).
Do you have some other hints I can try?
Can you same how many datapoints your data contains and how densethose are? 200MB doesn't seem that much, it shouldn't take hourswith 8 m1.large instances.
Can you give us the values of the following counters?

MaybePruneRowsMapper: Elements.USED
MaybePruneRowsMapper: Elements.NEGLECTED

CooccurrencesMapper: Counter.COOCCURRENCES
I'm not sure, if I find the data in the logs you want, but maybe thislog-sample helps:
MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
MaybePruneRowsMapper: Elements.USED = 6821670

I cant find Counter.COOCCURRENCES


INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%
INFO org.apache.hadoop.mapred.JobClient (main): Job complete:job_201104130912_0004
INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
INFO org.apache.hadoop.mapred.JobClient (main):org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
INFO org.apache.hadoop.mapred.JobClient (main): Launched reducetasks=24INFO org.apache.hadoop.mapred.JobClient (main): Rack-local maptasks=3INFO org.apache.hadoop.mapred.JobClient (main): Launched maptasks=24INFO org.apache.hadoop.mapred.JobClient (main): Data-local maptasks=21
INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
INFO org.apache.hadoop.mapred.JobClient (main):FILE_BYTES_READ=45452916INFO org.apache.hadoop.mapred.JobClient (main):HDFS_BYTES_READ=120672701INFO org.apache.hadoop.mapred.JobClient (main):FILE_BYTES_WRITTEN=106567950INFO org.apache.hadoop.mapred.JobClient (main):HDFS_BYTES_WRITTEN=51234800
INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
INFO org.apache.hadoop.mapred.JobClient (main): Reduce inputgroups=208760INFO org.apache.hadoop.mapred.JobClient (main): Combine outputrecords=0INFO org.apache.hadoop.mapred.JobClient (main): Map inputrecords=230201INFO org.apache.hadoop.mapred.JobClient (main): Reduce shufflebytes=60461985INFO org.apache.hadoop.mapred.JobClient (main): Reduce outputrecords=208760INFO org.apache.hadoop.mapred.JobClient (main): SpilledRecords=13643340INFO org.apache.hadoop.mapred.JobClient (main): Map outputbytes=136433400INFO org.apache.hadoop.mapred.JobClient (main): Combine inputrecords=0INFO org.apache.hadoop.mapred.JobClient (main): Map outputrecords=6821670INFO org.apache.hadoop.mapred.JobClient (main): Reduce inputrecords=6821670INFO org.apache.mahout.common.AbstractJob (main): Command linearguments: {--endPhase=2147483647, --maxSimilaritiesPerRow=501,--numberOfColumns=230201,--similarityClassname=SIMILARITY_LOGLIKELIHOOD, --startPhase=0,--tempDir=temp}INFO org.apache.hadoop.mapred.JobClient (main): Default number of maptasks: 2INFO org.apache.hadoop.mapred.JobClient (main): Default number ofreduce tasks: 24INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main):Total input paths to process : 24INFO org.apache.hadoop.mapred.JobClient (main): Running job:job_201104130912_0005
INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%
By the way, I see that you're located in Berlin, I have some freetime in the next 2 weeks if you want we could meet for a coffee andyou'll get some free consultation!
It would be really great to meet you, but only the head office is inBerlin. I am in Dresden and althougth that is not far away, it doesnot look like I can go to Berlin. Maybe it works out later when Ivisit the headquarter. I am sure you can explain a lot to me.
Thanks in advance
Thomas
On 14.04.2011 12:18, Thomas Rewig wrote:
 Hello
right now I'm testing Mahout (taste) Jobs on AWS EMR.
I wonder if anyone does have any experience with the best clustersize and the best EC2 instances. Are there any best practices formahout (taste) jobs?
On my first test I used a small 22 MB user-item-model an compute anItemSimilarityJob with 3 small EC2 instances:
ruby elastic-mapreduce --create --alive --slave-instance-typem1.small --master-instance-type m1.small --num-instances 3 --namemahout-0.5-itemSimJob-TEST
ruby elastic-mapreduce
--jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
--main-classorg.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
--arg -i --arg s3://some-uri/input/data_small_in.csv
--arg -o --arg s3://some-uri/output/data_out_small.csv
--arg -s
--arg SIMILARITY_LOGLIKELIHOOD
--arg -m
--arg 500
--arg -mo
--arg 500
-j JobId

Here everything worked well, even if it took a few minutes.
In a second test I used a bigger 200 MB user-item-model and do thesame with a cluster of large instances:
ruby elastic-mapreduce --create --alive --slave-instance-typem1.large --master-instance-type m1.large --num-instances 8 --namemahout-0.5-itemSimJob-TEST2
I logged in on the masterNode with ssh and looked at the syslog.First a few hours everything looked ok and then it seems to stopsat a 63% reduce step. I wait a few hours but nothing happend and soi terminate the job. I even couldn't find any errors in the logs.
So here my questions:
1. are the any proved best practice clustersizes and instancetypes(Standard- or High-Memory- or High-CPU-Instances) that work finefor big recommender jobs, or do I have to test it for everydifferent job I use?2. would it have some positiv effect if I split my big data_in.csvinto many small csv's?
Do anyone have any experience with it and have some hints?

Thanks in advance
Thomas

Re: Hints for Best Practices for Jobs with amazon EMR

Reply via email to