Hi Sebastian,

It is not my primary goal to compute an inputmodel for an itembased recommendation system, of course it would be a nice bonus. The thing I try to do is to get an orderd list of n-similar items for every item (could be songs, artists, playlists, users ... the usual stuff) based on collaborative data (playhistorys and equal stuff ) mixed with static aspect data (signalbased stuff).

For me taste seems to be the easiest way to get this orderd list and the results on the small testsets are really nice. Maybe there is a way to reach the same result with some clustering technique, but I havent tested it until now.

Of course I can compute all Item-Item-Similaritys for the sparse data and use the dense data in a next step as a filter for the similarity-lists. But I tought it would be cool, if I drop all data in one model and see if the results are perhaps better. This is a thing I wanted to try for quite some time. And I was hoping that it would work with the hadoop cluster.

Does anyone have any other ideas and techniques I can try to calculate similarity of items based on collaborative data?

Thanks in advance
Thomas




Hi Thomas,

I'd say now the long running time comes from items from dataset two that have lots of user-preferences. Seems like your data is to dense to compare all pairs of users with ItemSimilarityJob.

What exactly is the problem you're trying to solve with computing similar users? Do you need that as input for the computation of recommendations? Maybe we'll find another approach for you on this list.

--sebastian

On 14.04.2011 16:59, Thomas Rewig wrote:
 Hi Sebastian

in my datamodel are 17733658 Datapoints, there are 230116 unique users (U(Ix)) and 208760 unique items I(Ux). The datapoints are in some way dense and sparse because I test to merge 2 Datasets and invert the result, so I can use the ItemSimilarityJob:

e.g.:

I = Item
U = User

Dataset1 (the sparse one):
  I1 I2 I3 I4
U1 9        8
U2 7     4
U3    8     5
U5 5     9

Dataset2 (the dense one, but has much less Items than Dataset1):

  I5 I6
U1 1  2
U2 3  2
U3 2
U4 5  3
U5 1  1

Invert Dataset(1+2) so Users are Items an vice versa:

     I(U1) I(U2) I(U3) I(U4) I(U5)
U(I1) 9     7                 5
U(I2)             8
U(I3)       4                 9
U(I4) 8           5
U(I5) 1     3     2     5     1
U(I6) 2     2           3     1

So Yes your right because of this invertation I have users with a lots of preferences (nearly the number of users in Dataset2), and I can understand why the system seems to stop.

Maybe the Invertation of the Data isn't a good way for this purpose and i have to write my own UserUserSimilartityJob. (ok in the moment I have no idea how to do this, because I just started with hadoop and mapreduce, but I can try ;-) ).

Do you have some other hints I can try?


Can you same how many datapoints your data contains and how dense those are? 200MB doesn't seem that much, it shouldn't take hours with 8 m1.large instances.

Can you give us the values of the following counters?

MaybePruneRowsMapper: Elements.USED
MaybePruneRowsMapper: Elements.NEGLECTED

CooccurrencesMapper: Counter.COOCCURRENCES

I'm not sure, if I find the data in the logs you want, but maybe this log-sample helps:

MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
MaybePruneRowsMapper: Elements.USED = 6821670

I cant find Counter.COOCCURRENCES


INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%
INFO org.apache.hadoop.mapred.JobClient (main): Job complete: job_201104130912_0004
INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
INFO org.apache.hadoop.mapred.JobClient (main): org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=24 INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map tasks=3 INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=24 INFO org.apache.hadoop.mapred.JobClient (main): Data-local map tasks=21
INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_READ=45452916 INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_READ=120672701 INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_WRITTEN=106567950 INFO org.apache.hadoop.mapred.JobClient (main): HDFS_BYTES_WRITTEN=51234800
INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
INFO org.apache.hadoop.mapred.JobClient (main): Reduce input groups=208760 INFO org.apache.hadoop.mapred.JobClient (main): Combine output records=0 INFO org.apache.hadoop.mapred.JobClient (main): Map input records=230201 INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle bytes=60461985 INFO org.apache.hadoop.mapred.JobClient (main): Reduce output records=208760 INFO org.apache.hadoop.mapred.JobClient (main): Spilled Records=13643340 INFO org.apache.hadoop.mapred.JobClient (main): Map output bytes=136433400 INFO org.apache.hadoop.mapred.JobClient (main): Combine input records=0 INFO org.apache.hadoop.mapred.JobClient (main): Map output records=6821670 INFO org.apache.hadoop.mapred.JobClient (main): Reduce input records=6821670 INFO org.apache.mahout.common.AbstractJob (main): Command line arguments: {--endPhase=2147483647, --maxSimilaritiesPerRow=501, --numberOfColumns=230201, --similarityClassname=SIMILARITY_LOGLIKELIHOOD, --startPhase=0, --tempDir=temp} INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: 2 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 24 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 24 INFO org.apache.hadoop.mapred.JobClient (main): Running job: job_201104130912_0005
INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%


By the way, I see that you're located in Berlin, I have some free time in the next 2 weeks if you want we could meet for a coffee and you'll get some free consultation!

It would be really great to meet you, but only the head office is in Berlin. I am in Dresden and althougth that is not far away, it does not look like I can go to Berlin. Maybe it works out later when I visit the headquarter. I am sure you can explain a lot to me.



Thanks in advance
Thomas








On 14.04.2011 12:18, Thomas Rewig wrote:
 Hello
right now I'm testing Mahout (taste) Jobs on AWS EMR.
I wonder if anyone does have any experience with the best cluster size and the best EC2 instances. Are there any best practices for mahout (taste) jobs?

On my first test I used a small 22 MB user-item-model an compute an ItemSimilarityJob with 3 small EC2 instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.small --master-instance-type m1.small --num-instances 3 --name mahout-0.5-itemSimJob-TEST


ruby elastic-mapreduce
--jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
--main-class org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
--arg -i --arg s3://some-uri/input/data_small_in.csv
--arg -o --arg s3://some-uri/output/data_out_small.csv
--arg -s
--arg SIMILARITY_LOGLIKELIHOOD
--arg -m
--arg 500
--arg -mo
--arg 500
-j JobId

Here everything worked well, even if it took a few minutes.

In a second test I used a bigger 200 MB user-item-model and do the same with a cluster of large instances:

ruby elastic-mapreduce --create --alive --slave-instance-type m1.large --master-instance-type m1.large --num-instances 8 --name mahout-0.5-itemSimJob-TEST2

I logged in on the masterNode with ssh and looked at the syslog. First a few hours everything looked ok and then it seems to stops at a 63% reduce step. I wait a few hours but nothing happend and so i terminate the job. I even couldn't find any errors in the logs.

So here my questions:
1. are the any proved best practice clustersizes and instancetypes (Standard- or High-Memory- or High-CPU-Instances) that work fine for big recommender jobs, or do I have to test it for every different job I use? 2. would it have some positiv effect if I split my big data_in.csv into many small csv's?

Do anyone have any experience with it and have some hints?

Thanks in advance
Thomas









Reply via email to