Hi Sebastian
in my datamodel are 17733658 Datapoints, there are 230116 unique users
(U(Ix)) and 208760 unique items I(Ux).
The datapoints are in some way dense and sparse because I test to merge
2 Datasets and invert the result, so I can use the ItemSimilarityJob:
e.g.:
I = Item
U = User
Dataset1 (the sparse one):
I1 I2 I3 I4
U1 9 8
U2 7 4
U3 8 5
U5 5 9
Dataset2 (the dense one, but has much less Items than Dataset1):
I5 I6
U1 1 2
U2 3 2
U3 2
U4 5 3
U5 1 1
Invert Dataset(1+2) so Users are Items an vice versa:
I(U1) I(U2) I(U3) I(U4) I(U5)
U(I1) 9 7 5
U(I2) 8
U(I3) 4 9
U(I4) 8 5
U(I5) 1 3 2 5 1
U(I6) 2 2 3 1
So Yes your right because of this invertation I have users with a lots
of preferences (nearly the number of users in Dataset2), and I can
understand why the system seems to stop.
Maybe the Invertation of the Data isn't a good way for this purpose and
i have to write my own UserUserSimilartityJob. (ok in the moment I have
no idea how to do this, because I just started with hadoop and
mapreduce, but I can try ;-) ).
Do you have some other hints I can try?
Can you same how many datapoints your data contains and how dense
those are? 200MB doesn't seem that much, it shouldn't take hours with
8 m1.large instances.
Can you give us the values of the following counters?
MaybePruneRowsMapper: Elements.USED
MaybePruneRowsMapper: Elements.NEGLECTED
CooccurrencesMapper: Counter.COOCCURRENCES
I'm not sure, if I find the data in the logs you want, but maybe this
log-sample helps:
MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
MaybePruneRowsMapper: Elements.USED = 6821670
I cant find Counter.COOCCURRENCES
INFO org.apache.hadoop.mapred.JobClient (main): map 100% reduce 92%
INFO org.apache.hadoop.mapred.JobClient (main): map 100% reduce 100%
INFO org.apache.hadoop.mapred.JobClient (main): Job complete:
job_201104130912_0004
INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
INFO org.apache.hadoop.mapred.JobClient (main):
org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
INFO org.apache.hadoop.mapred.JobClient (main): NEGLECTED=8798627
INFO org.apache.hadoop.mapred.JobClient (main): USED=6821670
INFO org.apache.hadoop.mapred.JobClient (main): Job Counters
INFO org.apache.hadoop.mapred.JobClient (main): Launched reduce tasks=24
INFO org.apache.hadoop.mapred.JobClient (main): Rack-local map tasks=3
INFO org.apache.hadoop.mapred.JobClient (main): Launched map tasks=24
INFO org.apache.hadoop.mapred.JobClient (main): Data-local map tasks=21
INFO org.apache.hadoop.mapred.JobClient (main): FileSystemCounters
INFO org.apache.hadoop.mapred.JobClient (main): FILE_BYTES_READ=45452916
INFO org.apache.hadoop.mapred.JobClient (main):
HDFS_BYTES_READ=120672701
INFO org.apache.hadoop.mapred.JobClient (main):
FILE_BYTES_WRITTEN=106567950
INFO org.apache.hadoop.mapred.JobClient (main):
HDFS_BYTES_WRITTEN=51234800
INFO org.apache.hadoop.mapred.JobClient (main): Map-Reduce Framework
INFO org.apache.hadoop.mapred.JobClient (main): Reduce input
groups=208760
INFO org.apache.hadoop.mapred.JobClient (main): Combine output records=0
INFO org.apache.hadoop.mapred.JobClient (main): Map input records=230201
INFO org.apache.hadoop.mapred.JobClient (main): Reduce shuffle
bytes=60461985
INFO org.apache.hadoop.mapred.JobClient (main): Reduce output
records=208760
INFO org.apache.hadoop.mapred.JobClient (main): Spilled Records=13643340
INFO org.apache.hadoop.mapred.JobClient (main): Map output
bytes=136433400
INFO org.apache.hadoop.mapred.JobClient (main): Combine input records=0
INFO org.apache.hadoop.mapred.JobClient (main): Map output
records=6821670
INFO org.apache.hadoop.mapred.JobClient (main): Reduce input
records=6821670
INFO org.apache.mahout.common.AbstractJob (main): Command line
arguments: {--endPhase=2147483647, --maxSimilaritiesPerRow=501,
--numberOfColumns=230201,
--similarityClassname=SIMILARITY_LOGLIKELIHOOD, --startPhase=0,
--tempDir=temp}
INFO org.apache.hadoop.mapred.JobClient (main): Default number of map
tasks: 2
INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce
tasks: 24
INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total
input paths to process : 24
INFO org.apache.hadoop.mapred.JobClient (main): Running job:
job_201104130912_0005
INFO org.apache.hadoop.mapred.JobClient (main): map 0% reduce 0%
INFO org.apache.hadoop.mapred.JobClient (main): map 4% reduce 0%
By the way, I see that you're located in Berlin, I have some free time
in the next 2 weeks if you want we could meet for a coffee and you'll
get some free consultation!
It would be really great to meet you, but only the head office is in
Berlin. I am in Dresden and althougth that is not far away, it does not
look like I can go to Berlin. Maybe it works out later when I visit the
headquarter. I am sure you can explain a lot to me.
Thanks in advance
Thomas
On 14.04.2011 12:18, Thomas Rewig wrote:
Hello
right now I'm testing Mahout (taste) Jobs on AWS EMR.
I wonder if anyone does have any experience with the best cluster
size and the best EC2 instances. Are there any best practices for
mahout (taste) jobs?
On my first test I used a small 22 MB user-item-model an compute an
ItemSimilarityJob with 3 small EC2 instances:
ruby elastic-mapreduce --create --alive --slave-instance-type
m1.small --master-instance-type m1.small --num-instances 3 --name
mahout-0.5-itemSimJob-TEST
ruby elastic-mapreduce
--jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
--main-class
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
--arg -i --arg s3://some-uri/input/data_small_in.csv
--arg -o --arg s3://some-uri/output/data_out_small.csv
--arg -s
--arg SIMILARITY_LOGLIKELIHOOD
--arg -m
--arg 500
--arg -mo
--arg 500
-j JobId
Here everything worked well, even if it took a few minutes.
In a second test I used a bigger 200 MB user-item-model and do the
same with a cluster of large instances:
ruby elastic-mapreduce --create --alive --slave-instance-type
m1.large --master-instance-type m1.large --num-instances 8 --name
mahout-0.5-itemSimJob-TEST2
I logged in on the masterNode with ssh and looked at the syslog.
First a few hours everything looked ok and then it seems to stops at
a 63% reduce step. I wait a few hours but nothing happend and so i
terminate the job. I even couldn't find any errors in the logs.
So here my questions:
1. are the any proved best practice clustersizes and instancetypes
(Standard- or High-Memory- or High-CPU-Instances) that work fine for
big recommender jobs, or do I have to test it for every different
job I use?
2. would it have some positiv effect if I split my big data_in.csv
into many small csv's?
Do anyone have any experience with it and have some hints?
Thanks in advance
Thomas