Sure. I called the mahout 0.5 job with these parameters: $ jar mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/kjack/input/ -Dmapred.output.dir=/user/kjack/output/ --similarityClassname org.apache.mahout.math.hadoop.similarity.vector.DistributedUncenteredCosineVectorSimilarity --booleanData
The data set is quite sparse. ~3M items have been rated by at least 2 users. ~50K items have been rated by at least 10 users. ~500 items have been rated by at least 50 users. Running on a cluster with 15 nodes, 24GB ram each. It's not that the job didn't finish, it's just that it ran very slowly. By letting Hadoop assign mappers and reducers automatically, only 9 mappers were assigned and 1 reducer. I might have to up these too. I looked at the output from the MaybePruneRowsMapper and CooccurrencesMapper jobs and can't find any counters with those names. Just got the basic ones (e.g. FILE_BYTES_READ, HDFS_BYTES_READ, FILE_BYTES_WRITTEN, etc.). I'm looking at the counters from the job output through the jobtracker. Am I looking in the wrong place? Thanks, Kris 2011/4/11 Sebastian Schelter <s...@apache.org> > Could you give us some more details? Like all the parameters you called the > job with and some hints about the cluster you ran this on? And some hints > about your data, especially how dense/sparse it is? > > My first suspicion would be that much too much items need to be compared > with each other, are using Mahout 0.5 (you should because 0.4 has a bug in > this job) and can you give us the values of the following counters? > > - MaybePruneRowsMapper Elements.USED and ELEMENTS.NEGLECTED > - CooccurrencesMapper Counter.COOCCURRENCES > > --sebastian > > > > On 11.04.2011 14:17, Kris Jack wrote: > >> Hi everyone, >> >> I'm running the RecommenderJob with a fairly large dataset (e.g. 175K+ >> users, 3M+ items, 40M+ preferences). >> >> When it gets to the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer >> job, it takes a long time to run. >> >> I think that the main problem is that there's a lot of spilling going on >> when records are full. Here's extract from the logs: >> >> >> 2011-04-01 21:53:24,245 WARN org.apache.hadoop.conf.Configuration: >> >> /srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a >> attempt to override final parameter: mapred.local.dir; Ignoring. >> 2011-04-01 21:53:25,630 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: >> Initializing JVM Metrics with processName=MAP, sessionId= >> 2011-04-01 21:53:25,792 WARN org.apache.hadoop.conf.Configuration: >> >> /srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a >> attempt to override final parameter: mapred.local.dir; Ignoring. >> 2011-04-01 21:53:25,968 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb >> = 100 >> 2011-04-01 21:53:26,040 INFO org.apache.hadoop.mapred.MapTask: data >> buffer = 79691776/99614720 >> 2011-04-01 21:53:26,041 INFO org.apache.hadoop.mapred.MapTask: record >> buffer = 262144/327680 >> 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 0; bufend = 10492787; bufvoid = 99614720 >> 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 0; kvend = 262144; length = 327680 >> 2011-04-01 21:53:31,420 INFO org.apache.hadoop.util.NativeCodeLoader: >> Loaded the native-hadoop library >> 2011-04-01 21:53:31,422 INFO >> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded& >> initialized native-zlib library >> 2011-04-01 21:53:31,424 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new compressor >> 2011-04-01 21:53:32,747 INFO org.apache.hadoop.mapred.MapTask: Finished >> spill 0 >> 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 10492787; bufend = 21060166; bufvoid = 99614720 >> 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 262144; kvend = 196607; length = 327680 >> 2011-04-01 21:53:40,004 INFO org.apache.hadoop.mapred.MapTask: Finished >> spill 1 >> 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 21060166; bufend = 31021600; bufvoid = 99614720 >> 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 196607; kvend = 131070; length = 327680 >> 2011-04-01 21:53:54,636 INFO org.apache.hadoop.mapred.MapTask: Finished >> spill 2 >> 2011-04-01 21:53:54,921 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 31021600; bufend = 41605937; bufvoid = 99614720 >> 2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 131070; kvend = 65533; length = 327680 >> 2011-04-01 21:54:04,370 INFO org.apache.hadoop.mapred.MapTask: Finished >> spill 3 >> 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 41605937; bufend = 52550080; bufvoid = 99614720 >> 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 65533; kvend = 327677; length = 327680 >> 2011-04-01 21:54:17,717 INFO org.apache.hadoop.mapred.MapTask: Finished >> spill 4 >> 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 52550080; bufend = 63431004; bufvoid = 99614720 >> 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 327677; kvend = 262140; length = 327680 >> 2011-04-01 21:54:24,260 INFO org.apache.hadoop.mapred.MapTask: Finished >> spill 5 >> >> . >> >> >> . >> >> . >> >> >> 2011-04-06 11:18:09,932 INFO org.apache.hadoop.mapred.MapTask: >> Finished spill 42207 >> 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 53614451; bufend = 64698505; bufvoid = 99614720 >> 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 97306; kvend = 31769; length = 327680 >> 2011-04-06 11:18:18,905 INFO org.apache.hadoop.mapred.MapTask: >> Finished spill 42208 >> 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: >> Spilling map output: record full = true >> 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: >> bufstart = 64698505; bufend = 75588890; bufvoid = 99614720 >> 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 31769; kvend = 293913; length = 327680 >> 2011-04-06 11:18:32,523 INFO org.apache.hadoop.mapred.MapTask: >> Finished spill 42209 >> 2011-04-06 11:18:32,683 INFO org.apache.hadoop.mapred.MapTask: >> Starting flush of map output >> 2011-04-06 11:18:35,894 INFO org.apache.hadoop.mapred.MapTask: >> Finished spill 42210 >> 2011-04-06 11:19:11,530 INFO org.apache.hadoop.mapred.Merger: Merging >> 42211 sorted segments >> 2011-04-06 11:19:11,595 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,605 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,621 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,636 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,646 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,656 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,667 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,685 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,686 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,704 INFO org.apache.hadoop.io.compress.CodecPool: >> Got brand-new decompressor >> 2011-04-06 11:19:11,705 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42211 >> 2011-04-06 11:19:37,409 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42202 >> 2011-04-06 11:19:59,087 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42193 >> 2011-04-06 11:20:19,060 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42184 >> 2011-04-06 11:20:36,387 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42175 >> 2011-04-06 11:21:00,826 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42166 >> 2011-04-06 11:21:20,049 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42157 >> 2011-04-06 11:21:42,120 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42148 >> 2011-04-06 11:21:58,431 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 42139 >> 2011-04-06 11:22:15,594 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 4213 >> >> . >> >> . >> >> . >> >> >> 2011-04-09 00:43:01,590 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 73 >> 2011-04-09 01:20:59,688 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 64 >> 2011-04-09 01:57:23,938 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 55 >> 2011-04-09 02:26:39,086 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 46 >> 2011-04-09 05:08:55,609 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 37 >> 2011-04-09 09:59:05,399 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 28 >> 2011-04-09 14:59:58,730 INFO org.apache.hadoop.mapred.Merger: Merging >> 10 intermediate segments out of a total of 19 >> 2011-04-09 19:28:18,921 INFO org.apache.hadoop.mapred.Merger: Down to >> the last merge-pass, with 10 segments left of total size: 24786384532 >> bytes >> 2011-04-10 14:55:46,726 INFO org.apache.hadoop.mapred.TaskRunner: >> Task:attempt_201103291618_0156_m_000000_1 is done. And is in the >> process of commiting >> 2011-04-10 14:55:46,763 INFO org.apache.hadoop.mapred.TaskRunner: Task >> 'attempt_201103291618_0156_m_000000_1' done. >> >> >> This task took 10 days to run. I'm guessing that it should run faster? >> >> Any general advice for speeding this up? >> >> Should I increase the io.sort.mb parameter? It's currently set at >> 100, should it be more like 1,000? I see that mahout attempts to >> increase the parameter for the last job in RecommenderJob but it >> doesn't seem to for the preceding jobs. Is that right? >> >> Thanks, >> >> Kris >> >> -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/