Hi everyone, I'm running the RecommenderJob with a fairly large dataset (e.g. 175K+ users, 3M+ items, 40M+ preferences).
When it gets to the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job, it takes a long time to run. I think that the main problem is that there's a lot of spilling going on when records are full. Here's extract from the logs: 2011-04-01 21:53:24,245 WARN org.apache.hadoop.conf.Configuration: /srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a attempt to override final parameter: mapred.local.dir; Ignoring. 2011-04-01 21:53:25,630 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2011-04-01 21:53:25,792 WARN org.apache.hadoop.conf.Configuration: /srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a attempt to override final parameter: mapred.local.dir; Ignoring. 2011-04-01 21:53:25,968 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2011-04-01 21:53:26,040 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2011-04-01 21:53:26,041 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 10492787; bufvoid = 99614720 2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680 2011-04-01 21:53:31,420 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2011-04-01 21:53:31,422 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2011-04-01 21:53:31,424 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2011-04-01 21:53:32,747 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: bufstart = 10492787; bufend = 21060166; bufvoid = 99614720 2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: kvstart = 262144; kvend = 196607; length = 327680 2011-04-01 21:53:40,004 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: bufstart = 21060166; bufend = 31021600; bufvoid = 99614720 2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196607; kvend = 131070; length = 327680 2011-04-01 21:53:54,636 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2 2011-04-01 21:53:54,921 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask: bufstart = 31021600; bufend = 41605937; bufvoid = 99614720 2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask: kvstart = 131070; kvend = 65533; length = 327680 2011-04-01 21:54:04,370 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: bufstart = 41605937; bufend = 52550080; bufvoid = 99614720 2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: kvstart = 65533; kvend = 327677; length = 327680 2011-04-01 21:54:17,717 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: bufstart = 52550080; bufend = 63431004; bufvoid = 99614720 2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: kvstart = 327677; kvend = 262140; length = 327680 2011-04-01 21:54:24,260 INFO org.apache.hadoop.mapred.MapTask: Finished spill 5 . . . 2011-04-06 11:18:09,932 INFO org.apache.hadoop.mapred.MapTask: Finished spill 42207 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: bufstart = 53614451; bufend = 64698505; bufvoid = 99614720 2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: kvstart = 97306; kvend = 31769; length = 327680 2011-04-06 11:18:18,905 INFO org.apache.hadoop.mapred.MapTask: Finished spill 42208 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: bufstart = 64698505; bufend = 75588890; bufvoid = 99614720 2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: kvstart = 31769; kvend = 293913; length = 327680 2011-04-06 11:18:32,523 INFO org.apache.hadoop.mapred.MapTask: Finished spill 42209 2011-04-06 11:18:32,683 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2011-04-06 11:18:35,894 INFO org.apache.hadoop.mapred.MapTask: Finished spill 42210 2011-04-06 11:19:11,530 INFO org.apache.hadoop.mapred.Merger: Merging 42211 sorted segments 2011-04-06 11:19:11,595 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,605 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,621 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,636 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,646 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,656 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,667 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,685 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,686 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,704 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2011-04-06 11:19:11,705 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42211 2011-04-06 11:19:37,409 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42202 2011-04-06 11:19:59,087 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42193 2011-04-06 11:20:19,060 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42184 2011-04-06 11:20:36,387 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42175 2011-04-06 11:21:00,826 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42166 2011-04-06 11:21:20,049 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42157 2011-04-06 11:21:42,120 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42148 2011-04-06 11:21:58,431 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 42139 2011-04-06 11:22:15,594 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 4213 . . . 2011-04-09 00:43:01,590 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 73 2011-04-09 01:20:59,688 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 64 2011-04-09 01:57:23,938 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 55 2011-04-09 02:26:39,086 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 46 2011-04-09 05:08:55,609 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 37 2011-04-09 09:59:05,399 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 28 2011-04-09 14:59:58,730 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 19 2011-04-09 19:28:18,921 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 24786384532 bytes 2011-04-10 14:55:46,726 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201103291618_0156_m_000000_1 is done. And is in the process of commiting 2011-04-10 14:55:46,763 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_201103291618_0156_m_000000_1' done. This task took 10 days to run. I'm guessing that it should run faster? Any general advice for speeding this up? Should I increase the io.sort.mb parameter? It's currently set at 100, should it be more like 1,000? I see that mahout attempts to increase the parameter for the last job in RecommenderJob but it doesn't seem to for the preceding jobs. Is that right? Thanks, Kris