Hi everyone,

I'm running the RecommenderJob with a fairly large dataset (e.g. 175K+
users, 3M+ items, 40M+ preferences).

When it gets to the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer
job, it takes a long time to run.

I think that the main problem is that there's a lot of spilling going on
when records are full.  Here's extract from the logs:


2011-04-01 21:53:24,245 WARN org.apache.hadoop.conf.Configuration:
/srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a
attempt to override final parameter: mapred.local.dir;  Ignoring.
2011-04-01 21:53:25,630 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2011-04-01 21:53:25,792 WARN org.apache.hadoop.conf.Configuration:
/srv/hadoop/disk3/cache/mapred/taskTracker/jobcache/job_201103291618_0156/attempt_201103291618_0156_m_000000_1/job.xml:a
attempt to override final parameter: mapred.local.dir;  Ignoring.
2011-04-01 21:53:25,968 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2011-04-01 21:53:26,040 INFO org.apache.hadoop.mapred.MapTask: data
buffer = 79691776/99614720
2011-04-01 21:53:26,041 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680
2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 0; bufend = 10492787; bufvoid = 99614720
2011-04-01 21:53:26,757 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 0; kvend = 262144; length = 327680
2011-04-01 21:53:31,420 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2011-04-01 21:53:31,422 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
2011-04-01 21:53:31,424 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new compressor
2011-04-01 21:53:32,747 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0
2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 10492787; bufend = 21060166; bufvoid = 99614720
2011-04-01 21:53:33,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 262144; kvend = 196607; length = 327680
2011-04-01 21:53:40,004 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1
2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 21060166; bufend = 31021600; bufvoid = 99614720
2011-04-01 21:53:40,235 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 196607; kvend = 131070; length = 327680
2011-04-01 21:53:54,636 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2
2011-04-01 21:53:54,921 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 31021600; bufend = 41605937; bufvoid = 99614720
2011-04-01 21:53:54,922 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 131070; kvend = 65533; length = 327680
2011-04-01 21:54:04,370 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3
2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 41605937; bufend = 52550080; bufvoid = 99614720
2011-04-01 21:54:04,667 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 65533; kvend = 327677; length = 327680
2011-04-01 21:54:17,717 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4
2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 52550080; bufend = 63431004; bufvoid = 99614720
2011-04-01 21:54:17,997 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 327677; kvend = 262140; length = 327680
2011-04-01 21:54:24,260 INFO org.apache.hadoop.mapred.MapTask: Finished spill 5

.


.

.


2011-04-06 11:18:09,932 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 42207
2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 53614451; bufend = 64698505; bufvoid = 99614720
2011-04-06 11:18:10,250 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 97306; kvend = 31769; length = 327680
2011-04-06 11:18:18,905 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 42208
2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 64698505; bufend = 75588890; bufvoid = 99614720
2011-04-06 11:18:19,193 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 31769; kvend = 293913; length = 327680
2011-04-06 11:18:32,523 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 42209
2011-04-06 11:18:32,683 INFO org.apache.hadoop.mapred.MapTask:
Starting flush of map output
2011-04-06 11:18:35,894 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 42210
2011-04-06 11:19:11,530 INFO org.apache.hadoop.mapred.Merger: Merging
42211 sorted segments
2011-04-06 11:19:11,595 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,605 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,621 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,636 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,646 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,656 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,667 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,685 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,686 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,704 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
2011-04-06 11:19:11,705 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42211
2011-04-06 11:19:37,409 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42202
2011-04-06 11:19:59,087 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42193
2011-04-06 11:20:19,060 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42184
2011-04-06 11:20:36,387 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42175
2011-04-06 11:21:00,826 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42166
2011-04-06 11:21:20,049 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42157
2011-04-06 11:21:42,120 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42148
2011-04-06 11:21:58,431 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 42139
2011-04-06 11:22:15,594 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 4213

.

.

.


2011-04-09 00:43:01,590 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 73
2011-04-09 01:20:59,688 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 64
2011-04-09 01:57:23,938 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 55
2011-04-09 02:26:39,086 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 46
2011-04-09 05:08:55,609 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 37
2011-04-09 09:59:05,399 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 28
2011-04-09 14:59:58,730 INFO org.apache.hadoop.mapred.Merger: Merging
10 intermediate segments out of a total of 19
2011-04-09 19:28:18,921 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 10 segments left of total size: 24786384532
bytes
2011-04-10 14:55:46,726 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201103291618_0156_m_000000_1 is done. And is in the
process of commiting
2011-04-10 14:55:46,763 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201103291618_0156_m_000000_1' done.


This task took 10 days to run.  I'm guessing that it should run faster?

Any general advice for speeding this up?

Should I increase the io.sort.mb parameter?  It's currently set at
100, should it be more like 1,000?  I see that mahout attempts to
increase the parameter for the last job in RecommenderJob but it
doesn't seem to for the preceding jobs.  Is that right?

Thanks,

Kris

Reply via email to