I debugged the reducer and to my surprise there were a number of
records that had more than db.max.outlinks.per.page outlinks! The
reducer chokes on http://www.toplist.cz/ but has only 115 outlinks. I've
seen records pass with over 500 outlinks!
It seems somewhere in the following code the OOM is triggered because
the debug log line after this block is not printed:
337 while (values.hasNext()) {
338 Writable value = values.next().get();
339
340 if (value instanceof LinkDatum) {
341 // loop through, change out most recent timestamp if needed
342 LinkDatum next = (LinkDatum)value;
343 long timestamp = next.getTimestamp();
344 if (mostRecent == 0L || mostRecent < timestamp) {
345 mostRecent = timestamp;
346 }
347 outlinkList.add((LinkDatum)WritableUtils.clone(next, conf));
348 reporter.incrCounter("WebGraph.outlinks", "added links", 1);
349 }
350 else if (value instanceof BooleanWritable) {
351 BooleanWritable delete = (BooleanWritable)value;
352 // Actually, delete is always true, otherwise we don't emit it in
the mapper in the first place
353 if (delete.get() == true) {
354 // This page is gone, do not emit it's outlinks
355 reporter.incrCounter("WebGraph.outlinks", "removed links", 1);
356 return;
357 }
358 }
359 }
According to the stack trace it seems cloning a specific LinkDatum
causes something to consume a lot of memory:
2012-04-11 16:02:11,530 FATAL org.apache.hadoop.mapred.Child: Error
running child : java.lang.OutOfMemoryError: Java heap space
at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:760)
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.encode(Text.java:369)
at org.apache.hadoop.io.Text.writeString(Text.java:409)
at
org.apache.nutch.scoring.webgraph.LinkDatum.write(LinkDatum.java:126)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at
org.apache.hadoop.util.ReflectionUtils.copy(ReflectionUtils.java:274)
at org.apache.hadoop.io.WritableUtils.clone(WritableUtils.java:209)
at
org.apache.nutch.scoring.webgraph.WebGraph$OutlinkDb.reduce(WebGraph.java:347)
at
org.apache.nutch.scoring.webgraph.WebGraph$OutlinkDb.reduce(WebGraph.java:111)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
This is the that line where the object is cloned:
outlinkList.add((LinkDatum)WritableUtils.clone(next, conf));
In LinkDatum it's the URL field that's the last piece of Nutch code in
the trace:
Text.writeString(out, url);
I'm puzzled here. Could the code leak something somewhere? We have had
this URL for a longer time and it happily passed all jobs many times
before.
On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma
<markus.jel...@openindex.io> wrote:
Hi,
Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I looked at the
code and i cannot seem to find why it would take more than 400MB of
RAM to process outlinks of a single record. We do limit outlinks so
the HashSets pages and domains are used. But we also limit the number
of outlinks per record in the parser to the default of 100. So i
would
not expect the List and the both Sets in the reducer to use that
much.
Also, URL's longer than about 400 characters are discarded anyway.
Any thoughts to share?
Thanks,
Markus
--