I debugged the reducer and to my surprise there were a number of records that had more than db.max.outlinks.per.page outlinks! The reducer chokes on http://www.toplist.cz/ but has only 115 outlinks. I've seen records pass with over 500 outlinks!

It seems somewhere in the following code the OOM is triggered because the debug log line after this block is not printed:

337     while (values.hasNext()) {
338     Writable value = values.next().get();
339
340     if (value instanceof LinkDatum) {
341     // loop through, change out most recent timestamp if needed
342     LinkDatum next = (LinkDatum)value;
343     long timestamp = next.getTimestamp();
344     if (mostRecent == 0L || mostRecent < timestamp) {
345     mostRecent = timestamp;
346     }
347     outlinkList.add((LinkDatum)WritableUtils.clone(next, conf));
348     reporter.incrCounter("WebGraph.outlinks", "added links", 1);
349     }
350     else if (value instanceof BooleanWritable) {
351     BooleanWritable delete = (BooleanWritable)value;
352 // Actually, delete is always true, otherwise we don't emit it in the mapper in the first place
353     if (delete.get() == true) {
354     // This page is gone, do not emit it's outlinks
355     reporter.incrCounter("WebGraph.outlinks", "removed links", 1);
356     return;
357     }
358     }
359     }

According to the stack trace it seems cloning a specific LinkDatum causes something to consume a lot of memory:

2012-04-11 16:02:11,530 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
        at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:760)
        at org.apache.hadoop.io.Text.encode(Text.java:388)
        at org.apache.hadoop.io.Text.encode(Text.java:369)
        at org.apache.hadoop.io.Text.writeString(Text.java:409)
at org.apache.nutch.scoring.webgraph.LinkDatum.write(LinkDatum.java:126) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.util.ReflectionUtils.copy(ReflectionUtils.java:274)
        at org.apache.hadoop.io.WritableUtils.clone(WritableUtils.java:209)
at org.apache.nutch.scoring.webgraph.WebGraph$OutlinkDb.reduce(WebGraph.java:347) at org.apache.nutch.scoring.webgraph.WebGraph$OutlinkDb.reduce(WebGraph.java:111) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

This is the that line where the object is cloned:
outlinkList.add((LinkDatum)WritableUtils.clone(next, conf));

In LinkDatum it's the URL field that's the last piece of Nutch code in the trace:
Text.writeString(out, url);

I'm puzzled here. Could the code leak something somewhere? We have had this URL for a longer time and it happily passed all jobs many times before.


On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma <markus.jel...@openindex.io> wrote:
Hi,

Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I looked at the
code and i cannot seem to find why it would take more than 400MB of
RAM to process outlinks of a single record. We do limit outlinks so
the HashSets pages and domains are used. But we also limit the number
of outlinks per record in the parser to the default of 100. So i would not expect the List and the both Sets in the reducer to use that much.
Also, URL's longer than about 400 characters are discarded anyway.

Any thoughts to share?

Thanks,
Markus

--

Reply via email to