Re: WebGraph Outlinks.reduce OOM

Markus Jelsma Wed, 11 Apr 2012 09:18:21 -0700

I debugged the reducer and to my surprise there were a number ofrecords that had more than db.max.outlinks.per.page outlinks! Thereducer chokes on http://www.toplist.cz/ but has only 115 outlinks. I'veseen records pass with over 500 outlinks!

It seems somewhere in the following code the OOM is triggered becausethe debug log line after this block is not printed:


337     while (values.hasNext()) {
338     Writable value = values.next().get();
339
340     if (value instanceof LinkDatum) {
341     // loop through, change out most recent timestamp if needed
342     LinkDatum next = (LinkDatum)value;
343     long timestamp = next.getTimestamp();
344     if (mostRecent == 0L || mostRecent < timestamp) {
345     mostRecent = timestamp;
346     }
347     outlinkList.add((LinkDatum)WritableUtils.clone(next, conf));
348     reporter.incrCounter("WebGraph.outlinks", "added links", 1);
349     }
350     else if (value instanceof BooleanWritable) {
351     BooleanWritable delete = (BooleanWritable)value;

352 // Actually, delete is always true, otherwise we don't emit it inthe mapper in the first place

353     if (delete.get() == true) {
354     // This page is gone, do not emit it's outlinks
355     reporter.incrCounter("WebGraph.outlinks", "removed links", 1);
356     return;
357     }
358     }
359     }

According to the stack trace it seems cloning a specific LinkDatumcauses something to consume a lot of memory:

2012-04-11 16:02:11,530 FATAL org.apache.hadoop.mapred.Child: Errorrunning child : java.lang.OutOfMemoryError: Java heap space

        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
        at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:760)
        at org.apache.hadoop.io.Text.encode(Text.java:388)
        at org.apache.hadoop.io.Text.encode(Text.java:369)
        at org.apache.hadoop.io.Text.writeString(Text.java:409)

atorg.apache.nutch.scoring.webgraph.LinkDatum.write(LinkDatum.java:126)atorg.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)atorg.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)atorg.apache.hadoop.util.ReflectionUtils.copy(ReflectionUtils.java:274)

        at org.apache.hadoop.io.WritableUtils.clone(WritableUtils.java:209)

atorg.apache.nutch.scoring.webgraph.WebGraph$OutlinkDb.reduce(WebGraph.java:347)atorg.apache.nutch.scoring.webgraph.WebGraph$OutlinkDb.reduce(WebGraph.java:111)atorg.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)

        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)

atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)

        at org.apache.hadoop.mapred.Child.main(Child.java:249)

This is the that line where the object is cloned:
outlinkList.add((LinkDatum)WritableUtils.clone(next, conf));

In LinkDatum it's the URL field that's the last piece of Nutch code inthe trace:

Text.writeString(out, url);

I'm puzzled here. Could the code leak something somewhere? We have hadthis URL for a longer time and it happily passed all jobs many timesbefore.

On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma<markus.jel...@openindex.io> wrote:

Hi,

Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I looked at the
code and i cannot seem to find why it would take more than 400MB of
RAM to process outlinks of a single record. We do limit outlinks so
the HashSets pages and domains are used. But we also limit the number

of outlinks per record in the parser to the default of 100. So iwouldnot expect the List and the both Sets in the reducer to use thatmuch.

Also, URL's longer than about 400 characters are discarded anyway.

Any thoughts to share?

Thanks,
Markus

--

Re: WebGraph Outlinks.reduce OOM

Reply via email to