[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-497: ------------------------------- Attachment: ExtremeNestedTags.patch This is a rudimentary fix for those that want a workaround for this issue immediately. This patch simply alters DomContentUtils to ignore parsing links if they are more than 50 levels deep in nesting. I will provide a more robust patch with configuration options and unit test when time allows. I have successfully run this patch on a production system. > Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider > Trap > ---------------------------------------------------------------------------------- > > Key: NUTCH-497 > URL: https://issues.apache.org/jira/browse/NUTCH-497 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8.1, 0.9.0, 1.0.0 > Environment: all > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: ExtremeNestedTags.patch > > > Some webpages have a form of a spider trap that causes a > StackOverflowException in DomContentUtils by having nested tags with > thousands of layers deep. DomContentUtils when trying to get outlinks uses a > recursive method to parse the html. With this type of nesting it errors out. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers