[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797990#action_12797990 ]
Julien Nioche commented on NUTCH-269: ------------------------------------- I will shortly commit a variant of this approach whereby the inlinks are stored in a priority queue in order to keep the best scoring ones. The size of the queue is determined by the parameter db.update.max.inlinks which has a default value of 10000. > CrawlDbReducer: OOME because no upper-bound on inlinks count > ------------------------------------------------------------ > > Key: NUTCH-269 > URL: https://issues.apache.org/jira/browse/NUTCH-269 > Project: Nutch > Issue Type: Bug > Reporter: stack > Assignee: Julien Nioche > Priority: Trivial > Attachments: too-many-links.patch, too-many-links2.patch > > > A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands > of inlinks (The british foriegn office likes putting a clear.gif multiple > times into each page: > http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.