[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738191#comment-17738191 ]
Markus Jelsma commented on NUTCH-2993: -------------------------------------- To be honest, i am not too happy with the implementation like this. Ideally we would regex all outlinks, but that will be even more costly. The crawler still ends up in bad sections of the site and further on the www, but with low depth settings, it is manageable. > ScoringDepth plugin to skip depth check based on URL Pattern > ------------------------------------------------------------ > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. > This patch overrides maxDepth for outlinks of URLs matching a configured > pattern. URL not matching the pattern get the default max depth value > configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)