Markus Jelsma created NUTCH-3056: ------------------------------------ Summary: Injector to support resolving seed URLs Key: NUTCH-3056 URL: https://issues.apache.org/jira/browse/NUTCH-3056 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.21
We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#000000}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. -- This message was sent by Atlassian Jira (v8.20.10#820010)