Markus Jelsma created NUTCH-3056:
------------------------------------

             Summary: Injector to support resolving seed URLs
                 Key: NUTCH-3056
                 URL: https://issues.apache.org/jira/browse/NUTCH-3056
             Project: Nutch
          Issue Type: Improvement
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.21


We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#000000}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to