[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

Markus Jelsma (Jira) Thu, 16 May 2024 03:05:06 -0700


     [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma updated NUTCH-3056:
---------------------------------
    Description: 
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#000000}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.

If you have a seed file with 10k+ or millions of records, you are highly 
recommended to split the input file in chunks so that multiple mappers can get 
to work. Passing a few millions records without resolving through one mapper is 
no problem, but resolving millions with one mapper, even if threaded, will take 
many hours.

  was:
We have a case where clients submit huge uncurated seed files, the host may not 
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect 
etc.

The large crawl itself is not supposed to venture much beyond the seed list, 
except for regex exceptions listed in 
{color:#000000}db-ignore-external-exemptions{color}. It is also not allowed to 
jump to other domains/hosts to control the size of the crawl. This means 
externally redirecting seeds will not be crawled.

This ticket will add support for a multi-threaded 
host/domain/protocol/redirecter/resolver to the injector.


> Injector to support resolving seed URLs
> ---------------------------------------
>
>                 Key: NUTCH-3056
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3056
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.21
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#000000}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3056) Injector to support resolving seed URLs

Reply via email to