Hi Maciek,

there are multiple configurations which set a limit on the items
during fetch list generations.

- topN (Ok, it's obviously not the reason)

- a limit per host is defined by the property generate.max.count
  - default is -1 (no limit)
  - eventually you want to set a limit per host, in order to avoid that
    a single host with overlong fetch lists slows down the overall crawling
  - by generate.count.mode this limit can be applied per registered
    domain or IP address

- generate.min.score (default: 0.0): only CrawlDatum items with a higher
  score are put into fetch lists

- the fetch scheduling: re-fetch pages after a certain amount of time
  (default: 30 days), also wait 1 day for retrying a page which failed
  to fetch with an error (not a 404)


Running the CrawlDb statistics

  bin/nutch readdb crawldb -stats

shows the number of items per status, retry count, the distribution
of scores and fetch intervals.


> should generate a segment with 500k pages to fetch.

Not necessarily, see above. The final size of the fetch list
is shown by the counter
  Reduce output records=NNN
Note: the second appearance of it in the generator log because of NUTCH-3059.

> fetches only around 100k pages.

The fetcher counters also show how many items are skipped because of the
fetcher timelimit and alike.


Let us know whether you need more information. If possible please share the
CrawlDb statistics or the generator and fetcher counters. It might help
to find the reason.

Best,
Sebastian

On 3/27/25 18:56, Maciek Puzianowski wrote:
Hi,
I have a problem with topN value in Apache Nutch.
I have 8 million+ db_unfetched pages in crawldb. I use crawl script with
following command:
bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
--size-fetchlist 500000 /nutch/crawl 1
--size-fetchlist parameter is the topN for generate method, meaning that it
should generate a segment with 500k pages to fetch. However, the fetcher
fetches only around 100k pages. Also I get around 1 million
SCHEDULE_REJECTED counter in generate method, but I think its just pages
that I have already fetched.

I have checked url filters and they affect only few pages.

What can be causing such issue with such a big difference?

Best,
Maciej


Reply via email to