Hi,
I have a problem with topN value in Apache Nutch.
I have 8 million+ db_unfetched pages in crawldb. I use crawl script with
following command:
bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
--size-fetchlist 500000 /nutch/crawl 1
--size-fetchlist parameter is the topN for generate method, meaning that it
should generate a segment with 500k pages to fetch. However, the fetcher
fetches only around 100k pages. Also I get around 1 million
SCHEDULE_REJECTED counter in generate method, but I think its just pages
that I have already fetched.

I have checked url filters and they affect only few pages.

What can be causing such issue with such a big difference?

Best,
Maciej

Reply via email to