Hi, I have a problem with topN value in Apache Nutch. I have 8 million+ db_unfetched pages in crawldb. I use crawl script with following command: bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20 --size-fetchlist 500000 /nutch/crawl 1 --size-fetchlist parameter is the topN for generate method, meaning that it should generate a segment with 500k pages to fetch. However, the fetcher fetches only around 100k pages. Also I get around 1 million SCHEDULE_REJECTED counter in generate method, but I think its just pages that I have already fetched.
I have checked url filters and they affect only few pages. What can be causing such issue with such a big difference? Best, Maciej

