Re: Nutch 1.14 not crawling all links?

Sebastian Nagel Thu, 10 May 2018 02:25:57 -0700

Hi Bob,

it's impossible to make any diagnostics without the full log files
the complete configuration and a detailed description what is missing.


It could be a bug, of course. But it's more likely a configuration issue,
you should check the log files. Also have a look at:
- the robots.txt of the crawled sites
- your URL filters
- http.content.limit

These are often the reason for links not found or not fetched.


> even when I use the sitemap.xml as a seed url.

You need to use the SitemapProcessor
  bin/nutch sitemap

Best,
Sebastian

On 05/09/2018 07:08 PM, Robert Scavilla wrote:
>  Hello and thank you for your help. I'm confused why nutch 1.14 (I've had
> the same issues with earlier versions) is not crawling full websites. I set
> the number of rounds to a generous number and the crawl quits without
> crawling the whole site with the message "No New Links Found". This happens
> even when I use the sitemap.xml as a seed url.
> 
> Any help is greatly appreciated.
> 
> Best,
> ...bob
>

Re: Nutch 1.14 not crawling all links?

Reply via email to