Problems starting crawl from sitemaps

Chris Gray Wed, 23 May 2018 10:38:57 -0700

I've been using nutch for a few years to do conventional link-to-linkcrawls of our local websites, but I would like to switch to doing crawlsbased on sitemaps. So far I've had no luck doing this.

I'm not sure I've configured this correctly and the documentation I'vefound has left me guessing at many things. Why aren't the pages inlisted in a sitemap being fetched and indexed?

I've installed Nutch 1.14 and Solr 6.6.0. My urls/seeds.txt filecontains only the URLs for the 4 sitemaps I'm interested in. After running:

bin/crawl -i -D "solr.server.url=http://localhost:8983/solr/nutch"; -surls crawl 5

the crawl ends after 3 of 5 iterations and only 3 documents are in theindex: 3 of the seeds.

I do get error messages that 3 sitemap files that contain <urlset>elements are malformed, for example:

2018-05-23 08:57:24,564 ERROR tika.TikaParser - Error parsinghttps://uwaterloo.ca/library/sitemap.xmlCaused by: org.xml.sax.SAXParseException; lineNumber: 420; columnNumber:122; XML document structures must start and end within the same entity.

But I can't find anything wrong with the sitemaps and other validatorssay they're OK and the location pointed to (line 420, column 122) is inthe middle of the name of a directory in a URL.


Is there good documentation or a tutorial on using Nutch with sitemaps?

Problems starting crawl from sitemaps

Reply via email to