I just started using Nutch to crawl an intranet site. In my urls file, I have a single link that refers to a jhtml page, which contains roughly 2000 links in it. The links contain characters like '?' and '=', so I removed the following from the crawl-urlfilter.txt file:
# skip URLs containing certain characters as probable queries, etc. -[...@=] I finally got the crawl to work, but I only see 111 results under "TOTAL urls:" when I run the following command: bin/nutch readdb crawlTest/crawldb -stats I'm not sure where to look at this point. Any ideas? BTW what's the command that dumps all the links? Every one that I found online doesn't work... Cheers -- View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21418679.html Sent from the Nutch - User mailing list archive at Nabble.com.
