Hello, I'm still unable to find why Nutch is unable to fetch and index all the links that are on the page. To recap, the Nutch urls file contains a link to a jhtml file that contains roughly 2000 links, all hosted on the same server in the same folder.
Previously, I only got 111 links when I crawl. This was due to this: <property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> I changed the value to 2000, but I only got back 719 results. I also tried to make the value -1, and I still get 719 results. What other settings can affect this? I've been trying to tweak nutch-default.xml, but I couldn't improve the number of results. Any help with this would be appreciated. Thank you. Cheers ahammad wrote: > > I just started using Nutch to crawl an intranet site. In my urls file, I > have a single link that refers to a jhtml page, which contains roughly > 2000 links in it. The links contain characters like '?' and '=', so I > removed the following from the crawl-urlfilter.txt file: > > # skip URLs containing certain characters as probable queries, etc. > -[...@=] > > I finally got the crawl to work, but I only see 111 results under "TOTAL > urls:" when I run the following command: > > bin/nutch readdb crawlTest/crawldb -stats > > I'm not sure where to look at this point. Any ideas? > > BTW what's the command that dumps all the links? Every one that I found > online doesn't work... > > Cheers > -- View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html Sent from the Nutch - User mailing list archive at Nabble.com.
