Hello, The links are all the same format, they are not redirects. Is there something significant I need to know about redirects other than the http.redirect.max property?
In any case, I figured out the issue. Like Eric suggested, it was the file.content.limit property. I increased the value a hundred times and it fetched every link. Thanks you all for your advice. Cheers Doğacan Güney-3 wrote: > > On Wed, Jan 14, 2009 at 8:44 PM, ahammad <[email protected]> wrote: >> >> Hello, >> >> I'm still unable to find why Nutch is unable to fetch and index all the >> links that are on the page. To recap, the Nutch urls file contains a link >> to >> a jhtml file that contains roughly 2000 links, all hosted on the same >> server >> in the same folder. >> >> Previously, I only got 111 links when I crawl. This was due to this: >> >> <property> >> <name>db.max.outlinks.per.page</name> >> <value>100</value> >> <description>The maximum number of outlinks that we'll process for a >> page. >> If this value is nonnegative (>=0), at most db.max.outlinks.per.page >> outlinks >> will be processed for a page; otherwise, all outlinks will be processed. >> </description> >> </property> >> >> I changed the value to 2000, but I only got back 719 results. I also >> tried >> to make the value -1, and I still get 719 results. >> >> What other settings can affect this? I've been trying to tweak >> nutch-default.xml, but I couldn't improve the number of results. Any help >> with this would be appreciated. >> > > What does urls that are not fetched look like? Are they redirects? > >> Thank you. >> >> Cheers >> >> >> >> -- >> View this message in context: >> http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > > -- > Doğacan Güney > > -- View this message in context: http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21482360.html Sent from the Nutch - User mailing list archive at Nabble.com.
