On Wed, Jan 14, 2009 at 8:44 PM, ahammad <[email protected]> wrote: > > Hello, > > I'm still unable to find why Nutch is unable to fetch and index all the > links that are on the page. To recap, the Nutch urls file contains a link to > a jhtml file that contains roughly 2000 links, all hosted on the same server > in the same folder. > > Previously, I only got 111 links when I crawl. This was due to this: > > <property> > <name>db.max.outlinks.per.page</name> > <value>100</value> > <description>The maximum number of outlinks that we'll process for a page. > If this value is nonnegative (>=0), at most db.max.outlinks.per.page > outlinks > will be processed for a page; otherwise, all outlinks will be processed. > </description> > </property> > > I changed the value to 2000, but I only got back 719 results. I also tried > to make the value -1, and I still get 719 results. > > What other settings can affect this? I've been trying to tweak > nutch-default.xml, but I couldn't improve the number of results. Any help > with this would be appreciated. >
What does urls that are not fetched look like? Are they redirects? > Thank you. > > Cheers > > > > -- > View this message in context: > http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
