Re: Crawler not fetching all the links

Doğacan Güney Thu, 15 Jan 2009 01:56:58 -0800

On Wed, Jan 14, 2009 at 8:44 PM, ahammad <[email protected]> wrote:
>
> Hello,
>
> I'm still unable to find why Nutch is unable to fetch and index all the
> links that are on the page. To recap, the Nutch urls file contains a link to
> a jhtml file that contains roughly 2000 links, all hosted on the same server
> in the same folder.
>
> Previously, I only got 111 links when I crawl. This was due to this:
>
> <property>
>  <name>db.max.outlinks.per.page</name>
>  <value>100</value>
>  <description>The maximum number of outlinks that we'll process for a page.
>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>  will be processed for a page; otherwise, all outlinks will be processed.
>  </description>
> </property>
>
> I changed the value to 2000, but I only got back 719 results. I also tried
> to make the value -1, and I still get 719 results.
>
> What other settings can affect this? I've been trying to tweak
> nutch-default.xml, but I couldn't improve the number of results. Any help
> with this would be appreciated.
>


What does urls that are not fetched look like? Are they redirects?

> Thank you.
>
> Cheers
>
>
>
> --
> View this message in context: 
> http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Re: Crawler not fetching all the links

Reply via email to