Re: Crawler not fetching all the links

ahammad Mon, 12 Jan 2009 10:35:31 -0800

Doğacan Güney-3 wrote:
> 
> Hi,
> 
> Nutch only considers the first 100 links from a page by default. You
> can change this with
> this options:
> 
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>100</value>
>   <description>The maximum number of outlinks that we'll process for a
> page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
> </property>
> 
> You can dump inverted links with command "readlinkdb". To see all links
> from a page you can do:
> 
> bin/nutch readseg -get <segment> <url> -nocontent -nofetch -noparse
> -nogenerate -noparsetext
> 
>> Cheers
>> --
>> View this message in context:
>> http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21418679.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> -- 
> Doğacan Güney
> 
> 



Thank you for pointing me in the right direction. I changed the value from
100 to 2000. Now I get 719 results. It certainly is an improvement, but it
is still a lot less than the actual number of links on the jhtml page.

What other settings can affect this (ie file size etc)? Would you have any
suggestions?

Thank you very much for your time.

Cheers
-- 
View this message in context: 
http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21420769.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Crawler not fetching all the links

Reply via email to