Re: Crawler not fetching all the links

ahammad Thu, 15 Jan 2009 09:19:25 -0800

Hello,

The links are all the same format, they are not redirects. Is there
something significant I need to know about redirects other than the
http.redirect.max property?


In any case, I figured out the issue. Like Eric suggested, it was the
file.content.limit property. I increased the value a hundred times and it
fetched every link. Thanks you all for your advice.

Cheers



Doğacan Güney-3 wrote:
> 
> On Wed, Jan 14, 2009 at 8:44 PM, ahammad <[email protected]> wrote:
>>
>> Hello,
>>
>> I'm still unable to find why Nutch is unable to fetch and index all the
>> links that are on the page. To recap, the Nutch urls file contains a link
>> to
>> a jhtml file that contains roughly 2000 links, all hosted on the same
>> server
>> in the same folder.
>>
>> Previously, I only got 111 links when I crawl. This was due to this:
>>
>> <property>
>>  <name>db.max.outlinks.per.page</name>
>>  <value>100</value>
>>  <description>The maximum number of outlinks that we'll process for a
>> page.
>>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>> outlinks
>>  will be processed for a page; otherwise, all outlinks will be processed.
>>  </description>
>> </property>
>>
>> I changed the value to 2000, but I only got back 719 results. I also
>> tried
>> to make the value -1, and I still get 719 results.
>>
>> What other settings can affect this? I've been trying to tweak
>> nutch-default.xml, but I couldn't improve the number of results. Any help
>> with this would be appreciated.
>>
> 
> What does urls that are not fetched look like? Are they redirects?
> 
>> Thank you.
>>
>> Cheers
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Doğacan Güney
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21482360.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawler not fetching all the links

Reply via email to