Re: Crawler not fetching all the links

ahammad Wed, 14 Jan 2009 10:44:36 -0800

Hello,

I'm still unable to find why Nutch is unable to fetch and index all the
links that are on the page. To recap, the Nutch urls file contains a link to
a jhtml file that contains roughly 2000 links, all hosted on the same server
in the same folder.

Previously, I only got 111 links when I crawl. This was due to this:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

I changed the value to 2000, but I only got back 719 results. I also tried
to make the value -1, and I still get 719 results.

What other settings can affect this? I've been trying to tweak
nutch-default.xml, but I couldn't improve the number of results. Any help
with this would be appreciated.

Thank you.

Cheers

ahammad wrote:
> 
> I just started using Nutch to crawl an intranet site. In my urls file, I
> have a single link that refers to a jhtml page, which contains roughly
> 2000 links in it. The links contain characters like '?' and '=', so I
> removed the following from the crawl-urlfilter.txt file:
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[...@=]
> 
> I finally got the crawl to work, but I only see 111 results under "TOTAL
> urls:" when I run the following command:
> 
> bin/nutch readdb crawlTest/crawldb -stats 
> 
> I'm not sure where to look at this point. Any ideas?
> 
> BTW what's the command that dumps all the links? Every one that I found
> online doesn't work...
> 
> Cheers
> 

-- 
View this message in context: 
http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21462474.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawler not fetching all the links

Reply via email to