Re: Crawler not fetching all the links

Doğacan Güney Mon, 12 Jan 2009 09:20:09 -0800

Hi,

On Mon, Jan 12, 2009 at 7:03 PM, ahammad <[email protected]> wrote:
>
> I just started using Nutch to crawl an intranet site. In my urls file, I have
> a single link that refers to a jhtml page, which contains roughly 2000 links
> in it. The links contain characters like '?' and '=', so I removed the
> following from the crawl-urlfilter.txt file:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[...@=]
>
> I finally got the crawl to work, but I only see 111 results under "TOTAL
> urls:" when I run the following command:
>
> bin/nutch readdb crawlTest/crawldb -stats
>
> I'm not sure where to look at this point. Any ideas?
>
> BTW what's the command that dumps all the links? Every one that I found
> online doesn't work...
>


Nutch only considers the first 100 links from a page by default. You
can change this with
this options:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

You can dump inverted links with command "readlinkdb". To see all links
from a page you can do:

bin/nutch readseg -get <segment> <url> -nocontent -nofetch -noparse
-nogenerate -noparsetext

> Cheers
> --
> View this message in context: 
> http://www.nabble.com/Crawler-not-fetching-all-the-links-tp21418679p21418679.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Re: Crawler not fetching all the links

Reply via email to