Re: Crawler not fetching all the links

Eric J. Christeson Wed, 14 Jan 2009 14:05:12 -0800


On Jan 14, 2009, at 12:44 PM, ahammad wrote:

Hello,
I'm still unable to find why Nutch is unable to fetch and index all the links that are on the page. To recap, the Nutch urls file contains a link to a jhtml file that contains roughly 2000 links, all hosted on the same server
in the same folder.

Previously, I only got 111 links when I crawl. This was due to this:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
<description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>


You may also want to change this one:

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.

If this value is nonnegative (>=0), content longer than it will be truncated;

  otherwise, no truncation at all.
  </description>
</property>

Eric
--

Eric J. Christeson <[email protected]>

Enterprise Computing and Infrastructure    (701) 231-8693 (Voice)
North Dakota State University

PGP.sig
Description: This is a digitally signed message part

Re: Crawler not fetching all the links

Reply via email to