On Jan 14, 2009, at 12:44 PM, ahammad wrote:


Hello,

I'm still unable to find why Nutch is unable to fetch and index all the links that are on the page. To recap, the Nutch urls file contains a link to a jhtml file that contains roughly 2000 links, all hosted on the same server
in the same folder.

Previously, I only got 111 links when I crawl. This was due to this:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
<description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

You may also want to change this one:

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

Eric
--
Eric J. Christeson <[email protected]>
Enterprise Computing and Infrastructure    (701) 231-8693 (Voice)
North Dakota State University

Attachment: PGP.sig
Description: This is a digitally signed message part

Reply via email to