On Jan 14, 2009, at 12:44 PM, ahammad wrote:
Hello,I'm still unable to find why Nutch is unable to fetch and index all the links that are on the page. To recap, the Nutch urls file contains a link to a jhtml file that contains roughly 2000 links, all hosted on the same serverin the same folder. Previously, I only got 111 links when I crawl. This was due to this: <property> <name>db.max.outlinks.per.page</name> <value>100</value><description>The maximum number of outlinks that we'll process for a page.If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinkswill be processed for a page; otherwise, all outlinks will be processed.</description> </property>
You may also want to change this one: <property> <name>file.content.limit</name> <value>65536</value> <description>The length limit for downloaded content, in bytes.If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all. </description> </property> Eric --Eric J. Christeson <[email protected]>
Enterprise Computing and Infrastructure (701) 231-8693 (Voice) North Dakota State University
PGP.sig
Description: This is a digitally signed message part
