[Nutch-general] Re: Link dump

Kenji Wed, 23 Nov 2005 21:24:01 -0800

there were in fact a few parameters quite different between the twomodes. Specially the "http.max.delays" is set to 3 in the main configfile, but 100 for the crawler. This affects a lot when a site is slowand may result in many fetching errors. Many extra page entries may bethere because they may be links of fetched pages, but may not get parsedresulting in much fewer link entries. Increasing the value would reducethe fetch errors.


Sorry to have bothered you all.

When I index a site with the "crawl" option, I get all the pages and their
links.  However, indexing the same site by step-by-step following the whole
web crawling example with a filter limited to a single domain, I get very
limited link output.  I set the "db.ignore.internal.links" to false in both
cases.  Is there any other configuration setting I missed?  Looking at the
CrawlTool, the manual steps seem exactly same.  I need the complete link
data more than text index for our application.  Thanks.



-Kenji




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Link dump

Reply via email to