there were in fact a few parameters quite different between the two modes. Specially the "http.max.delays" is set to 3 in the main config file, but 100 for the crawler. This affects a lot when a site is slow and may result in many fetching errors. Many extra page entries may be there because they may be links of fetched pages, but may not get parsed resulting in much fewer link entries. Increasing the value would reduce the fetch errors.

Sorry to have bothered you all.

When I index a site with the "crawl" option, I get all the pages and their
links.  However, indexing the same site by step-by-step following the whole
web crawling example with a filter limited to a single domain, I get very
limited link output.  I set the "db.ignore.internal.links" to false in both
cases.  Is there any other configuration setting I missed?  Looking at the
CrawlTool, the manual steps seem exactly same.  I need the complete link
data more than text index for our application.  Thanks.



-Kenji





-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to