I am a VERY new Nutch user. I thought I had made some progress when I was able to crawl the apache site. The problem is I have *not* been able to crawl anything else.
The crawl command fires up and produces some console output, but nothing is ever actually fetched. I know this because the lines "fetching: http...." that occur when crawling the apache site never appear - and of course I don't gen any hits when attempting to search my resulting database. What could be wrong ? Here are the urls that worked for me: http://lucene.apache.org/ http://lucene.apache.org/Nutch/ Here are the ones that did not: http://www.birminghamfreepress.com/ http://www.bhamnews.com/ http://www.irs.gov Am I setting up these links correctly? There is one thing I did a bit differently. I put my input url directories and output crawl directories outside of the nutch home directory, and used a symbolic link to switch which of the outputs would be the active 'searcher' directory. This is the purpose of the first property below in my nutch-site.xml. Could that be my problem? What follows is the text of my config file. Thanks for your help! <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>searcher.dir</name> <value>/home/clipper/crawl/searchdir</value> <description> Path to root of crawl - searcher looks here to find its index (oversimplified description: see nutch-defaults.xml) </description> </property> <!-- file properties --> <property> <name>file.content.limit</name> <value>65536</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> <!-- HTTP properties --> <property> <name>http.agent.name</name> <value>newscrawler</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty please set this to a single word uniquely related to your organization. </description> </property> <property> <name>http.robots.agents</name> <value>clipper,*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> <property> <name>http.agent.description</name> <value>news search engine</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://decisionsmith.com</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>clipper twenty nine at gmail dot com</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <property> <name>http.verbose</name> <value>false</value> <description>If true, HTTP will log more verbosely.</description> </property> <!-- web db properties --> <property> <name>db.default.fetch.interval</name> <value>1</value> <description>The default number of days between re-fetches of a page. </description> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> <property> <name>db.ignore.external.links</name> <value>false</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> </configuration>