I am a VERY new Nutch user. I thought I had made some progress when I was
able to crawl the apache site. The problem is I have *not* been able to
crawl anything else.

The crawl command fires up and produces some console output, but nothing is
ever actually fetched. I know this because the lines "fetching: http...."
that occur when crawling the apache site never appear - and of course I
don't gen any hits when attempting to search my resulting database.

What could be wrong ?

Here are the urls that worked for me:

http://lucene.apache.org/
http://lucene.apache.org/Nutch/

Here are the ones that did not:

http://www.birminghamfreepress.com/
http://www.bhamnews.com/

http://www.irs.gov

Am I setting up these links correctly?


There is one thing I did a bit differently. I put my input url directories
and output crawl directories outside of the nutch home directory, and used a

symbolic link to switch which of the outputs would be the active 'searcher'
directory. This is the purpose of the first property below in my
nutch-site.xml. Could that be my problem?

What follows is the text of my config file.

Thanks for your help!


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>

<property>
<name>searcher.dir</name>
<value>/home/clipper/crawl/searchdir</value>
<description>
  Path to root of crawl - searcher looks here to find its index
  (oversimplified description: see nutch-defaults.xml)
</description>
</property>



<!-- file properties -->

<property>
 <name>file.content.limit</name>
 <value>65536</value>
 <description>The length limit for downloaded content, in bytes.
 If this value is nonnegative (>=0), content longer than it will be
truncated;
 otherwise, no truncation at all.
 </description>
</property>

<!-- HTTP properties -->

<property>
 <name>http.agent.name</name>
 <value>newscrawler</value>
 <description>HTTP 'User-Agent' request header. MUST NOT be empty
 please set this to a single word uniquely related to your organization.
 </description>
</property>

<property>
 <name>http.robots.agents</name>
 <value>clipper,*</value>
 <description>The agent strings we'll look for in robots.txt files,
 comma-separated, in decreasing order of precedence. You should
 put the value of http.agent.name as the first agent name, and keep the
 default * at the end of the list. E.g.: BlurflDev,Blurfl,*
 </description>
</property>

<property>
 <name>http.agent.description</name>
 <value>news search engine</value>
 <description>Further description of our bot- this text is used in
 the User-Agent header.  It appears in parenthesis after the agent name.
 </description>
</property>

<property>
 <name>http.agent.url</name>
 <value>http://decisionsmith.com</value>
 <description>A URL to advertise in the User-Agent header.  This will
  appear in parenthesis after the agent name. Custom dictates that this
  should be a URL of a page explaining the purpose and behavior of this
  crawler.
 </description>
</property>

<property>
 <name>http.agent.email</name>
 <value>clipper twenty nine at gmail dot com</value>
 <description>An email address to advertise in the HTTP 'From' request
  header and User-Agent header. A good practice is to mangle this
  address (e.g. 'info at example dot com') to avoid spamming.
 </description>
</property>

<property>
 <name>http.verbose</name>
 <value>false</value>
 <description>If true, HTTP will log more verbosely.</description>
</property>

<!-- web db properties -->

<property>
 <name>db.default.fetch.interval</name>
 <value>1</value>
 <description>The default number of days between re-fetches of a page.
 </description>
</property>

<property>
 <name>db.ignore.internal.links</name>
 <value>false</value>
 <description>If true, when adding new links to a page, links from
 the same host are ignored.  This is an effective way to limit the
 size of the link database, keeping only the highest quality
 links.
 </description>
</property>

<property>
 <name>db.ignore.external.links</name>
 <value>false</value>
 <description>If true, outlinks leading from a page to external hosts
 will be ignored. This is an effective way to limit the crawl to include
 only initially injected hosts, without creating complex URLFilters.
 </description>
</property>

</configuration>

Reply via email to