Thanks folks! Very helpful!
On 3/10/07, Michael Wechner <[EMAIL PROTECTED]> wrote:
d e wrote:
> I am a VERY new Nutch user. I thought I had made some progress when I
was
> able to crawl the apache site. The problem is I have *not* been able to
> crawl anything else.
>
> The crawl command fires up and produces some console output, but
> nothing is
> ever actually fetched. I know this because the lines "fetching:
http...."
> that occur when crawling the apache site never appear - and of course I
> don't gen any hits when attempting to search my resulting database.
>
> What could be wrong ?
have you added your domains to th url filters?
HTH
Michael
>
> Here are the urls that worked for me:
>
> http://lucene.apache.org/
> http://lucene.apache.org/Nutch/
>
> Here are the ones that did not:
>
> http://www.birminghamfreepress.com/
> http://www.bhamnews.com/
>
> http://www.irs.gov
>
> Am I setting up these links correctly?
>
>
> There is one thing I did a bit differently. I put my input url
> directories
> and output crawl directories outside of the nutch home directory, and
> used a
>
> symbolic link to switch which of the outputs would be the active
> 'searcher'
> directory. This is the purpose of the first property below in my
> nutch-site.xml. Could that be my problem?
>
> What follows is the text of my config file.
>
> Thanks for your help!
>
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>
> <property>
> <name>searcher.dir</name>
> <value>/home/clipper/crawl/searchdir</value>
> <description>
> Path to root of crawl - searcher looks here to find its index
> (oversimplified description: see nutch-defaults.xml)
> </description>
> </property>
>
>
>
> <!-- file properties -->
>
> <property>
> <name>file.content.limit</name>
> <value>65536</value>
> <description>The length limit for downloaded content, in bytes.
> If this value is nonnegative (>=0), content longer than it will be
> truncated;
> otherwise, no truncation at all.
> </description>
> </property>
>
> <!-- HTTP properties -->
>
> <property>
> <name>http.agent.name</name>
> <value>newscrawler</value>
> <description>HTTP 'User-Agent' request header. MUST NOT be empty
> please set this to a single word uniquely related to your organization.
> </description>
> </property>
>
> <property>
> <name>http.robots.agents</name>
> <value>clipper,*</value>
> <description>The agent strings we'll look for in robots.txt files,
> comma-separated, in decreasing order of precedence. You should
> put the value of http.agent.name as the first agent name, and keep the
> default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> </description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value>news search engine</value>
> <description>Further description of our bot- this text is used in
> the User-Agent header. It appears in parenthesis after the agent name.
> </description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value>http://decisionsmith.com</value>
> <description>A URL to advertise in the User-Agent header. This will
> appear in parenthesis after the agent name. Custom dictates that this
> should be a URL of a page explaining the purpose and behavior of this
> crawler.
> </description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value>clipper twenty nine at gmail dot com</value>
> <description>An email address to advertise in the HTTP 'From' request
> header and User-Agent header. A good practice is to mangle this
> address (e.g. 'info at example dot com') to avoid spamming.
> </description>
> </property>
>
> <property>
> <name>http.verbose</name>
> <value>false</value>
> <description>If true, HTTP will log more verbosely.</description>
> </property>
>
> <!-- web db properties -->
>
> <property>
> <name>db.default.fetch.interval</name>
> <value>1</value>
> <description>The default number of days between re-fetches of a page.
> </description>
> </property>
>
> <property>
> <name>db.ignore.internal.links</name>
> <value>false</value>
> <description>If true, when adding new links to a page, links from
> the same host are ignored. This is an effective way to limit the
> size of the link database, keeping only the highest quality
> links.
> </description>
> </property>
>
> <property>
> <name>db.ignore.external.links</name>
> <value>false</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
>
> </configuration>
>
--
Michael Wechner
Wyona - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED] [EMAIL PROTECTED]
+41 44 272 91 61
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general