d e wrote: > I am a VERY new Nutch user. I thought I had made some progress when I was > able to crawl the apache site. The problem is I have *not* been able to > crawl anything else. > > The crawl command fires up and produces some console output, but > nothing is > ever actually fetched. I know this because the lines "fetching: http...." > that occur when crawling the apache site never appear - and of course I > don't gen any hits when attempting to search my resulting database. > > What could be wrong ?
have you added your domains to th url filters? HTH Michael > > Here are the urls that worked for me: > > http://lucene.apache.org/ > http://lucene.apache.org/Nutch/ > > Here are the ones that did not: > > http://www.birminghamfreepress.com/ > http://www.bhamnews.com/ > > http://www.irs.gov > > Am I setting up these links correctly? > > > There is one thing I did a bit differently. I put my input url > directories > and output crawl directories outside of the nutch home directory, and > used a > > symbolic link to switch which of the outputs would be the active > 'searcher' > directory. This is the purpose of the first property below in my > nutch-site.xml. Could that be my problem? > > What follows is the text of my config file. > > Thanks for your help! > > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > <configuration> > > <property> > <name>searcher.dir</name> > <value>/home/clipper/crawl/searchdir</value> > <description> > Path to root of crawl - searcher looks here to find its index > (oversimplified description: see nutch-defaults.xml) > </description> > </property> > > > > <!-- file properties --> > > <property> > <name>file.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > </description> > </property> > > <!-- HTTP properties --> > > <property> > <name>http.agent.name</name> > <value>newscrawler</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty > please set this to a single word uniquely related to your organization. > </description> > </property> > > <property> > <name>http.robots.agents</name> > <value>clipper,*</value> > <description>The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > </description> > </property> > > <property> > <name>http.agent.description</name> > <value>news search engine</value> > <description>Further description of our bot- this text is used in > the User-Agent header. It appears in parenthesis after the agent name. > </description> > </property> > > <property> > <name>http.agent.url</name> > <value>http://decisionsmith.com</value> > <description>A URL to advertise in the User-Agent header. This will > appear in parenthesis after the agent name. Custom dictates that this > should be a URL of a page explaining the purpose and behavior of this > crawler. > </description> > </property> > > <property> > <name>http.agent.email</name> > <value>clipper twenty nine at gmail dot com</value> > <description>An email address to advertise in the HTTP 'From' request > header and User-Agent header. A good practice is to mangle this > address (e.g. 'info at example dot com') to avoid spamming. > </description> > </property> > > <property> > <name>http.verbose</name> > <value>false</value> > <description>If true, HTTP will log more verbosely.</description> > </property> > > <!-- web db properties --> > > <property> > <name>db.default.fetch.interval</name> > <value>1</value> > <description>The default number of days between re-fetches of a page. > </description> > </property> > > <property> > <name>db.ignore.internal.links</name> > <value>false</value> > <description>If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > </description> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to include > only initially injected hosts, without creating complex URLFilters. > </description> > </property> > > </configuration> > -- Michael Wechner Wyona - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED] [EMAIL PROTECTED] +41 44 272 91 61 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
