Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by JulienNioche: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=33&rev2=34 Comment: Removed reference to crawl-urlfitler.txt * Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain: {{{ http://lucene.apache.org/nutch/ }}} + * Edit the file conf/regex-urlfilter.txt and replace - * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read: - {{{ +^http://([a-z0-9]*\.)*apache.org/ }}} This will include any url in the domain apache.org. - * Until someone could explain this...When I use the file crawl-urlfilter.txt the filter doesn't work, instead of it use the file conf/regex-urlfilter.txt and change the last line from "+." to "-." + {{{ + # accept anything else + +. + }}} + + with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read: + + {{{ + +^http://([a-z0-9]*\.)*apache.org/ + }}} + + This will include any url in the domain apache.org. === Crawl Command: Running the Crawl === Once things are configured, running the crawl is easy. Just use the crawl command. Its options include: @@ -162, +172 @@ Now we're ready to search! - == Command Line Searching == + == Command Line Searching (version < 1.3) == Simplest way to verify the integrity of your crawl is to launch NutchBean from command line: {{{ bin/nutch org.apache.nutch.searcher.NutchBean apache }}} where ''apache'' is the search term (note that NutchBean will only search pages in the {{{crawl}}} directory, so if you named the crawl directory something else, NutchBean will not find any results). After you have verified that the above command returns results you can proceed to setting up the web interface. - == Installing in Tomcat == + == Installing in Tomcat (version < 1.3) == To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command {{{ant war}}}.) Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands: