[Nutch Wiki] Update of "NutchTutorial" by JulienNioche

Apache Wiki Tue, 12 Jul 2011 02:39:53 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by JulienNioche:
http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=33&rev2=34

Comment:
Removed reference to crawl-urlfitler.txt

   * Create a directory with a flat file of root urls. For example, to crawl 
the nutch site you might start with a file named urls/nutch containing the url 
of just the Nutch home page. All other Nutch pages should be reachable from 
this page. The urls/nutch file would thus contain:
   {{{ http://lucene.apache.org/nutch/ }}}
  
+  * Edit the file conf/regex-urlfilter.txt and replace 
-  * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the 
name of the domain you wish to crawl. For example, if you wished to limit the 
crawl to the apache.org domain, the line should read:
-  {{{ +^http://([a-z0-9]*\.)*apache.org/ }}} This will include any url in the 
domain apache.org.
  
- * Until someone could explain this...When I use the file crawl-urlfilter.txt 
the filter doesn't work, instead of it use the file conf/regex-urlfilter.txt 
and change the last line from "+." to "-."
+ {{{
+ # accept anything else
+ +.  
+ }}}
+ 
+ with a regular expression matching the domain you wish to crawl. For example, 
if you wished to limit the crawl to the apache.org domain, the line should read:
+ 
+ {{{
+  +^http://([a-z0-9]*\.)*apache.org/ 
+ }}} 
+ 
+ This will include any url in the domain apache.org.
  
  === Crawl Command: Running the Crawl ===
  Once things are configured, running the crawl is easy. Just use the crawl 
command. Its options include:
@@ -162, +172 @@

  
  Now we're ready to search!
  
- == Command Line Searching ==
+ == Command Line Searching (version < 1.3)  ==
  Simplest way to verify the integrity of your crawl is to launch NutchBean 
from command line:
  
  {{{ bin/nutch org.apache.nutch.searcher.NutchBean apache }}}
  
  where ''apache'' is the search term (note that NutchBean will only search 
pages in the {{{crawl}}} directory, so if you named the crawl directory 
something else, NutchBean will not find any results). After you have verified 
that the above command returns results you can proceed to setting up the web 
interface.
  
- == Installing in Tomcat ==
+ == Installing in Tomcat (version < 1.3) ==
  To search you need to put the nutch war file into your servlet container. (If 
instead of downloading a Nutch release you checked the sources out of SVN, then 
you'll first need to build the war file, with the command {{{ant war}}}.)
  
  Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file 
may be installed with the commands:

[Nutch Wiki] Update of "NutchTutorial" by JulienNioche

Reply via email to