Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by DanielLaLiberte:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

The comment on the change is:
Correct my misunderstanding of previous (ambiguous) instructions

------------------------------------------------------------------------------
  NOTE: The steps below are assumed to be carried out from inside the 
/home/tyrell/nutch-0.7 directory created when extracting the archive. Change 
the path according to your local instance.
  
  
- === 3.2.1 Create a flat file of root urls. ===
+ === 3.2.1 Create a directory of root urls. ===
  
- For example, to crawl the http://www.virtusa.com  site from scratch, you 
might start with a file named 'urls' containing just the URL for the Virtusa 
home page. All other pages should be reachable by links from this page. The 
‘urls’ file would therefore contain: http://www.virtusa.com
+ The nutch 'crawl' command expects to be given a directory containing files 
that list all the root level urls to be crawled.  So create a 'urls' directory 
in the nutch directory.  Then, to crawl the http://www.virtusa.com  site from 
scratch, you might start with a file named 'virtusa' in the 'urls' directory, 
and in the file, add just the URL for the Virtusa home page, 
http://www.virtusa.com. All other pages should be reachable by links from this 
page.  
  
- The 'urls' file could be put anywhere.  It will be used below in the nutch 
crawl command, which assumes the file is in the nutch directory.
-       
+ The 'depth' option to the crawl command will limit how far the crawl goes. 
Also, the conf/crawl-urlfilter.txt file, described next, will limit what sites 
to crawl to. 
+ 
  === 3.2.2 Edit the file conf/crawl-urlfilter.txt ===
  
  If you are using TRUNK then there is no file called conf/crawl-urlfilter.txt 
but conf/crawl-urlfilter.txt.template. Just do 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to