Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change
notification.
The following page has been changed by DanielLaLiberte:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine
The comment on the change is:
Correct my misunderstanding of previous (ambiguous) instructions
------------------------------------------------------------------------------
NOTE: The steps below are assumed to be carried out from inside the
/home/tyrell/nutch-0.7 directory created when extracting the archive. Change
the path according to your local instance.
- === 3.2.1 Create a flat file of root urls. ===
+ === 3.2.1 Create a directory of root urls. ===
- For example, to crawl the http://www.virtusa.com site from scratch, you
might start with a file named 'urls' containing just the URL for the Virtusa
home page. All other pages should be reachable by links from this page. The
âurlsâ file would therefore contain: http://www.virtusa.com
+ The nutch 'crawl' command expects to be given a directory containing files
that list all the root level urls to be crawled. So create a 'urls' directory
in the nutch directory. Then, to crawl the http://www.virtusa.com site from
scratch, you might start with a file named 'virtusa' in the 'urls' directory,
and in the file, add just the URL for the Virtusa home page,
http://www.virtusa.com. All other pages should be reachable by links from this
page.
- The 'urls' file could be put anywhere. It will be used below in the nutch
crawl command, which assumes the file is in the nutch directory.
-
+ The 'depth' option to the crawl command will limit how far the crawl goes.
Also, the conf/crawl-urlfilter.txt file, described next, will limit what sites
to crawl to.
+
=== 3.2.2 Edit the file conf/crawl-urlfilter.txt ===
If you are using TRUNK then there is no file called conf/crawl-urlfilter.txt
but conf/crawl-urlfilter.txt.template. Just do
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs