Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change
notification.
The following page has been changed by DanielLaLiberte:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine
------------------------------------------------------------------------------
=== 3.2.1 Create a flat file of root urls. ===
- For example, to crawl the http://www.virtusa.com site from scratch, you
might start with a file named 'urls' containing just the Virtusa home page. All
other pages should be reachable from this page. The âurlsâ file would
therefore contain: http://www.virtusa.com
+ For example, to crawl the http://www.virtusa.com site from scratch, you
might start with a file named 'urls' containing just the URL for the Virtusa
home page. All other pages should be reachable by links from this page. The
âurlsâ file would therefore contain: http://www.virtusa.com
+
+ The 'urls' file could be put anywhere. It will be used below in the nutch
crawl command, which assumes the file is in the nutch directory.
=== 3.2.2 Edit the file conf/crawl-urlfilter.txt ===
If you are using TRUNK then there is no file called conf/crawl-urlfilter.txt
but conf/crawl-urlfilter.txt.template. Just do
{{{
- cat conf/crawl-urlfilter.txt.template|sed
's/MY.DOMAIN.NAME/criaturitas.org/'g> conf/crawl-urlfilter.txt
+ cat conf/crawl-urlfilter.txt.template|sed
's/MY.DOMAIN.NAME/criaturitas.org/'g> conf/crawl-urlfilter.txt }}}
- }}}
If you already have this file then replace the existing domain name with the
name of the domain you wish to crawl. For example, if you wished to limit the
crawl to the virtusa.com domain, the line should read:
+ {{{
- {{{ +^http://([a-z0-9]*\.)*virtusa.com/ }}}
+ +^http://([a-z0-9]*\.)*virtusa.com/ }}}
This will include any url in the domain virtusa.com in the crawl.
@@ -127, +129 @@
* -dir dir names the directory to put the crawl in.
* -depth depth indicates the link depth from the root page that should be
crawled.
* -delay delay determines the number of seconds between accesses to each
host.
- * -threads threads determines the number of threads that will fetch in
parallel.
+ * -threads threads determines the number of threads that will fetch in
parallel. }}}
- }}}
For example, a typical command might be:
+ {{{
- {{{ bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}}
+ bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}}
=== 3.2.4 Output of the crawl ===
@@ -175, +177 @@
<property>
<name>searcher.dir </name>
<value>/home/tyrell/nutch-0.7/crawl.virtusa </value>
- </property>
+ </property> }}}
- }}}
4. Re-start Tomcat
@@ -196, +197 @@
Now that all is working, we need to think the long term maintenance of the
Index. This is a required activity because the web gets updated frequently. New
content will appear on sites while existing content might get modified or
deleted altogether.
- Nutch provides the administrator with a set of commands to update a given
index, however performing them manually will not only be tiresome but also non
productive. Since this task need to be carried out periodically it should
ideally be scheduled.
+ Nutch provides the administrator with a set of commands to update a given
index, however performing them manually will not only be tiresome but also
unproductive. Since this task need to be carried out periodically it should
ideally be scheduled.
=== 3.5.1 Creating a Maintenance Shell Script ===
@@ -233, +234 @@
}}}
- === 3.5.2 Scheduling Index Updations ===
+ === 3.5.2 Scheduling Index Updates ===
The above shell script can be scheduled to be run periodically using a
âcronâ job.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs