Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=77&rev2=78 * Setup `JAVA_HOME` if you are seeing `JAVA_HOME` not set. On Mac, you can run the following command or add it to `~/.bashrc`: {{{ - export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home + export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home + # note that the actual path may be different on your system }}} On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc: {{{ export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") }}} + + You may also have to update your /etc/hosts file. If so you can add the following + + {{{ + ## + # Host Database + # + # localhost is used to configure the loopback interface + # when the system is booting. Do not change this entry. + ## + 127.0.0.1 localhost.localdomain localhost LMC-032857 + ::1 ip6-localhost ip6-loopback + fe80::1%lo0 ip6-localhost ip6-loopback + }}} + + Note that the `LMC-032857` above should be replaced with your machine name. + == 3. Crawl your first website == Nutch requires two configuration changes before a website can be crawled: @@ -120, +138 @@ NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all domains linking to your seed URLs file being crawled as well. - === 3.3 Using the Crawl Command === - {{{#!wiki caution - The crawl command is deprecated. Please see section [[#a3.5._Using_the_crawl_script|3.5]] on how to use the crawl script that is intended to replace the crawl command. - }}} - Now we are ready to initiate a crawl, use the following parameters: - - * '''-dir''' ''dir'' names the directory to put the crawl in. - * '''-threads''' ''threads'' determines the number of threads that will fetch in parallel. - * '''-depth''' ''depth'' indicates the link depth from the root page that should be crawled. - * '''-topN''' ''N'' determines the maximum number of pages that will be retrieved at each level up to the depth. - * Run the following command: - - {{{ - bin/nutch crawl urls -dir crawl -depth 3 -topN 5 - }}} - * Now you should be able to see the following directories created: - - {{{ - crawl/crawldb - crawl/linkdb - crawl/segments - }}} - '''NOTE''': If you have a Solr core already set up and wish to index to it, you are required to add the `-solr <solrUrl> parameter` to your `crawl` command e.g. - - {{{ - bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 - }}} - If not then please skip to [[#A4._Setup_Solr_for_search|here]] for how to set up your Solr instance and index your crawl data. - - Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (`-topN`), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (`-topN`) for a full crawl can be from tens of thousands to millions, depending on your resources. - - === 3.4 Using Individual Commands for Whole-Web Crawling === + === Using Individual Commands for Whole-Web Crawling === '''NOTE''': If you previously modified the file `conf/regex-urlfilter.txt` as covered [[#A3._Crawl_your_first_website|here]] you will need to change it back. Whole-Web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. This also permits more control over the crawl process, and incremental crawling. It is important to note that whole Web crawling does not necessarily mean crawling the entire World Wide Web. We can limit a whole Web crawl to just a list of the URLs we want to crawl. This is done by using a filter just like the one we used when we did the `crawl` command (above).