[Nutch Wiki] Update of "NutchTutorial" by kiranchitturi

Apache Wiki Thu, 21 Mar 2013 21:50:23 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by kiranchitturi:
http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=61&rev2=62

  This will include any URL in the domain `nutch.apache.org`.
  
  === 3.1 Using the Crawl Command ===
+ 
+ {{{#!wiki caution
+ The crawl command is deprecated. Please see section 
[[#A3.3._Using_the_crawl_script|3.3]] on how to use the crawl script that is 
intended to replace the crawl command.
+ }}}
+ 
  Now we are ready to initiate a crawl, use the following parameters:
  
   * '''-dir''' ''dir'' names the directory to put the crawl in.
@@ -220, +225 @@

  }}}
  We are now ready to search with Apache Solr.
  
+ === 3.3. Using the crawl script ===
+ 
+ If you have followed the 3.2 section above on how the crawling can be done 
step by step, you might be wondering how a bash script can be written to 
automate all the process described above.
+ 
+ Nutch developers have written one for you :), and it is available at 
[[bin/crawl]]. 
+ 
+ {{{
+      Usage: bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
+      Example: bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
+ }}}
+ 
+ 
+ The crawl script has lot of parameters set, and you can modify the parameters 
to your needs. It would be ideal to understand the parameters before setting up 
big crawls.
+ 
+ 
  == 4. Setup Solr for search ==
   * download binary file from 
[[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
   * unzip to `$HOME/apache-solr-3.X`, we will now refer to this as 
`${APACHE_SOLR_HOME}`

[Nutch Wiki] Update of "NutchTutorial" by kiranchitturi

Reply via email to