[Nutch Wiki] Update of "GettingNutchRunningWithFedoraCore" by ThiloPfennig

Apache Wiki Thu, 09 Nov 2006 08:34:15 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by ThiloPfennig:
http://wiki.apache.org/nutch/GettingNutchRunningWithFedoraCore

------------------------------------------------------------------------------
  This is based on GettingNutchRunningWithRedHatApplicationServer. To make this 
easier to start we are using the yum command line as an example.
- 
-  
-  /!\ This is not yet a working installation description.
  
  
  == Repositories we need ==
@@ -41, +38 @@

  
    * No Match for argument: jta-javadoc
  
+ 
+ == Install Java ==
+ 
+  * 
[http://javashoplm.sun.com/ECom/docs/Welcome.jsp?StoreId=22&PartDetailId=jdk-1.5.0_08-oth-JPR&SiteId=JSC&TransactionId=noregDownload
  Install Linux RPM in self-extracting file]
+ 
+ 
  == Download and Testing ==
+ 
  
   * DownloadingNutch: downloaded nutch-0.8.tar.gz
   
  {{{
  tar xzf nutch-08.tar.gz
  cd nutch-0.8
+ 
+ {{{
+ export JAVA_HOME=/usr/java/jdk1.5.0_08/
  bin/nutch
  }}}
+ 
+ 
+  * Test using http://lucene.apache.org/nutch/tutorial.html
+  
+  1. add an url in a new file "urls"
+  1. add/edit conf/crawl-urlfilter.txt (under # accept hosts in MY.DOMAIN.NAME 
)
+ 
+ 
  '''result:'''
  {{{
+ Exception in thread "main" java.io.IOException: Input directory 
/home/vinci/Down
+ loads/nutch-0.8/urls in local is invalid.
+         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
+         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
+         at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
+         at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
- Usage: nutch COMMAND
- where COMMAND is one of:
-   crawl             one-step crawler for intranets
-   readdb            read / dump crawl db
-   mergedb           merge crawldb-s, with optional filtering
-   readlinkdb        read / dump link db
-   inject            inject new urls into the database
-   generate          generate new segments to fetch
-   fetch             fetch a segment's pages
-   parse             parse a segment's pages
-   segread           read / dump segment data
-   mergesegs         merge several segments, with optional filtering and 
slicing
-   updatedb          update crawl db from segments after fetching
-   invertlinks       create a linkdb from parsed segments
-   mergelinkdb       merge linkdb-s, with optional filtering
-   index             run the indexer on parsed segments and linkdb
-   merge             merge several segment indexes
-   dedup             remove duplicates from a set of segment indexes
-   plugin            load a plugin and run one of its classes main()
-   server            run a search server
-  or
-   CLASSNAME         run the class named CLASSNAME
- Most commands print help when invoked w/o parameters.
  }}}
+ 
  
  
  ---

[Nutch Wiki] Update of "GettingNutchRunningWithFedoraCore" by ThiloPfennig

Reply via email to