[Nutch-cvs] [Nutch Wiki] Update of "GettingNutchRunningWithWindows" by JamesVictor

Apache Wiki Tue, 10 Jul 2007 15:20:46 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by JamesVictor:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

The comment on the change is:
Clarified language and directions; updated with nutch 0.9 information and Win2k3

------------------------------------------------------------------------------
  Since Nutch is written in Java, it is possible to get Nutch working in a 
Windows environment, provided that the correct software is installed.
  
- The following documents describe how I got it working on Windows XP Pro 
running Tomcat 5.28.  
+ The following documents describe how I got it working on Windows XP Pro 
running Tomcat 5.28.  Edit: page updated with my experience installing on 
Windows Server 2003.
  
+ == Required Software ==
+ 
- == Java ==
+ === Java ===
  
  You will need to have Java 1.4.2 (or Java 1.5 for Nutch 0.8.x or higher) 
installed.
  
- == Cygwin ==
+ This also works with Java 6, Nutch 0.9, and Tomcat 6. Just the Java 6 JRE is 
necessary, unless you want to build nutch from sources yourself.
  
+ === Cygwin ===
+ 
- You'll need cygwin to run the shell commands since there are no separate 
scripts for NT cmd (the NT cmd shell does not nest environments recursively).  
Mks ksh does not work correctly with the scripts.
+ You'll need [http://www.cygwin.com/ cygwin] to run the shell commands since 
there are no separate scripts for NT cmd (the NT cmd shell does not nest 
environments recursively).  Mks ksh does not work correctly with the scripts.
  Make sure you have installed the utility 'uname' in cygwin.
  
- == Tomcat ==
+ === Tomcat ===
  
- You'll need Tomcat 4.* or higher running on your machine.
+ You'll need Tomcat 4.* or higher running on your machine. I know of no reason 
to not go with the latest release ([http://tomcat.apache.org/download-60.cgi 
Tomcat 6] at time of last writing).
  
  == Setup ==
  
- Download the release and extract anywhere on your hard disk e.g. 
c:\nutch-0.7.1
+ === Download ===
  
- Create an empty text file in your nutch directory e.g. "urls" and add the 
urls of the sites you want to crawl as shown in the tutorial.
+ [http://lucene.apache.org/nutch/release/ Download] the release and extract 
anywhere on your hard disk e.g. `c:\nutch-0.9`
  
- Add your urls to the crawl-urlfilter.txt (e.g. 
C:\nutch-0.7.1\conf\crawl-urlfilter.txt). An entry could look like this: 
+^http://([a-z0-9]*\.)*apache.org/
+ Create an empty text file in your nutch directory e.g. `urls` and add the 
URLs of the sites you want to crawl.
  
- Load up cygwin and naviagte to your nutch directory.  When cygwin launches 
you'll usually find yourself in your user folder (e.g. C:\Documents and 
Settings\username).
+ Add your URLs to the `crawl-urlfilter.txt` (e.g. 
`C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this:
+ {{{
+ +^http://([a-z0-9]*\.)*apache.org/
+ }}}
  
+ Load up cygwin and naviagte to your nutch directory.  When cygwin launches 
you'll usually find yourself in your user folder (e.g. `C:\Documents and 
Settings\username`).
+ 
- If your workstation needs to go through a windows authentication proxy to get 
to the internet then you can use an application such as the NTLM Authorization 
Proxy Server: [http://www.geocities.com/rozmanov/ntlm/] to get through it.  
You'll then need to edit the nutch-site.xml file to point to the port opened by 
the app.
+ If your workstation needs to go through a windows authentication proxy to get 
to the internet then you can use an application such as the NTLM Authorization 
Proxy Server: [http://www.geocities.com/rozmanov/ntlm/] to get through it.  
You'll then need to edit the `nutch-site.xml` file to point to the port opened 
by the app.
  
  == Intranet Crawling ==
  
@@ -37, +46 @@

  {{{
  bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log
  }}}
- then a folder called crawl/ is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have. 
You'll need to delete or move the crawl directory before starting the crawl off 
again unless you specify another path on the command above.
+ then a folder called crawl/ is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.
+ 
+ You'll need to delete or move the crawl directory before starting the crawl 
off again unless you specify another path on the command above.
  
  == Web Interface for Search ==
  
- In your Environment Variables settings, add NUTCH_JAVA_HOME and the location 
of your JVM (e.g. C:\j2sdk1.4.2_09) as a new Environment Variable
+ In your Environment Variables settings, add `NUTCH_JAVA_HOME` and the 
location of your JVM (e.g. `C:\j2sdk1.4.2_09`) as a new Environment Variable.
  
- Open up a web browser and navigate to the Tomcat webapps manager (e.g. 
http://localhost:8080/manager/html) and upload the nutch WAR file to the 
context.
+ Open up a web browser and navigate to the Tomcat webapps manager (e.g. 
`http://localhost:8080/manager/html`) and upload the nutch WAR file to the 
context.
  
- If a root context already exists, undeploy it.
+ If you are going to run nutch in the root context ''and'' a root context 
already exists, undeploy it. Otherwise, skip to the Alternative, below.
  
- You now need to create a context fragment file so that the root url points to 
your nutch webapp.
+ Create a context fragment file so that the root url points to your nutch 
webapp.
  Navigate to your [tomcat_home]/conf/Catalina/localhost/ and put it there.
- Create a new xml file (name it the same as the webapp?) e.g. nutch-0.7.1.xml 
and add something like the following line to it
+ Create a new xml file (name it the same as the webapp?) e.g. nutch-0.9.xml 
and add something like the following line to it.
+ 
  {{{
  <Context path="" debug="5" privileged="true" docBase="nutch-0.7.1"/>
  }}}
  
+ '''Alternative:''' if you want to run other web applications alongside nutch, 
copy or rename the `nutch-0.9.0.war` to whatever you'd like the subdirectory 
URL to be. Deploy the renamed version using the Tomcat Web Application Manager.
+ 
+ For example, to use nutch via `http://localhost/search/`, rename the nutch 
`.war` file to `search.war` and then deploy `search.war`.
+ 
+ 
+ === Set Your Searcher Directory ===
+ 
- Next, navigate to your nutch webapp folder then WEB-INF/classes.
+ Next, navigate to your nutch webapp folder then `WEB-INF/classes`.
- Edit the nutch-site.xml file and add the following to it (make sure you don't 
have two <nutch-conf></nutch-conf> tags!):
+ Edit the `nutch-site.xml` file and add the following to it (make sure you 
don't have two sets of <configuration></configuration> tags!):
  
  {{{
- <nutch-conf>
+ <configuration>
  <property>
      <name>searcher.dir</name>
      <value>your_crawl_folder_here</value>
    </property>
- </nutch-conf>
+ </configuration>
  }}}
  
- For example, if your nutch directory resides at C:\nutch-0.7.1 and you 
specified crawled as the directory after the -dir command, then enter 
C:\nutch-0.7.1\crawl\ instead of your_crawl_folder_here.
+ For example, if your nutch directory resides at `C:\nutch-0.9.0` and you 
specified `crawl` as the directory after the `-dir` command, then enter 
`C:\nutch-0.9.0\crawl\` instead of `your_crawl_folder_here`.
  
- Restart Tomcat using the windows services tool, open up a browser and enter 
the url http://localhost:8080.  The nutch search page should appear.  As long 
as you've defined the correct location of your nutch index directory as shown 
above then clicking search should yield results.
+ === Reload ===
  
+ Reload the Application. Use the Tomcat Manager and simply click the "Reload" 
command for nutch, or restart Tomcat using the windows services tool.
+ 
+ Open up a browser and enter the url `http://localhost:8080`.  The nutch 
search page should appear.  As long as you've defined the correct location of 
your nutch index directory (as shown above), clicking search should yield 
results.
+ 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "GettingNutchRunningWithWindows" by JamesVictor

Reply via email to