Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JamesVictor: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows The comment on the change is: Clarified language and directions; updated with nutch 0.9 information and Win2k3 ------------------------------------------------------------------------------ Since Nutch is written in Java, it is possible to get Nutch working in a Windows environment, provided that the correct software is installed. - The following documents describe how I got it working on Windows XP Pro running Tomcat 5.28. + The following documents describe how I got it working on Windows XP Pro running Tomcat 5.28. Edit: page updated with my experience installing on Windows Server 2003. + == Required Software == + - == Java == + === Java === You will need to have Java 1.4.2 (or Java 1.5 for Nutch 0.8.x or higher) installed. - == Cygwin == + This also works with Java 6, Nutch 0.9, and Tomcat 6. Just the Java 6 JRE is necessary, unless you want to build nutch from sources yourself. + === Cygwin === + - You'll need cygwin to run the shell commands since there are no separate scripts for NT cmd (the NT cmd shell does not nest environments recursively). Mks ksh does not work correctly with the scripts. + You'll need [http://www.cygwin.com/ cygwin] to run the shell commands since there are no separate scripts for NT cmd (the NT cmd shell does not nest environments recursively). Mks ksh does not work correctly with the scripts. Make sure you have installed the utility 'uname' in cygwin. - == Tomcat == + === Tomcat === - You'll need Tomcat 4.* or higher running on your machine. + You'll need Tomcat 4.* or higher running on your machine. I know of no reason to not go with the latest release ([http://tomcat.apache.org/download-60.cgi Tomcat 6] at time of last writing). == Setup == - Download the release and extract anywhere on your hard disk e.g. c:\nutch-0.7.1 + === Download === - Create an empty text file in your nutch directory e.g. "urls" and add the urls of the sites you want to crawl as shown in the tutorial. + [http://lucene.apache.org/nutch/release/ Download] the release and extract anywhere on your hard disk e.g. `c:\nutch-0.9` - Add your urls to the crawl-urlfilter.txt (e.g. C:\nutch-0.7.1\conf\crawl-urlfilter.txt). An entry could look like this: +^http://([a-z0-9]*\.)*apache.org/ + Create an empty text file in your nutch directory e.g. `urls` and add the URLs of the sites you want to crawl. - Load up cygwin and naviagte to your nutch directory. When cygwin launches you'll usually find yourself in your user folder (e.g. C:\Documents and Settings\username). + Add your URLs to the `crawl-urlfilter.txt` (e.g. `C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this: + {{{ + +^http://([a-z0-9]*\.)*apache.org/ + }}} + Load up cygwin and naviagte to your nutch directory. When cygwin launches you'll usually find yourself in your user folder (e.g. `C:\Documents and Settings\username`). + - If your workstation needs to go through a windows authentication proxy to get to the internet then you can use an application such as the NTLM Authorization Proxy Server: [http://www.geocities.com/rozmanov/ntlm/] to get through it. You'll then need to edit the nutch-site.xml file to point to the port opened by the app. + If your workstation needs to go through a windows authentication proxy to get to the internet then you can use an application such as the NTLM Authorization Proxy Server: [http://www.geocities.com/rozmanov/ntlm/] to get through it. You'll then need to edit the `nutch-site.xml` file to point to the port opened by the app. == Intranet Crawling == @@ -37, +46 @@ {{{ bin/nutch crawl urls -dir crawl -depth 3 >& crawl.log }}} - then a folder called crawl/ is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. You'll need to delete or move the crawl directory before starting the crawl off again unless you specify another path on the command above. + then a folder called crawl/ is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. + + You'll need to delete or move the crawl directory before starting the crawl off again unless you specify another path on the command above. == Web Interface for Search == - In your Environment Variables settings, add NUTCH_JAVA_HOME and the location of your JVM (e.g. C:\j2sdk1.4.2_09) as a new Environment Variable + In your Environment Variables settings, add `NUTCH_JAVA_HOME` and the location of your JVM (e.g. `C:\j2sdk1.4.2_09`) as a new Environment Variable. - Open up a web browser and navigate to the Tomcat webapps manager (e.g. http://localhost:8080/manager/html) and upload the nutch WAR file to the context. + Open up a web browser and navigate to the Tomcat webapps manager (e.g. `http://localhost:8080/manager/html`) and upload the nutch WAR file to the context. - If a root context already exists, undeploy it. + If you are going to run nutch in the root context ''and'' a root context already exists, undeploy it. Otherwise, skip to the Alternative, below. - You now need to create a context fragment file so that the root url points to your nutch webapp. + Create a context fragment file so that the root url points to your nutch webapp. Navigate to your [tomcat_home]/conf/Catalina/localhost/ and put it there. - Create a new xml file (name it the same as the webapp?) e.g. nutch-0.7.1.xml and add something like the following line to it + Create a new xml file (name it the same as the webapp?) e.g. nutch-0.9.xml and add something like the following line to it. + {{{ <Context path="" debug="5" privileged="true" docBase="nutch-0.7.1"/> }}} + '''Alternative:''' if you want to run other web applications alongside nutch, copy or rename the `nutch-0.9.0.war` to whatever you'd like the subdirectory URL to be. Deploy the renamed version using the Tomcat Web Application Manager. + + For example, to use nutch via `http://localhost/search/`, rename the nutch `.war` file to `search.war` and then deploy `search.war`. + + + === Set Your Searcher Directory === + - Next, navigate to your nutch webapp folder then WEB-INF/classes. + Next, navigate to your nutch webapp folder then `WEB-INF/classes`. - Edit the nutch-site.xml file and add the following to it (make sure you don't have two <nutch-conf></nutch-conf> tags!): + Edit the `nutch-site.xml` file and add the following to it (make sure you don't have two sets of <configuration></configuration> tags!): {{{ - <nutch-conf> + <configuration> <property> <name>searcher.dir</name> <value>your_crawl_folder_here</value> </property> - </nutch-conf> + </configuration> }}} - For example, if your nutch directory resides at C:\nutch-0.7.1 and you specified crawled as the directory after the -dir command, then enter C:\nutch-0.7.1\crawl\ instead of your_crawl_folder_here. + For example, if your nutch directory resides at `C:\nutch-0.9.0` and you specified `crawl` as the directory after the `-dir` command, then enter `C:\nutch-0.9.0\crawl\` instead of `your_crawl_folder_here`. - Restart Tomcat using the windows services tool, open up a browser and enter the url http://localhost:8080. The nutch search page should appear. As long as you've defined the correct location of your nutch index directory as shown above then clicking search should yield results. + === Reload === + Reload the Application. Use the Tomcat Manager and simply click the "Reload" command for nutch, or restart Tomcat using the windows services tool. + + Open up a browser and enter the url `http://localhost:8080`. The nutch search page should appear. As long as you've defined the correct location of your nutch index directory (as shown above), clicking search should yield results. + ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs