Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by MiddleForkMaps:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

------------------------------------------------------------------------------
  Under Debian Etch, the Catalina configuration files are located under 
'''/etc/tomcat5.5/policy.d'''  At runtime they are combined into a single file, 
''/usr/share/tomcat5.5/conf/catalina.policy''  Do not edit the latter, as it 
will be overwrittten.[[BR]]
  At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following 
code:[[BR]]
  
- ''grant codeBase "file:/usr/share/tomcat5.5-webapps/-" {
+ ''grant codeBase "file:/usr/share/tomcat5.5-webapps/-" {[[BR]]
-     permission java.util.PropertyPermission "user.dir", "read";
+     permission java.util.PropertyPermission "user.dir", "read";[[BR]]
-     permission java.util.PropertyPermission "java.io.tmpdir", "read,write";
+     permission java.util.PropertyPermission "java.io.tmpdir", 
"read,write";[[BR]]
-     permission java.util.PropertyPermission "org.apache.*", "read,execute";
+     permission java.util.PropertyPermission "org.apache.*", 
"read,execute";[[BR]]
-     permission java.io.FilePermission "/usr/local/nutch/crawls/-" , "read";
+     permission java.io.FilePermission "/usr/local/nutch/crawls/-" , 
"read";[[BR]]
-     permission java.io.FilePermission "/var/lib/tomcat5.5/temp", "read";
+     permission java.io.FilePermission "/var/lib/tomcat5.5/temp", "read";[[BR]]
-     permission java.io.FilePermission "/var/lib/tomcat5.5/temp/-", 
"read,write,execute,delete";
+     permission java.io.FilePermission "/var/lib/tomcat5.5/temp/-", 
"read,write,execute,delete";[[BR]]
-     permission java.lang.RuntimePermission "createClassLoader", "";
+     permission java.lang.RuntimePermission "createClassLoader", "";[[BR]]
-     permission java.security.AllPermission;
+     permission java.security.AllPermission;[[BR]]
+ };[[BR]]
- };
- ''
- '''Warning:  The last line here was necessary in order to make things work 
for me.  If anybody can supply a more restrictive permission set, please do 
so!!!  The effects of this are unknown'''
+ '''Warning:  The last line here was necessary in order to make things work 
for me.  If anybody can supply a more restrictive permission set, please do 
so!!!  The effects of this are unknown'''[[BR]]
  
  == Acquire, install and configure Nutch ==
- Follow '''ONLY''' the section ''Getting Started'' in the Nutch tutorial at 
http://lucene.apache.org/nutch/tutorial8.html
+ Acquire a copy of nutch and unpack it in a new directory location.  I suggest 
using /usr/local/nutch as the top-level directory, but this is of course 
optional[[BR]]
+ 
- ===Configure for multiple, independent site crawls and searches===
+ === Configure for multiple, independent site crawls and searches ===
+ Follow the section '''Intranet:Configuration''' from the Nutch tutorial at 
http://lucene.apache.org/nutch/tutorial8.html.  However, plan in advance for 
crawling and searching sites independently from one another:[[BR]]
- Given two sites, site1 and site2 which you wish to crawl/index (and later 
search) independently from each other:[[BR]]
+ Given two sites, site1 and site2 which you wish to crawl/index (and later 
search) independently from each other, you may make multiple copies of the conf 
directory:[[BR]]
+  ''#cd /usr/local/nutch''[[BR]]
   ''#cp -rp conf conf.site1''[[BR]]
   ''#cp -rp conf conf.site2''[[BR]]
+ And then work through steps one through four of the above mentioned section 
for '''each''' site.[[BR]]
+ 
+ Create simple shell scripts which allow for the independent crawling of each 
site, such as '''/usr/local/nutch/crawl_site1.sh'''[[BR]]
+   ''NUTCH_CONF_DIR=conf.site1''[[BR]]
+   ''export NUTCH_CONF_DIR''[[BR]]
+   ''bin/nutch crawl urls/site1  -dir crawls/site1 -depth 10 -topN 
100000''[[BR]]
+ and the same for site2.[[BR]]
+ Crawl each site:[[BR]]
+   ''sh crawl_site1.sh''[[BR]]
+   ''sh crawl_site2.sh''[[BR]]
  
  
  
  
  
+ 
+ 

Reply via email to