Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by ThorstenScherler:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
      </property>
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[http://www.mozilla.org/quality/networking/testing/filetests.html here] and 
this behavior may be disabled by a 
[http://www.mozilla.org/quality/networking/docs/netprefs.html preference] (see 
security.checkloaduri). IE5 does not have this problem.
+ 
+ ==== Nutch crawling parent directories for file protocol ->  misconfigured 
URLFilters ====
+ [http://issues.apache.org/jira/browse/NUTCH-407] E.g. for urlfilter-regex you 
should put the following in regex-urlfilter.txt :
+ {{{
+ 
+ +^file:///c:/top/directory/
+ -.
+ }}}
  
  ==== How do I index remote file shares? ====
  
@@ -379, +387 @@

  
  The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seed -dir 
/user/nutchuser...
  
- ==== Nutch crawling parent directories for file protocol ->  misconfigured 
URLFilters ====
- [http://issues.apache.org/jira/browse/NUTCH-407] E.g. for urlfilter-regex you 
should put the following in regex-urlfilter.txt :
- {{{
- 
- +^file:///c:/top/directory/
- -.
- }}}
- 
  === Discussion ===
  
  [http://grub.org/ Grub] has some interesting ideas about building a search 
engine using distributed computing. ''And how is that relevant to nutch?''

Reply via email to