Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by ThorstenScherler: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ </property> Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [http://www.mozilla.org/quality/networking/testing/filetests.html here] and this behavior may be disabled by a [http://www.mozilla.org/quality/networking/docs/netprefs.html preference] (see security.checkloaduri). IE5 does not have this problem. + + ==== Nutch crawling parent directories for file protocol -> misconfigured URLFilters ==== + [http://issues.apache.org/jira/browse/NUTCH-407] E.g. for urlfilter-regex you should put the following in regex-urlfilter.txt : + {{{ + + +^file:///c:/top/directory/ + -. + }}} ==== How do I index remote file shares? ==== @@ -379, +387 @@ The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser... - ==== Nutch crawling parent directories for file protocol -> misconfigured URLFilters ==== - [http://issues.apache.org/jira/browse/NUTCH-407] E.g. for urlfilter-regex you should put the following in regex-urlfilter.txt : - {{{ - - +^file:///c:/top/directory/ - -. - }}} - === Discussion === [http://grub.org/ Grub] has some interesting ideas about building a search engine using distributed computing. ''And how is that relevant to nutch?''