Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by SebastianNagel: https://wiki.apache.org/nutch/FAQ?action=diff&rev1=139&rev2=140 Comment: Update regarding parent directories, single slash after file:/, cf. NUTCH-1483 <value>protocol-file|...copy original values from nutch-default here...</value> </property> }}} - where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-ins normally enabled will be enabled, plus the protocol-file plugin. Make sure that urlfilter-regex is included, or else '''the urlfilter files will be ignored''', leadingNnutch to accept all URLs. You need to enable crawl URL filters to prevent Nutch from crawling up the parent directory, see below. + where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-ins normally enabled will be enabled, plus the protocol-file plugin. Now you can invoke the crawler and index all or part of your disk. ==== Nutch crawling parent directories for file protocol ==== - If you find Nutch crawling parent directories when using the file protocol, the following Jira issue may help: - http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you could put the following in regex-urlfilter.txt : + By default, Nutch will step into parent directories. You can avoid this by setting the following property to false: {{{ + <property> + <name>file.crawl.parent</name> + <value>false</value> + <description>The crawler is not restricted to the directories that you specified in the + Urls file but it is jumping into the parent directories as well. For your own crawlings you can + change this behavior (set to false) the way that only directories beneath the directories that you specify get + crawled.</description> + </property> + }}} + + Alternatively, you could add a regex URL filter rule, e.g. + {{{ - +^file:///c:/top/directory/ + +^file:/c:/top/directory/ -. }}} - Alternatively, you could apply the patch described [[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on this page]], which would avoid the hard-wiring of the site-specific /top/directory in your configuration file. + - and don't forget to make sure that the plugin urlfilter-regex is enabled in plugin.includes. + + ==== A note on slashes after file: ==== + + When converting {{{file:}}} URLs from the Java URL class back only one slash remains: + {{{ + String url = "file:///path/index.html"; + java.net.URL u = new java.net.URL(url); + url = u.toString(); // url is now file:/path/index.html + }}} + Because such conversions are quite frequent, you better writer URLs (and also URL filter rules, etc.) with a single slash ({{{file:/path/index.html}}}). Nutch's URL normalizers in the default configuration also normalize file: URLs to have only one slash. ==== How do I index remote file shares? ==== At the current time, Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method is to mount the shares yourself, then index the contents as though they were local directories (see above).