[Nutch Wiki] Update of "FAQ" by SebastianNagel

Apache Wiki Mon, 12 Jun 2017 14:36:44 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/FAQ?action=diff&rev1=139&rev2=140

Comment:
Update regarding parent directories, single slash after file:/, cf. NUTCH-1483

        <value>protocol-file|...copy original values from nutch-default 
here...</value>
      </property>
  }}}
- where you should copy and paste all values from nutch-default.xml in the 
plugin.includes setting provided there. This will ensure that all plug-ins 
normally enabled will be enabled, plus the protocol-file plugin. Make sure that 
urlfilter-regex is included, or else '''the urlfilter files will be ignored''', 
leadingNnutch to accept all URLs. You need to enable crawl URL filters to 
prevent Nutch from crawling up the parent directory, see below.
+ where you should copy and paste all values from nutch-default.xml in the 
plugin.includes setting provided there. This will ensure that all plug-ins 
normally enabled will be enabled, plus the protocol-file plugin.
  
  Now you can invoke the crawler and index all or part of your disk.
  
  ==== Nutch crawling parent directories for file protocol ====
- If you find Nutch crawling parent directories when using the file protocol, 
the following Jira issue may help:
  
- http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you 
could put the following in regex-urlfilter.txt :
+ By default, Nutch will step into parent directories. You can avoid this by 
setting the following property to false:
  
  {{{
+ <property>
+   <name>file.crawl.parent</name>
+   <value>false</value>
+   <description>The crawler is not restricted to the directories that you 
specified in the
+     Urls file but it is jumping into the parent directories as well. For your 
own crawlings you can
+     change this behavior (set to false) the way that only directories beneath 
the directories that you specify get
+     crawled.</description>
+ </property>
+ }}}
+ 
+ Alternatively, you could add a regex URL filter rule, e.g.
+ {{{
- +^file:///c:/top/directory/
+ +^file:/c:/top/directory/
  -.
  }}}
- Alternatively, you could apply the patch described 
[[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on 
this page]], which would avoid the hard-wiring of the site-specific 
/top/directory in your configuration file.
+ - and don't forget to make sure that the plugin urlfilter-regex is enabled in 
plugin.includes.
+ 
+ ==== A note on slashes after file: ====
+ 
+ When converting {{{file:}}} URLs from the Java URL class back only one slash 
remains:
+ {{{
+ String url = "file:///path/index.html";
+ java.net.URL u = new java.net.URL(url);
+ url = u.toString();  // url is now file:/path/index.html
+ }}}
+ Because such conversions are quite frequent, you better writer URLs (and also 
URL filter rules, etc.) with a single slash ({{{file:/path/index.html}}}). 
Nutch's URL normalizers in the default configuration also normalize file: URLs 
to have only one slash.
  
  ==== How do I index remote file shares? ====
  At the current time, Nutch does not have built in support for accessing files 
over SMB (Windows) shares.  This means the only available method is to mount 
the shares yourself, then index the contents as though they were local 
directories (see above).

[Nutch Wiki] Update of "FAQ" by SebastianNagel

Reply via email to