Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by Paul Ruiz:
http://wiki.apache.org/nutch/Features

------------------------------------------------------------------------------
    * Guessing from the names of the available parser plugins, this is probably 
it.  However, only the plain text and HTML are enabled by default.  Edit 
conf/nutch-site.xml and change the value of plugin.includes property to include 
the plugins for the document types that you want Nutch to handle:
     * Plain Text (plugin: parse-text)
     * HTML (parse-html)
+    * XML (parse-xml) uses XPath and namespaces to do the mapping between XML 
elements and Lucene fields. 
     * Java``Script (for extracting links only?) (parse-js)
     * Microsoft Power Point, the .ppt file (parse-mspowerpoint)
     * Microsoft Word, the .doc file (parse-msword)

Reply via email to