Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by KurosakaTeruhiko:
http://wiki.apache.org/nutch/Features

------------------------------------------------------------------------------
  
   *How does the search engine handle punctuation and special characters? (and 
what's configurable?)
   *Which document formats are supported?
-   * Guessing from the names of the available parser plugins, this is probably 
it:
+   * Guessing from the names of the available parser plugins, this is probably 
it.  However, only the plain text and HTML are enabled by default.  Edit 
conf/nutch-site.xml and change the value of plugin.includes property to include 
the plugins for the document types that you want Nutch to handle:
-    *Plain Text (in a fixed preconfigured charset only)
+    * Plain Text (in a fixed preconfigured charset only) (plugin: parse-text)
-    * HTML (in most any charsets)
+    * HTML (in most any charsets) (parse-html)
-    * JavaScript (for extracting links only?)
+    * JavaScript (for extracting links only?) (parse-js)
-    * Microsoft Power Point, the .ppt file
+    * Microsoft Power Point, the .ppt file (parse-mspowerpoint)
-    * Microsoft Word, the .doc file
+    * Microsoft Word, the .doc file (parse-msword)
-    * Adobe PDF
-    * RSS
-    * RTF
+    * Adobe PDF (parse-pdf)
+    * RSS (parse-rss)
+    * RTF (parse-rtf)
-    * MP3 (?) Is there any text in MP3?
+    * MP3 (?) Is there any text in MP3? (parse-mp3)
-    * ZIP (?) This seems to expand the zip of plain text files and return the 
concatenated text.
+    * ZIP (?) This seems to expand the zip of plain text files and return the 
concatenated text. (parse-zip)
  
   *What post-coordination options are available? (hey Karen, what does this 
mean?)
  

Reply via email to