Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KurosakaTeruhiko: http://wiki.apache.org/nutch/Features ------------------------------------------------------------------------------ *How does the search engine handle punctuation and special characters? (and what's configurable?) *Which document formats are supported? - * Guessing from the names of the available parser plugins, this is probably it: + * Guessing from the names of the available parser plugins, this is probably it. However, only the plain text and HTML are enabled by default. Edit conf/nutch-site.xml and change the value of plugin.includes property to include the plugins for the document types that you want Nutch to handle: - *Plain Text (in a fixed preconfigured charset only) + * Plain Text (in a fixed preconfigured charset only) (plugin: parse-text) - * HTML (in most any charsets) + * HTML (in most any charsets) (parse-html) - * JavaScript (for extracting links only?) + * JavaScript (for extracting links only?) (parse-js) - * Microsoft Power Point, the .ppt file + * Microsoft Power Point, the .ppt file (parse-mspowerpoint) - * Microsoft Word, the .doc file + * Microsoft Word, the .doc file (parse-msword) - * Adobe PDF - * RSS - * RTF + * Adobe PDF (parse-pdf) + * RSS (parse-rss) + * RTF (parse-rtf) - * MP3 (?) Is there any text in MP3? + * MP3 (?) Is there any text in MP3? (parse-mp3) - * ZIP (?) This seems to expand the zip of plain text files and return the concatenated text. + * ZIP (?) This seems to expand the zip of plain text files and return the concatenated text. (parse-zip) *What post-coordination options are available? (hey Karen, what does this mean?)