[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jorge Luis Betancourt Gonzalez updated NUTCH-1928: -------------------------------------------------- Attachment: (was: mimetype-patch-v2.patch) > Indexing filter of documents by the MIME type > --------------------------------------------- > > Key: NUTCH-1928 > URL: https://issues.apache.org/jira/browse/NUTCH-1928 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin > Reporter: Jorge Luis Betancourt Gonzalez > Assignee: Jorge Luis Betancourt Gonzalez > Labels: filter, mime-type, plugin > Fix For: 1.10 > > > This allows to filter the indexed documents by the MIME type property of the > crawled content. Basically this will allow you to restrict the MIME type of > the contents that will be stored in Solr/Elasticsearch index without the need > to restrict the crawling/parsing process, so no need to use URLFilter plugin > family. Also this address one particular corner case when certain URLs > doesn't have any format to filter such as some RSS feeds > (http://www.awesomesite.com/feed) and it will end in your index mixed with > all your HTML content. > A configuration can file specified on the {{mimetype.filter.file}} property > in the {{nutch-site.xml}}. This file use the same format as the > {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an > {{allow all}} policy is used instead, so all your crawled documents will be > indexed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)