[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319744#comment-14319744 ]
Jorge Luis Betancourt Gonzalez commented on NUTCH-1928: ------------------------------------------------------- [~lewismc] I've added the configuration key in the {{nutch-default.xml}} file and an example content for the {{mimetype-filter.txt}} file including the description of the format, by default the configuration is set to block all mimetypes except {{text/html}}, could be wise to include a couple more of mimetypes to allow or with this is sufficient? I haven't included the plugin to be activated by default, so I don't see any problem in allowing only {{text/html}} as an usage example, but comments are welcome. > Indexing filter of documents by the MIME type > --------------------------------------------- > > Key: NUTCH-1928 > URL: https://issues.apache.org/jira/browse/NUTCH-1928 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin > Reporter: Jorge Luis Betancourt Gonzalez > Assignee: Jorge Luis Betancourt Gonzalez > Labels: filter, mime-type, plugin > Fix For: 1.10 > > Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, > mimetype-patch-v3.patch > > > This allows to filter the indexed documents by the MIME type property of the > crawled content. Basically this will allow you to restrict the MIME type of > the contents that will be stored in Solr/Elasticsearch index without the need > to restrict the crawling/parsing process, so no need to use URLFilter plugin > family. Also this address one particular corner case when certain URLs > doesn't have any format to filter such as some RSS feeds > (http://www.awesomesite.com/feed) and it will end in your index mixed with > all your HTML content. > A configuration can file specified on the {{mimetype.filter.file}} property > in the {{nutch-site.xml}}. This file use the same format as the > {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an > {{allow all}} policy is used instead, so all your crawled documents will be > indexed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)