[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711365#comment-13711365 ]
Brian commented on NUTCH-1614: ------------------------------ Can you please tell me how to do this? I couldn't find anything about how to do this. From what I can tell URL filters apply to crawling not just indexing I couldn't see how to apply it to only indexing. I don't see how normalizing a URL would help in this case if it still filters the URL from the crawl and not just indexing. > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > ------------------------------------------------------------------------------------------- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer > Affects Versions: 2.2.1 > Reporter: Brian > Priority: Minor > Labels: plugin > Attachments: NUTCH-1614.patch > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > <property> > <name>indexer.url.filter.exclude.regex.file</name> > <value>regex-indexer-exclude-urls.txt</value> > <description> > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > </description> > </property> -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira