[ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876748#action_12876748 ]
Andrzej Bialecki commented on NUTCH-828: ----------------------------------------- I generally like the idea of a decision point, but I think the place where this decision is taken in this patch (Fetcher) is not right. Since you rely on the presence of ParseResult (understandably so) it seems to me that a much better place to run the filters would be inside ParseUtils.parse(content), and you could return null (or a special ParseResult) to indicate that the content is to be discarded. This way you can both run this filtering as a part of a Fetcher in parsing mode, and as a part of ParseSegment, without duplicating the same logic. Consequently, I propose to change the name from FetchFilter to ParseFilter. > Fetch Filter > ------------ > > Key: NUTCH-828 > URL: https://issues.apache.org/jira/browse/NUTCH-828 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.1 > > Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch > > > Adds a Nutch extension point for a fetch filter. The fetch filter allows > filtering content and parse data/text after it is fetched but before it is > written to segments. The fliter can return true if content is to be written > or false if it is not. > Some use cases for this filter would be topical search engines that only want > to fetch/index certain types of content, for example a news or sports only > search engine. In these types of situations the only way to determine if > content belongs to a particular set is to fetch the page and then analyze the > content. If the content passes, meaning belongs to the set of say sports > pages, then we want to include it. If it doesn't then we want to ignore it, > never fetch that same page in the future, and ignore any urls on that page. > If content is rejected due to a fetch filter then its status is written to > the CrawlDb as gone and its content is ignored and not written to segments. > This effectively stop crawling along the crawl path of that page and the urls > from that page. An example filter, fetch-safe, is provided that allows > fetching content that does not contain a list of bad words. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.