[ 
https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876748#action_12876748
 ] 

Andrzej Bialecki  commented on NUTCH-828:
-----------------------------------------

I generally like the idea of a decision point, but I think the place where this 
decision is taken in this patch (Fetcher) is not right. Since you rely on the 
presence of ParseResult (understandably so) it seems to me that a much better 
place to run the filters would be inside ParseUtils.parse(content), and you 
could return null (or a special ParseResult) to indicate that the content is to 
be discarded.

This way you can both run this filtering as a part of a Fetcher in parsing 
mode, and as a part of ParseSegment, without duplicating the same logic. 
Consequently, I propose to change the name from FetchFilter to ParseFilter.

> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows 
> filtering content and parse data/text after it is fetched but before it is 
> written to segments.  The fliter can return true if content is to be written 
> or false if it is not.  
> Some use cases for this filter would be topical search engines that only want 
> to fetch/index certain types of content, for example a news or sports only 
> search engine.  In these types of situations the only way to determine if 
> content belongs to a particular set is to fetch the page and then analyze the 
> content.  If the content passes, meaning belongs to the set of say sports 
> pages, then we want to include it.  If it doesn't then we want to ignore it, 
> never fetch that same page in the future, and ignore any urls on that page.  
> If content is rejected due to a fetch filter then its status is written to 
> the CrawlDb as gone and its content is ignored and not written to segments.  
> This effectively stop crawling along the crawl path of that page and the urls 
> from that page.  An example filter, fetch-safe, is provided that allows 
> fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to