[ 
https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876923#action_12876923
 ] 

Dennis Kubes commented on NUTCH-828:
------------------------------------

I agree about wanting the decision not just in fetcher while parsing but also 
in parse segment.  Here is the problem as I see it in returning null content.

Say we are wanting to create a topical search engine about sports.  We fetch 
pages.  Run through a fetch filter for a yes/no is this page in sports by its 
content.  If we null out content and from that ParseText and ParseData, we 
still have the CrawlDatum to deal with.  If we leave it as is, the CrawlDatum 
will get updated into CrawlDb as successfully fetched.  Content and Parse won't 
get collected because they are null.  We won't have the problem of Outlinks on 
that page getting queued in CrawlDb but the original URL will still be there 
and will be queued after an interval for repeated crawling.  Over time what we 
have is a large number of URLs that we know to be filtered being repeatedly 
crawled.

The decision point isn't just keep the content.  It is should we keep the URL 
and its content/parse and continue crawling down the path of the URLs outlinks 
or should we ignore this URL and not crawl anything it points to, break the 
crawl graph at this point.  Hence FetchFilter.  My solution to this was to null 
out content/parse and add a different CrawlDatum that essentially said the page 
was gone.  Ideally we should have a separate status but the gone worked as a 
first pass.  This gets updated back into CrawlDb and won't get recrawled at a 
later date  This was only possible in the Fetcher though.

Thoughts on how we might approach this?



> Fetch Filter
> ------------
>
>                 Key: NUTCH-828
>                 URL: https://issues.apache.org/jira/browse/NUTCH-828
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter.  The fetch filter allows 
> filtering content and parse data/text after it is fetched but before it is 
> written to segments.  The fliter can return true if content is to be written 
> or false if it is not.  
> Some use cases for this filter would be topical search engines that only want 
> to fetch/index certain types of content, for example a news or sports only 
> search engine.  In these types of situations the only way to determine if 
> content belongs to a particular set is to fetch the page and then analyze the 
> content.  If the content passes, meaning belongs to the set of say sports 
> pages, then we want to include it.  If it doesn't then we want to ignore it, 
> never fetch that same page in the future, and ignore any urls on that page.  
> If content is rejected due to a fetch filter then its status is written to 
> the CrawlDb as gone and its content is ignored and not written to segments.  
> This effectively stop crawling along the crawl path of that page and the urls 
> from that page.  An example filter, fetch-safe, is provided that allows 
> fetching content that does not contain a list of bad words.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to