[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596239#comment-14596239
 ] 

Asitang Mishra edited comment on NUTCH-2038 at 6/22/15 5:09 PM:
----------------------------------------------------------------

>From what I understand the problem is that a url filter in nutch has a very 
>simple interface (has no provision for content) and is only "fired" in the 
>generator step.

problems:
[~chrismattmann]: 
1> Cannot make it a part of the core, should be a plugin and be called as a 
general plugin from the core (right now in my patch, it is more visible than a 
general plugin).
2>Should be a url filter and not a scoring filter to preserve the simplicity 
and transparency of the methodology.
 [~wastl-nagel]: 
1>Should not read content or call tika in the plugin as it will be a hadoop job 
and also not lightweight. 
2> Should be a scoring filter as the interface in place already supports such 
an improvement.


I may suggest that if we all agree to let it be a url filter (and that's 
completely up to you guys) then what I can do is either enhance the already 
present urlfilter interface or make an abstract class (which will very generic 
and has a filter function that takes some args and a string)
And call all the url filters from parser as well, but this time not fire the 
original filter() function (keep it for the generator). Fire the new filter 
function from the parser. That way the only viable change in NUTCH will be that 
now parser will also be calling urlfilters (And this will be very generic). 
That way we also don't need to read the crawl db or call tika for my specific 
filter.


was (Author: asitang):
>From what I understand the problem is that a url filter in nutch has a very 
>simple interface (has no provision for content) and is only "fired" in the 
>generator step.

problems:
[~chrismattmann]: 
1> Cannot make it a part of the core, should be a plugin and be called as a 
general plugin from the core.
2>Should be a url filter and not a scoring filter to preserve the simplicity 
and transparency of the methodology.
 [~wastl-nagel]: 
1>Should not read content or call tika in the plugin as it will be a hadoop job 
and also not lightweight. 
2> Should be a scoring filter as the interface in place already supports such 
an improvement.


I may suggest that if we all agree to let it be a url filter (and that's 
completely up to you guys) then what I can do is either enhance the already 
present urlfilter interface or make an abstract class (which will very generic 
and has a filter function that takes some args and a string)
And call all the url filters from parser as well, but this time not fire the 
original filter() function (keep it for the generator). Fire the new filter 
function from the parser. That way the only viable change in NUTCH will be that 
now parser will also be calling urlfilters (And this will be very generic). 
That way we also don't need to read the crawl db or call tika for my specific 
filter.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to