Stefan, I'm back from the holidays, so hopefully I can contribute something useful to our discussion again. :)
Attached is a diff to Fetcher.java (CVS) that adds a pre-fetch
URLFilter. I simply added a "urlFilter" instance variable to Filter, a
"setUrlFilter()" method, and a test in FetcherThread to use the
filter, if it is set. A post-fetch content filter could be implemented
similarly, if we defined a suitable ContentFilter interface.
This seems very simple and effective to me, but it does not use
ExtensionPoint which you have recommended to use. How would you
reimplement this using ExtensionPoint?
One issue with my code is that I use URLFilter, but expect the filter
to be thread-safe. A naive user might be tempted to use RegexURLFilter
at this filter-point, which is a bad idea.
--Matt
On Mon, 20 Dec 2004 22:15:23 +0100, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi,
> we do whois based filtering until fetch list generation so we use the
> url filter.
> Until fetchlist generation we query the whois information and store
> them temporally in a db.
> Until indexing we use the cached informations add more meta data and
> index all the meta data.
> May next week we can show code.
>
> Stefan
>
> Am 20.12.2004 um 22:05 schrieb Matt Kangas:
>
> > I'm definitely more interested in a good solution. I already have
> > enough quick solutions in my private fork of the Nutch tree. :)
> >
> > I can see the plugin being very useful from a legal perspective too.
> > The GeoIP API I want to use is GPL'd, and I know there has been
> > discussion about separating all GPL'd code because of Apache
> > requirements.
> >
> > But how do you suggest to manage & load the plugins? The example I see
> > from FetcherThread.run() is:
> >
> > Protocol protocol = ProtocolFactory.getProtocol(url);
> > Content content = protocol.getContent(url);
> >
> > Since we want to stack filters immediately before & after this code,
> > does this mean we should create two new factories, one for each
> > filter-point? I dislike that the factories are only configurable via
> > the global-per-JVM conf files. What's a better approach?
> >
> > Using UrlFilter, I would simply create two UrlFilter instance
> > variables in Fetcher, and then implement the filter points as:
> >
> > if (preFetchFilter != null && preFetchFilter.filter(url) == null)
> > { next; }
> > // do fetch
> > // do similar with postFetchFilter and content
> >
> > Actually, that won't work. Content isn't a String, so UrlFilter is the
> > wrong interface to use. Maybe a new ContentFilter interface? Or maybe
> > you'll show me how ExtensionPoint solves all of these problems... :-)
> >
> > --Matt
> >
> > On Mon, 20 Dec 2004 11:05:14 +0100, Stefan Groschupf
> > <[EMAIL PROTECTED]> wrote:
> >> Matt,
> >>
> >>> What do you think is the next step? Should I simply write an
> >>> implementation and post it to the list?
> >>
> >> Well feel free!
> >> The question is if you need a quick solution or a good solution.
> >> For a good solution i would suggest change the fetcherFilter from the
> >> old but still used interface - configuration to a plugin extension
> >> point mechanism.
> >> This would allow to have multiple filters as well.
> >>
> >> The whois code is already under development here and we may will
> >> contribute it in the first quarter 2005.
> >>
> >> Feel free to join us then we can participate from each other. :-)
> >>
> >> Greetings,
> >> Stefan
> >>
Fetcher_with_URLFilter.diff
Description: Binary data
