Stefan and/or Doug,

Here's a followup to my Jan 3 diff. This time I added two hooks to the
Fetcher, for URLFilter and also for a new interface, ContentFilter.
These allow one to:
- filter out URLs prior to fetching, and
- filter out fetched content prior to writing to a segment

This should provide a lot of flexibility for people who don't want to
index the entire web. The only drawback I see is that the interface is
too simple to be leveraged from the command-line; you'd have to make
your own custom CrawlTool and plug in filters at the appropriate point
in the crawl cycle.

Speaking of CrawlTool, I think it'd be great if end users could
customize specific steps of the crawl cycle, in Java, w/o having to
cut-and-paste the whole class. Template method is the pattern I'm
thinking of here. Does this sound useful to anybody else?

--Matt

On Wed, 12 Jan 2005 10:50:15 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Good point.  I meant thread-safe, not re-entrant.
> 
> Doug
> 
> Kragen Sitaker wrote:
> > On Fri, 2005-01-07 at 11:34 -0800, Doug Cutting wrote:
> >
> >>It's usually pretty easy to replace fields that must be synchronized
> >>with ThreadLocals in order to make a class re-entrant.  Perhaps we
> >>should do this to RegexURLFilter?
> >
> >
> > Nitpick --- as far as I know, ThreadLocals don't make things re-entrant,
> > only thread-safe, which is a strictly weaker property.  RegexURLFilter
> > probably doesn't need to be re-entrant, because it's not very likely
> > that it's going to call some client-provided code in the middle of
> > filtering a URL and have that client-provided code call RegexURLFilter
> > again --- right?
> >
> > I'd hate to have to argue with someone who thinks ThreadLocals make
> > things re-entrant in some context where re-entrancy matters, having
> > gotten the idea from a trusted source.
> >

Attachment: fetcher_add_url_content_filters.diff
Description: Binary data

Attachment: ContentFilter.java
Description: Binary data

Reply via email to