Stefan and/or Doug, Here's a followup to my Jan 3 diff. This time I added two hooks to the Fetcher, for URLFilter and also for a new interface, ContentFilter. These allow one to: - filter out URLs prior to fetching, and - filter out fetched content prior to writing to a segment
This should provide a lot of flexibility for people who don't want to index the entire web. The only drawback I see is that the interface is too simple to be leveraged from the command-line; you'd have to make your own custom CrawlTool and plug in filters at the appropriate point in the crawl cycle. Speaking of CrawlTool, I think it'd be great if end users could customize specific steps of the crawl cycle, in Java, w/o having to cut-and-paste the whole class. Template method is the pattern I'm thinking of here. Does this sound useful to anybody else? --Matt On Wed, 12 Jan 2005 10:50:15 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote: > Good point. I meant thread-safe, not re-entrant. > > Doug > > Kragen Sitaker wrote: > > On Fri, 2005-01-07 at 11:34 -0800, Doug Cutting wrote: > > > >>It's usually pretty easy to replace fields that must be synchronized > >>with ThreadLocals in order to make a class re-entrant. Perhaps we > >>should do this to RegexURLFilter? > > > > > > Nitpick --- as far as I know, ThreadLocals don't make things re-entrant, > > only thread-safe, which is a strictly weaker property. RegexURLFilter > > probably doesn't need to be re-entrant, because it's not very likely > > that it's going to call some client-provided code in the middle of > > filtering a URL and have that client-provided code call RegexURLFilter > > again --- right? > > > > I'd hate to have to argue with someone who thinks ThreadLocals make > > things re-entrant in some context where re-entrancy matters, having > > gotten the idea from a trusted source. > >
fetcher_add_url_content_filters.diff
Description: Binary data
ContentFilter.java
Description: Binary data
