Matt Kangas wrote:
Stefan and/or Doug,

Here's a followup to my Jan 3 diff. This time I added two hooks to the
Fetcher, for URLFilter and also for a new interface, ContentFilter.
These allow one to:
- filter out URLs prior to fetching, and
- filter out fetched content prior to writing to a segment

While the idea of ContentFilter is very useful, I have some doubts regarding the use of URLFilter during fetching. If you don't want to fetch some urls, then you should not put them in the fetchlist in the first place. In other words, I think this patch should be moved to the FetchListTool.java, between lines 508-509.


Also, in other places we use the factory pattern to get an instance of URLFilter, without using setters. Perhaps we should use the same pattern here as well?


This should provide a lot of flexibility for people who don't want to index the entire web. The only drawback I see is that the interface is too simple to be leveraged from the command-line; you'd have to make your own custom CrawlTool and plug in filters at the appropriate point in the crawl cycle.

There is a middle-ground solution here, I think: you could implement a simple content filter, which filters e.g. based on a regex match of the content metadata. Regexes could be read from a text file. The filter could be then activated from the command-line with switch, pointing to the location of the regex file.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to