Andrzej: On the same note, let me list examples of certain analysis that should be helpful and I'd appreciate it if you can point where is an appropriate place to add the code. Right now these sit external for us, but it would be nice to integrate them to Nutch.
1. Content - total size < X bytes - discard and mark. 2. Content - HTML tag to content ratio < threshold -- discard and mark 3. Link analysis - incoming to outgoing link ratio is too low 4. File Size - the max file size to fetch based on type. Example, a file of 64k for HTML maybe fine, but not for a PDF -- this currently in Nutch will cause a "Fetched but cannot parse error". Thus it would be nice to have a property in the plugin xml file that specifies the max fetch bytes, and the action if this is hit (parse or discard) Thx -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrzej Bialecki Sent: Monday, January 17, 2005 7:34 AM To: [EMAIL PROTECTED] Subject: Re: [Nutch-dev] Implementing geography-by-IP filtering? Matt Kangas wrote: > Stefan and/or Doug, > > Here's a followup to my Jan 3 diff. This time I added two hooks to the > Fetcher, for URLFilter and also for a new interface, ContentFilter. > These allow one to: > - filter out URLs prior to fetching, and > - filter out fetched content prior to writing to a segment While the idea of ContentFilter is very useful, I have some doubts regarding the use of URLFilter during fetching. If you don't want to fetch some urls, then you should not put them in the fetchlist in the first place. In other words, I think this patch should be moved to the FetchListTool.java, between lines 508-509. Also, in other places we use the factory pattern to get an instance of URLFilter, without using setters. Perhaps we should use the same pattern here as well? > > This should provide a lot of flexibility for people who don't want to > index the entire web. The only drawback I see is that the interface is > too simple to be leveraged from the command-line; you'd have to make > your own custom CrawlTool and plug in filters at the appropriate point > in the crawl cycle. There is a middle-ground solution here, I think: you could implement a simple content filter, which filters e.g. based on a regex match of the content metadata. Regexes could be read from a text file. The filter could be then activated from the command-line with switch, pointing to the location of the regex file. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
