Andrzej:

On the same note, let me list examples of certain analysis that should be
helpful and I'd appreciate it if you can point where is an appropriate place
to add the code. Right now these sit external for us, but it would be nice
to integrate them to Nutch.

1. Content - total size < X bytes - discard and mark.

2. Content - HTML tag to content ratio < threshold -- discard and mark

3. Link analysis - incoming to outgoing link ratio is too low

4. File Size - the max file size to fetch based on type. Example, a file of
64k for HTML maybe fine, but not for a PDF -- this currently in Nutch will
cause a "Fetched but cannot parse error". Thus it would be nice to have a
property in the plugin xml file that specifies the max fetch bytes, and the
action if this is hit (parse or discard)


Thx


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Andrzej
Bialecki
Sent: Monday, January 17, 2005 7:34 AM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] Implementing geography-by-IP filtering?

Matt Kangas wrote:
> Stefan and/or Doug,
> 
> Here's a followup to my Jan 3 diff. This time I added two hooks to the 
> Fetcher, for URLFilter and also for a new interface, ContentFilter.
> These allow one to:
> - filter out URLs prior to fetching, and
> - filter out fetched content prior to writing to a segment

While the idea of ContentFilter is very useful, I have some doubts regarding
the use of URLFilter during fetching. If you don't want to fetch some urls,
then you should not put them in the fetchlist in the first place. In other
words, I think this patch should be moved to the FetchListTool.java, between
lines 508-509.

Also, in other places we use the factory pattern to get an instance of
URLFilter, without using setters. Perhaps we should use the same pattern
here as well?

> 
> This should provide a lot of flexibility for people who don't want to 
> index the entire web. The only drawback I see is that the interface is 
> too simple to be leveraged from the command-line; you'd have to make 
> your own custom CrawlTool and plug in filters at the appropriate point 
> in the crawl cycle.

There is a middle-ground solution here, I think: you could implement a
simple content filter, which filters e.g. based on a regex match of the
content metadata. Regexes could be read from a text file. The filter could
be then activated from the command-line with switch, pointing to the
location of the regex file.

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to