[Nutch-dev] Re: Urlfilter Patch

Matt Kangas Thu, 01 Dec 2005 18:44:33 -0800

Totally agreed. Neither approach replaces the other. I just wanted tomention possibility so people don't over-focus on trying to build ahyper-optimized regex list. :)

For the content provider, an HTTP HEAD request saves them bandwidthif we don't do a GET. That's some cost savings for them over doing ablind fetch (esp. if we discard it).


I guess the question is, what's worse:
- two server hits when we find content we want?, or

- spending bandwidth on pages that the Nutch installation will ignoreanyway?


--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:

Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEADbefore the HTTP GET, and determine the mime-type before actuallygrabbing the content.It's not how Nutch works now, but this might be more useful thana super-detailed set of regexes...
This could be a useful addition, but it could not replace url-basedfilters. A HEAD request must still be polite, so this couldsubstantially slow fetching, as it would incur more delays. Also,for most dynamic pages, a HEAD is as expensive for the server as aGET, so this would cause more load on servers.
Doug


--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Urlfilter Patch

Reply via email to