Totally agreed. Neither approach replaces the other. I just wanted to mention possibility so people don't over-focus on trying to build a hyper-optimized regex list. :)

For the content provider, an HTTP HEAD request saves them bandwidth if we don't do a GET. That's some cost savings for them over doing a blind fetch (esp. if we discard it).

I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation will ignore anyway?

--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:

Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes...

This could be a useful addition, but it could not replace url-based filters. A HEAD request must still be polite, so this could substantially slow fetching, as it would incur more delays. Also, for most dynamic pages, a HEAD is as expensive for the server as a GET, so this would cause more load on servers.

Doug

--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to