Doug,

After sleeping on this idea, I realized that there's a middle ground that may give us (and website operators) the best of both worlds.

The question: how to avoid fetching unparseable content?

Value in answering this:
- save crawl operators bandwidth, disk space, cpu time
- save website operators bandwidth (and maybe cpu time) = be better web citizens

Tools availble:
- regex-urlfilter.txt (nearly free to run, but is only an approximate answer) - HTTP HEAD before GET (cheaper than blind GET, but mainly saves bandwidth, not server cpu)

Proposed strategy:

1) Define regex-urlfilter.txt, as we do now. Continue to weed out known-unparseable file extensions as early as possible. 2) Also define another regex list for extensions that are very likely to be text/html. (e.g. .html, .php).
Fetch these blindly with HTTP GET.
3) For everything else, perform HTTP HEAD first. If the mime-type is unparseable, do not follow with HTTP GET.

Advantages to this approach:
- still weeds out known-bad stuff as early as possible
- saves crawl+server bandwidth in questionable cases
- saves server load in high-confidence cases (eliminates HTTP HEAD)

Disadvantages: ?


On Dec 1, 2005, at 5:23 PM, Matt Kangas wrote:

Totally agreed. Neither approach replaces the other. I just wanted to mention possibility so people don't over-focus on trying to build a hyper-optimized regex list. :)

For the content provider, an HTTP HEAD request saves them bandwidth if we don't do a GET. That's some cost savings for them over doing a blind fetch (esp. if we discard it).

I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation will ignore anyway?

--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:

Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes...

This could be a useful addition, but it could not replace url- based filters. A HEAD request must still be polite, so this could substantially slow fetching, as it would incur more delays. Also, for most dynamic pages, a HEAD is as expensive for the server as a GET, so this would cause more load on servers.

Doug

--
Matt Kangas / [EMAIL PROTECTED]



--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to