[Nutch-dev] Re: Urlfilter Patch

Matt Kangas Fri, 02 Dec 2005 15:39:04 -0800

Doug,

After sleeping on this idea, I realized that there's a middle groundthat may give us (and website operators) the best of both worlds.


The question: how to avoid fetching unparseable content?

Value in answering this:
- save crawl operators bandwidth, disk space, cpu time

- save website operators bandwidth (and maybe cpu time) = be betterweb citizens


Tools availble:

- regex-urlfilter.txt (nearly free to run, but is only an approximateanswer)- HTTP HEAD before GET (cheaper than blind GET, but mainly savesbandwidth, not server cpu)


Proposed strategy:

1) Define regex-urlfilter.txt, as we do now. Continue to weed outknown-unparseable file extensions as early as possible.2) Also define another regex list for extensions that are very likelyto be text/html. (e.g. .html, .php).

Fetch these blindly with HTTP GET.

3) For everything else, perform HTTP HEAD first. If the mime-type isunparseable, do not follow with HTTP GET.


Advantages to this approach:
- still weeds out known-bad stuff as early as possible
- saves crawl+server bandwidth in questionable cases
- saves server load in high-confidence cases (eliminates HTTP HEAD)

Disadvantages: ?


On Dec 1, 2005, at 5:23 PM, Matt Kangas wrote:

Totally agreed. Neither approach replaces the other. I just wantedto mention possibility so people don't over-focus on trying tobuild a hyper-optimized regex list. :)
For the content provider, an HTTP HEAD request saves them bandwidthif we don't do a GET. That's some cost savings for them over doinga blind fetch (esp. if we discard it).
I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation willignore anyway?
--matt

On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:
Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEADbefore the HTTP GET, and determine the mime-type before actuallygrabbing the content.It's not how Nutch works now, but this might be more useful thana super-detailed set of regexes...
This could be a useful addition, but it could not replace url-based filters. A HEAD request must still be polite, so this couldsubstantially slow fetching, as it would incur more delays. Also,for most dynamic pages, a HEAD is as expensive for the server as aGET, so this would cause more load on servers.
Doug
--
Matt Kangas / [EMAIL PROTECTED]


--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Urlfilter Patch

Reply via email to