[Nutch-dev] RE: Urlfilter Patch

Chris Mattmann Thu, 01 Dec 2005 14:09:51 -0800

Hi Doug,

> 
> Chris Mattmann wrote:
> >   In principle, the mimeType system should give us some guidance on
> > determining the appropriate mimeType for the content, regardless of
> whether
> > it ends in .foo, .bar or the like.
> 
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Duh, you're right. Sorry about that. 

Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  
> before the HTTP GET, and determine the mime-type before actually  
> grabbing the content.
> 
> It's not how Nutch works now, but this might be more useful than a 
> super-detailed set of regexes...

I liked Matt's idea of the HEAD request though. I wonder if some benchmarks
on performance of this would be useful, because in some cases (such as
focused crawling, or "non-whole-internet" crawling, such as intranet, etc.),
it would seem that the performance penalty of performing the HEAD to get the
content-type would be useful, and worth the cost...

Cheers,
  Chris

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: Urlfilter Patch

Reply via email to