Hi Doug, > > Chris Mattmann wrote: > > In principle, the mimeType system should give us some guidance on > > determining the appropriate mimeType for the content, regardless of > whether > > it ends in .foo, .bar or the like. > > Right, but the URL filters run long before we know the mime type, in > order to try to keep us from fetching lots of stuff we can't process. > The mime type is not known until we've fetched it.
Duh, you're right. Sorry about that. Matt Kangas wrote: > The latter is not strictly true. Nutch could issue an HTTP HEAD > before the HTTP GET, and determine the mime-type before actually > grabbing the content. > > It's not how Nutch works now, but this might be more useful than a > super-detailed set of regexes... I liked Matt's idea of the HEAD request though. I wonder if some benchmarks on performance of this would be useful, because in some cases (such as focused crawling, or "non-whole-internet" crawling, such as intranet, etc.), it would seem that the performance penalty of performing the HEAD to get the content-type would be useful, and worth the cost... Cheers, Chris