Hi Doug, > > Chris Mattmann wrote: > > In principle, the mimeType system should give us some guidance on > > determining the appropriate mimeType for the content, regardless of > whether > > it ends in .foo, .bar or the like. > > Right, but the URL filters run long before we know the mime type, in > order to try to keep us from fetching lots of stuff we can't process. > The mime type is not known until we've fetched it.
Duh, you're right. Sorry about that. Matt Kangas wrote: > The latter is not strictly true. Nutch could issue an HTTP HEAD > before the HTTP GET, and determine the mime-type before actually > grabbing the content. > > It's not how Nutch works now, but this might be more useful than a > super-detailed set of regexes... I liked Matt's idea of the HEAD request though. I wonder if some benchmarks on performance of this would be useful, because in some cases (such as focused crawling, or "non-whole-internet" crawling, such as intranet, etc.), it would seem that the performance penalty of performing the HEAD to get the content-type would be useful, and worth the cost... Cheers, Chris ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
