I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as "text/html" Nutch decided to treat as XML, not HTML.
Yes, the current version of the mime-type resolver is a crude one. XML, HTML, RSS and all XML based files are not always correctly identified. (this problem is well known, and cause troubles for instance with RSS feeds that return text/xml content-type). We had to turn off
the guessing of content types to index Apache correctly.
Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml In the new version (based on freedesktop) that is sleeping for a while on my disk, I think such problems are solved since it introduce many informations not included in the current version: hierarchy between content-types (text/html is a subclass of text/xml), some way to express some complex magic clause, and so on. For instance, it can now correctly identify RSS documents : generally RSS feeds are associated with a generic text/xml content-type, and we cannot identify them => they fall back to the generic parse-text parser.
I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler.
Yes, but actually Nutch cannot acts as a browser. For instance with RSS: A browser know that a URL is a RSS feed because there is a <link rel="alternate" type="..."/> with the correct content-type (application/rss+xml) in the refering HTML page. Nutch doesn't keep such informations for guessing a content-type (it could be a good think to add), so it must find the content-type from the URL (without any context). Since all servers simply return the generic text/xml content-type, the only way to know it is a rss related document is to use magic content-type guessing (you can notice that many browsers doesn"t identify it as a rss document, but simply as a generic xml file). One more thing is that actually, there is no officialy registered content-type for rss. So, we can only use guessing from the document content to know it is a rss document. Jérôme