I'm not so sure.  When crawling Apache we had trouble with this feature.
  Some HTML files that had an XML header and the server identified as
"text/html" Nutch decided to treat as XML, not HTML.

Yes, the current version of the mime-type resolver is a crude one.
XML, HTML, RSS and all XML based files are not always correctly identified.
(this problem is well known, and cause troubles for instance with RSS feeds
that
return text/xml content-type).

 We had to turn off
the guessing of content types to index Apache correctly.

Instead of turning off the guessing of content types you should only to
remove
the magic for xml in mime-types.xml
In the new version (based on freedesktop) that is sleeping for a while on my
disk, I think
such problems are solved since it introduce many informations not included
in the current version:
hierarchy between content-types (text/html is a subclass of text/xml), some
way to express some complex magic clause, and so on.
For instance, it  can now correctly identify RSS documents : generally RSS
feeds are associated with a generic text/xml content-type, and
we cannot identify them => they fall back to the generic parse-text parser.


  I think we
shouldn't aim guess things any more than a browser does.  If browsers
require standards compliance, then our lives will be simpler.

Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because there
is a <link rel="alternate" type="..."/>
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).
Since all servers simply return the generic text/xml content-type, the only
way to know it is a rss related document is to use magic content-type
guessing (you can notice that many browsers doesn"t identify it as a rss
document, but simply as a generic xml file).
One more thing is that actually, there is no officialy registered
content-type for rss. So, we can only use guessing from the document content
to know it is a rss document.


Jérôme

Reply via email to