Jérôme Charron wrote:
We had to turn off
the guessing of content types to index Apache correctly.
Instead of turning off the guessing of content types you should only to
remove the magic for xml in mime-types.xml
Perhaps that would have worked also, but, with Apache, simply trusting
the declared Content-Type seems to work quite well.
I think we
shouldn't aim guess things any more than a browser does. If browsers
require standards compliance, then our lives will be simpler.
Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because
there
is a <link rel="alternate" type="..."/>
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).
Shouldn't RSS feeds declare the correct content-type?
http://feedvalidator.org/docs/warning/NonSpecificMediaType.html
I don't see that context should be required for feeds.
Doug