Jérôme Charron wrote:
We had to turn off
the guessing of content types to index Apache correctly.

Instead of turning off the guessing of content types you should only to
remove the magic for xml in mime-types.xml

Perhaps that would have worked also, but, with Apache, simply trusting the declared Content-Type seems to work quite well.

I think we
shouldn't aim guess things any more than a browser does.  If browsers
require standards compliance, then our lives will be simpler.

Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because there
is a <link rel="alternate" type="..."/>
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).

Shouldn't RSS feeds declare the correct content-type?

http://feedvalidator.org/docs/warning/NonSpecificMediaType.html

I don't see that context should be required for feeds.

Doug

Reply via email to