Jérôme, >> Why should Nutch treat it as HTML? > > Simply because it is a HTML file, with a strange name, of course, but > it is a HTML file. > My example is a kind of "caricature". But some more real case could be > : a HTML file with a text/plain content-type, or with an text/xml
These cases don't sound "real" to me either. In the first case (text/plain), the page would be displayed with all HTML tags visible; only very patients readers would try to decipher it. In the second case (text/xml), the document would most likely be not displayed at all because most HTML documents are not well formed as XML. The site admins, not Nutch, must fix this incosistency; I don't think Nutch needs to be "smarter" than browsers. It's actually better for Nutch to miss these pages. I don't want to see a hit that leads me to a page that cannot be viewed. -kuro