Jérôme,

>>              Why should Nutch treat it as HTML? 
>
>       Simply because it is a HTML file, with a strange name, of course, but 
> it is a HTML file.
>       My example is a kind of "caricature". But some more real case could be 
> : a HTML file with a text/plain content-type, or with an text/xml 

These cases don't sound "real" to me either.  
In the first case (text/plain), the page would be displayed with all HTML tags 
visible; only very patients readers would try to decipher it.
In the second case (text/xml), the document would most likely be not displayed 
at all because most HTML documents are not well formed as XML.  

The site admins, not Nutch, must fix this incosistency; I don't think Nutch 
needs to be "smarter" than browsers.
It's actually better for Nutch to miss these pages. I don't want to see a hit 
that leads me to a page that cannot be viewed.

-kuro

Reply via email to