> I'm not sure if that is the right thing. > If the site administrator did a poort job and a wrong media type is > advertized, it's the site > problem and Nutch shouldn't be fixing it, in my opinion. Those sites > would > not work properly with the browsers any way, and Nutch doesn't need to > work properly > except that it should protect itself from crashing. I tried to visit your > fake.zip page with > IE and Firefox, and both faithfully trusted the media type as advertised > by the server, and > asked me if I want to open it with WinZip or save it; there was no option > to open it as an HTML. > Why should Nutch treat it as HTML?
Simply because it is a HTML file, with a strange name, of course, but it is a HTML file. My example is a kind of "caricature". But some more real case could be : a HTML file with a text/plain content-type, or with an text/xml Finaly it is a good news that Nutch seems to be more "intelligent" on content-type guessing than Firefox or IE, no? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/