[Nutch-dev] Re: Content-Type inconsistency?

Doug Cutting Thu, 27 Apr 2006 14:50:03 -0700

Jérôme Charron wrote:

Finaly it is a good news that Nutch seems to be more "intelligent" on
content-type guessing than Firefox or IE, no?

I'm not so sure. When crawling Apache we had trouble with this feature.Some HTML files that had an XML header and the server identified as"text/html" Nutch decided to treat as XML, not HTML. We had to turn offthe guessing of content types to index Apache correctly. I think weshouldn't aim guess things any more than a browser does. If browsersrequire standards compliance, then our lives will be simpler.


Doug


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Content-Type inconsistency?

Reply via email to