Re: Content-Type inconsistency?

2006-05-02 Thread Jérôme Charron
I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. Yes, the current version of the mime-type resolver is a crude one. XML, HTML, RSS and all XML based

Re: Content-Type inconsistency?

2006-05-02 Thread Doug Cutting
Jérôme Charron wrote: We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml Perhaps that would have worked also, but, with Apache, simply trusting the

Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type. Currently, what happens when Content-Type exists in both HTTP layer and in META tag

Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
I'm not sure if that is the right thing. If the site administrator did a poort job and a wrong media type is advertized, it's the site problem and Nutch shouldn't be fixing it, in my opinion. Those sites would not work properly with the browsers any way, and Nutch doesn't need to work

Re: Content-Type inconsistency?

2006-04-27 Thread Doug Cutting
Jérôme Charron wrote: Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html

Re: Content-Type inconsistency?

2006-04-13 Thread Jérôme Charron
I would like to come back on this issue: The Content object holds two content-types: 1. The raw content-type from the protocol layer (http header in case of http) in the Content's metadata 2. The guessed content-type in a private field content-type. When a ParseData object is created, it takes