Content-Type inconsistency?

2006-04-10 Thread Jérôme Charron
It seems there is an inconsistency with content-type handling in Nutch: 1. The protocol level content-type header is added in content's metadata. 2. The content-type is then checked/guessed while instanciating the Content object and stored in a private field (at this step, the Content object can h

Re: Content-Type inconsistency?

2006-04-13 Thread Jérôme Charron
I would like to come back on this issue: The Content object holds two content-types: 1. The raw content-type from the protocol layer (http header in case of http) in the Content's metadata 2. The guessed content-type in a private field content-type. When a ParseData object is created, it takes onl

RE: Content-Type inconsistency?

2006-04-17 Thread Teruhiko Kurosaka
differs from the content-type in the metadata? If so, what class uses which? -kuro > -Original Message- > From: Jérôme Charron [mailto:[EMAIL PROTECTED] > Sent: 2006-4-13 12:57 > To: nutch-dev@lucene.apache.org > Subject: Re: Content-Type inconsistency? > > I would like

Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
> Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type. > Currently, what happens when Content-Type exists in both HTTP layer and in >

RE: Content-Type inconsistency?

2006-04-27 Thread Teruhiko Kurosaka
Jérôme, Thank you for the explanation. Here is an easy way to reproduce what I mean by content-type inconsistency: 1. Perform a crawl of the following URL : http://jerome.charron.free.fr/nutch/fake.zip (fake.zip is a fake zip file, in fact it is a html one) 2

Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
> I'm not sure if that is the right thing. > If the site administrator did a poort job and a wrong media type is > advertized, it's the site > problem and Nutch shouldn't be fixing it, in my opinion. Those sites > would > not work properly with the browsers any way, and Nutch doesn't need to > wor

Re: Content-Type inconsistency?

2006-04-27 Thread Doug Cutting
Jérôme Charron wrote: Finaly it is a good news that Nutch seems to be more "intelligent" on content-type guessing than Firefox or IE, no? I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as "text/html" N

RE: Content-Type inconsistency?

2006-04-27 Thread Teruhiko Kurosaka
Jérôme, >> Why should Nutch treat it as HTML? > > Simply because it is a HTML file, with a strange name, of course, but > it is a HTML file. > My example is a kind of "caricature". But some more real case could be > : a HTML file with a text/plain content-type, or with

Re: Content-Type inconsistency?

2006-05-02 Thread Jérôme Charron
I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as "text/html" Nutch decided to treat as XML, not HTML. Yes, the current version of the mime-type resolver is a crude one. XML, HTML, RSS and all XML based

Re: Content-Type inconsistency?

2006-05-02 Thread Doug Cutting
Jérôme Charron wrote: We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml Perhaps that would have worked also, but, with Apache, simply trusting the decl

Re: Content-Type inconsistency?

2006-05-04 Thread Jérôme Charron
Shouldn't RSS feeds declare the correct content-type? Yes, they should, but generally, they don't (a lot of rss feeds return a text/xml content-type). I don't know why. Perhaps because application/rss+xml is not registered to IANA (http://www.iana.org/assignments/media-types/application/) In pra