> Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type.
> Currently, what happens when Content-Type exists in both HTTP layer and in > META tag (if contents is HTML)? We cannot use the one in Meta-tags : to extract it, we first need to know to use the html parser. Only the HTTP header is used. It is then checked/guessed using the mime-type repository (it is a mime-type database that contains mime-type and associated file extensions and optionaly some magic-bytes). How does Nutch guesses Content-Type, and when does it need to do that? See my response above > Is there a situation where the guessed content-type differs from the > content-type in the metadata? >From the one in headers : yes (mainly when the server is badely configured) Here is an easy way to reproduce what I mean by content-type inconsistency: 1. Perform a crawl of the following URL : http://jerome.charron.free.fr/nutch/fake.zip (fake.zip is a fake zip file, in fact it is a html one) 2. While crawling, you can see that the content-type returned by the server is application/zip 3. But you can see that Nutch correctly guess the content-type to text/html (it uses the HtmlParser) 4. At this step, all is ok. 5. Then start your tomcat and try the following search : zip 6. You can see the fake.zip file in results. Click on details ; if the index-more plugin was activated then you can see that the stored content-type is application/zip and not text/html What I suggest is simply to use the content-type used by nutch to find which parser to use instead of the one returned by the server. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/