> Are you mainly concerned with charset in Content-Type?

Not specifically.
But while looking at these content-type inconsistency, I noticed that there
is some prossible
troubles with charset in content-type.


> Currently, what happens when Content-Type exists in both HTTP layer and in
> META tag (if contents is HTML)?

We cannot use the one in Meta-tags : to extract it, we first need to know to
use the html parser.
Only the HTTP header is used.
It is then checked/guessed using the mime-type repository (it is a mime-type
database that contains mime-type and associated file extensions and
optionaly some magic-bytes).

How does Nutch guesses Content-Type, and when does it need to do that?

See my response above


> Is there a situation where the guessed content-type differs from the
> content-type in the metadata?

>From the one in headers : yes (mainly when the server is badely configured)


Here is an easy way to reproduce what I mean by content-type inconsistency:
1. Perform a crawl of the following URL :
http://jerome.charron.free.fr/nutch/fake.zip
(fake.zip is a fake zip file, in fact it is a html one)
2. While crawling, you can see that the content-type returned by the server
is application/zip
3. But you can see that Nutch correctly guess the content-type to text/html
(it uses the HtmlParser)
4. At this step, all is ok.
5. Then start your tomcat and try the following search : zip
6. You can see the fake.zip file in results. Click on details ; if the
index-more plugin was activated then you can see that the stored
content-type is application/zip and not text/html

What I suggest is simply to use the content-type used by nutch to find which
parser to use instead of the one returned by the server.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Reply via email to