Re: Content-Type inconsistency?
I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. Yes, the current version of the mime-type resolver is a crude one. XML, HTML, RSS and all XML based files are not always correctly identified. (this problem is well known, and cause troubles for instance with RSS feeds that return text/xml content-type). We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml In the new version (based on freedesktop) that is sleeping for a while on my disk, I think such problems are solved since it introduce many informations not included in the current version: hierarchy between content-types (text/html is a subclass of text/xml), some way to express some complex magic clause, and so on. For instance, it can now correctly identify RSS documents : generally RSS feeds are associated with a generic text/xml content-type, and we cannot identify them = they fall back to the generic parse-text parser. I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler. Yes, but actually Nutch cannot acts as a browser. For instance with RSS: A browser know that a URL is a RSS feed because there is a link rel=alternate type=.../ with the correct content-type (application/rss+xml) in the refering HTML page. Nutch doesn't keep such informations for guessing a content-type (it could be a good think to add), so it must find the content-type from the URL (without any context). Since all servers simply return the generic text/xml content-type, the only way to know it is a rss related document is to use magic content-type guessing (you can notice that many browsers doesnt identify it as a rss document, but simply as a generic xml file). One more thing is that actually, there is no officialy registered content-type for rss. So, we can only use guessing from the document content to know it is a rss document. Jérôme
Re: Content-Type inconsistency?
Jérôme Charron wrote: We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml Perhaps that would have worked also, but, with Apache, simply trusting the declared Content-Type seems to work quite well. I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler. Yes, but actually Nutch cannot acts as a browser. For instance with RSS: A browser know that a URL is a RSS feed because there is a link rel=alternate type=.../ with the correct content-type (application/rss+xml) in the refering HTML page. Nutch doesn't keep such informations for guessing a content-type (it could be a good think to add), so it must find the content-type from the URL (without any context). Shouldn't RSS feeds declare the correct content-type? http://feedvalidator.org/docs/warning/NonSpecificMediaType.html I don't see that context should be required for feeds. Doug
Re: Content-Type inconsistency?
Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type. Currently, what happens when Content-Type exists in both HTTP layer and in META tag (if contents is HTML)? We cannot use the one in Meta-tags : to extract it, we first need to know to use the html parser. Only the HTTP header is used. It is then checked/guessed using the mime-type repository (it is a mime-type database that contains mime-type and associated file extensions and optionaly some magic-bytes). How does Nutch guesses Content-Type, and when does it need to do that? See my response above Is there a situation where the guessed content-type differs from the content-type in the metadata? From the one in headers : yes (mainly when the server is badely configured) Here is an easy way to reproduce what I mean by content-type inconsistency: 1. Perform a crawl of the following URL : http://jerome.charron.free.fr/nutch/fake.zip (fake.zip is a fake zip file, in fact it is a html one) 2. While crawling, you can see that the content-type returned by the server is application/zip 3. But you can see that Nutch correctly guess the content-type to text/html (it uses the HtmlParser) 4. At this step, all is ok. 5. Then start your tomcat and try the following search : zip 6. You can see the fake.zip file in results. Click on details ; if the index-more plugin was activated then you can see that the stored content-type is application/zip and not text/html What I suggest is simply to use the content-type used by nutch to find which parser to use instead of the one returned by the server. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Content-Type inconsistency?
I'm not sure if that is the right thing. If the site administrator did a poort job and a wrong media type is advertized, it's the site problem and Nutch shouldn't be fixing it, in my opinion. Those sites would not work properly with the browsers any way, and Nutch doesn't need to work properly except that it should protect itself from crashing. I tried to visit your fake.zip page with IE and Firefox, and both faithfully trusted the media type as advertised by the server, and asked me if I want to open it with WinZip or save it; there was no option to open it as an HTML. Why should Nutch treat it as HTML? Simply because it is a HTML file, with a strange name, of course, but it is a HTML file. My example is a kind of caricature. But some more real case could be : a HTML file with a text/plain content-type, or with an text/xml Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Content-Type inconsistency?
Jérôme Charron wrote: Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. We had to turn off the guessing of content types to index Apache correctly. I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler. Doug
Re: Content-Type inconsistency?
I would like to come back on this issue: The Content object holds two content-types: 1. The raw content-type from the protocol layer (http header in case of http) in the Content's metadata 2. The guessed content-type in a private field content-type. When a ParseData object is created, it takes only the Content's metadata. So, the ParseData can only access the raw content type and not the one guessed. What I suggest is : 1. add a content-type parameter in the ParseData constructors (so that Parsers can pass the guessed content-type to ParseData). 2. The Content object stores the guessed content-type in it's metadata in a special attribute named for instance GUESSED_CONTENT_TYPE, so that the ParseData can access it I think 1. is really cleanest way to implement this, but there is a lot of code impacted = all the parsers. Solution 2. have no impact on APIs, so the code changes are very small. Suggestions? Comments? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/