Re: Content-Type inconsistency?

2006-05-02 Thread Jérôme Charron

I'm not so sure.  When crawling Apache we had trouble with this feature.
  Some HTML files that had an XML header and the server identified as
text/html Nutch decided to treat as XML, not HTML.


Yes, the current version of the mime-type resolver is a crude one.
XML, HTML, RSS and all XML based files are not always correctly identified.
(this problem is well known, and cause troubles for instance with RSS feeds
that
return text/xml content-type).

 We had to turn off

the guessing of content types to index Apache correctly.


Instead of turning off the guessing of content types you should only to
remove
the magic for xml in mime-types.xml
In the new version (based on freedesktop) that is sleeping for a while on my
disk, I think
such problems are solved since it introduce many informations not included
in the current version:
hierarchy between content-types (text/html is a subclass of text/xml), some
way to express some complex magic clause, and so on.
For instance, it  can now correctly identify RSS documents : generally RSS
feeds are associated with a generic text/xml content-type, and
we cannot identify them = they fall back to the generic parse-text parser.



  I think we
shouldn't aim guess things any more than a browser does.  If browsers
require standards compliance, then our lives will be simpler.


Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because there
is a link rel=alternate type=.../
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).
Since all servers simply return the generic text/xml content-type, the only
way to know it is a rss related document is to use magic content-type
guessing (you can notice that many browsers doesnt identify it as a rss
document, but simply as a generic xml file).
One more thing is that actually, there is no officialy registered
content-type for rss. So, we can only use guessing from the document content
to know it is a rss document.


Jérôme


Re: Content-Type inconsistency?

2006-05-02 Thread Doug Cutting

Jérôme Charron wrote:

We had to turn off
the guessing of content types to index Apache correctly.


Instead of turning off the guessing of content types you should only to
remove the magic for xml in mime-types.xml


Perhaps that would have worked also, but, with Apache, simply trusting 
the declared Content-Type seems to work quite well.



I think we
shouldn't aim guess things any more than a browser does.  If browsers
require standards compliance, then our lives will be simpler.


Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because 
there

is a link rel=alternate type=.../
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).


Shouldn't RSS feeds declare the correct content-type?

http://feedvalidator.org/docs/warning/NonSpecificMediaType.html

I don't see that context should be required for feeds.

Doug


Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
 Are you mainly concerned with charset in Content-Type?

Not specifically.
But while looking at these content-type inconsistency, I noticed that there
is some prossible
troubles with charset in content-type.


 Currently, what happens when Content-Type exists in both HTTP layer and in
 META tag (if contents is HTML)?

We cannot use the one in Meta-tags : to extract it, we first need to know to
use the html parser.
Only the HTTP header is used.
It is then checked/guessed using the mime-type repository (it is a mime-type
database that contains mime-type and associated file extensions and
optionaly some magic-bytes).

How does Nutch guesses Content-Type, and when does it need to do that?

See my response above


 Is there a situation where the guessed content-type differs from the
 content-type in the metadata?

From the one in headers : yes (mainly when the server is badely configured)


Here is an easy way to reproduce what I mean by content-type inconsistency:
1. Perform a crawl of the following URL :
http://jerome.charron.free.fr/nutch/fake.zip
(fake.zip is a fake zip file, in fact it is a html one)
2. While crawling, you can see that the content-type returned by the server
is application/zip
3. But you can see that Nutch correctly guess the content-type to text/html
(it uses the HtmlParser)
4. At this step, all is ok.
5. Then start your tomcat and try the following search : zip
6. You can see the fake.zip file in results. Click on details ; if the
index-more plugin was activated then you can see that the stored
content-type is application/zip and not text/html

What I suggest is simply to use the content-type used by nutch to find which
parser to use instead of the one returned by the server.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
 I'm not sure if that is the right thing.
 If the site administrator did a poort job and a wrong media type is
 advertized, it's the site
 problem and Nutch shouldn't be fixing it, in my opinion.  Those sites
 would
 not work properly with the browsers any way, and Nutch doesn't need to
 work properly
 except that it should protect itself from crashing.  I tried to visit your
 fake.zip page with
 IE and Firefox, and both faithfully trusted the media type as advertised
 by the server, and
 asked me if I want to open it with WinZip or save it; there was no option
 to open it as an HTML.
 Why should Nutch treat it as HTML?

Simply because it is a HTML file, with a strange name, of course, but it is
a HTML file.
My example is a kind of caricature. But some more real case could be : a
HTML file with a text/plain content-type, or with an text/xml
Finaly it is a good news that Nutch seems to be more intelligent on
content-type guessing than Firefox or IE, no?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Content-Type inconsistency?

2006-04-27 Thread Doug Cutting

Jérôme Charron wrote:

Finaly it is a good news that Nutch seems to be more intelligent on
content-type guessing than Firefox or IE, no?


I'm not so sure.  When crawling Apache we had trouble with this feature. 
 Some HTML files that had an XML header and the server identified as 
text/html Nutch decided to treat as XML, not HTML.  We had to turn off 
the guessing of content types to index Apache correctly.  I think we 
shouldn't aim guess things any more than a browser does.  If browsers 
require standards compliance, then our lives will be simpler.


Doug


Re: Content-Type inconsistency?

2006-04-13 Thread Jérôme Charron
I would like to come back on this issue:
The Content object holds two content-types:
1. The raw content-type from the protocol layer (http header in case of
http) in the Content's metadata
2. The guessed content-type in a private field content-type.

When a ParseData object is created, it takes only the Content's metadata.
So, the ParseData can only access the raw content type and not the one
guessed.

What I suggest is :
1. add a content-type parameter in the ParseData constructors (so that
Parsers  can pass the guessed content-type to ParseData).
2. The Content object stores the guessed content-type in it's metadata in a
special attribute named for instance GUESSED_CONTENT_TYPE, so that the
ParseData can access it

I think 1. is really cleanest way to implement this, but there is a lot of
code impacted = all the parsers.
Solution 2. have no impact on APIs, so the code changes are very small.

Suggestions? Comments?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/