Hi Markus,

Jörg Schaible wrote on Wednesday, April 19, 2006 8:46 AM:

> Hi Markus,
> 
> Markus Härnvi wrote on Wednesday, April 19, 2006 8:47 AM:
> 
>> Hi!
>> 
>>> Starting from scratch would be possibly the best anyway. I
>> had it also on my todo list on a very low priority ... but
>> just, because I found that jMimeMagic has a really worse
>> implemenattion - extremly slow and not working correctly. I
>> have a good pile of image files it does not detect. Main
>> reason is, that the implementation is simply wrong. The
>> original magic files have a clear idea of precedence of
>> patterns - this has been lost completely in the
>> conversion/implementation of jMimeMagic.
>>> 
>>> - Jörg
>>> 
>> 
>> Using the original magic file and parse it in Java also makes it
>> easier to keep it updated. Just add the newest magic file to the jar
>> file and we are done.
> 
> That would have been my approach also. I was just not sure,
> whether we should bundle the magic file or try to locate it
> (this is the interesting part and highly system dependent).
> And a user might have an additional magic file in its home -
> at least this can be located.

After looking into the magic files (magic and magic.mime) I am somewhat 
disappointed. While file magic is good at binary formats with fixed headers, 
its definition language is poor for string based formats, e.g. rules for 
detecting XML & XSL:

===== %< =====
0       string/cb       \<?xml                  XML document text
0       string          \<?xml\ version "       XML
0       string          \<?xml\ version="       XML
>15     string          >\0                     %.3s document text
>>23    string          \<xsl:stylesheet        (XSL stylesheet)
>>24    string          \<xsl:stylesheet        (XSL stylesheet)
0       string/b        \<?xml                  XML document text
0       string/cb       \<?xml                  broken XML document text
===== %< =====

This is quite poor. The second line is invalid XML. It looks at offset 23 or 24 
for "<xsl:stylesheet" totally ignoring the fact, that the offset might be quite 
different if the XML declaration contains an encoding attribute or depending on 
the whitspaces and line ending. See detection of xml mime formats:

===== %< =====
0       string          \<?xml
>38     string          \<\!DOCTYPE\040svg      image/svg+xml
0       string          \<?xml                  text/xml
===== %< =====

Again I am quite sure, that a lot of SVG documents are not recognized.

Main problem is that the format specification cannot deal with variable length. 
See "man magic" for the format definition. You cannot express, that a file with 
an XML declaration followed by a non-empty line with a DOCTYPE declaration for 
SVG is "image/svg+xml".

Bottom line: I am no longer sure, if a mime detection based on the definitions 
of file magic is really a good idea :-/

- Jörg

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to