On 2017-06-14 00:18:44 +0200, Vincent Lefevre wrote:
I don't think that libmagic is a good idea. It has too many false
positives; by that, I mean that it can say that some file has
some specific MIME type (e.g. text/x-c) while it is actually
just text that may have some lines of C in a small part of it.
Many bugs have been reported and most of them are still not fixed.

In particular, the UTF-8 code in libmagic is too permissive.  It claims
UTF-8 even if the file ends with an incomplete code sequence.  Moreover,
its notion of UTF-8 has been obsolete since 2003: it allows 6 byte
sequences and doesn't enforce minimality.  This has been reported.

A UTF-8 checker handles 10GB of text rather rapidly.  The code
   http://marc.info/?l=mutt-dev&m=149687650903097
treats blacklisted control characters conservatively, including NUL but
also many others.  (It does need more work to handle large files well:
an incomplete code sequence of 1-3 bytes at the end of one buffer-load
of input should be included at the beginning of the next buffer-load.)

Any encoding applied to a file will swamp a properly written UTF-8
check.  The check will fail early with high probability if the input is
not UTF-8, while the encoding must be applied to each character.  So
arguments about huge attachments are only relevant if each byte of the
attachment really does need to be checked (i.e. it really is UTF-8), and
for some reason the checker does more work per byte than the encoder.

IMHO, the rules should be (step by step):

This sequence of rules seems sensible.

-- Andras Salamon                   [email protected]

Reply via email to