On 2017-06-14 00:18:44 +0200, Vincent Lefevre wrote:
I don't think that libmagic is a good idea. It has too many false positives; by that, I mean that it can say that some file has some specific MIME type (e.g. text/x-c) while it is actually just text that may have some lines of C in a small part of it. Many bugs have been reported and most of them are still not fixed.
In particular, the UTF-8 code in libmagic is too permissive. It claims UTF-8 even if the file ends with an incomplete code sequence. Moreover, its notion of UTF-8 has been obsolete since 2003: it allows 6 byte sequences and doesn't enforce minimality. This has been reported. A UTF-8 checker handles 10GB of text rather rapidly. The code http://marc.info/?l=mutt-dev&m=149687650903097 treats blacklisted control characters conservatively, including NUL but also many others. (It does need more work to handle large files well: an incomplete code sequence of 1-3 bytes at the end of one buffer-load of input should be included at the beginning of the next buffer-load.) Any encoding applied to a file will swamp a properly written UTF-8 check. The check will fail early with high probability if the input is not UTF-8, while the encoding must be applied to each character. So arguments about huge attachments are only relevant if each byte of the attachment really does need to be checked (i.e. it really is UTF-8), and for some reason the checker does more work per byte than the encoder.
IMHO, the rules should be (step by step):
This sequence of rules seems sensible. -- Andras Salamon [email protected]
