On 2013-08-12 15:16:59 +0200, Adam Borowski wrote: > On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote: > > On 2013-08-12 02:51:52 +0200, Adam Borowski wrote: > > > Detecting non-UTF files is easy: > > > * false positives are impossible > > > * false negatives are extremely unlikely: combinations of letters that > > > would > > > happen to match a valid utf character don't happen naturally, and even > > > if > > > they did, every single combination in the file tested would need to > > > match > > > valid utf. > > > > Not that unlikely, and it is rather annoying that Firefox (and > > therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620. > > IMHO, in case of ambiguity, UTF-8 should always be preferred by > > default (applications could have options to change the preferences). > > That's the opposite of what I'm talking about: it is hard to reliably detect > ancient encodings, because they tend to assign a character to every possible > bit stream. On the other hand, only certain combinations of bytes with the > 8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with > good accuracy. It is obviously trivial to fool such detection deliberately, > but such combinations don't happen in real languages, and thus if something > validates as UTF-8, it is safe to assume it indeed is.
I don't know about the exact cause making Firefox to recognize some file as TIS-620 instead of UTF-8, but it is fooled and not deliberately. > > > On the other hand, detecting text files is hard. > > > > Deciding whether a file is a text file may be hard even for a human. > > What about text files with ANSI control sequences? > > Same as, say, a Word97 document: not text for my purposes. It might be > just coloured plain text, but there is no generic way to handle that. I think I've already seen such files as distributed text files (documentation), or perhaps there were just backspace characters to get bold (x\bx) and underline (x\b_). The less utility can handle them. > > I think better questions could be: why do you want to regard a file as > > text? For what purpose(s)? For the "all shipped text files in UTF-8" > > rule only? > > A shipped config file will have some settings the user may edit and comments > he may read. Being able to see what's going on is a prerequisite here. However some config files may be byte-oriented (like procmailrc, AFAIK). > HTML can include http-equiv which take care of rendering, but editing is > still a problem. And if you edit it, or, say, fill in some fields from a > database, you risk data loss. If everything is UTF-8 end-to-end, this risk > goes away. (I do care about plain text more, though.) You may still have NFC/NFD problems (this is also true for filenames). -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812172807.ga2...@ioooi.vinc17.net