Chris Angelico writes:

 > Can anyone give an example of a current system encoding (ie one that
 > is likely to be the default currently used by open()) that can have
 > byte values below 128 which do NOT mean what they would mean in ASCII?
 > In other words, is it possible to read in a section of a file, think
 > that it's ASCII, and then find that you decoded it wrongly?

Japanese Shift JIS, as mentioned by Richard.  The Japanese just
redefine the glyph used for Windows paths and character escapes to be
the yen sign.  So it's a total muddle, because they also use that for
the yen sign.  They also use a broken vertical bar for the pipe
symbol, but the visual similarity there is so strong that you have to
know a *lot* of computational Japanese to realize that they're
different characters (they are, in JIS, but nobody cares -- there's
almost never a reason to use both).

 > I'm assuming here that there is a *single* default system encoding,
 > meaning that the automatic handler has only three cases to worry
 > about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system
 > encoding.

Sure that handles a lot of cases ... but the vast majority are already
handled with just the system encoding and UTF-8.  In my experience the
UTF-16 cases are not going to be the majority of what's left.  YMMV.

 > Right, but as long as there's only one system encoding, that's not
 > our problem. If you're on a Greek system and you want to decode
 > ISO-8859-9 text, you have to state that explicitly. For the
 > situations where you want heuristics based on byte distributions,
 > there's always chardet.

But that's the big question.  If you're just going to fall back to
chardet, you might as well start there.  No?  Consider: if 'open'
detects the encoding for you, *you can't find out what it is*.  'open'
has no facility to tell you!

As somebody else pointed out, if you're writing a text editor,
autodetection makes a lot of sense.  You just provide a facility for
the user to chose something different and reread the file.  But if
you're running non-interactive, it's much harder to recover -- and
'open' can't do it for you.

 > > Program source code where the higher-level functions (likely to
 > > contain literal strings) come late in the file are frequently
 > > misdetected based on the earlier bytes.
 > 
 > Yup; and the real question is whether anything would have been decoded
 > incorrectly.

If I recall correctly there are several Latin-1 characters in UTF-8
which are plausible Windows 125x digraphs.  So, yes, it's quite possible.

 > If you read in a bunch of ASCII-only text and yield it to
 > the app, and then come across something that proves that the file is
 > not UTF-8, then as far as I am aware, you won't have to un-yield any
 > of the previous text - it'll all have been correctly decoded.

Not if it's UTF-16.  And again, if you put the detection logic in
'open', once you've yielded anything to the main logic *it's too late
to change your mind*.

 > In theory, UTF-16 without a BOM can consist entirely of byte values
 > below 128,

It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
syllabary is composed of 2 printing ASCII characters (including SPC).
A large fraction of the Han ideographs satisfy that condition, and I
wouldn't be surprised if a majority of the 1000 most common ones do.
(Not a good bet because half of the ideographs have a low byte > 127,
but the order of characters isn't random, so if you get a couple of
popular radicals that have 50 or so characters in a group in that
range, you'd be much of the way there.)

 > But there's no solution to that,

Well, yes, but that's my line. ;-)
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/CIBX3EFFW2OMFUXQ4KPUJ4OZIYMQK5PH/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to