On 1/24/21 1:18 PM, MRAB wrote:
> On 2021-01-24 17:04, Chris Angelico wrote:
>> On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull
>> <turnbull.stephen...@u.tsukuba.ac.jp> wrote:
>>>
>>> Chris Angelico writes:
>>>  > Right, but as long as there's only one system encoding, that's not
>>>  > our problem. If you're on a Greek system and you want to decode
>>>  > ISO-8859-9 text, you have to state that explicitly. For the
>>>  > situations where you want heuristics based on byte distributions,
>>>  > there's always chardet.
>>>
>>> But that's the big question.  If you're just going to fall back to
>>> chardet, you might as well start there.  No?  Consider: if 'open'
>>> detects the encoding for you, *you can't find out what it is*.  'open'
>>> has no facility to tell you!
>>
>> Isn't that what file objects have attributes for? You can find out,
>> for instance, what newlines a file uses, even if it's being
>> autodetected.
>>
>>>  > In theory, UTF-16 without a BOM can consist entirely of byte values
>>>  > below 128,
>>>
>>> It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
>>> syllabary is composed of 2 printing ASCII characters (including SPC).
>>> A large fraction of the Han ideographs satisfy that condition, and I
>>> wouldn't be surprised if a majority of the 1000 most common ones do.
>>> (Not a good bet because half of the ideographs have a low byte > 127,
>>> but the order of characters isn't random, so if you get a couple of
>>> popular radicals that have 50 or so characters in a group in that
>>> range, you'd be much of the way there.)
>>>
>>>  > But there's no solution to that,
>>>
>>> Well, yes, but that's my line. ;-)
>>>
>>
>> Do you get files that lack the BOM? If so, there's fundamentally no
>> way for the autodetection to recognize them. That's why, in my
>> quickly-whipped-up algorithm above, I basically had it assume that no
>> BOM means not UTF-16. After all, there's no way to know whether it's
>> UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point
>> of it), so IMO it's not unreasonable to assert that all files that
>> don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using
>> the ASCII-compatible detection method.
>>
>> (Of course, this is *ONLY* if you don't specify an encoding. That part
>> won't be going away.)
>>
> Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's
> probably UTF16-BE and if you see patterns like
> b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
>
> You could also look for, say, sequences of Latin characters and
> sequences of Han characters.
>
Yes, if you happen to see that sort of pattern, you could perhaps make a
guess, but since part of the goal is to not need to read ahead much of
the file, it doesn't become a very reliable test to confirm UTF16 file
in case they don't begin with Latin-1 characters.

-- 
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/KU7YLC3MZP3SVOAP2YPBQO5H4DIRUBWQ/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to