On 1/24/21 1:18 PM, MRAB wrote: > On 2021-01-24 17:04, Chris Angelico wrote: >> On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull >> <turnbull.stephen...@u.tsukuba.ac.jp> wrote: >>> >>> Chris Angelico writes: >>> > Right, but as long as there's only one system encoding, that's not >>> > our problem. If you're on a Greek system and you want to decode >>> > ISO-8859-9 text, you have to state that explicitly. For the >>> > situations where you want heuristics based on byte distributions, >>> > there's always chardet. >>> >>> But that's the big question. If you're just going to fall back to >>> chardet, you might as well start there. No? Consider: if 'open' >>> detects the encoding for you, *you can't find out what it is*. 'open' >>> has no facility to tell you! >> >> Isn't that what file objects have attributes for? You can find out, >> for instance, what newlines a file uses, even if it's being >> autodetected. >> >>> > In theory, UTF-16 without a BOM can consist entirely of byte values >>> > below 128, >>> >>> It's not just theory, it's my life. 62/80 of the Japanese "hiragana" >>> syllabary is composed of 2 printing ASCII characters (including SPC). >>> A large fraction of the Han ideographs satisfy that condition, and I >>> wouldn't be surprised if a majority of the 1000 most common ones do. >>> (Not a good bet because half of the ideographs have a low byte > 127, >>> but the order of characters isn't random, so if you get a couple of >>> popular radicals that have 50 or so characters in a group in that >>> range, you'd be much of the way there.) >>> >>> > But there's no solution to that, >>> >>> Well, yes, but that's my line. ;-) >>> >> >> Do you get files that lack the BOM? If so, there's fundamentally no >> way for the autodetection to recognize them. That's why, in my >> quickly-whipped-up algorithm above, I basically had it assume that no >> BOM means not UTF-16. After all, there's no way to know whether it's >> UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point >> of it), so IMO it's not unreasonable to assert that all files that >> don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using >> the ASCII-compatible detection method. >> >> (Of course, this is *ONLY* if you don't specify an encoding. That part >> won't be going away.) >> > Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's > probably UTF16-BE and if you see patterns like > b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE. > > You could also look for, say, sequences of Latin characters and > sequences of Han characters. > Yes, if you happen to see that sort of pattern, you could perhaps make a guess, but since part of the goal is to not need to read ahead much of the file, it doesn't become a very reliable test to confirm UTF16 file in case they don't begin with Latin-1 characters.
-- Richard Damon _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KU7YLC3MZP3SVOAP2YPBQO5H4DIRUBWQ/ Code of Conduct: http://python.org/psf/codeofconduct/