On Mon, Jan 25, 2021 at 12:33 AM Steven D'Aprano <st...@pearwood.info> wrote: > > On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote: > > > I think that you are going to create a bug magnet if you attempt to auto > > detect the encoding. > > > > First problem I see is that the file may be a pipe and then you will block > > until you have enough data to do the auto detect. > > Can you use `open('filename')` to read a pipe?
Yes. You can even use it with stdin: >>> open("/proc/self/fd/0").read(1) a 'a' The second line was me typing something, even though I was otherwise at the REPL. > Is blocking a problem in practice? If you try to open a network file, > that could block too, if there are network issues. And since you're > likely to follow the open with a read, the read is likely to block. So > over all I don't think that blocking is an issue. Definitely could be a problem if you read too much just for the sake of autodetection. It needs to be possible to do everything with an absolute minimum of reading. > > Second problem is that the first N bytes are all in ASCII and only later > > do you see Windows code page signature (odd lack of utf-8 signature). > > UTF-8 is a strict superset of ASCII, so if the file is actually > ASCII, there is no harm in using UTF-8. > > The bigger issue is if you have N bytes of pure ASCII followed by some > non-UTF superset, such as one of the ISO-8859-* encodings. So you end up > detecting what you think is ASCII/UTF-8 but is actually some legacy > encoding. But if N is large, say 512 bytes, that's unlikely in practice. There's no problem if you think it's ASCII, so the only problem would be if you start thinking that it's UTF-8 and then discover that it isn't. The scheme used by UTF-8 is designed such that this is highly unlikely with random data or actual text in an eight-bit encoding, so it's most likely to be broken UTF-8 than legit ISO-8859-X. ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/MBBCCHLFHFHYPCS54AKOVOCA4ELBFNPD/ Code of Conduct: http://python.org/psf/codeofconduct/