On Mon, Jan 25, 2021 at 12:33 AM Steven D'Aprano <st...@pearwood.info> wrote:
>
> On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote:
>
> > I think that you are going to create a bug magnet if you attempt to auto
> > detect the encoding.
> >
> > First problem I see is that the file may be a pipe and then you will block
> > until you have enough data to do the auto detect.
>
> Can you use `open('filename')` to read a pipe?

Yes. You can even use it with stdin:

>>> open("/proc/self/fd/0").read(1)
a
'a'

The second line was me typing something, even though I was otherwise
at the REPL.

> Is blocking a problem in practice? If you try to open a network file,
> that could block too, if there are network issues. And since you're
> likely to follow the open with a read, the read is likely to block. So
> over all I don't think that blocking is an issue.

Definitely could be a problem if you read too much just for the sake
of autodetection. It needs to be possible to do everything with an
absolute minimum of reading.

> > Second problem is that the first N bytes are all in ASCII and only later
> > do you see Windows code page signature (odd lack of utf-8 signature).
>
> UTF-8 is a strict superset of ASCII, so if the file is actually
> ASCII, there is no harm in using UTF-8.
>
> The bigger issue is if you have N bytes of pure ASCII followed by some
> non-UTF superset, such as one of the ISO-8859-* encodings. So you end up
> detecting what you think is ASCII/UTF-8 but is actually some legacy
> encoding. But if N is large, say 512 bytes, that's unlikely in practice.

There's no problem if you think it's ASCII, so the only problem would
be if you start thinking that it's UTF-8 and then discover that it
isn't. The scheme used by UTF-8 is designed such that this is highly
unlikely with random data or actual text in an eight-bit encoding, so
it's most likely to be broken UTF-8 than legit ISO-8859-X.

ChrisA
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MBBCCHLFHFHYPCS54AKOVOCA4ELBFNPD/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to