Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Paul Moore Fri, 02 Sep 2016 02:25:59 -0700

On 2 September 2016 at 03:35, Steve Dower <[email protected]> wrote:
> I'd need to test to be sure, but writing an incomplete code point should
> just truncate to before that point. It may currently raise OSError if that
> truncated to zero length, as I believe that's not currently distinguished
> from an error. What behavior would you propose?


For "correct" behaviour, you should retain the unwritten bytes, and
write them as part of the next call (essentially making the API
stateful, in the same way that incremental codecs work). I'm pretty
sure that this could cause actual problems, for example I think invoke
(https://github.com/pyinvoke/invoke) gets byte streams from
subprocesses and dumps them direct to stdout in blocks (so could
easily end up splitting multibyte sequences). It''s arguable that it
should be decoding the bytes from the subprocess and then re-encoding
them, but that gets us into "guess the encoding used by the
subprocess" territory.

The problem is that we're not going to simply drop some bad data in
the common case - it's not so much the dropping of the start of an
incomplete code point that bothers me, as the encoding error you hit
at the start of the *next* block of data you send. So people will get
random, unexplained, encoding errors.

I don't see an easy answer here other than a stateful API.

> Reads of less than four bytes fail instantly, as in the worst case we need
> four bytes to represent one Unicode character. This is an unfortunate
> reality of trying to limit it to one system call - you'll never get a full
> buffer from a single read, as there is no simple mapping between
> length-as-utf8 and length-as-utf16 for an arbitrary string.

And here - "read a single byte" is a not uncommon way of getting some
data. Once again see invoke:

https://github.com/pyinvoke/invoke/blob/master/invoke/platform.py#L147

used at

https://github.com/pyinvoke/invoke/blob/master/invoke/runners.py#L548

I'm not saying that there's an easy answer here, but this *will* break
code. And actually, it's in violation of the documentation: see
https://docs.python.org/3/library/io.html#io.RawIOBase.read

"""
read(size=-1)

Read up to size bytes from the object and return them. As a
convenience, if size is unspecified or -1, readall() is called.
Otherwise, only one system call is ever made. Fewer than size bytes
may be returned if the operating system call returns fewer than size
bytes.

If 0 bytes are returned, and size was not 0, this indicates end of
file. If the object is in non-blocking mode and no bytes are
available, None is returned.
"""

You're not allowed to return 0 bytes if the requested size was not 0,
and you're not at EOF.

Having said all this, I'm strongly +1 on the idea of this PEP, it
would be fantastic to resolve the above issues and get this in.

Paul
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Reply via email to