Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

eryk sun Mon, 05 Sep 2016 17:31:38 -0700

On Mon, Sep 5, 2016 at 9:45 PM, Steve Dower <steve.do...@python.org> wrote:
>
> So it works, though the behaviour is a little strange when you do it from
> the interactive prompt:
>
>>>> sys.stdin.buffer.raw.read(1)
> ɒprint('hi')
> b'\xc9'
>>>> hi
>>>> sys.stdin.buffer.raw.read(1)
> b'\x92'
>>>>
>
> What happens here is the raw.read(1) rounds one byte up to one character,
> reads the turned alpha, returns a single byte of the two byte encoded form
> and caches the second byte. Then interactive mode reads from stdin and gets
> the rest of the characters, starting from the print() and executes that.
> Finally the next call to raw.read(1) returns the cached second byte of the
> turned alpha.
>
> This is basically only a problem because the readline implementation is
> totally separate from the stdin object and doesn't know about the small
> cache (and for now, I think it's going to stay that way - merging readline
> and stdin would be great, but is a fairly significant task that won't make
> 3.6 at this stage).


It needs to read a minimum of 2 codes in case the first character is a
lead surrogate. It can use a length 2 WCHAR buffer and remember how
many bytes have been written (for the general case -- not specifically
for this case).

Example failure using your 3rd patch:

    >>> _ = write_console_input("\U00010000print('hi')\r\n");\
    ... raw_read(1)
    𐀀print('hi')
    b'\xef'
    >>>   File "<stdin>", line 1
        �print('hi')
             ^
    SyntaxError: invalid character in identifier
    >>> raw_read(1)
    b'\xbf'
    >>> raw_read(1)
    b'\xbd'

The raw read captures the first surrogate code, and transcodes it as
the replacement character b'\xef\xbf\xbd' (U+FFFD). Then PyOS_Readline
captures the 2nd surrogate and decodes it as the replacement
character.

In the general case in which a lead surrogate is the last code read,
but not at index 0, it can use the internal buffer to save the code
for the next call.

Surrogates that aren't in valid pairs should be allowed to pass
through via surrogatepass. This aims for consistency with the
filesystem encoding PEP.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Reply via email to