2016-08-16 17:56 GMT+02:00 Steve Dower <steve.do...@python.org>:
> 2. Windows file system encoding is *always* UTF-16. There's no "assuming
> mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
> it is". We know exactly what the encoding is on every supported version of
> Windows. UTF-16.

I think that you missed a important issue (or "use case") which is
called the "Makefile problem" by Mercurial developers:
https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22

I already explained it before, but maybe you misunderstood or just
missed it, so here is a more concrete example.

A runner.py script produces a bytes filename and sends it to a second
read_file.py script through stdin/stdout. The read_file.py script
opens the file using open(filename). The read_file.py script is run by
Python 2 which works naturally on bytes. The question is how the
runner.py produces (encodes) the filename.

runner.py (script run by Python 3.7):
---
import os, sys, subprocess, tempfile

filename = 'h\xe9.txt'
content = b'foo bar'
print("filename unicode: %a" % filename)

root = os.path.realpath(os.path.dirname(__file__))
script = os.path.join(root, 'read_file.py')

old_cwd = os.getcwd()

with tempfile.TemporaryDirectory() as tmpdir:
    os.chdir(tmpdir)
    with open(filename, 'wb') as fp:
        fp.write(content)

    filenameb = os.listdir(b'.')[0]
    # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
    # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
    print("filename bytes: %a" % filenameb)

    proc = subprocess.Popen(['py', '-2', script],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    stdout = proc.communicate(filenameb)[0]
    print("File content: %a" % stdout)

    os.chdir(old_cwd)
---

read_file.py (run by Python 2):
---
import sys
filename = sys.stdin.read()
# Python 2 calls the Windows C open() function
# which expects a filename encoded to the ANSI code page
with open(filename) as fp:
    content = fp.read()
sys.stdout.write(content)
sys.stdout.flush()
---

read_file.py only works if the non-ASCII filename is encoded to the
ANSI code page.

The question is how you expect developers should handle such use case.

For example, are developers responsible to transcode communicate()
data (input and outputs) manually?

That's why I keep repeating that ANSI code page is the best *default*
encoding because it is the encoded expected by other applications.

I know that the ANSI code page is usually limited and caused various
painful issues when handling non-ASCII data, but it's the status quo
if you really want to handle data as bytes...

Sorry, I didn't read all emails of this long thread, so maybe I missed
your answer to this issue.

Victor
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to