I just want to clearly address two points, since I feel like multiple posts have been unclear on them.

1. The bytes API was deprecated in 3.3 and it is listed in https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is an unfortunate oversight, but it was certainly announced and the warning has been there for three released versions. We can freely change or remove the support now, IMHO.

2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.

This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?"

The choices are:

* don't represent them at all (remove bytes API)
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page
* convert and fail on invalid surrogate pairs
* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)

Currently we have the second option.

My preference is the fourth option, as it will cause the least breakage of existing code and enable the most amount of code to just work in the presence of non-ACP characters.

The fifth option is the best for round-tripping within Windows APIs.

The only code that will break with any change is code that was using an already deprecated API. Code that correctly uses str to represent "encoding agnostic text" is unaffected.

If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?

Cheers,
Steve
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to