STINNER Victor <[email protected]> added the comment:
I'm not sure that I understand your issue. There are 3 ways to enable the UTF-8
Mode:
* if the LC_CTYPE locale is "C" or "POSIX"
* if PYTHONUTF8 env var is equal to "1"
* using -X utf8 or -X utf8=1 command line option
For the first 2 cases are fine if the locale encoding is gb18030.
For the command line argument, first Python decodes the command line from
gb18030. If -X utf8 is present, the command line is decoded again from UTF-8
(and the old configuration is removed, to parse the new configuration).
I understand that your question if is decoding the command line argument from
gb18030 can miss -X utf8 or enable UTF-8 by mistake.
It seems like gb18030 encodes "-X utf8" text the same way than ASCII:
>>> "-X utf8".encode("gb18030")
b'-X utf8'
>>> b'-X utf8'.decode("gb18030")
'-X utf8'
I'm aware of mojibake causing a security issue, but it was for a function
checking for a single byte, not a substring:
https://unicodebook.readthedocs.io/issues.html#check-byte-strings-before-decoding-them-to-character-strings
I don't know well gb18030, so maybe I missed something. To me, using gb18030
with the UTF-8 mode doesn't seem to cause any issue to decode the command line
arguments.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34914>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com