Michael Felt <aixto...@felt.demon.nl> added the comment:
On 23/08/2018 12:51, STINNER Victor wrote: > STINNER Victor <vstin...@redhat.com> added the comment: > > Your issue is about decoding command line argument which is done from main() > function. It doesn't use Python codecs, but functions like Py_DecodeLocale(). This is beyond my understanding atm. Early on I tried making the expected just be 'arg' and went from situation A to situation B - which looked much closer, BUT, the 'types' differed: Situaltion A (original) AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']" - ['h\xc3\xa9\xe2\x82\xac'] + ['h\udcc3\udca9\udce2\udc82\udcac'] : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such). Situation B AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']" - ['h\xc3\xa9\xe2\x82\xac'] + [b'h\xc3\xa9\xe2\x82\xac'] ? + : ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] After further digging - to understand why it was coming as "\x encoding rather than \udc" I looked at what was happening here: out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw) becomes out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw) becomes out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw) And finally, at the CLI becomes: ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar gv[1:])))', b'h\xc3\xa9\xe2\x82\xac' UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac'] /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. argv[1:])))', b'h\xc3\xa9\xe2\x82\xac' ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac'] Note: /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac' ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac'] /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac' ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (> UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac'] Summary: a) concerned about how b'h....' becomes 'bh....' b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from self.get_output('-X', utf8_opt, '-c', code, arg, **kw) determines the output and the (failed) comparison. >> Question 1: why is windows excluded? Because it does not use UTF-8 as it's >> default (it's default is CP1252) > Windows uses wmain() which gets command line arguments as wchar_t* strings: > Unicode. No decoding is needed. > > ---------- > > _______________________________________ > Python tracker <rep...@bugs.python.org> > <https://bugs.python.org/issue34347> > _______________________________________ > ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue34347> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com