[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order
New submission from Bill Fenner fen...@gmail.com: In python 2.5, shlex handled unicode input fine: Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) [GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2 Type help, copyright, credits or license for more information. import shlex shlex.split( u'Hello, World!' ) ['Hello,', 'World!'] In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly confusing execl: Python 2.6 (r26:66714, Jun 8 2009, 16:07:29) [GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2 Type help, copyright, credits or license for more information. import shlex shlex.split( u'Hello, World' ) ['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00', '\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'] Even weirder, the two return strings have different byte order (see 'H\x00\x00\x00' vs. '\x00\x00\x00W'!) -- components: Library (Lib) messages: 93074 nosy: fenner severity: normal status: open title: shlex.split() converts unicode input to UCS-4 output with varying byte order versions: Python 2.6 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6988 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order
Bill Fenner fen...@gmail.com added the comment: A colleague pointed out that the bad behavior was introduced in 2.5.2: Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03) [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 Type help, copyright, credits or license for more information. import shlex shlex.split( uHello, World! ) ['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00', '\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'] -- versions: +Python 2.5 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6988 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: I'll take the opposite point of view: the bad behavior was introduced with 2.5.1 (issue1548891, r52302), and reverted for 2.5.2 because it broke backwards compatibility with arbitrary read buffers (issue1730114, r53831) The difference is in cStringIO: from cStringIO import StringIO StringIO(uHello, World!).read() 'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00 \x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00' The byte order is not different in the two strings: but u becomes \x00\x00\x00 and the three zeros are copied into the second item. -- nosy: +amaury.forgeotdarc resolution: - wont fix status: open - pending ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6988 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order
Bill Fenner fen...@gmail.com added the comment: so, just to be clear, your position is that the output of shlex.split( u'Hello, World!' ) is *supposed* to be ['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00', '\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']? -- status: pending - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6988 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order
Antoine Pitrou pit...@free.fr added the comment: Hm, while the StringIO behaviour supposedly cannot be changed for backwards-compatibility reasons, we can probably improve shlex behaviour with unicode strings. -- nosy: +pitrou ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6988 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: (Presented this way, my opinion becomes difficult to stand... OTOH the docs say that the module does not support Unicode, so it's not strictly a bug) http://docs.python.org/library/shlex.html Yes, shlex could be improved and encode unicode strings to ascii. -- resolution: wont fix - ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6988 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order
Marc-Andre Lemburg m...@egenix.com added the comment: Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: (Presented this way, my opinion becomes difficult to stand... OTOH the docs say that the module does not support Unicode, so it's not strictly a bug) http://docs.python.org/library/shlex.html Yes, shlex could be improved and encode unicode strings to ascii. I'd suggest to convert Unicode input to a string using an optional encoding parameter which defaults to 'utf-8' (most shells nowadays default to UTF-8). This is only a compromise, though, albeit a practical one. POSIX has the notion of a portable character set: http://www.opengroup.org/onlinepubs/95399/basedefs/xbd_chap06.html#tagtcjh_3 which is pretty much the same as ASCII. Any ASCII compatible encoding is then allowed via variable length encodings (see further down on that page). -- nosy: +lemburg title: shlex.split() converts unicode input to UCS-4 output with varying byte order - shlex.split() converts unicode input to UCS-4 output with varying byte order ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6988 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com