[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order

2009-09-24 Thread Bill Fenner

New submission from Bill Fenner fen...@gmail.com:

In python 2.5, shlex handled unicode input fine:

Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) 
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type help, copyright, credits or license for more information.
 import shlex
 shlex.split( u'Hello, World!' )
['Hello,', 'World!']

In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly
confusing execl:

Python 2.6 (r26:66714, Jun  8 2009, 16:07:29)
[GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2
Type help, copyright, credits or license for more information.
 import shlex
 shlex.split( u'Hello, World' )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']

Even weirder, the two return strings have different byte order (see
'H\x00\x00\x00' vs. '\x00\x00\x00W'!)

--
components: Library (Lib)
messages: 93074
nosy: fenner
severity: normal
status: open
title: shlex.split() converts unicode input to UCS-4 output with varying byte 
order
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6988
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order

2009-09-24 Thread Bill Fenner

Bill Fenner fen...@gmail.com added the comment:

A colleague pointed out that the bad behavior was introduced in 2.5.2:

Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03) 
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type help, copyright, credits or license for more information.
 import shlex
 shlex.split( uHello, World! )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00']

--
versions: +Python 2.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6988
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order

2009-09-24 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

I'll take the opposite point of view:
the bad behavior was introduced with 2.5.1 (issue1548891, r52302), and
reverted for 2.5.2 because it broke backwards compatibility with
arbitrary read buffers (issue1730114, r53831)

The difference is in cStringIO:

 from cStringIO import StringIO
 StringIO(uHello, World!).read()
'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00
\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'

The byte order is not different in the two strings: but u  becomes 
 \x00\x00\x00 and the three zeros are copied into the second item.

--
nosy: +amaury.forgeotdarc
resolution:  - wont fix
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6988
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order

2009-09-24 Thread Bill Fenner

Bill Fenner fen...@gmail.com added the comment:

so, just to be clear, your position is that the output of shlex.split(
u'Hello, World!' ) is *supposed* to be
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']?

--
status: pending - open

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6988
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order

2009-09-24 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Hm, while the StringIO behaviour supposedly cannot be changed for
backwards-compatibility reasons, we can probably improve shlex behaviour
with unicode strings.

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6988
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order

2009-09-24 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc amaur...@gmail.com added the comment:

(Presented this way, my opinion becomes difficult to stand...
OTOH the docs say that the module does not support Unicode, so it's not
strictly a bug)
http://docs.python.org/library/shlex.html

Yes, shlex could be improved and encode unicode strings to ascii.

--
resolution: wont fix - 

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6988
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6988] shlex.split() converts unicode input to UCS-4 output with varying byte order

2009-09-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Amaury Forgeot d'Arc wrote:
 
 Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
 
 (Presented this way, my opinion becomes difficult to stand...
 OTOH the docs say that the module does not support Unicode, so it's not
 strictly a bug)
 http://docs.python.org/library/shlex.html
 
 Yes, shlex could be improved and encode unicode strings to ascii.

I'd suggest to convert Unicode input to a string using an
optional encoding parameter which defaults to 'utf-8' (most
shells nowadays default to UTF-8).

This is only a compromise, though, albeit a practical one.
POSIX has the notion of a portable character set:

http://www.opengroup.org/onlinepubs/95399/basedefs/xbd_chap06.html#tagtcjh_3

which is pretty much the same as ASCII. Any ASCII compatible
encoding is then allowed via variable length encodings (see
further down on that page).

--
nosy: +lemburg
title: shlex.split() converts unicode input to UCS-4 output with varying byte 
order - shlex.split() converts unicode input to UCS-4 output with
varying byte order

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6988
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com