Re: sys.argv as a list of bytes

2012-01-19 Thread jmfauth

 In short: if you need to write system scripts on Unix, and you need them
 to work reliably, you need to stick with Python 2.x.


I think, understanding the coding of the characters helps a bit.

I can not figure out how the example below could not be
done on other systems.

D:\tmpchcp
Page de codes active : 1252

D:\tmpc:\python32\python.exe sysarg.py a b é € \u0430 \u03b1 z
arg: 1   unicode name: LATIN SMALL LETTER A
arg: 2   unicode name: LATIN SMALL LETTER B
arg: 3   unicode name: LATIN SMALL LETTER E WITH ACUTE
arg: 4   unicode name: EURO SIGN
arg: 5   unicode name: CYRILLIC SMALL LETTER A
arg: 6   unicode name: GREEK SMALL LETTER ALPHA
arg: 7   unicode name: LATIN SMALL LETTER Z

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sys.argv as a list of bytes

2012-01-18 Thread Peter Otten
Olive wrote:

 In Unix the operating system pass argument as a list of C strings. But
 C strings does corresponds to the bytes notions of Python3. Is it
 possible to have sys.argv as a list of bytes ? What happens if I pass
 to a program an argumpent containing funny character, for example
 (with a bash shell)?
 
 python -i ./test.py $'\x01'$'\x05'$'\xFF'

Python has a special errorhandler, surrogateescape to deal with bytes that 
are not 
valid UTF-8. If you try to print such a string you get an error:

$ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF'
'\x01\x05\udcff'
$ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
Traceback (most recent call last):
  File string, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 
2: surrogates not allowed

It is still possible to get the original bytes:

$ python3 -c'import sys; print(sys.argv[1].encode(utf-8, surrogateescape))' 
$'\x01'$'\x05'$'\xFF'
b'\x01\x05\xff'


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sys.argv as a list of bytes

2012-01-18 Thread Olive
On Wed, 18 Jan 2012 09:05:42 +0100
Peter Otten __pete...@web.de wrote:

 Olive wrote:
 
  In Unix the operating system pass argument as a list of C strings.
  But C strings does corresponds to the bytes notions of Python3. Is
  it possible to have sys.argv as a list of bytes ? What happens if I
  pass to a program an argumpent containing funny character, for
  example (with a bash shell)?
  
  python -i ./test.py $'\x01'$'\x05'$'\xFF'
 
 Python has a special errorhandler, surrogateescape to deal with
 bytes that are not valid UTF-8. If you try to print such a string you
 get an error:
 
 $ python3 -c'import sys; print(repr(sys.argv[1]))'
 $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
 $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
 Traceback (most recent call last):
   File string, line 1, in module
 UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
 position 2: surrogates not allowed
 
 It is still possible to get the original bytes:
 
 $ python3 -c'import sys; print(sys.argv[1].encode(utf-8,
 surrogateescape))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
 
 

But is it safe even if the locale is not UTF-8? I would like to be able
to pass a file name to a script. I can use bytes for file names in the
open function. If I keep the filename as bytes everywhere it will work
reliably whatever the locale or strange character the file name may
contain. 

Olive

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sys.argv as a list of bytes

2012-01-18 Thread Peter Otten
Olive wrote:

 On Wed, 18 Jan 2012 09:05:42 +0100
 Peter Otten __pete...@web.de wrote:
 
 Olive wrote:
 
  In Unix the operating system pass argument as a list of C strings.
  But C strings does corresponds to the bytes notions of Python3. Is
  it possible to have sys.argv as a list of bytes ? What happens if I
  pass to a program an argumpent containing funny character, for
  example (with a bash shell)?
  
  python -i ./test.py $'\x01'$'\x05'$'\xFF'
 
 Python has a special errorhandler, surrogateescape to deal with
 bytes that are not valid UTF-8. If you try to print such a string you
 get an error:
 
 $ python3 -c'import sys; print(repr(sys.argv[1]))'
 $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
 $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
 Traceback (most recent call last):
   File string, line 1, in module
 UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
 position 2: surrogates not allowed
 
 It is still possible to get the original bytes:
 
 $ python3 -c'import sys; print(sys.argv[1].encode(utf-8,
 surrogateescape))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
 
 
 
 But is it safe even if the locale is not UTF-8? I would like to be able
 to pass a file name to a script. I can use bytes for file names in the
 open function. If I keep the filename as bytes everywhere it will work
 reliably whatever the locale or strange character the file name may
 contain.

I believe you need not convert back to bytes explicitly, you can open the 
file with open(sys.argv[i]). I don't know if there are cornercases where 
that won't work; maybe http://www.python.org/dev/peps/pep-0383/ can help you 
figure it out.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sys.argv as a list of bytes

2012-01-18 Thread Nobody
On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

 Python has a special errorhandler, surrogateescape to deal with
 bytes that are not valid UTF-8.

On Wed, 18 Jan 2012 11:16:27 +0100, Olive wrote:

 But is it safe even if the locale is not UTF-8?

Yes. Peter's reference to UTF-8 is misleading. The surrogateescape
mechanism is used to represent anything which cannot be decoded according
to the locale's encoding. E.g. in the C locale, any byte = 128 will be
encoded as a surrogate.

On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

 It is still possible to get the original bytes:
 
 python3 -c'import sys; print(sys.argv[1].encode(utf-8, surrogateescape))'

Except, it isn't. Because the Python dev's can't make up their mind which
encoding sys.argv uses, or even document it.

AFAICT:

On Windows, there never was a bytes version of sys.argv to start with
(the OS supplies the command line using wide strings).

On Mac OS X, the command line is always decoded using UTF-8.

On Unix, the command line is decoded using mbstowcs(). There isn't a
Python function to query which encoding this used (if there even _is_ a
corresponding Python encoding).

Except on Windows (where OS APIs take wide string parameters), if a
library function needs to pass a Unicode string to an API function, it
will normally decode it using sys.getfilesystemencoding(), which isn't
guaranteed to be the encoding which was used to fabricate sys.argv in
the first place.

In short: if you need to write system scripts on Unix, and you need them
to work reliably, you need to stick with Python 2.x.

-- 
http://mail.python.org/mailman/listinfo/python-list


sys.argv as a list of bytes

2012-01-17 Thread Olive
In Unix the operating system pass argument as a list of C strings. But
C strings does corresponds to the bytes notions of Python3. Is it
possible to have sys.argv as a list of bytes ? What happens if I pass
to a program an argumpent containing funny character, for example
(with a bash shell)?

python -i ./test.py $'\x01'$'\x05'$'\xFF'


-- 
http://mail.python.org/mailman/listinfo/python-list