Re: sys.argv as a list of bytes
In short: if you need to write system scripts on Unix, and you need them to work reliably, you need to stick with Python 2.x. I think, understanding the coding of the characters helps a bit. I can not figure out how the example below could not be done on other systems. D:\tmpchcp Page de codes active : 1252 D:\tmpc:\python32\python.exe sysarg.py a b é € \u0430 \u03b1 z arg: 1 unicode name: LATIN SMALL LETTER A arg: 2 unicode name: LATIN SMALL LETTER B arg: 3 unicode name: LATIN SMALL LETTER E WITH ACUTE arg: 4 unicode name: EURO SIGN arg: 5 unicode name: CYRILLIC SMALL LETTER A arg: 6 unicode name: GREEK SMALL LETTER ALPHA arg: 7 unicode name: LATIN SMALL LETTER Z jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: sys.argv as a list of bytes
Olive wrote: In Unix the operating system pass argument as a list of C strings. But C strings does corresponds to the bytes notions of Python3. Is it possible to have sys.argv as a list of bytes ? What happens if I pass to a program an argumpent containing funny character, for example (with a bash shell)? python -i ./test.py $'\x01'$'\x05'$'\xFF' Python has a special errorhandler, surrogateescape to deal with bytes that are not valid UTF-8. If you try to print such a string you get an error: $ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff' $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF' Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 2: surrogates not allowed It is still possible to get the original bytes: $ python3 -c'import sys; print(sys.argv[1].encode(utf-8, surrogateescape))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff' -- http://mail.python.org/mailman/listinfo/python-list
Re: sys.argv as a list of bytes
On Wed, 18 Jan 2012 09:05:42 +0100 Peter Otten __pete...@web.de wrote: Olive wrote: In Unix the operating system pass argument as a list of C strings. But C strings does corresponds to the bytes notions of Python3. Is it possible to have sys.argv as a list of bytes ? What happens if I pass to a program an argumpent containing funny character, for example (with a bash shell)? python -i ./test.py $'\x01'$'\x05'$'\xFF' Python has a special errorhandler, surrogateescape to deal with bytes that are not valid UTF-8. If you try to print such a string you get an error: $ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff' $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF' Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 2: surrogates not allowed It is still possible to get the original bytes: $ python3 -c'import sys; print(sys.argv[1].encode(utf-8, surrogateescape))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff' But is it safe even if the locale is not UTF-8? I would like to be able to pass a file name to a script. I can use bytes for file names in the open function. If I keep the filename as bytes everywhere it will work reliably whatever the locale or strange character the file name may contain. Olive -- http://mail.python.org/mailman/listinfo/python-list
Re: sys.argv as a list of bytes
Olive wrote: On Wed, 18 Jan 2012 09:05:42 +0100 Peter Otten __pete...@web.de wrote: Olive wrote: In Unix the operating system pass argument as a list of C strings. But C strings does corresponds to the bytes notions of Python3. Is it possible to have sys.argv as a list of bytes ? What happens if I pass to a program an argumpent containing funny character, for example (with a bash shell)? python -i ./test.py $'\x01'$'\x05'$'\xFF' Python has a special errorhandler, surrogateescape to deal with bytes that are not valid UTF-8. If you try to print such a string you get an error: $ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff' $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF' Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 2: surrogates not allowed It is still possible to get the original bytes: $ python3 -c'import sys; print(sys.argv[1].encode(utf-8, surrogateescape))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff' But is it safe even if the locale is not UTF-8? I would like to be able to pass a file name to a script. I can use bytes for file names in the open function. If I keep the filename as bytes everywhere it will work reliably whatever the locale or strange character the file name may contain. I believe you need not convert back to bytes explicitly, you can open the file with open(sys.argv[i]). I don't know if there are cornercases where that won't work; maybe http://www.python.org/dev/peps/pep-0383/ can help you figure it out. -- http://mail.python.org/mailman/listinfo/python-list
Re: sys.argv as a list of bytes
On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote: Python has a special errorhandler, surrogateescape to deal with bytes that are not valid UTF-8. On Wed, 18 Jan 2012 11:16:27 +0100, Olive wrote: But is it safe even if the locale is not UTF-8? Yes. Peter's reference to UTF-8 is misleading. The surrogateescape mechanism is used to represent anything which cannot be decoded according to the locale's encoding. E.g. in the C locale, any byte = 128 will be encoded as a surrogate. On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote: It is still possible to get the original bytes: python3 -c'import sys; print(sys.argv[1].encode(utf-8, surrogateescape))' Except, it isn't. Because the Python dev's can't make up their mind which encoding sys.argv uses, or even document it. AFAICT: On Windows, there never was a bytes version of sys.argv to start with (the OS supplies the command line using wide strings). On Mac OS X, the command line is always decoded using UTF-8. On Unix, the command line is decoded using mbstowcs(). There isn't a Python function to query which encoding this used (if there even _is_ a corresponding Python encoding). Except on Windows (where OS APIs take wide string parameters), if a library function needs to pass a Unicode string to an API function, it will normally decode it using sys.getfilesystemencoding(), which isn't guaranteed to be the encoding which was used to fabricate sys.argv in the first place. In short: if you need to write system scripts on Unix, and you need them to work reliably, you need to stick with Python 2.x. -- http://mail.python.org/mailman/listinfo/python-list
sys.argv as a list of bytes
In Unix the operating system pass argument as a list of C strings. But C strings does corresponds to the bytes notions of Python3. Is it possible to have sys.argv as a list of bytes ? What happens if I pass to a program an argumpent containing funny character, for example (with a bash shell)? python -i ./test.py $'\x01'$'\x05'$'\xFF' -- http://mail.python.org/mailman/listinfo/python-list