Le jeudi 8 novembre 2012 19:32:14 UTC+1, Oscar Benjamin a écrit : > On 8 November 2012 15:05, <wxjmfa...@gmail.com> wrote: > > > Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit : > > >> On 8 November 2012 00:44, Oscar Benjamin <oscar.j.benja...@gmail.com> > >> wrote: > > >> > On 7 November 2012 23:51, Andrew Berg <bahamutzero8...@gmail.com> wrote: > > >> >> On 2012.11.07 17:27, Oscar Benjamin wrote: > > >> > > >> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not > > >> >>> support unicode > > >> > > >> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help > > >> >> the OP since Python versions below 3.3 don't support cp65001, but I > > >> >> think it's important to point out that the Windows command line system > > >> >> (it is not unique to cmd) does in fact support Unicode. > > >> > > >> > I have tried to use code page 65001 and it didn't work for me even if > > >> > I did use a version of Python (possibly 3.3 alpha) that claimed to > > >> > support it. > > >> > > >> I stand corrected. I've just checked and codepage 65001 does work in > > >> cmd.exe (on this machine): > > >> > > >> O:\>chcp 65001 > > >> Active code page: 65001 > > >> > > >> O:\>Q:\tools\Python33\python -c print('abc\u2013def') > > >> abc-def > > >> > > >> O:\>Q:\tools\Python33\python -c print('\u03b1') > > >> α > > >> > > >> It would be a lot better though if it just worked straight away > > >> without me needing to set the code page (like the terminal in every > > >> other OS I use). > > > > > > It *WORKS* straight away. The problem is that > > > people do not wish to use unicode correctly > > > (eg. Mulder's example). > > > Read the point 1) and 4) in my previous post. > > > > > > Unicode and in general the coding of the characters > > > have nothing to do with the os's or programming languages. > > > > I don't know what you mean that it works "straight away". > > > > The default code page on my machine is cp850. > > > > O:\>chcp > > Active code page: 850 > > > > cp850 doesn't understand utf-8. It just prints garbage: > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" > > ╬▒ > > > > Using the correct encoding doesn't help: > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))" > > Traceback (most recent call last): > > File "<string>", line 1, in <module> > > File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode > > return codecs.charmap_encode(input,errors,encoding_map) > > UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in > > position 0: character maps to > > <undefined> > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en > > coding))" > > Traceback (most recent call last): > > File "<string>", line 1, in <module> > > File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode > > return codecs.charmap_encode(input,errors,encoding_map) > > UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in > > position 0: character maps to > > <undefined> > > > > If I want the other characters to work I need to change the code page: > > > > O:\>chcp 65001 > > Active code page: 65001 > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" > > α > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en > > coding))" > > α > > > > > > Oscar
You are confusing two things. The coding of the characters and the set of the characters (glyphes/graphemes) of a coding scheme. It is always possible to encode safely an unicode, but the target coding may not contain the character. Take a look at the output of this "special" interactive interpreter" where the host coding (sys.stdout.encoding) can be change on the fly. >>> s = 'éléphant\u2013abc需' >>> sys.stdout.encoding '<unicode>' >>> s 'éléphant–abc需' >>> >>> sys.stdout.encoding = 'cp1252' >>> s.encode('cp1252') 'éléphant–abc需' >>> sys.stdout.encoding = 'cp850' >>> s.encode('cp850') Traceback (most recent call last): File "<eta last command>", line 1, in <module> File "C:\Python32\lib\encodings\cp850.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 8: character maps to <undefined> >>> # but >>> s.encode('cp850', 'replace') 'éléphant?abcé??' >>> >>> sys.stdout.encoding = 'utf-8' >>> s 'éléphant–abc需' >>> s.encode('utf-8') 'éléphant–abc需' >>> >>> sys.stdout.encoding = 'utf-16-le' <<<<<<<<< >>> s ' é l é p h a n t a b c é S ¬ ' >>> s.encode('utf-16-le') 'éléphant–abc需' <<<<<<<<<<< some cheating here do to the mail system, it really looks like this. jmf -- http://mail.python.org/mailman/listinfo/python-list