I'm curious about something I've encountered while updating a very old Tk app (originally written in Python 1, but I've ported it to Python 2 as a first step towards getting it running on modern systems). The app downloads emails from a POP server and displays them. At the moment, the code is completely unaware of character encodings (which is something I plan to fix), and I have found that I don't understand what Python is doing when no character encoding is specified.
To demonstrate, I have written this short example program that displays a variety of UTF-8 characters to check whether they are decoded properly: ---- Example Code ---- import Tkinter as tk window = tk.Tk() mytext = """ \xc3\xa9 LATIN SMALL LETTER E WITH ACUTE \xc5\x99 LATIN SMALL LETTER R WITH CARON \xc4\xb1 LATIN SMALL LETTER DOTLESS I \xef\xac\x84 LATIN SMALL LIGATURE FFL \xe2\x84\x9a DOUBLE-STRUCK CAPITAL Q \xc2\xbd VULGAR FRACTION ONE HALF \xe2\x82\xac EURO SIGN \xc2\xa5 YEN SIGN \xd0\x96 CYRILLIC CAPITAL LETTER ZHE \xea\xb8\x80 HANGUL SYLLABLE GEUL \xe0\xa4\x93 DEVANAGARI LETTER O \xe5\xad\x97 CJK UNIFIED IDEOGRAPH-5B57 \xe2\x99\xa9 QUARTER NOTE \xf0\x9f\x90\x8d SNAKE \xf0\x9f\x92\x96 SPARKLING HEART """ mytext = mytext.decode(encoding="utf-8") greeting = tk.Label(text=mytext) greeting.pack() window.mainloop() ---- End Example Code ---- This works exactly as expected, with all the characters displaying correctly. However, if I comment out the line 'mytext = mytext.decode (encoding="utf-8")', the program still displays *almost* everything correctly. All of the characters appear correctly apart from the two four-byte emoji characters at the end, which instead display as four characters. For example, the "SNAKE" character actually displays as: U+00F0 LATIN SMALL LETTER ETH U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK U+FF90 HALFWIDTH KATAKANA LETTER MI U+FF8D HALFWIDTH KATAKANA LETTER HE What's Python 2 doing here? sys.getdefaultencoding() returns 'ascii', but it's clearly not attempting to display the bytes as ASCII (or cp1252, or ISO-8859-1). How is it deciding on some sort of almost-but- not-quite UTF-8 decoding? I am using Python 2.7.18 on a Windows 10 system. If there's any other relevant information I should provide please let me know. Many thanks, Rayner -- https://mail.python.org/mailman/listinfo/python-list