On Fri, 29 Nov 2013 23:00:27 -0700, Ian Kelly wrote: > On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <r...@panix.com> wrote: >> I was speaking specifically of "ligatures like fi" (or, if you prefer, >> "ligatures like ό". By which I mean those things printers invented >> because some letter combinations look funny when typeset as two >> distinct letters. > > I think the encoding of your email is incorrect, because GREEK SMALL > LETTER OMICRON WITH TONOS is not a ligature.
Roy's post, which is sent via Usenet not email, doesn't have an encoding set. Since he's sending from a Mac, his software may believe that the entire universe understands the Mac Roman encoding, which makes a certain amount of sense since if I recall correctly the fi and fl ligatures originally appeared in early Mac fonts. I'm going to give Roy the benefit of the doubt and assume he actually entered the fi ligature at his end. If his software was using Mac Roman, it would insert a single byte DE into the message: py> '\N{LATIN SMALL LIGATURE FI}'.encode('macroman') b'\xde' But that's not what his post includes. The message actually includes two bytes CF8C, in other words: '\N{LATIN SMALL LIGATURE FI}'.encode('who the hell knows') => b'\xCF\x8C' Since nearly all of his post is in single bytes, it's some variable-width encoding, but not UTF-8. With no encoding set, our newsreader software starts off assuming that the post uses UTF-8 ('cos that's the only sensible default), and those two bytes happen to encode to ό GREEK SMALL LETTER OMICRON WITH TONOS. I'm not surprised that Roy has a somewhat jaundiced view of Unicode, when the tools he uses are apparently so broken. But it isn't Unicode's fault, its the tools. The really bizarre thing is that apparently Roy's software, MT- NewsWatcher, knows enough Unicode to normalise ffl LATIN SMALL LIGATURE FFL (sent in UTF-8 and therefore appearing as bytes b'\xef\xac\x84') to the ASCII letters "ffl". That's astonishingly weird. That is really a bizarre error. I suppose it is not entirely impossible that the software is actually being clever rather than dumb. Having correctly decoded the UTF-8 bytes, perhaps it realised that there was no glyph for the ligature, and rather than display a MISSING CHAR glyph (usually one of those empty boxes you sometimes see), it normalized it to ASCII. But if it's that clever, why the hell doesn't it set an encoding line in posts????? -- Steven -- https://mail.python.org/mailman/listinfo/python-list