On Thu, Jan 16, 2014 at 11:43 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > Worse, linguists sometimes disagree as to what counts as a grapheme. For > instance, some authorities consider the English "sh" to be a separate > grapheme. As a native English speaker, I'm not sure about that. Certainly > it isn't a separate letter of the alphabet, but on the other hand I can't > think of any words containing "sh" that should be considered as two > graphemes "s" followed by "h". Wait, no, that's not true... compound > words such as "glasshouse" or "disheartened" are counter examples.
Digression: When I was taught basic English during my school days, my mum used Spalding's book and the 70 phonograms. 25 of them are single letters (Q is not a phonogram - QU is), and the others are mostly pairs (there are a handful of 3- and 4-letter phonograms). Not every instance of "s" followed by "h" is the phonogram "sh" - only the times when it makes the single sound "sh" (which it doesn't in "glasshouse" or "disheartened"). Thing is, you can't define spelling and pronunciation in terms of each other, because you'll always be bitten by corner cases. Everyone knows how "Thames" is pronounced... right? Well, no. There are (at least) two rivers of that name, the famous one in London p1[ and another one further north [2]. The obscure one is pronounced the way the word looks, the famous one isn't. And don't even get started on English family names... Majorinbanks, Meux and Cholmodeley, as lampshaded [3] in this song [4]! Even without names, though, there are the tricky cases and the ones where different localities pronounce the same word very differently; Unicode shouldn't have to deal with that by changing whether something's a single character or two. Considering that phonograms aren't even ligatures (though there is overlap, eg "Th"), it's much cleaner to leave them as multiple characters. ChrisA [1] https://en.wikipedia.org/wiki/River_Thames [2] Though it's better known as the Isis. https://en.wikipedia.org/wiki/The_Isis [3] http://tvtropes.org/pmwiki/pmwiki.php/Main/LampshadeHanging [4] http://www.stagebeauty.net/plays/th-arca2.html - "Mosh-banks", "Mow", and "Chumley" are the pronunciations used -- https://mail.python.org/mailman/listinfo/python-list