Re: unicode Digest V12 #108
From: Philippe Verdy verd...@wanadoo.fr Date: Sat, 2 Jul 2011 15:59:18 +0200 Subject: Re: ch ligature in a monospace font 2011/7/1 Richard Wordingham richard.wording...@ntlworld.com: I wonder if anyone has some statistics on the use of CGJ. Its revised intended use was to disrupt collating sequences, but you may be right about its most frequent use being to disrupt canonical reordering. A few years ago I concluded it wasn't yet safe to type the Welsh place name Llan͏gollen with CGJ. Interestingly, I can't have this name being rendered correctly in my Chrome version on Windows 7; it just displays the occurence of CGJ as a non-spacing dotted box, overwriting the surrounding characters n and g so that the place is completely unreadable. I just wonder why Chrome needs to display this control in such a disruptive way (I have not checked with other browsers). Why do you need CGJ between n and g ? - Is that to make sure that they won't collate as a single element ng but separately ? How is it different here from the collation of language where the situation would be similar? - Or do you intend to do the reverse, i.e. effectively collate ng in Llangollen as a single element? Sorry I don't know Welsh, all I know is that ng is a digram of its alphabet, which also includes n and g as separate letters... Other digrams are dd contrasting with isolated d, ff contrasting with isolated f, ll contrasting with isolated l, ph contrasting with isolated p and h, rh contrasting with isolated r and h, and finaly th contrasting with isolated t and h. The ng in Llangollen is not the digram ng but two separate letters (unlike the ll in the name which is the digram).
Re: unicode Digest V12 #108
2011/7/2 Andrew Miller a.j.mil...@bcs.org.uk: The ng in Llangollen is not the digram ng but two separate letters (unlike the ll in the name which is the digram). Why not simply using a soft hyphen between n and g in this case ? Soft hyphens are normally recognized as such by smart correctors and as well by search engines or collators. It seems enough for me to indicate that this is not the Welsh digram ng ; CGJ anyway is certainly not the correct disjoiner in your case.
Re: unicode Digest V12 #108
On 7/2/2011 8:59 AM, Philippe Verdy wrote: 2011/7/2 Andrew Millera.j.mil...@bcs.org.uk: The ng in Llangollen is not the digram ng but two separate letters (unlike the ll in the name which is the digram). Why not simply using a soft hyphen between n and g in this case ? Soft hyphens are normally recognized as such by smart correctors and as well by search engines or collators. It seems enough for me to indicate that this is not the Welsh digram ng ; CGJ anyway is certainly not the correct disjoiner in your case. This solution works well if the word can split between the n and the g. In fact, if such split is possible, I would call it the preferred solution to indicating an accidental digraph. An example: The Danish digraph aa, normally spelled å in modern orthography, but retained in names etc. can occur accidentally in compound nouns, such as dataanalyse. Adding a SHY is the preferred method to indicate that the aa is accidental. Other characters may have the same effect of breaking the digraph, their use might require an *additional* SHY to be inserted, if and when a linebreak opportunity needs to be manually marked (say for an unusual compound not recognized by the automatic hyphenator). It would be bad to have to have *two* invisible characters at that location.
Re: unicode Digest V12 #108
Asmus Freytag wrote: On 7/2/2011 8:59 AM, Philippe Verdy wrote: [...] Why not simply using a soft hyphen between n and g in this case ? Soft hyphens are normally recognized as such by smart correctors and as well by search engines or collators. It seems enough for me to indicate that this is not the Welsh digram ng ; CGJ anyway is certainly not the correct disjoiner in your case. This solution works well if the word can split between the n and the g. It would still be hackery, since word division is something different from digraphs, no matter what one really means by “digraph.” It’s a trick comparable to inserting a left-to-right mark in the hope of making relevant software treat the characters before and after it as separate, not as candidates for being treated as a digraph. A soft hyphen probably has the desired effect, but its meaning is really something different. In practice, it does not just say that there is a possible word division point. It may also affect hyphenation so that no automatic hyphenation is applied in the word at all, or within some distance from the soft hyphen. An isolated soft hyphen also introduces a line breaking opportunity that is often pragmatically odd. In a context where no hyphenation is normally performed, as on a web page, just throwing in a soft hyphen often causes that very word to be split, in the midst of otherwise unhyphenated text. Moreover, in such a situation, the word will be split at the soft hyphen, no matter how odd such a particular division may look like, in a word with many word division opportunities. The morale is: Don’t play with the soft hyphen unless you are prepared to address word division problem as a whole and at least check that the soft hyphen you introduce is the optimal division point for the word or you can assure that better division points will be used when applicable. The Danish digraph aa, normally spelled å in modern orthography, but retained in names etc. can occur accidentally in compound nouns, such as dataanalyse. Adding a SHY is the preferred method to indicate that the aa is accidental. While the point is the optimal division point (between components of a compound) in this case, this is not generally true for the possible use causes. Besides, it may prevent other divisions of the word, which might be applied due to automatic hyphenation and might really be needed for good typography. And there is really no guarantee that programs support the soft hyphen. For one, Microsoft Word doesn’t—it treats it as just another printable character. Software that recognizes “words” in some sense, e.g. search engines, may or may not treat the soft hyphen as ignorable, so they treat the word with a soft hyphen as two words. And so on. We may need to take our chances when we really need discretionary hyphenation hints. But why take those risks when you don’t really want to affect hyphenation at all? I may have missed some parts of the discussion, but I don’t see why you couldn’t just use the zero-width non-joiner. Using it may cause risks of its own, but at least you would be dealing with risks related to the original problem. Jukka
Questions about UAX #29
I have two questions about this. 1) In UAX #44, it says for information about the Grapheme_Base property, to see UAX #29, but that document doesn't mention this property. 2) The definition in UAX #29 for both legacy and extended grapheme clusters effectively says that any Gc=Cn code points followed by any number of grapheme_extend code points is a grapheme cluster. Is that what is meant? I notice that Grapheme_Base excludes Cn code points.
SHY, CGJ, etc. (was: Re: unicode Digest V12 #108)
I'm a bit concerned about the implication that correctly encoded Breton, Welsh, etc. Unicode text needs to be sprinkled liberally with SHY or CGJ or other invisible formatting characters, to resolve any possible ambiguity in these languages' orthographies. This is like saying English text needs to have a SHY at every potential hyphenation point, so text processors don't have to use a dictionary to hyphenate. I can easily see this thread being misinterpreted or taken out of context by newcomers, or reposted or blogged by someone eager to make a point about unneeded complexity in Unicode. Really, for 99.9% of applications, shouldn't we just write the letters? --Doug Sent via BlackBerry by ATT -Original Message- From: Asmus Freytag asm...@ix.netcom.com Sender: unicode-bou...@unicode.org Date: Sat, 02 Jul 2011 10:02:03 To: verd...@wanadoo.fr Cc: Andrew Millera.j.mil...@bcs.org.uk; unicode@unicode.org Subject: Re: unicode Digest V12 #108 On 7/2/2011 8:59 AM, Philippe Verdy wrote: 2011/7/2 Andrew Millera.j.mil...@bcs.org.uk: The ng in Llangollen is not the digram ng but two separate letters (unlike the ll in the name which is the digram). Why not simply using a soft hyphen between n and g in this case ? Soft hyphens are normally recognized as such by smart correctors and as well by search engines or collators. It seems enough for me to indicate that this is not the Welsh digram ng ; CGJ anyway is certainly not the correct disjoiner in your case. This solution works well if the word can split between the n and the g. In fact, if such split is possible, I would call it the preferred solution to indicating an accidental digraph. An example: The Danish digraph aa, normally spelled å in modern orthography, but retained in names etc. can occur accidentally in compound nouns, such as dataanalyse. Adding a SHY is the preferred method to indicate that the aa is accidental. Other characters may have the same effect of breaking the digraph, their use might require an *additional* SHY to be inserted, if and when a linebreak opportunity needs to be manually marked (say for an unusual compound not recognized by the automatic hyphenator). It would be bad to have to have *two* invisible characters at that location.