Re: unicode Digest V12 #108

2011-07-02 Thread Andrew Miller
 From: Philippe Verdy verd...@wanadoo.fr
 Date: Sat, 2 Jul 2011 15:59:18 +0200
 Subject: Re: ch ligature in a monospace font

 2011/7/1 Richard Wordingham richard.wording...@ntlworld.com:
  I wonder if anyone has some statistics on the use of CGJ.  Its revised
  intended use was to disrupt collating sequences, but you may be right
  about its most frequent use being to disrupt canonical reordering.  A
  few years ago I concluded it wasn't yet safe to type the Welsh place
  name Llan͏gollen with CGJ.

 Interestingly, I can't have this name being rendered correctly in my
 Chrome version on Windows 7; it just displays the occurence of CGJ as
 a non-spacing dotted box, overwriting the surrounding characters n
 and g so that the place is completely unreadable.

 I just wonder why Chrome needs to display this control in such a
 disruptive way (I have not checked with other browsers).

 Why do you need CGJ between n and g ?

 - Is that to make sure that they won't collate as a single element
 ng but separately ? How is it different here from the collation of
 language where the situation would be similar?

 - Or do you intend to do the reverse, i.e. effectively collate ng in
 Llangollen as a single element?

 Sorry I don't know Welsh, all I know is that ng is a digram of its
 alphabet, which also includes n and g as separate letters... Other
 digrams are dd contrasting with isolated d, ff contrasting with
 isolated f, ll contrasting with isolated l, ph contrasting
 with isolated p and h, rh contrasting with isolated r and h,
 and finaly th contrasting with isolated t and h.

The ng in Llangollen is not the digram ng but two separate letters
(unlike the ll in the name which is the digram).




Re: unicode Digest V12 #108

2011-07-02 Thread Philippe Verdy
2011/7/2 Andrew Miller a.j.mil...@bcs.org.uk:
 The ng in Llangollen is not the digram ng but two separate letters
 (unlike the ll in the name which is the digram).

Why not simply using a soft hyphen between n and g in this case ?
Soft hyphens are normally recognized as such by smart correctors and
as well by search engines or collators. It seems enough for me to
indicate that this is not the Welsh digram ng ; CGJ anyway is
certainly not the correct disjoiner in your case.



Re: unicode Digest V12 #108

2011-07-02 Thread Asmus Freytag

On 7/2/2011 8:59 AM, Philippe Verdy wrote:

2011/7/2 Andrew Millera.j.mil...@bcs.org.uk:

The ng in Llangollen is not the digram ng but two separate letters
(unlike the ll in the name which is the digram).

Why not simply using a soft hyphen between n and g in this case ?
Soft hyphens are normally recognized as such by smart correctors and
as well by search engines or collators. It seems enough for me to
indicate that this is not the Welsh digram ng ; CGJ anyway is
certainly not the correct disjoiner in your case.



This solution works well if the word can split between the n and the g.

In fact, if such split is possible, I would call it the preferred 
solution to indicating an accidental digraph.


An example:

The Danish digraph aa, normally spelled å in modern orthography, but 
retained in names etc. can occur accidentally in compound nouns, such 
as dataanalyse. Adding a SHY is the preferred method to indicate that 
the aa is accidental.


Other characters may have the same effect of breaking the digraph, their 
use might require an *additional* SHY to be inserted, if and when a 
linebreak opportunity needs to be manually marked (say for an unusual 
compound not recognized by the automatic hyphenator). It would be bad to 
have to have *two* invisible characters at that location.





Re: unicode Digest V12 #108

2011-07-02 Thread Jukka K. Korpela

Asmus Freytag wrote:

On 7/2/2011 8:59 AM, Philippe Verdy wrote:

[...]

Why not simply using a soft hyphen between n and g in this case ?
Soft hyphens are normally recognized as such by smart correctors and
as well by search engines or collators. It seems enough for me to
indicate that this is not the Welsh digram ng ; CGJ anyway is
certainly not the correct disjoiner in your case.


This solution works well if the word can split between the n and the
g.


It would still be hackery, since word division is something different from 
digraphs, no matter what one really means by “digraph.” It’s a trick 
comparable to inserting a left-to-right mark in the hope of making relevant 
software treat the characters before and after it as separate, not as 
candidates for being treated as a digraph.


A soft hyphen probably has the desired effect, but its meaning is really 
something different. In practice, it does not just say that there is a 
possible word division point. It may also affect hyphenation so that no 
automatic hyphenation is applied in the word at all, or within some distance 
from the soft hyphen.


An isolated soft hyphen also introduces a line breaking opportunity that is 
often pragmatically odd. In a context where no hyphenation is normally 
performed, as on a web page, just throwing in a soft hyphen often causes 
that very word to be split, in the midst of otherwise unhyphenated text. 
Moreover, in such a situation, the word will be split at the soft hyphen, no 
matter how odd such a particular division may look like, in a word with many 
word division opportunities.


The morale is: Don’t play with the soft hyphen unless you are prepared to 
address word division problem as a whole and at least check that the soft 
hyphen you introduce is the optimal division point for the word or you can 
assure that better division points will be used when applicable.



The Danish digraph aa, normally spelled å in modern orthography,
but retained in names etc. can occur accidentally in compound
nouns, such as dataanalyse. Adding a SHY is the preferred method to
indicate that the aa is accidental.


While the point is the optimal division point (between components of a 
compound) in this case, this is not generally true for the possible use 
causes. Besides, it may prevent other divisions of the word, which might be 
applied due to automatic hyphenation and might really be needed for good 
typography.


And there is really no guarantee that programs support the soft hyphen. For 
one, Microsoft Word doesn’t—it treats it as just another printable 
character. Software that recognizes “words” in some sense, e.g. search 
engines, may or may not treat the soft hyphen as ignorable, so they treat 
the word with a soft hyphen as two words. And so on.


We may need to take our chances when we really need discretionary 
hyphenation hints. But why take those risks when you don’t really want to 
affect hyphenation at all?


I may have missed some parts of the discussion, but I don’t see why you 
couldn’t just use the zero-width non-joiner. Using it may cause risks of its 
own, but at least you would be dealing with risks related to the original 
problem.


Jukka 





Questions about UAX #29

2011-07-02 Thread Karl Williamson

I have two questions about this.

1) In UAX #44, it says for information about the Grapheme_Base property, 
to see UAX #29, but that document doesn't mention this property.


2) The definition in UAX #29 for both legacy and extended grapheme 
clusters effectively says that any Gc=Cn code points followed by any 
number of grapheme_extend code points is a grapheme cluster.  Is that 
what is meant?  I notice that Grapheme_Base excludes Cn code points.




SHY, CGJ, etc. (was: Re: unicode Digest V12 #108)

2011-07-02 Thread doug
I'm a bit concerned about the implication that correctly encoded Breton, Welsh, 
etc. Unicode text needs to be sprinkled liberally with SHY or CGJ or other 
invisible formatting characters, to resolve any possible ambiguity in these 
languages' orthographies.  This is like saying English text needs to have a SHY 
at every potential hyphenation point, so text processors don't have to use a 
dictionary to hyphenate.

I can easily see this thread being misinterpreted or taken out of context by 
newcomers, or reposted or blogged by someone eager to make a point about 
unneeded complexity in Unicode.  Really, for 99.9% of applications, shouldn't 
we just write the letters?

--Doug
Sent via BlackBerry by ATT

-Original Message-
From: Asmus Freytag asm...@ix.netcom.com
Sender: unicode-bou...@unicode.org
Date: Sat, 02 Jul 2011 10:02:03 
To: verd...@wanadoo.fr
Cc: Andrew Millera.j.mil...@bcs.org.uk; unicode@unicode.org
Subject: Re: unicode Digest V12 #108

On 7/2/2011 8:59 AM, Philippe Verdy wrote:
 2011/7/2 Andrew Millera.j.mil...@bcs.org.uk:
 The ng in Llangollen is not the digram ng but two separate letters
 (unlike the ll in the name which is the digram).
 Why not simply using a soft hyphen between n and g in this case ?
 Soft hyphens are normally recognized as such by smart correctors and
 as well by search engines or collators. It seems enough for me to
 indicate that this is not the Welsh digram ng ; CGJ anyway is
 certainly not the correct disjoiner in your case.


This solution works well if the word can split between the n and the g.

In fact, if such split is possible, I would call it the preferred 
solution to indicating an accidental digraph.

An example:

The Danish digraph aa, normally spelled å in modern orthography, but 
retained in names etc. can occur accidentally in compound nouns, such 
as dataanalyse. Adding a SHY is the preferred method to indicate that 
the aa is accidental.

Other characters may have the same effect of breaking the digraph, their 
use might require an *additional* SHY to be inserted, if and when a 
linebreak opportunity needs to be manually marked (say for an unusual 
compound not recognized by the automatic hyphenator). It would be bad to 
have to have *two* invisible characters at that location.