Re: What are the issues in having U+FB06 fold to U+FB05?

2011-07-06 Thread Ken Whistler

On 7/6/2011 1:40 PM, Mark Davis ☕ wrote:


The other two are special cases; they casefold together
because of the
way that the full case mapping is computed. Their equivalence is
normally captured by a canonical-equivalent folding. Because
the simple
folding is only codepoint by codepoint, and only resulting in
single
code points, they can't be added.

I didn't understand the sentence above.  But would it be fair to
say that a plausible case could be made for FB06 folding to FB05
simply, but that there really shouldn't be a simple fold for the
other two cases?


Yes, that's what I mean. You can propose all three if you want, via 
the reporting form, but I think only #1 is a real possibility (IMO).


For those following along (or not), this has to do with entries in
CaseFolding.txt. The current relevant sections of CaseFolding.txt are:

FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1FD3; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA

03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND 
TONOS
1FE3; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND 
OXIA


What Karl is suggesting amounts to updating those entries to:

FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T
FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FD3; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA

03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND 
TONOS

1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
1FE3; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND 
OXIA


Note that I think the plausible simple folding for the first group is 
FB05 *to* FB06, not vice versa.


As for the other two, taking the 0390/1FD3 pair as the example
we would have, currently, for simple case folding:

simpleCaseFold(0390) = 0390
simpleCaseFold(1FD3) = 1FD3

simpleCaseFold(NFD(0390)) = 03B9 0308 0301
simpleCaseFold(NFD(1FD3)) = 0390 0308 0301

and for full case folding:

CaseFold(0390) = 03B9 0308 0301
CaseFold(1FD3) = 03B9 0308 0301

CaseFold(NFD(0390)) = 03B9 0308 0301
CaseFold(NFD(1FD3)) = 0390 0308 0301

In all of these instances, because 1FD3 is canonically equivalent to 
0390, the
results of the folding are canonically equivalent. While there might not 
be any
actual prohibition against adding a simple case folding of 1FD3 to 0390 
explicitly
in CaseFolding.txt, I don't see that it buys anybody anything. This is 
roughly the

same problem as, for example:

simpleCaseFold(00E1) = 00E1
simpleCaseFold(0061 0301) = 0061 0301

simpleCasefold(NFD(00E1) = 0061 0301
simpleCasefold(NFD(0061 0301) = 0061 0301

and noting that the results of the simpleCasefold of those two different 
sources
are canonically equivalent, even if you don't do the normalization 
before the

case folding. An application which is doing case folding, but which isn't
checking for canonical equivalence is kinda out to lunch, anyway, as this
example demonstrates.

So while I don't quite understand Mark's claim that "they can't be added", I
would say that I agree at least that I don't see any point to adding them.

I'm not sure whether the FB05/FB06 instance is important enough to add
or not. Neither of those compabitility ligatures should ordinarily be used
in text, anyway, and it hard to see that an algorithmic neatness argument
buys much here in the way of actual utility.

--Ken




Re: What are the issues in having U+FB06 fold to U+FB05?

2011-07-06 Thread Mark Davis ☕
Mark
*— Il meglio è l’inimico del bene —*


On Sat, Jun 11, 2011 at 08:04, Karl Williamson wrote:

> On 06/08/2011 03:33 PM, Mark Davis ☕ wrote:
>
>> As to the first, it would seem reasonable. The simple folding is not
>> covered by the following stability policies:
>>
>> http://www.unicode.org/**policies/stability_policy.**html#Case_Folding
>> http://www.unicode.org/**policies/stability_policy.**html#Case_Pair
>>
>> However, the committee may be leery of changing these even though they
>> are not covered by those policies. You can file a request form for the
>> committee to consider it, at 
>> http://unicode.org/reporting.**html
>>
>> The other two are special cases; they casefold together because of the
>> way that the full case mapping is computed. Their equivalence is
>> normally captured by a canonical-equivalent folding. Because the simple
>> folding is only codepoint by codepoint, and only resulting in single
>> code points, they can't be added.
>>
>>  I didn't understand the sentence above.  But would it be fair to say that
> a plausible case could be made for FB06 folding to FB05 simply, but that
> there really shouldn't be a simple fold for the other two cases?
>

Yes, that's what I mean. You can propose all three if you want, via the
reporting form, but I think only #1 is a real possibility (IMO).


>
>  Mark
>>
>> /— Il meglio è l’inimico del bene —/
>>
>>
>> On Sun, Jun 5, 2011 at 08:17, Karl Williamson > > wrote:
>>
>>There are three pairs of characters in Unicode 6.0 in which each
>>member of the pair has a full fold to the same sequence, yet there
>>is no simple fold relation between them.  They are:
>>
>>U+FB05 LATIN SMALL LIGATURE LONG S T and
>>U+FB06 LATIN SMALL LIGATURE ST
>>both fold to 'st';
>>
>>U+0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
>>U+1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
>>both fold to the sequence "U+03B9 U+0308 U+0301" or (the dot
>>standing for concatenation)
>>GREEK SMALL LETTER IOTA . COMBINING DIAERESIS . COMBINING ACUTE ACCENT
>>
>>U+03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
>>U+1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
>>both fold to the sequence "U+03C5 U+0308 U+0301" or
>>GREEK SMALL LETTER UPSILON . COMBINING DIAERESIS . COMBINING ACUTE
>>ACCENT
>>
>>Under full case folding rules, each member of one of these pairs is
>>caselessly equivalent to the other member, even without adding NFD
>>rules.  Correct me if I'm wrong, but shouldn't they also be
>>caselessly equivalent under simple folding rules?  If so, I'm
>>wondering what issues there would be in creating an S rule for these
>>pairs in CaseFolding.txt, so that they would be considered
>>caselessly equivalent even for applications that don't do full case
>>folding?
>>
>>
>>
>>
>>
>>
>>
>
>


Re: Questions about UAX #29

2011-07-06 Thread Mark Davis ☕
I wouldn't be adverse to adding [:cn:][:cs:][:co:] to [:gcb:control:]. It
would make it align more with the current definition of Grapheme_Base.

As to how to handle private use characters, UAX #29 already allows
overriding:

"This specification defines *default* mechanisms; more sophisticated
implementations can *and should* tailor them for particular locales or
environments."

I'll file an agenda item for the August UTC meeting to consider this; you
can also add your feedback to the UTC using the reporting form.

Mark
*— Il meglio è l’inimico del bene —*


On Tue, Jul 5, 2011 at 16:31, Karl Williamson wrote:

> On 07/05/2011 09:29 AM, Mark Davis ☕ wrote:
>
>> Ah, you're right; I wasn't looking carefully enough at what you wrote.
>>
>> Yes, an unassigned code point (Cn) is treated as a base character.
>>
>> Unassigned code points are peculiar beasts, since we don't know really
>> how they should behave until (and if) they are assigned. Their treatment
>> by  the Unicode algorithms varies based on some factors:
>>
>>* safety - don't have them behave in a way that causes problems
>>* foresight - have them behave like the most likely candidate for
>>  future assignment
>>* simplicity - since they shouldn't occur normally in text, don't
>>  spend too much time worrying about them.
>>
>> These are not formalized principles, just my observations on how we've
>> operated over the years.
>>
>> Mark
>> /— Il meglio è l’inimico del bene —/
>>
>
> Thanks for the answer.  It does seem weird to me to treat them as base
> characters.
>
> But, I'm wondering then about Cs, isolated surrogates.  They also are
> treated as base characters.  That seems wrong to me.  Since UTS18 is
> starting to mention the possibility of them in regexes, perhaps this should
> be addressed?
>
> Also, my understanding of UAX #44 is that private use code points may or
> may not be treated as base characters at the application's discretion. But
> this isn't mentioned in UAX#29.
>


Re: unicode Digest V12 #108

2011-07-06 Thread Ken Whistler

On 7/6/2011 11:18 AM, Asmus Freytag wrote:
The Danes, over a decade ago, when they made the official 
recommendation to use SHY appear to have come to the conclusion that 
"AA" can never occur accidentally, except at word division in compounds.


Not really a safe conclusion. :)

http://da.wikipedia.org/wiki/Hank_Aaron

Although I guess it depends on what the meaning of "aaccidental" is. The 
point
here is that in citations of non-Danish names in Danish text, an "aa" 
can never be equated

to an å, but nor would you want to insert a SHY between the a's in that case
to distinguish the sequence.

--Ken




Re: unicode Digest V12 #108

2011-07-06 Thread Asmus Freytag

On 7/6/2011 12:16 AM, Jukka K. Korpela wrote:
Allowing word division just to say that some characters do not 
constitute a digraph (or trigraph…) is not practical e.g. when the 
text has otherwise no word divisions, for one reason or another, or 
when the particular word division point is typographically suboptimal 
or even bad. 

I quite agree. But that's been my position from the start.

In my very first post in this thread I had written:

   ...*if* such split [=word division] *is possible*, I would call it
   [=SHY] the preferred solution to indicating an "accidental" digraph.

The corollary is that it's not a good thing to use SHY when there's no 
coinciding word division.


True digraphs are usually not word division points, but in any language 
forming compounds, accidental combinations occur at word-division 
boundaries with some frequency.


The Danes, over a decade ago, when they made the official recommendation 
to use SHY appear to have come to the conclusion that "AA" can never 
occur accidentally, except at word division in compounds.


A./


Re: SHY, CGJ, etc.

2011-07-06 Thread Andreas Prilop
On Tue, 5 Jul 2011, Philippe Verdy wrote:

>> Even MS Word 2010 continues to use U+001F as soft hyphen
>> but does not recognize U+00AD as soft hyphen.
>
> I've not spoken at all about U+001F and not even tested it

alt+0031
alt+0173

> I have entered TRUE soft hyphens as U+00AD, in a plain-text
> document, and opened it in word. And this works effectively
> as expected. I could also copy-paste a SHY from a plain-text
> document, or from the Charmap utility, or from my keyboard,
> and it works as well.

File > Options > Advanced > Cut, copy, and paste >
Pasting from other programs > Keep Text Only



Re: unicode Digest V12 #108

2011-07-06 Thread Jukka K. Korpela

2011-07-06 9:25, Asmus Freytag wrote:


Because accidental digraphs (in Danish) happen at word boundaries in a
compound, the SHY is an elegant way to mark them.


It may often be a practical trick, given the current repertoire of 
characters in Unicode and the way they are handled in different 
programs. But I don’t see any elegance in it, and it may turn to an 
impractical method rather easily.


You don’t really want to say “this is an allowable word division point” 
but “these two (or more!) characters are not to be treated as one unit 
text, even in a context where they normally would be so treated.” You 
_might_ want to explicitly allow word division, but that’s coincidental.


Allowing word division just to say that some characters do not 
constitute a digraph (or trigraph…) is not practical e.g. when the text 
has otherwise no word divisions, for one reason or another, or when the 
particular word division point is typographically suboptimal or even bad.


--
Yucca, http://www.cs.tut.fi/~jkorpela/