> On Apr 2, 2017, at 1:50 AM, Gerriet M. Denkmann <gerri...@icloud.com> wrote:
> 
>> 
>> On 2 Apr 2017, at 10:59, Aki Inoue <a...@apple.com> wrote:
>> 
>> 
>>> On Apr 1, 2017, at 4:57 PM, Gerriet M. Denkmann <gerri...@icloud.com> wrote:
>>> 
>>> 
>>>> On 2 Apr 2017, at 06:33, Jens Alfke <j...@mooseyard.com> wrote:
>>>> 
>>>> 
>>>>> On Apr 1, 2017, at 11:58 AM, Gerriet M. Denkmann <gerri...@icloud.com> 
>>>>> wrote:
>>>>> 
>>>>> I think that the examples above show, that NSURL does indeed do something 
>>>>> about normalising Unicode strings.
>>>> 
>>>> That makes sense; I’d expect that one of the RFCs covering URLs describes 
>>>> normalization. Otherwise constructing URLs (for a REST API, say) could 
>>>> become quite ambiguous because you wouldn’t know which way to encode 
>>>> various Unicode characters.
>>>> 
>>>>> But my point is that NSURL gets the normalisation wrong in this case; or 
>>>>> at least that it is not very consistent in normalising strings.
>>>> 
>>>> Yes, it does seem wrong that you can have two filenames that are treated 
>>>> as distinct by the filesystem, but whose URL.path properties produce 
>>>> identical NSStrings.
>>> 
>>> Sorry, my explanation was not quite clear: these two filenames look 
>>> absolutely identical, but as a sequence of Unicode code points, they are 
>>> not (tone-mark and vowel are in different order).
>>> 
>>> What puzzles me is that consonant + THAI CHARACTER MAI EK + THAI CHARACTER 
>>> SARA UU gets normalised by NSURL to:  consonant + THAI CHARACTER SARA UU + 
>>> THAI CHARACTER MAI EK (note the different order), whereas consonant + THAI 
>>> CHARACTER MAI EK + THAI CHARACTER SARA II is left unchanged.
>> Garret,
>> 
>> This is the standard Unicode Normalization behavior. Each Unicode character 
>> is assigned the Unicode Combining Property, an integer value defining the 
>> canonical ordering of combining marks.
>> 
>> The Unicode Combining Property for THAI CHARACTER SARA UU is 103, and THAI 
>> CHARACTER MAI EK 107. So, MAI EK always comes after SARA UU in the canonical 
>> order.
>> 
>> On the other hand, THAI CHARACTER SARA II has the property value 0 which 
>> indicates the start of the reordering segment. That’s why the character is 
>> not reordered in respect to other Thai combining characters.
>> 
>> Aki
> 
> Thanks a lot for this explanation.
> 
> I just read about  Combining_Character_Class in 
> <http://unicode.org/reports/tr44/#Validation_of_CCC 
> <http://unicode.org/reports/tr44/#Validation_of_CCC>>.
> 
> What I did not find was an explanation why all Thai top-vowels (+ THAI 
> CHARACTER MAI HAN-AKAT) have Combining_Character_Class 0, Not_Reordered, 
> whereas the bottom vowels have 103.
I’m not a linguistic expert, but my understanding for the Unicode combining 
class is that a pair of two characters can be in the same combining class when:
- the ordering of the two characters has the semantic value (changing the order 
changes the meaning, for example)
or
- they can never be attached to a base character at the same time 
linguistically and/or grammatically

> Another strange thing: the tone marks have 107, but THAI CHARACTER 
> THANTHAKHAT has 0. (This sometimes occurs together with ิ, e.g. เกียรติ์, or 
> ุ, e.g. บงสุ์ )
As far as I know, the class 0 Thai vowels can appear multiple times for a 
single consonant and their ordering has distinct meaning. So, these character 
must be in same Unicode combining class 0.

The Unicode specification is carefully crafted that the general rules for the 
combining class works universally (except for the Hebrew accent characters).

> If you have any links to an explanation for these (to me) rather strange 
> decisions of the Unicode people, I would appreciate this very much.
Probably these questions could be appropriate for the Unicode ML 
<http://www.unicode.org/consortium/distlist.html 
<http://www.unicode.org/consortium/distlist.html>>.

There are many real linguistic experts there (some of them were actually there 
from the beginning Unicode) who should be able to answer your questions :)

Aki

> 
> 
> Kind regards,
> 
> Gerriet.

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to