Thanks a lot for your insights!

(I'm cc'ing the mailinglist, just in case someone else later stumbles across 
this.)

Best, G.


> On 14. Mar 2022, at 14:57, Aandi Inston <aa...@quite.com> wrote:
>
> This is largely from memory, so details might be wrong.
> Normalisation is an insufficiently known thing to consider when working with 
> Unicode.  (We all know that Unicode is a list of code points (integers).
>
> Here are some Unicode points for this discussion:
> U+0065 "e" Latin Small Letter E
> U+00E9 "é" Latin Small Letter E with Acute
> U+0301 "◌́" (U+0301) Combining Acute accent - this may not display as expected
> Many languages have accents that change letters, so we have "e" plus "acute" 
> to get "e acute". In Unicode there are two ways to get "e acute". One is the 
> single Unicode point U+00E9. The other is the TWO characters "e" and 
> "combining acute accent", so U+0065 followed by U+0301. U+0301 does not take 
> any space for itself, but dumps an acute accent over the character 
> before.(Not all accented letters have two representations like this, and some 
> have more than two).
> So, what's the difference between U+00E9 versus U+0065 followed by U+0301? 
> They will look the same, but a string with the second form will be 1 
> character longer, and the offset of all character after it will be changed.
> Are they equal? Well, no, not in simple terms because they are different list 
> of characters.
> Do we get both? YES. The Mac OS file systems store the long form. On a French 
> keyboard, if you type e acute, you get the short form. If you copy paste it 
> could be either. This can be bad. For example if you get a list of the files 
> in a folder, and allow the user to type a name to choose the file, there 
> might not be a match, even though the user can see one.
> To get over this, we have "canonical" forms. There are at least four forms, 
> C, D, KC and KD. precomposedStringWithCanonicalMapping converts to form C. It 
> doesn't really matter what it is, but if you run all your strings through 
> precomposedStringWithCanonicalMapping, then you will get more expected 
> results when comparing strings.
>

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to