Hi John

You can find out whether a unicode string is inside the BMP by 
converting it to UTF-32 and checking that the new string is twice the 
length of the original (UTF-16) string.
> A user could specifically choose to enter that character in either form - 
> this is unlikely, yes.  Or, two users using the same codepage could choose to 
> enter the character differently.
>
> Or if your data is coming from two separate external sources.
>
> The *only* way to be sure is to normalise before processing.
>    
Agreed. That will eliminate any issues with composite codepoints.
>> You only ever get issues if you cross codepage boundaries
>> (like for example if you have users in different countries
>> storing data in a database - which is why international
>> databases often use UTF-8 to store data instead of their
>> native charactersets).
>>      
> This makes no sense at all to me.
>
> "รถ" encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you 
> encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint 
> vs a character followed by a diacritic are still two distinct "character" 
> sequences.
>    
True. I think the point is that UTF-8 is the most compact format without 
data loss, regardless of whether the codepoints are composite or not.

Todd.

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Reply via email to