Hi John You can find out whether a unicode string is inside the BMP by converting it to UTF-32 and checking that the new string is twice the length of the original (UTF-16) string. > A user could specifically choose to enter that character in either form - > this is unlikely, yes. Or, two users using the same codepage could choose to > enter the character differently. > > Or if your data is coming from two separate external sources. > > The *only* way to be sure is to normalise before processing. > Agreed. That will eliminate any issues with composite codepoints. >> You only ever get issues if you cross codepage boundaries >> (like for example if you have users in different countries >> storing data in a database - which is why international >> databases often use UTF-8 to store data instead of their >> native charactersets). >> > This makes no sense at all to me. > > "รถ" encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8. Whether you > encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint > vs a character followed by a diacritic are still two distinct "character" > sequences. > True. I think the point is that UTF-8 is the most compact format without data loss, regardless of whether the codepoints are composite or not.
Todd. _______________________________________________ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe