> You can find out whether a unicode string is inside the BMP 
> by converting it to UTF-32

No need to go to that trouble, just test for surrogates:

Uses Character;

  for i := 1 to Length(s) do
    if IsSurrogate( s[i] ) then
        // s contains non-BMP characters



> I think the point is that UTF-8 is the most compact format without
> data loss, regardless of whether the codepoints are composite or not.

The point *seemed* to be that UTF-8 somehow avoided problems with composite 
characters, which is simply not the case and I wanted to clarify that point.

As for being the most compact - If your data is primarily ASCII in nature then 
yes UTF-8 is the most compact but if it isn't then UTF16 could easily be more 
compact.  It all depends on the data.  There is no absolute rule in that regard.

And of course, you pay for that compactness by incurring additional processing 
overhead when dealing with the strings as soon as you have any non-ASCII 
character involved (and *some* of that overhead is incurred just IN CASE you 
have such non-ASCII characters).


_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Reply via email to