Re: [DUG] Upgrading to XE - Unicode strings questions

Jolyon Smith Tue, 23 Nov 2010 12:43:22 -0800

John, the problem is that in Unicode "single character" is meaningless unless 
you have performed some pre-processing to GIVE that term some meaning.  There 
are some standard forms for such processing, called "Normalisations".


The problem is that a single "character" to your eyes, e.g. an accented "a", 
could be represented in a Unicode string in at least two ways:

  1.  A single codepoint represented that accented "a"

  2.  TWO codepoints - the first representing "a" and the
      second a diacritic codepoint for the accent


> Iterating over a string is for the purpose of doing something with each
> individual character

That's fine, but in Unicode what you have is a string not of characters but of 
codepoints.  The concept of a "character" is not synonymous with "codepoint" in 
Unicode in the same way that it is with ASCII or even ANSI.

So you have compounded complications:

a.  Depending on encoding, a single codepoint (32-bit value) 
     may be encoded in 1, 2, or more bytes.  Each byte may 
     represent a whole codepoint or only part of a codepoint 
     encoding.

b.  Each codepoint may represent a whole character or only 
     PART of a character encoding.


Complication 'a' can be avoided by adopting UTF-32 encoding - 4 bytes for EVERY 
codepoint.  That is hugely wasteful in terms of memory/storage for most 
applications.  UTF-16 - the encoding used by Delphi and indeed by Windows 
natively itself - is a compromise.  It is less efficient than ANSI for ASCII, 
but more efficient that UTF-32 for ANSI characters sets represented in the BMP.

For applications working entirely in the BMP UTF-16 is also relatively easy to 
process - for NORMALISED strings, each codepoint IS a character (in the BMP).  
But for non-normalised data that is still not necessarily the case.



> could I build a string like this?

> setlength(String1,7);
> string1[1] := 'f';
> string1[2] := 'i';
> string1[3] := 'a';
> string1[4] := 'n';
> string1[5] := 'c';
> string1[6] := 'e';
> string1[7] := 'e';            //I would want the full e acute here

Yes, you can.

But you might also *receive* from another source, a string that is apparently 
the same at the visual representation level, but different at the data level, 
where:

 string1[1] = 'f';
 string1[2] = 'i';
 string1[3] = 'a';
 string1[4] = 'n';
 string1[5] = 'c';
 string1[6] = 'e';
 string1[7] = 'e';            // Normal 'e' character, i.e. identical to 
string1[6]
 string1[8] = U+0301;         // Combining acute diacritic

When displayed on screen this string will appear identical to your string, but 
it is represented in the data in a different way.


> hence I want to be able to go

>    for i :=1 to length(string1) do
>    begin
> ..
>    end

> Now everything Jolyon  are saying and Cary also implies that this is
> not going to work.   This looks to be a real nuisance!

I don't know what gave you that impression from what I said.

Yes, Unicode is/can be a real nuisance - *properly* supporting it is a lot more 
work than people think - but what you want to do here can be done.



> Now I think the e acute could be one unicode character (as there is likely 
> to be a representation using one character, one code point and one code 
> unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
> where eg one supplies the e and one the acute.   

NO!!!  This is NOT what a surrogate pair is.

A surrogate pair is encountered ONLY in UTF-16, and is found when you have a 
codepoint that is not in the BMP.  i.e. a value > 65535 that cannot be encoded 
in a 16-bit value.  These are typically CJVK characters 
(Chinese/Japanese/Vietnamese/Korean) sometimes called Han or Kanji character 
sets.

The first 16-bit value indicates a "page" in the non-BMP.  The following 16-bit 
value then identifies an entry in that "page".  To obtain the codepoint that 
the PAIR of VALUES represents, you have to apply a transform, combining the 
page selector with the page entry.  But what you get is a single codepoint.  
(you don't have to do this - there are routines to do it for you, but you have 
to invoke them as appropriate).

A Surrogate Pair is a representation of a single codepoint, NOT a relationship 
between TWO codepoints.



When you have a visual character encoded as a codepoint + a following, 
combining codepoint, that is simply TWO Unicode codepoints that are combined to 
form one VISUAL "character".  That is NOT a surrogate pair however.  It is 
merely two codepoints that have to be combined.


_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

Reply via email to