John, the problem is that in Unicode "single character" is meaningless unless
you have performed some pre-processing to GIVE that term some meaning. There
are some standard forms for such processing, called "Normalisations".
The problem is that a single "character" to your eyes, e.g. an accented "a",
could be represented in a Unicode string in at least two ways:
1. A single codepoint represented that accented "a"
2. TWO codepoints - the first representing "a" and the
second a diacritic codepoint for the accent
> Iterating over a string is for the purpose of doing something with each
> individual character
That's fine, but in Unicode what you have is a string not of characters but of
codepoints. The concept of a "character" is not synonymous with "codepoint" in
Unicode in the same way that it is with ASCII or even ANSI.
So you have compounded complications:
a. Depending on encoding, a single codepoint (32-bit value)
may be encoded in 1, 2, or more bytes. Each byte may
represent a whole codepoint or only part of a codepoint
encoding.
b. Each codepoint may represent a whole character or only
PART of a character encoding.
Complication 'a' can be avoided by adopting UTF-32 encoding - 4 bytes for EVERY
codepoint. That is hugely wasteful in terms of memory/storage for most
applications. UTF-16 - the encoding used by Delphi and indeed by Windows
natively itself - is a compromise. It is less efficient than ANSI for ASCII,
but more efficient that UTF-32 for ANSI characters sets represented in the BMP.
For applications working entirely in the BMP UTF-16 is also relatively easy to
process - for NORMALISED strings, each codepoint IS a character (in the BMP).
But for non-normalised data that is still not necessarily the case.
> could I build a string like this?
> setlength(String1,7);
> string1[1] := 'f';
> string1[2] := 'i';
> string1[3] := 'a';
> string1[4] := 'n';
> string1[5] := 'c';
> string1[6] := 'e';
> string1[7] := 'e'; //I would want the full e acute here
Yes, you can.
But you might also *receive* from another source, a string that is apparently
the same at the visual representation level, but different at the data level,
where:
string1[1] = 'f';
string1[2] = 'i';
string1[3] = 'a';
string1[4] = 'n';
string1[5] = 'c';
string1[6] = 'e';
string1[7] = 'e'; // Normal 'e' character, i.e. identical to
string1[6]
string1[8] = U+0301; // Combining acute diacritic
When displayed on screen this string will appear identical to your string, but
it is represented in the data in a different way.
> hence I want to be able to go
> for i :=1 to length(string1) do
> begin
> ..
> end
> Now everything Jolyon are saying and Cary also implies that this is
> not going to work. This looks to be a real nuisance!
I don't know what gave you that impression from what I said.
Yes, Unicode is/can be a real nuisance - *properly* supporting it is a lot more
work than people think - but what you want to do here can be done.
> Now I think the e acute could be one unicode character (as there is likely
> to be a representation using one character, one code point and one code
> unit) or as one character, two code units, 2*2 bytes - a surrogate pair -
> where eg one supplies the e and one the acute.
NO!!! This is NOT what a surrogate pair is.
A surrogate pair is encountered ONLY in UTF-16, and is found when you have a
codepoint that is not in the BMP. i.e. a value > 65535 that cannot be encoded
in a 16-bit value. These are typically CJVK characters
(Chinese/Japanese/Vietnamese/Korean) sometimes called Han or Kanji character
sets.
The first 16-bit value indicates a "page" in the non-BMP. The following 16-bit
value then identifies an entry in that "page". To obtain the codepoint that
the PAIR of VALUES represents, you have to apply a transform, combining the
page selector with the page entry. But what you get is a single codepoint.
(you don't have to do this - there are routines to do it for you, but you have
to invoke them as appropriate).
A Surrogate Pair is a representation of a single codepoint, NOT a relationship
between TWO codepoints.
When you have a visual character encoded as a codepoint + a following,
combining codepoint, that is simply TWO Unicode codepoints that are combined to
form one VISUAL "character". That is NOT a surrogate pair however. It is
merely two codepoints that have to be combined.
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject:
unsubscribe