On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal
<fpc-pascal@lists.freepascal.org> wrote:
function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
CodePointLen: longint;
CodePoint: longword;
begin
Result:=0;
while (ByteCount>0) do begin
inc(Result);
CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
...do something with the CodePoint...
inc(p,CodePointLen);
dec(ByteCount,CodePointLen);
end;
end;
Thanks, this looks right. I guess this is how we need to iterate over unicode
now.
Btw, why isn't there a for-loop we can use over unicode strings? seems like
that should be supported out of the box. I had this same problem in Swift also
where it's extremely confusing to merely iterate over a string and look at each
character. Replacing characters will be tricky also so we need some good
library functions.
You're still confusing the Unicode terms. The above code iterates over
Unicode Code Points, not "characters" in a UTF-8 encoded string. A
Unicode Code Point is not a "character":
https://unicode.org/glossary/#character
https://unicode.org/glossary/#code_point
There are also graphemes, grapheme clusters and extended grapheme
clusters - these terms can also be perceived as "characters".
https://unicode.org/glossary/#grapheme
https://unicode.org/glossary/#grapheme_cluster
https://unicode.org/glossary/#extended_grapheme_cluster
If you want to iterate over extended grapheme clusters, for example,
there's an iterator (written by me) in the unit graphemebreakproperty.pp
in the rtl-unicode package.
If you use the 'char' type in Pascal to iterate over an UTF-8 encoded
string, you're iterating over Unicode code units (units! not code
points! https://unicode.org/glossary/#code_unit).
If you use the 'widechar' type in Pascal to iterate over a UnicodeString
(which is a UTF-16 encoded string), you're also iterating over Unicode
code units, however this time in UTF-16 encoding.
If you want to iterate over Unicode code points (not units! not
characters! not graphemes!) in a UTF-8 string, you need something like
the ReadUTF8 function above. If you want to iterate over Unicode code
points in a UTF-16 string, you need different code.
You need to understand all these terms and know exactly what you need to
do. E.g. are you dealing with keyboard input, are you dealing with the
low level parts of text display, are you searching for something in the
text, are you just passing strings around and letting the GUI deal with
it? These are all different use cases, and they require careful
understanding what Unicode thing you need to iterate over.
Nikolay
_______________________________________________
fpc-pascal maillist - fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal