On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:

On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal 
<fpc-pascal@lists.freepascal.org> wrote:

function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
    inc(Result);
    CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
    ...do something with the CodePoint...
    inc(p,CodePointLen);
    dec(ByteCount,CodePointLen);
  end;
end;
Thanks, this looks right. I guess this is how we need to iterate over unicode 
now.

Btw, why isn't there a for-loop we can use over unicode strings? seems like 
that should be supported out of the box. I had this same problem in Swift also 
where it's extremely confusing to merely iterate over a string and look at each 
character. Replacing characters will be tricky also so we need some good 
library functions.

You're still confusing the Unicode terms. The above code iterates over Unicode Code Points, not "characters" in a UTF-8 encoded string. A Unicode Code Point is not a "character":

https://unicode.org/glossary/#character

https://unicode.org/glossary/#code_point

There are also graphemes, grapheme clusters and extended grapheme clusters - these terms can also be perceived as "characters".

https://unicode.org/glossary/#grapheme

https://unicode.org/glossary/#grapheme_cluster

https://unicode.org/glossary/#extended_grapheme_cluster

If you want to iterate over extended grapheme clusters, for example, there's an iterator (written by me) in the unit graphemebreakproperty.pp in the rtl-unicode package.

If you use the 'char' type in Pascal to iterate over an UTF-8 encoded string, you're iterating over Unicode code units (units! not code points! https://unicode.org/glossary/#code_unit).

If you use the 'widechar' type in Pascal to iterate over a UnicodeString (which is a UTF-16 encoded string), you're also iterating over Unicode code units, however this time in UTF-16 encoding.

If you want to iterate over Unicode code points (not units! not characters! not graphemes!) in a UTF-8 string, you need something like the ReadUTF8 function above. If you want to iterate over Unicode code points in a UTF-16 string, you need different code.

You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text display, are you searching for something in the text, are you just passing strings around and letting the GUI deal with it? These are all different use cases, and they require careful understanding what Unicode thing you need to iterate over.

Nikolay

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to