Re: [fpc-pascal] Parse unicode scalar

Nikolay Nikolov via fpc-pascal Mon, 03 Jul 2023 19:59:07 -0700


On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:

On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal 
<fpc-pascal@lists.freepascal.org> wrote:

function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
    inc(Result);
    CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
    ...do something with the CodePoint...
    inc(p,CodePointLen);
    dec(ByteCount,CodePointLen);
  end;
end;

Thanks, this looks right. I guess this is how we need to iterate over unicode 
now.

Btw, why isn't there a for-loop we can use over unicode strings? seems like 
that should be supported out of the box. I had this same problem in Swift also 
where it's extremely confusing to merely iterate over a string and look at each 
character. Replacing characters will be tricky also so we need some good 
library functions.

You're still confusing the Unicode terms. The above code iterates overUnicode Code Points, not "characters" in a UTF-8 encoded string. AUnicode Code Point is not a "character":


https://unicode.org/glossary/#character

https://unicode.org/glossary/#code_point

There are also graphemes, grapheme clusters and extended graphemeclusters - these terms can also be perceived as "characters".


https://unicode.org/glossary/#grapheme

https://unicode.org/glossary/#grapheme_cluster

https://unicode.org/glossary/#extended_grapheme_cluster

If you want to iterate over extended grapheme clusters, for example,there's an iterator (written by me) in the unit graphemebreakproperty.ppin the rtl-unicode package.

If you use the 'char' type in Pascal to iterate over an UTF-8 encodedstring, you're iterating over Unicode code units (units! not codepoints! https://unicode.org/glossary/#code_unit).

If you use the 'widechar' type in Pascal to iterate over a UnicodeString(which is a UTF-16 encoded string), you're also iterating over Unicodecode units, however this time in UTF-16 encoding.

If you want to iterate over Unicode code points (not units! notcharacters! not graphemes!) in a UTF-8 string, you need something likethe ReadUTF8 function above. If you want to iterate over Unicode codepoints in a UTF-16 string, you need different code.

You need to understand all these terms and know exactly what you need todo. E.g. are you dealing with keyboard input, are you dealing with thelow level parts of text display, are you searching for something in thetext, are you just passing strings around and letting the GUI deal withit? These are all different use cases, and they require carefulunderstanding what Unicode thing you need to iterate over.


Nikolay

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

Reply via email to