On 7/4/23 08:08, Nikolay Nikolov wrote:
On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:
On Jul 4, 2023, at 11:50 AM, Hairy Pixels <[email protected]> wrote:
You know you're right, with properly enclosed patterns you can
capture everything inside and it works. You won't know if you had
unicode in your string or not though but that depends on what's
being parsed and if you care or not (I'm doing a TOML parser).
Sorry I'm still curious even though it's not my current problem :)
How can I make this program output the expected results:
w: widechar;
a: array of widechar;
begin
for w in 'abc🐻' do
a += [w];
// Outputs 7 instead of 4
writeln(length(a));
end;
The user doesn't know about unicode they just want to get an array of
characters and not worry about all these little details. What can FPC
do to solve this problem?
Depends on what you need, but I suppose in this case you want to count
the number of extended grapheme clusters (a.k.a. "user perceived
characters" - how many character-like things are displayed on the
screen). You might be tempted to count the number of Unicode code
points, but that's not the same, due to the existence of combining
characters:
https://en.wikipedia.org/wiki/Combining_character
For extended grapheme clusters, there's an iterator in the
graphemebreakproperty unit. I implemented this for the Unicode KVM and
FreeVision. There it's needed for figuring out how many character
blocks in the console will be needed to display a certain string. For
the console or other GUIs that use fixed width fonts, there's also the
East Asian Width property as well - some characters (East Asian -
Chinese, Japanese, Korean) take double the space. So, to figure out
where to move the cursor, you need to take East Asian Width as well.
For console apps that use the Unicode KVM video unit, I've introduced
two functions for determining the display width of a Unicode string in
the video unit:
function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString):
Integer;
{ Returns the number of display columns needed for the given extended
grapheme cluster }
function StringDisplayWidth(const S: UnicodeString): Integer;
{ Returns the number of display columns needed for the given string }
Remember, the display width is different than the number of graphemes,
due to East Asian double width characters.
And these work with UnicodeString, which is UTF-16, not UTF-8. But Free
Pascal can convert between the two.
Nikolay
_______________________________________________
fpc-pascal maillist - [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal