subject:"Re\: \[fpc\-pascal\] Parse unicode scalar"




> On Jul 4, 2023, at 12:38 PM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> For console apps that use the Unicode KVM video unit, I've introduced two 
> functions for determining the display width of a Unicode string in the video 
> unit:
> 
> function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString): 
> Integer;
> { Returns the number of display columns needed for the given extended 
> grapheme cluster }
> 
> function StringDisplayWidth(const S: UnicodeString): Integer;
> { Returns the number of display columns needed for the given string }
> 
> Remember, the display width is different than the number of graphemes, due to 
> East Asian double width characters.
> 
> And these work with UnicodeString, which is UTF-16, not UTF-8. But Free 
> Pascal can convert between the two.

is there an example snippet of how all this works? It's too level for newbies 
to understand. :)

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



On 7/4/23 08:08, Nikolay Nikolov wrote:


On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 11:50 AM, Hairy Pixels  wrote:

You know you're right, with properly enclosed patterns you can 
capture everything inside and it works. You won't know if you had 
unicode in your string or not though but that depends on what's 
being parsed and if you care or not (I'm doing a TOML parser).

Sorry I'm still curious even though it's not my current problem :)

How can I make this program output the expected results:

   w: widechar;
   a: array of widechar;
begin
    for w in 'abc🐻' do
  a += [w];
   // Outputs 7 instead of 4
   writeln(length(a));
end;

The user doesn't know about unicode they just want to get an array of 
characters and not worry about all these little details. What can FPC 
do to solve this problem?


Depends on what you need, but I suppose in this case you want to count 
the number of extended grapheme clusters (a.k.a. "user perceived 
characters" - how many character-like things are displayed on the 
screen). You might be tempted to count the number of Unicode code 
points, but that's not the same, due to the existence of combining 
characters:


https://en.wikipedia.org/wiki/Combining_character

For extended grapheme clusters, there's an iterator in the 
graphemebreakproperty unit. I implemented this for the Unicode KVM and 
FreeVision. There it's needed for figuring out how many character 
blocks in the console will be needed to display a certain string. For 
the console or other GUIs that use fixed width fonts, there's also the 
East Asian Width property as well - some characters (East Asian - 
Chinese, Japanese, Korean) take double the space. So, to figure out 
where to move the cursor, you need to take East Asian Width as well.


For console apps that use the Unicode KVM video unit, I've introduced 
two functions for determining the display width of a Unicode string in 
the video unit:


function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString): 
Integer;
{ Returns the number of display columns needed for the given extended 
grapheme cluster }


function StringDisplayWidth(const S: UnicodeString): Integer;
{ Returns the number of display columns needed for the given string }

Remember, the display width is different than the number of graphemes, 
due to East Asian double width characters.


And these work with UnicodeString, which is UTF-16, not UTF-8. But Free 
Pascal can convert between the two.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 11:50 AM, Hairy Pixels  wrote:

You know you're right, with properly enclosed patterns you can capture 
everything inside and it works. You won't know if you had unicode in your 
string or not though but that depends on what's being parsed and if you care or 
not (I'm doing a TOML parser).

Sorry I'm still curious even though it's not my current problem :)

How can I make this program output the expected results:

   w: widechar;
   a: array of widechar;
begin
for w in 'abc🐻' do
  a += [w];
   // Outputs 7 instead of 4
   writeln(length(a));
end;

The user doesn't know about unicode they just want to get an array of 
characters and not worry about all these little details. What can FPC do to 
solve this problem?


Depends on what you need, but I suppose in this case you want to count 
the number of extended grapheme clusters (a.k.a. "user perceived 
characters" - how many character-like things are displayed on the 
screen). You might be tempted to count the number of Unicode code 
points, but that's not the same, due to the existence of combining 
characters:


https://en.wikipedia.org/wiki/Combining_character

For extended grapheme clusters, there's an iterator in the 
graphemebreakproperty unit. I implemented this for the Unicode KVM and 
FreeVision. There it's needed for figuring out how many character blocks 
in the console will be needed to display a certain string. For the 
console or other GUIs that use fixed width fonts, there's also the East 
Asian Width property as well - some characters (East Asian - Chinese, 
Japanese, Korean) take double the space. So, to figure out where to move 
the cursor, you need to take East Asian Width as well.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



> On Jul 4, 2023, at 11:50 AM, Hairy Pixels  wrote:
> 
> You know you're right, with properly enclosed patterns you can capture 
> everything inside and it works. You won't know if you had unicode in your 
> string or not though but that depends on what's being parsed and if you care 
> or not (I'm doing a TOML parser).

Sorry I'm still curious even though it's not my current problem :)

How can I make this program output the expected results:

  w: widechar;
  a: array of widechar;
begin
   for w in 'abc🐻' do
 a += [w];
  // Outputs 7 instead of 4 
  writeln(length(a));
end;

The user doesn't know about unicode they just want to get an array of 
characters and not worry about all these little details. What can FPC do to 
solve this problem?


Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> But you just don't need to do this, in order to tokenize Pascal. The 
> beginning and the end of the string literal is the apostrophe, which is 
> ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), 
> that will not be mistaken for an apostrophe, or end of line, because they 
> will have their high bit set. There's simply no need for a Pascal tokenizer 
> to iterate over UTF-8 code points, instead of code units.

You know you're right, with properly enclosed patterns you can capture 
everything inside and it works. You won't know if you had unicode in your 
string or not though but that depends on what's being parsed and if you care or 
not (I'm doing a TOML parser).

Maybe I can skip that part and just focus on the decoding of the unicode scalars

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



On 7/4/23 07:45, Nikolay Nikolov wrote:


On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:


On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
 wrote:


For what grammar? What characters are allowed in a token? For 
example, Free Pascal also has a parser/tokenizer, but since Pascal 
keywords are ASCII only, it doesn't need to understand Unicode 
characters, so it works on the byte (Pascal's char type) level (for 
UTF-8 files, this means UTF-8 Unicode code units). That's because 
UTF-8 has two nice properties:


1)  ASCII character are encoded as they are - by using bytes in the 
range #0..#127


2) non-ASCII characters will always use a sequence of bytes, that 
are all in the range #128..#255 (they have their highest bit set), 
so they will never be misinterpreted as ASCII.


So, the tokenizer just works with UTF-8 like with any other 8-bit 
code page.
yes this works until you reach a non-ASCII ranged character and then 
the character index no longer matches the string 1 to 1. For example 
consider this was pascal:


i := '🐻';

You can advance by index like:

  Inc(currrentIndex);
  c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't 
advance to the next character by doing +1.


But you just don't need to do this, in order to tokenize Pascal. The 
beginning and the end of the string literal is the apostrophe, which 
is ASCII. The bear is a sequence of UTF-8 code units (opaque to the 
compiler), that will not be mistaken for an apostrophe, or end of 
line, because they will have their high bit set. There's simply no 
need for a Pascal tokenizer to iterate over UTF-8 code points, instead 
of code units.


Sorry, the last sentence should read: "There's simply no need for a 
Pascal tokenizer to iterate over Unicode code points, instead of UTF-8 
code units." Hope that makes it more clear and accurate.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
 wrote:

For what grammar? What characters are allowed in a token? For example, Free 
Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, 
it doesn't need to understand Unicode characters, so it works on the byte 
(Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code 
units). That's because UTF-8 has two nice properties:

1)  ASCII character are encoded as they are - by using bytes in the range 
#0..#127

2) non-ASCII characters will always use a sequence of bytes, that are all in 
the range #128..#255 (they have their highest bit set), so they will never be 
misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

yes this works until you reach a non-ASCII ranged character and then the 
character index no longer matches the string 1 to 1. For example consider this 
was pascal:

i := '🐻';

You can advance by index like:

  Inc(currrentIndex);
  c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the 
next character by doing +1.


But you just don't need to do this, in order to tokenize Pascal. The 
beginning and the end of the string literal is the apostrophe, which is 
ASCII. The bear is a sequence of UTF-8 code units (opaque to the 
compiler), that will not be mistaken for an apostrophe, or end of line, 
because they will have their high bit set. There's simply no need for a 
Pascal tokenizer to iterate over UTF-8 code points, instead of code units.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar



> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> For what grammar? What characters are allowed in a token? For example, Free 
> Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, 
> it doesn't need to understand Unicode characters, so it works on the byte 
> (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code 
> units). That's because UTF-8 has two nice properties:
> 
> 1)  ASCII character are encoded as they are - by using bytes in the range 
> #0..#127
> 
> 2) non-ASCII characters will always use a sequence of bytes, that are all in 
> the range #128..#255 (they have their highest bit set), so they will never be 
> misinterpreted as ASCII.
> 
> So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

yes this works until you reach a non-ASCII ranged character and then the 
character index no longer matches the string 1 to 1. For example consider this 
was pascal:

i := '🐻';

You can advance by index like:

 Inc(currrentIndex);
 c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the 
next character by doing +1.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote:

On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal
wrote:

You need to understand all these terms and know exactly what you need to do.
E.g. are you dealing with keyboard input, are you dealing with the low level
parts of text display, are you searching for something in the text, are you
just passing strings around and letting the GUI deal with it? These are all
different use cases, and they require careful understanding what Unicode thing
you need to iterate over.

Thanks for trying to help but this is more complicated than I thought and I
don't have the patience for a deep dive right now :)

Unicode is complicated under the hood but we should have some libraries to help right? I mean the
user thinks of these things as "characters" be it "A" or the unicode symbol 👍
so we should be able to operate on that basis as well. Something like an iterator that return the
character (wide char) and byte offset or writing would be a nice place to start.

I have a parser/tokenizer I want to update so I'm trying to find tokens by
advancing one character at a time. That's why I have a requirement to know
which character is next in the file and probably the byte offset also so it can
be referenced later.

For what grammar? What characters are allowed in a token? For example,
Free Pascal also has a parser/tokenizer, but since Pascal keywords are
ASCII only, it doesn't need to understand Unicode characters, so it
works on the byte (Pascal's char type) level (for UTF-8 files, this
means UTF-8 Unicode code units). That's because UTF-8 has two nice
properties:

1) ASCII character are encoded as they are - by using bytes in the
range #0..#127

2) non-ASCII characters will always use a sequence of bytes, that are
all in the range #128..#255 (they have their highest bit set), so they
will never be misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

Nikolay

___
fpc-pascal maillist - fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> You need to understand all these terms and know exactly what you need to do. 
> E.g. are you dealing with keyboard input, are you dealing with the low level 
> parts of text display, are you searching for something in the text, are you 
> just passing strings around and letting the GUI deal with it? These are all 
> different use cases, and they require careful understanding what Unicode 
> thing you need to iterate over.

Thanks for trying to help but this is more complicated than I thought and I 
don't have the patience for a deep dive right now :)

Unicode is complicated under the hood but we should have some libraries to help 
right? I mean the user thinks of these things as "characters" be it "A" or the 
unicode symbol 👍 so we should be able to operate on that basis as well. 
Something like an iterator that return the character (wide char) and  byte 
offset or writing would be a nice place to start.

I have a parser/tokenizer I want to update so I'm trying to find tokens by 
advancing one character at a time. That's why I have a requirement to know 
which character is next in the file and probably the byte offset also so it can 
be referenced later.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:



On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal 
 wrote:

function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
inc(Result);
CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
...do something with the CodePoint...
inc(p,CodePointLen);
dec(ByteCount,CodePointLen);
  end;
end;

Thanks, this looks right. I guess this is how we need to iterate over unicode 
now.

Btw, why isn't there a for-loop we can use over unicode strings? seems like 
that should be supported out of the box. I had this same problem in Swift also 
where it's extremely confusing to merely iterate over a string and look at each 
character. Replacing characters will be tricky also so we need some good 
library functions.


You're still confusing the Unicode terms. The above code iterates over 
Unicode Code Points, not "characters" in a UTF-8 encoded string. A 
Unicode Code Point is not a "character":


https://unicode.org/glossary/#character

https://unicode.org/glossary/#code_point

There are also graphemes, grapheme clusters and extended grapheme 
clusters - these terms can also be perceived as "characters".


https://unicode.org/glossary/#grapheme

https://unicode.org/glossary/#grapheme_cluster

https://unicode.org/glossary/#extended_grapheme_cluster

If you want to iterate over extended grapheme clusters, for example, 
there's an iterator (written by me) in the unit graphemebreakproperty.pp 
in the rtl-unicode package.


If you use the 'char' type in Pascal to iterate over an UTF-8 encoded 
string, you're iterating over Unicode code units (units! not code 
points! https://unicode.org/glossary/#code_unit).


If you use the 'widechar' type in Pascal to iterate over a UnicodeString 
(which is a UTF-16 encoded string), you're also iterating over Unicode 
code units, however this time in UTF-16 encoding.


If you want to iterate over Unicode code points (not units! not 
characters! not graphemes!) in a UTF-8 string, you need something like 
the ReadUTF8 function above. If you want to iterate over Unicode code 
points in a UTF-16 string, you need different code.


You need to understand all these terms and know exactly what you need to 
do. E.g. are you dealing with keyboard input, are you dealing with the 
low level parts of text display, are you searching for something in the 
text, are you just passing strings around and letting the GUI deal with 
it? These are all different use cases, and they require careful 
understanding what Unicode thing you need to iterate over.


Nikolay

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
> // returns the number of codepoints
> var
>  CodePointLen: longint;
>  CodePoint: longword;
> begin
>  Result:=0;
>  while (ByteCount>0) do begin
>inc(Result);
>CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
>...do something with the CodePoint...
>inc(p,CodePointLen);
>dec(ByteCount,CodePointLen);
>  end;
> end;

Thanks, this looks right. I guess this is how we need to iterate over unicode 
now.

Btw, why isn't there a for-loop we can use over unicode strings? seems like 
that should be supported out of the box. I had this same problem in Swift also 
where it's extremely confusing to merely iterate over a string and look at each 
character. Replacing characters will be tricky also so we need some good 
library functions.

Swift is especially terrible because there's NO ANSII string so even a 1 byte 
sequence needs all these confusing as hell functions to do any work with 
strings at all. Terrible experience and slow.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 17:18:56 +0700
Hairy Pixels via fpc-pascal  wrote:

>[...]
> > First of all: Is it valid UTF-8 or do you have to check for broken
> > or malicious sequences?  
> 
> If they give the parser broken files that's their problem they need
> to fix? the user has control over the file so it's there
> responsibility I think.

Users responsibility?
 - I recommend to check for malicious codes. ;)


> >> Right now I've just read the file into an AnsiString and indexing
> >> assuming a fixed character size, which breaks of course if non-1
> >> byte characters exist  
> > 
> > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> > 
> > function UTF8CodepointToUnicode(p: PChar; out CodepointLen:
> > integer): Cardinal;  
> 
> Not sure how this works. You need to advance by character so there
> return value should be the byte location of the next character or
> something like that.

function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
inc(Result);
CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
...do something with the CodePoint...
inc(p,CodePointLen);
dec(ByteCount,CodePointLen);
  end;
end;


> >> I also need to know if I come across something like \u1F496 I need
> >> to convert that to a unicode character.  
> > 
> > I guess you know how to convert a hex to a dword.  
> 
> Is there anything better than StrToInt?

Good start.

> I wouldn't be able to do it
> myself though without that function.

Hex to dword. That's easy enough for ChatGPT.


> > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to
> > UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar):
> > integer; // UTF32 to UTF8 
> 
> Ok I think this is basically what the other programmer submitted and
> what ChatGPT tried to do.

Yes, no need to reinvent the wheel.

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread José Mejuto via fpc-pascal


El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió:


Right now I've just read the file into an AnsiString and indexing assuming a 
fixed character size, which breaks of course if non-1 byte characters exist

  I also need to know if I come across something like \u1F496 I need to convert 
that to a unicode character.



Hello,

You are intermixing a lot of concepts, ASCII, Unicode, grapheme, 
representation, content, etc...


Talking about Unicode you must forget ASCII, the text is a sequence of 
bytes which are encoded in a special format (UTF-8, UTF-16, UTF-32,...) 
and that must be represented in screen using Unicode representation 
rules, which are not the same as ASCII.


Just to keep this message quite short, think in a text with only one 
"letter":


"á"

This text (text, not one letter, Unicode is about texts) can be 
transmitted or stored using Unicode encoding rules which are a sequence 
of bytes with its own rules to encode the information. Each byte is 
hexadecimal:


UTF8: C3 A1
UTF16LE: 00 E1
UTF32: 00 00 00 E1

You must know in advance the encoding format to get the text from the 
bytes sequence. There is also a BOM (Byte Order Mark) which is sometimes 
used in files as a header to indicate the encoding, but in general it is 
not used.


Now decoding that sequence of bytes, using the right decoding format you 
get a text which represent the letter "a" with an acute accent, but 
Unicode is *not* so *simple* and the same text could be represented in 
screen using letter "a" + "combining acute accent" and bytes sequence is 
totally different, different at encoding level but identical at 
renderization level. So this two UTF8 sequences:


"C3 A1" and "61 CC 81"

are different at grapheme level and encoding level but identical at 
representation level.


Just as final note, this is the UTF-8 sequence of bytes for one single 
"character" in screen:


F0 9F 8F B4 F3 A0 81 A7 F3 A0 81 A2 F3 A0 81 B3 F3 A0 81 A3 F3 A0 81 B4 
F3 A0 81 BF


Unicode is far, far from easy.

Have a nice day.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
>> What I'm really trying to do is improve a parser so it can read UTF-8
>> files and decode unicode literals in the grammar.
> 
> First of all: Is it valid UTF-8 or do you have to check for broken or
> malicious sequences?

If they give the parser broken files that's their problem they need to fix? the 
user has control over the file so it's there responsibility I think.

> 
> 
>> Right now I've just read the file into an AnsiString and indexing
>> assuming a fixed character size, which breaks of course if non-1 byte
>> characters exist
> 
> Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> 
> function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): 
> Cardinal;

Not sure how this works. You need to advance by character so there return value 
should be the byte location of the next character or something like that.

> 
> 
>> I also need to know if I come across something like \u1F496 I need
>> to convert that to a unicode character.
> 
> I guess you know how to convert a hex to a dword.

Is there anything better than StrToInt? I wouldn't be able to do it myself 
though without that function.

> Then
> 
> function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
> function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to 
> UTF8
> 

Ok I think this is basically what the other programmer submitted and what 
ChatGPT tried to do.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Jer Haan via fpc-pascal

Hi Ryan,I’ve created attached unit, which takes a code point and returns the utf8 char as a string. It’s based on the Wikipedia article on UTF8UTF-8 encodes code points in one to four bytes, depending on the value of the code point. The x characters are replaced by the bits of the code point:This table is copied from Wikipedia.

uencoding.pas
Description: Binary data
Hope it’s useful for you. If you improve the code pls let me know.Best regards,JeroenOn 2 Jul 2023, at 15:30, Hairy Pixels via fpc-pascal  wrote:I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 (💖).Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to tell me the scaler is comprised of these 4 bytes: 240 159 146 150I was able to correctly concatenate these characters and writeln printed the correct character.var	s: String;begins := char(240)+char(159)+char(146)+char(150);writeln(s);end.The question is, how was 1F496 decomposed into 4 bytes? Regards,	Ryan Joseph___fpc-pascal maillist  -  fpc-pascal@lists.freepascal.orghttps://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 15:27:10 +0700
Hairy Pixels via fpc-pascal  wrote:

>[...]
> I was just curious how ChatGPTs implementation compared to other
> programmer.

Apparently the quality is often terrible. But it can be useful.

 
> What I'm really trying to do is improve a parser so it can read UTF-8
> files and decode unicode literals in the grammar.

First of all: Is it valid UTF-8 or do you have to check for broken or
malicious sequences?

 
> Right now I've just read the file into an AnsiString and indexing
> assuming a fixed character size, which breaks of course if non-1 byte
> characters exist

Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:

function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal;

 
>  I also need to know if I come across something like \u1F496 I need
> to convert that to a unicode character.

I guess you know how to convert a hex to a dword. Then

function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to 
UTF8

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 3, 2023, at 3:05 PM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> I wonder, is this thread about testing ChatGPT or do you want to
> implement something useful?
> There are already plenty of optimized UTF-8 functions in the FPC and
> Lazarus sources. Maybe too many, and you have trouble finding the right
> one? Just ask what your function needs to do.

I was just curious how ChatGPTs implementation compared to other programmer.

What I'm really trying to do is improve a parser so it can read UTF-8 files and 
decode unicode literals in the grammar.

Right now I've just read the file into an AnsiString and indexing assuming a 
fixed character size, which breaks of course if non-1 byte characters exist

 I also need to know if I come across something like \u1F496 I need to convert 
that to a unicode character.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 12:01:11 +0700
Hairy Pixels via fpc-pascal  wrote:

> > On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal
> >  wrote:
> > 
> > Useless array of.
> > And it does not return the bytecount.  
> 
> it's an open array so what's the problem?
>[...]
> > Wrong for byteCount=1  
> 
> really? How so? 
>
> ChatGPT is risky because it will give wrong information with perfect
> confidence and there's no way for the ignorant person to know.

I wonder, is this thread about testing ChatGPT or do you want to
implement something useful?
There are already plenty of optimized UTF-8 functions in the FPC and
Lazarus sources. Maybe too many, and you have trouble finding the right
one? Just ask what your function needs to do.

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

On Mon, 3 Jul 2023 14:12:03 +0700
Hairy Pixels via fpc-pascal  wrote:

> > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal
> >  wrote:
> > 
> > No - in this case, the "header" is the highest bit of that byte
> > being 0.  
> 
> Oh it's the header BIT. Admittedly I don't understand how this
> function returns the highest bit using that case, which I think he
> was suggesting.

A first byte of an UTF-8 codepoint is 0..127,192..247.
The second, third, fourth byte are between 128..191, so you can easily
detect where a codepoint starts.
And from the first byte you can derive the length of the codepoint.
If you just want to skip over n codepoints, then the below function does
the job:

 
> function UTF8CodepointSizeFast(p: PChar): integer;
> begin
>  case p^ of
>#0..#191   : Result := 1;
>#192..#223 : Result := 2;
>#224..#239 : Result := 3;
>#240..#247 : Result := 4;
>else Result := 1; // An optimization + prevents compiler warning
> about uninitialized Result. end;
> end;

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal

On 3 July 2023 9:12:03 +0200, Hairy Pixels via fpc-pascal 
 wrote:
>> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal 
>>  wrote:
>> 
>> No - in this case, the "header" is the highest bit of that byte being 0.
>
>Oh it's the header BIT. Admittedly I don't understand how this function 
>returns the highest bit using that case, which I think he was suggesting.
>
>function UTF8CodepointSizeFast(p: PChar): integer;
>begin
> case p^ of
>   #0..#191   : Result := 1;
>   #192..#223 : Result := 2;
>   #224..#239 : Result := 3;
>   #240..#247 : Result := 4;
>   else Result := 1; // An optimization + prevents compiler warning about 
> uninitialized Result.
> end;
>end;

That's why I wrote "in this case". The "header" itself is not fixed size 
either, but the algorithm above shows how you can derive the length from the 
first byte.

Tomas

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal 
>  wrote:
> 
> No - in this case, the "header" is the highest bit of that byte being 0.

Oh it's the header BIT. Admittedly I don't understand how this function returns 
the highest bit using that case, which I think he was suggesting.

function UTF8CodepointSizeFast(p: PChar): integer;
begin
 case p^ of
   #0..#191   : Result := 1;
   #192..#223 : Result := 2;
   #224..#239 : Result := 3;
   #240..#247 : Result := 4;
   else Result := 1; // An optimization + prevents compiler warning about 
uninitialized Result.
 end;
end;

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-03 Thread Tomas Hajny via fpc-pascal

On 3 July 2023 8:42:05 +0200, Hairy Pixels via fpc-pascal 
 wrote:
>> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal 
>>  wrote:
>> 
>> No, the header of a codepoint to figure out the length.
>
>so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 
>1 for the character? 
>
>ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 
>bytes?

No - in this case, the "header" is the highest bit of that byte being 0.

Tomas

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> No, the header of a codepoint to figure out the length.

so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 
1 for the character? 

ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 
bytes?

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Mattias Gaertner via fpc-pascal

On Mon, 3 Jul 2023 11:58:33 +0700
Hairy Pixels via fpc-pascal  wrote:

> > On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal
> >  wrote:
> > 
> > There is a header byte.
> > 
> > It depends, if you want to check for invalid UTF-8 sequences.
> > 
> > From LazUTF8:
> > 
> > function UTF8CodepointSizeFast(p: PChar): integer;
> > begin
> >  case p^ of
> >#0..#191   : Result := 1;
> >#192..#223 : Result := 2;
> >#224..#239 : Result := 3;
> >#240..#247 : Result := 4;
> >else Result := 1; // An optimization + prevents compiler warning
> > about uninitialized Result. end;
> > end;  
> 
> This is a header for the file?

No, the header of a codepoint to figure out the length.

> Does that mean the file itself must
> have uniform character sizes?

No.

> I though the idea was to read the file
> one byte at a time but I don't understand how you would know if a 1
> byte character (like ascii) was part of a 4 byte character or not.

ASCII is #0..#127, which is the same character in UTF-8.

Mattias

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




> On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> Useless array of.
> And it does not return the bytecount.

it's an open array so what's the problem?

> 
>> var
>>  i: Integer;
>>  byteCount: Integer;
>> begin
>>  // Number of bytes required to represent the Unicode scalar
>>  if unicodeScalar < $80 then
>>byteCount := 1
>>  else if unicodeScalar < $800 then
>>byteCount := 2
>>  else if unicodeScalar < $1 then
>>byteCount := 3
>>  else if unicodeScalar < $11 then
>>byteCount := 4
>>  else
>>raise Exception.Create('Invalid Unicode scalar');
>> 
>>  // Extract the individual bytes using bitwise operations
>>  for i := byteCount - 1 downto 0 do
>>  begin
>>bytes[i] := $80 or (unicodeScalar and $3F);
> 
> Wrong for byteCount=1

really? How so? 

ChatGPT is risky because it will give wrong information with perfect confidence 
and there's no way for the ignorant person to know.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




> On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal 
>  wrote:
> 
> There is a header byte.
> 
> It depends, if you want to check for invalid UTF-8 sequences.
> 
> From LazUTF8:
> 
> function UTF8CodepointSizeFast(p: PChar): integer;
> begin
>  case p^ of
>#0..#191   : Result := 1;
>#192..#223 : Result := 2;
>#224..#239 : Result := 3;
>#240..#247 : Result := 4;
>else Result := 1; // An optimization + prevents compiler warning about 
> uninitialized Result.
>  end;
> end;

This is a header for the file? Does that mean the file itself must have uniform 
character sizes? I though the idea was to read the file one byte at a time but 
I don't understand how you would know if a 1 byte character (like ascii) was 
part of a 4 byte character or not.

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Mattias Gaertner via fpc-pascal

On Mon, 3 Jul 2023 08:29:11 +0700
Hairy Pixels via fpc-pascal  wrote:

> > On Jul 2, 2023, at 11:16 PM, Jer Haan  wrote:
> > 
> > This table is copied from Wikipedia.Hope it’s useful
> > for you. If you improve the code pls let me know. 
> 
> This is perfect, thanks! Much more complicated than I thought.
> 
> I'm curious now, if you were going the other direction and parsing a
> string of different unicode characters with different code point
> sequence lengths how would you know which length it was? For example
> I started off know which unicode scalar to use by looking at a table
> but if I had to find the character is stream of text?
> 
> I think UTF8 can have 1-4 byte characters so you could encounter 1
> byte character followed by 4 byte characters interleaved and there's
> no header or terminator for each character. How is this solved?

There is a header byte.

It depends, if you want to check for invalid UTF-8 sequences.

From LazUTF8:

function UTF8CodepointSizeFast(p: PChar): integer;
begin
  case p^ of
#0..#191   : Result := 1;
#192..#223 : Result := 2;
#224..#239 : Result := 3;
#240..#247 : Result := 4;
else Result := 1; // An optimization + prevents compiler warning about 
uninitialized Result.
  end;
end;

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

2023-07-02 Thread Mattias Gaertner via fpc-pascal

On Mon, 3 Jul 2023 09:34:10 +0700
Hairy Pixels via fpc-pascal  wrote:

>[...]
> Ok today I I just tried to ask ChatGPT and got an answer. I must have
> asked the wrong thing yesterday but it got it right today (with one
> syntax error using an inline "var" in the code section  for some
> reason).
> 
> How does this look?
> 
> procedure SplitUTF8Bytes(unicodeScalar: Integer; var bytes: array of
> Byte);

Useless array of.
And it does not return the bytecount.

> var
>   i: Integer;
>   byteCount: Integer;
> begin
>   // Number of bytes required to represent the Unicode scalar
>   if unicodeScalar < $80 then
> byteCount := 1
>   else if unicodeScalar < $800 then
> byteCount := 2
>   else if unicodeScalar < $1 then
> byteCount := 3
>   else if unicodeScalar < $11 then
> byteCount := 4
>   else
> raise Exception.Create('Invalid Unicode scalar');
> 
>   // Extract the individual bytes using bitwise operations
>   for i := byteCount - 1 downto 0 do
>   begin
> bytes[i] := $80 or (unicodeScalar and $3F);

Wrong for byteCount=1

> unicodeScalar := unicodeScalar shr 6;
>   end;
> 
>   // Set the leading bits of each byte
>   case byteCount of
> 2:
>   bytes[0] := $C0 or bytes[0];
> 3:
>   bytes[0] := $E0 or bytes[0];
> 4:
>   bytes[0] := $F0 or bytes[0];
>   end;
> end;

Well, it got the basic idea of UTF-8 multibytes right and it compiles,
so maybe half the points?

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar




> On Jul 3, 2023, at 12:20 AM, Nikolay Nikolov via fpc-pascal 
>  wrote:
> 
> There's no such thing as "unicode scalar" in Unicode terminology:
> 
> https://unicode.org/glossary/

I got it from here 
https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/

Ok today I I just tried to ask ChatGPT and got an answer. I must have asked the 
wrong thing yesterday but it got it right today (with one syntax error using an 
inline "var" in the code section  for some reason).

How does this look?

procedure SplitUTF8Bytes(unicodeScalar: Integer; var bytes: array of Byte);
var
  i: Integer;
  byteCount: Integer;
begin
  // Number of bytes required to represent the Unicode scalar
  if unicodeScalar < $80 then
byteCount := 1
  else if unicodeScalar < $800 then
byteCount := 2
  else if unicodeScalar < $1 then
byteCount := 3
  else if unicodeScalar < $11 then
byteCount := 4
  else
raise Exception.Create('Invalid Unicode scalar');

  // Extract the individual bytes using bitwise operations
  for i := byteCount - 1 downto 0 do
  begin
bytes[i] := $80 or (unicodeScalar and $3F);
unicodeScalar := unicodeScalar shr 6;
  end;

  // Set the leading bits of each byte
  case byteCount of
2:
  bytes[0] := $C0 or bytes[0];
3:
  bytes[0] := $E0 or bytes[0];
4:
  bytes[0] := $F0 or bytes[0];
  end;
end;

Regards,
Ryan Joseph

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar