Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 09:12, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 12:38 PM, Nikolay Nikolov via fpc-pascal wrote: For console apps that use the Unicode KVM video unit, I've introduced two functions for determining the display width of a Unicode string in the video unit: function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString): Integer; { Returns the number of display columns needed for the given extended grapheme cluster } function StringDisplayWidth(const S: UnicodeString): Integer; { Returns the number of display columns needed for the given string } Remember, the display width is different than the number of graphemes, due to East Asian double width characters. And these work with UnicodeString, which is UTF-16, not UTF-8. But Free Pascal can convert between the two. is there an example snippet of how all this works? It's too level for newbies to understand. :) Rendering Unicode to the screen is not for newbies :) Using Unicode (where another library, like GTK or QT or the console deals with it) is another matter. What is it that you need to do? From your emails I get the impression you're writing a parser for a language. For that, you don't usually need this sort of "length". If you're making a GUI app, e.g. with the LCL, there should be ways to determine the display length of a text control? Generally, you should use your GUI or TUI toolkit. The Unicode version of Free Vision is for fullscreen TUI apps, like the console IDE (which does not yet support Unicode). If that's what you want, here's a starting point: https://wiki.freepascal.org/Free_Vision#Unicode_version Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 12:38 PM, Nikolay Nikolov via fpc-pascal > wrote: > > For console apps that use the Unicode KVM video unit, I've introduced two > functions for determining the display width of a Unicode string in the video > unit: > > function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString): > Integer; > { Returns the number of display columns needed for the given extended > grapheme cluster } > > function StringDisplayWidth(const S: UnicodeString): Integer; > { Returns the number of display columns needed for the given string } > > Remember, the display width is different than the number of graphemes, due to > East Asian double width characters. > > And these work with UnicodeString, which is UTF-16, not UTF-8. But Free > Pascal can convert between the two. is there an example snippet of how all this works? It's too level for newbies to understand. :) Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 08:08, Nikolay Nikolov wrote: On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser). Sorry I'm still curious even though it's not my current problem :) How can I make this program output the expected results: w: widechar; a: array of widechar; begin for w in 'abc🐻' do a += [w]; // Outputs 7 instead of 4 writeln(length(a)); end; The user doesn't know about unicode they just want to get an array of characters and not worry about all these little details. What can FPC do to solve this problem? Depends on what you need, but I suppose in this case you want to count the number of extended grapheme clusters (a.k.a. "user perceived characters" - how many character-like things are displayed on the screen). You might be tempted to count the number of Unicode code points, but that's not the same, due to the existence of combining characters: https://en.wikipedia.org/wiki/Combining_character For extended grapheme clusters, there's an iterator in the graphemebreakproperty unit. I implemented this for the Unicode KVM and FreeVision. There it's needed for figuring out how many character blocks in the console will be needed to display a certain string. For the console or other GUIs that use fixed width fonts, there's also the East Asian Width property as well - some characters (East Asian - Chinese, Japanese, Korean) take double the space. So, to figure out where to move the cursor, you need to take East Asian Width as well. For console apps that use the Unicode KVM video unit, I've introduced two functions for determining the display width of a Unicode string in the video unit: function ExtendedGraphemeClusterDisplayWidth(const EGC: UnicodeString): Integer; { Returns the number of display columns needed for the given extended grapheme cluster } function StringDisplayWidth(const S: UnicodeString): Integer; { Returns the number of display columns needed for the given string } Remember, the display width is different than the number of graphemes, due to East Asian double width characters. And these work with UnicodeString, which is UTF-16, not UTF-8. But Free Pascal can convert between the two. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser). Sorry I'm still curious even though it's not my current problem :) How can I make this program output the expected results: w: widechar; a: array of widechar; begin for w in 'abc🐻' do a += [w]; // Outputs 7 instead of 4 writeln(length(a)); end; The user doesn't know about unicode they just want to get an array of characters and not worry about all these little details. What can FPC do to solve this problem? Depends on what you need, but I suppose in this case you want to count the number of extended grapheme clusters (a.k.a. "user perceived characters" - how many character-like things are displayed on the screen). You might be tempted to count the number of Unicode code points, but that's not the same, due to the existence of combining characters: https://en.wikipedia.org/wiki/Combining_character For extended grapheme clusters, there's an iterator in the graphemebreakproperty unit. I implemented this for the Unicode KVM and FreeVision. There it's needed for figuring out how many character blocks in the console will be needed to display a certain string. For the console or other GUIs that use fixed width fonts, there's also the East Asian Width property as well - some characters (East Asian - Chinese, Japanese, Korean) take double the space. So, to figure out where to move the cursor, you need to take East Asian Width as well. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 11:50 AM, Hairy Pixels wrote: > > You know you're right, with properly enclosed patterns you can capture > everything inside and it works. You won't know if you had unicode in your > string or not though but that depends on what's being parsed and if you care > or not (I'm doing a TOML parser). Sorry I'm still curious even though it's not my current problem :) How can I make this program output the expected results: w: widechar; a: array of widechar; begin for w in 'abc🐻' do a += [w]; // Outputs 7 instead of 4 writeln(length(a)); end; The user doesn't know about unicode they just want to get an array of characters and not worry about all these little details. What can FPC do to solve this problem? Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal > wrote: > > But you just don't need to do this, in order to tokenize Pascal. The > beginning and the end of the string literal is the apostrophe, which is > ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), > that will not be mistaken for an apostrophe, or end of line, because they > will have their high bit set. There's simply no need for a Pascal tokenizer > to iterate over UTF-8 code points, instead of code units. You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser). Maybe I can skip that part and just focus on the decoding of the unicode scalars Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:45, Nikolay Nikolov wrote: On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties: 1) ASCII character are encoded as they are - by using bytes in the range #0..#127 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII. So, the tokenizer just works with UTF-8 like with any other 8-bit code page. yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal: i := '🐻'; You can advance by index like: Inc(currrentIndex); c := text[currentIndex]; but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1. But you just don't need to do this, in order to tokenize Pascal. The beginning and the end of the string literal is the apostrophe, which is ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), that will not be mistaken for an apostrophe, or end of line, because they will have their high bit set. There's simply no need for a Pascal tokenizer to iterate over UTF-8 code points, instead of code units. Sorry, the last sentence should read: "There's simply no need for a Pascal tokenizer to iterate over Unicode code points, instead of UTF-8 code units." Hope that makes it more clear and accurate. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal wrote: For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties: 1) ASCII character are encoded as they are - by using bytes in the range #0..#127 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII. So, the tokenizer just works with UTF-8 like with any other 8-bit code page. yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal: i := '🐻'; You can advance by index like: Inc(currrentIndex); c := text[currentIndex]; but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1. But you just don't need to do this, in order to tokenize Pascal. The beginning and the end of the string literal is the apostrophe, which is ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), that will not be mistaken for an apostrophe, or end of line, because they will have their high bit set. There's simply no need for a Pascal tokenizer to iterate over UTF-8 code points, instead of code units. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal > wrote: > > For what grammar? What characters are allowed in a token? For example, Free > Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, > it doesn't need to understand Unicode characters, so it works on the byte > (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code > units). That's because UTF-8 has two nice properties: > > 1) ASCII character are encoded as they are - by using bytes in the range > #0..#127 > > 2) non-ASCII characters will always use a sequence of bytes, that are all in > the range #128..#255 (they have their highest bit set), so they will never be > misinterpreted as ASCII. > > So, the tokenizer just works with UTF-8 like with any other 8-bit code page. yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal: i := '🐻'; You can advance by index like: Inc(currrentIndex); c := text[currentIndex]; but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal wrote: You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text display, are you searching for something in the text, are you just passing strings around and letting the GUI deal with it? These are all different use cases, and they require careful understanding what Unicode thing you need to iterate over. Thanks for trying to help but this is more complicated than I thought and I don't have the patience for a deep dive right now :) Unicode is complicated under the hood but we should have some libraries to help right? I mean the user thinks of these things as "characters" be it "A" or the unicode symbol 👍 so we should be able to operate on that basis as well. Something like an iterator that return the character (wide char) and byte offset or writing would be a nice place to start. I have a parser/tokenizer I want to update so I'm trying to find tokens by advancing one character at a time. That's why I have a requirement to know which character is next in the file and probably the byte offset also so it can be referenced later. For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties: 1) ASCII character are encoded as they are - by using bytes in the range #0..#127 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII. So, the tokenizer just works with UTF-8 like with any other 8-bit code page. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal > wrote: > > You need to understand all these terms and know exactly what you need to do. > E.g. are you dealing with keyboard input, are you dealing with the low level > parts of text display, are you searching for something in the text, are you > just passing strings around and letting the GUI deal with it? These are all > different use cases, and they require careful understanding what Unicode > thing you need to iterate over. Thanks for trying to help but this is more complicated than I thought and I don't have the patience for a deep dive right now :) Unicode is complicated under the hood but we should have some libraries to help right? I mean the user thinks of these things as "characters" be it "A" or the unicode symbol 👍 so we should be able to operate on that basis as well. Something like an iterator that return the character (wide char) and byte offset or writing would be a nice place to start. I have a parser/tokenizer I want to update so I'm trying to find tokens by advancing one character at a time. That's why I have a requirement to know which character is next in the file and probably the byte offset also so it can be referenced later. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote: On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal wrote: function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; // returns the number of codepoints var CodePointLen: longint; CodePoint: longword; begin Result:=0; while (ByteCount>0) do begin inc(Result); CodePoint:=UTF8CodepointToUnicode(p,CodePointLen); ...do something with the CodePoint... inc(p,CodePointLen); dec(ByteCount,CodePointLen); end; end; Thanks, this looks right. I guess this is how we need to iterate over unicode now. Btw, why isn't there a for-loop we can use over unicode strings? seems like that should be supported out of the box. I had this same problem in Swift also where it's extremely confusing to merely iterate over a string and look at each character. Replacing characters will be tricky also so we need some good library functions. You're still confusing the Unicode terms. The above code iterates over Unicode Code Points, not "characters" in a UTF-8 encoded string. A Unicode Code Point is not a "character": https://unicode.org/glossary/#character https://unicode.org/glossary/#code_point There are also graphemes, grapheme clusters and extended grapheme clusters - these terms can also be perceived as "characters". https://unicode.org/glossary/#grapheme https://unicode.org/glossary/#grapheme_cluster https://unicode.org/glossary/#extended_grapheme_cluster If you want to iterate over extended grapheme clusters, for example, there's an iterator (written by me) in the unit graphemebreakproperty.pp in the rtl-unicode package. If you use the 'char' type in Pascal to iterate over an UTF-8 encoded string, you're iterating over Unicode code units (units! not code points! https://unicode.org/glossary/#code_unit). If you use the 'widechar' type in Pascal to iterate over a UnicodeString (which is a UTF-16 encoded string), you're also iterating over Unicode code units, however this time in UTF-16 encoding. If you want to iterate over Unicode code points (not units! not characters! not graphemes!) in a UTF-8 string, you need something like the ReadUTF8 function above. If you want to iterate over Unicode code points in a UTF-16 string, you need different code. You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text display, are you searching for something in the text, are you just passing strings around and letting the GUI deal with it? These are all different use cases, and they require careful understanding what Unicode thing you need to iterate over. Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal > wrote: > > function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; > // returns the number of codepoints > var > CodePointLen: longint; > CodePoint: longword; > begin > Result:=0; > while (ByteCount>0) do begin >inc(Result); >CodePoint:=UTF8CodepointToUnicode(p,CodePointLen); >...do something with the CodePoint... >inc(p,CodePointLen); >dec(ByteCount,CodePointLen); > end; > end; Thanks, this looks right. I guess this is how we need to iterate over unicode now. Btw, why isn't there a for-loop we can use over unicode strings? seems like that should be supported out of the box. I had this same problem in Swift also where it's extremely confusing to merely iterate over a string and look at each character. Replacing characters will be tricky also so we need some good library functions. Swift is especially terrible because there's NO ANSII string so even a 1 byte sequence needs all these confusing as hell functions to do any work with strings at all. Terrible experience and slow. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 17:18:56 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > > First of all: Is it valid UTF-8 or do you have to check for broken > > or malicious sequences? > > If they give the parser broken files that's their problem they need > to fix? the user has control over the file so it's there > responsibility I think. Users responsibility? - I recommend to check for malicious codes. ;) > >> Right now I've just read the file into an AnsiString and indexing > >> assuming a fixed character size, which breaks of course if non-1 > >> byte characters exist > > > > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful: > > > > function UTF8CodepointToUnicode(p: PChar; out CodepointLen: > > integer): Cardinal; > > Not sure how this works. You need to advance by character so there > return value should be the byte location of the next character or > something like that. function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt; // returns the number of codepoints var CodePointLen: longint; CodePoint: longword; begin Result:=0; while (ByteCount>0) do begin inc(Result); CodePoint:=UTF8CodepointToUnicode(p,CodePointLen); ...do something with the CodePoint... inc(p,CodePointLen); dec(ByteCount,CodePointLen); end; end; > >> I also need to know if I come across something like \u1F496 I need > >> to convert that to a unicode character. > > > > I guess you know how to convert a hex to a dword. > > Is there anything better than StrToInt? Good start. > I wouldn't be able to do it > myself though without that function. Hex to dword. That's easy enough for ChatGPT. > > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to > > UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): > > integer; // UTF32 to UTF8 > > Ok I think this is basically what the other programmer submitted and > what ChatGPT tried to do. Yes, no need to reinvent the wheel. Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
El 03/07/2023 a las 10:27, Hairy Pixels via fpc-pascal escribió: Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist I also need to know if I come across something like \u1F496 I need to convert that to a unicode character. Hello, You are intermixing a lot of concepts, ASCII, Unicode, grapheme, representation, content, etc... Talking about Unicode you must forget ASCII, the text is a sequence of bytes which are encoded in a special format (UTF-8, UTF-16, UTF-32,...) and that must be represented in screen using Unicode representation rules, which are not the same as ASCII. Just to keep this message quite short, think in a text with only one "letter": "á" This text (text, not one letter, Unicode is about texts) can be transmitted or stored using Unicode encoding rules which are a sequence of bytes with its own rules to encode the information. Each byte is hexadecimal: UTF8: C3 A1 UTF16LE: 00 E1 UTF32: 00 00 00 E1 You must know in advance the encoding format to get the text from the bytes sequence. There is also a BOM (Byte Order Mark) which is sometimes used in files as a header to indicate the encoding, but in general it is not used. Now decoding that sequence of bytes, using the right decoding format you get a text which represent the letter "a" with an acute accent, but Unicode is *not* so *simple* and the same text could be represented in screen using letter "a" + "combining acute accent" and bytes sequence is totally different, different at encoding level but identical at renderization level. So this two UTF8 sequences: "C3 A1" and "61 CC 81" are different at grapheme level and encoding level but identical at representation level. Just as final note, this is the UTF-8 sequence of bytes for one single "character" in screen: F0 9F 8F B4 F3 A0 81 A7 F3 A0 81 A2 F3 A0 81 B3 F3 A0 81 A3 F3 A0 81 B4 F3 A0 81 BF Unicode is far, far from easy. Have a nice day. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal > wrote: > >> What I'm really trying to do is improve a parser so it can read UTF-8 >> files and decode unicode literals in the grammar. > > First of all: Is it valid UTF-8 or do you have to check for broken or > malicious sequences? If they give the parser broken files that's their problem they need to fix? the user has control over the file so it's there responsibility I think. > > >> Right now I've just read the file into an AnsiString and indexing >> assuming a fixed character size, which breaks of course if non-1 byte >> characters exist > > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful: > > function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): > Cardinal; Not sure how this works. You need to advance by character so there return value should be the byte location of the next character or something like that. > > >> I also need to know if I come across something like \u1F496 I need >> to convert that to a unicode character. > > I guess you know how to convert a hex to a dword. Is there anything better than StrToInt? I wouldn't be able to do it myself though without that function. > Then > > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8 > function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to > UTF8 > Ok I think this is basically what the other programmer submitted and what ChatGPT tried to do. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
Hi Ryan,I’ve created attached unit, which takes a code point and returns the utf8 char as a string. It’s based on the Wikipedia article on UTF8UTF-8 encodes code points in one to four bytes, depending on the value of the code point. The x characters are replaced by the bits of the code point:This table is copied from Wikipedia. uencoding.pas Description: Binary data Hope it’s useful for you. If you improve the code pls let me know.Best regards,JeroenOn 2 Jul 2023, at 15:30, Hairy Pixels via fpc-pascal wrote:I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 (💖).Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to tell me the scaler is comprised of these 4 bytes: 240 159 146 150I was able to correctly concatenate these characters and writeln printed the correct character.var s: String;begins := char(240)+char(159)+char(146)+char(150);writeln(s);end.The question is, how was 1F496 decomposed into 4 bytes? Regards, Ryan Joseph___fpc-pascal maillist - fpc-pascal@lists.freepascal.orghttps://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 15:27:10 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > I was just curious how ChatGPTs implementation compared to other > programmer. Apparently the quality is often terrible. But it can be useful. > What I'm really trying to do is improve a parser so it can read UTF-8 > files and decode unicode literals in the grammar. First of all: Is it valid UTF-8 or do you have to check for broken or malicious sequences? > Right now I've just read the file into an AnsiString and indexing > assuming a fixed character size, which breaks of course if non-1 byte > characters exist Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful: function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal; > I also need to know if I come across something like \u1F496 I need > to convert that to a unicode character. I guess you know how to convert a hex to a dword. Then function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to UTF8 Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 3:05 PM, Mattias Gaertner via fpc-pascal > wrote: > > I wonder, is this thread about testing ChatGPT or do you want to > implement something useful? > There are already plenty of optimized UTF-8 functions in the FPC and > Lazarus sources. Maybe too many, and you have trouble finding the right > one? Just ask what your function needs to do. I was just curious how ChatGPTs implementation compared to other programmer. What I'm really trying to do is improve a parser so it can read UTF-8 files and decode unicode literals in the grammar. Right now I've just read the file into an AnsiString and indexing assuming a fixed character size, which breaks of course if non-1 byte characters exist I also need to know if I come across something like \u1F496 I need to convert that to a unicode character. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 12:01:11 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal > > wrote: > > > > Useless array of. > > And it does not return the bytecount. > > it's an open array so what's the problem? >[...] > > Wrong for byteCount=1 > > really? How so? > > ChatGPT is risky because it will give wrong information with perfect > confidence and there's no way for the ignorant person to know. I wonder, is this thread about testing ChatGPT or do you want to implement something useful? There are already plenty of optimized UTF-8 functions in the FPC and Lazarus sources. Maybe too many, and you have trouble finding the right one? Just ask what your function needs to do. Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 14:12:03 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > > wrote: > > > > No - in this case, the "header" is the highest bit of that byte > > being 0. > > Oh it's the header BIT. Admittedly I don't understand how this > function returns the highest bit using that case, which I think he > was suggesting. A first byte of an UTF-8 codepoint is 0..127,192..247. The second, third, fourth byte are between 128..191, so you can easily detect where a codepoint starts. And from the first byte you can derive the length of the codepoint. If you just want to skip over n codepoints, then the below function does the job: > function UTF8CodepointSizeFast(p: PChar): integer; > begin > case p^ of >#0..#191 : Result := 1; >#192..#223 : Result := 2; >#224..#239 : Result := 3; >#240..#247 : Result := 4; >else Result := 1; // An optimization + prevents compiler warning > about uninitialized Result. end; > end; Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 3 July 2023 9:12:03 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal >> wrote: >> >> No - in this case, the "header" is the highest bit of that byte being 0. > >Oh it's the header BIT. Admittedly I don't understand how this function >returns the highest bit using that case, which I think he was suggesting. > >function UTF8CodepointSizeFast(p: PChar): integer; >begin > case p^ of > #0..#191 : Result := 1; > #192..#223 : Result := 2; > #224..#239 : Result := 3; > #240..#247 : Result := 4; > else Result := 1; // An optimization + prevents compiler warning about > uninitialized Result. > end; >end; That's why I wrote "in this case". The "header" itself is not fixed size either, but the algorithm above shows how you can derive the length from the first byte. Tomas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal > wrote: > > No - in this case, the "header" is the highest bit of that byte being 0. Oh it's the header BIT. Admittedly I don't understand how this function returns the highest bit using that case, which I think he was suggesting. function UTF8CodepointSizeFast(p: PChar): integer; begin case p^ of #0..#191 : Result := 1; #192..#223 : Result := 2; #224..#239 : Result := 3; #240..#247 : Result := 4; else Result := 1; // An optimization + prevents compiler warning about uninitialized Result. end; end; Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 3 July 2023 8:42:05 +0200, Hairy Pixels via fpc-pascal wrote: >> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal >> wrote: >> >> No, the header of a codepoint to figure out the length. > >so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and >1 for the character? > >ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 >bytes? No - in this case, the "header" is the highest bit of that byte being 0. Tomas ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 12:04 PM, Mattias Gaertner via fpc-pascal > wrote: > > No, the header of a codepoint to figure out the length. so the smallest character UTF-8 can represent is 2 bytes? 1 for the header and 1 for the character? ASCII #100 is the same character in UTF-8 but it needs a header byte, so 2 bytes? Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 11:58:33 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal > > wrote: > > > > There is a header byte. > > > > It depends, if you want to check for invalid UTF-8 sequences. > > > > From LazUTF8: > > > > function UTF8CodepointSizeFast(p: PChar): integer; > > begin > > case p^ of > >#0..#191 : Result := 1; > >#192..#223 : Result := 2; > >#224..#239 : Result := 3; > >#240..#247 : Result := 4; > >else Result := 1; // An optimization + prevents compiler warning > > about uninitialized Result. end; > > end; > > This is a header for the file? No, the header of a codepoint to figure out the length. > Does that mean the file itself must > have uniform character sizes? No. > I though the idea was to read the file > one byte at a time but I don't understand how you would know if a 1 > byte character (like ascii) was part of a 4 byte character or not. ASCII is #0..#127, which is the same character in UTF-8. Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 11:36 AM, Mattias Gaertner via fpc-pascal > wrote: > > Useless array of. > And it does not return the bytecount. it's an open array so what's the problem? > >> var >> i: Integer; >> byteCount: Integer; >> begin >> // Number of bytes required to represent the Unicode scalar >> if unicodeScalar < $80 then >>byteCount := 1 >> else if unicodeScalar < $800 then >>byteCount := 2 >> else if unicodeScalar < $1 then >>byteCount := 3 >> else if unicodeScalar < $11 then >>byteCount := 4 >> else >>raise Exception.Create('Invalid Unicode scalar'); >> >> // Extract the individual bytes using bitwise operations >> for i := byteCount - 1 downto 0 do >> begin >>bytes[i] := $80 or (unicodeScalar and $3F); > > Wrong for byteCount=1 really? How so? ChatGPT is risky because it will give wrong information with perfect confidence and there's no way for the ignorant person to know. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal > wrote: > > There is a header byte. > > It depends, if you want to check for invalid UTF-8 sequences. > > From LazUTF8: > > function UTF8CodepointSizeFast(p: PChar): integer; > begin > case p^ of >#0..#191 : Result := 1; >#192..#223 : Result := 2; >#224..#239 : Result := 3; >#240..#247 : Result := 4; >else Result := 1; // An optimization + prevents compiler warning about > uninitialized Result. > end; > end; This is a header for the file? Does that mean the file itself must have uniform character sizes? I though the idea was to read the file one byte at a time but I don't understand how you would know if a 1 byte character (like ascii) was part of a 4 byte character or not. Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 08:29:11 +0700 Hairy Pixels via fpc-pascal wrote: > > On Jul 2, 2023, at 11:16 PM, Jer Haan wrote: > > > > This table is copied from Wikipedia.Hope it’s useful > > for you. If you improve the code pls let me know. > > This is perfect, thanks! Much more complicated than I thought. > > I'm curious now, if you were going the other direction and parsing a > string of different unicode characters with different code point > sequence lengths how would you know which length it was? For example > I started off know which unicode scalar to use by looking at a table > but if I had to find the character is stream of text? > > I think UTF8 can have 1-4 byte characters so you could encounter 1 > byte character followed by 4 byte characters interleaved and there's > no header or terminator for each character. How is this solved? There is a header byte. It depends, if you want to check for invalid UTF-8 sequences. From LazUTF8: function UTF8CodepointSizeFast(p: PChar): integer; begin case p^ of #0..#191 : Result := 1; #192..#223 : Result := 2; #224..#239 : Result := 3; #240..#247 : Result := 4; else Result := 1; // An optimization + prevents compiler warning about uninitialized Result. end; end; Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On Mon, 3 Jul 2023 09:34:10 +0700 Hairy Pixels via fpc-pascal wrote: >[...] > Ok today I I just tried to ask ChatGPT and got an answer. I must have > asked the wrong thing yesterday but it got it right today (with one > syntax error using an inline "var" in the code section for some > reason). > > How does this look? > > procedure SplitUTF8Bytes(unicodeScalar: Integer; var bytes: array of > Byte); Useless array of. And it does not return the bytecount. > var > i: Integer; > byteCount: Integer; > begin > // Number of bytes required to represent the Unicode scalar > if unicodeScalar < $80 then > byteCount := 1 > else if unicodeScalar < $800 then > byteCount := 2 > else if unicodeScalar < $1 then > byteCount := 3 > else if unicodeScalar < $11 then > byteCount := 4 > else > raise Exception.Create('Invalid Unicode scalar'); > > // Extract the individual bytes using bitwise operations > for i := byteCount - 1 downto 0 do > begin > bytes[i] := $80 or (unicodeScalar and $3F); Wrong for byteCount=1 > unicodeScalar := unicodeScalar shr 6; > end; > > // Set the leading bits of each byte > case byteCount of > 2: > bytes[0] := $C0 or bytes[0]; > 3: > bytes[0] := $E0 or bytes[0]; > 4: > bytes[0] := $F0 or bytes[0]; > end; > end; Well, it got the basic idea of UTF-8 multibytes right and it compiles, so maybe half the points? Mattias ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 3, 2023, at 12:20 AM, Nikolay Nikolov via fpc-pascal > wrote: > > There's no such thing as "unicode scalar" in Unicode terminology: > > https://unicode.org/glossary/ I got it from here https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/ Ok today I I just tried to ask ChatGPT and got an answer. I must have asked the wrong thing yesterday but it got it right today (with one syntax error using an inline "var" in the code section for some reason). How does this look? procedure SplitUTF8Bytes(unicodeScalar: Integer; var bytes: array of Byte); var i: Integer; byteCount: Integer; begin // Number of bytes required to represent the Unicode scalar if unicodeScalar < $80 then byteCount := 1 else if unicodeScalar < $800 then byteCount := 2 else if unicodeScalar < $1 then byteCount := 3 else if unicodeScalar < $11 then byteCount := 4 else raise Exception.Create('Invalid Unicode scalar'); // Extract the individual bytes using bitwise operations for i := byteCount - 1 downto 0 do begin bytes[i] := $80 or (unicodeScalar and $3F); unicodeScalar := unicodeScalar shr 6; end; // Set the leading bits of each byte case byteCount of 2: bytes[0] := $C0 or bytes[0]; 3: bytes[0] := $E0 or bytes[0]; 4: bytes[0] := $F0 or bytes[0]; end; end; Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
> On Jul 2, 2023, at 11:16 PM, Jer Haan wrote: > > This table is copied from Wikipedia.Hope it’s useful for you. > If you improve the code pls let me know. > This is perfect, thanks! Much more complicated than I thought. I'm curious now, if you were going the other direction and parsing a string of different unicode characters with different code point sequence lengths how would you know which length it was? For example I started off know which unicode scalar to use by looking at a table but if I had to find the character is stream of text? I think UTF8 can have 1-4 byte characters so you could encounter 1 byte character followed by 4 byte characters interleaved and there's no header or terminator for each character. How is this solved? Regards, Ryan Joseph ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/2/23 20:38, Martin Frb via fpc-pascal wrote: On 02/07/2023 19:20, Nikolay Nikolov via fpc-pascal wrote: On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote: I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 (💖). There's no such thing as "unicode scalar" in Unicode terminology: https://unicode.org/glossary/ There seems to be https://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#G7404 Too bad it's not included in the Unicode glossary. :( So, it's basicaly a Unicode code point that is not a high-surrogate or low-surrogate. And if you want to know what "high-surrogate" and "low-surrogate" means, you should read about UTF-16. Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to tell me the scaler is comprised of these 4 bytes: 240 159 146 150 That is an utf-8 encoded representation of such a value. You can find them on https://www.compart.com/en/unicode/U+0041 (using the hex for whatever codepoint interests you) Or just learn about Unicode encodings, such as UTF-8, UTF-16, etc. https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/UTF-16 https://en.wikipedia.org/wiki/UTF-32 Both UTF-8 and UTF-16 are frequently used and are important to know. UTF-32 is rarely used, but is very simple and easy to understand as well. It's just not very efficient, hence its rarity. :) Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 02/07/2023 19:20, Nikolay Nikolov via fpc-pascal wrote: On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote: I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 (💖). There's no such thing as "unicode scalar" in Unicode terminology: https://unicode.org/glossary/ There seems to be https://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#G7404 Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to tell me the scaler is comprised of these 4 bytes: 240 159 146 150 That is an utf-8 encoded representation of such a value. You can find them on https://www.compart.com/en/unicode/U+0041 (using the hex for whatever codepoint interests you) The question is, how was 1F496 decomposed into 4 bytes? https://en.wikipedia.org/wiki/UTF-8#Encoding ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar
On 7/2/23 16:30, Hairy Pixels via fpc-pascal wrote: I'm interested in parsing unicode scalars (I think they're called) to byte sized values but I'm not sure where to start. First thing I did was choose the unicode scalar U+1F496 (💖). There's no such thing as "unicode scalar" in Unicode terminology: https://unicode.org/glossary/ So, what do you mean? A Unicode code point? An Extended Grapheme Cluster? Or something else? There are also several ways to encode Unicode into a byte sequence - UTF-8, UTF-16LE, UTF-16BE, UTF-32, etc. Next I cheated and ask ChatGPT. :) Amazingly from my question it was able to tell me the scaler is comprised of these 4 bytes: 240 159 146 150 I was able to correctly concatenate these characters and writeln printed the correct character. var s: String; begin s := char(240)+char(159)+char(146)+char(150); writeln(s); end. The question is, how was 1F496 decomposed into 4 bytes? I guess you should ask ChatGPT, who gave you the answer ;-) Nikolay ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal