Re: [fpc-pascal] Parse unicode scalar

Hairy Pixels via fpc-pascal Mon, 03 Jul 2023 03:19:24 -0700

> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal 
> <fpc-pascal@lists.freepascal.org> wrote:
> 
>> What I'm really trying to do is improve a parser so it can read UTF-8
>> files and decode unicode literals in the grammar.
> 
> First of all: Is it valid UTF-8 or do you have to check for broken or
> malicious sequences?

If they give the parser broken files that's their problem they need to fix? the 
user has control over the file so it's there responsibility I think.

> 
> 
>> Right now I've just read the file into an AnsiString and indexing
>> assuming a fixed character size, which breaks of course if non-1 byte
>> characters exist
> 
> Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> 
> function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): 
> Cardinal;

Not sure how this works. You need to advance by character so there return value 
should be the byte location of the next character or something like that.

> 
> 
>> I also need to know if I come across something like \u1F496 I need
>> to convert that to a unicode character.
> 
> I guess you know how to convert a hex to a dword.

Is there anything better than StrToInt? I wouldn't be able to do it myself 
though without that function.

> Then
> 
> function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
> function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to 
> UTF8
> 

Ok I think this is basically what the other programmer submitted and what 
ChatGPT tried to do.

Regards,
Ryan Joseph

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Parse unicode scalar

Reply via email to