> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal
> <fpc-pascal@lists.freepascal.org> wrote:
>
>> What I'm really trying to do is improve a parser so it can read UTF-8
>> files and decode unicode literals in the grammar.
>
> First of all: Is it valid UTF-8 or do you have to check for broken or
> malicious sequences?
If they give the parser broken files that's their problem they need to fix? the
user has control over the file so it's there responsibility I think.
>
>
>> Right now I've just read the file into an AnsiString and indexing
>> assuming a fixed character size, which breaks of course if non-1 byte
>> characters exist
>
> Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
>
> function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer):
> Cardinal;
Not sure how this works. You need to advance by character so there return value
should be the byte location of the next character or something like that.
>
>
>> I also need to know if I come across something like \u1F496 I need
>> to convert that to a unicode character.
>
> I guess you know how to convert a hex to a dword.
Is there anything better than StrToInt? I wouldn't be able to do it myself
though without that function.
> Then
>
> function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
> function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to
> UTF8
>
Ok I think this is basically what the other programmer submitted and what
ChatGPT tried to do.
Regards,
Ryan Joseph
_______________________________________________
fpc-pascal maillist - fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal