Re: [fpc-pascal] Parse unicode scalar

Hairy Pixels via fpc-pascal Mon, 03 Jul 2023 21:41:03 -0700


> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
> <[email protected]> wrote:
> 
> For what grammar? What characters are allowed in a token? For example, Free 
> Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, 
> it doesn't need to understand Unicode characters, so it works on the byte 
> (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code 
> units). That's because UTF-8 has two nice properties:
> 
> 1)  ASCII character are encoded as they are - by using bytes in the range 
> #0..#127
> 
> 2) non-ASCII characters will always use a sequence of bytes, that are all in 
> the range #128..#255 (they have their highest bit set), so they will never be 
> misinterpreted as ASCII.
> 
> So, the tokenizer just works with UTF-8 like with any other 8-bit code page.


yes this works until you reach a non-ASCII ranged character and then the 
character index no longer matches the string 1 to 1. For example consider this 
was pascal:

i := '🐻';

You can advance by index like:

 Inc(currrentIndex);
 c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the 
next character by doing +1.

Regards,
Ryan Joseph

_______________________________________________
fpc-pascal maillist  -  [email protected]
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Parse unicode scalar

Reply via email to