> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal > <fpc-pascal@lists.freepascal.org> wrote: > > For what grammar? What characters are allowed in a token? For example, Free > Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, > it doesn't need to understand Unicode characters, so it works on the byte > (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code > units). That's because UTF-8 has two nice properties: > > 1) ASCII character are encoded as they are - by using bytes in the range > #0..#127 > > 2) non-ASCII characters will always use a sequence of bytes, that are all in > the range #128..#255 (they have their highest bit set), so they will never be > misinterpreted as ASCII. > > So, the tokenizer just works with UTF-8 like with any other 8-bit code page.
yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal: i := '🐻'; You can advance by index like: Inc(currrentIndex); c := text[currentIndex]; but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1. Regards, Ryan Joseph _______________________________________________ fpc-pascal maillist - fpc-pascal@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal