On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote:

On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal 
<fpc-pascal@lists.freepascal.org> wrote:

You need to understand all these terms and know exactly what you need to do. 
E.g. are you dealing with keyboard input, are you dealing with the low level 
parts of text display, are you searching for something in the text, are you 
just passing strings around and letting the GUI deal with it? These are all 
different use cases, and they require careful understanding what Unicode thing 
you need to iterate over.
Thanks for trying to help but this is more complicated than I thought and I 
don't have the patience for a deep dive right now :)

Unicode is complicated under the hood but we should have some libraries to help right? I mean the 
user thinks of these things as "characters" be it "A" or the unicode symbol đź‘Ť 
so we should be able to operate on that basis as well. Something like an iterator that return the 
character (wide char) and  byte offset or writing would be a nice place to start.

I have a parser/tokenizer I want to update so I'm trying to find tokens by 
advancing one character at a time. That's why I have a requirement to know 
which character is next in the file and probably the byte offset also so it can 
be referenced later.

For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties:

1)  ASCII character are encoded as they are - by using bytes in the range #0..#127

2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

Nikolay

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Reply via email to