I've been tinkering with a lexer/parser for Lambda calculus expressions. I'm
trying to add a feature to highlight unexpected characters, rather than making
the user count columns. For example, the lexer will choke on this string,
because it doesn't have a match for % anywhere: (λx.x %)
I'd like to get output like this:
<Some kind of explanation message>:
(λx.x %)
^
I've hit a problem though where the srclocs returned from the lexer don't
correspond exactly to string indices when unicode characters (such as λ) are
included in the expression.
Take this program for example:
(with-handlers ([exn:fail:read? (λ (e) (printf "ERROR: ~a~n" e))])
((lexer-src-pos [(eof) 'EOF]) (open-input-string "λ")))
This outputs:
ERROR: #(struct:exn:fail:read lexer: No match found in input starting with: λ
#<continuation-mark-set> (#(struct:srcloc #f #f #f 1 2)))
You can see in the srcloc that the position is 1, (as expected) but the span is
2.
So let's imagine I have a lexer that understands λ, but nothing else. When I
try to lex the expression "λx", the lexer chokes on the x, and the srcloc
information has position 3 and span 1. However, (string-length "λx") evaluates
to 2.
I'm struggling to figure out how to solve the problem. I'm guessing that the λ
is two bytes, and the lexer counts 1 position per byte. So what's the 'right'
way to handle this? I could just sub1 from a position for each λ that appeared
in the expression beforehand, but that feels sick and wrong.
Is there a way to tell the lexer to count positions on a character-by-character
basis, rather than bytes? (Or perhaps I'm misunderstanding how that works too.)
Any pointers would be appreciated. Thanks!
--
You received this message because you are subscribed to the Google Groups
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.