[racket-users] Interpreting srclocs from lexer-src-loc with unicode characters

Ben Draut Wed, 13 Jan 2016 12:24:07 -0800

I've been tinkering with a lexer/parser for Lambda calculus expressions. I'm 
trying to add a feature to highlight unexpected characters, rather than making 
the user count columns. For example, the lexer will choke on this string, 
because it doesn't have a match for % anywhere: (λx.x %)


I'd like to get output like this:

<Some kind of explanation message>:

(λx.x %)
      ^

I've hit a problem though where the srclocs returned from the lexer don't 
correspond exactly to string indices when unicode characters (such as λ) are 
included in the expression.

Take this program for example:

(with-handlers ([exn:fail:read? (λ (e) (printf "ERROR: ~a~n" e))]) 
  ((lexer-src-pos [(eof) 'EOF]) (open-input-string "λ")))

This outputs:

ERROR: #(struct:exn:fail:read lexer: No match found in input starting with: λ 
#<continuation-mark-set> (#(struct:srcloc #f #f #f 1 2)))

You can see in the srcloc that the position is 1, (as expected) but the span is 
2. 

So let's imagine I have a lexer that understands λ, but nothing else. When I 
try to lex the expression "λx", the lexer chokes on the x, and the srcloc 
information has position 3 and span 1. However, (string-length "λx") evaluates 
to 2. 

I'm struggling to figure out how to solve the problem. I'm guessing that the λ 
is two bytes, and the lexer counts 1 position per byte. So what's the 'right' 
way to handle this? I could just sub1 from a position for each λ that appeared 
in the expression beforehand, but that feels sick and wrong. 

Is there a way to tell the lexer to count positions on a character-by-character 
basis, rather than bytes? (Or perhaps I'm misunderstanding how that works too.)

Any pointers would be appreciated. Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[racket-users] Interpreting srclocs from lexer-src-loc with unicode characters

Reply via email to