Re: [r6rs-discuss] Why lexers can be simpler when restricted to ASCII

Lars T Hansen Mon, 23 Apr 2007 12:08:11 -0700

On 4/23/07, Alan Watson <[EMAIL PROTECTED]> wrote:
> In formal comment 231, I stated:
>
> "Many current Schemes have lexers written for ASCII (or Latin-1)
> character sets. Conversion of these lexers to the new standard would be
> easier if the report allowed inline hex escapes to appear anywhere in
> Scheme code."
>
> The editors replied:
>
> "It is unclear why converting the lexers would be significantly simpler
> through this change"
>
> Let me explain my original opinion. Many Schemes currently have lexers
> written in C using "char". These need converting to "long" to handle
> Unicode. Furthermore, table-driven approaches are practical for ASCII
> (128 values), but not practical for Unicode (roughly 2^24 values).
>
> In case that isn't clear enough: My Scheme uses flex for its lexer. I
> cannot see how to simply convert it to accept Unicode. I think I will
> have to dump flex and implement a new lexer by hand.


Normally you can make Flex work on Unicode by converting the input to
UTF-8 before lexing it, having first rewritten the flex input to work
on UTF-8.  It's not exactly pretty, but (speaking from experience) if
you don't mind accepting a superset of the valid characters for
identifiers it's not bad at all.  State-dependent recognizers in the
flex input are very helpful here.

--lars

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Why lexers can be simpler when restricted to ASCII

Reply via email to