I have some suggestions, Alan, but of course they are fundamentally
greedy/self-interested ("Hey, give us some free labor!").Flex, which you are using, is nicely polished but, ultimately, it ain't rocket science. DFA minimization isn't that hard and it's a fun programming puzzle. Tables for Unicode Scheme lexers under the draft aren't all that impractical. The key insight is that such tables are mostly homogenous -- a sparse array data structure can serve well. You might consider writing a lexical analyzer generator from scratch. It's not so crazy, really. If you make or take a flyweight R5 implementation, you can even have a nice bootstrapping path if you build the analyzer in Scheme. I had some experience going down this path a ways back. It's fun. My work then wasn't timely but the environment may be more receptive these days. If you get into regular expression hacking, it's just a hop-skip-and-a-jump to get to interesting parsers. It could be interesting to have a lex/parser-generator that can not only generate C/C++/Java/... but can also drive Scheme tools. I'm sure there are examples out there but I think there's room for a tight, practical expression of such tools. One application I found for such tools was in checking mostly-statically-checkable invariants in a fork off of SCM. For example, in SCM, certain invariants about the heap data structure require that certain macros and functions are invoked only from within explicitly delimited critical sections. Turns out that, at the time, lots of programmer mistakes violated those invariants. It also turned out that even a crude C parser and crude flow analysis was "good enough" to catch dozens of those errors automatically. Similar analysis would have useful application in, for example, maintaining an OS kernel like Linux. Probably only take you a few days :-) -t Alan Watson wrote: > In formal comment 231, I stated: > > "Many current Schemes have lexers written for ASCII (or Latin-1) > character sets. Conversion of these lexers to the new standard would be > easier if the report allowed inline hex escapes to appear anywhere in > Scheme code." > > The editors replied: > > "It is unclear why converting the lexers would be significantly simpler > through this change" > > Let me explain my original opinion. Many Schemes currently have lexers > written in C using "char". These need converting to "long" to handle > Unicode. Furthermore, table-driven approaches are practical for ASCII > (128 values), but not practical for Unicode (roughly 2^24 values). > > In case that isn't clear enough: My Scheme uses flex for its lexer. I > cannot see how to simply convert it to accept Unicode. I think I will > have to dump flex and implement a new lexer by hand. > > Regards, > > Alan > > _______________________________________________ > r6rs-discuss mailing list > [email protected] > http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss > > _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
