On Wed, Mar 1, 2023 at 11:00 PM John R. Hogerhuis <jho...@pobox.com> wrote:
> Well, for ANSI C 99 lex is probably the best way. C doesn't (didn't?) have > any regex engine. I don't know what's in modern C, it might have libraries > for regex. > I've used the PCRE library in C, which works but is not as nice as a language which is designed from the ground up to use regular expressions. While the modern languages that I know technically have regular expressions, most of them treat regexes as a weird string which can be used in a function call, no different than a library in C. Perl and Python are noteworthy here: Perl for being surprisingly good at integrating regular expressions into the design of the language and Python for failing quite badly when they should have known better. I don't know if there are any modern languages that put regular expressions at the heart in the way lex, awk, and sed do. I'd love to learn if anyone knows. If someone is using Python, Perl, C#, Java, Javascript, etc... my > minimalist tendency would be to forgo the dependency on a lexer library > If you are speaking of how the compiled executable can literally depend upon the lexer library to run, it turns out that is not necessary. But, yes, I do see your broader point of wanting to write directly in a language instead of using a meta-language. and just use the regex language feature to do the work. Since you can > create one being expression with a named line number token, followed by a > series of BASIC lexemes as one big list of "|"'s > If the language's regex features made doing so easy, I'd be all for it. Unfortunately, as far as I know, they don't. Consider "mini-scanners" where part of the input is syntactically different from the rest. For example, how do you handle REM or double quotes? It is totally possible to do it in a giant regex, but not by me. I mean, I could probably come up with something that seems to work, but I'm not sure I have enough skill (or patience) to do it right. I would end up kludging it with extra code that might work, but certainly would offend anyone's sense of minimalism. With lex, mini-scanners are trivial. For example, here is the entirety of the code needed to handle REM in my tokenizer: %x remark /* remark is an exclusive start condition */REM putchar(142); BEGIN(remark); <*>\n putchar('\0'); BEGIN(INITIAL); I simply defined an exclusive start condition named <remark> and have the REM statement enter into it. Lex automatically copies verbatim any text that doesn't match any rules. The only rule that matches <remark> is newline, which returns the scanner to the normal start condition. Double quotes are just as easy. > For a given line you need to lex the line number first. That can be its > own regex. > Then you need to lex BASIC tokens one after another. Pretty much that's > just one giant regex of alternative lexemes. > As you extract them you may enter modes for handling expressions, strings, > REM content depending on what you're doing (tokenizer, syntax checker, > pretty printer, renumber, etc.) . That's your own code and variables, lex > doesn't really help with that anyway since it's not a parser. > You are correct about lex not handling expressions. The syntax for REM and strings in BASIC is not recursive and doesn't need a parser. That makes lex perfect as a BASIC tokenizer, but not so great for any of the other examples you listed, at least not on its own. While my tokenizer can ~kinda~ pack the .BA file by removing comments and whitespace, the sort of optimizations Brian is talking about (merging lines) or that I'm considering (removing lines that only contain a remark) are beyond it. Now if you're into trying a proper parser/context free grammar then I can > see using lex+yacc or equivalents, even in a higher level language. > I'm not at that level, yet. Maybe someday. > That said I don't know how parseable BASIC is. It's not a structured > language. It may be best suited to ad hoc methods. > You know a heckuva lot more than I do, John. I had just presumed any computer language was parsable. —b9