On Wed, Mar 1, 2023 at 11:00 PM John R. Hogerhuis <jho...@pobox.com> wrote:

> Well, for ANSI C 99 lex is probably the best way. C doesn't (didn't?) have
> any regex engine. I don't know what's in modern C, it might have libraries
> for regex.
>

I've used the PCRE library in C, which works but is not as nice as a
language which is designed from the ground up to use regular expressions.
While the modern languages that I know technically have regular
expressions, most of them treat regexes as a weird string which can be used
in a function call, no different than a library in C. Perl and Python are
noteworthy here: Perl for being surprisingly good at integrating regular
expressions into the design of the language and Python for failing quite
badly when they should have known better. I don't know if there are any
modern languages that put regular expressions at the heart in the way lex,
awk, and sed do. I'd love to learn if anyone knows.

If someone is using Python, Perl, C#, Java, Javascript, etc... my
> minimalist tendency would be to forgo the dependency on a lexer library
>

If you are speaking of how the compiled executable can literally depend
upon the lexer library to run, it turns out that is not necessary. But,
yes, I do see your broader point of wanting to write directly in a language
instead of using a meta-language.


and just use the regex language feature to do the work. Since you can
> create one being expression with a named line number token, followed by a
> series of BASIC lexemes as one big list of "|"'s
>

If the language's regex features made doing so easy, I'd be all for it.
Unfortunately, as far as I know, they don't. Consider "mini-scanners" where
part of the input is syntactically different from the rest. For example,
how do you handle REM or double quotes? It is totally possible to do it in
a giant regex, but not by me. I mean, I could probably come up with
something that seems to work, but I'm not sure I have enough skill (or
patience) to do it right. I would end up kludging it with extra code that
might work, but certainly would offend anyone's sense of minimalism.

With lex, mini-scanners are trivial. For example, here is the entirety of
the code needed to handle REM in my tokenizer:

%x remark       /* remark is an exclusive start condition
*/REM           putchar(142);   BEGIN(remark);
<*>\n           putchar('\0');  BEGIN(INITIAL);

I simply defined an exclusive start condition named <remark> and have the
REM statement enter into it. Lex automatically copies verbatim any text
that doesn't match any rules. The only rule that matches <remark> is
newline, which returns the scanner to the normal start condition. Double
quotes are just as easy.



> For a given line you need to lex the line number first. That can be its
> own regex.
> Then you need to lex BASIC tokens one after another. Pretty much that's
> just one giant regex of alternative lexemes.
>
As you extract them you may enter modes for handling expressions, strings,
> REM content depending on what you're doing (tokenizer, syntax checker,
> pretty printer, renumber, etc.) . That's your own code and variables, lex
> doesn't really help with that anyway since it's not a parser.
>

You are correct about lex not handling expressions. The syntax for REM and
strings in BASIC is not recursive and doesn't need a parser. That makes lex
perfect as a BASIC tokenizer, but not so great for any of the other
examples you listed, at least not on its own. While my tokenizer can
~kinda~ pack the .BA file by removing comments and whitespace, the sort of
optimizations Brian is talking about (merging lines) or that I'm
considering (removing lines that only contain a remark) are beyond it.


Now if you're into trying a proper parser/context free grammar then I can
> see using lex+yacc or equivalents, even in a higher level language.
>

I'm not at that level, yet. Maybe someday.



> That said I don't know how parseable BASIC is. It's not a structured
> language. It may be best suited to ad hoc methods.
>

You know a heckuva lot more than I do, John. I had just presumed any
computer language was parsable.

—b9

Reply via email to