Well, for ANSI C 99 lex is probably the best way. C doesn't (didn't?) have any regex engine. I don't know what's in modern C, it might have libraries for regex.
As you say lex is a "friendly wrapper around regular expressions." But all it really does is make a giant regex/state machine with hooks for your code. If someone is using Python, Perl, C#, Java, Javascript, etc... my minimalist tendency would be to forgo the dependency on a lexer library and just use the regex language feature to do the work. Since you can create one being expression with a named line number token, followed by a series of BASIC lexemes as one big list of "|"'s For a given line you need to lex the line number first. That can be its own regex. Then you need to lex BASIC tokens one after another. Pretty much that's just one giant regex of alternative lexemes. As you extract them you may enter modes for handling expressions, strings, REM content depending on what you're doing (tokenizer, syntax checker, pretty printer, renumber, etc.) . That's your own code and variables, lex doesn't really help with that anyway since it's not a parser. Now if you're into trying a proper parser/context free grammar then I can see using lex+yacc or equivalents, even in a higher level language. That said I don't know how parseable BASIC is. It's not a structured language. It may be best suited to ad hoc methods. -- John. On Wed, Mar 1, 2023 at 5:50 PM B 9 <hacke...@gmail.com> wrote: > > > On Wed, Mar 1, 2023 at 9:39 AM John R. Hogerhuis <jho...@pobox.com> wrote: > >> Parser systems are less work than they're worth. But lexer systems, not >> so much. >> Modern languages have advanced regular expression systems which are equal >> in power to a lexer. Might as well just use a big regex to lex your tokens. >> > > You probably know more about this than me, John, but I'm confused by the > idea of using regexes *instead* of lex. I think of lex as a friendly > wrapper around regular expressions. At its heart, lex is a list of regular > expressions with associated actions for each one. They are implicitly or'ed > and the longest possible match is taken. Most of my tokenizer is just lines > like this: > > END putchar(128);FOR putchar(129);NEXT > putchar(130);DATA putchar(131);INPUT > putchar(132);DIM putchar(133);READ > putchar(134);LET putchar(135); > > The most "complicated" part is the regexp for matching line numbers which > has to print four bytes. > > ^[[:space:]]*[0-9]+[ ]? { /* BASIC line number */ > uint16_t line=atoi(yytext); > putchar(42); putchar(42); /* Dummy placeholder values */ > putchar(line & 0xFF); > putchar(line >> 8); > } > > For me lex makes connecting regular expressions to the actions they > trigger straightforward and clean. But, that could be because I'm ignorant, > which is often the case. If there's a better way using advanced regular > expressions, I'd love to learn it. > > —b9 > >