On Wed, Mar 1, 2023 at 9:39 AM John R. Hogerhuis <jho...@pobox.com> wrote:

> Parser systems are less work than they're worth. But lexer systems, not so
> much.
> Modern languages have advanced regular expression systems which are equal
> in power to a lexer. Might as well just use a big regex to lex your tokens.
>

You probably know more about this than me, John, but I'm confused by the
idea of using regexes *instead* of lex. I think of lex as a friendly
wrapper around regular expressions. At its heart, lex is a list of regular
expressions with associated actions for each one. They are implicitly or'ed
and the longest possible match is taken. Most of my tokenizer is just lines
like this:

END             putchar(128);FOR                putchar(129);NEXT               
putchar(130);DATA               putchar(131);INPUT              
putchar(132);DIM                putchar(133);READ               
putchar(134);LET                putchar(135);

The most "complicated" part is the regexp for matching line numbers which
has to print four bytes.

^[[:space:]]*[0-9]+[ ]? {       /* BASIC line number */
  uint16_t line=atoi(yytext);
  putchar(42);  putchar(42);    /* Dummy placeholder values */
  putchar(line & 0xFF);
  putchar(line >> 8);
 }

For me lex makes connecting regular expressions to the actions they trigger
straightforward and clean. But, that could be because I'm ignorant, which
is often the case. If there's a better way using advanced regular
expressions, I'd love to learn it.

—b9

Reply via email to