On Wed, Mar 1, 2023 at 9:39 AM John R. Hogerhuis <jho...@pobox.com> wrote:
> Parser systems are less work than they're worth. But lexer systems, not so > much. > Modern languages have advanced regular expression systems which are equal > in power to a lexer. Might as well just use a big regex to lex your tokens. > You probably know more about this than me, John, but I'm confused by the idea of using regexes *instead* of lex. I think of lex as a friendly wrapper around regular expressions. At its heart, lex is a list of regular expressions with associated actions for each one. They are implicitly or'ed and the longest possible match is taken. Most of my tokenizer is just lines like this: END putchar(128);FOR putchar(129);NEXT putchar(130);DATA putchar(131);INPUT putchar(132);DIM putchar(133);READ putchar(134);LET putchar(135); The most "complicated" part is the regexp for matching line numbers which has to print four bytes. ^[[:space:]]*[0-9]+[ ]? { /* BASIC line number */ uint16_t line=atoi(yytext); putchar(42); putchar(42); /* Dummy placeholder values */ putchar(line & 0xFF); putchar(line >> 8); } For me lex makes connecting regular expressions to the actions they trigger straightforward and clean. But, that could be because I'm ignorant, which is often the case. If there's a better way using advanced regular expressions, I'd love to learn it. —b9