Well, for ANSI C 99 lex is probably the best way. C doesn't (didn't?) have
any regex engine. I don't know what's in modern C, it might have libraries
for regex.

As you say lex is a  "friendly wrapper around regular expressions."

But all it really does is make a giant regex/state machine with hooks for
your code. If someone is using Python, Perl, C#, Java, Javascript, etc...
my minimalist tendency would be to forgo the dependency on a lexer library
and just use the regex language feature to do the work. Since you can
create one being expression with a named line number token, followed by a
series of BASIC lexemes as one big list of "|"'s

For a given line you need to lex the line number first. That can be its own
regex.
Then you need to lex BASIC tokens one after another. Pretty much that's
just one giant regex of alternative lexemes.

As you extract them you may enter modes for handling expressions, strings,
REM content depending on what you're doing (tokenizer, syntax checker,
pretty printer, renumber, etc.) . That's your own code and variables, lex
doesn't really help with that anyway since it's not a parser.

Now if you're into trying a proper parser/context free grammar then I can
see using lex+yacc or equivalents, even in a higher level language.

That said I don't know how parseable BASIC is. It's not a structured
language. It may be best suited to ad hoc methods.

-- John.


On Wed, Mar 1, 2023 at 5:50 PM B 9 <hacke...@gmail.com> wrote:

>
>
> On Wed, Mar 1, 2023 at 9:39 AM John R. Hogerhuis <jho...@pobox.com> wrote:
>
>> Parser systems are less work than they're worth. But lexer systems, not
>> so much.
>> Modern languages have advanced regular expression systems which are equal
>> in power to a lexer. Might as well just use a big regex to lex your tokens.
>>
>
> You probably know more about this than me, John, but I'm confused by the
> idea of using regexes *instead* of lex. I think of lex as a friendly
> wrapper around regular expressions. At its heart, lex is a list of regular
> expressions with associated actions for each one. They are implicitly or'ed
> and the longest possible match is taken. Most of my tokenizer is just lines
> like this:
>
> END           putchar(128);FOR                putchar(129);NEXT               
> putchar(130);DATA               putchar(131);INPUT              
> putchar(132);DIM                putchar(133);READ               
> putchar(134);LET                putchar(135);
>
> The most "complicated" part is the regexp for matching line numbers which
> has to print four bytes.
>
> ^[[:space:]]*[0-9]+[ ]?       {       /* BASIC line number */
>   uint16_t line=atoi(yytext);
>   putchar(42);  putchar(42);  /* Dummy placeholder values */
>   putchar(line & 0xFF);
>   putchar(line >> 8);
>  }
>
> For me lex makes connecting regular expressions to the actions they
> trigger straightforward and clean. But, that could be because I'm ignorant,
> which is often the case. If there's a better way using advanced regular
> expressions, I'd love to learn it.
>
> —b9
>
>

Reply via email to