lexer support?

Lex Spoon Thu, 29 Feb 2024 09:07:52 -0800

Greetings, Bison developers,

What would the sentiment be to include lexer support in core Bison?


I've thought for a long time that for common usage of Bison, it's not
really helping developers to push them into a separate tool like Flex in
order to enter their token patterns. While in theory it gives the developer
flexibility to mix and match tools, in practice they are going to use both
Bison+Flex as a combo, for a combined task of "implement a parser". That's
my own gut, anyway, based on my professional experience in language
technology.

I've dabbled on this problem as a hobby project for about a year and a
half, and at this point, I am ready to share a detailed proposal
<https://docs.google.com/document/d/1TuUDPB5RH842U-xbKut16q1Xl3gJYYm8dyYLWUO9G6Q/edit#>
as
well as a prototype
<https://github.com/lexspoon/bison/compare/master...lex-lexer> for the C
language backend. I'm wondering what to do next.

To get an idea of what I'm thinking, here is an example of the calculator
example, but with a built-in lexer. Take a look at the new "%%tokens"
section, halfway through the file. It includes a lexical pattern for each
token type. This section replaces the old "%token" declaration, and instead
of just declaring the tokens, it provides enough info that Bison can
generate yylex().

This design has a lot of little niceties other than the basic feature
itself. There is automatic location tracking, a modern approach to Unicode
based on UTF-8, a cleaned-up version of mode support, and incorporation of
line-oriented syntax, an area of syntax that seems pretty important
nowadays. See the proposal for details.

%code {

#include <stdlib.h>

void yyerror (const char *s);

}


%define api.value.type union

%type  <int> expr term fact

%%

input:

  %empty

| input line

;

line:

  NL

| expr NL  { printf ("%d\n", $1); }

| error NL { yyerrok; }

;

expr:

  expr PLUS term { $$ = $1 + $3; }

| expr MINUS term { $$ = $1 - $3; }

| term

;

term:

  term TIMES fact { $$ = $1 * $3; }

| term DIVIDE fact { $$ = $1 / $3; }

| fact

;

fact:

  NUM { $$ = atoi($1); }

| LPAREN expr RPAREN { $$ = $2; }

;

%%tokens

DIVIDE: "/"

LPAREN: "("

MINUS: "-"

NL: "\n"

NUM: [0-9]+ ("." [0-9]+)?

PLUS: "+"

RPAREN: ")"

TIMES: "*"

WS: [ \t\r]+  -> skip

%%

void

yyerror (const char *s)

{

  fprintf (stderr, "%s\n", s);

}

int

main (int argc, char const* argv[])

{

  return yyparse ();

}


That's what I'm thinking. What does the group figure about where to go next?

If the sentiment is positive, then perhaps we can talk about what the right
kind of design review would be, and on what the checklist or process would
be before it feels ready to the maintainers for inclusion in the main Git
branch. I've opened the document for comments, for anyone that wants to
interact in that way.

If the sentiment is not that great, then no hard feelings. I realize this
post is coming out of the blue. In that case, I'll leave my fork on GitHub,
and I'll regroup.

Lex Spoon

lexer support?

Reply via email to