Kaz, are you disagreeing on the general shape of my proposal, or are you just taking issue with calling it "lexer support"? I often edit my emails down because everyone says they are too long. Maybe I should have said, "better lexer support".
I agree about the capabilities of Bison that you describe, and I agree on most of your design points. I think you actually conceded, though, the general idea that Bison is pretty spare on its existing lexer support? For example, you wrote: The grammar language isn't very expressive for that (e.g. for repeat matching similar to a regex Kleene star, you have to write recursive rules, and alternation is likewise clumsy). But it has the power. The test of a good tool is not just that it provides power at all, but that it makes that power accessible. I think we agree that something like [a-z][a-z0-9_]+ is nice for developers if Bison could support it? Some things you can do in Lex are going to be difficult... Quite. At the risk of being snarky, that is why I spent a year and a half on this project. You end up needing to go to a separate tool, in practice, and as things stand, there are a lot of unnecessary little problems that come up. ...namely those that are made possible by start conditions.. You can't easily put a Bison-generated parser into different states in which different rules are active or dormant I actually think you gave up too soon, here. Bison parsers have their own way of managing modes, and I think there's some interesting research on combining that with lexical modes in some way. To see what I mean, consider this parse rule: stmt: "val" id "=" expr. Imagine the parser has just shifted the "=" token and is now going to try and gobble an expr. That's a mode! The parser knows that the next token has to be one that can start an expression. The lexer has no idea about this mode, though, and so the lexer has to track its own modes in its own way. I think a knock-on benefit from Bison having a unified grammar file would be to enable research on this kind of thing. This can all be done in Bison... I am not sure this is true for larger examples, even though it's true for the toy calculator example that we use all the time. In a larger grammar, a developer will run into harder problems than just unrolling Kleene star. Here are some reasons to believe that it's helpful to use a lexer generator such as my namesake: 1. The Lex+Yacc combo has been enormously successful in industry. It's taught in the majority of compiler classes, and, remarkably, people actually use that approach when they go out on the job. In my long career, I can't immediately remember a parser that didn't have a separate layer for the lexer. Most of the time, the lexer specifically uses Flex or a tool derived from it. Even for hand-written parsers, such as the one in Scala, they tend to have a separate layer <https://github.com/scala/scala/blob/2.13.x/src/compiler/scala/tools/nsc/ast/parser/Scanners.scala> for the lexer, and the general style of the Scala lexer is pretty similar to a DFA-based lexer. Other things are possible, but the general Lex+Yacc combo is *the* way that people implement parsers right now. I'm enough of a geek to go there with you a little bit, but we would be way out on a limb to really say that Lex-style lexers are obsolete or are unnecessary. 2. Language specifications often separate out the lexical tokens as their own thing. When the spec is structured a certain way, it is often helpful for the implementation to be structured the same way. As a concrete example, the rules for significant newlines are generally based on the tokens that occur before and after the newline. Such a rule can only be easily implemented if the implementation has some notion of a token stream at all. 3. Bison's design and mindshare are overwhelmingly oriented around the idea of a separate lex routine. For example, the manual states <https://www.gnu.org/software/bison/manual/html_node/Language-and-Grammar.html>, "(These tokens can be subdivided into characters, but that is a matter of lexicography, not grammar.)". 4. The developers of Yacc and Lex did not believe it would work well to use Yacc by itself. The first two sentences of the original Lex paper <https://epaperpress.com/lexandyacc/download/lex.pdf> talk about integration with "a parse tool", and the paper mentions Yacc by name over 20 times. The original Yacc paper <https://www.cs.utexas.edu/users/novak/yaccpaper.htm> mentions Lex by name, and furthermore says you wouldn't want to use Yacc by itself. For example, the Yacc paper says, "Such low-level rules tend to waste time and space, and may complicate the specification beyond Yacc's ability to deal with it.". 5. I am groping here, but it seems like a problem if the 1 in LALR(1) were for one character instead of one token. A Bison parser looks ahead by one token to decide what to do, and sometimes it needs more than one character to know what kind of token it is looking at. Here's a concrete example of what I mean. Suppose the parse stack has these symbols on it: ["begin" "if" exp "then"]. If the next token is "else", then the parser should shift the "else" and is eventually going to reduce an if-then-else expression. Now we assume that instead of a token, Bison just sees a single character, "e". What does it do? It's facing a shift-reduce conflict. Suppose it shifts it, and the input continues l s e. In this case, the shift worked out well. Bison will reduce an "else" token, go a little further, and eventually reduce an entire if-then-else expression. However, the shift may not have been the correct thing to do. If the next characters are e n d, then this strategy won't work. It shifts e n d, reduces it to "end", and gives us this parse stack: ["begin" "if" exp "then" "end"]. Now the parser needs to reduce the three symbols in the middle of the stack, but Bison parsers aren't allowed to reduce in the middle. So, we can't embed the lex rules in the parse rules without abandoning the general approach of LALR. It's fun to meet you, Kaz, and to geek out about all of this! Lex Spoon
