Hi Peng, > Le 17 févr. 2019 à 17:36, Peng Yu <pengyu...@gmail.com> a écrit : > > Sometimes, the best implementation may not be what it obviously should > look like. I think that there can be cases in which more actions > should be in the lexer instead of the parser.
Yes, I agree. > Consider the parameter expansion (along with assignment) in bash. > > https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html > > Bash currently doesn't support nested parameter expansion. For > example, both the following assignment commands are fine. > > - x=${parameter:offset} > - x=${parameter/pattern/string} This qualifies as a "simple language", in the sense I meant in my previous message. > But bash does not support something like the following (i.e, extract > the substring first then do replacement of the substring) > > - x=${{parameter:offset}/pattern/string} When nesting is allowed, that's a strong sign that a grammar is better suited than something in the scanner. Yet the scanner, using start-conditions, can deal with nested structures, but I would limit this to very simple cases (say, /* ... */ comments that accept nesting, or capture nested strings a la $(...), etc.). > Since bash does not support the nested parameter expansion, bash > tokenizes the whole thing (e.g, `x=${parameter:offset}` literally) as > an ASSIGNMENT_WORD. Then, bash post-processes the thing extracted with > some handwritten code but not another pair of full-fledged > lexer/parser. The post-processing code is still manageable due to the > simplicity of not allowing nesting. It would be interesting to know why it was written this way. Maybe there is something I don't foresee, but that still seems like something where more work can be pushed on the parser. There is one reason that could have driven to this architecture: the fact that the lexical structure in strings is very different. There are no comments, spaces matter, etc. So that's a case where the parser state controls the type of scanning you need, which is called lexical tie-in in Bison's documentation. It's doable, but not very elegant. It would be much better if supported scanner less parsing. I'd love that, but this is not going to happen in the near future. > However, if one wants to allow arbitrary nesting, a handwritten > post-processing approach will not be viable. A lexer/parser > post-processing approach is more manageable. Or dealing with that since the primary parser. Is there a reason it cannot be done? > For example, the > post-processing lexer can recognize basic tokens (once the whole > assignment parameter expansion string is extracted by the first lexer) > like the following. Then a parser can be built on top of those basic > tokens. > > - VARNAME > - `=` > - DOLLOR_LPAR > - `{` > - `}` > - `/` > - `:` > - ... Absolutely. You might actual even get better error messages. > But how to recognize the nested parameter expansion assignment in the > first place? The lexer should have builtin states to capture paired > `{` `}`, and use states to remember whether it is in substring > extraction or pattern replacement in order to make sure to capture any > errors at the level of the lexer. Ok. > But once such complex stuff is > builtin in the first lexer, I don't see the point to use a second > lexer and parser to do post-processing, since the same states will be > constructed at least twice, which is a waste. Probably negligible. However, I agree with you it would be better to avoid. For instance location tracking would be more accurate too for error inside in the pattern. > In this sense, it seems nested parameter expansion assignment should > be builtin in the first lexer rather than the first parser. And the > parser should be left for other more high-level language features. This is not so clear to me. Why not lifting all that processing to the parser? Is there a problem I miss? But I don't claim you should never have several processing phases either :) Bison for instance processes the user actions in a second step, not during the initial parsing. It was just easier this way. However, in our case, this second step is hardly more than a sed-like substitution (of the $$, $1, etc.); there is no parsing, just a transducer implemented in Flex. > If instead, the first lexer don't spit out ASSIGNMENT_WORD, but rather > more basic tokens (as listed above, such as VARNAME ... DOLLOR_LPAR), > then it will need to remember when to allow spaces (as spaces are used > as separators in other context but not allowed in the assignment). Start conditions in the scanner are the proper way to do that. > So > the lexer basically needs to be aware of more high-level language > information, which may not be good. Choose your evil :( We don't have scanner less, you need lexical tie-ins. > In this sense, I think that the > lexer should still spit the ASSIGNMENT_WORD but do more internal work > to deal with the actions of the assignment and nested parameter > expansions. > > What do you think the best implement should be in this case? I would still give a try to using a parser for the second analysis. And I would actually first try to have a single parser. Is there a public branch of bash where such experiments are conducted? _______________________________________________ help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison