Re: How to decide what to put in the lexer and the grammar respectively?

Akim Demaille Sun, 17 Feb 2019 09:20:11 -0800

Hi Peng,

> Le 17 févr. 2019 à 17:36, Peng Yu <pengyu...@gmail.com> a écrit :
> 
> Sometimes, the best implementation may not be what it obviously should
> look like. I think that there can be cases in which more actions
> should be in the lexer instead of the parser.


Yes, I agree.

> Consider the parameter expansion (along with assignment) in bash.
> 
> https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html
> 
> Bash currently doesn't support nested parameter expansion. For
> example, both the following assignment commands are fine.
> 
> - x=${parameter:offset}
> - x=${parameter/pattern/string}

This qualifies as a "simple language", in the sense I meant in my
previous message.

> But bash does not support something like the following (i.e, extract
> the substring first then do replacement of the substring)
> 
> - x=${{parameter:offset}/pattern/string}

When nesting is allowed, that's a strong sign that a grammar is
better suited than something in the scanner.  Yet the scanner,
using start-conditions, can deal with nested structures, but I
would limit this to very simple cases (say, /* ... */ comments
that accept nesting, or capture nested strings a la $(...), etc.).

> Since bash does not support the nested parameter expansion, bash
> tokenizes the whole thing (e.g, `x=${parameter:offset}` literally) as
> an ASSIGNMENT_WORD. Then, bash post-processes the thing extracted with
> some handwritten code but not another pair of full-fledged
> lexer/parser. The post-processing code is still manageable due to the
> simplicity of not allowing nesting.

It would be interesting to know why it was written this way.
Maybe there is something I don't foresee, but that still
seems like something where more work can be pushed on the parser.

There is one reason that could have driven to this architecture:
the fact that the lexical structure in strings is very different.
There are no comments, spaces matter, etc.  So that's a case
where the parser state controls the type of scanning you need,
which is called lexical tie-in in Bison's documentation.  It's
doable, but not very elegant.

It would be much better if supported scanner less parsing.  I'd
love that, but this is not going to happen in the near future.

> However, if one wants to allow arbitrary nesting, a handwritten
> post-processing approach will not be viable. A lexer/parser
> post-processing approach is more manageable.

Or dealing with that since the primary parser.  Is there a reason
it cannot be done?

> For example, the
> post-processing lexer can recognize basic tokens (once the whole
> assignment parameter expansion string is extracted by the first lexer)
> like the following. Then a parser can be built on top of those basic
> tokens.
> 
> - VARNAME
> - `=`
> - DOLLOR_LPAR
> - `{`
> - `}`
> - `/`
> - `:`
> - ...

Absolutely.  You might actual even get better error messages.


> But how to recognize the nested parameter expansion assignment in the
> first place? The lexer should have builtin states to capture paired
> `{` `}`, and use states to remember whether it is in substring
> extraction or pattern replacement in order to make sure to capture any
> errors at the level of the lexer.

Ok.

> But once such complex stuff is
> builtin in the first lexer, I don't see the point to use a second
> lexer and parser to do post-processing, since the same states will be
> constructed at least twice, which is a waste.

Probably negligible.  However, I agree with you it would be better
to avoid.  For instance location tracking would be more accurate too
for error inside in the pattern.


> In this sense, it seems nested parameter expansion assignment should
> be builtin in the first lexer rather than the first parser. And the
> parser should be left for other more high-level language features.

This is not so clear to me.  Why not lifting all that processing
to the parser?  Is there a problem I miss?

But I don't claim you should never have several processing phases
either :)  Bison for instance processes the user actions in a second
step, not during the initial parsing.  It was just easier this way.
However, in our case, this second step is hardly more than a sed-like
substitution (of the $$, $1, etc.); there is no parsing, just a
transducer implemented in Flex.



> If instead, the first lexer don't spit out ASSIGNMENT_WORD, but rather
> more basic tokens (as listed above, such as VARNAME ... DOLLOR_LPAR),
> then it will need to remember when to allow spaces (as spaces are used
> as separators in other context but not allowed in the assignment).

Start conditions in the scanner are the proper way to do that.


> So
> the lexer basically needs to be aware of more high-level language
> information, which may not be good.

Choose your evil :(  We don't have scanner less, you need lexical
tie-ins.

> In this sense, I think that the
> lexer should still spit the ASSIGNMENT_WORD but do more internal work
> to deal with the actions of the assignment and nested parameter
> expansions.
> 
> What do you think the best implement should be in this case?

I would still give a try to using a parser for the second
analysis.  And I would actually first try to have a single
parser.

Is there a public branch of bash where such experiments are conducted?
_______________________________________________
help-bison@gnu.org https://lists.gnu.org/mailman/listinfo/help-bison

Re: How to decide what to put in the lexer and the grammar respectively?

Reply via email to