%nterm directive incorrectly accepts character literals and quoted strings

Rici Lake Mon, 15 Oct 2018 09:05:24 -0700

(Modified from an email originally sent privately to Akim, who persuaded me
to make it public. It's really a minor point.)


I only just noticed that %nterm is a bison directive, although I still
don't fully understand the motivation. I can't find any reference in the
manual, except for its unexplained use in one example. But Akim assures me
that it has been accepted for quite some time, at least since 1993.

I noticed it more or less by accident when looking to see what the exact
syntax for %token, %type and %[precedence-level] declarations were. And
once I noticed it, I added it to a test suite and promptly discovered that
this input segfaults on bison 3.0.5:

    %union {int n;}
    %nterm <n> start "a"
    %%
    start: "a" | start "a"

With bison 3.1, the segfault is gone, but the error messages are a bit
mysterious:

    nterm.y:4.8-10: error: symbol "a" redefined
    start: "a" | start "a"
           ^^^
    nterm.y:4.1-5: error: rule given for start, which is a token
    start: "a" | start "a"
    ^^^^^

The %nterm grammar is essentially the same as the grammar for the %token
directive, and I suppose that it was considered to be the complement for
declaring non-terminals. But a non-terminal cannot be aliased to a quoted
name. Moreover, non-terminals do not have any equivalent to the token
enumeration, so there is no meaningful way to assign a number to a
non-terminal. The attempt to alias a non-terminal, as in the incorrect
bison snippet above, should be flagged as an error in the %nterm
declaration, not left to create havoc later on in the parse.

The misbehaviour of the above sample program is the result of not checking
to ensure that the alias syntax is only applied to tokens. I didn't attempt
to do a full analysis of the segfault in bison 3.0.5 since the bug was
fixed before it was noticed :-) but it happens at a point in which bison is
about to issue a different mysterious error message, alleging that a
token's token number had previously been assigned. (In the course of that
error report, bison tries to compare the source code locations for two
definitions, and segfaults because one of the locations is NULL.) In 3.1, a
new function is used which merges the attributes of a token and its alias;
in the case of the incorrect %nterm declaration, this seems to have the
effect of making the alias ("a") into a non-terminal. When "a" is
subsequently encountered in a rule and converted into a token alias, bison
complains that it was already defined (as a non-terminal). The redefinition
appears to then change `start` into a token, which makes it an invalid
left-hand-side for a production.

For the casual user (or even the not-so-casual user), the error would have
been much clearer if the report had been something like:

    nterm.y:2.18-20: error: Quoted strings cannot alias non-terminals
    %nterm <n> start "a"
                     ^^^

But it would probably suffice to just reject the declaration as a syntax
error, by changing the grammar so that only IDs can be listed in an %nterm
declaration. (The situation is actually unlikely to arise since %nterm is
undocumented and, I think, not particularly well-known.)

The various symbol declaration declarations are a bit of a jumble, thanks
to backwards compatibility and some unfortunate decisions made before the
start of recorded time. For what it's worth, this is my understanding of
the different syntaxes (in a kind of EBNF):

    class-declaration      ::=  ( %token | %nterm ) tag? ( ID NUMBER?
QUOTED? )+ ( tag (ID NUMBER? QUOTED? )+ )*
    precedence-declaration ::=  ( %left | %right | %precedence | %nonassoc)
                                tag? ( ID NUMBER? | QUOTED | CHARACTER )+
    type-declaration       ::=  %type tag ( ID | QUOTED | CHARACTER ) +

The inconsistency between %token, which allows aliases to be declared, and
%precedence declarations, which treat aliases as new independent symbols,
is noted in the manual, and apparently is necessary for Posix compliance.
Posix does not require an %nterm declaration, and the Posix grammar for
declarations is much simpler, since it doesn't allow multiple tags in a
single declaration, and it doesn't allow quoted strings. (IDENTIFIER here
includes character literals):

    declaration      ::=  ( %token | %type | %left | %right | %nonassoc )
tag? ( IDENTIFIER NUMBER? )+

(Note that the Posix grammar allows %type declarations without a tag.)

The fact that bison's %token and %nterm declarations do allow multiple tags
is probably the only justification for the existence of %nterm, since there
is no need to predeclare non-terminals. (It could be considered superior to
%type because it explicitly states that the targets are non-terminals. On
the other hand, it is generally more useful IMHO to group terminals and
non-terminals with the same type tag together.) But it is many decades to
late to suggest removing it; my only suggestion is that its grammar be
limited to the semantically meaningful:

    token-declaration      ::=  %token tag? ( ID NUMBER? QUOTED? )+ ( tag (
ID NUMBER? QUOTED? )+ )*
    nterm-declaration      ::=  %nterm tag? ID+ ( tag ID+ )*

On the other hand, there is no real reason not to extend multiple-tag
syntax to the other declarations, which would make the syntaxes a little
less incoherent:

    token-declaration      ::=  %token tag? ( ID NUMBER? QUOTED? )+ ( tag
(ID NUMBER? QUOTED? )+ )*
    nterm-declaration      ::=  %nterm tag? ID+ ( tag ID+ )*
    precedence-declaration ::=  ( %left | %right | %precedence | %nonassoc)
                                tag? ( ID NUMBER? )+ ( tag ( ID NUMBER? )+
)*
    type-declaration       ::=  %type ( tag ( ID | QUOTED | CHARACTER )+ )+

Such a change would not make bison any more or less Posix-compliant than it
already is.

Rici.

%nterm directive incorrectly accepts character literals and quoted strings

Reply via email to