On at least one occasion when converting a grammar from a formal language specification to a Bison specification, I forgot to convert the single quotes around multi-character literals to double quotes. Initially, I was amazed at the massive number of parser table conflicts Bison reported, and I was amazed at how long IELR(1) took to generate the tables as a result. Eventually I realized the problem: Bison thinks literals like '+' and '++' are the same token. I wonder how many researchers have been led astray by this problem when using Bison to collect LALR statistics.
I'm surprised to see a comment in scan-gram.l that describes this behavior almost as if it's desired: /* Characters. We don't check there is only one. */ Is there some good reason why we shouldn't check? If not, then I'd like to push the following patch (written against branch-2.5) to master and branch-2.5. Any objections? I've thought about making it an error instead of a warning because it seems like a severe mistake to miss. However, I worry that could cause trouble for distro maintainers trying to build existing projects. In the new test group, I wish I could check the case where a character literal ends at eof without a newline. However, the documentation for AT_DATA demands that the file contents end with an "end of line", so I'm not sure what to do. I might write the Autoconf people later if no one around here already knows how to work around this limitation. >From 34689e4754fcac0154ba69622e241edeab1e6d7c Mon Sep 17 00:00:00 2001 From: Joel E. Denny <[email protected]> Date: Thu, 23 Jul 2009 13:29:22 -0400 Subject: [PATCH] Warn about character literals not of length one. * NEWS (2.5): Document. * src/scan-gram.l (INITIAL): Remove comment that we don't check the length. (SC_ESCAPED_CHARACTER): Warn if length is wrong. * tests/input.at (Bad character literals): New test group. diff --git a/NEWS b/NEWS index 7d401ad..e723542 100644 --- a/NEWS +++ b/NEWS @@ -98,6 +98,18 @@ Bison News about a missing semicolon where it did not before. Future releases of Bison will cease to append semicolons entirely. +** Character literals not of length one. + + Previously, Bison quietly converted all character literals to length + one. For example, without warning, Bison interpreted the operators in + the following grammar to be the same token: + + exp: exp '++' + | exp '+' exp + ; + + Bison now warns when a character literal is not of length one. + * Changes in version 2.4.2 (????-??-??): ** %code is now a permanent feature. diff --git a/src/scan-gram.l b/src/scan-gram.l index 60813bb..7017dbb 100644 --- a/src/scan-gram.l +++ b/src/scan-gram.l @@ -250,7 +250,7 @@ splice (\\[ \f\t\v]*\n)* complain_at (*loc, _("invalid identifier: %s"), quote (yytext)); } - /* Characters. We don't check there is only one. */ + /* Characters. */ "'" STRING_GROW; token_start = loc->start; BEGIN SC_ESCAPED_CHARACTER; /* Strings. */ @@ -465,24 +465,38 @@ splice (\\[ \f\t\v]*\n)* <SC_ESCAPED_CHARACTER> { "'"|"\n" { - if (yytext[0] == '\n') - unexpected_newline (token_start, "'"); STRING_GROW; STRING_FINISH; loc->start = token_start; val->character = last_string[1]; + { + size_t length = strlen (last_string); + if (strlen (last_string) < 3) + warn_at (*loc, _("empty character literal")); + else if (strlen (last_string) > 3) + warn_at (*loc, _("extra characters in character literal")); + } + if (yytext[0] == '\n') + unexpected_newline (token_start, "'"); STRING_FREE; BEGIN INITIAL; return CHAR; } <<EOF>> { - unexpected_eof (token_start, "'"); STRING_FINISH; loc->start = token_start; - if (strlen (last_string) > 1) - val->character = last_string[1]; - else - val->character = last_string[0]; + { + size_t length = strlen (last_string); + if (length < 2) + warn_at (*loc, _("empty character literal")); + else if (length > 2) + warn_at (*loc, _("extra characters in character literal")); + if (length > 1) + val->character = last_string[1]; + else + val->character = last_string[0]; + } + unexpected_eof (token_start, "'"); STRING_FREE; BEGIN INITIAL; return CHAR; diff --git a/tests/input.at b/tests/input.at index f62a68d..8ba2496 100644 --- a/tests/input.at +++ b/tests/input.at @@ -1162,3 +1162,50 @@ AT_CHECK_NAMESPACE_ERROR([[::]], [[namespace reference has a trailing "::"]]) AT_CLEANUP + +## ------------------------ ## +## Bad character literals. ## +## ------------------------ ## + +# Bison used to accept character literals that were empty or contained +# too many characters. + +AT_SETUP([[Bad character literals]]) + +AT_DATA([empty.y], +[[%% +start: ''; +start: ' +]]) + +AT_BISON_CHECK([empty.y], [1], [], +[[empty.y:2.8-9: warning: empty character literal +empty.y:3.8-4.0: warning: empty character literal +empty.y:3.8-4.0: missing `'' at end of line +]]) + +AT_DATA([two.y], +[[%% +start: 'ab'; +start: 'ab +]]) + +AT_BISON_CHECK([two.y], [1], [], +[[two.y:2.8-11: warning: extra characters in character literal +two.y:3.8-4.0: warning: extra characters in character literal +two.y:3.8-4.0: missing `'' at end of line +]]) + +AT_DATA([three.y], +[[%% +start: 'abc'; +start: 'abc +]]) + +AT_BISON_CHECK([three.y], [1], [], +[[three.y:2.8-12: warning: extra characters in character literal +three.y:3.8-4.0: warning: extra characters in character literal +three.y:3.8-4.0: missing `'' at end of line +]]) + +AT_CLEANUP -- 1.5.4.3
