I should mention that our use-case for ZetaSQL <https://github.com/google/zetasql> is basically identical to that of Tableau's as Adrian described it. We use it for an autocompleter (Google-internal only, for now, so no code links) and do the same IDENTIFIER search followed by pruning any keywords after a sentinel token. Our main parser <https://github.com/google/zetasql/blob/aac17f8352ab620afbf6124d7193ccedc79ec786/zetasql/parser/bison_parser.y> is also based off the lalr1.cc skeleton, but when I added Autocompletion I didn't have Adrian's skill/patience to add Look-Ahead Correction to that skeleton (thanks for doing that btw!), and instead created a parallel parser based off of the yacc.c skeleton, which did have LAC implemented.
On Fri, Jan 3, 2020 at 6:13 AM Adrian Vogelsgesang < [email protected]> wrote: > Hi Akim, > > > > I am happy to see you are looking into improving error reporting. > > We at Tableau also struggled with this. I think your proposed solution, > > in particular "parse.error custom", would improve things a lot for us. > > > > Our parser uses the C++ skeleton, though. In case you are interested, > > I could take care of the C++ work, as soon as the we know which exact > > interface we want to support. > > > > Also, I would like to explain our particular use case in this email, > > as it differs from the other use cases you mentioned. > > I am by no means implying that this use case is common or that it > > should be supported. I just wanted to share this particular use case > > as an example for more esoteric functionality people might be asking > > for. > > > > Our grammar is structured similarly to the Postgres grammar > > ( > https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y > ) > > If we take a look at ColId, type_function_name, NonReservedWord (lines > 15064 > > and following), we can see that in most places where IDENT (i.e. > identifiers) > > are accepted, also "unreserved_keyword" are accepted. Now, if someone types > > > CREATE TABLE $1 > > there is a syntax error near "$1" and the list of expected tokens would > > include the complete "unreserved_keyword" list, i.e. ~300 keywords. > Obviously > > this is pointless. > > > > Hence, we have custom logic in place which checks if the IDENT token is > > among the expected tokens. If so, we supress all tokens with an internal > > token number larger than the token number of "ABORT_P" - which happen to be > > all keywords. > > > > You can find our current "solution" attached to this email. You can run > this > > Python script against a C++ parser generated by bison and it will inject > our > > custom error handling code. > > > > Cheers, > > Adrian > > > > *From: *bison-patches <bison-patches-bounces+avogelsgesang= > [email protected]> on behalf of Akim Demaille <[email protected]> > *Date: *Friday, 3 January 2020 at 11:07 > *To: *Bison Patches <[email protected]> > *Cc: *Derek Clegg <[email protected]>, Christian Schoenebeck < > [email protected]>, Kevin Villela <[email protected]>, Rici Lake < > [email protected]> > *Subject: *RFC: custom error messages > > > > Hi all, > > There is quite some demand to provide the users with more freedom > to customize the error messages. There's also demand for us to stop > double-quoting the token names in the table of the symbol names. For > instance in Bison's own parser we have > > | static const char *const yytname[] = > | { > | "\"end of file\"", "error", "$undefined", "\"string\"", "\"%token\"", > | "\"%nterm\"", "\"%type\"", "\"%destructor\"", "\"%printer\"", > | "\"%left\"", "\"%right\"", "\"%nonassoc\"", "\"%precedence\"", > | "\"%prec\"", "\"%dprec\"", "\"%merge\"", "\"%code\"", > | "\"%default-prec\"", "\"%define\"", "\"%defines\"", "\"%error-verbose\"", > | "\"%expect\"", "\"%expect-rr\"", "\"%<flag>\"", "\"%file-prefix\"", > | "\"%glr-parser\"", "\"%initial-action\"", "\"%language\"", > | "\"%name-prefix\"", "\"%no-default-prec\"", "\"%no-lines\"", > | "\"%nondeterministic-parser\"", "\"%output\"", "\"%pure-parser\"", > | "\"%require\"", "\"%skeleton\"", "\"%start\"", "\"%token-table\"", > | "\"%verbose\"", "\"%yacc\"", "\"{...}\"", "\"%?{...}\"", > | "\"[identifier]\"", "\"character literal\"", "\":\"", "\"epilogue\"", > | "\"=\"", "\"identifier\"", "\"identifier:\"", "\"%%\"", "\"|\"", > | "\"%{...%}\"", "\";\"", "\"<tag>\"", "\"<*>\"", "\"<>\"", "\"integer\"", > | "\"%param\"", "\"%union\"", "\"%empty\"", "$accept", "input", > | "prologue_declarations", "prologue_declaration", "$@1", "params", > | "grammar_declaration", "code_props_type", "union_name", > | "symbol_declaration", "$@2", "$@3", "precedence_declarator", "tag.opt", > | "generic_symlist", "generic_symlist_item", "tag", "nterm_decls", > | "token_decls", "token_decl.1", "token_decl", "int.opt", > | "token_decls_for_prec", "token_decl_for_prec.1", "token_decl_for_prec", > | "symbol_decls", "symbol_decl.1", "grammar", > | "rules_or_grammar_declaration", "rules", "$@4", "rhses.1", "rhs", > | "named_ref.opt", "variable", "value", "id", "id_colon", "symbol", > | "string_as_id", "string_as_id.opt", "epilogue.opt", YY_NULLPTR > | }; > > This double-escaping wrecks non-ASCII string literals, such as > mathematical symbols written in UTF-8. It also is a problem when you > want to translate your symbols. > > About one year ago I made a first attempt to address this: > https://lists.gnu.org/r/bison-patches/2018-12/msg00088.html. That > series of patches completely changed the semantics of yytname (by > removing this double-quotation). I was quite happy with the result (I > liked a lot the fact that there was just one implementation, not a > host of alternatives, as that's the kind of things that make Bison > hard to maintain). I thought we were done, but there were a few > problems. > > One severe issue brought to my attention by Rici Lake (unfortunately > privately, although he had written a very nice and detailed mail with > all the details) is that this would break several existing parsers > that expect yytname to be this way. For instance he pointed to > > > https://git.gnupg.org/cgi-bin/gitweb.cgi?p=libksba.git;a=blob;f=src/asn1-parse.y;h=5bff15cd8db64786f7c9e2ef000aeddd583cfdc0;hb=HEAD#l856 > > currently not responding, but the code is: > > | for (k = 0; k < YYNTOKENS; k++) > | { > | if (yytname[k] && yytname[k][0] == '\"' > | && !strncmp (yytname[k] + 1, string, len) > | && yytname[k][len + 1] == '\"' && !yytname[k][len + 2]) > | return yytoknum[k]; > | } > > While I was okay that my changes would modify some error messages of > some parsers, I did not want to break parsers. Note that the core > problem here is that some people use yytname to write their scanners. > In other words, they use the string literal not as a means to > customize the error messages, but to encode in some way their scanner. > > We cannot satisfy both needs (better error messages and helping the > definition of the scanner) at the same time. I fully support the use > of string literals for better error messages, and will not help to use > them for the scanner. I will pay attention not to break existing > code, but I will not make any efforts to make it easier, nor to be > portable across skeletons (in the future though, Bison will help these > users, but by a totally different means). > > I looked at how people use yytnamerr on GitHub (see at the end of this > message) and I would summarize that: > > - some users "fight" the double quotation there > > - some users want to translate the token names > > - some users want to customize their error messages and try to > shoehorn that into yytnamerr. The example of PHP is particularly > enlightening > > I think we can satisfy everybody by offering alternatives to "%define > parse.error verbose". > > With "%define parse.error rich" (for lack of a better idea, > suggestions most welcome), you get something equivalent to 'verbose' > but (i) there is no double quotation, (ii) there is support for symbol > translation (as in > https://lists.gnu.org/r/bison-patches/2018-12/msg00088.html). This > would _not_ support yytnamerr. > > With "%define parse.error custom", the user must define the > yyreport_syntax_error function, which is fully responsible to emit the > syntax error message. In the branch yyreport_syntax_error (see > https://github.com/akimd/bison/pull/16), the example looks like this: > > | int > | yyreport_syntax_error (const yysyntax_error_context_t *ctx) > | { > | enum { YYERROR_VERBOSE_ARGS_MAXIMUM = 10 }; > | /* Arguments of yyformat: reported tokens (one for the "unexpected", > | one per "expected"). */ > | char const *arg[YYERROR_VERBOSE_ARGS_MAXIMUM]; > | int n = yysyntax_error_arguments (ctx, arg, sizeof arg / sizeof *arg); > | if (n == -2) > | return 2; > | fprintf (stderr, "SYNTAX ERROR on token [%s]", arg[0]); > | if (1 < n) > | { > | fprintf (stderr, " (expected:"); > | for (int i = 1; i < n; ++i) > | fprintf (stderr, " [%s]", arg[i]); > | fprintf (stderr, ")"); > | } > | fprintf (stderr, "\n"); > | return 0; > | } > > Again, this would not use yytnamerr; it's up to the user to implement > and call whatever transformation she wants on the error message. And > there would be no double quotation in yytname. > > Chistian made a strong case in favor of what he called "push" API > rather than pull: > > May 11th 2019, Christian Schoenebeck <[email protected]> > (https://lists.gnu.org/r/help-bison/2019-05/msg00041.html: > > > My old concerns still apply, that is I still find a push design > > (that is you call Bison API functions you need explicitly in your > > yyerror() implementation to get just the information you need for > > the error message) more appropriate, flexible, powerful and API > > stable than a pull design (where you would get a predefined list of > > information passed to your yyerror_whatever() implementation; like > > eat it or leave it). > > > > The push design has various advantages. E.g. Bison can easily be > > extended with new API functions if somebody needs another > > information, without breaking the previous API and people's existing > > yyerror() implementations. No matter what kind of arguments Bison > > would pass to that suggested yyerror_*() function today, I am sure > > later on people would ask about different information to be provided > > by Bison as well, or in a different way. Some of them might even be > > exotic from your point of view, but legitimate for certain use > > cases, some might become replaced & deprecated, etc. Plus developers > > could duplicate and modify the parser state without touching the > > current state to see potential future information, etc. > > I think he is right, hence the call to yysyntax_error_arguments which > returns the list of expected/unexpected tokens. > > I can't make up my mind on whether returning the list of expected > tokens as strings (as exemplified above), or simply as their symbol > numbers. Symbol numbers are more efficient, yet they are the > *internal* symbol numbers, not the ones the user is exposed to. > > I'm not sure I really want the unexpected token to be in the same > array. Maybe some day we would like to have $undefined have a string > semantic value, where we would write the unexpected string: > > $ echo "1+&" | ./_build/g9d/examples/c/calc/calc > SYNTAX ERROR on string [&] (expected: ["number"] ['(']) > > would look better than > > $ echo "1+&" | ./_build/g9d/examples/c/calc/calc > SYNTAX ERROR on token [$undefined] (expected: ["number"] ['(']) > > If you are familiar with the current implementation of 'verbose' error > reporting in yacc.c, you'll see that you cannot emulate it with the > custom API I propose. That's because the 'verbose' implementation > goes into a lot of troubles so that the error message is built and > stored in yyparse's stack frame (using alloca). That requires a two > step calling convention, where first one would first ask "how much > space do I need for the error message" and a last one is "I allocated > this much space here, please strcpy the error message here". This is > too complex, I don't think we should support such a complex protocol. > > > > I also plan to get rid of YYERROR_VERBOSE. Juggling with CPP and M4 > at the same time is a nightmare. YYERROR_VERBOSE was deprecated > looong ago (Bison 2.6, 2012-07-19, claims that's deprecated since > 1.875 > https://git.savannah.gnu.org/cgit/bison.git/commit/NEWS?id=258cddbc3). > > > I would really appreciate comments on this proposal. Sorry for making > it so long. > > Cheers! > > > > > # hayate891/linuxsampler -- Parser for NKSP real-time instrument script > language > > They have single quotes in their string literals, yet they don't want > the outer quotes (which are not removed by the default yytnamerr if > there are single quotes). IOW, would not be needed with my changes. > > > https://github.com/hayate891/linuxsampler/blob/831838989ea8066d3875057e84e7585a1fec4160/linuxsampler/trunk/src/scriptvm/parser.y#L806 > ``` > /// Custom implementation of yytnamerr() to ensure quotation is always > stripped from token names before printing them to error messages. > int InstrScript_tnamerr(char* yyres, const char* yystr) { > if (*yystr == '"') { > int yyn = 0; > char const *yyp = yystr; > for (;;) > switch (*++yyp) > { > /* > case '\'': > case ',': > goto do_not_strip_quotes; > case '\\': > if (*++yyp != '\\') > goto do_not_strip_quotes; > */ > /* Fall through. */ > default: > if (yyres) > yyres[yyn] = *yyp; > yyn++; > break; > > case '"': > if (yyres) > yyres[yyn] = '\0'; > return yyn; > } > /* > do_not_strip_quotes: ; > */ > } > > if (! yyres) > return (int) yystrlen (yystr); > > return int( yystpcpy (yyres, yystr) - yyres ); > } > ``` > > > # Ortelius maps your microservice configurations with their relationships > to the application that use them > > > https://github.com/ortelius/ortelius/blob/3c23d26b34d88b5e2398d341b5c9a2a9cd00a525/dmengine/dmapi/dm.y#L342 > > Display literal tokens with quotes around them (e.g., as [unexpected > "action"], not as [unexpected "\"action\""]. > > Would not be needed with straight display. > > ``` > %token <bval> T_BOOL "boolean value" > %token <ival> NUM "integer value" > %token <str> STR STR2 "string literal" > %token <str> IDENT "identifier" > %token <str> DOLIDENT "$identifier" > %token T_ACTION "\"action\"" > %token T_BREAK "\"break\"" > %token T_CASE "\"case\"" > %token T_CATCH "\"catch\"" > ``` > > ``` > case '\\': > ++yyp; > /* RHT mod - allow double-quotes to be escaped */ > if((*yyp != '\\') && (*yyp != '"')) { > goto do_not_strip_quotes; > } > ``` > > > # PHP > > > https://github.com/php/php-src/blob/e1559b51cdde65400434653c55790279a15475d5/Zend/zend_language_parser.y#L1320 > > ``` > %token <ast> T_LNUMBER "integer number (T_LNUMBER)" > %token <ast> T_DNUMBER "floating-point number (T_DNUMBER)" > %token <ast> T_STRING "identifier (T_STRING)" > %token <ast> T_VARIABLE "variable (T_VARIABLE)" > %token <ast> T_INLINE_HTML > %token <ast> T_ENCAPSED_AND_WHITESPACE "quoted-string and whitespace > (T_ENCAPSED_AND_WHITESPACE)" > %token <ast> T_CONSTANT_ENCAPSED_STRING "quoted-string > (T_CONSTANT_ENCAPSED_STRING)" > %token <ast> T_STRING_VARNAME "variable name (T_STRING_VARNAME)" > %token <ast> T_NUM_STRING "number (T_NUM_STRING)" > ``` > > They maintain state between the different calls to yytname to know > whether its the unexpected token, or one of the expected. To this end > they use `CG(parse_error)`, changed in yytnamerr. > > > https://github.com/php/php-src/blob/2dd2dcaf9c8bcdda9c687660da887c5fddeb7448/Zend/zend_globals_macros.h#L34 > ``` > # define CG(v) (compiler_globals.v) > ``` > > ``` > /* Copy to YYRES the contents of YYSTR after stripping away unnecessary > quotes and backslashes, so that it's suitable for yyerror. The > heuristic is that double-quoting is unnecessary unless the string > contains an apostrophe, a comma, or backslash (other than > backslash-backslash). YYSTR is taken from yytname. If YYRES is > null, do not copy; instead, return the length of what the result > would have been. */ > static YYSIZE_T zend_yytnamerr(char *yyres, const char *yystr) > { > /* CG(parse_error) states: > * 0 => yyres = NULL, yystr is the unexpected token > * 1 => yyres = NULL, yystr is one of the expected tokens > * 2 => yyres != NULL, yystr is the unexpected token > * 3 => yyres != NULL, yystr is one of the expected tokens > */ > if (yyres && CG(parse_error) < 2) { > CG(parse_error) = 2; > } > > if (CG(parse_error) % 2 == 0) { > /* The unexpected token */ > char buffer[120]; > const unsigned char *end, *str, *tok1 = NULL, *tok2 = NULL; > unsigned int len = 0, toklen = 0, yystr_len; > > CG(parse_error)++; > > if (LANG_SCNG(yy_text)[0] == 0 && > LANG_SCNG(yy_leng) == 1 && > strcmp(yystr, "\"end of file\"") == 0) { > if (yyres) { > yystpcpy(yyres, "end of file"); > } > return sizeof("end of file")-1; > } > > str = LANG_SCNG(yy_text); > end = memchr(str, '\n', LANG_SCNG(yy_leng)); > yystr_len = (unsigned int)yystrlen(yystr); > > if ((tok1 = memchr(yystr, '(', yystr_len)) != NULL > && (tok2 = zend_memrchr(yystr, ')', yystr_len)) != NULL) { > toklen = (tok2 - tok1) + 1; > } else { > tok1 = tok2 = NULL; > toklen = 0; > } > > if (end == NULL) { > len = LANG_SCNG(yy_leng) > 30 ? 30 : LANG_SCNG(yy_leng); > } else { > len = (end - str) > 30 ? 30 : (end - str); > } > if (yyres) { > if (toklen) { > snprintf(buffer, sizeof(buffer), "'%.*s' %.*s", len, str, toklen, tok1); > } else { > snprintf(buffer, sizeof(buffer), "'%.*s'", len, str); > } > yystpcpy(yyres, buffer); > } > return len + (toklen ? toklen + 1 : 0) + 2; > } > > /* One of the expected tokens */ > if (!yyres) { > return yystrlen(yystr) - (*yystr == '"' ? 2 : 0); > } > > if (*yystr == '"') { > YYSIZE_T yyn = 0; > const char *yyp = yystr; > > for (; *++yyp != '"'; ++yyn) { > yyres[yyn] = *yyp; > } > yyres[yyn] = '\0'; > return yyn; > } > yystpcpy(yyres, yystr); > return strlen(yystr); > } > ``` > > GC(parse_error) is reset when the error message is reported: > > > https://github.com/php/php-src/blob/a40a69fdd058cdcb7da5d4527ea6c7dd261417b7/Zend/zend.c#L1115 > ``` > ZEND_COLD void zenderror(const char *error) /* {{{ */ > { > CG(parse_error) = 0; > > if (EG(exception)) { > /* An exception was thrown in the lexer, don't throw another in the > parser. */ > return; > } > > zend_throw_exception(zend_ce_parse_error, error, 0); > } > /* }}} */ > > ``` > > > # PolishReadabilityChecker > > They want i18n. I don't see where the other tokens are defined though. > > > https://github.com/stasiekz/PolishReadabilityChecker/blob/86ec1caec23352ac170dc33e12de02a6ce47c044/third_party/corpus2/poliqarp-library/sakura/parser.y#L55 > > ``` > static size_t yytnamerr(char *yyres, const char *yystr) > { > const char *translation = _(yystr); > size_t length = strlen(translation); > if (yyres != NULL) > strcpy(yyres, translation); > return length; > } > #if 0 > M_("$undefined"); > M_("$end"); > #endif > ``` > > > > # Daruma BASIC for Raspberry Pi and Mac OS X > > They want I18n. > > > https://github.com/toshinagata/darumabasic/blob/3979d68493eceafcd4cbc6410d7d52fbe6b8c9ba/common/transmessage.c > > ``` > const char *err_msg[] = { > /* BS_M_ONLY_ONE_DO_COND */ > "only one WHILE/UNTIL condition may appear with DO or LOOP keyword", > > /* BS_M_UNDEF_SYMBOL */ > "undefined symbol %s", > ``` > > ``` > const char *translated_err_msg[] = { > /* BS_M_ONLY_ONE_DO_COND */ > "DO...LOOPのWHILE/UNTIL条件は一つまでです", > > /* BS_M_UNDEF_SYMBOL */ > "%sは未定義です", > > ``` > ``` > unsigned int > my_yytnamerr(char *yyres, const char *yystr) > { > /* This is a patch to remove 'BS_' from the token name */ > if (strncmp(yystr, "BS_", 3) == 0) > yystr += 3; > /* End patch */ > ``` > > > > https://github.com/toshinagata/darumabasic/blob/3979d68493eceafcd4cbc6410d7d52fbe6b8c9ba/common/grammar.y#L24 > ``` > /* Reserved words */ > %token BS_IF BS_THEN BS_ELSE BS_ELSEIF BS_ENDIF > %token BS_DO BS_WHILE BS_UNTIL BS_LOOP > ``` > > > # Intel SPMD Program Compiler > > https://github.com/ispc/ispc/blob/0198f9f8e3b64716e65e56a3861972c71c6c6019/src/parse.yy#L2092 > > yytnamerr starts with an addition to be able to rename some of the > tokens. Otherwise, back to the default implementation. > > This is C++, yet they use yacc.c. > > ``` > static int > lYYTNameErr (char *yyres, const char *yystr) > { > extern std::map<std::string, std::string> tokenNameRemap; > Assert(tokenNameRemap.size() > 0); > if (tokenNameRemap.find(yystr) != tokenNameRemap.end()) { > std::string n = tokenNameRemap[yystr]; > if (yyres == NULL) > return n.size(); > else > return yystpcpy(yyres, n.c_str()) - yyres; > } > > if (*yystr == '"') > { > YYSIZE_T yyn = 0; > char const *yyp = yystr; > > for (;;) > switch (*++yyp) > { > case '\'': > case ',': > goto do_not_strip_quotes; > > case '\\': > if (*++yyp != '\\') > goto do_not_strip_quotes; > /* Fall through. */ > default: > if (yyres) > yyres[yyn] = *yyp; > yyn++; > break; > > case '"': > if (yyres) > yyres[yyn] = '\0'; > return yyn; > } > do_not_strip_quotes: ; > } > > if (! yyres) > return yystrlen (yystr); > > return yystpcpy (yyres, yystr) - yyres; > } > ``` > > > > https://github.com/ispc/ispc/blob/017e80e7c0a4187b44e37aed3c235520fccaf088/src/lex.ll#L98 > ``` > void ParserInit() { > tokenToName[TOKEN_ASSERT] = "assert"; > tokenToName[TOKEN_BOOL] = "bool"; > tokenToName[TOKEN_BREAK] = "break"; > tokenToName[TOKEN_CASE] = "case"; > tokenToName[TOKEN_CDO] = "cdo"; > tokenToName[TOKEN_CFOR] = "cfor"; > tokenToName[TOKEN_CIF] = "cif"; > ``` > > > https://github.com/ispc/ispc/blob/017e80e7c0a4187b44e37aed3c235520fccaf088/src/lex.ll#L218 > ``` > tokenNameRemap["TOKEN_ASSERT"] = "\'assert\'"; > tokenNameRemap["TOKEN_BOOL"] = "\'bool\'"; > tokenNameRemap["TOKEN_BREAK"] = "\'break\'"; > tokenNameRemap["TOKEN_CASE"] = "\'case\'"; > tokenNameRemap["TOKEN_CDO"] = "\'cdo\'"; > tokenNameRemap["TOKEN_CFOR"] = "\'cfor\'"; > ``` > > > # Ruby > > > https://github.com/ruby/ruby/blob/0c2d731ef210c9121e2a97cc5b0d7594a80389f3/parse.y#L12687 > > > ``` > /* > * strip enclosing double-quotes, same as the default yytnamerr except > * for that single-quotes matching back-quotes do not stop stripping. > * > * "\"`class' keyword\"" => "`class' keyword" > */ > RUBY_FUNC_EXPORTED size_t > rb_yytnamerr(struct parser_params *p, char *yyres, const char *yystr) > { > YYUSE(p); > if (*yystr == '"') { > size_t yyn = 0, bquote = 0; > const char *yyp = yystr; > > ``` > > >
