Hi Akim > Why not indeed. It will have to be implemented in all our skeletons > anyway, D included.
Great to hear! I wasn’t sure if you were planning to cover all skeletons or would focus on C > My proposal (but returning an array of token numbers rather that token > names) would work for such a case, but it would be a bit wasteful: we > would first gather all the possible tokens in an array, and then remove > the useless ones. > > Would that be ok with you? That would certainly work for us. We are not optimizing for "syntax errors per second" anyway. My main ask here would be to expose the token numbers instead of the token names. One can always get the corresponding token name using a lookup in "yytname", so I would consider an interface which provides the token numbers as strictly superior. > Do you have an alternative to propose? One could tackle this particular use case also from a different angle: We could introduce the concept of "opaque" rules, i.e. rules which are not expanded when reporting syntax errors. E.g., if I could define "unreserved_keyword" as > unreserved_keyword [opaqe]: ABORT_P | ABSOLUTE_P | <...> bison should then create the error message > expected: Identifier, unreserved_keyword instead of > expected: Identifier, <long list containing all unreserved keywords> I have no idea how common such use cases are, though. Also, I am currently not even sure opaque rules could be efficiently implemented. Probably not worth it… Especially, given that we would still use custom error reporting for internationalization, even independent of the filtering of the token list. Cheers, Adrian From: Akim Demaille <[email protected]> Date: Saturday, 4 January 2020 at 12:15 To: Adrian Vogelsgesang <[email protected]> Cc: Bison Patches <[email protected]>, Derek Clegg <[email protected]>, Christian Schoenebeck <[email protected]>, Kevin Villela <[email protected]>, Rici Lake <[email protected]> Subject: Re: custom error messages Hi Adrian, > Le 3 janv. 2020 à 13:13, Adrian Vogelsgesang <[email protected]> a > écrit : > > Hi Akim, > > I am happy to see you are looking into improving error reporting. > We at Tableau also struggled with this. I think your proposed solution, > in particular "parse.error custom", would improve things a lot for us. Good! > Our parser uses the C++ skeleton, though. In case you are interested, > I could take care of the C++ work, as soon as the we know which exact > interface we want to support. Why not indeed. It will have to be implemented in all our skeletons anyway, D included. > Our grammar is structured similarly to the Postgres grammar > (https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y<https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y>) > If we take a look at ColId, type_function_name, NonReservedWord (lines 15064 > and following), we can see that in most places where IDENT (i.e. identifiers) > are accepted, also "unreserved_keyword" are accepted. Now, if someone types > > CREATE TABLE $1 > there is a syntax error near "$1" and the list of expected tokens would > include the complete "unreserved_keyword" list, i.e. ~300 keywords. Obviously > this is pointless. Indeed. I see your point. This is probably the place where what I have in mind for Bison 4 would shine. > Hence, we have custom logic in place which checks if the IDENT token is > among the expected tokens. If so, we supress all tokens with an internal > token number larger than the token number of "ABORT_P" - which happen to be > all keywords. > > You can find our current "solution" attached to this email. You can run this > Python script against a C++ parser generated by bison and it will inject our > custom error handling code. If I read correctly, the core of your approach is here: > int ident=SQLParser::yytranslate_(token::IDENT); > int abort_p=SQLParser::yytranslate_(token::ABORT_P); > got = yytname_[yytoken]; if ((yytoken<abort_p)&&(got[0]=='"')) ++got; > bool expectedIdent=false; > int yyn = yypact_[yystate]; > if (!yy_pact_value_is_default_ (yyn)) { > int yyxbegin = yyn < 0 ? -yyn : 0; > int yychecklim = yylast_ - yyn + 1; > int yyxend = yychecklim < yyntokens_ ? yychecklim : yyntokens_; > for (int yyx = yyxbegin; yyx < yyxend; ++yyx) { > if (yycheck_[yyx + yyn] == yyx && yyx != yyterror_ && > !yy_table_value_is_error_ (yytable_[yyx + yyn])) > if (yyx==ident) { expectedIdent=true; break; } > } > } > if (!yy_pact_value_is_default_ (yyn)) { > int yyxbegin = yyn < 0 ? -yyn : 0; > int yychecklim = yylast_ - yyn + 1; > int yyxend = yychecklim < yyntokens_ ? yychecklim : yyntokens_; > for (int yyx = yyxbegin; yyx < yyxend; ++yyx) { > if (yycheck_[yyx + yyn] == yyx && yyx != yyterror_ && > !yy_table_value_is_error_ (yytable_[yyx + yyn])) { > if ((expectedIdent)&&(yyx>=abort_p)) continue; > const char* p=yytname_[yyx]; if ((yyx<abort_p)&&(p[0]=='\"')) ++p; > possibilities.push_back(p); > } > } > } in particular > int ident=SQLParser::yytranslate_(token::IDENT); > int abort_p=SQLParser::yytranslate_(token::ABORT_P); and > if (yyx==ident) { expectedIdent=true; break; } which notices when IDENT is there, and then > if ((expectedIdent)&&(yyx>=abort_p)) continue; which filters out the alternatives in that case. Such a transformation is easier to implement with the symbol numbers, indeed. My proposal (but returning an array of token numbers rather that token names) would work for such a case, but it would be a bit wasteful: we would first gather all the possible tokens in an array, and then remove the useless ones. Would that be ok with you? Do you have an alternative to propose? I'm not super excited at the idea of exposing all these gory details to the users, I don't want to commit myself to have to support yy_pact_value_is_default_, and all this internal variables. I want to stay high-level.
