Akim and Bison fans, > I would really like to know more about your uses of Bison. > What kind of use can require these tables (yyrhs and yyprhs)?
The short answer is that we have cases where we construct special parse trees that contain raw Bison rule numbers and symbol numbers. We cannot decipher these parse trees without access to all of the following tables: yytname[], yyr1[], yyr2[], yyprhs[] and yyrhs[]. Here is a longer explanation of how this situation materializes within our domain. ---------------- We have several large applications, each of which digests many flat text input files, each of which has its own distinct grammar. We use Bison to parse all of these input files, regardless of whether their grammar is trivial or complex. In total, we have several hundred Bison grammars. The core parsing functionality resides within an application framework that is shared by the several applications. Attributes of this parsing infrastructure include: - Grammar include files. Grammars reside in .g files, and grammar headers (containing common grammatical idioms) reside in .gh files. A set of m4 macros combine them into a single .y file we feed to Bison. - One noteworthy example: we have a .gh file that provides a rather capable programming language (vaguely C-like). This is one reason many of our grammars are "large" and "complex". - The grammar in some of our .gh files is "templated" (they are included with arguments). This gives us the grammatical equivalent of "templated container classes" -- common syntactic scaffolding containing invoker-defined grammatical sub-phrases within them. - Each grammar defines a distinct C++ class, each having a yyparse() function. These all derive from a common abstract Parser class. This is vastly cleaner than trying to manage several hundred Bison "yy" style prefixes. - We use our own parser skeleton. We do not often upgrade to newer versions of Bison -- but when we do we are willing to adapt this skeleton as required by the new Bison version. This cost is amortized over several hundred grammars, and over intervals of several years between Bison upgrades. - All of the grammars use a single, common lexical analyzer. It natively recognizes C/C++ style comments, and the following "built-in" token types: - identifiers IDENT - double-quoted strings STRING - single-quoted strings CCONST - integer constants ICONST - floating-point constants FCONST but it is also smart enough to discover all of the remaining tokens (and the token symbol numbering) directly from Bison's generated parser tables. We just write grammar, and never think about lexical issues. - The grammars construct fairly conventional parse trees consisting of C++ structs. We keep them simple and completely independent from the application code so that we can use any of these parsers and parse trees within simple ad-hoc tools without having to link in any portion of the main application's code. - Our users generally prefer to work with "flat text" input files, since this allows using tools like "grep", "sed", "awk" and relational databases to gather needed data from various sources. - One exception to this is geometric or geographic data. For example, a road network can be represented as a graph, where each vertex has an associated latitude/longitude. Using "vi" to edit a long list of latitude/longitude pairs is tedious. In such cases it is better to use a GUI tool that can display it all on a map, and let user's click, drag and drop things. - Since the data files are complex, the users put lots of comments in them to help remember important assumptions and other info. - Most GUI tools that edit such "flat text" files use a standard parser to read the file (ignoring whitespace and comments), edit the structure, and then write the modified structure back out in some "aribitrary" format. This methodology completely loses all comments and the indentation structure of the file. Our users would never use any GUI tool that is this brutal to their data files. Thus we are faced with the problem of doing "structural" editing of flat text files in a manner that preserves whitespace, indentation and comments. This is the particular niche where we absolutely require the yyprhs[] and yyrhs[] tables. Our solution is a "generic parser", which produces a "generic parse tree". Generic parse trees simultaneously represent both the phrase-structure of an input file, and its exact character sequence, including all whitespace and comments. A single common generic_yyparse() function can produce a generic parse tree for any of our several hundred Bison grammars. Generic parse trees have a special "root" node. The rest of the generic parse tree contains only two kinds of nodes: one for tokens, and the other for nonterminals. These two kinds of nodes share a common base class: struct GenericParseTree { const ParserTables * pt; // Tables describing parser/grammar Symbol * root; // Top level rule (should always be // a non-terminal) Name * endWhite; // whitespace between final token // and EOF token }; struct Symbol { bool isToken; // True for tokens, false for nonterms int symbol; // Symbol this subtree represents int state; // State this symbol was recognized // in (i.e., before shift or goto // action) NonTerm * parent; // Parent parse tree }; struct Token : public Symbol { Name * whitespace; // White space preceeding this token Name * token; // Text of the token iteself }; struct NonTerm : public Symbol { int rule; // Rule number whose reduction // produced this non-terminal int nkids; // number of children Symbol ** kids; // Array of child parse trees }; The Name object is a hash table entry, and encapsulates an arbitrary character string (byte sequence). Only one copy of each such byte sequence exists, so that two Names are equal if-and-only-if they have the same address. Like the EQ function in LISP, our application can therefore just use pointer comparison to compare strings of variable/ arbitrary length. Each "shift" action produces a Token node, and each "reduce" action produces a NonTerm node. All {...} "action code" regions in the Bison grammar are ignored by the generic parser. Here is the key -- a simple traversal of this tree that (at each Token node) prints out (a) the text in the "whitespace" member, followed by (b) the text in the "token" member, and finally the "endWhite" member -- produces the exact character sequence of the original input file. Everything is preserved identically. But now we also have access to the grammatical phrase structure of the input file as well (although perhaps in a slightly cumbersom form). The GenericParseTree therefore simultaneously represents: a. The exact character sequence of the input file, b. The set of Bison parser tables / grammar rules used to parse it, c. The exact derivation (i.e., parse tree) of this string according to these grammar rules. Items b and c are a bit cumbersome, but with a few more tools they are not too difficult to work with. To find things in this generic parse tree, one could just "hack" it -- deal with hard-coded symbol numbers and/or hard-coded rule numbers. But even the smallest changes to the grammar will cause Bison to completely renumber everything. The better approach is to code things up using actual grammar symbol names. The pt->yytname field lets you convert these into symbol numbers correctly (until you change the name of that symbol in the grammar file or get rid of it). Given a rule number k, some fairly simple code such as: lhs_symbol = pt -> yyr1 [k]; nrhs = pt -> yyr2 [k]; base = pt -> prhs [k]; for (i = 0; i < nrhs; i++) { rhs_symbol [i] = pt -> rhs [base + i]; } gives you access to the definition of rule k, in terms of symbol numbers. Given some grammar fragment such as: latlon_list: /* empty */ | latlon_list latlon ; latlon: latitude longitude ; latitude : ... longitude : ... One can actually do a "pattern match" to (a) find the rule number that matches the pattern: latlon_list: * latlon * (b) make sure there is only one such rule, and (c) return its rule number. (The match is performed across all rules k using code similar to above.) Now it is a fairly simple matter to traverse the generic parse tree, looking for all occurrences of this rule k (as a NonTerm node) and return a list of all of them. Our infrastructure even knows how to examine the grammar rules and compute a "derives" relation (just like the one used inside the lalr logic in Bison) -- and it can use this relation to avoid recursing down into subtrees that can never include any occurrences of rule k during this search of the parse tree. This is where we depend crucially upon having access to both the yyprhs[] and yyrhs[] tables. We could never make sense of a generic parse tree without them. The yytname[], yyr1[] and yyr2[] tables are equally crucial. We have other tools for examining the comment and indentation structure at any point in the generic parse tree, including pattern recognition stuff that enable us to "continue the pattern" of surrounding indentation/comments, for example, when inserting a new element into such a latlon_list phrase. Once you have modified the structure of the generic parse tree as desired, it is a simple matter to write the new character sequence form of it back out to the flat text file. To round things out, here is the basic structure we use to encapsulate all of the parser tables that we require from each Bison grammar: struct ParserTables { int yyntokens; // Number of tokens in grammar int yynnts; // Number of non-terminal symbols int yynrules; // Number of rules in grammar int yynstates; // Number of states in automaton int yylast; // Index of last entry in yytable int yyfinal; // Final state of state machine int yypact_ninf; // Special flag value in yypact[] int yytable_ninf; // Special flag value in yytable[] int yyntbase; // Symbol number of first nonterm int nsyms; // Number of elements in yytname[] const short * yyprhs; // The yyprhs[] table const short * yyrhs; // The yyrhs[] table const char * const * yytname; // The yytname[] table const short * yyr1; // The yyr1[] table const short * yyr2; // The yyr2[] table const short * yyrline; // The yyrline[] table const short * yydefact; // The yydefact[] table const short * yydefgoto; // The yydefgoto[] table const short * yypact; // The yypact[] table const short * yypgoto; // The yypgoto[] table const short * yytable; // The yytable[] table const short * yycheck; // The yycheck[] table const char * parserName; // Name of the parser class that // defines this grammar. }; Our parsing skeleton includes the following snippet: #define yyquotify(x) yyquotify2(x) #define yyquotify2(x) #x static const ParserTables yyalltables = { YYNTOKENS, YYNNTS, YYNRULES, YYNSTATES, YYLAST, YYFINAL, YYPACT_NINF, YYTABLE_NINF, YYNTOKENS, (sizeof(yytname)/sizeof(yytname[0])) - 1, yyprhs, yyrhs, yytname, yyr1, yyr2, #if YYDEBUG != 0 yyrline, #else NULL, #endif yydefact, yydefgoto, yypact, yypgoto, yytable, yycheck, yyquotify(CLASS), }; #undef yyquotify2 #undef yyquotify const ParserTables * CLASS::getParserTables () { return &yyalltables; } This is a very small additional size increment added to each of our parsers (other than the obvious penalty of forcing all our tables to be "short" so that they conform to this mold). This gives us the ability to obtain "all of the parser tables" for any of our grammars in a uniform way. Given (1) the pathname of a file to parse, and (2) a pointer to the ParserTables object describing the grammar to use, our common generic parser routine can construct a generic parse tree for the file. The ParserTables object also provides access to the yytname[], yyr1[], yyr2[], yyprhs[] and yyrhs[] tables, all of which are all necessary to make sense of this generic parse tree. This is a very brief overview describing just a fraction of the many ways that we use and abuse Bison in our application. You should also be aware of the following facts regarding the new technique Bison uses (in the master branch) to print reduced rules (using yyssp and yystos[] instead of yyprhs[] and yyrhs[]): 1. This technique only works "in flight" while you are actually parsing a file, because only there do you have a parsing stack available that represents a valid automaton configuration from which you can "read off" the correct sequence of states that gets you the right-hand-side of the rule. Outside of this context, there is no simple way to statically determine what rule k looks like. The yyprhs[] and yyrhs[] tables provide this very directly. 2. There is at least one state machine optimization technique that I am aware of for which the accessing symbol of states is no longer guaranteed to be unique. (In fact, the set of symbols that access a state can be an arbitrary mix of both tokens and nonterminals.) In this context, it is not possible to produce a well-defined yystos[] table. If Bison ever adopts such optimizations in the future (e.g., bison -O2), it will have to fall back to a more reliable method of printing rules during reduction, such as with yyprhs[] and yyrhs[]. The optimization I am alluding to removes all "unit reductions" from the parsing automaton. These are reductions of the form: a: b ; and that have no action code (other than the default action of $$ = $1). The symbol b can be either a nonterminal or a token. Such unit reductions serve only to symbolically "rename" the symbol at the top of the stack. These "no-op" reduce operations can simply be omitted with no ill effect. There are lots of other good things that come from doing this, but that should be a separate discussion. David David M. Warme, Ph.D. Principal Computer Scientist Group W, Inc. Fairfax, Virginia, USA. _______________________________________________ help-bison@gnu.org http://lists.gnu.org/mailman/listinfo/help-bison