Modularized parser files covering similar grammars

Frans Englich Mon, 13 Jun 2005 06:17:31 -0700

Hello,

I have a design dilemma that will become real some time in the future, and 
consider how large it is, I thought it could be a good idea to take a quick 
look forward.


I am building a Bison parser for a language, or to be precise, multiple 
languages which all are very similar. I have a "main" language, followed by 
three other languages which all are subsets of the main language.

To be precise, I'm building a parser for the XPath language, and the different 
flavours I need to be able to distinguish are:

* XPath 2.0. This is as broad as it gets.
* XPath 1.0. A subset of XPath 2.0. XPath 2.0 is an extension of XPath 1.0
* XSL-T 2.0 Patterns. A small subset of XPath 2.0
* XSL-T 1.0 Patterns. A small subset of XPath 1.0
* W3C XML Schema Selectors. An even smaller subset of XPath 1.0

My wondering is how I practically should modularize the code in order to 
efficiently support these different languages.

First of all, my thought is that the scanner(flex) is the same in either 
case(e.g, support all tokens in XPath 2.0), and that distinguishing the 
various "languages" is done on a higher level(parser).

Distinguishing XPath 1.0/2.0 is from what I can tell the easiest. Since XPath 
2.0 is an extension to 1.0, one can pass the parser an argument which 
signifies whether it's 1.0 that is parsed, and in the actions for 2.0 
expressions error out if 1.0 is being parsed.

In other words, conditional checks on an action basis.

This approach, however, easily becomes complex when taking the other grammars 
into account, because one needs to be "context" aware. For example, XSL-T 
Patterns is a sub-set, but the constructs that are disallowed are only done 
so in certain scenarios. Hence, if one continued with conditional tests("What 
language am I parsing?") inside actions, it would require to implement 
"non-terminal awareness".

Another approach, which seems attractive to me if it's possible, is to 
modularize the grammar on the API/file level. For example, the tokens are 
declared in one file, non-terminals grouped in files, and a separate parser 
is constructed for each language. It would be preferred if it was also 
modularized on the object level, but I guess the disadvantage wouldn't be 
that big if it wasn't. In other words, if one could "select start token 
depending on language" it would solve my problems, it seems. I don't know how 
this "bison modularization" would be done practically though.


What are people's experiences with these kind of problems? What are the 
approaches for solving them?


Cheers,

                Frans

PS.

For those interested, here are the EBNF productions for what I'm talking 
about:

XPath 2.0(1.0 is merely a subset):
http://www.w3.org/TR/xpath20/#nt-bnf

XSL-T Patterns:
http://www.w3.org/TR/xslt20/#pattern-syntax

W3C XML Schema Selectors:
http://www.w3.org/TR/xmlschema-1/#coss-identity-constraint

btw, there's also an interesting document wrt to parser/scanner construction & 
XPath, "Building a Tokenizer for XPath or XQuery":
http://www.w3.org/TR/2005/WD-xquery-xpath-parsing-20050404/


_______________________________________________
Help-bison@gnu.org http://lists.gnu.org/mailman/listinfo/help-bison

Modularized parser files covering similar grammars

Reply via email to