On Monday, 23 January 2017 at 01:46:58 UTC, FatalCatharsis wrote:
On Monday, 23 January 2017 at 00:46:30 UTC, Profile Anaysis wrote:
The real issue is ambiguity. Any time you have a cycle you must be able to get out of it and so your rules must be organized so that one always checks to see if termination has occurred before checking for nesting. If you allow, say } as an element of a stringliteral then it will be ambiguous as the grammar will try to match it, when it is meant to be a literal.

This brings up another question for me. Isn't the token string production ambiguous by itself?

q{ tokens }

How does it know whether the input "}" is another token or the terminator on a token string?

You know how an if statement in many languages has the feature of Short-circuit evaluation?

if (X | Y)

X is first tested, and if it is true, there is no need to test Y.

Now lets say that X is true at some point but Y is "recursive" in some sense, while X is false, Y is evaluated but once X becomes true there is no need for Y to be evaluated.

This "analogy" is basically what is happening in the grammar for nested rules.

As long as we have a terminating rule that is checked first and bypasses the non-recursive rules, we will terminate.

So, to understand this easily we think about this way for any type of nested things.

1. Create the terminating instance(e.g., the X above) that has no recursive properties/non-terminating behavior.

2. Create the recursive instance(the Y).

Then effectively do the if(X | Y). This generally can be directly mapped in to a grammar because the if statements are implicit in the rules(since it is a decision tree).

So, to "create X", which is simply a string literal, it has the structure q{...}.

This structure must be unambiguous for .... This means that ... must not have q{ or } in it unless they are escaped in some way(which essentially just changes them in to something else. e.g., \n is not \ and n but a completely different character as far as the grammar is concerned.

Now, that is the terminal rule. It can't recurse because ... will be parsed as "whatever" and that whatever can't be, by design(we hope), be re-parsed as itself(in any way).


Once we have that, we create the non-terminal recursive rule:

q{...} in which ... can be itself(or something else which then is this, or whatever.

You basically have this stuff(you might have to make sure the first case actually is a terminal and can't recurse on itself.

Once you have that, it is easy to create a terminating grammar:

Y = q{X | Y}

First X is checked, if it is found then Y terminates, else we try Y.

Such a rule as the above can be expanded.

Y = q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | ... Y}

so, this only allows strings like

q{X}
q{q{X}}
q{q{q{X}}}
q{q{q{q{X}}}}
q{q{q{q{q{X}}}}}

etc...

But they always terminate as long as X terminates.

The compilers logic is this:

Check if the tokens look like q{ then some form of X, if no X then check if they match Y, which requires reapplying the whole check, then a }.

Suppose our rule is

Y = q{1X | 2Y}

q{1X}
q{2q{1X}}
q{2q{2q{1X}}}
q{2q{2q{2q{1X}}}}
q{2q{2q{2q{2q{1X}}}}}

would be the valid matches.


If we allows other alternates like

Y = q{X | (A | B) }
A = l{Y}
B = ($|%)A

then things like

q{X}

q{l{q{X}}
q{l{q{l{q{X}}}}} (basically A is Y as before but with an l
q{l{q{l{X}}}} <- not valid because l{X} is not in the rule above, it must be of type l{Y} and Y always starts with a q{. The rule actually fails at the inner most l{l{X}} because the l{X} is not matched(no rule)

q{l{q{$l{X}}}} is valid, This is the rules applied in this form Y->A->B->Y.


You can see, with the proper rules we can generate just about any form we want.

The things to keep track of is that:

X must never contain tokens/matches that allow the other non-terminals to fail, unless that is the goal and X must be a terminal. In the above example, if X could contain l, {, }, $, and/or % then we have no way of knowing if our tokens are really X or Y. e.g., with q{l{q{l{q{X}}}}}, X could just be l{q{l{q{X}}}} and we would terminate immediately. This would make our ability to parse the nesting impossible. We can say that X must be "disjoint" from Y for Y the grammar to function as intended.

Second, you must list your rules in the short circuit order so that the terminate is always checked first. This is because the Y rules may actually contain X(they do in fact, and must if it is recursive).

Hence, if we check the Y rules first, we will end up in a cycle because we will never be able to determine that the Y rule is really an X and X is what allows us to terminate the cycle in the first place.

It is not really difficult as it seems. It is just a way of thinking and expressing things that makes the difference. Remember that a grammar is effectively just a giant "if statement machine" and so similar logic holds. Just as we can do invalid things with if statements we can do similar things with gammars.

The grammars do exactly as we tell them. Maybe we want infinite cycles! Maybe want them to terminate only after 150 loops, e.g.,

Y = Y_(n<150) | X

which could be read to use the rule Y for 150 times and then switch to rule X. (this is just a sort of spinwait in the grammar). e.g., if our grammar was expressing how enemies think in video game, this could be seen as a sort of pause in their thinking.

If could be used to draw a fractal where we could have several parallel grammars workings at the same time and this burns up cycles while other grammars do things.

It is a contrived example, of course...

For programming languages, we do not want complexities that we can't wrap our mind around(since programming languages are the basis for building complex programs it would make the complexity too difficult to handle in the long run).

The rule to solve these problems is to use the short-circuit concept as then you can list your rules in order from least complex to most. The least complex, if a terminal, when then always allow your rules to terminate if it is properly matched.


So your job is to construct your terminating string literal by knowing what makes the tokens not terminate... then simply make sure they can't be in your terminating string literal.

Second, construct your non-terminating rules to be how you want the nesting to be.

This is a general procedure. Any "sequence" of things have these properties. A literal in a grammar is simply a terminal rule that allows all the other rules to terminate at some point. Without them, we can't build a structure than can terminate.

e.g., lexing an integer

integer = Digit | (Digit & (Digit | (...) = Digit*

Is a recursive rule. It's just that the recursion as 0 depth in some sense or that we use meta rules to help make the rule succinct(e.g., the *). Digit, a terminal, terminates the recursion.

We could write it as

integer = Digit | DigitSeq
DigitSeq = Digit | DigitSeq

or

integer = Digit | DigitSeq
DigitSeq = Digit | integer


or

integer = Digit | integer

(which, in this case is simple because any integer can be seen as two integers concatenated, except for a single digit)

but, of course

integer = integer | Digit

makes no sense because it is effectively an infinite loop(we have to try to match some terminal to to get somewhere)

Obviously things can get tricky, but programming languages are designed so that ambiguities and complexity is not desirable. Programmers are in the business of programming to get a job done. If they burn needless cycles(like the spin loop) for no benefit, then they are less efficient at their job.

Hopefully that helps some.







































Reply via email to