Re: Writing a JFlex lexer for D - have an issue with cycles

Profile Anaysis via Digitalmars-d Mon, 23 Jan 2017 15:36:59 -0800

On Monday, 23 January 2017 at 01:46:58 UTC, FatalCatharsis wrote:

On Monday, 23 January 2017 at 00:46:30 UTC, Profile Anaysiswrote:
The real issue is ambiguity. Any time you have a cycle youmust be able to get out of it and so your rules must beorganized so that one always checks to see if termination hasoccurred before checking for nesting. If you allow, say } asan element of a stringliteral then it will be ambiguous as thegrammar will try to match it, when it is meant to be a literal.
This brings up another question for me. Isn't the token stringproduction ambiguous by itself?
q{ tokens }
How does it know whether the input "}" is another token or theterminator on a token string?

You know how an if statement in many languages has the feature ofShort-circuit evaluation?


if (X | Y)

X is first tested, and if it is true, there is no need to test Y.

Now lets say that X is true at some point but Y is "recursive" insome sense, while X is false, Y is evaluated but once X becomestrue there is no need for Y to be evaluated.

This "analogy" is basically what is happening in the grammar fornested rules.

As long as we have a terminating rule that is checked first andbypasses the non-recursive rules, we will terminate.

So, to understand this easily we think about this way for anytype of nested things.

1. Create the terminating instance(e.g., the X above) that has norecursive properties/non-terminating behavior.


2. Create the recursive instance(the Y).

Then effectively do the if(X | Y). This generally can be directlymapped in to a grammar because the if statements are implicit inthe rules(since it is a decision tree).

So, to "create X", which is simply a string literal, it has thestructure q{...}.

This structure must be unambiguous for .... This means that ...must not have q{ or } in it unless they are escaped in someway(which essentially just changes them in to something else.e.g., \n is not \ and n but a completely different character asfar as the grammar is concerned.

Now, that is the terminal rule. It can't recurse because ... willbe parsed as "whatever" and that whatever can't be, by design(wehope), be re-parsed as itself(in any way).



Once we have that, we create the non-terminal recursive rule:

q{...} in which ... can be itself(or something else which then isthis, or whatever.

You basically have this stuff(you might have to make sure thefirst case actually is a terminal and can't recurse on itself.


Once you have that, it is easy to create a terminating grammar:

Y = q{X | Y}

First X is checked, if it is found then Y terminates, else we tryY.


Such a rule as the above can be expanded.

Y = q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | ... Y}

so, this only allows strings like

q{X}
q{q{X}}
q{q{q{X}}}
q{q{q{q{X}}}}
q{q{q{q{q{X}}}}}

etc...

But they always terminate as long as X terminates.

The compilers logic is this:

Check if the tokens look like q{ then some form of X, if no Xthen check if they match Y, which requires reapplying the wholecheck, then a }.


Suppose our rule is

Y = q{1X | 2Y}

q{1X}
q{2q{1X}}
q{2q{2q{1X}}}
q{2q{2q{2q{1X}}}}
q{2q{2q{2q{2q{1X}}}}}

would be the valid matches.


If we allows other alternates like

Y = q{X | (A | B) }
A = l{Y}
B = ($|%)A

then things like

q{X}

q{l{q{X}}
q{l{q{l{q{X}}}}} (basically A is Y as before but with an l

q{l{q{l{X}}}} <- not valid because l{X} is not in the rule above,it must be of type l{Y} and Y always starts with a q{. The ruleactually fails at the inner most l{l{X}} because the l{X} is notmatched(no rule)

q{l{q{$l{X}}}} is valid, This is the rules applied in this formY->A->B->Y.

You can see, with the proper rules we can generate just about anyform we want.


The things to keep track of is that:

X must never contain tokens/matches that allow the othernon-terminals to fail, unless that is the goal and X must be aterminal. In the above example, if X could contain l, {, }, $,and/or % then we have no way of knowing if our tokens are reallyX or Y. e.g., with q{l{q{l{q{X}}}}}, X could just bel{q{l{q{X}}}} and we would terminate immediately. This would makeour ability to parse the nesting impossible. We can say that Xmust be "disjoint" from Y for Y the grammar to function asintended.

Second, you must list your rules in the short circuit order sothat the terminate is always checked first. This is because the Yrules may actually contain X(they do in fact, and must if it isrecursive).

Hence, if we check the Y rules first, we will end up in a cyclebecause we will never be able to determine that the Y rule isreally an X and X is what allows us to terminate the cycle in thefirst place.

It is not really difficult as it seems. It is just a way ofthinking and expressing things that makes the difference.Remember that a grammar is effectively just a giant "if statementmachine" and so similar logic holds. Just as we can do invalidthings with if statements we can do similar things with gammars.

The grammars do exactly as we tell them. Maybe we want infinitecycles! Maybe want them to terminate only after 150 loops, e.g.,


Y = Y_(n<150) | X

which could be read to use the rule Y for 150 times and thenswitch to rule X. (this is just a sort of spinwait in thegrammar). e.g., if our grammar was expressing how enemies thinkin video game, this could be seen as a sort of pause in theirthinking.

If could be used to draw a fractal where we could have severalparallel grammars workings at the same time and this burns upcycles while other grammars do things.


It is a contrived example, of course...

For programming languages, we do not want complexities that wecan't wrap our mind around(since programming languages are thebasis for building complex programs it would make the complexitytoo difficult to handle in the long run).

The rule to solve these problems is to use the short-circuitconcept as then you can list your rules in order from leastcomplex to most. The least complex, if a terminal, when thenalways allow your rules to terminate if it is properly matched.

So your job is to construct your terminating string literal byknowing what makes the tokens not terminate... then simply makesure they can't be in your terminating string literal.

Second, construct your non-terminating rules to be how you wantthe nesting to be.

This is a general procedure. Any "sequence" of things have theseproperties. A literal in a grammar is simply a terminal rule thatallows all the other rules to terminate at some point. Withoutthem, we can't build a structure than can terminate.


e.g., lexing an integer

integer = Digit | (Digit & (Digit | (...) = Digit*

Is a recursive rule. It's just that the recursion as 0 depth insome sense or that we use meta rules to help make the rulesuccinct(e.g., the *). Digit, a terminal, terminates therecursion.


We could write it as

integer = Digit | DigitSeq
DigitSeq = Digit | DigitSeq

or

integer = Digit | DigitSeq
DigitSeq = Digit | integer


or

integer = Digit | integer

(which, in this case is simple because any integer can be seen astwo integers concatenated, except for a single digit)


but, of course

integer = integer | Digit

makes no sense because it is effectively an infinite loop(we haveto try to match some terminal to to get somewhere)

Obviously things can get tricky, but programming languages aredesigned so that ambiguities and complexity is not desirable.Programmers are in the business of programming to get a job done.If they burn needless cycles(like the spin loop) for no benefit,then they are less efficient at their job.


Hopefully that helps some.

Re: Writing a JFlex lexer for D - have an issue with cycles

Reply via email to