Re: Writing a JFlex lexer for D - have an issue with cycles

Profile Anaysis via Digitalmars-d Sun, 22 Jan 2017 16:52:19 -0800

On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:

I'm writing a flex lexer for D and I've hit a roadblock. It isalmost working EXCEPT for one specific production.
StringLiteral is cyclic and I don't know how to approach it. Itis cyclic because:
     Token -> StringLiteral -> TokenString -> Token
To break the cycle, I was thinking I could just make aproduction which is Token sans StringLiteral and instead subbedwith a production for StringLiteral that does not containTokenString, but that fundamentally changes the language.Should the lexer really handle something like:
    q{blah1q{20q{"meh"q{20.1q{blah}}}}}
Lexically I don't know how this makes sense. To be clear, I'mwondering if this is acceptable:
    Token:
        Identifier
        StringLiteral
        CharacterLiteral
        IntegerLiteral
        FloatLiteral
        Keyword
        Operator

     StringLiteral:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString
        TokenString

     TokenString:
        q{ TokenNonNestedTokenStrings }


     TokenNonNestedTokenStrings:
        TokenNonNestedTokenString
        TokenNonNestedTokenString TokenNonNestedTokenStrings

     TokenNonNestedTokenString:
        Identifier
        StringLiteralNonNestedTokenString
        CharacterLiteral
        IntegerLiteral
        FloatLiteral
        Keyword
        Operator

     StringLiteralNonNestedTokenString:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString
Which basically disables nested token strings. Has anyone elserun into this issue?


This is not really any different than any nesting problem.

First, you must realize that it is more like a "spiral" than acircle. It is recursive in this sense:


      Token_n -> StringLiteral_n -> TokenString_n -> Token_(n-1)

and eventually Token_(n-1) must terminate the chain(essentiallythe rule must fail for some n).

But this is ok because parsers are recursive in nature, so youjust have to have your rule be able to terminate in a logical way.

What we can say about q{} string literal is that either containsother string literals or it doesn't.

If it doesn't, that is all that is required because a nested setof string literals will have to contain at least one that is notnested. That collapses the whole chain.

So, the way I see it is that you really need your TokenString tohave a nested and non-nested version. The nested version willcycle but also as an out.


e.g.,

    Token:
        Identifier
        StringLiteral
        CharacterLiteral
        IntegerLiteral
        FloatLiteral
        Keyword
        Operator

     StringLiteral:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString
        TokenString

     TokenString:
        q{StringLiteralNonNestedTokenString}  <- Terminal

          q{NestedTokenString}


     NestedTokenString:

          Tokens
          TokenString
          Tokens


     StringLiteralNonNestedTokenString:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString

          !q{
          !}

So, a TokenString can be a a "simple string" that doesn't nest orit can have nesting. That nesting can go on for every if thesource has it, and this grammar can handle it.

But If a tokenstring, regardless if it is inside another tokenstring, terminates/is not nesting another token string, then youare fine too because that catches the eventual termination andwill back propagate through the nesting to determine the othertokens.

The real issue is ambiguity. Any time you have a cycle you mustbe able to get out of it and so your rules must be organized sothat one always checks to see if termination has occurred beforechecking for nesting. If you allow, say } as an element of astringliteral then it will be ambiguous as the grammar will tryto match it, when it is meant to be a literal.


I'm not 100% sure about your grammar expression but basically


q{blah1 <- q{NestedTokenString}/nonterminal
       q{20 <- q{NestedTokenString}/nonterminal
           q{"meh" <- q{NestedTokenString}/nonterminal
                  q{20.1 <- q{NestedTokenString}/nonterminal

q{blah} <-q{StringLiteralNonNestedTokenString}/terminal

Basically as we parse the string, we encounter q{'s and thiscause a "cycle"(but it really is a spiral ;).

When we get to the final one we see that it is a terminal becauseit doesn't contain any q{ inside.

Hence at that point the grammar unwinds and builds the tree quiteeasily.


Again, it is really no different than ()'s and such.

One just has to make sure that, say, ( and ) not sued foranything else in an ambiguous way.

e.g., (3+)) means what? (suppose ) also was the same as 4. sosomeone things (3+4) = 7 but the compiler cannot distinguishbetween them.

(it could be made to in some ways but the grammar is then notcontext free)

You shouldn't have to worry too much about those cases but you dohave to make sure your grammar understands that the tokens ituses to determine a rule cannot be interpreted in any other wayalong that parsed branch. Else you end up with an ambiguous orcontext sensitive grammar.

Re: Writing a JFlex lexer for D - have an issue with cycles

Reply via email to