On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:
I'm writing a flex lexer for D and I've hit a roadblock. It is almost working EXCEPT for one specific production.

StringLiteral is cyclic and I don't know how to approach it. It is cyclic because:

     Token -> StringLiteral -> TokenString -> Token

To break the cycle, I was thinking I could just make a production which is Token sans StringLiteral and instead subbed with a production for StringLiteral that does not contain TokenString, but that fundamentally changes the language. Should the lexer really handle something like:

    q{blah1q{20q{"meh"q{20.1q{blah}}}}}

Lexically I don't know how this makes sense. To be clear, I'm wondering if this is acceptable:

    Token:
        Identifier
        StringLiteral
        CharacterLiteral
        IntegerLiteral
        FloatLiteral
        Keyword
        Operator

     StringLiteral:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString
        TokenString

     TokenString:
        q{ TokenNonNestedTokenStrings }


     TokenNonNestedTokenStrings:
        TokenNonNestedTokenString
        TokenNonNestedTokenString TokenNonNestedTokenStrings

     TokenNonNestedTokenString:
        Identifier
        StringLiteralNonNestedTokenString
        CharacterLiteral
        IntegerLiteral
        FloatLiteral
        Keyword
        Operator

     StringLiteralNonNestedTokenString:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString

Which basically disables nested token strings. Has anyone else run into this issue?

This is not really any different than any nesting problem.

First, you must realize that it is more like a "spiral" than a circle. It is recursive in this sense:

      Token_n -> StringLiteral_n -> TokenString_n -> Token_(n-1)


and eventually Token_(n-1) must terminate the chain(essentially the rule must fail for some n).

But this is ok because parsers are recursive in nature, so you just have to have your rule be able to terminate in a logical way.

What we can say about q{} string literal is that either contains other string literals or it doesn't.

If it doesn't, that is all that is required because a nested set of string literals will have to contain at least one that is not nested. That collapses the whole chain.


So, the way I see it is that you really need your TokenString to have a nested and non-nested version. The nested version will cycle but also as an out.

e.g.,
    Token:
        Identifier
        StringLiteral
        CharacterLiteral
        IntegerLiteral
        FloatLiteral
        Keyword
        Operator

     StringLiteral:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString
        TokenString

     TokenString:
        q{StringLiteralNonNestedTokenString}  <- Terminal
          q{NestedTokenString}


     NestedTokenString:
          Tokens
          TokenString
          Tokens

     StringLiteralNonNestedTokenString:
        WysiwygString
        AlternateWysiwygString
        DoubleQuotedString
        HexString
        DelimitedString
          !q{
          !}





So, a TokenString can be a a "simple string" that doesn't nest or it can have nesting. That nesting can go on for every if the source has it, and this grammar can handle it.

But If a tokenstring, regardless if it is inside another token string, terminates/is not nesting another token string, then you are fine too because that catches the eventual termination and will back propagate through the nesting to determine the other tokens.

The real issue is ambiguity. Any time you have a cycle you must be able to get out of it and so your rules must be organized so that one always checks to see if termination has occurred before checking for nesting. If you allow, say } as an element of a stringliteral then it will be ambiguous as the grammar will try to match it, when it is meant to be a literal.

I'm not 100% sure about your grammar expression but basically


q{blah1 <- q{NestedTokenString}/nonterminal
       q{20 <- q{NestedTokenString}/nonterminal
           q{"meh" <- q{NestedTokenString}/nonterminal
                  q{20.1 <- q{NestedTokenString}/nonterminal
q{blah} <- q{StringLiteralNonNestedTokenString}/terminal
                    }
             }
        }
 }

Basically as we parse the string, we encounter q{'s and this cause a "cycle"(but it really is a spiral ;).

When we get to the final one we see that it is a terminal because it doesn't contain any q{ inside.

Hence at that point the grammar unwinds and builds the tree quite easily.

Again, it is really no different than ()'s and such.


One just has to make sure that, say, ( and ) not sued for anything else in an ambiguous way.

e.g., (3+)) means what? (suppose ) also was the same as 4. so someone things (3+4) = 7 but the compiler cannot distinguish between them.

(it could be made to in some ways but the grammar is then not context free)

You shouldn't have to worry too much about those cases but you do have to make sure your grammar understands that the tokens it uses to determine a rule cannot be interpreted in any other way along that parsed branch. Else you end up with an ambiguous or context sensitive grammar.




Reply via email to