Nick Sabalausky wrote:
"Walter Bright" <newshou...@digitalmars.com> wrote in message news:ia59si$1r0...@digitalmars.com...
Consider a string literal, say "abc\"def". With Goldie's method, I infer this string has to be scanned twice. Once to find its limits, and the second to convert it to the actual string.

Yea, that is true. With that string in the input, the value given to the user code will be:

assert(tokenObtainedFromGoldie.toString() == q{"abc\"def"});

That's a consequence of the grammar being separated from lexing/parsing implementation.

You're right that that does seem less than ideal. Although I'm not sure how to remedy that without loosing the independence between grammar and lex/parse implementation that is the main point of the GOLD-based style.

But there's something I don't quite understand about the approach you're suggesting: You seem to be suggesting that a terminal be progressively converted into its final form *as* it's still in the process of being recognized by the DFA. Which means, you don't know *what* you're supposed to be converting it into *while* you're converting it. Which means, you have to be speculatively converting it into all types of tokens that the current DFA state could possibly be on its way towards accepting (also, the DFA would need to contain a record of possible terminals for each DFA state). And then the result is thrown away if it turns out to be a different terminal. Is this correct? If so, is there generally enough lexical difference between the terminals that need such treatment to compensate for the extra processing needed in situations that are closer to worst-case (that is, in comparison to Goldie's current approach)?

Probably that's why I don't use lexer generators. Building lexers is the simplest part of building a compiler, and I've always been motivated by trying to make it as fast as possible.

To specifically answer your question, yes, in the lexers I make, you know you're parsing a string, so you process it as you parse it.



If all of that is so, then what would be your thoughts on this approach?:

Suppose Goldie had a way to associate an optional "simultaneous/lockstep conversion" to a type of terminal. For instance:

myLanguage.associateConversion("StringLiteral", new StringLiteralConverter());

Then, 'StringLiteralConverter' would be something that could be either user-provided or offered by Goldie (both ways would be supported). It would be some sort of class or something that had three basic functions:

class StringLiteralConverter : ITerminalConverter
{
    void process(dchar c) {...}

    // Or maybe this to make it possible to minimize allocations
    // in certain circumstances by utilizing slices:
void process(dchar c, size_t indexIntoSource, string fullOrignalSource) {...}

    Variant emit() {...}
    void clear() {...}
}

Each state in the lexer's DFA would know which terminals it could possibly be processing. And for each of those terminals that has an associated converter, the lexer will call 'process()'. If a terminal is accepted, 'emit' is called to get the final result (and maybe do any needed finalization first), and then 'clear' is called on all converters that had been used.

This feature would preclude the use of the actual "GOLD Parser Builder" program, but since I'm writing a tool to handle that functionality anyway, I'm not too concerned about that.

Do you think that would work? Would its benefits be killed by the overhead introduced? If so, could those overheads be sufficiently reduced without scrapping the general idea?

I don't know. I'd have to study the issue for a while. I suggest taking a look at dmd's lexer and compare. I'm not sure what Spirit's approach to this is.


What Goldie will be compared against is Spirit. Spirit is a reasonably successful add-on to C++. Goldie doesn't have to do things the same way as Spirit (expression templates - ugh), but it should be as easy to use and at least as powerful.


Understood.

Reply via email to