Nick Sabalausky wrote:
"Walter Bright" <newshou...@digitalmars.com> wrote in message
news:ia59si$1r0...@digitalmars.com...
Consider a string literal, say "abc\"def". With Goldie's method, I infer
this string has to be scanned twice. Once to find its limits, and the
second to convert it to the actual string.
Yea, that is true. With that string in the input, the value given to the
user code will be:
assert(tokenObtainedFromGoldie.toString() == q{"abc\"def"});
That's a consequence of the grammar being separated from lexing/parsing
implementation.
You're right that that does seem less than ideal. Although I'm not sure how
to remedy that without loosing the independence between grammar and
lex/parse implementation that is the main point of the GOLD-based style.
But there's something I don't quite understand about the approach you're
suggesting: You seem to be suggesting that a terminal be progressively
converted into its final form *as* it's still in the process of being
recognized by the DFA. Which means, you don't know *what* you're supposed to
be converting it into *while* you're converting it. Which means, you have to
be speculatively converting it into all types of tokens that the current DFA
state could possibly be on its way towards accepting (also, the DFA would
need to contain a record of possible terminals for each DFA state). And then
the result is thrown away if it turns out to be a different terminal. Is
this correct? If so, is there generally enough lexical difference between
the terminals that need such treatment to compensate for the extra
processing needed in situations that are closer to worst-case (that is, in
comparison to Goldie's current approach)?
Probably that's why I don't use lexer generators. Building lexers is the
simplest part of building a compiler, and I've always been motivated by trying
to make it as fast as possible.
To specifically answer your question, yes, in the lexers I make, you know you're
parsing a string, so you process it as you parse it.
If all of that is so, then what would be your thoughts on this approach?:
Suppose Goldie had a way to associate an optional "simultaneous/lockstep
conversion" to a type of terminal. For instance:
myLanguage.associateConversion("StringLiteral", new
StringLiteralConverter());
Then, 'StringLiteralConverter' would be something that could be either
user-provided or offered by Goldie (both ways would be supported). It would
be some sort of class or something that had three basic functions:
class StringLiteralConverter : ITerminalConverter
{
void process(dchar c) {...}
// Or maybe this to make it possible to minimize allocations
// in certain circumstances by utilizing slices:
void process(dchar c, size_t indexIntoSource, string fullOrignalSource)
{...}
Variant emit() {...}
void clear() {...}
}
Each state in the lexer's DFA would know which terminals it could possibly
be processing. And for each of those terminals that has an associated
converter, the lexer will call 'process()'. If a terminal is accepted,
'emit' is called to get the final result (and maybe do any needed
finalization first), and then 'clear' is called on all converters that had
been used.
This feature would preclude the use of the actual "GOLD Parser Builder"
program, but since I'm writing a tool to handle that functionality anyway,
I'm not too concerned about that.
Do you think that would work? Would its benefits be killed by the overhead
introduced? If so, could those overheads be sufficiently reduced without
scrapping the general idea?
I don't know. I'd have to study the issue for a while. I suggest taking a look
at dmd's lexer and compare. I'm not sure what Spirit's approach to this is.
What Goldie will be compared against is Spirit. Spirit is a reasonably
successful add-on to C++. Goldie doesn't have to do things the same way as
Spirit (expression templates - ugh), but it should be as easy to use and
at least as powerful.
Understood.