08-Oct-2013 04:16, Andrei Alexandrescu пишет:
On 10/4/13 5:24 PM, Andrei Alexandrescu wrote:
To put my money where my mouth is, I have a proof-of-concept tokenizer
for C++ in working state.

http://dpaste.dzfl.pl/d07dd46d

It contains some rather unsavory bits (I'm sure a ctRegex would be nicer
for parsing numbers etc), but it works on a lot of code just swell.

No - ctRegex as it stands right now is too generic and conservative with the code it generates so "\d+" would do:
a) use full Unicode for "Number"
b) keep tabs on where to return as if there could be ambiguity of how many '\d' it may eat (no maximal munch etc.). The reason is that the fact that said '\d+' may be in the middle of some pattern (e.g. 9\d+0), and the fact that it's unambiguous on its own is not exploited.

Both are quite suboptimal and there is a long road I going to take to have a _general_ solution for both points. One day we would reach that goal though. ATM just hack your way through if pattern is sooo simple.

Most importantly, there's a clear distinction between the generic core
and the C++-specific part. It should be obvious how to use the generic
matcher for defining a D tokenizer.



Token representation is minimalistic and expressive. Just write tk!"<<"
for left shift, tk!"int" for int etc. Typos will be detected during
compilation. One does NOT need to define and use TK_LEFTSHIFT or TK_INT;
all needed by the generic tokenizer is the list of tokens. In return, it
offers an efficient trie-based matcher for all tokens.

(Keyword matching is unusual in that keywords are first found by the
trie matcher, and then a simple check figures whether more characters
follow, e.g. "if" vs. "iffy".

+1

Given that many tokenizers use a hashtable
anyway to look up all symbols, there's no net loss of speed with this
approach.

Yup. The only benefit is slimmer giant switch.
Another "hybrid" option is insated of hash-table use a generated keyword trie searcher separately as a function. Then just test each identifier with it. This is what std.d.lexer does and is quite fast. (awaiting latest benchmarks)

The lexer generator compiles fast and should run fast. If not, it should
be easy to improve at the matcher level.

Now, what I'm asking for is that std.d.lexer builds on this design
instead of the traditional one. At a slight delay, we get the proverbial
fishing rod IN ADDITION TO of the equally proverbial fish, FOR FREE. It
is quite evident there's a bunch of code sharing going on already
between std.d.lexer and the proposed design, so it shouldn't be hard to
effect the adaptation.

Agreed. Let us take a moment to incorporate a better design.

So with this I'm leaving it all within the hands of the submitter and
the review manager. I didn't count the votes, but we may have a "yes"
majority built up. Since additional evidence has been introduce, I
suggest at least a revote. Ideally, there would be enough motivation for
Brian to suspend the review and integrate the proposed design within
std.d.lexer.


Andrei



--
Dmitry Olshansky

Reply via email to