On 01-Aug-12 02:01, Philippe Sigaud wrote:
On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky
<dmitry.o...@gmail.com> wrote:

I guess creating a tree of symbol tables according to scope visibility
is then more the job of the parser, but I'm not sure.

Parser can use constant IDs for nested tables, IDs point to string table.
String table is populated by lexer.

The latter, I get. The former, not so much.


I've only glimpsed at your code. For most languages lexing is far more
expensive then parsing


Is that so?


It usually is. Unless parser is overly generic as GLL/GLR (not to claim that
it's very slow but GLL/GLR are slower for obvious reasons).

I'd have thought that, seeing as 'the end result' is more complicated
(richer, if I may say so) for parsing, then parsing is more expensive.
I'm reading on GLR, and was wondering what the practical limits are.
Do people use GLR to parse thousands of pages of text?


Have to agree here if anything a better DFA generator is needed, current
std.regex can't get as good in this field because of:

a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in
ctRegex via codegen)
b) full unicode features commitment (again expect some improvement here in
near future)
c) designed to take multiple captures from matched piece of text.

I'm not sure when (or even if) std.regex will ever special case all of the
above.

Well,

- for a lexer lookahead is sometimes useful (the Dragon book cite the
FORTRAN grammar, for which keywords are not reserved and so when you
encounter IF, you don't know if (!) it's a function call or a 'real'
if)

Well while lookahead will help, there are simpler ways. e.g.
regex ("IF\ZYOUR_LOOKAHEAD");

\Z means that you are capturing text only up to \Z. It's still regular. I think Boost regex does support this. We could have supported this cleanly but have chosen ECM-262 standard.


Critical difference is that lookahead can be used like this:

regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");

- Unicode is needed to lex D correctly, no?

Not that much of unicode, e.g. it could skip decoding stuff most of the time.

- multiple captures doesn't seem necessary *for a lexer*


How does the lexer know when to stop producing a 'comment'
token?


Call special function skipComment when lexer hits comment_start pseudotoken.

Ah, and this special function must somehow maintain a stack, to
'count' the comment_start and comment_end pseudotokens. So in a way,
it quits the regular world to become temporarily more powerful.

Typically lexeres are regular as it allows them to be fast.

Hmm, but then it has to treat comments a one token. So no Ddoc love?



--
Dmitry Olshansky

Reply via email to