On 01-Aug-12 01:10, Philippe Sigaud wrote:
On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tob...@pankrath.net> wrote:
On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:

I know I'm a little late coming into this conversation. This seems like a
nice thread to toss myself into. I've started working on a generic lexer
that is based off of a defined grammar.


Every helping hand is appreciated :-)

Hi Kai, welcome here!

As I read through the thread (I unfortunately don't have enough time to
read every post, but I skimmed through as many as I could, and read the ones
that seemed important), it seems like we need a parser in D that can lex D,
and provide a Range of tokens that could be consumed.


Yes. To make this statement more precise: We need a lexer that provides
a range of tokens and we need a parser which makes it possible to build
an AST. Pegged combines both approaches but imposes an overhead if you
just need a token list. However I'm not sure if this is a problem.

I think we need a tokenizer generator and a parser generator. They can
be grouped or not, but we need both, in Phobos.

We also need to define what's needed in a token. Kai's approach is OK,
but what's the _type field for an operator or and identifier?
Also, a lexer can fill a symbol table and produce identifier tokens
that refer to a particular entry in the symbol table.

Yeah.

I guess creating a tree of symbol tables according to scope visibility
is then more the job of the parser, but I'm not sure.

Parser can use constant IDs for nested tables, IDs point to string table.
String table is populated by lexer.

I've only glimpsed at your code. For most languages lexing is far more
expensive then parsing

Is that so?

It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).


and thus the lexer has to be very fast and I wouldn't
pursue your approach and instead use something like ragel. It already has D
output but needs a separate build step.

Have to agree here if anything a better DFA generator is needed, current std.regex can't get as good in this field because of:

a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in ctRegex via codegen) b) full unicode features commitment (again expect some improvement here in near future)
c) designed to take multiple captures from matched piece of text.

I'm not sure when (or even if) std.regex will ever special case all of the above.


Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.

The only problem I see in Kai's approach (which is the standard one, a
prioritized list of regexes) is how to recognize tokens that are not
regular (I mean, that cannot be encoded as a regex), like nesting
comments?

See below.

How does the lexer know when to stop producing a 'comment'
token?

Call special function skipComment when lexer hits comment_start pseudotoken.

Typically lexeres are regular as it allows them to be fast.


--
Dmitry Olshansky

Reply via email to