Sean Kelly <s...@invisibleduck.org> wrote: > Sean Kelly <s...@invisibleduck.org> wrote: >> Andrei Alexandrescu <seewebsiteforem...@erdani.org> wrote: >>> On 10/22/10 16:28 CDT, Sean Kelly wrote: >>>> Andrei Alexandrescu Wrote: >>>>> >>>>> I have in mind the entire implementation of a simple design, but >>>>> never >>>>> had the time to execute on it. The tokenizer would work like this: >>>>> >>>>> alias Lexer!( >>>>> "+", "PLUS", >>>>> "-", "MINUS", >>>>> "+=", "PLUS_EQ", >>>>> ... >>>>> "if", "IF", >>>>> "else", "ELSE" >>>>> ... >>>>> ) DLexer; >>>>> >>>>> Such a declaration generates numeric values DLexer.PLUS etc. and >>>>> generates an efficient code that extracts a stream of tokens from > > > > > a >>>>> stream of text. Each token in the token stream has the ID and the >>>>> text. >>>> >>>> What about, say, floating-point literals? It seems like the first >>>> element of a pair might have to be a regex pattern. >>> >>> >>> Yah, with regard to such regular patterns (strings, comments, >>> numbers, >>> identifiers) there are at least two possibilities that I see: >>> >>> 1. Go the full route of allowing regexen in the definition. This is >>> very hard because you need to generate an efficient (N|D)FA during >>> compilation. >>> >>> 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in > > > the >>> compile-time table matches, just call onUnrecognizedString(). In >>> conjunction with a few simple specialized functions, that makes it >>> very simple to define arbitrarily complex lexers where the bulk of >>> the >>> work (and the most tedious part) is done by the D compiler. >> >> For the second, that may push the work of recognizing some lexical >> elements into the parser. For example, a comment may be defined as >> /**/, >> which if there is no lexical definition of a comment means that it >> parses as four distinct valid tokens, div mul mul div. > > Or maybe not. A /* could be CommentBegin. I'll have to think on it a > bit > more.
I still think it won't work. The stuff inside the comment would come through as a string of random tokens. Also, the // comment is EOL sensitive, and this info Ian normally communicated to the parser.