On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote: > Is it based on a prioritized list of regexes?
I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible. > > Well, it's still a work in progress, so it certainly can be adjusted as > > necessary. I intend it to fully implement the spec (and make sure that > > both it and the spec match what dmd is doing) as far as lexing goes. The > > idea is that you should be able to build a fully compliant D parser on > > top of it and build a fully compliant D compiler on top of that. > > My fear (having encoded the D grammar once) is that 99% of the code > constructs are easily done (even new stuff like the => syntax), but > that the remaining 1% is a nightmare. The many different kind of > strings with or without escaping, for example. Your lexer return a > string literal as a string field, right? If so, escaping chars like " > and ` makes me nervous. Well, if you want the original text, there's the str property, which contains everything that was in the source code (including whatever escaping there was), whereas the value property includes only the body, and it will have already processed whatever escaped characters needed to be processed. It's the actual value of the literal. So, \` would be ` in the value property, named entities would be their actual code points, etc. So, I don't really see a problem there. Sure, depending on what is done with the processed string, it could cause further problems, but that always happens when doing string processing. > > It already supports all of the comment types and several of the string > > literal types. I haven't sorted out q{} yet, but I will before I'm done, > > and that may or may not affect how some things work, since I'm not quite > > sure how to handle q{} yet (it may end up being done with tokens marking > > the beginning and end of the token sequence encompassed by q{}, but we'll > > see). > > I see. My question was because, as I said before I found this to be a > difficult and hairy part of the D grammar, and an IDE might want to > colorize/manipulate the content of a string, when it knows its a > string mixin. Well, it could very well cause me trouble. I haven't gotten to it yet, and I'll need to study it to fully understand what it does before crafting my solution, but it'll probably end up being a sequence of tokens of some kind given that that's what it is - a string of tokens. Regardless, it'll be done in a way that whatever is using the lexer will be able to know what it is and process is it accordingly. > > str holds the exact text which was lexed (this includes the entire comment > > for the various comment token types), which in the case of lexing string > > rather than another range type would normally (always? - I don't > > remember) be a slice of the string being lexed, which should help make > > lexing string very efficient. > Yeahs, you may as well use this excellent feature. But that means the > input must be held in memory always, no? If it's an input range > (coming from a big file), can you keep slices from it? Well, whatever is using the lexer is going to have to make sure that what it passes to the lexer continues to exist while it uses the lexer. Given how slicing works in D, it's pretty much a given that lexers and parsers are going to use them heavily, and any code using a lexer or parser is going to have to take that into account. Slicing is one of the major reasons that Tango's XML parser is so insanely fast. > > LiteralValue is a VariantN of the types that a literal can be (long, > > ulong, > > real, and string IIRC) and is empty unless the token is a literal type > > You can make it an Algebraic!(long, ulong, real, string), as your > universe of type is restricted. That can even catch a non-legal value. I'll have to look at that to see whether using Algebraic is better. I'm not super-familiar with std.variant, so I may have picked the wrong type. However, VariantN already holds a specific set of types though (unlike Variant), so that part isn't a problem. > Arrays literals are not considered literals, since they can contain > arbitrary expressions, is that right? There is no concept of array literal in the lexing portion of the spec. There may be at the level of the parser, but that's not the lexer's problem. > And it's a good idea to drop the complex literals once and for all. I'm not supporting any symbols which are known to be scheduled for deprecation (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is to-be-deprecated keywords (e.g. volatile and delete), since they still won't be legal to use as identifiers. And there's a function to query a token as to whether it's using a deprecated or unused keyword so that the program using the lexer can flag it if it wants to. > > (the > > various string postfixes - c,w, and d - are treated as different token > > types rather than giving the literal value different string types - the > > same with the integral and floating point literals). > > That's seem reasonable enough, but can you really store a dstring > literal in a string field? Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care. > Does syntax highlighting need more that a token stream? Without having > thought a lot about it, it seems to me IDE tend to highlight based > just on the token type, not on a parse tree. So that means your lexer > can be used directly by interested people, that's nice. If all you want to do is highlight specific symbols and keywords, then you don't need a parser. If you want to do fancier stuff related to templates and the like, then you'll need a parser and maybe even a semantic analyzer. But a lexer should be plenty for basic syntax highlighting. > I really like the idea of having malformed token become error token, > with the lexing that continue. As I said before, that enables an IDE > to continue to do syntax highlighting. That was the idea. You need to be able to continue lexing beyond errors in many cases, and that seemed the natural way to do it. > Did you achieve UFCS, wrt floating point values? is 1.f seen as a > float or as f called on 1 (int)? I haven't tackled floating point literals yet, but since the f in 1.f is not part of the floating point literal, I'll have to handle it. I suspect that that's one area where I'll need to update the spec though, since it probably hasn't been updated for that yet. Basically, the lexer that I'm writing needs to be 100% compliant with the D spec and dmd (which means updating the spec or fixing dmd in some cases), and it needs to be possible to build on top of it anything and everything that dmd does that would use a lexer (even if it's not the way that dmd currently does it) so that it's possible to build a fully compliant D parser and compiler on top of it as well as a ddoc generator and anything else that you'd need to do with a lexer for D. So, if you have any questions about what my lexer does (or is supposed to do) with regards to the spec, that should answer it. If my lexer doesn't match the spec or dmd when I'm done (aside from specific exceptions relating to stuff like deprecated symbols), then I screwed up. - Jonathan M Davis