On Wed, Aug 1, 2012 at 8:39 AM, Jonathan M Davis <jmdavisp...@gmx.com> wrote:
> It was never intended to be even vaguely generic. It's targeting D > specifically. If someone can take it and make it generic when I'm done, then > great. But it's goal is to lex D as efficiently as possible, and it'll do > whatever it takes to do that. That's exactly what I had in mind. Anyway, we need a D lexer. We also need a generic lexer generator, but as a far-away second choice and we can admit it being less efficient. Of course, any trick used on the D lexer can most probably be reused for Algol-family lexers. >> I don't get it. Say I have an literal with non UTF-8 chars, how will >> it be stored inside the .str field as a string? > > The literal is written in whatever encoding the range is in. If it's UTF-8, > it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set > of characters that UTF-32 can. Your range could be UTF-32, but the string > literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when > the literal is UTF-32. The characters themselves are in the encoding type of > the range regardless. It's just the values that the compiler generates which > change. > > "hello world" > "hello world"c > "hello world"w > "hello world"d > > are absolutely identical as far as lexing goes save for the trailing > character. It would be the same regardless of the characters in the strings or > the encoding used in the source file. Everytime I think I understand D strings, you prove me wrong. So, I *still* don't get how that works: say I have auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d; Then, the "..." part is lexed as a string literal. How can the string field in the Token magically contain UTF32 characters? Or, are they automatically cut in four nonsense chars each? What about comments containing non-ASCII chars? How can code coming after the lexer make sense of it? As Jacob say, many people code in English. That's right, but 1- they most probably use their own language for internal documentation 2- any in8n part of a code base will have non-ASCII chars 3- D is supposed to accept UTF-16 and UTF-32 source code. So, wouldn't it make sense to at least provide an option on the lexer to specifically store identifier lexemes and comments as a dstring?