On 2012-08-01 14:44, Philippe Sigaud wrote:

Everytime I think I understand D strings, you prove me wrong. So, I
*still* don't get how that works:

say I have

auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;

Then, the "..." part is lexed as a string literal. How can the string
field in the Token magically contain UTF32 characters? Or, are they
automatically cut in four nonsense chars each? What about comments
containing non-ASCII chars? How can code coming after the lexer make
sense of it?

As Jacob say, many people code in English. That's right, but

1- they most probably use their own language for internal documentation
2- any in8n part of a code base will have non-ASCII chars
3- D is supposed to accept UTF-16 and UTF-32 source code.

So, wouldn't it make sense to at least provide an option on the lexer
to specifically store identifier lexemes and comments as a dstring?

I'm not quite sure how it works either. But I'm thinking like this:

The string representing what's in the source code can be either UFT-8 or the encoding of the file. I'm not sure if the lexer needs to re-encode the string if it's not in the same encoding as the file.

Then there's an other field/function that returns the processed token, i.e. for a token of the type "int" it will return an actual int. This function will return different types of string depending on the type of the string literal the token represents.

--
/Jacob Carlborg

Reply via email to