On 2012-08-01 20:24, Jonathan M Davis wrote:
D source text can be in one of the following formats:
* ASCII
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE
So, yes, you can stick unicode characters directly in D code. Though I wonder
about the correctness of the spec here. It claims that if there's no BOM, then
it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
them have BOM markers, but I can put unicode in a .d file just fine with vim. U
should probably study up on BOMs.
In any case, the source is read in whatever encoding it's in. String literals
then all become UTF-8 in the final object code unless they're marked as
specifically being another type via the postfix letter or they're inferred as
being another type (e.g. when you assign a string literal to a dstring).
Regardless, what's in the final object code is based on the types that the type
system marks strings as, not what the encoding of the source code was.
So, a lexer shouldn't care about what the encoding of the source is beyond
what it takes to covert it to a format that it can deal with and potentially
being written in a way which makes handling a particular encoding more
efficient. The values of literals and the like are completely unaffected
regardless.
But if you read a source file which is encoded using UTF-16 you would
need to re-encode that to store it in the "str" filed in your Token struct?
If that's the case, wouldn't it be better to make Token a template to be
able to store all Unicode encodings without re-encoding? Although I
don't know how if that will complicate the rest of the lexer.
--
/Jacob Carlborg