Re: Let's stop parser Hell

Jacob Carlborg Wed, 01 Aug 2012 11:30:22 -0700

On 2012-08-01 20:24, Jonathan M Davis wrote:

D source text can be in one of the following formats:
* ASCII
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE


So, yes, you can stick unicode characters directly in D code. Though I wonder
about the correctness of the spec here. It claims that if there's no BOM, then
it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
them have BOM markers, but I can put unicode in a .d file just fine with vim. U
should probably study up on BOMs.

In any case, the source is read in whatever encoding it's in. String literals
then all become UTF-8 in the final object code unless they're marked as
specifically being another type via the postfix letter or they're inferred as
being another type (e.g. when you assign a string literal to a dstring).
Regardless, what's in the final object code is based on the types that the type
system marks strings as, not what the encoding of the source code was.

So, a lexer shouldn't care about what the encoding of the source is beyond
what it takes to covert it to a format that it can deal with and potentially
being written in a way which makes handling a particular encoding more
efficient. The values of literals and the like are completely unaffected
regardless.

But if you read a source file which is encoded using UTF-16 you wouldneed to re-encode that to store it in the "str" filed in your Token struct?

If that's the case, wouldn't it be better to make Token a template to beable to store all Unicode encodings without re-encoding? Although Idon't know how if that will complicate the rest of the lexer.


--
/Jacob Carlborg

Re: Let's stop parser Hell

Reply via email to