On 8/1/2012 9:52 PM, Jonathan M Davis wrote:
1. The current design of Phobos is to have ranges of dchar, because it fosters
code correctness (though it can harm efficiency). It's arguably too late to do
otherwise. Certainly, doing otherwise now would break a lot of code. If the
lexer tried to operate on UTF-8 as part of its API rather than operating on
ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos
at all.
The lexer must use char or it will not be acceptable as anything but a toy for
performance reasons.
2. The lexer does _not_ have to have its performance tank by accepting ranges
of dchar. It's true that the performance will be harmed for ranges which
_aren't_ strings, but for strings (as would be by far the most common use
case) it can be very efficient by special-casing them.
Somebody has to convert the input files into dchars, and then back into chars.
That blows for performance. Think billions and billions of characters going
through, not just a few random strings.
Always always think of the lexer as having a firehose of data shoved into its
maw, and it better be thirsty!
And as much as there are potential performance issues with Phobos' choice of
treating strings as ranges of dchar, if it were to continue to treat them as
ranges of code units, it's pretty much a guarantee that there would be a _ton_
of bugs caused by it. Most programmers have absolutely no clue about how
unicode works and don't really want to know. They want strings to just work.
Phobos' approach of defaulting to correct but making it possible to make the
code faster through specializations is very much in line with D's typical
approach of making things safe by default but allowing the programmer to do
unsafe things for optimizations when they know what they're doing.
I expect std.d.lexer to handle UTF8 correctly, so I don't think this should be
an issue in this particular case. dmd's lexer does handle UTF8 correctly.
Note also that the places where non-ASCII characters can appear in correct D
code is severely limited, and there are even fewer places where multibyte
characters need to be decoded at all, and the lexer takes full advantage of this
to boost its speed.
For example, non-ASCII characters can appear in comments, but they DO NOT need
to be decoded, and even just having the test for a non-ASCII character in the
comment scanner will visibly slow down the lexer.
All identifiers are entered into a hashtable, and are referred to by
pointers into that hashtable for the rest of dmd. This makes symbol lookups
incredibly fast, as no string comparisons are done.
Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing
else will care about it. At most, it should be an optional feature of the
lexer. But it certainly could be added that way.
I hate to say "trust me on this", but if you don't, have a look at dmd's lexer
and how it handles identifiers, then look at dmd's symbol table.