On 8/1/2012 10:44 PM, Jonathan M Davis wrote:
On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
The lexer must use char or it will not be acceptable as anything but a toy
for performance reasons.

Avoiding decoding can be done with strings and operating on ranges of dchar,
so you'd be operating almost entirely on ASCII. Are you saying that there's a
performance issue aside from decoding?

1. Encoding it into a dchar is a performance problem. Source code sits in files that are nearly always in UTF8. So your input range MUST check every single char and convert it to UTF32 as necessary. Plus, there's that additional step removed from sticking the file input buffer directly into the lexer's input.

2. Storing strings as dchars is a performance and memory problem (4x as much data and hence time).

Remember, nearly ALL of D source will be ASCII. All performance considerations must be tilted towards the usual case.


Somebody has to convert the input files into dchars, and then back into
chars. That blows for performance. Think billions and billions of
characters going through, not just a few random strings.

Why is there any converting to dchar going on here?

Because your input range is a range of dchar?


I don't see why any would
be necessary. If you reading in a file as a string or char[] (as would be
typical), then you're operating on a string, and then the only time that any
decoding will be necessary is when you actually need to operate on a unicode
character, which is very rare in D's grammar. It's only when operating on
something _other_ than a string that you'd have to actually deal with dchars.

That's what I've been saying. So why have an input range of dchars, which must be decoded in advance, otherwise it wouldn't be a range of dchars?


Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
nothing
else will care about it. At most, it should be an optional feature of the
lexer. But it certainly could be added that way.

I hate to say "trust me on this", but if you don't, have a look at dmd's
lexer and how it handles identifiers, then look at dmd's symbol table.

My point is that it's the sort of thing that _only_ a parser would care about.
So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.

I think you are misunderstanding. The lexer doesn't have a *symbol* table in it. It has a mapping from identifiers to unique handles. It needs to be there, otherwise the semantic analysis has to scan identifier strings a second time.

Reply via email to