On 8/2/2012 12:21 AM, Jonathan M Davis wrote:
Because your input range is a range of dchar?
I think that we're misunderstanding each other here. A typical, well-written,
range-based function which operates on ranges of dchar will use static if or
overloads to special-case strings. This means that it will function with any
range of dchar, but it _also_ will be as efficient with strings as if it just
operated on strings.
It *still* must convert UTF8 to dchars before presenting them to the consumer of
the dchar elements.
It won't decode anything in the string unless it has to.
So, having a lexer which operates on ranges of dchar does _not_ make string
processing less efficient. It just makes it so that it can _also_ operate on
ranges of dchar which aren't strings.
For instance, my lexer uses this whenever it needs to get at the first
character in the range:
static if(isNarrowString!R)
Unqual!(ElementEncodingType!R) first = range[0];
else
dchar first = range.front;
You're requiring a random access input range that has random access to something
other than the range element type?? and you're requiring an isNarrowString to
work on an arbitrary range?
if I need to know the number of code units that make up the code point, I
explicitly call decode in the case of a narrow string. In either case, code
units are _not_ being converted to dchar unless they absolutely have to be.
Or you could do away with requiring a special range type and just have it be a
UTF8 range.
What I wasn't realizing earlier was that you were positing a range type that has
two different kinds of elements. I don't think this is a proper component type.
Yes. I understand. It has a mapping of pointers to identifiers. My point is
that nothing but parsers will need that.
From the standpoint of functionality,
it's a parser feature, not a lexer feature. So, if it can be done just fine in
the parser, then that's where it should be. If on the other hand, it _needs_
to be in the lexer for some reason (e.g. performance), then that's a reason to
put it there.
If you take it out of the lexer, then:
1. the lexer must allocate storage for every identifier, rather than only for
unique identifiers
2. and then the parser must scan the identifier string *again*
3. there must be two hash lookups of each identifier rather than one
It's a suboptimal design.