On 02-Aug-12 08:30, Walter Bright wrote:
On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
1. It should accept as input an input range of UTF8. I feel it is a
mistake
to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
should use an 'adapter' range to convert the input to UTF8. (This is
what
component programming is all about.)

But that's not how ranges of characters work. They're ranges of dchar.
Ranges
don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have
to create
special wrappers around string or wstring to have ranges of UTF-8. The
way
that it's normally done is to have ranges of dchar and then special-case
range-based functions for strings. Then the function can operate on
any range
of dchar but still operates on strings efficiently.

I have argued against making ranges of characters dchars, because of
performance reasons. This will especially adversely affect the
performance of the lexer.

The fact is, files are normally in UTF8 and just about everything else
is in UTF8. Prematurely converting to UTF-32 is a performance disaster.
Note that the single largest thing holding back regex performance is
that premature conversion to dchar and back to char.

Well, it doesn't convert back to UTF-8 as it just slices of the input :)

Otherwise very true especially with ctRegex that used to recieve quite some hype even in its present state. 33% of time spent is doing and redoing UTF-8 decoding. (Note that quite some extra work on top of what lexer does is done, e.g. lexer is largely deterministic but regex has some of try-rollback).

If lexer is required to accept dchar ranges, its performance will drop
at least in half, and people are going to go reinvent their own lexers.


Yes, it slows things down. Decoding (if any) should kick in only where it's absolutely necessary and be an integral part of lexer automation.


--
Dmitry Olshansky

Reply via email to