On 14.09.2011 1:55, Vladimir Panteleev wrote:
On Wed, 14 Sep 2011 00:49:47 +0300, Dmitry Olshansky
<dmitry.o...@gmail.com> wrote:

However when it comes to some serious business like matching charsets
and doing ambiguous control flow it starts losing to more superior
design of the default engine.

Ah, okay. I thought compile-time regex was the same as the default
engine, but without the runtime parsing and bytecode interpretation(?).


Let me explain a bit. The default as chosen is fairly advanced NFA engine which features: - never backtracks and consequently decodes each UTF8/16/32 character only once; - as side effect doesn't support fully combinatorial backreferences (why would someone use 'em anyway);
        - guaranteed O(N*M) time on input with M being "size" of pattern;

This also opens up matching directly on buffered streams... but back to the point.

There is also a slower traditional backtracking engine, it still faster then many other engines out there but it ebbs out on some patterns like .*.*.*Z in text with no 'Z', precisely because of combinatorial nature of backtracking.

Then C-T is built on top of backtracking, just because it was much easier this way :) It's indeed replacing bytecode interpretations with hardcoded code which is exceptionally fast, but the prime time consumer in real patterns is usually this memoize/return. C-T regex is still a bit experimental though and going to be improved over time.

Last but not least there is also completely dumb (as in dead simple and fast) kickstart engine that does all of heavy lifting iff pattern starts with something more or less constant (e.g. it does 50% of job in regex-dna).

And you can just parse at compile time and use default run-time engine for matching:
enum r = regex("whatever");


--
Dmitry Olshansky

Reply via email to