On 14.09.2011 1:55, Vladimir Panteleev wrote:
On Wed, 14 Sep 2011 00:49:47 +0300, Dmitry Olshansky
<dmitry.o...@gmail.com> wrote:
However when it comes to some serious business like matching charsets
and doing ambiguous control flow it starts losing to more superior
design of the default engine.
Ah, okay. I thought compile-time regex was the same as the default
engine, but without the runtime parsing and bytecode interpretation(?).
Let me explain a bit. The default as chosen is fairly advanced NFA
engine which features:
- never backtracks and consequently decodes each UTF8/16/32 character
only once;
- as side effect doesn't support fully combinatorial backreferences
(why would someone use 'em anyway);
- guaranteed O(N*M) time on input with M being "size" of pattern;
This also opens up matching directly on buffered streams... but back to
the point.
There is also a slower traditional backtracking engine, it still faster
then many other engines out there but it ebbs out on some patterns like
.*.*.*Z in text with no 'Z', precisely because of combinatorial nature
of backtracking.
Then C-T is built on top of backtracking, just because it was much
easier this way :)
It's indeed replacing bytecode interpretations with hardcoded code which
is exceptionally fast, but the prime time consumer in real patterns is
usually this memoize/return. C-T regex is still a bit experimental
though and going to be improved over time.
Last but not least there is also completely dumb (as in dead simple and
fast) kickstart engine that does all of heavy lifting iff pattern starts
with something more or less constant (e.g. it does 50% of job in regex-dna).
And you can just parse at compile time and use default run-time engine
for matching:
enum r = regex("whatever");
--
Dmitry Olshansky