On Sun, 03 Dec 2006 20:10:34 -0500, Michel Fortin
<[EMAIL PROTECTED]> wrote:
My experience optimizing PHP Markdown, and building the custom mixed
Markdown/HTML-block pesudo-tokenizer of PHP Markdown Extra, tells me
that it'll probably stay very slow as long as the implementation is made
of PHP code.
Yeah, it is. I'm not much of a programmer, but I thought the algorithm
too useful not to try and implement.
Assuming you've implemented the algorithm in the spec as PHP code, you
could probably make it faster by using regular expressions in the
tokenization steps instead of iterating character by character. For
instance, you could implement many of the tokenizer states by matching
from the start of a string with a regex. And maybe then it'll also be
possible to combine a couple of states within the same regex too.
This is precisely what I've done. Before I did said optimization, the
parser would crash more often than not on a document larger than a few
kilobytes on my machine.
The more we replace PHP code by regular expressions, the faster it'll
go, but further we deviate from the processing algorithm described in
the spec. I wonder how far we could go while keeping the exact same
behaviour.
My pattern optimization is pretty simple: when switching states the parser
first tries matching whatever range of characters will keep the machine in
the same state, and then acts as normal on the first character that
doesn't match. There is, effectively, next to no deviation from the spec
short of emitting one char token per unbroken string rather than one token
per character. Since the tokens are merged into one text node in the tree
builder anyway, the deviation is essentially nil.
The true good solution would be to have a parser implemented in C and
available through every standard installation of PHP. It could be used
by other languages too.
I am keeping my fingers crossed, hoping that someone much more
knowledgable than I will do this. :)
--
J. King
http://jking.dark-phantasy.com/