On Sun, 03 Dec 2006 20:10:34 -0500, Michel Fortin <[EMAIL PROTECTED]> wrote:

My experience optimizing PHP Markdown, and building the custom mixed Markdown/HTML-block pesudo-tokenizer of PHP Markdown Extra, tells me that it'll probably stay very slow as long as the implementation is made of PHP code.

Yeah, it is. I'm not much of a programmer, but I thought the algorithm too useful not to try and implement.

Assuming you've implemented the algorithm in the spec as PHP code, you could probably make it faster by using regular expressions in the tokenization steps instead of iterating character by character. For instance, you could implement many of the tokenizer states by matching from the start of a string with a regex. And maybe then it'll also be possible to combine a couple of states within the same regex too.

This is precisely what I've done. Before I did said optimization, the parser would crash more often than not on a document larger than a few kilobytes on my machine.

The more we replace PHP code by regular expressions, the faster it'll go, but further we deviate from the processing algorithm described in the spec. I wonder how far we could go while keeping the exact same behaviour.

My pattern optimization is pretty simple: when switching states the parser first tries matching whatever range of characters will keep the machine in the same state, and then acts as normal on the first character that doesn't match. There is, effectively, next to no deviation from the spec short of emitting one char token per unbroken string rather than one token per character. Since the tokens are merged into one text node in the tree builder anyway, the deviation is essentially nil.

The true good solution would be to have a parser implemented in C and available through every standard installation of PHP. It could be used by other languages too.

I am keeping my fingers crossed, hoping that someone much more knowledgable than I will do this. :)

--
J. King
http://jking.dark-phantasy.com/

Reply via email to