Michel Fortin wrote:
Le 2007-08-13 à 21:56, Allan Odgaard a écrit :
Both. Implementations use look-aheads because simply the syntax often require looking ahead to see if the pattern is complete before deciding if something should be taken literally or not. For instance, an asterisk alone doesn't start emphasis unless it has a corresponding one in the same paragraph.

Okay, but the spec could easily be changed to continue emphasis until the end of the current paragraph, if no ending * is found. This would IMO be much more logical for document readers/writers (as well as computer parsers), as some asterisk 60 lines later would no longer affect part of the document on the current line.

Given that `*`s surrounded by spaces are left alone, I don't think this change would have any real downside, in terms of unintentional emphasis, and it would make it possible to have syntax highlighting which actually accurately matched the expected output, which would make my life as a document author much easier.

In the case of the backtick in the other thread, it's clearly not an edge case to me: I'm pretty sure that it has been designed that way. First, it's much more useable, and second, the code explicitly checks for the surrounding backticks so it can't really be an oversight. But that's for the other thread...

Well, the specification is clearly ambiguous, if it can be interpreted (pretty reasonably AFAICT) in two completely different ways.

I also agree that parsing Markdown is slower than it should be, but I wonder which non-written rules you're talking about that makes it so slow. When I wrote about looking until the end of the document, I was talking about reference-style links, which are pretty well documented as such. Maybe you had something else in mind?

You don't actually need to look to the end of the document when constructing the parse tree, however. Links which actually have a reference can easily be distinguished from those which don't in a quick pass at the end of parsing.

Syntax highlighting isn't the same thing as "parsing" Markdown, not in my mind. It's more like "tokenizing" Markdown, and for that you don't really need to be aware of the whole document at once; you just need to identify things, not interpret them... Although it can get complicated if you consider that some thing such as reference-style links are only links when they have a corresponding definition.

Yes, but tokenizing the document into a parse tree is 90% of the battle… the rest can be quickly (i.e. at worst linear time) handled at the end.

I was referring to how HTML browsers mutate the DOM in strange ways while parsing improper HTML, if only to "fix" things such as this:

Yes, but that's not really in the scope of our discussion; I don't particularly care how clearly invalid markdown is converted. It might be nice to specify (as the HTML5 spec is trying to do for HTML), but document authors don't need to know about it.

Indeed, I was considering the parser was doing the output too, or at least generating sequencial events (à la SAX). If you're talking about creating the document tree in one pass, I suppose this can work if you allow mutations of the document (as opposed to just appending new elements at the end as you'd do with an XML parser).

It requires one pass to create a document tree, and one more pass for a few other things, such as assigning links.

Ok, back to performance. How many time do you start a new Perl process when building the manual? My benchmarks indicates that starting the

AFAIK it only takes starting one perl process to run markdown on a large file...

I'm totally not convinced that creating a byte-by-byte parser in Perl or PHP is going to be very useful. Using regular expressions is much faster than executing equivalent PHP code in my experience.

This is a silly statement. Compiled vs. interpreted languages has nothing to do with the big-O run-time of an algorithm. Real parsers are optimized for this type of task and will take linear time (with suitable changes in the markdown spec), while regular expressions are not (and it's possible even that a document could be created that would take exponential time with the current markdown implementation).

That doesn't mean PHP Markdown and Markdown.pl can't be made faster, but I'd be surprised if it ever reach the speed of TextMate's dedicated parsing state machine.

TextMate's parser has to do quite a bit of extra work that a dedicated parser for a particular format would not have to do. I would not be at all surprised if a Perl or PHP implementation could be made extremely fast.

You may not like the way Markdown.pl or PHP Markdown works, but we each have to work within the limits of our particular language. Between a PHP-made state machine or a PHP-regex hybrid design, I'd choose the second one if it means I get decent performance. And I don't think a formal grammar will help much getting better performance in a PHP or Perl environment for the reasons outlined above.

I really don't understand where you're coming from on this point. Why would a PHP state machine be so terribly slow?

There's a tricky case here however: [foo][bar] isn't a link in Markdown unless "bar" is defined somewhere; if it isn't defined, it's plain text.

As previously mentioned that's quite trivial to pick up after the document has been completely tokenized.

-Jacob Rus

_______________________________________________
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to