Re: Incremental parser

Jacob Rus Tue, 14 Aug 2007 18:56:41 -0700

Michel Fortin wrote:

Le 2007-08-13 à 21:56, Allan Odgaard a écrit :
Both. Implementations use look-aheads because simply the syntax oftenrequire looking ahead to see if the pattern is complete before decidingif something should be taken literally or not. For instance, an asteriskalone doesn't start emphasis unless it has a corresponding one in thesame paragraph.

Okay, but the spec could easily be changed to continue emphasis untilthe end of the current paragraph, if no ending * is found. This wouldIMO be much more logical for document readers/writers (as well ascomputer parsers), as some asterisk 60 lines later would no longeraffect part of the document on the current line.

Given that `*`s surrounded by spaces are left alone, I don't think thischange would have any real downside, in terms of unintentional emphasis,and it would make it possible to have syntax highlighting which actuallyaccurately matched the expected output, which would make my life as adocument author much easier.

In the case of the backtick in the other thread, it's clearly not anedge case to me: I'm pretty sure that it has been designed that way.First, it's much more useable, and second, the code explicitly checksfor the surrounding backticks so it can't really be an oversight. Butthat's for the other thread...

Well, the specification is clearly ambiguous, if it can be interpreted(pretty reasonably AFAICT) in two completely different ways.

I also agree that parsing Markdown is slower than it should be, but Iwonder which non-written rules you're talking about that makes it soslow. When I wrote about looking until the end of the document, I wastalking about reference-style links, which are pretty well documented assuch. Maybe you had something else in mind?

You don't actually need to look to the end of the document whenconstructing the parse tree, however. Links which actually have areference can easily be distinguished from those which don't in a quickpass at the end of parsing.

Syntax highlighting isn't the same thing as "parsing" Markdown, not inmy mind. It's more like "tokenizing" Markdown, and for that you don'treally need to be aware of the whole document at once; you just need toidentify things, not interpret them... Although it can get complicatedif you consider that some thing such as reference-style links are onlylinks when they have a corresponding definition.

Yes, but tokenizing the document into a parse tree is 90% of the battle…the rest can be quickly (i.e. at worst linear time) handled at the end.

I was referring to how HTML browsers mutate the DOM in strange wayswhile parsing improper HTML, if only to "fix" things such as this:

Yes, but that's not really in the scope of our discussion; I don'tparticularly care how clearly invalid markdown is converted. It mightbe nice to specify (as the HTML5 spec is trying to do for HTML), butdocument authors don't need to know about it.

Indeed, I was considering the parser was doing the output too, or atleast generating sequencial events (à la SAX). If you're talking aboutcreating the document tree in one pass, I suppose this can work if youallow mutations of the document (as opposed to just appending newelements at the end as you'd do with an XML parser).

It requires one pass to create a document tree, and one more pass for afew other things, such as assigning links.

Ok, back to performance. How many time do you start a new Perl processwhen building the manual? My benchmarks indicates that starting the

AFAIK it only takes starting one perl process to run markdown on a largefile...

I'm totally not convinced that creating a byte-by-byte parser in Perl orPHP is going to be very useful. Using regular expressions is much fasterthan executing equivalent PHP code in my experience.

This is a silly statement. Compiled vs. interpreted languages hasnothing to do with the big-O run-time of an algorithm. Real parsers areoptimized for this type of task and will take linear time (with suitablechanges in the markdown spec), while regular expressions are not (andit's possible even that a document could be created that would takeexponential time with the current markdown implementation).

That doesn't mean PHP Markdown and Markdown.pl can't be made faster, butI'd be surprised if it ever reach the speed of TextMate's dedicatedparsing state machine.

TextMate's parser has to do quite a bit of extra work that a dedicatedparser for a particular format would not have to do. I would not be atall surprised if a Perl or PHP implementation could be made extremely fast.

You may not like the way Markdown.pl or PHP Markdown works, but we eachhave to work within the limits of our particular language. Between aPHP-made state machine or a PHP-regex hybrid design, I'd choose thesecond one if it means I get decent performance. And I don't think aformal grammar will help much getting better performance in a PHP orPerl environment for the reasons outlined above.

I really don't understand where you're coming from on this point. Whywould a PHP state machine be so terribly slow?

There's a tricky case here however: [foo][bar] isn't a link in Markdownunless "bar" is defined somewhere; if it isn't defined, it's plain text.

As previously mentioned that's quite trivial to pick up after thedocument has been completely tokenized.


-Jacob Rus

_______________________________________________
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: Incremental parser

Reply via email to