Michel Fortin wrote:
Le 2007-08-13 à 21:56, Allan Odgaard a écrit :
Both. Implementations use look-aheads because simply the syntax often
require looking ahead to see if the pattern is complete before deciding
if something should be taken literally or not. For instance, an asterisk
alone doesn't start emphasis unless it has a corresponding one in the
same paragraph.
Okay, but the spec could easily be changed to continue emphasis until
the end of the current paragraph, if no ending * is found. This would
IMO be much more logical for document readers/writers (as well as
computer parsers), as some asterisk 60 lines later would no longer
affect part of the document on the current line.
Given that `*`s surrounded by spaces are left alone, I don't think this
change would have any real downside, in terms of unintentional emphasis,
and it would make it possible to have syntax highlighting which actually
accurately matched the expected output, which would make my life as a
document author much easier.
In the case of the backtick in the other thread, it's clearly not an
edge case to me: I'm pretty sure that it has been designed that way.
First, it's much more useable, and second, the code explicitly checks
for the surrounding backticks so it can't really be an oversight. But
that's for the other thread...
Well, the specification is clearly ambiguous, if it can be interpreted
(pretty reasonably AFAICT) in two completely different ways.
I also agree that parsing Markdown is slower than it should be, but I
wonder which non-written rules you're talking about that makes it so
slow. When I wrote about looking until the end of the document, I was
talking about reference-style links, which are pretty well documented as
such. Maybe you had something else in mind?
You don't actually need to look to the end of the document when
constructing the parse tree, however. Links which actually have a
reference can easily be distinguished from those which don't in a quick
pass at the end of parsing.
Syntax highlighting isn't the same thing as "parsing" Markdown, not in
my mind. It's more like "tokenizing" Markdown, and for that you don't
really need to be aware of the whole document at once; you just need to
identify things, not interpret them... Although it can get complicated
if you consider that some thing such as reference-style links are only
links when they have a corresponding definition.
Yes, but tokenizing the document into a parse tree is 90% of the battle…
the rest can be quickly (i.e. at worst linear time) handled at the end.
I was referring to how HTML browsers mutate the DOM in strange ways
while parsing improper HTML, if only to "fix" things such as this:
Yes, but that's not really in the scope of our discussion; I don't
particularly care how clearly invalid markdown is converted. It might
be nice to specify (as the HTML5 spec is trying to do for HTML), but
document authors don't need to know about it.
Indeed, I was considering the parser was doing the output too, or at
least generating sequencial events (à la SAX). If you're talking about
creating the document tree in one pass, I suppose this can work if you
allow mutations of the document (as opposed to just appending new
elements at the end as you'd do with an XML parser).
It requires one pass to create a document tree, and one more pass for a
few other things, such as assigning links.
Ok, back to performance. How many time do you start a new Perl process
when building the manual? My benchmarks indicates that starting the
AFAIK it only takes starting one perl process to run markdown on a large
file...
I'm totally not convinced that creating a byte-by-byte parser in Perl or
PHP is going to be very useful. Using regular expressions is much faster
than executing equivalent PHP code in my experience.
This is a silly statement. Compiled vs. interpreted languages has
nothing to do with the big-O run-time of an algorithm. Real parsers are
optimized for this type of task and will take linear time (with suitable
changes in the markdown spec), while regular expressions are not (and
it's possible even that a document could be created that would take
exponential time with the current markdown implementation).
That doesn't mean PHP Markdown and Markdown.pl can't be made faster, but
I'd be surprised if it ever reach the speed of TextMate's dedicated
parsing state machine.
TextMate's parser has to do quite a bit of extra work that a dedicated
parser for a particular format would not have to do. I would not be at
all surprised if a Perl or PHP implementation could be made extremely fast.
You may not like the way Markdown.pl or PHP Markdown works, but we each
have to work within the limits of our particular language. Between a
PHP-made state machine or a PHP-regex hybrid design, I'd choose the
second one if it means I get decent performance. And I don't think a
formal grammar will help much getting better performance in a PHP or
Perl environment for the reasons outlined above.
I really don't understand where you're coming from on this point. Why
would a PHP state machine be so terribly slow?
There's a tricky case here however: [foo][bar] isn't a link in Markdown
unless "bar" is defined somewhere; if it isn't defined, it's plain text.
As previously mentioned that's quite trivial to pick up after the
document has been completely tokenized.
-Jacob Rus
_______________________________________________
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss