Le 29 juil. 2006 à 17:54, A. Pagaltzis a écrit :

I wouldn’t go for a pure formal grammar. If you don’t, then it’s
easy to tolerate ambiguity in the language by deferring
disambiguation until possible. Just accumulate potential tokens
and only assign meaning once it’s decidable.

Personally, I'd do it with multiple passes of tokenization. I'd first tokenize block-level elements and define a particular rendering procedure for each of these block-level tokens. Then, when parsing of span-level elements is needed inside block-level tokens, I'd tokenize the text content of these blocks (with proper indentation removed as needed) into span-level tokens. This means you'd have two grammars: one to separate block elements, one to separate span elements.

I'd like to point out that in my view John's implementation is already doing tokenization in some form. The most obvious is the replacement of HTML blocks by md5 hashes. If you consider the hash as a token, and the text before and after it as text tokens too, you have, in a way, a string composed of tokens. It's not really the usual way of working with tokens, but all the tokens being part of a single string makes possible to pass the entire text through a single regular expression.

Markdown then separates blocks and renders them, replace them in the text by the generated HTML, then reuse the HTML block parser to hash, or "tokenize" what it just outputted (it would be better to hash/ tokenize blocks directly instead of relying on the HTML block parser to catch them all later, and this is what PHP Markdown Extra does).

A similar strategy could be used for span-level elements too. PHP Markdown Extra already does create hashes for some kinds of inline- level tags, which prevents Markdown from interfering with the content of <script> or <math> or <code>. The same strategy could be used with emphasis, links, and other generated markup to prevent invalid nesting. For example, let's create a link with a new "tokenized" way from this:

    __some text [with a link__ oh!](somewhere)

When Markdown encounters the link, it'll use this markdown text:

    with a link__ oh!

When processed with doSpanGamut, the text is unchanged. The link is then formed:

    <a href="somewhere">with a link__ oh!</a>

tokenized (md5 hash):

    c168b0c687ed1c4696a41207dd654824

and inserted in the text:

    __some text c168b0c687ed1c4696a41207dd654824

When the actual HTML output is created, hash values are replaced by their corresponding valid HTML string, and then you you have this perfectly valid span-level HTML snippet:

    __some text <a href="somewhere">with a link__ oh!</a>

See? No invalid nesting anymore!

I recognize that md5 hashes are somewhat overkill for this process. In fact, any alphanumeric string which isn't present in the input text is suitable for "tokens". You could, for instance, label them as "x1x", "x2x", "x3x" in their order of insertion: it'd work beautifully, as long as you prevent any x digit x in the input from being seen as a token.

This is far from having a formal grammar, but it shows that a lot more could be done by reusing the current approach.


Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/


_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to