Re: Formal Grammar — some thoughts

Michel Fortin Sun, 30 Jul 2006 13:34:43 -0700

Le 29 juil. 2006 à 17:54, A. Pagaltzis a écrit :

I wouldn’t go for a pure formal grammar. If you don’t, then it’s
easy to tolerate ambiguity in the language by deferring
disambiguation until possible. Just accumulate potential tokens
and only assign meaning once it’s decidable.

Personally, I'd do it with multiple passes of tokenization. I'd firsttokenize block-level elements and define a particular renderingprocedure for each of these block-level tokens. Then, when parsing ofspan-level elements is needed inside block-level tokens, I'd tokenizethe text content of these blocks (with proper indentation removed asneeded) into span-level tokens. This means you'd have two grammars:one to separate block elements, one to separate span elements.

I'd like to point out that in my view John's implementation isalready doing tokenization in some form. The most obvious is thereplacement of HTML blocks by md5 hashes. If you consider the hash asa token, and the text before and after it as text tokens too, youhave, in a way, a string composed of tokens. It's not really theusual way of working with tokens, but all the tokens being part of asingle string makes possible to pass the entire text through a singleregular expression.

Markdown then separates blocks and renders them, replace them in thetext by the generated HTML, then reuse the HTML block parser to hash,or "tokenize" what it just outputted (it would be better to hash/tokenize blocks directly instead of relying on the HTML block parserto catch them all later, and this is what PHP Markdown Extra does).

A similar strategy could be used for span-level elements too. PHPMarkdown Extra already does create hashes for some kinds of inline-level tags, which prevents Markdown from interfering with the contentof <script> or <math> or <code>. The same strategy could be used withemphasis, links, and other generated markup to prevent invalidnesting. For example, let's create a link with a new "tokenized" wayfrom this:


    __some text [with a link__ oh!](somewhere)

When Markdown encounters the link, it'll use this markdown text:

    with a link__ oh!

When processed with doSpanGamut, the text is unchanged. The link isthen formed:


    <a href="somewhere">with a link__ oh!</a>

tokenized (md5 hash):

    c168b0c687ed1c4696a41207dd654824

and inserted in the text:

    __some text c168b0c687ed1c4696a41207dd654824

When the actual HTML output is created, hash values are replaced bytheir corresponding valid HTML string, and then you you have thisperfectly valid span-level HTML snippet:


    __some text <a href="somewhere">with a link__ oh!</a>

See? No invalid nesting anymore!

I recognize that md5 hashes are somewhat overkill for this process.In fact, any alphanumeric string which isn't present in the inputtext is suitable for "tokens". You could, for instance, label them as"x1x", "x2x", "x3x" in their order of insertion: it'd workbeautifully, as long as you prevent any x digit x in the input frombeing seen as a token.

This is far from having a formal grammar, but it shows that a lotmore could be done by reusing the current approach.



Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/


_______________________________________________
Markdown-Discuss mailing list
[email protected]
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: Formal Grammar — some thoughts

Reply via email to