On 3 Mar 2008, at 13:30, Michel Fortin wrote:

[...]
1. A regexp that makes the parser enter the context the rule
represents (e.g. block quote, list, raw, etc.).

2. A list of which rules are allowed in the context of this rule.

3. A regexp for leaving the context of this rule.

4. A regexp which is pushed onto a stack when entering the context of
this rule, and popped again when leaving this rule.

The fourth item here is really the interesting part, because it is
what made Markdown nesting work (99% of the time) despite this being
100% rule-driven.

I'm not sure that the regular expression in 4 does, beside being pushed and popped from the stack

Yeah, I accidentally sent the letter w/o noticing I forgot to explain the fourth rule.

The regexps which end on this stack are used to preprocess the current line, so for example the rule for code blocks is:

    RAW[1] = /\g {4}/          # Four spaces starts raw.
RAW[2] = [ RAW_TEXT ] # No other rules are active inside raw, RAW_TEXT is a dummy .+ rule RAW[4] = /\g( {4}| {,3}$)/ # While in the raw context, we need to eat the first # four spaces of each line, or the line must be empty.

Two things to notice here:
1. I don’t use an explicit ‘end’ rule since we automatically leave the context if RAW[4] doesn’t successfully match. 2. I use \g instead of ^ since we need to anchor to where the last block-rule stopped matching, not necessarily BOL.

Now take the rule for block quote:

    BQ[1] = /\g {,3}> {,3}/    # We start it for lines with > allowing
                               # up to 3 spaces before/after.

    BQ[2] = [ BQ, RAW, PAR, … ] # Basically all block elements
                                # can go inside block quote.

    BQ[3] = /\g( *$|«hr»)/     # We leave block quote at empty lines or
# horizontal rulers¹. The actual pattern for
                               # «hr» is something like:
# [ ]{,3}(?<M>[-*_])([ ]{,2}\k<M>) {2,}[ \t]*+$

BQ[4] = /\g( {,3}> ?)?/ # While in BQ eat leading quote characters.

¹ I am actually not sure if this is “the spec” or just a bug. But placing a horizontal ruler just below a block quoted paragraph does not give the expected “lazy mode” and places the <hr> inside the block quote, instead it leaves the block quote.

Just to make the example more complete, let us also have a paragraph rule:

PAR[1] = /\g {,3}(?=[^ >])/ # Any non-special character with less than
                                # 4 leading spaces starts a paragraph.

PAR[2] = [ B, EM, LINK, TEXT, … ] # All the inline stuff works in this context

PAR[3] = /\g(?= | {,3}>| {,3}$)/ # We exit the paragraph when the line # is starting raw, block quote, or is # empty. In practice paragraphs do end # with block quote, but not with raw.

Now we have 3 rules, be aware I typed all this just now without actual testing, and the goal is not to replicate Markdown.pl 100%, just to give an example of how the rule-system works.

So our ROOT rule looks like this:

    ROOT[1] = //
    ROOT[2] = [ RAW, BQ, PAR ]

So when we start to process a document, using this root rule, we will get a match (without actually advancing our position in the document, since zero characters were matched).

After this match we have RAW, BQ, and PAR as active rules. Say our document looks like this:

    > A normal paragaph
    >     Some raw text
    > Normal text again

    Out of the block quote

The first line is ‘> A normal paragaph’ and we have 3 rules to apply, BQ[1], RAW[1], and PAR[1].

Since all of these regexps starts with \g, they are anchored to the first byte of the document, and only BQ[1] will match.

This “eats” the ‘> ’ prefix, pushes BQ[4] on our stack, and makes BQ, RAW, and PAR our new active rules (yeah, the same as before).

So we now have ‘A normal paragaph’ and again apply our 3 active rules, this time PAR[1] will match, it won’t actually eat any characters, and it won’t push additional rules onto our stack, but ti will change the active rules to: B, EM, LINK, TEXT, …

I didn’t define TEXT, but that is a fallback rule for non-special text- runs. We apply these rules to the line, and TEXT will match the line.

Now comes the special part, when we move to next line, which is ‘> Some raw text’ we start by applying the rules from our stack to this line, we have BQ[4] on the stack, which will eat the leading ‘> ’. The line is now: ‘ Some raw text’ and we have no more rules on the stack. Before we apply the active rules though, we need to check if we need to leave the current context, which is PAR, thus we try to apply PAR[3], and we do get a match, so we leave PAR.

The active rules now revert to those active before we entered PAR, i.e. RAW, BQ, and PAR. Applying these will give a match for RAW, so we eat the match (the leading four spaces), push RAW[4] on the stack, and set the new active rules to RAW[2], i.e. RAW_TEXT.

The line is now ‘Some raw text’ which will be eaten by the RAW_TEXT rule.

Next line is ‘> Normal text again’ and we have both BQ[4] and RAW[4] on the stack. We apply these in a FIFO order, so first BQ[4] which eats ‘> ’, then RAW[4], which fails to match, instructing us to leave RAW, …

Okay, enough writing — I hope the above gives a better understanding of how the rules are used.

[...] You also need a way for the regular expression in 3 to be variable depending on what you caught in 1 (to match the same number of backticks in a code span for instance; to catch a matching closing HTML tag, etc.).

I allow captures from the match done by 1 to be referenced in 3.

_______________________________________________
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Reply via email to