Re: evolving the spec (was: forking Markdown.pl?)

Allan Odgaard Mon, 03 Mar 2008 21:49:34 -0800

On 3 Mar 2008, at 13:30, Michel Fortin wrote:

[...]

1. A regexp that makes the parser enter the context the rule
represents (e.g. block quote, list, raw, etc.).


2. A list of which rules are allowed in the context of this rule.

3. A regexp for leaving the context of this rule.

4. A regexp which is pushed onto a stack when entering the context of
this rule, and popped again when leaving this rule.

The fourth item here is really the interesting part, because it is
what made Markdown nesting work (99% of the time) despite this being
100% rule-driven.

I'm not sure that the regular expression in 4 does, beside beingpushed and popped from the stack

Yeah, I accidentally sent the letter w/o noticing I forgot to explainthe fourth rule.

The regexps which end on this stack are used to preprocess the currentline, so for example the rule for code blocks is:


    RAW[1] = /\g {4}/          # Four spaces starts raw.

RAW[2] = [ RAW_TEXT ] # No other rules are active insideraw, RAW_TEXT is a dummy .+ ruleRAW[4] = /\g( {4}| {,3}$)/ # While in the raw context, we need toeat the first# four spaces of each line, or theline must be empty.


Two things to notice here:

1. I don’t use an explicit ‘end’ rule since we automatically leavethe context if RAW[4] doesn’t successfully match.2. I use \g instead of ^ since we need to anchor to where the lastblock-rule stopped matching, not necessarily BOL.


Now take the rule for block quote:

    BQ[1] = /\g {,3}> {,3}/    # We start it for lines with > allowing
                               # up to 3 spaces before/after.

    BQ[2] = [ BQ, RAW, PAR, … ] # Basically all block elements
                                # can go inside block quote.

    BQ[3] = /\g( *$|«hr»)/     # We leave block quote at empty lines or

# horizontal rulers¹. The actualpattern for

                               # «hr» is something like:

# [ ]{,3}(?<M>[-*_])([ ]{,2}\k<M>){2,}[ \t]*+$

BQ[4] = /\g( {,3}> ?)?/ # While in BQ eat leading quotecharacters.

¹ I am actually not sure if this is “the spec” or just a bug. Butplacing a horizontal ruler just below a block quoted paragraph doesnot give the expected “lazy mode” and places the <hr> inside the blockquote, instead it leaves the block quote.

Just to make the example more complete, let us also have a paragraphrule:

PAR[1] = /\g {,3}(?=[^ >])/ # Any non-special character with lessthan

                                # 4 leading spaces starts a paragraph.

PAR[2] = [ B, EM, LINK, TEXT, … ] # All the inline stuff works inthis context

PAR[3] = /\g(?= | {,3}>| {,3}$)/ # We exit the paragraph whenthe line# is starting raw, blockquote, or is# empty. In practiceparagraphs do end# with block quote, but notwith raw.

Now we have 3 rules, be aware I typed all this just now without actualtesting, and the goal is not to replicate Markdown.pl 100%, just togive an example of how the rule-system works.


So our ROOT rule looks like this:

    ROOT[1] = //
    ROOT[2] = [ RAW, BQ, PAR ]

So when we start to process a document, using this root rule, we willget a match (without actually advancing our position in the document,since zero characters were matched).

After this match we have RAW, BQ, and PAR as active rules. Say ourdocument looks like this:


    > A normal paragaph
    >     Some raw text
    > Normal text again

    Out of the block quote

The first line is ‘> A normal paragaph’ and we have 3 rules to apply,BQ[1], RAW[1], and PAR[1].

Since all of these regexps starts with \g, they are anchored to thefirst byte of the document, and only BQ[1] will match.

This “eats” the ‘> ’ prefix, pushes BQ[4] on our stack, and makes BQ,RAW, and PAR our new active rules (yeah, the same as before).

So we now have ‘A normal paragaph’ and again apply our 3 active rules,this time PAR[1] will match, it won’t actually eat any characters, andit won’t push additional rules onto our stack, but ti will change theactive rules to: B, EM, LINK, TEXT, …

I didn’t define TEXT, but that is a fallback rule for non-special text-runs. We apply these rules to the line, and TEXT will match the line.

Now comes the special part, when we move to next line, which is ‘>Some raw text’ we start by applying the rules from our stack to thisline, we have BQ[4] on the stack, which will eat the leading ‘> ’. Theline is now: ‘ Some raw text’ and we have no more rules on thestack. Before we apply the active rules though, we need to check if weneed to leave the current context, which is PAR, thus we try to applyPAR[3], and we do get a match, so we leave PAR.

The active rules now revert to those active before we entered PAR,i.e. RAW, BQ, and PAR. Applying these will give a match for RAW, so weeat the match (the leading four spaces), push RAW[4] on the stack, andset the new active rules to RAW[2], i.e. RAW_TEXT.

The line is now ‘Some raw text’ which will be eaten by the RAW_TEXTrule.

Next line is ‘> Normal text again’ and we have both BQ[4] and RAW[4]on the stack. We apply these in a FIFO order, so first BQ[4] whicheats ‘> ’, then RAW[4], which fails to match, instructing us to leaveRAW, …

Okay, enough writing — I hope the above gives a better understandingof how the rules are used.

[...] You also need a way for the regular expression in 3 to bevariable depending on what you caught in 1 (to match the same numberof backticks in a code span for instance; to catch a matchingclosing HTML tag, etc.).


I allow captures from the match done by 1 to be referenced in 3.

_______________________________________________
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: evolving the spec (was: forking Markdown.pl?)

Reply via email to