Markdown validity Re: Agreeing on "Historical Markdown"

2014-07-12 Thread Sean Leonard

As I'm thinking about this, I have other questions:

Can a Markdown parser/processor fail? Is there a concept of Markdown 
validity--i.e., can Markdown content be invalid (from the perspective of 
Markdown, not (X)HTML)?


As I understand it:
A Markdown processor identifies Markdown control sequences (aka 
markdown, in lowercase) in a stream of text and converts these sequences 
to the target markup--namely (X)HTML.
A Markdown processor identifies (X)HTML in markdown and passes this 
content to the target markup.
 <-- Do Markdown processors (i.e., existing implementations) attempt to 
fix or normalize the markup (by deserializing and then reserializing the 
markup), or is it a straight pass? It sounds like whether or not a 
Markdown processor reserializes the markup is implementation-dependent; 
Gruber's syntax rules do not say. However, if you have Markdown in the 
HTML content with markdown="1" as with PHP Markdown Extra, it is 
necessary to parse the HTML with something other than a straight HTML 
parser since the straight HTML parser will misinterpret the Markdown 
(e.g., & will be a validation error).



Therefore:
Markdown has no concept of markdown validity. A Markdown processor never 
fails due to invalid markdown input. If a sequence of text is not 
recognized as markdown (i.e., control sequences), it is treated as text 
and passed accordingly to the target markup. (This property is directly 
related to the "degradation" feature of Markdown, namely, if your 
processor cannot understand the markdown, the output is "worse" than an 
author intended, but does not cause utter failure--the non-understood 
markdown is visible in the output. This is in contrast to HTML, where 
tags or attributes that are not understood have no effect on the 
presentation of the HTML.)


Markdown may have a concept of HTML validity. A Markdown processor that 
identifies HTML in Markdown content may determine that the HTML is valid 
or invalid. For example, it may identify  ... [end of document] as 
HTML that is invalid because it lacks a closing  tag. Then, it has 
five choices:
1. treat the invalid HTML as text--pass the text-as-text to the markup 
(i.e., turn & into & , < into < , etc.)
2. treat the invalid HTML as Markdown--keep on processing the input and 
look for markdown inside of it (thus *hello* inside the invalid HTML 
will get marked up...and href="http://www.example.com/";>hello[end of document] will become a 
real link with the literal text '' preceding it)
  <-- this is the same behavior as "not identifying the text as HTML in 
the first place"

3. pass the invalid HTML as HTML
4. attempt to fix the HTML...thus href="http://www.example.com/";>hello[end of document] might become 
http://www.example.com/";>hello

5. fail due to HTML invalidity

?

Sean

___
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss


Re: Markdown validity Re: Agreeing on "Historical Markdown"

2014-07-12 Thread Aristotle Pagaltzis
* Sean Leonard  [2014-07-12 16:35]:
> However, if you have Markdown in the HTML content with markdown="1" as
> with PHP Markdown Extra, it is necessary to parse the HTML with
> something other than a straight HTML parser since the straight HTML
> parser will misinterpret the Markdown (e.g., & will be a validation
> error).

That parser is Markdown itself. You can already put Markdown inside
HTML tags, it’s just that normally Markdown will only parse the content
of inline tags like EM and SPAN, not block tags like P or DIV. This was
an explicit design choice. The markdown="1" attribute does nothing more
than turn off this distinction temporarily.

(The block tag rule allows you to write portions of your document as
plain old HTML when Markdown is insufficient, and also allows you to
pass stuff through Markdown several times (e.g. fragments in a CMS
getting passed through Markdown at various stages of page assembly)
without screwing up the document. I consider it the smartest choice in
the design of Markdown: the reason it has been adopted where other
syntaxes have remained confined to niches. It means almost any HTML
fragment is also a Markdown fragment, so it’s easy to add Markdown to
any publishing workflow that involves HTML somewhere even if it wasn’t
designed for that at all, and the content can then be ported piecemeal
instead of boil-the-ocean. Classic embrace-and-extend.)

> Therefore:
> Markdown has no concept of markdown validity.

Correct.

> Markdown may have a concept of HTML validity.

Not really. Individual processors may, but Markdown itself has nothing
to say about that. The original implementation of course is implemented
as a text substitution system, which means if you give it Markdown that
contains invalid HTML then you’ll simply get HTML that’s invalid in the
same way, to then be interpreted by the browser however the browser may.
My guess is that the majority of implementations behave equivalently to
this, though depending on their design they could differ completely.

Regards,
-- 
Aristotle Pagaltzis // 
___
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss


Re: Markdown validity Re: Agreeing on "Historical Markdown"

2014-07-12 Thread Michel Fortin
Le 12-juil.-2014 à 10:32, Sean Leonard  a écrit :

> Markdown may have a concept of HTML validity. A Markdown processor that 
> identifies HTML in Markdown content may determine that the HTML is valid or 
> invalid. For example, it may identify  ... [end of document] as HTML 
> that is invalid because it lacks a closing  tag. Then, it has five 
> choices:
> 1. treat the invalid HTML as text--pass the text-as-text to the markup (i.e., 
> turn & into & , < into < , etc.)
> 2. treat the invalid HTML as Markdown--keep on processing the input and look 
> for markdown inside of it (thus *hello* inside the invalid HTML will get 
> marked up...and http://www.example.com/";>hello[end of 
> document] will become a real link with the literal text '' preceding it)
>  <-- this is the same behavior as "not identifying the text as HTML in the 
> first place"
> 3. pass the invalid HTML as HTML
> 4. attempt to fix the HTML...thus  href="http://www.example.com/";>hello[end of document] might become 
> http://www.example.com/";>hello
> 5. fail due to HTML invalidity
> 
> ?

Is that really a question?

1. Turning `&` and `<` into `&` and `<` is part of the official syntax 
rules. Hopefully every Markdown parser does that.

2. 3. 4. 5. We have implementations doing all of that, probably mixing a few of 
those solutions depending on the exact error.

When you have a question like this, just try it Babelmark 2:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E


-- 
Michel Fortin
michel.for...@michelf.ca
http://michelf.ca

___
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss


Re: Markdown validity Re: Agreeing on "Historical Markdown"

2014-07-12 Thread Waylan Limberg

> On Jul 12, 2014, at 2:52 PM, Michel Fortin  wrote:
> [snip]
> When you have a question like this, just try it Babelmark 2:
> http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E

Yes, that's what we all do. And to answer your other question, notice that only 
two of the implementations on Babelmark2 failed. Remember, most of these 
implementations were written to be run on web servers. We can't have our web 
servers crashing just because a user submitted invalid markdown. What a parser 
doesn't understand is just passes through. What it misunderstands is garbles 
but it is specifically designed to never choke.

As Michel alluded to, most parsers are simply a series of regular expression 
substitutions which are run in a predetermined order. If a regex never matches 
a part of the text, then that part passes through untouched. Yes, that means 
the HTML is parsed by regex - which we all know is a bad idea -- but it is not 
really parsed in the way that browsers parse HTML. The regex just finds 
anything surrounded by angle brackets and ignores it. With the exception of the 
limited block level stuff, we don't even care if there are opening and/or 
closing tags. Yes, that can result in improperly nested stuff, but that is the 
authors fault and the parser should not bring the whole server down for that. 
The Author can (should?) preview in a browser and fix it before publishing.

However, I should point out that while the above describes most parsers (as 
most are more or less direct ports of markdown.pl - which works this way), 
there are a few that use other methods under the hood. For example, a few 
generate a parse tree which is then fed into a renderer (I believe Pandoc works 
like that, which allows it to output many more formats than just HTML), but 
they are the rare exception.

Waylan
___
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss


Re: Markdown validity Re: Agreeing on "Historical Markdown"

2014-07-12 Thread Sean Leonard

On 7/12/2014 12:31 PM, Waylan Limberg wrote:

On Jul 12, 2014, at 2:52 PM, Michel Fortin  wrote:
[snip]
When you have a question like this, just try it Babelmark 2:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E

Yes, that's what we all do. And to answer your other question, notice that only 
two of the implementations on Babelmark2 failed. Remember, most of these 
implementations were written to be run on web servers. We can't have our web 
servers crashing just because a user submitted invalid markdown. What a parser 
doesn't understand is just passes through. What it misunderstands is garbles 
but it is specifically designed to never choke.

As Michel alluded to, most parsers are simply a series of regular expression 
substitutions which are run in a predetermined order. If a regex never matches 
a part of the text, then that part passes through untouched. Yes, that means 
the HTML is parsed by regex - which we all know is a bad idea -- but it is not 
really parsed in the way that browsers parse HTML. The regex just finds 
anything surrounded by angle brackets and ignores it. With the exception of the 
limited block level stuff, we don't even care if there are opening and/or 
closing tags. Yes, that can result in improperly nested stuff, but that is the 
authors fault and the parser should not bring the whole server down for that. 
The Author can (should?) preview in a browser and fix it before publishing.

However, I should point out that while the above describes most parsers (as 
most are more or less direct ports of markdown.pl - which works this way), 
there are a few that use other methods under the hood. For example, a few 
generate a parse tree which is then fed into a renderer (I believe Pandoc works 
like that, which allows it to output many more formats than just HTML), but 
they are the rare exception.


I see.

Here is a real-world example of what I was citing:
http://johnmacfarlane.net/babelmark2/?text=Hello+I+am+some+*text*.%0A%3Cdiv%3EHello+%3Ca+href%3D%22http%3A%2F%2Fwww.example.com%2F%22%3Ethat+is+nice%3C%2Fa%3E+chance+%26+circumstance%26hellip%3B%0A%0AThe+end.

Truly, it looks like there is great diversity in Markdown-land.

Ok, so any standard mentioning Historical Markdown cannot say that any 
particular behavior is normative when it comes to HTML validity. Some 
check for HTML (island) validity and behave differently; others don't. 
The end...I guess.


Sean

___
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss


Re: Markdown validity Re: Agreeing on "Historical Markdown"

2014-07-12 Thread Waylan Limberg

> On Jul 12, 2014, at 6:23 PM, Sean Leonard  wrote:
> 
> On 7/12/2014 12:31 PM, Waylan Limberg wrote:
>>> On Jul 12, 2014, at 2:52 PM, Michel Fortin  wrote:
>>> [snip]
>>> When you have a question like this, just try it Babelmark 2:
>>> http://johnmacfarlane.net/babelmark2/?normalize=1&text=%3Cdiv%3E
>> Yes, that's what we all do. And to answer your other question, notice that 
>> only two of the implementations on Babelmark2 failed. Remember, most of 
>> these implementations were written to be run on web servers. We can't have 
>> our web servers crashing just because a user submitted invalid markdown. 
>> What a parser doesn't understand is just passes through. What it 
>> misunderstands is garbles but it is specifically designed to never choke.
>> 
>> As Michel alluded to, most parsers are simply a series of regular expression 
>> substitutions which are run in a predetermined order. If a regex never 
>> matches a part of the text, then that part passes through untouched. Yes, 
>> that means the HTML is parsed by regex - which we all know is a bad idea -- 
>> but it is not really parsed in the way that browsers parse HTML. The regex 
>> just finds anything surrounded by angle brackets and ignores it. With the 
>> exception of the limited block level stuff, we don't even care if there are 
>> opening and/or closing tags. Yes, that can result in improperly nested 
>> stuff, but that is the authors fault and the parser should not bring the 
>> whole server down for that. The Author can (should?) preview in a browser 
>> and fix it before publishing.
>> 
>> However, I should point out that while the above describes most parsers (as 
>> most are more or less direct ports of markdown.pl - which works this way), 
>> there are a few that use other methods under the hood. For example, a few 
>> generate a parse tree which is then fed into a renderer (I believe Pandoc 
>> works like that, which allows it to output many more formats than just 
>> HTML), but they are the rare exception.
> 
> I see.
> 
> Here is a real-world example of what I was citing:
> http://johnmacfarlane.net/babelmark2/?text=Hello+I+am+some+*text*.%0A%3Cdiv%3EHello+%3Ca+href%3D%22http%3A%2F%2Fwww.example.com%2F%22%3Ethat+is+nice%3C%2Fa%3E+chance+%26+circumstance%26hellip%3B%0A%0AThe+end.
> 
> Truly, it looks like there is great diversity in Markdown-land.
> 
> Ok, so any standard mentioning Historical Markdown cannot say that any 
> particular behavior is normative when it comes to HTML validity. Some check 
> for HTML (island) validity and behave differently; others don't. The end...I 
> guess.

Yes, but select "normalize" (which normalizes insignificant white space in the 
output), and the number of variations decreases. Unfortunately, there is 
absolutely no standardization in how the various implementations handle white 
space (I don't think I've seen two that match exactly in every corner case). 
Either way though, hit the "preview" button (top right of output) to see how 
the browser renders the output and all but a couple render in the browser 
exactly the same.

And that is what makes markdown so great. You don't need to know or understand 
HTML to write it if you are using markdown. And if you have only an elementary 
knowledge of HTML, you can break into HTML on those few occasions when markdown 
won't do what you need.

Waylan
___
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss