Eduard Pascual wrote:
(Note: I made a recurrent typo on my previous e-mails: XML's CDATA tag
is spelled <![CDATA[ ... ]]> rather than <[CDATA[ ... ]]>. The "<!"
sequence is a legacy from SGML's obscure features. My apologies if
those mistakes caused any issue; although I hope the idea behind my
posts was clear enough.)
On Wed, Apr 7, 2010 at 7:49 AM, T.J. Crowder <[email protected]> wrote:
<[CDATA[ ... ]]>. This is far easier to
sanitize (you just need to ensure that the input doesn't include the
"]]>" sequence), thus being more usable on user-provided content.
What makes ]]> easier to defend against than </code>?
As I said, with <![CDATA[ ... ]]> you only need to care about the
exact sequence "]]>": if it's found within an input, get rid of it or
somehow fix it (string replacement "]]>" => "]]>]]><![CDATA[" gets
the job done safely). With </code> (or even with Arthur's <cdata>
suggestion, to some degree), things are quite more complex:
1) an instance of the "</code>" string may be legitimate within the
content (if it closes a matching <code ...> within the content).
2) due to HTML5's error-handling rules, something other than "</code>"
may end up closing the initial <code ...>, so a sanitizer would have
to implement the error-handling rules and play really smart to handle
those cases. I don't know the rules down to the detail, but IIRC
something like this: <div> <code> </div> would have the <code> element
implicitly closed just before the </div>.
That's why I just use DOMDocument (libxml2) for all dynamically
generated code. I don't have to worry about that kind of thing.
User input where markup is allowed is sent through a filter first (html
tidy in xml mode followed by HTML Purifier) that fixes it for xml sanity
and then it is imported into a DOM of its own before the node is
imported into the DOM that is served to the requesting client.
Code injection is a non issue for me.
It's a little slower, but you can cache it once it has been done that
way making performance an issue only the first time it is assembled or
modified.