Tim Starling has uploaded a new change for review. https://gerrit.wikimedia.org/r/287154
Change subject: In removeHTMLtags, silently normalize self-closing tags ...................................................................... In removeHTMLtags, silently normalize self-closing tags HTML 5 requires self-closing tags other than those with a small set of tag names to emit a parse error when seen, and be treated as start tags. We don't want either the parse error, and we don't want these tags to be treated as start tags since they have traditionally been used on Wikipedia to create empty tags. So normalize them to valid HTML. The non-tidy case is not patched here due to it being broken beyond easy repair in many ways. Its existing treatment of self-closing tags is wrong but fairly harmless. I might patch it in a subsequent commit. Also cleaned up an unnecessary dynamic modification to htmlsingle and htmlsingleonly. Change-Id: Ia2d97f8dbf21309ac403db2aec19c24e7fc4cc9e --- M includes/Sanitizer.php 1 file changed, 16 insertions(+), 6 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/core refs/changes/54/287154/1 diff --git a/includes/Sanitizer.php b/includes/Sanitizer.php index d321e9f..9795be4 100644 --- a/includes/Sanitizer.php +++ b/includes/Sanitizer.php @@ -381,14 +381,17 @@ 'kbd', 'samp', 'data', 'time', 'mark' ]; $htmlsingle = [ - 'br', 'wbr', 'hr', 'li', 'dt', 'dd' - ]; - $htmlsingleonly = [ # Elements that cannot have close tags - 'br', 'wbr', 'hr' + 'br', 'wbr', 'hr', 'li', 'dt', 'dd', 'meta', 'link' ]; - $htmlsingle[] = $htmlsingleonly[] = 'meta'; - $htmlsingle[] = $htmlsingleonly[] = 'link'; + # Elements that cannot have close tags. This is (not coincidentally) + # also the list of tags for which the HTML 5 parsing algorithm + # requires you to "acknowledge the token's self-closing flag", i.e. + # a self-closing tag like <br/> is not an HTML 5 parse error only + # for this list. + $htmlsingleonly = [ + 'br', 'wbr', 'hr', 'meta', 'link' + ]; $htmlnest = [ # Tags that can be nested--?? 'table', 'tr', 'td', 'th', 'div', 'blockquote', 'ol', 'ul', @@ -610,6 +613,13 @@ $newparams = Sanitizer::fixTagAttributes( $params, $t ); if ( !$badtag ) { + if ( $brace === '/>' && !isset( $htmlsingleonly[$t] ) ) { + # Interpret self-closing tags as empty tags even when + # HTML 5 would interpret them as start tags. Such input + # is commonly seen on Wikimedia wikis with this intention. + $brace = "></$t>"; + } + $rest = str_replace( '>', '>', $rest ); $text .= "<$slash$t$newparams$brace$rest"; continue; -- To view, visit https://gerrit.wikimedia.org/r/287154 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ia2d97f8dbf21309ac403db2aec19c24e7fc4cc9e Gerrit-PatchSet: 1 Gerrit-Project: mediawiki/core Gerrit-Branch: master Gerrit-Owner: Tim Starling <tstarl...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits