Tim Starling has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/287154

Change subject: In removeHTMLtags, silently normalize self-closing tags
......................................................................

In removeHTMLtags, silently normalize self-closing tags

HTML 5 requires self-closing tags other than those with a small set of tag
names to emit a parse error when seen, and be treated as start tags. We
don't want either the parse error, and we don't want these tags to be
treated as start tags since they have traditionally been used on
Wikipedia to create empty tags. So normalize them to valid HTML.

The non-tidy case is not patched here due to it being broken beyond easy
repair in many ways. Its existing treatment of self-closing tags is
wrong but fairly harmless. I might patch it in a subsequent commit.

Also cleaned up an unnecessary dynamic modification to htmlsingle and
htmlsingleonly.

Change-Id: Ia2d97f8dbf21309ac403db2aec19c24e7fc4cc9e
---
M includes/Sanitizer.php
1 file changed, 16 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/core 
refs/changes/54/287154/1

diff --git a/includes/Sanitizer.php b/includes/Sanitizer.php
index d321e9f..9795be4 100644
--- a/includes/Sanitizer.php
+++ b/includes/Sanitizer.php
@@ -381,14 +381,17 @@
                                'kbd', 'samp', 'data', 'time', 'mark'
                        ];
                        $htmlsingle = [
-                               'br', 'wbr', 'hr', 'li', 'dt', 'dd'
-                       ];
-                       $htmlsingleonly = [ # Elements that cannot have close 
tags
-                               'br', 'wbr', 'hr'
+                               'br', 'wbr', 'hr', 'li', 'dt', 'dd', 'meta', 
'link'
                        ];
 
-                       $htmlsingle[] = $htmlsingleonly[] = 'meta';
-                       $htmlsingle[] = $htmlsingleonly[] = 'link';
+                       # Elements that cannot have close tags. This is (not 
coincidentally)
+                       # also the list of tags for which the HTML 5 parsing 
algorithm
+                       # requires you to "acknowledge the token's self-closing 
flag", i.e.
+                       # a self-closing tag like <br/> is not an HTML 5 parse 
error only
+                       # for this list.
+                       $htmlsingleonly = [
+                               'br', 'wbr', 'hr', 'meta', 'link'
+                       ];
 
                        $htmlnest = [ # Tags that can be nested--??
                                'table', 'tr', 'td', 'th', 'div', 'blockquote', 
'ol', 'ul',
@@ -610,6 +613,13 @@
 
                                                $newparams = 
Sanitizer::fixTagAttributes( $params, $t );
                                                if ( !$badtag ) {
+                                                       if ( $brace === '/>' && 
!isset( $htmlsingleonly[$t] ) ) {
+                                                               # Interpret 
self-closing tags as empty tags even when
+                                                               # HTML 5 would 
interpret them as start tags. Such input
+                                                               # is commonly 
seen on Wikimedia wikis with this intention.
+                                                               $brace = 
"></$t>";
+                                                       }
+
                                                        $rest = str_replace( 
'>', '&gt;', $rest );
                                                        $text .= 
"<$slash$t$newparams$brace$rest";
                                                        continue;

-- 
To view, visit https://gerrit.wikimedia.org/r/287154
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia2d97f8dbf21309ac403db2aec19c24e7fc4cc9e
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/core
Gerrit-Branch: master
Gerrit-Owner: Tim Starling <tstarl...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to