Subramanya Sastry has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/87019


Change subject: Improved handling of leading whitespace during serialization
......................................................................

Improved handling of leading whitespace during serialization

* "Normally", leading whitespace in wikitext trigger indent-pres.
  However, there are some exceptions to this rule. This behavior
  is captured in information about weak and strong indent-pre
  suppressing tags. Weak tags only suppress indent-pres in their
  immediate DOM children (caveat: while ignoring any zero-width
  wikitext nodes). Strong tags suppress indent-pres in the entire
  DOM subtree they generate.

  Ex: Whitespace before <td> wikitext (" |foo") will not generate
  pre-wraps. But, whitespace inside <td> will generate pre-wraps
  ("|\n foo"). So, <table>, <tbody>, <tr> are all weak indent-pre
  suppressing tags.

  Ex: <blockquote>, <ref> extension tag are examples of strong
  indent-pre suppressing tags. Indent-pres generation is entirely
  turned off in these tags.

  In addition, block tags suppress indent-pres on the lines
  they are present.

  This patch updates wikitext.constants.js with the right set of
  weak/strong indent-pre suppressing tags.

* makeSeparator in the serializer tries to account for leading
  white-space in separator text. However, it needs to know whether
  the separator is being generated between siblings or between
  a parent and child. The leading-whitespace-stripping behavior
  differs in these scenarios. This patch carries over this information
  to makeSeparator (and appropriately updates it in selser mode)
  to account for block tags.

* There are still gaps in our handling (there is a FIXME in the
  code as well as a new failing test that I've added to the
  blacklist). We dont need to address that immediately but as we
  continue to update our support for serialization of arbitrary
  HTML, we will have to tackel it.

* Added various tests for serialization of new content that will
  exercise these different code paths.

* Test result changes:
  - One html2wt failure is because of missing support for arbitrary
    HTML (see above).

  - One additional selser failure is because selser is more accurate
    than wt2wt (which strips leading whitespace unnecessarily).
    "Extra newlines: More paragraphs with indented comment [3,4,0,2,0]"

  - One additional passing selser test is because of improved
    indent-pre handling

  - Two new html2wt tests pass with this patch (but would have
    failed without it).

Change-Id I518237a2aec3c54281b3641d1a7e4a3c76331686

Change-Id: I49123699b7710737cfcb835143a2fb861c3ad1f0
---
M js/lib/mediawiki.WikitextSerializer.js
M js/lib/mediawiki.wikitext.constants.js
M js/tests/parserTests-blacklist.js
M js/tests/parserTests.txt
4 files changed, 191 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Parsoid 
refs/changes/19/87019/1

diff --git a/js/lib/mediawiki.WikitextSerializer.js 
b/js/lib/mediawiki.WikitextSerializer.js
index 63613b7..62db9d0 100644
--- a/js/lib/mediawiki.WikitextSerializer.js
+++ b/js/lib/mediawiki.WikitextSerializer.js
@@ -3574,7 +3574,7 @@
  * Create a separator given a (potentially empty) separator text and newline
  * constraints
  */
-WSP.makeSeparator = function(sep, node, nlConstraints, state) {
+WSP.makeSeparator = function(sep, nlConstraints, state) {
        var origSep = sep;
 
                // TODO: Move to Util?
@@ -3634,21 +3634,74 @@
        //      sep.replace(/^([^\n<]*<!--(?:[^\-]|-(?!->))*-->)?[^\n<]+/g, 
'$1');
        //}
 
-       // Strip non-nl ws from last line, but preserve comments
-       // This avoids triggering indent-pres.
+       // SSS FIXME: 'nlConstraints.min > 0' only applies to nodes that have 
native wikitext
+       // equivalents. For HTML nodes that will be serialized as HTML tags, we 
have to check
+       // for indent-pre safety always. This is currently not handled by the 
code below and
+       // will require other fixes to handle arbitrary HTML. To be done later! 
See example
+       // below that fails.
        //
-       // 'node' has min-nl constraint, but we dont know that 'node' is 
pre-safe.
-       // SSS FIXME: The check for 'node.nodeName in preSafeTags' should be 
possible
-       // at a nested level rather than just 'node'.  If 'node' is an 
IEW/comment,
-       // we should find the "next" (at this and and ancestor levels), the 
non-sep
-       // sibling and check if that node is one of these types.
-       //
-               // SSS FIXME: how is it that parentNode can be null??  is body 
getting here?
-       var parentName = node.parentNode && node.parentNode.nodeName;
-       if (nlConstraints.min > 0 && !(node.nodeName in Consts.PreSafeTags)) {
-               sep = 
sep.replace(/[^\n>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g, '$1');
+       // Ex: "<div>foo</div>\n <span>bar</span>"
+       if (nlConstraints.min > 0 && 
sep.match(/[^\n<>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g)) {
+               // 'sep' is the separator before 'nodeB' and it has leading 
spaces on a newline.
+               // We have to decide whether that leading space will trigger 
indent-pres in wikitext.
+               // The decision depends on where this separator will be emitted 
relative
+               // to 'nodeA' and 'nodeB'.
+
+               var isIndentPreSafe = false,
+                       constraintInfo = nlConstraints.constraintInfo || {};
+
+               // Example of sibling sepType scenario:
+               // <p>foo</p>
+               //  <span>bar</span>
+               // The span will be wrapped in an indent-pre if the leading 
space
+               // is not stripped since span is not a block tag
+               //
+               // Example of child-parent sepType scenario:
+               // <span>foo
+               //  </span>bar
+               // The " </span>bar" will be wrapped in an indent-pre if the
+               // leading space is not stripped since span is not a block tag
+               if ((constraintInfo.sepType === 'sibling' ||
+                       constraintInfo.sepType === 'child-parent') &&
+                       Util.isBlockTag(constraintInfo.nodeB.nodeName))
+               {
+                       isIndentPreSafe = true;
+               } else if (constraintInfo.sepType === 'parent-child') {
+                       // Separator between parent node and child node
+                       var node = constraintInfo.nodeA;
+
+                       // Walk up past zero-wikitext width ancestors
+                       while (node.nodeName in Consts.ZeroWidthWikitextTags) {
+                               node = node.parentNode;
+                       }
+
+                       // Deal with weak/strong indent-pre suppressing tags
+                       if (node.nodeName in 
Consts.WeakIndentPreSuppressingTags) {
+                               isIndentPreSafe = true;
+                       } else {
+                               // Strong indent-pre suppressing tags suppress 
indent-pres
+                               // in entire DOM subtree rooted at that node
+                               while (node.nodeName !== 'BODY') {
+                                       if (node.nodeName in 
Consts.StrongIndentPreSuppressingTags) {
+                                               isIndentPreSafe = true;
+                                       }
+                                       node = node.parentNode;
+                               }
+                       }
+               }
+
+               if (!isIndentPreSafe) {
+                       // Strip non-nl ws from last line, but preserve 
comments.
+                       // This avoids triggering indent-pres.
+                       sep = 
sep.replace(/[^\n>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g, '$1');
+               }
        }
-       this.trace('makeSeparator', sep, origSep, minNls, sepNlCount, 
nlConstraints);
+
+       if (this.debugging) {
+               var constraints = nlConstraints;
+               constraints.constraintInfo = undefined;
+               this.trace('makeSeparator', sep, origSep, minNls, sepNlCount, 
constraints);
+       }
 
        return sep;
 };
@@ -3697,24 +3750,29 @@
  *     }
  * }
  */
-WSP.updateSeparatorConstraints = function( state, nodeA, handlerA, nodeB, 
handlerB, dir) {
+WSP.updateSeparatorConstraints = function( state, nodeA, handlerA, nodeB, 
handlerB) {
        var nlConstraints,
                sepHandlerA = handlerA && handlerA.sepnls || {},
-               sepHandlerB = handlerB && handlerB.sepnls || {};
+               sepHandlerB = handlerB && handlerB.sepnls || {},
+               sepType = null;
        if ( nodeA.nextSibling === nodeB ) {
                // sibling separator
+               sepType = "sibling";
                nlConstraints = this.getSepNlConstraints(state, nodeA, 
sepHandlerA.after,
                                                                                
        nodeB, sepHandlerB.before);
-       } else if ( nodeB.parentNode === nodeA || dir === 'prev' ) {
+       } else if ( nodeB.parentNode === nodeA ) {
+               sepType = "parent-child";
                // parent-child separator, nodeA parent of nodeB
                nlConstraints = this.getSepNlConstraints(state, nodeA, 
sepHandlerA.firstChild,
                                                                                
        nodeB, sepHandlerB.before);
-       } else if ( nodeA.parentNode === nodeB || dir === 'next') {
+       } else if ( nodeA.parentNode === nodeB ) {
+               sepType = "child-parent";
                // parent-child separator, nodeB parent of nodeA
                nlConstraints = this.getSepNlConstraints(state, nodeA, 
sepHandlerA.after,
                                                                                
        nodeB, sepHandlerB.lastChild);
        } else {
                // sibling separator
+               sepType = "sibling";
                nlConstraints = this.getSepNlConstraints(state, nodeA, 
sepHandlerA.after,
                                                                                
        nodeB, sepHandlerB.before);
        }
@@ -3725,6 +3783,7 @@
 
        if (this.debugging) {
                this.trace('hSep', nodeA.nodeName, nodeB.nodeName,
+                               sepType,
                                nlConstraints,
                                (nodeA.outerHTML || nodeA.nodeValue || 
'').substr(0,40),
                                (nodeB.outerHTML || nodeB.nodeValue || 
'').substr(0,40)
@@ -3741,6 +3800,13 @@
                state.sep.constraints = nlConstraints;
                //state.sep.lastSourceNode = state.sep.lastSourceNode || nodeA;
        }
+
+       state.sep.constraints.constraintInfo = {
+               sepType: sepType,
+               nodeA: nodeA,
+               nodeB: nodeB
+       };
+
        //console.log('nlConstraints', state.sep.constraints);
 };
 
@@ -3902,7 +3968,6 @@
                        // TODO: set modified flag if start or end node (but 
not both) are
                        // modified / new so that the selser can use the 
separator
                        sep = this.makeSeparator(state.sep.src || '',
-                                               origNode,
                                                state.sep.constraints || 
{a:{},b:{}, max:0},
                                                state);
                } else {
@@ -4013,6 +4078,23 @@
 
                                        state.currNodeUnmodified = true;
 
+                                       // If this HTML node will disappear in 
wikitext because of zero width,
+                                       // then the separator constraints will 
carry over to the node's children.
+                                       //
+                                       // Since we dont recurse into 'node' in 
selser mode, we update the
+                                       // separator constraintInfo to apply to 
'node' and its first child.
+                                       //
+                                       // We could clear constraintInfo 
altogether which would be correct (but
+                                       // could normalize separators and 
introduce dirty diffs unnecessarily).
+                                       if (node.nodeName in 
Consts.ZeroWidthWikitextTags &&
+                                               node.childNodes.length > 0 &&
+                                               
state.sep.constraints.constraintInfo.sepType === 'sibling')
+                                       {
+                                               
state.sep.constraints.constraintInfo.sepType = 'parent-child';
+                                               
state.sep.constraints.constraintInfo.nodeA = node;
+                                               
state.sep.constraints.constraintInfo.nodeB = node.firstChild;
+                                       }
+
                                        // console.warn("USED ORIG");
                                        this.trace("ORIG-src:", src, '; out:', 
out);
                                        cb(out, node);
diff --git a/js/lib/mediawiki.wikitext.constants.js 
b/js/lib/mediawiki.wikitext.constants.js
index d96c659..8b93146 100644
--- a/js/lib/mediawiki.wikitext.constants.js
+++ b/js/lib/mediawiki.wikitext.constants.js
@@ -90,8 +90,21 @@
        // These wikitext tags are composed with quote-chars
        WTQuoteTags: JSUtils.arrayToHash(['I', 'B']),
 
-       // Whitespace in these elements does not lead to indent-pre
-       PreSafeTags: JSUtils.arrayToHash(['BR', 'TABLE', 'TBODY', 'CAPTION', 
'TR', 'TD', 'TH']),
+       // Leading whitespace on new lines in these elements does not lead to 
indent-pre
+       // This only applies to immediate children (while skipping past 
zero-wikitext tags)
+       // (Ex: content in table-cells induce indent pres)
+       WeakIndentPreSuppressingTags: JSUtils.arrayToHash([
+               'TABLE', 'TBODY', 'TR'
+       ]),
+
+       // Leading whitespace on new lines in these elements does not lead to 
indent-pre
+       // This applies to all nested content in these tags
+       // (Ex: content in table-cells nested in blocktags do not induce indent 
pres)
+       // FIXME: <ref> extension tag is another such -- is it possible to fold 
them
+       // into this setup so we can get rid of the 'noPre' hack in token 
transformers?
+       StrongIndentPreSuppressingTags: JSUtils.arrayToHash([
+               'BLOCKQUOTE'
+       ]),
 
        // In the PHP parser, these block tags open block-tag scope
        // See doBlockLevels in the PHP parser (includes/parser/Parser.php)
@@ -213,7 +226,12 @@
                "i"     : [2,2],
                "br"    : [0,0],
                "figure": [2,2]
-       }
+       },
+
+       // Derived information from 'WT_TagWidths' and should be kept 
consistent with it
+       ZeroWidthWikitextTags: JSUtils.arrayToHash([
+               'P', 'META', 'TBODY', 'OL', 'UL', 'DL', 'BR'
+       ])
 };
 
 // Fill in reverse map of prefix options.
diff --git a/js/tests/parserTests-blacklist.js 
b/js/tests/parserTests-blacklist.js
index 79dbcb0..492ad02 100644
--- a/js/tests/parserTests-blacklist.js
+++ b/js/tests/parserTests-blacklist.js
@@ -2168,6 +2168,7 @@
 add("html2wt", "Empty TR followed by mixed-ws-comment line should RT 
correctly");
 add("html2wt", "Multi-line image caption generated by templates with/without 
trailing newlines");
 add("html2wt", "Improperly nested inline or quotes tags with whitespace in 
between");
+add("html2wt", "Strip leading whitespace when handling indent-pre inducing 
tags");
 
 
 // Blacklist for selser
@@ -2184,12 +2185,12 @@
 add("selser", "Paragraphs with newline spacing with non-empty white-space 
lines in between [0,2,3,0,[3],0,0,0,4,0,[0,4],4,4]");
 add("selser", "Paragraphs with newline spacing with non-empty white-space 
lines in between [3,2,4,4,0,0,0,4,[4],0,2,3,2]");
 add("selser", "Paragraphs with newline spacing with non-empty mixed comment 
and white-space lines in between 
[3,2,[0,0,3],0,0,0,1,0,3,0,2,0,0,0,3,0,3,4,2,3,[2],0,0,4,2,0,2,3,0]");
-add("selser", "Paragraphs with newline spacing with non-empty mixed comment 
and white-space lines in between 
[4,3,[4,2,4],3,3,3,[4,0,0,4],2,0,3,1,0,0,0,3,0,[2],0,0,3,3,0,0,0,3,3,[0,4],0,3]");
 add("selser", "Paragraphs with newline spacing with non-empty mixed comment 
and white-space lines in between 
[0,3,4,3,0,0,[4,0,0,3],0,2,0,0,0,3,3,4,2,4,4,2,3,4,0,0,2,0,0,0,0,2]");
 add("selser", "Paragraphs with newline spacing with non-empty mixed comment 
and white-space lines in between 
[0,0,[2,0,3],4,0,4,2,3,0,3,[2],0,0,3,0,0,0,3,0,3,0,0,3,0,3,0,2,3,0]");
 add("selser", "Paragraphs with newline spacing with non-empty mixed comment 
and white-space lines in between 
[0,0,2,0,0,3,4,0,0,0,1,3,0,0,3,3,[4],0,2,2,0,2,2,0,0,4,2,3,4]");
 add("selser", "Extra newlines: More paragraphs with indented comment 
[1,0,4,0,2]");
 add("selser", "Extra newlines: More paragraphs with indented comment 
[[2],3,4,2,2]");
+add("selser", "Extra newlines: More paragraphs with indented comment 
[3,4,0,2,0]");
 add("selser", "Extra newlines: More paragraphs with indented comment 
[0,3,0,4,2]");
 add("selser", "Italics and possessives (1) [1]");
 add("selser", "Italics and possessives (1) [2]");
diff --git a/js/tests/parserTests.txt b/js/tests/parserTests.txt
index b2919dc..7691bee 100644
--- a/js/tests/parserTests.txt
+++ b/js/tests/parserTests.txt
@@ -1390,11 +1390,21 @@
 !! input
 <blockquote>
  Blah
+{|
+|
+ indented cell (no pre-wrapping!)
+|}
 </blockquote>
 !! result
 <blockquote>
 <p> Blah
 </p>
+<table>
+<tr>
+<td>
+<p> indented cell (no pre-wrapping!)
+</p>
+</td></tr></table>
 </blockquote>
 
 !! end
@@ -17722,6 +17732,63 @@
 </ul>
 !! end
 
+!! test
+Dont strip leading whitespace when handling indent-pre suppressing tags
+!! options
+parsoid=html2wt
+!! input
+{|
+  | indented row
+|}
+<blockquote>
+ '''This is very bold of you!'''
+
+{|
+|
+ indented cell (no pre-wrapping!)
+|}
+</blockquote>
+foo
+ <div>bar</div>
+!! result
+<table>
+  <tr><td> indented row</td></tr>
+</table>
+<blockquote><p>
+ <b>This is very bold of you!</b>
+</p>
+<table><tr><td>
+ indented cell (no pre-wrapping!)
+</td></tr></table>
+</blockquote>
+<p>foo</p>
+ <div>bar</div>
+!! end
+
+!! test
+Strip leading whitespace when handling indent-pre inducing tags
+!! options
+parsoid=html2wt
+!! input
+foo
+<span>bar</span>
+
+<span>foo2
+</span>bar2
+
+<div>foo</div>
+<span>bar</span>
+!! result
+<p>foo</p>
+ <span>bar</span>
+
+<span>foo2
+ </span>bar2
+
+<div>foo</div>
+ <span>bar</span>
+!! end
+
 # Wacky -- the leading newline in input is required because
 # that is what the serializer emits. To be fixed. Not fixing
 # the test because this test is required to test serialization of

-- 
To view, visit https://gerrit.wikimedia.org/r/87019
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I49123699b7710737cfcb835143a2fb861c3ad1f0
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/Parsoid
Gerrit-Branch: master
Gerrit-Owner: Subramanya Sastry <ssas...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to