Subramanya Sastry has uploaded a new change for review. https://gerrit.wikimedia.org/r/87019
Change subject: Improved handling of leading whitespace during serialization ...................................................................... Improved handling of leading whitespace during serialization * "Normally", leading whitespace in wikitext trigger indent-pres. However, there are some exceptions to this rule. This behavior is captured in information about weak and strong indent-pre suppressing tags. Weak tags only suppress indent-pres in their immediate DOM children (caveat: while ignoring any zero-width wikitext nodes). Strong tags suppress indent-pres in the entire DOM subtree they generate. Ex: Whitespace before <td> wikitext (" |foo") will not generate pre-wraps. But, whitespace inside <td> will generate pre-wraps ("|\n foo"). So, <table>, <tbody>, <tr> are all weak indent-pre suppressing tags. Ex: <blockquote>, <ref> extension tag are examples of strong indent-pre suppressing tags. Indent-pres generation is entirely turned off in these tags. In addition, block tags suppress indent-pres on the lines they are present. This patch updates wikitext.constants.js with the right set of weak/strong indent-pre suppressing tags. * makeSeparator in the serializer tries to account for leading white-space in separator text. However, it needs to know whether the separator is being generated between siblings or between a parent and child. The leading-whitespace-stripping behavior differs in these scenarios. This patch carries over this information to makeSeparator (and appropriately updates it in selser mode) to account for block tags. * There are still gaps in our handling (there is a FIXME in the code as well as a new failing test that I've added to the blacklist). We dont need to address that immediately but as we continue to update our support for serialization of arbitrary HTML, we will have to tackel it. * Added various tests for serialization of new content that will exercise these different code paths. * Test result changes: - One html2wt failure is because of missing support for arbitrary HTML (see above). - One additional selser failure is because selser is more accurate than wt2wt (which strips leading whitespace unnecessarily). "Extra newlines: More paragraphs with indented comment [3,4,0,2,0]" - One additional passing selser test is because of improved indent-pre handling - Two new html2wt tests pass with this patch (but would have failed without it). Change-Id I518237a2aec3c54281b3641d1a7e4a3c76331686 Change-Id: I49123699b7710737cfcb835143a2fb861c3ad1f0 --- M js/lib/mediawiki.WikitextSerializer.js M js/lib/mediawiki.wikitext.constants.js M js/tests/parserTests-blacklist.js M js/tests/parserTests.txt 4 files changed, 191 insertions(+), 23 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Parsoid refs/changes/19/87019/1 diff --git a/js/lib/mediawiki.WikitextSerializer.js b/js/lib/mediawiki.WikitextSerializer.js index 63613b7..62db9d0 100644 --- a/js/lib/mediawiki.WikitextSerializer.js +++ b/js/lib/mediawiki.WikitextSerializer.js @@ -3574,7 +3574,7 @@ * Create a separator given a (potentially empty) separator text and newline * constraints */ -WSP.makeSeparator = function(sep, node, nlConstraints, state) { +WSP.makeSeparator = function(sep, nlConstraints, state) { var origSep = sep; // TODO: Move to Util? @@ -3634,21 +3634,74 @@ // sep.replace(/^([^\n<]*<!--(?:[^\-]|-(?!->))*-->)?[^\n<]+/g, '$1'); //} - // Strip non-nl ws from last line, but preserve comments - // This avoids triggering indent-pres. + // SSS FIXME: 'nlConstraints.min > 0' only applies to nodes that have native wikitext + // equivalents. For HTML nodes that will be serialized as HTML tags, we have to check + // for indent-pre safety always. This is currently not handled by the code below and + // will require other fixes to handle arbitrary HTML. To be done later! See example + // below that fails. // - // 'node' has min-nl constraint, but we dont know that 'node' is pre-safe. - // SSS FIXME: The check for 'node.nodeName in preSafeTags' should be possible - // at a nested level rather than just 'node'. If 'node' is an IEW/comment, - // we should find the "next" (at this and and ancestor levels), the non-sep - // sibling and check if that node is one of these types. - // - // SSS FIXME: how is it that parentNode can be null?? is body getting here? - var parentName = node.parentNode && node.parentNode.nodeName; - if (nlConstraints.min > 0 && !(node.nodeName in Consts.PreSafeTags)) { - sep = sep.replace(/[^\n>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g, '$1'); + // Ex: "<div>foo</div>\n <span>bar</span>" + if (nlConstraints.min > 0 && sep.match(/[^\n<>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g)) { + // 'sep' is the separator before 'nodeB' and it has leading spaces on a newline. + // We have to decide whether that leading space will trigger indent-pres in wikitext. + // The decision depends on where this separator will be emitted relative + // to 'nodeA' and 'nodeB'. + + var isIndentPreSafe = false, + constraintInfo = nlConstraints.constraintInfo || {}; + + // Example of sibling sepType scenario: + // <p>foo</p> + // <span>bar</span> + // The span will be wrapped in an indent-pre if the leading space + // is not stripped since span is not a block tag + // + // Example of child-parent sepType scenario: + // <span>foo + // </span>bar + // The " </span>bar" will be wrapped in an indent-pre if the + // leading space is not stripped since span is not a block tag + if ((constraintInfo.sepType === 'sibling' || + constraintInfo.sepType === 'child-parent') && + Util.isBlockTag(constraintInfo.nodeB.nodeName)) + { + isIndentPreSafe = true; + } else if (constraintInfo.sepType === 'parent-child') { + // Separator between parent node and child node + var node = constraintInfo.nodeA; + + // Walk up past zero-wikitext width ancestors + while (node.nodeName in Consts.ZeroWidthWikitextTags) { + node = node.parentNode; + } + + // Deal with weak/strong indent-pre suppressing tags + if (node.nodeName in Consts.WeakIndentPreSuppressingTags) { + isIndentPreSafe = true; + } else { + // Strong indent-pre suppressing tags suppress indent-pres + // in entire DOM subtree rooted at that node + while (node.nodeName !== 'BODY') { + if (node.nodeName in Consts.StrongIndentPreSuppressingTags) { + isIndentPreSafe = true; + } + node = node.parentNode; + } + } + } + + if (!isIndentPreSafe) { + // Strip non-nl ws from last line, but preserve comments. + // This avoids triggering indent-pres. + sep = sep.replace(/[^\n>]+(<!--(?:[^\-]|-(?!->))*-->[^\n]*)?$/g, '$1'); + } } - this.trace('makeSeparator', sep, origSep, minNls, sepNlCount, nlConstraints); + + if (this.debugging) { + var constraints = nlConstraints; + constraints.constraintInfo = undefined; + this.trace('makeSeparator', sep, origSep, minNls, sepNlCount, constraints); + } return sep; }; @@ -3697,24 +3750,29 @@ * } * } */ -WSP.updateSeparatorConstraints = function( state, nodeA, handlerA, nodeB, handlerB, dir) { +WSP.updateSeparatorConstraints = function( state, nodeA, handlerA, nodeB, handlerB) { var nlConstraints, sepHandlerA = handlerA && handlerA.sepnls || {}, - sepHandlerB = handlerB && handlerB.sepnls || {}; + sepHandlerB = handlerB && handlerB.sepnls || {}, + sepType = null; if ( nodeA.nextSibling === nodeB ) { // sibling separator + sepType = "sibling"; nlConstraints = this.getSepNlConstraints(state, nodeA, sepHandlerA.after, nodeB, sepHandlerB.before); - } else if ( nodeB.parentNode === nodeA || dir === 'prev' ) { + } else if ( nodeB.parentNode === nodeA ) { + sepType = "parent-child"; // parent-child separator, nodeA parent of nodeB nlConstraints = this.getSepNlConstraints(state, nodeA, sepHandlerA.firstChild, nodeB, sepHandlerB.before); - } else if ( nodeA.parentNode === nodeB || dir === 'next') { + } else if ( nodeA.parentNode === nodeB ) { + sepType = "child-parent"; // parent-child separator, nodeB parent of nodeA nlConstraints = this.getSepNlConstraints(state, nodeA, sepHandlerA.after, nodeB, sepHandlerB.lastChild); } else { // sibling separator + sepType = "sibling"; nlConstraints = this.getSepNlConstraints(state, nodeA, sepHandlerA.after, nodeB, sepHandlerB.before); } @@ -3725,6 +3783,7 @@ if (this.debugging) { this.trace('hSep', nodeA.nodeName, nodeB.nodeName, + sepType, nlConstraints, (nodeA.outerHTML || nodeA.nodeValue || '').substr(0,40), (nodeB.outerHTML || nodeB.nodeValue || '').substr(0,40) @@ -3741,6 +3800,13 @@ state.sep.constraints = nlConstraints; //state.sep.lastSourceNode = state.sep.lastSourceNode || nodeA; } + + state.sep.constraints.constraintInfo = { + sepType: sepType, + nodeA: nodeA, + nodeB: nodeB + }; + //console.log('nlConstraints', state.sep.constraints); }; @@ -3902,7 +3968,6 @@ // TODO: set modified flag if start or end node (but not both) are // modified / new so that the selser can use the separator sep = this.makeSeparator(state.sep.src || '', - origNode, state.sep.constraints || {a:{},b:{}, max:0}, state); } else { @@ -4013,6 +4078,23 @@ state.currNodeUnmodified = true; + // If this HTML node will disappear in wikitext because of zero width, + // then the separator constraints will carry over to the node's children. + // + // Since we dont recurse into 'node' in selser mode, we update the + // separator constraintInfo to apply to 'node' and its first child. + // + // We could clear constraintInfo altogether which would be correct (but + // could normalize separators and introduce dirty diffs unnecessarily). + if (node.nodeName in Consts.ZeroWidthWikitextTags && + node.childNodes.length > 0 && + state.sep.constraints.constraintInfo.sepType === 'sibling') + { + state.sep.constraints.constraintInfo.sepType = 'parent-child'; + state.sep.constraints.constraintInfo.nodeA = node; + state.sep.constraints.constraintInfo.nodeB = node.firstChild; + } + // console.warn("USED ORIG"); this.trace("ORIG-src:", src, '; out:', out); cb(out, node); diff --git a/js/lib/mediawiki.wikitext.constants.js b/js/lib/mediawiki.wikitext.constants.js index d96c659..8b93146 100644 --- a/js/lib/mediawiki.wikitext.constants.js +++ b/js/lib/mediawiki.wikitext.constants.js @@ -90,8 +90,21 @@ // These wikitext tags are composed with quote-chars WTQuoteTags: JSUtils.arrayToHash(['I', 'B']), - // Whitespace in these elements does not lead to indent-pre - PreSafeTags: JSUtils.arrayToHash(['BR', 'TABLE', 'TBODY', 'CAPTION', 'TR', 'TD', 'TH']), + // Leading whitespace on new lines in these elements does not lead to indent-pre + // This only applies to immediate children (while skipping past zero-wikitext tags) + // (Ex: content in table-cells induce indent pres) + WeakIndentPreSuppressingTags: JSUtils.arrayToHash([ + 'TABLE', 'TBODY', 'TR' + ]), + + // Leading whitespace on new lines in these elements does not lead to indent-pre + // This applies to all nested content in these tags + // (Ex: content in table-cells nested in blocktags do not induce indent pres) + // FIXME: <ref> extension tag is another such -- is it possible to fold them + // into this setup so we can get rid of the 'noPre' hack in token transformers? + StrongIndentPreSuppressingTags: JSUtils.arrayToHash([ + 'BLOCKQUOTE' + ]), // In the PHP parser, these block tags open block-tag scope // See doBlockLevels in the PHP parser (includes/parser/Parser.php) @@ -213,7 +226,12 @@ "i" : [2,2], "br" : [0,0], "figure": [2,2] - } + }, + + // Derived information from 'WT_TagWidths' and should be kept consistent with it + ZeroWidthWikitextTags: JSUtils.arrayToHash([ + 'P', 'META', 'TBODY', 'OL', 'UL', 'DL', 'BR' + ]) }; // Fill in reverse map of prefix options. diff --git a/js/tests/parserTests-blacklist.js b/js/tests/parserTests-blacklist.js index 79dbcb0..492ad02 100644 --- a/js/tests/parserTests-blacklist.js +++ b/js/tests/parserTests-blacklist.js @@ -2168,6 +2168,7 @@ add("html2wt", "Empty TR followed by mixed-ws-comment line should RT correctly"); add("html2wt", "Multi-line image caption generated by templates with/without trailing newlines"); add("html2wt", "Improperly nested inline or quotes tags with whitespace in between"); +add("html2wt", "Strip leading whitespace when handling indent-pre inducing tags"); // Blacklist for selser @@ -2184,12 +2185,12 @@ add("selser", "Paragraphs with newline spacing with non-empty white-space lines in between [0,2,3,0,[3],0,0,0,4,0,[0,4],4,4]"); add("selser", "Paragraphs with newline spacing with non-empty white-space lines in between [3,2,4,4,0,0,0,4,[4],0,2,3,2]"); add("selser", "Paragraphs with newline spacing with non-empty mixed comment and white-space lines in between [3,2,[0,0,3],0,0,0,1,0,3,0,2,0,0,0,3,0,3,4,2,3,[2],0,0,4,2,0,2,3,0]"); -add("selser", "Paragraphs with newline spacing with non-empty mixed comment and white-space lines in between [4,3,[4,2,4],3,3,3,[4,0,0,4],2,0,3,1,0,0,0,3,0,[2],0,0,3,3,0,0,0,3,3,[0,4],0,3]"); add("selser", "Paragraphs with newline spacing with non-empty mixed comment and white-space lines in between [0,3,4,3,0,0,[4,0,0,3],0,2,0,0,0,3,3,4,2,4,4,2,3,4,0,0,2,0,0,0,0,2]"); add("selser", "Paragraphs with newline spacing with non-empty mixed comment and white-space lines in between [0,0,[2,0,3],4,0,4,2,3,0,3,[2],0,0,3,0,0,0,3,0,3,0,0,3,0,3,0,2,3,0]"); add("selser", "Paragraphs with newline spacing with non-empty mixed comment and white-space lines in between [0,0,2,0,0,3,4,0,0,0,1,3,0,0,3,3,[4],0,2,2,0,2,2,0,0,4,2,3,4]"); add("selser", "Extra newlines: More paragraphs with indented comment [1,0,4,0,2]"); add("selser", "Extra newlines: More paragraphs with indented comment [[2],3,4,2,2]"); +add("selser", "Extra newlines: More paragraphs with indented comment [3,4,0,2,0]"); add("selser", "Extra newlines: More paragraphs with indented comment [0,3,0,4,2]"); add("selser", "Italics and possessives (1) [1]"); add("selser", "Italics and possessives (1) [2]"); diff --git a/js/tests/parserTests.txt b/js/tests/parserTests.txt index b2919dc..7691bee 100644 --- a/js/tests/parserTests.txt +++ b/js/tests/parserTests.txt @@ -1390,11 +1390,21 @@ !! input <blockquote> Blah +{| +| + indented cell (no pre-wrapping!) +|} </blockquote> !! result <blockquote> <p> Blah </p> +<table> +<tr> +<td> +<p> indented cell (no pre-wrapping!) +</p> +</td></tr></table> </blockquote> !! end @@ -17722,6 +17732,63 @@ </ul> !! end +!! test +Dont strip leading whitespace when handling indent-pre suppressing tags +!! options +parsoid=html2wt +!! input +{| + | indented row +|} +<blockquote> + '''This is very bold of you!''' + +{| +| + indented cell (no pre-wrapping!) +|} +</blockquote> +foo + <div>bar</div> +!! result +<table> + <tr><td> indented row</td></tr> +</table> +<blockquote><p> + <b>This is very bold of you!</b> +</p> +<table><tr><td> + indented cell (no pre-wrapping!) +</td></tr></table> +</blockquote> +<p>foo</p> + <div>bar</div> +!! end + +!! test +Strip leading whitespace when handling indent-pre inducing tags +!! options +parsoid=html2wt +!! input +foo +<span>bar</span> + +<span>foo2 +</span>bar2 + +<div>foo</div> +<span>bar</span> +!! result +<p>foo</p> + <span>bar</span> + +<span>foo2 + </span>bar2 + +<div>foo</div> + <span>bar</span> +!! end + # Wacky -- the leading newline in input is required because # that is what the serializer emits. To be fixed. Not fixing # the test because this test is required to test serialization of -- To view, visit https://gerrit.wikimedia.org/r/87019 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I49123699b7710737cfcb835143a2fb861c3ad1f0 Gerrit-PatchSet: 1 Gerrit-Project: mediawiki/extensions/Parsoid Gerrit-Branch: master Gerrit-Owner: Subramanya Sastry <ssas...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits