[MediaWiki-commits] [Gerrit] Temporarily tweak removal heuristic a bit - change (mediawiki...parsoid)

jenkins-bot (Code Review) Wed, 17 Dec 2014 14:16:47 -0800

jenkins-bot has submitted this change and it was merged.

Change subject: Temporarily tweak <nowiki/> removal heuristic a bit
......................................................................



Temporarily tweak <nowiki/> removal heuristic a bit

* Looks like there are a lot of wikitext scenarios like this in
  roundtrip testing.

'<nowiki/>''foo'' and ''[[bar]]''

  The existing conservative heuristic won't strip the nowiki in this
  scenario. So, I've added another hacky heuristic for now.

  We really need a line-based heuristic that can examine wikitext chunks
  and their types (text, quote, link, etc.) so as to better determine
  what kinds of escaping is necessary.

  That is coming later as part of what Scott is working on.

  For now, this should help us minimize regressions.

Change-Id: I2759e76d56703254d3907ac447644457bc007b4b
---
M lib/mediawiki.WikitextSerializer.js
1 file changed, 19 insertions(+), 8 deletions(-)

Approvals:
  Cscott: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/lib/mediawiki.WikitextSerializer.js 
b/lib/mediawiki.WikitextSerializer.js
index 79ba9bd..ae3b908 100644
--- a/lib/mediawiki.WikitextSerializer.js
+++ b/lib/mediawiki.WikitextSerializer.js
@@ -1217,27 +1217,38 @@
 //   before <i> or <b> does not need <nowiki/> protection.
 function stripUnnecessaryQuoteNowikis(wt) {
        // no-quotes OR matched quote segments with 5/3/2 quotes OR 
single-quote char
-       // Within the matched quote-segments, be conservative and don't match 
higher-priority
-       // parser characters like [{< -- used for links and templates. This 
should prevent
-       // inadvertent matching up across links/templates/tags.
-       var testRE = 
/^[^']+$|^[^']*(('''''[^\[\{<']+'''''|'''[^\[\{<']+'''|''[^\[\{<']+''|')([^']+|$))+('|$)$/;
+       //
+       // Within the matched quote-segments,
+       // - be conservative and don't match higher-priority parser characters
+       //   like [{< -- used for links and templates. This should prevent
+       //   inadvertent matching up across links/templates/tags.
+       // - allow [[..]]
+       var testRE = 
/^[^']+$|^[^']*(('''''(\[\[\w+\]\]|[^\[\{<'])+'''''|'''(\[\[\w+\]\]|[^\[\{<'])+'''|''(\[\[\w+\]\]|[^\[\{<'])+''|')([^']+|$))+('|$)$/;
 
        return wt.split(/\n|$/).map(function(line) {
+               // Optimization: skip test if there are no <nowiki/>s here to 
remove.
+               if (!/<nowiki\/>/.test(line)) {
+                       return line;
+               }
+
                // * Strip out nowiki-protected strings since we are only 
interested in
                //   quote sequences that correspond to <i>/<b> tags.
                // * Find segments separated by <nowiki/>s.
                // * If all the segments contain balanced i/b tags, and the 
<nowiki/>
                //   separated a quote and an i/b tag, we can remove all the 
<nowiki/>s
                var pieces = line.replace(/<nowiki>.*?<\/nowiki>/g, 
'').split(/<nowiki\/>/);
-
                var n = pieces.length;
                for (var i = 0; i < n; i++) {
-                       if (!testRE.test(pieces[i]) ||
+                       // Since we are okay with single quotes in the middle, 
strip those
+                       // out so as not to have to deal with them in the 
testRE regexp above.
+                       // We need to leave a trailing ' behind since we test 
for it below.
+                       var p = pieces[i].replace(/(^|[^'])'(?=[^'])/g, "$1");
+                       if (!testRE.test(p) ||
                                // All but the first piece should start with ''
-                               (i > 0 && !/^''/.test(pieces[i])) ||
+                               (i > 0 && !/^''/.test(p)) ||
                                // All but the last piece should end in a 
single ' char
                                // since that is the only scenario we are 
optimizing for here
-                               (i < n-1 && !/(^|[^'])'$/.test(pieces[i])))
+                               (i < n-1 && !/(^|[^'])'$/.test(p)))
                        {
                                return line;
                        }

-- 
To view, visit https://gerrit.wikimedia.org/r/180563
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I2759e76d56703254d3907ac447644457bc007b4b
Gerrit-PatchSet: 6
Gerrit-Project: mediawiki/services/parsoid
Gerrit-Branch: master
Gerrit-Owner: Subramanya Sastry <[email protected]>
Gerrit-Reviewer: Arlolra <[email protected]>
Gerrit-Reviewer: Cscott <[email protected]>
Gerrit-Reviewer: Subramanya Sastry <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Temporarily tweak removal heuristic a bit - change (mediawiki...parsoid)

Reply via email to