[MediaWiki-commits] [Gerrit] mediawiki...Wikispeech[master]: Calculate correct offsets for unicode characters

2017-03-30 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/344616 )

Change subject: Calculate correct offsets for unicode characters
..


Calculate correct offsets for unicode characters

Unicode characters are counted as multiple bytes, which manifested as
the highlighting of a sentence being longer than it should. This also
affects the start of following sentences.

In fixing this, the segmentation was reworked which also resulted in a
more consistent use of utterance boundaries ("offset" is used instead
of "position" and excludes the last character, which is more common
when getting substrings etc.), not including trailing or leading
whitespaces for utterances and properly handling utterances that are
completely wrapped in tags.

Bug: T159545
Bug: T159811
Bug: T159809
Bug: T159671
Change-Id: I8e32637a51857e383ed1fae5a83fa04b0a978deb
---
M Hooks.php
M includes/Cleaner.php
M includes/Segmenter.php
M tests/phpunit/SegmenterTest.php
M tests/qunit/ext.wikispeech.test.js
5 files changed, 302 insertions(+), 97 deletions(-)

Approvals:
  Lokal Profil: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/Hooks.php b/Hooks.php
index ea491cc..2fa1092 100644
--- a/Hooks.php
+++ b/Hooks.php
@@ -54,12 +54,12 @@
'Wikispeech',
'HTML from onParserAfterTidy(): ' . $text
);
-   $cleanedText = Cleaner::cleanHtml( $text );
+   $cleanedContents = Cleaner::cleanHtml( $text );
wfDebugLog(
'Wikispeech',
-   'Cleaned text: ' . var_export( $cleanedText, 
true )
+   'Cleaned text: ' . var_export( 
$cleanedContents, true )
);
-   $utterances = Segmenter::segmentSentences( $cleanedText 
);
+   $utterances = Segmenter::segmentSentences( 
$cleanedContents );
wfDebugLog(
'Wikispeech',
'Utterances: ' . var_export( $utterances, true )
diff --git a/includes/Cleaner.php b/includes/Cleaner.php
index a65e0d6..ab58f48 100644
--- a/includes/Cleaner.php
+++ b/includes/Cleaner.php
@@ -25,12 +25,12 @@
// Only add elements below the dummy element. These are the
// elements from the original HTML.
$top = $xpath->evaluate( '/meta/dummy' )->item( 0 );
-   $cleanedContent = [];
+   $cleanedContents = [];
self::addContent(
-   $cleanedContent,
+   $cleanedContents,
$top
);
-   return $cleanedContent;
+   return $cleanedContents;
}
 
/**
diff --git a/includes/Segmenter.php b/includes/Segmenter.php
index 249bbf5..ea98d09 100644
--- a/includes/Segmenter.php
+++ b/includes/Segmenter.php
@@ -13,30 +13,30 @@
 *
 * A segment is an array with the keys "content", "startOffset"
 * and "endOffset". "content" is an array of `CleanedText`s.
-
-* "startOffset" is the position of the first character of the
+* "startOffset" is the offset of the first character of the
 * segment, within the text node it appears. "endOffset" is the
-* position of the last character of the segment, within the text
+* offset of the last character of the segment, within the text
 * node it appears. These are used to determine start and end of a
 * segment in the original HTML.
 *
 * A sentence is here defined as a number of tokens ending with a
-* dot (full stop). Headings are also considered sentences.
+* dot (full stop).
 *
 * @since 0.0.1
-* @param array $cleanedContent An array of `CleanedText`s, as
+* @param array $cleanedContents An array of `CleanedText`s, as
 *  returned by `Cleaner::cleanHtml()`.
 * @return array An array of segments, each containing the
 *  `CleanedText's in that segment.
 */
 
-   public static function segmentSentences( $cleanedContent ) {
+   public static function segmentSentences( $cleanedContents ) {
$segments = [];
$currentSegment = [
'content' => [],
-   'startOffset' => 0
+   'startOffset' => null,
+   'endOffset' => null
];
-   foreach ( $cleanedContent as $content ) {
+   foreach ( $cleanedContents as $content ) {
self::addSegments(
$segments,
$currentSegment,
@@ -53,21 +53,14 @@
/**
 * Add segments for a string.
  

[MediaWiki-commits] [Gerrit] mediawiki...Wikispeech[master]: Calculate correct offsets for unicode characters

2017-03-24 Thread Sebastian Berlin (WMSE) (Code Review)
Sebastian Berlin (WMSE) has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/344616 )

Change subject: Calculate correct offsets for unicode characters
..

Calculate correct offsets for unicode characters

Unicode characters are counted as multiple bytes, which manifested as
the highlighting of a sentence being longer than it should. This also
affects the start of following sentences.

In fixing this, the segmentation was reworked which also resulted in a
more consistent use of utterance boundaries ("offset" is used instead
of "position" and excludes the last character, which is more common
when getting substrings etc.), not including trailing or leading
whitespaces for utterances and properly handling utterances that are
completely wrapped in tags.

Bug: T159545
Bug: T159811
Bug: T159809
Bug: T159671

Change-Id: I8e32637a51857e383ed1fae5a83fa04b0a978deb
---
M Hooks.php
M extension.json
M includes/Cleaner.php
M includes/Segmenter.php
M modules/ext.wikispeech.js
M tests/phpunit/SegmenterTest.php
M tests/qunit/ext.wikispeech.test.js
7 files changed, 300 insertions(+), 103 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Wikispeech 
refs/changes/16/344616/1

diff --git a/Hooks.php b/Hooks.php
index ea491cc..2fa1092 100644
--- a/Hooks.php
+++ b/Hooks.php
@@ -54,12 +54,12 @@
'Wikispeech',
'HTML from onParserAfterTidy(): ' . $text
);
-   $cleanedText = Cleaner::cleanHtml( $text );
+   $cleanedContents = Cleaner::cleanHtml( $text );
wfDebugLog(
'Wikispeech',
-   'Cleaned text: ' . var_export( $cleanedText, 
true )
+   'Cleaned text: ' . var_export( 
$cleanedContents, true )
);
-   $utterances = Segmenter::segmentSentences( $cleanedText 
);
+   $utterances = Segmenter::segmentSentences( 
$cleanedContents );
wfDebugLog(
'Wikispeech',
'Utterances: ' . var_export( $utterances, true )
diff --git a/extension.json b/extension.json
index 850c2b9..8b1b9d1 100644
--- a/extension.json
+++ b/extension.json
@@ -83,23 +83,23 @@
"WikispeechKeyboardShortcuts": {
"playStop": {
"key": 32,
-   "modifiers": [ "ctrl" ]
+   "modifiers": [ "alt", "shift" ]
},
"skipAheadSentence": {
"key": 39,
-   "modifiers": [ "ctrl" ]
+   "modifiers": [ "alt", "shift" ]
},
"skipBackSentence": {
"key": 37,
-   "modifiers": [ "ctrl" ]
+   "modifiers": [ "alt", "shift" ]
},
"skipAheadWord": {
"key": 40,
-   "modifiers": [ "ctrl" ]
+   "modifiers": [ "alt", "shift" ]
},
"skipBackWord": {
"key": 38,
-   "modifiers": [ "ctrl" ]
+   "modifiers": [ "alt", "shift" ]
}
},
"WikispeechSkipBackRewindsThreshold": 3.0
diff --git a/includes/Cleaner.php b/includes/Cleaner.php
index a65e0d6..ab58f48 100644
--- a/includes/Cleaner.php
+++ b/includes/Cleaner.php
@@ -25,12 +25,12 @@
// Only add elements below the dummy element. These are the
// elements from the original HTML.
$top = $xpath->evaluate( '/meta/dummy' )->item( 0 );
-   $cleanedContent = [];
+   $cleanedContents = [];
self::addContent(
-   $cleanedContent,
+   $cleanedContents,
$top
);
-   return $cleanedContent;
+   return $cleanedContents;
}
 
/**
diff --git a/includes/Segmenter.php b/includes/Segmenter.php
index 249bbf5..170e8e1 100644
--- a/includes/Segmenter.php
+++ b/includes/Segmenter.php
@@ -13,30 +13,30 @@
 *
 * A segment is an array with the keys "content", "startOffset"
 * and "endOffset". "content" is an array of `CleanedText`s.
-
-* "startOffset" is the position of the first character of the
+* "startOffset" is the offset of the first character of the
 * segment, within the text node it appears. "endOffset" is the
-*