[MediaWiki-commits] [Gerrit] mediawiki...Wikispeech[master]: Calculate correct offsets for unicode characters

Sebastian Berlin (WMSE) (Code Review) Fri, 24 Mar 2017 07:14:43 -0700

Sebastian Berlin (WMSE) has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/344616 )


Change subject: Calculate correct offsets for unicode characters
......................................................................

Calculate correct offsets for unicode characters

Unicode characters are counted as multiple bytes, which manifested as
the highlighting of a sentence being longer than it should. This also
affects the start of following sentences.

In fixing this, the segmentation was reworked which also resulted in a
more consistent use of utterance boundaries ("offset" is used instead
of "position" and excludes the last character, which is more common
when getting substrings etc.), not including trailing or leading
whitespaces for utterances and properly handling utterances that are
completely wrapped in tags.

Bug: T159545
Bug: T159811
Bug: T159809
Bug: T159671

Change-Id: I8e32637a51857e383ed1fae5a83fa04b0a978deb
---
M Hooks.php
M extension.json
M includes/Cleaner.php
M includes/Segmenter.php
M modules/ext.wikispeech.js
M tests/phpunit/SegmenterTest.php
M tests/qunit/ext.wikispeech.test.js
7 files changed, 300 insertions(+), 103 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Wikispeech 
refs/changes/16/344616/1

diff --git a/Hooks.php b/Hooks.php
index ea491cc..2fa1092 100644
--- a/Hooks.php
+++ b/Hooks.php
@@ -54,12 +54,12 @@
                                'Wikispeech',
                                'HTML from onParserAfterTidy(): ' . $text
                        );
-                       $cleanedText = Cleaner::cleanHtml( $text );
+                       $cleanedContents = Cleaner::cleanHtml( $text );
                        wfDebugLog(
                                'Wikispeech',
-                               'Cleaned text: ' . var_export( $cleanedText, 
true )
+                               'Cleaned text: ' . var_export( 
$cleanedContents, true )
                        );
-                       $utterances = Segmenter::segmentSentences( $cleanedText 
);
+                       $utterances = Segmenter::segmentSentences( 
$cleanedContents );
                        wfDebugLog(
                                'Wikispeech',
                                'Utterances: ' . var_export( $utterances, true )
diff --git a/extension.json b/extension.json
index 850c2b9..8b1b9d1 100644
--- a/extension.json
+++ b/extension.json
@@ -83,23 +83,23 @@
                "WikispeechKeyboardShortcuts": {
                        "playStop": {
                                "key": 32,
-                               "modifiers": [ "ctrl" ]
+                               "modifiers": [ "alt", "shift" ]
                        },
                        "skipAheadSentence": {
                                "key": 39,
-                               "modifiers": [ "ctrl" ]
+                               "modifiers": [ "alt", "shift" ]
                        },
                        "skipBackSentence": {
                                "key": 37,
-                               "modifiers": [ "ctrl" ]
+                               "modifiers": [ "alt", "shift" ]
                        },
                        "skipAheadWord": {
                                "key": 40,
-                               "modifiers": [ "ctrl" ]
+                               "modifiers": [ "alt", "shift" ]
                        },
                        "skipBackWord": {
                                "key": 38,
-                               "modifiers": [ "ctrl" ]
+                               "modifiers": [ "alt", "shift" ]
                        }
                },
                "WikispeechSkipBackRewindsThreshold": 3.0
diff --git a/includes/Cleaner.php b/includes/Cleaner.php
index a65e0d6..ab58f48 100644
--- a/includes/Cleaner.php
+++ b/includes/Cleaner.php
@@ -25,12 +25,12 @@
                // Only add elements below the dummy element. These are the
                // elements from the original HTML.
                $top = $xpath->evaluate( '/meta/dummy' )->item( 0 );
-               $cleanedContent = [];
+               $cleanedContents = [];
                self::addContent(
-                       $cleanedContent,
+                       $cleanedContents,
                        $top
                );
-               return $cleanedContent;
+               return $cleanedContents;
        }
 
        /**
diff --git a/includes/Segmenter.php b/includes/Segmenter.php
index 249bbf5..170e8e1 100644
--- a/includes/Segmenter.php
+++ b/includes/Segmenter.php
@@ -13,30 +13,30 @@
         *
         * A segment is an array with the keys "content", "startOffset"
         * and "endOffset". "content" is an array of `CleanedText`s.
-
-        * "startOffset" is the position of the first character of the
+        * "startOffset" is the offset of the first character of the
         * segment, within the text node it appears. "endOffset" is the
-        * position of the last character of the segment, within the text
+        * offset of the last character of the segment, within the text
         * node it appears. These are used to determine start and end of a
         * segment in the original HTML.
         *
         * A sentence is here defined as a number of tokens ending with a
-        * dot (full stop). Headings are also considered sentences.
+        * dot (full stop).
         *
         * @since 0.0.1
-        * @param array $cleanedContent An array of `CleanedText`s, as
+        * @param array $cleanedContents An array of `CleanedText`s, as
         *  returned by `Cleaner::cleanHtml()`.
         * @return array An array of segments, each containing the
         *  `CleanedText's in that segment.
         */
 
-       public static function segmentSentences( $cleanedContent ) {
+       public static function segmentSentences( $cleanedContents ) {
                $segments = [];
                $currentSegment = [
                        'content' => [],
-                       'startOffset' => 0
+                       'startOffset' => null,
+                       'endOffset' => null
                ];
-               foreach ( $cleanedContent as $content ) {
+               foreach ( $cleanedContents as $content ) {
                        self::addSegments(
                                $segments,
                                $currentSegment,
@@ -53,21 +53,14 @@
        /**
         * Add segments for a string.
         *
-        * Looks for sentence final string (strings which a sentence ends
+        * Looks for sentence final strings (strings which a sentence ends
         * with). When a sentence final string is found, it's sentence is
-        * added to the $currentSegment, which in turn is added to
-        * $segments. An empty array is created as the new
-        * $currentSegment.
-        *
-        * When the end of string is reached, the remaining string is
-        * added to $currentSegment. Subsequent segment parts will be
-        * added to this semgent.
+        * added to the $currentSegment.
         *
         * @since 0.0.1
         * @param array $segments The segment array to add new segments to.
-        * @param array $currentSegment The segment under construction, to
-        *  which the first found string segment will be added.
-        * @param string $text The string to segment.
+        * @param array $currentSegment The segment under construction.
+        * @param CleanedText $text The text to segment.
         */
 
        private static function addSegments(
@@ -75,56 +68,122 @@
                &$currentSegment,
                $text
        ) {
-               // Find the indices of all characters that may be sentence 
final.
-               preg_match_all(
-                       "/\./",
-                       $text->string,
-                       $matches,
-                       PREG_OFFSET_CAPTURE
-               );
-               $offset = 0;
-               foreach ( $matches[0] as $match ) {
-                       $sentenceFinalPosition = $match[1];
-                       if (
-                               self::isSentenceFinal(
-                                       $text->string,
-                                       $sentenceFinalPosition
-                               )
-                       ) {
-                               $length = $sentenceFinalPosition - $offset + 1;
-                               $segmentText = substr( $text->string, $offset, 
$length );
-                               if ( trim( $segmentText ) != '' ) {
-                                       // Don't add segments with only 
whitespaces.
-                                       $newText = new CleanedText(
-                                               $segmentText,
-                                               $text->path
-                                       );
-                                       array_push( $currentSegment['content'], 
$newText );
-                                       $offset = $sentenceFinalPosition + 1;
-                                       $currentSegment['endOffset'] = $offset;
-                                       array_push( $segments, $currentSegment 
);
+               $nextStartOffset = 0;
+               do {
+                       $endOffset = self::addSegment(
+                               $segments,
+                               $currentSegment,
+                               $text,
+                               $nextStartOffset
+                       );
+                       // The earliest the next segments can start is one after
+                       // the end of the current one.
+                       $nextStartOffset = $endOffset + 1;
+               } while ( $nextStartOffset < mb_strlen( $text->string ) - 1 );
+       }
 
-                                       // Create the next segment, which 
should start at
-                                       // the position after the end position 
of the
-                                       // current one.
-                                       $currentSegment = [
-                                               'content' => [],
-                                               'startOffset' => $offset
-                                       ];
-                               }
+       /**
+        * Add a sentence, or part thereof, to a segment.
+        *
+        * Finds the next sentence by sentence final characters and adds
+        * them to the segment under construction. If no sentence final
+        * character was found, all the remaining text is added. Stores
+        * start offset when the first text of a segment is added and end
+        * offset when the last is.
+        *
+        * @since 0.0.1
+        * @param array $segments The segment array to add new segments to.
+        * @param array $currentSegment The segment under construction.
+        * @param CleanedText $text The text to segment.
+        * @param int $startOffset The offset where the next sentence can
+        *  start, at the earliest. If the sentence has leading
+        *  whitespaces, this will be moved forward.
+        * @return int The offset of the last character in the
+        *   sentence. If the sentence didn't end yet, this is the last
+        *   character of $text.
+        */
+
+       private static function addSegment(
+               &$segments,
+               &$currentSegment,
+               $text,
+               $startOffset=0
+       ) {
+               if ( $currentSegment['startOffset'] === null ) {
+                       // Move the start offset ahead by the number of leading
+                       // whitespaces. This means that whitespaces before or
+                       // between segments aren't included.
+                       $leadingWhitespacesLength = 
self::getLeadingWhitespacesLength(
+                               mb_substr( $text->string, $startOffset )
+                       );
+                       $startOffset += $leadingWhitespacesLength;
+               }
+               // Get the offset for the next sentence final character.
+               $endOffset = self::getSentenceFinalOffset(
+                       $text->string,
+                       $startOffset
+               );
+               // If no sentence final character is found, add the rest of
+               // the text and remember that this segment isn't ended.
+               $ended = true;
+               if ( $endOffset === null ) {
+                       $endOffset = mb_strlen( $text->string ) - 1;
+                       $ended = false;
+               }
+               $sentence = mb_substr(
+                       $text->string,
+                       $startOffset,
+                       $endOffset - $startOffset + 1
+               );
+               $sentenceText = new CleanedText(
+                       $sentence,
+                       $text->path
+               );
+               array_push( $currentSegment['content'], $sentenceText );
+               if ( $currentSegment['startOffset'] === null ) {
+                       // Record the start offset if this is the first text 
added
+                       // to the segment.
+                       $currentSegment['startOffset'] = $startOffset;
+               }
+               $currentSegment['endOffset'] = $endOffset;
+               if ( $ended ) {
+                       array_push( $segments, $currentSegment );
+                       // Create a fresh segment to add following text to.
+                       $currentSegment = [
+                               'content' => [],
+                               'startOffset' => null,
+                               'endOffset' => null
+                       ];
+               }
+               return $endOffset;
+       }
+
+       /**
+        * Get the offset of the first sentence final character in a string.
+        *
+        * @since 0.0.1
+        * @param string $string The string to look in.
+        * @param int $offset The offset to start looking from.
+        * @return int The offset of the first sentence final character
+        *  that was found, if any, else null.
+        */
+
+       private static function getSentenceFinalOffset( $string, $offset ) {
+               // This is needed to do this in a do-while loop.
+               $offset --;
+               do {
+                       // Find the next character that may be sentence final.
+                       $offset = mb_strpos( $string, '.', $offset + 1 );
+                       if ( $offset === false ) {
+                               // No character that can be sentence final was 
found.
+                               return null;
                        }
-               }
-               $remainder = substr( $text->string, $offset );
-               if ( $remainder ) {
-                       // Add any remaining part of the string.
-                       $remainderText = new CleanedText( $remainder, 
$text->path );
-                       array_push( $currentSegment['content'], $remainderText 
);
-                       $currentSegment['endOffset'] = $offset + strlen( 
$remainder );
-               } else {
-                       // If there was no remainder, the next segment will 
always
-                       // start at the beginning of the text.
-                       $currentSegment['startOffset'] = 0;
-               }
+                       // This is needed since mb_strops returns bytes rather
+                       // than index.
+                       $sentence = mb_substr( $string, 0, $offset );
+                       $offset = mb_strlen( $sentence );
+               } while ( !self::isSentenceFinal( $string, $offset ) );
+               return $offset;
        }
 
        /**
@@ -141,14 +200,14 @@
         */
 
        private static function isSentenceFinal( $string, $index ) {
-               $character = $string[$index];
+               $character = mb_substr( $string, $index, 1 );
                $nextCharacter = null;
-               if ( strlen( $string ) > $index + 1 ) {
-                       $nextCharacter = $string[$index + 1];
+               if ( mb_strlen( $string ) > $index + 1 ) {
+                       $nextCharacter = mb_substr( $string, $index + 1, 1 );
                }
                $characterAfterNext = null;
-               if ( strlen( $string ) > $index + 2 ) {
-                       $characterAfterNext = $string[$index + 2];
+               if ( mb_strlen( $string ) > $index + 2 ) {
+                       $characterAfterNext = mb_substr( $string, $index + 2, 1 
);
                }
                if (
                        $character == '.' &&
@@ -173,6 +232,20 @@
         */
 
        private static function isUpper( $string ) {
-               return mb_strtoupper( $string, 'UTF-8' ) == $string;
+               return mb_strtoupper( $string ) == $string;
+       }
+
+       /**
+        * Get the number of whitespaces at the start of a string.
+        *
+        * @since 0.0.1
+        * @param string $string The string to count leading whitespaces
+        *  for.
+        * @return int The number of whitespaces at the start of $string.
+        */
+
+       private static function getLeadingWhitespacesLength( $string ) {
+               $trimmedString = ltrim( $string );
+               return mb_strlen( $string ) - mb_strlen( $trimmedString );
        }
 }
diff --git a/modules/ext.wikispeech.js b/modules/ext.wikispeech.js
index 8423a4a..62c9ae0 100644
--- a/modules/ext.wikispeech.js
+++ b/modules/ext.wikispeech.js
@@ -603,7 +603,9 @@
                 */
 
                this.addKeyboardShortcuts = function () {
-                       var shortcuts = mw.config.get( 
'wgWikispeechKeyboardShortcuts' );
+                       var shortcuts, name, shortcut;
+
+                       shortcuts = mw.config.get( 
'wgWikispeechKeyboardShortcuts' );
                        $( document ).keydown( function ( event ) {
                                if ( self.eventMatchShortcut( event, 
shortcuts.playStop ) ) {
                                        self.playOrStop();
@@ -632,6 +634,18 @@
                                        return false;
                                }
                        } );
+                       // Prevent keyup events from triggering if there is
+                       // keydown event for the same key combination. This 
caused
+                       // buttons in focus to trigger if a shortcut had space 
as
+                       // key.
+                       $( document ).keyup( function ( event ) {
+                               for ( name in shortcuts ) {
+                                       shortcut = shortcuts[ name ];
+                                       if ( self.eventMatchShortcut( event, 
shortcut ) ) {
+                                               event.preventDefault();
+                                       }
+                               }
+                       } );
                };
 
                /**
diff --git a/tests/phpunit/SegmenterTest.php b/tests/phpunit/SegmenterTest.php
index 09800d0..150b881 100644
--- a/tests/phpunit/SegmenterTest.php
+++ b/tests/phpunit/SegmenterTest.php
@@ -12,18 +12,23 @@
 class SegmenterTest extends MediaWikiTestCase {
        public function testSegmentSentences() {
                $cleanedContent = [
-                       new CleanedText( 'Sentence 1. Sentence 2.' )
+                       new CleanedText( 'Sentence 1. Sentence 2. Sentence 3.' )
                ];
                $expectedSegments = [
                        [
                                'startOffset' => 0,
-                               'endOffset' => 11,
+                               'endOffset' => 10,
                                'content' => [ new CleanedText( 'Sentence 1.' ) 
]
                        ],
                        [
-                               'startOffset' => 11,
-                               'endOffset' => 23,
-                               'content' => [ new CleanedText( ' Sentence 2.' 
) ]
+                               'startOffset' => 12,
+                               'endOffset' => 22,
+                               'content' => [ new CleanedText( 'Sentence 2.' ) 
]
+                       ],
+                       [
+                               'startOffset' => 24,
+                               'endOffset' => 34,
+                               'content' => [ new CleanedText( 'Sentence 3.' ) 
]
                        ]
                ];
                $segments = Segmenter::segmentSentences( $cleanedContent );
@@ -72,25 +77,130 @@
                $cleanedContent = [ new CleanedText( 'Sentence. No sentence 
final' ) ];
                $segments = Segmenter::segmentSentences( $cleanedContent );
                $this->assertEquals(
-                       [ new CleanedText( ' No sentence final' ) ],
+                       [ new CleanedText( 'No sentence final' ) ],
                        $segments[1]['content']
                );
-               $this->assertEquals( 27, $segments[1]['endOffset'] );
+               $this->assertEquals( 26, $segments[1]['endOffset'] );
+       }
+
+       public function testTextFromMultipleNodes() {
+               $cleanedContent = [
+                       new CleanedText( 'Sentence split ' ),
+                       new CleanedText( 'by' ),
+                       new CleanedText( ' tags.' )
+               ];
+               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $this->assertEquals(
+                       0,
+                       $segments[0]['startOffset']
+               );
+               $this->assertEquals(
+                       5,
+                       $segments[0]['endOffset']
+               );
+               $this->assertEquals(
+                       'Sentence split ',
+                       $segments[0]['content'][0]->string
+               );
+               $this->assertEquals(
+                       'by',
+                       $segments[0]['content'][1]->string
+               );
+               $this->assertEquals(
+                       ' tags.',
+                       $segments[0]['content'][2]->string
+               );
+       }
+
+       public function testStartOffsetForMultipleTextNodes() {
+               $cleanedContent = [
+                       new CleanedText( 'First sentence. Split' ),
+                       new CleanedText( 'sentence. And other sentence.' ),
+               ];
+               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $this->assertEquals(
+                       16,
+                       $segments[1]['startOffset']
+               );
+               $this->assertEquals(
+                       10,
+                       $segments[2]['startOffset']
+               );
        }
 
        public function testTextOffset() {
                $cleanedContent = [ new CleanedText( 'Sentence.' ) ];
                $segments = Segmenter::segmentSentences( $cleanedContent );
                $this->assertEquals( 0, $segments[0]['startOffset'] );
+               $this->assertEquals( 8, $segments[0]['endOffset'] );
+       }
+
+       public function testSegmentTextWithUnicodeChars() {
+               $cleanedContent = [
+                       new CleanedText(
+                               'Normal sentence. Utterance with å. Another 
normal sentence.'
+                       )
+               ];
+               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $this->assertEquals(
+                       'Utterance with å.',
+                       $segments[1]['content'][0]->string
+               );
+               $this->assertEquals( 17, $segments[1]['startOffset'] );
+               $this->assertEquals( 33, $segments[1]['endOffset'] );
+               $this->assertEquals(
+                       'Another normal sentence.',
+                       $segments[2]['content'][0]->string
+               );
+               $this->assertEquals( 35, $segments[2]['startOffset'] );
+               $this->assertEquals( 58, $segments[2]['endOffset'] );
+       }
+
+       public function testDontCreateEmptyTextForWhitespaces() {
+               $cleanedContent = [
+                       new CleanedText( 'Sentence 1. ' ),
+                       new CleanedText( 'Sentence 2.' )
+               ];
+               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $this->assertEquals(
+                       'Sentence 1.',
+                       $segments[0]['content'][0]->string
+               );
+               $this->assertEquals(
+                       'Sentence 2.',
+                       $segments[1]['content'][0]->string
+               );
+       }
+
+       public function testRemoveLeadingAndTrailingWhitespaces() {
+               $cleanedContent = [ new CleanedText( ' Sentence. ' ) ];
+               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $this->assertEquals(
+                       'Sentence.',
+                       $segments[0]['content'][0]->string
+               );
+               $this->assertEquals( 1, $segments[0]['startOffset'] );
                $this->assertEquals( 9, $segments[0]['endOffset'] );
        }
 
-       public function testTextOffsetMultipleUtterances() {
-               $cleanedContent = [ new CleanedText( 'Sentence one. Sentence 
two.' ) ];
+       public function testLastTextIsOnlySentenceFinalCharacter() {
+               $cleanedContent = [
+                       new CleanedText( 'Sentence one' ),
+                       new CleanedText( '. ' ),
+                       new CleanedText( 'Sentence two.' )
+               ];
                $segments = Segmenter::segmentSentences( $cleanedContent );
-               $this->assertEquals( 0, $segments[0]['startOffset'] );
-               $this->assertEquals( 13, $segments[0]['endOffset'] );
-               $this->assertEquals( 13, $segments[1]['startOffset'] );
-               $this->assertEquals( 27, $segments[1]['endOffset'] );
+               $this->assertEquals(
+                       'Sentence one',
+                       $segments[0]['content'][0]->string
+               );
+               $this->assertEquals(
+                       '.',
+                       $segments[0]['content'][1]->string
+               );
+               $this->assertEquals(
+                       'Sentence two.',
+                       $segments[1]['content'][0]->string
+               );
        }
 }
diff --git a/tests/qunit/ext.wikispeech.test.js 
b/tests/qunit/ext.wikispeech.test.js
index 45d1458..e5fea89 100644
--- a/tests/qunit/ext.wikispeech.test.js
+++ b/tests/qunit/ext.wikispeech.test.js
@@ -869,18 +869,18 @@
                $( '#utterance-0' ).append(
                        $( '<content></content>' )
                                .append( $( '<text></text>' )
-                                       .text( ' Utterance two.' )
+                                       .text( 'Utterance two.' )
                                        .attr( 'path', './text()' )
                                )
                );
-               $( '#utterance-0' ).attr( 'start-offset', 14 );
+               $( '#utterance-0' ).attr( 'start-offset', 15 );
                $( '#utterance-0' ).attr( 'end-offset', 28 );
 
                wikispeech.highlightUtterance( $( '#utterance-0' ) );
 
                assert.strictEqual(
                        $( '#mw-content-text' ).prop( 'innerHTML' ),
-                       'Utterance one.<span 
class="ext-wikispeech-highlight-sentence"> Utterance two.</span> Utterance 
three.'
+                       'Utterance one. <span 
class="ext-wikispeech-highlight-sentence">Utterance two.</span> Utterance 
three.'
                );
        } );
 
@@ -895,7 +895,7 @@
                        $( '<content></content>' )
                                .append(
                                        $( '<text></text>' )
-                                               .text( 'Utterance with ' )
+                                               .text( 'Utterance with' )
                                                .attr( 'path', './p/text()[1]' )
                                )
                                .append(
@@ -905,7 +905,7 @@
                                )
                                .append(
                                        $( '<text></text>' )
-                                               .text( ' tag.' )
+                                               .text( 'tag.' )
                                                .attr( 'path', './p/text()[2]' )
                                )
                );

-- 
To view, visit https://gerrit.wikimedia.org/r/344616
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8e32637a51857e383ed1fae5a83fa04b0a978deb
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/Wikispeech
Gerrit-Branch: master
Gerrit-Owner: Sebastian Berlin (WMSE) <sebastian.ber...@wikimedia.se>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] mediawiki...Wikispeech[master]: Calculate correct offsets for unicode characters

Reply via email to