[MediaWiki-commits] [Gerrit] wikimedia/textcat[master]: Move ambiguity detection into main TextCat module.

jenkins-bot (Code Review) Mon, 19 Dec 2016 00:24:06 -0800

jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/327364 )


Change subject: Move ambiguity detection into main TextCat module.
......................................................................


Move ambiguity detection into main TextCat module.

Move ambiguity detection (using results ratio and max returned
languages) into the main TextCat module. These are still set with -u and
-a in the driver, catus.php

Moved status messages to string constants in TextCat.php, from
catus.php, stored in resultStatus variable. Added test cases for status
messages.

Changed -t in catus.php to -m. The Perl original has two separate flags,
-t and -m, to control model size; catus.php only has one; -m is more
mnemonic for "model size".

Added -w for word separator(s) to catus.php, plus tests.

General syntax tidying, esp. in TextCatTest.php.

Update README file.

Bug: T153105
Change-Id: I8bc83ccd4bcf0f064f2de43ea0b6d732def9b53f
---
M README.md
M TextCat.php
M catus.php
M tests/TextCatTest.php
4 files changed, 310 insertions(+), 76 deletions(-)

Approvals:
  Smalyshev: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/README.md b/README.md
index a8d8be5..1f7f9bb 100644
--- a/README.md
+++ b/README.md
@@ -6,13 +6,17 @@
 
 This is a PHP port of the TextCat language guesser utility.
 
-Please see http://odur.let.rug.nl/~vannoord/TextCat/ for the original one.
+Please see also the [original Perl
+version](http://odur.let.rug.nl/~vannoord/TextCat/), and an [updated
+Perl version](https://github.com/Trey314159/TextCat).
 
 ## Contents
 
-The package contains the classifier class itself and two tools—for classifying 
the texts and for generating the ngram database.
-The code now assumes the text encoding is UTF-8, since it's easier to extract 
ngrams this way.
-Also, everybody uses UTF-8 now and I, for one, welcome our new UTF-8-encoded 
overlords.
+The package contains the classifier class itself and two tools—for
+classifying the texts and for generating the ngram database. The code
+now assumes the text encoding is UTF-8, since it's easier to extract
+ngrams this way. Also, everybody uses UTF-8 now and I, for one, welcome
+our new UTF-8-encoded overlords.
 
 ### Classifier
 
@@ -28,48 +32,128 @@
 
     fr OR ro
 
-Please note that the provided collection of language models includes a model 
for Oriya (ଓଡ଼ିଆ), which has the language code `or`, so results like `or OR sco 
OR ro OR nl` are possible.
+Please note that the provided collection of language models includes a
+model for Oriya (ଓଡ଼ିଆ), which has the language code `or`, so results
+like `or OR sco OR ro OR nl` are possible.
 
 ### Generator
 
-To generate the language model database from a set of texts, use the script 
`felis.php`. It can be run as:
+To generate the language model database from a set of texts, use the
+script `felis.php`. It can be run as:
 
     php felis.php INPUTDIR OUTPUTDIR
 
-And will read texts from `INPUTDIR` and generate ngrams files in `OUTPUTDIR`.
-The files in `INPUTDIR` are assumed to have names like `LANGUAGE.txt`, e.g. 
`english.txt`, `german.txt`, `klingon.txt`, etc.
+And will read texts from `INPUTDIR` and generate ngrams files in
+`OUTPUTDIR`. The files in `INPUTDIR` are assumed to have names like
+`LANGUAGE.txt`, e.g. `english.txt`, `german.txt`, `klingon.txt`, etc.
 
-If you are working with sizable corpora (e.g., millions of characters), you 
should set `$minFreq` in `TextCat.php` to a reasonably small value, like `10`, 
to trim the very long tail of infrequent ngrams before they are sorted. This 
reduces the CPU and memory requirements for generating the language models. 
When *evaluating* texts, `$minFreq` should be set back to `0` unless your input 
texts are fairly large.
+If you are working with sizable corpora (e.g., millions of characters),
+you should set `$minFreq` in `TextCat.php` to a reasonably small value,
+like `10`, to trim the very long tail of infrequent ngrams before they
+are sorted. This reduces the CPU and memory requirements for generating
+the language models. When *evaluating* texts, `$minFreq` should be set
+back to `0` unless your input texts are fairly large.
 
 ## Models
 
-The package comes with a default language model database in the `LM` directory 
and a query-based language model database in the `LM-query` directory. However, 
model performance will depend a lot on the text corpus it will be applied to, 
as well as specific modifications—e.g. capitalization, diacritics, etc. 
Currently the library does not modify or normalize either training texts or 
classified texts in any way, so usage of custom language models may be more 
appropriate for specific applications.
+The package comes with a default language model database in the `LM`
+directory and a query-based language model database in the `LM-query`
+directory. However, model performance will depend a lot on the text
+corpus it will be applied to, as well as specific modifications—e.g.
+capitalization, diacritics, etc. Currently the library does not modify
+or normalize either training texts or classified texts in any way, so
+usage of custom language models may be more appropriate for specific
+applications.
 
-Model names use [Wikipedia language 
codes](https://en.wikipedia.org/wiki/List_of_Wikipedias), which are often but 
not guaranteed to be the same as [ISO 639 language 
codes](https://en.wikipedia.org/wiki/ISO_639).
+Model names use [Wikipedia language
+codes](https://en.wikipedia.org/wiki/List_of_Wikipedias), which are
+often but not guaranteed to be the same as [ISO 639 language
+codes](https://en.wikipedia.org/wiki/ISO_639).
 
-When detecting languages, you will generally get better results when you can 
limit the number of language models in use. For example, if there is virtually 
no chance that your text could be in Irish Gaelic, including the Irish Gaelic 
language model (`ga`) only increases the likelihood of mis-identification. This 
is particularly true for closely related languages (e.g., the Romance 
languages, or English/`en` and Scots/`sco`).
+When detecting languages, you will generally get better results when you
+can limit the number of language models in use, especially with very
+short texts. For example, if there is virtually no chance that your text
+could be in Irish Gaelic, including the Irish Gaelic language model
+(`ga`) only increases the likelihood of mis-identification. This is
+particularly true for closely related languages (e.g., the Romance
+languages, or English/`en` and Scots/`sco`).
 
-Limiting the number of language models used also generally improves 
performance. You can copy your desired language models into a new directory 
(and use `-d` with `catus.php`) or specify your desired languages on the 
command line (use `-c` with `catus.php`).
+Limiting the number of language models used also generally improves
+performance. You can copy your desired language models into a new
+directory (and use `-d` with `catus.php`) or specify your desired
+languages on the command line (use `-c` with `catus.php`).
+
+You can also combine models in multiple directories (e.g., to use the
+query-based models with a fallback to Wiki-Text-based models) with a
+comma-separated list of directories (use `-d` with `catus.php`).
+Directories are scanned in order, and only the first model found with a
+particular name will be used.
 
 ### Wiki-Text models
 
-The 70 language models in `LM` are based on text extracted from randomly 
chosen articles from the Wikipedia for that language. The languages included 
were chosen based on a number of criteria, including the number of native 
speakers of the language, the number of queries to the various wiki projects in 
the language (not just Wikipedia), the list of languages supported by the 
original TextCat, and the size of Wikipedia in the language (i.e., the size of 
the collection from which to draw a training corpus).
+The 70 language models in `LM` are based on text extracted from randomly
+chosen articles from the Wikipedia for that language. The languages
+included were chosen based on a number of criteria, including the number
+of native speakers of the language, the number of queries to the various
+wiki projects in the language (not just Wikipedia), the list of
+languages supported by the original TextCat, and the size of Wikipedia
+in the language (i.e., the size of the collection from which to draw a
+training corpus).
 
-The training corpus for each language was originally made up of ~2.7 to ~2.8M 
million characters, excluding markup. The texts were then lightly preprocessed. 
Preprocessing steps taken include: HTML Tags were removed. Lines were sorted 
and `uniq`-ed (so that Wikipedia idiosyncrasies—like "References", "See Also", 
and "This article is a stub"—are not over-represented, and so that articles 
randomly selected more than once were reduced to one copy). For corpora in 
Latin character sets, lines containing no Latin characters were removed. For 
corpora in non-Latin character sets, lines containing only Latin characters, 
numbers, and punctuation were removed. This character-set-based filtering 
removed from dozens to thousands of lines from the various corpora. For corpora 
in multiple character sets (e.g., Serbo-Croatian/`sh`, Serbian/`sr`, 
Turkmen/`tk`), no such character-set-based filtering was done. The final size 
of the training corpora ranged from ~1.8M to ~2.8M characters.
+The training corpus for each language was originally made up of ~2.7 to
+~2.8M million characters, excluding markup. The texts were then lightly
+preprocessed. Preprocessing steps taken include: HTML Tags were removed.
+Lines were sorted and `uniq`-ed (so that Wikipedia idiosyncrasies—like
+"References", "See Also", and "This article is a stub"—are not
+over-represented, and so that articles randomly selected more than once
+were reduced to one copy). For corpora in Latin character sets, lines
+containing no Latin characters were removed. For corpora in non-Latin
+character sets, lines containing only Latin characters, numbers, and
+punctuation were removed. This character-set-based filtering removed
+from dozens to thousands of lines from the various corpora. For corpora
+in multiple character sets (e.g., Serbo-Croatian/`sh`, Serbian/`sr`,
+Turkmen/`tk`), no such character-set-based filtering was done. The final
+size of the training corpora ranged from ~1.8M to ~2.8M characters.
 
-These models have not been tested and are provided as-is. We may add new 
models or remove poorly-performing models in the future.
+These models have not been tested and are provided as-is. We may add new
+models or remove poorly-performing models in the future.
 
-These models have 4000 ngrams. The best number of ngrams to use for language 
identification is application-dependent. For larger texts (e.g., containing 
hundreds of words per sample), significantly smaller ngram sets may be best. 
You can set the number to be used by changing `$maxNgrams` in `TextCat.php` or 
in `felis.php`, or using `-t` with `catus.php`.
+These models have 4000 ngrams. The best number of ngrams to use for
+language identification is application-dependent. For larger texts
+(e.g., containing hundreds of words per sample), significantly smaller
+ngram sets may be best. You can set the number to be used by changing
+`$maxNgrams` in `TextCat.php` or in `felis.php`, or using `-m` with
+`catus.php`.
 
 ### Wiki Query Models.
 
-The 19 language models in `LM-query` are based on query data from Wikipedia 
which is less formal (e.g., fewer diacritics are used in languages that have 
them) and has a different distribution of words than general text. The original 
set of languages considered was based on the number of queries across all wiki 
projects for a particular week. The text has been preprocessed and many queries 
were removed from the training sets according to a process similar to that used 
on the Wiki-Text models above.
+The 19 language models in `LM-query` are based on query data from
+Wikipedia which is less formal (e.g., fewer diacritics are used in
+languages that have them) and has a different distribution of words than
+general text. The original set of languages considered was based on the
+number of queries across all wiki projects for a particular week. The
+text has been preprocessed and many queries were removed from the
+training sets according to a process similar to that used on the
+Wiki-Text models above.
 
-In general, query data is much messier than Wiki-Text—including junk text and 
queries in unexpected languages—but the overall performance on query strings, 
at least for English Wikipedia—is better.
+In general, query data is much messier than Wiki-Text—including junk
+text and queries in unexpected languages—but the overall performance on
+query strings, at least for English Wikipedia—is better.
 
-The final set of models provided is based in part on their performance on 
English Wikipedia queries (the first target for language ID using TextCat). For 
more details see our [initial 
report](https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat)
 on TextCat. More languages will be added in the future based on additional 
performance evaluations.
+The final set of models provided is based in part on their performance
+on English Wikipedia queries (the first target for language ID using
+TextCat). For more details see our [initial
+report](https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/
+Language_Detection_with_TextCat) on TextCat. More languages will be
+added in the future based on additional performance evaluations.
 
-These models have 5000 ngrams. The best number of ngrams to use for language 
identification is application-dependent. For larger texts (e.g., containing 
hundreds of words per sample), significantly smaller ngram sets may be best. 
For short query seen on English Wikipedia strings, a model size of 3000 ngrams 
has worked best. You can set the number to be used by changing `$maxNgrams` in 
`TextCat.php` or in `felis.php`, or using `-t` with `catus.php`.
+These models have 5000 ngrams. The best number of ngrams to use for
+language identification is application-dependent. For larger texts
+(e.g., containing hundreds of words per sample), significantly smaller
+ngram sets may be best. For short query seen on English Wikipedia
+strings, a model size of 3000 ngrams has worked best. You can set the
+number to be used by changing `$maxNgrams` in `TextCat.php` or in
+`felis.php`, or using `-m` with `catus.php`.
 
 
 [![Build 
Status](https://travis-ci.org/smalyshev/textcat.svg?branch=master)](https://travis-ci.org/smalyshev/textcat)
diff --git a/TextCat.php b/TextCat.php
index 1ede010..7fa9540 100644
--- a/TextCat.php
+++ b/TextCat.php
@@ -6,6 +6,17 @@
  */
 class TextCat {
 
+       const STATUSTOOSHORT = 'Input is too short.';
+       const STATUSNOMATCH = 'No match found.';
+       const STATUSAMBIGUOUS = 'Cannot determine language.';
+
+       /**
+        * Minimum input length to be considered for
+        * classification
+        * @var string
+        */
+       private $resultStatus = '';
+
        /**
         * Number of ngrams to be used.
         * @var int
@@ -33,11 +44,34 @@
        private $langFiles = array();
 
        /**
-        * Minimum Input Length to be considered for
+        * Minimum input length to be considered for
         * classification
         * @var int
         */
        private $minInputLength = 0;
+
+       /**
+        * Maximum ratio of the score between a given
+        * candidate and the best candidate for the
+        * given candidate to be considered an alternative.
+        * @var float
+        */
+       private $resultsRatio = 1.05;
+
+       /**
+        * Maximum number of languages to return, within
+        * the resultsRatio. If there are more, the result
+        * is too ambiguous.
+        * @var int
+        */
+       private $maxReturnedLanguages = 10;
+
+       /**
+        * @param
+        */
+       public function getResultStatus() {
+               return $this->resultStatus;
+       }
 
        /**
         * @param int $maxNgrams
@@ -58,6 +92,27 @@
         */
        public function setMinInputLength( $minInputLength ) {
                $this->minInputLength = $minInputLength;
+       }
+
+       /**
+        * @param float $resultsRatio
+        */
+       public function setResultsRatio( $resultsRatio ) {
+               $this->resultsRatio = $resultsRatio;
+       }
+
+       /**
+        * @param int $maxReturnedLanguages
+        */
+       public function setMaxReturnedLanguages( $maxReturnedLanguages ) {
+               $this->maxReturnedLanguages = $maxReturnedLanguages;
+       }
+
+       /**
+        * @param string $wordSeparator
+        */
+       public function setWordSeparator( $wordSeparator ) {
+               $this->wordSeparator = $wordSeparator;
        }
 
        /**
@@ -170,10 +225,12 @@
         */
        public function classify( $text, $candidates = null ) {
                $results = array();
+               $this->resultStatus = '';
 
                // strip non-word characters before checking for min length, 
don't assess empty strings
                $wordLength = mb_strlen( preg_replace( 
"/[{$this->wordSeparator}]+/", "", $text ) );
                if ( $wordLength < $this->minInputLength || $wordLength == 0 ) {
+                       $this->resultStatus = self::STATUSTOOSHORT;
                        return $results;
                }
 
@@ -197,7 +254,25 @@
                        }
                        $results[$language] = $p;
                }
+
                asort( $results );
+
+               // ignore any item that scores higher than best * resultsRatio
+               $max = reset( $results ) * $this->resultsRatio;
+               $results = array_filter( $results, function ( $res ) use ( $max 
) { return $res <= $max;
+               } );
+
+               // if more than maxReturnedLanguages remain, the result is too 
ambiguous, so bail
+               if ( count( $results ) > $this->maxReturnedLanguages ) {
+                       $this->resultStatus = self::STATUSAMBIGUOUS;
+                       return array();
+               }
+
+               if ( count( $results ) == 0 ) {
+                       $this->resultStatus = self::STATUSNOMATCH;
+                       return $results;
+               }
+
                return $results;
        }
 }
diff --git a/catus.php b/catus.php
index 7be34ef..6ff20a2 100644
--- a/catus.php
+++ b/catus.php
@@ -4,18 +4,19 @@
  */
 require_once __DIR__.'/TextCat.php';
 
-$options = getopt( 'a:c:d:f:j:l:t:u:h' );
+$options = getopt( 'a:c:d:f:j:l:m:u:w:h' );
 
 if ( isset( $options['h'] ) ) {
        $help = <<<HELP
-{$argv[0]} [-d Dir] [-c Lang] [-a Int] [-f Int] [-j Int] [-l Text] [-t Int] 
[-u Float]
+{$argv[0]} [-d Dir] [-c Lang] [-a Int] [-u Float] [-l Text]
+           [-f Int] [-j Int] [-m Int] [-w String]
 
     -a NUM  The program returns the best-scoring language together
             with all languages which are <N times worse (set by option -u).
             If the number of languages to be printed is larger than the value
-            of this option then no language is returned, but
-            instead a message that the input is of an unknown language is
-            printed. Default: 10.
+            of this option then no language is returned, but instead a
+            message that the input is of an unknown language is printed.
+            Default: 10.
     -c LANG,LANG,...
             Lists the candidate languages. Only languages listed will be
             considered for detection.
@@ -32,11 +33,13 @@
     -l TEXT Indicates that input is given as an argument on the command line,
             e.g. {$argv[0]} -l "this is english text"
             If this option is not given, the input is stdin.
-    -t NUM  Indicates the topmost number of ngrams that should be used.
+    -m NUM  Indicates the topmost number of ngrams that should be used.
             Default: 3000
     -u NUM  Determines how much worse result must be in order not to be
             mentioned as an alternative. Typical value: 1.05 or 1.1.
             Default: 1.05.
+    -w STRING
+            Regex for non-word characters. Default: '0-9\s\(\)'
 
 HELP;
        echo $help;
@@ -51,8 +54,8 @@
 
 $cat = new TextCat( $dirs );
 
-if ( !empty( $options['t'] ) ) {
-       $cat->setMaxNgrams( intval( $options['t'] ) );
+if ( !empty( $options['m'] ) ) {
+       $cat->setMaxNgrams( intval( $options['m'] ) );
 }
 if ( !empty( $options['f'] ) ) {
        $cat->setMinFreq( intval( $options['f'] ) );
@@ -60,6 +63,16 @@
 if ( isset( $options['j'] ) ) {
        $cat->setMinInputLength( intval( $options['j'] ) );
 }
+if ( !empty( $options['u'] ) ) {
+       $cat->setResultsRatio( floatval( $options['u'] ) );
+}
+if ( isset( $options['a'] ) ) {
+       $cat->setMaxReturnedLanguages( intval( $options['a'] ) );
+}
+if ( isset( $options['w'] ) ) {
+       $cat->setWordSeparator( $options['w'] );
+}
+
 
 $input = isset( $options['l'] ) ? $options['l'] : file_get_contents( 
"php://stdin" );
 if ( !empty( $options['c'] ) ) {
@@ -69,28 +82,9 @@
 }
 
 if ( empty( $result ) ) {
-       echo "No match found.\n";
+       echo $cat->getResultStatus() . "\n";
        exit( 1 );
 }
 
-if ( !empty( $options['u'] ) ) {
-       $max = reset( $result ) * $options['u'];
-} else {
-       $max = reset( $result ) * 1.05;
-}
-
-if ( !empty( $options['a'] ) ) {
-       $top = $options['a'];
-} else {
-       $top = 10;
-}
-$result = array_filter( $result, function ( $res ) use( $max ) { return $res < 
$max;
-
-} );
-if ( $result && count( $result ) <= $top ) {
-       echo join( " OR ", array_keys( $result ) ) . "\n";
-       exit( 0 );
-} else {
-       echo "Cannot determine language.\n";
-       exit( 1 );
-}
+echo join( " OR ", array_keys( $result ) ) . "\n";
+exit( 0 );
diff --git a/tests/TextCatTest.php b/tests/TextCatTest.php
index c6ce0e7..93e925b 100644
--- a/tests/TextCatTest.php
+++ b/tests/TextCatTest.php
@@ -11,10 +11,20 @@
 
        public function setUp()
        {
-               // initialze testcat with a string, and multicats with arrays
+               // initialize testcat with a string
                $this->testcat = new TextCat( __DIR__."/data/Models" );
-               $this->multicat1 = new TextCat( array(__DIR__."/../LM", 
__DIR__."/../LM-query" ) );
-               $this->multicat2 = new TextCat( array(__DIR__."/../LM-query", 
__DIR__."/../LM" ) );
+
+               // initialize multicats with multi-element arrays
+               $this->multicat1 = new TextCat( array( __DIR__."/../LM", 
__DIR__."/../LM-query" ) );
+               $this->multicat2 = new TextCat( array( __DIR__."/../LM-query", 
__DIR__."/../LM" ) );
+
+               // effectively disable RR-based filtering for these cats
+               $this->multicat1->setResultsRatio( 100 );
+               $this->multicat2->setResultsRatio( 100 );
+
+               // initialize ambiguouscat with a one-element array
+               $this->ambiguouscat = new TextCat( array( 
__DIR__."/../LM-query" ) );
+
        }
 
        public function testCreateLM()
@@ -66,7 +76,7 @@
                        if ( !$file->isFile() || $file->getExtension() != "txt" 
) {
                                continue;
                        }
-                       $data[] = array( $file->getPathname(), $outdir . "/" . 
$file->getBasename(".txt") . ".lm" );
+                       $data[] = array( $file->getPathname(), $outdir . "/" . 
$file->getBasename( ".txt" ) . ".lm" );
                }
                return $data;
        }
@@ -81,7 +91,7 @@
                include $lmFile;
                $this->assertEquals(
                                $ngrams,
-                               $this->testcat->createLM( file_get_contents( 
$textFile ), 4000)
+                               $this->testcat->createLM( file_get_contents( 
$textFile ), 4000 )
                );
        }
 
@@ -106,21 +116,21 @@
     public function multiCatData()
     {
         return array(
-          array('this is english text français bisschen',
-                               array('sco', 'en', 'fr',  'de' ),
-                               array('fr',  'de', 'sco', 'en' ), ),
-          array('الاسم العلمي: Felis catu',
-                               array('ar', 'la', 'fa', 'fr' ),
-                               array('ar', 'fr', 'la', 'fa' ), ),
-          array('Кошка, или домашняя кошка A macska más néven házi macska',
-                               array('ru', 'uk', 'hu', 'fi' ),
-                               array('hu', 'ru', 'uk', 'fi' ), ),
-          array('Il gatto domestico Kucing disebut juga kucing domestik',
-                               array('id', 'it', 'pt', 'es' ),
-                               array('it', 'id', 'es', 'pt' ), ),
-          array('Domaća mačka Pisică de casă Hejma kato',
-                               array('hr', 'ro', 'eo', 'cs' ),
-                               array('hr', 'cs', 'ro', 'eo' ), ),
+          array( 'this is english text français bisschen',
+                               array( 'sco', 'en', 'fr',  'de' ),
+                               array( 'fr',  'de', 'sco', 'en' ), ),
+          array( 'الاسم العلمي: Felis catu',
+                               array( 'ar', 'la', 'fa', 'fr' ),
+                               array( 'ar', 'fr', 'la', 'fa' ), ),
+          array( 'Кошка, или домашняя кошка A macska más néven házi macska',
+                               array( 'ru', 'uk', 'hu', 'fi' ),
+                               array( 'hu', 'ru', 'uk', 'fi' ), ),
+          array( 'Il gatto domestico Kucing disebut juga kucing domestik',
+                               array( 'id', 'it', 'pt', 'es' ),
+                               array( 'it', 'id', 'es', 'pt' ), ),
+          array( 'Domaća mačka Pisică de casă Hejma kato',
+                               array( 'hr', 'ro', 'eo', 'cs' ),
+                               array( 'hr', 'cs', 'ro', 'eo' ), ),
         );
     }
 
@@ -165,13 +175,84 @@
                if ( !isset( $res ) ) {
                        $res = $lang;
                }
-               # should get results when min input len is 0
-               $minLength = $this->testcat->setMinInputLength(0);
+
+               // disable RR-based filtering
+               $this->testcat->setResultsRatio( 100 );
+
+               // should get results when min input len is 0
+               $this->testcat->setMinInputLength( 0 );
                $this->assertEquals( array_keys( $this->testcat->classify( 
$testLine, $res ) ),
                                                         array_values( $res ) );
-        # should get no results when min input len is more than the length of 
the string
-        $minLength = $this->testcat->setMinInputLength(mb_strlen($testLine) + 
1);
+               if ( !empty( $res ) ) {
+                       $this->assertEquals( $this->testcat->getResultStatus(), 
'' );
+               }
+
+        // should get no results when min input len is more than the length of 
the string
+        $this->testcat->setMinInputLength( mb_strlen( $testLine ) + 1 );
         $this->assertEquals( array_keys( $this->testcat->classify( $testLine, 
$res ) ),
                              array() );
+               $this->assertEquals( $this->testcat->getResultStatus(), 
TextCat::STATUSTOOSHORT );
+
+               // reset to defaults
+               $this->testcat->setMinInputLength( 0 );
+               $this->testcat->setResultsRatio( 1.05 );
     }
+
+    public function ambiguityData()
+    {
+        return array(
+          array( 'espanol português', 1.05, 10, 3000, array( 'pt' ), '' ),
+          array( 'espanol português', 1.20, 10, 3000, array( 'pt', 'es' ), '' 
),
+          array( 'espanol português', 1.20,  2, 3000, array( 'pt', 'es' ), '' 
),
+          array( 'espanol português', 1.20,  1, 3000, array(), 
TextCat::STATUSAMBIGUOUS ),
+          array( 'espanol português', 1.30, 10, 3000, array( 'pt', 'es', 'fr', 
'it', 'en', 'pl' ), '' ),
+          array( 'espanol português', 1.30,  6, 3000, array( 'pt', 'es', 'fr', 
'it', 'en', 'pl' ), '' ),
+          array( 'espanol português', 1.30,  5, 3000, array(), 
TextCat::STATUSAMBIGUOUS ),
+          array( 'espanol português', 1.10, 20,  500,
+                       array( 'pt', 'es', 'it', 'fr', 'pl', 'cs', 'en', 'sv', 
'de', 'id', 'nl' ), '' ),
+          array( 'espanol português', 1.10, 20,  700, array( 'pt', 'es', 'it', 
'fr', 'en', 'de' ), '' ),
+          array( 'espanol português', 1.10, 20, 1000, array( 'pt', 'es', 'it', 
'fr' ), '' ),
+          array( 'espanol português', 1.10, 20, 2000, array( 'pt', 'es' ), '' 
),
+          array( 'espanol português', 1.10, 20, 3000, array( 'pt' ), '' ),
+        );
+    }
+
+    /**
+     * @dataProvider ambiguityData
+        * @param string $testLine
+        * @param array $lang
+        * @param array $res
+     */
+    public function testAmbiguity( $testLine, $resRatio, $maxRetLang, 
$modelSize, $results, $errMsg )
+    {
+               $this->ambiguouscat->setMaxNgrams( $modelSize );
+               $this->ambiguouscat->setResultsRatio( $resRatio );
+               $this->ambiguouscat->setMaxReturnedLanguages( $maxRetLang );
+
+               $this->assertEquals( array_keys( $this->ambiguouscat->classify( 
$testLine ) ),
+                                                        array_values( $results 
) );
+               $this->assertEquals( $this->ambiguouscat->getResultStatus(), 
$errMsg );
+    }
+
+       public function testNoMatch()
+       {
+               # no xxx.lm model exists, so get no match
+        $this->assertEquals( array_keys( $this->testcat->classify( "some 
string", array( "xxx" ) ) ),
+                                                        array() );
+               $this->assertEquals( $this->testcat->getResultStatus(), 
TextCat::STATUSNOMATCH );
+       }
+
+       public function testWordSep()
+       {
+               $this->testcat->setResultsRatio( 1.25 );
+               $this->testcat->setMaxReturnedLanguages( 20 );
+               $normalResults = $this->testcat->classify( "espanol português" 
);
+               $weirdResults = $this->testcat->classify( "sp nol português" );
+
+               // this is a non-sensical set of word separators, just for 
testing
+               $this->testcat->setWordSeparator( 'a-e\s' );
+               $this->assertNotEquals( $this->testcat->classify( "espanol 
português" ), $normalResults );
+               $this->assertEquals( $this->testcat->classify( "espanol 
português" ), $weirdResults );
+       }
+
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/327364
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I8bc83ccd4bcf0f064f2de43ea0b6d732def9b53f
Gerrit-PatchSet: 4
Gerrit-Project: wikimedia/textcat
Gerrit-Branch: master
Gerrit-Owner: Tjones <[email protected]>
Gerrit-Reviewer: Cindy-the-browser-test-bot <[email protected]>
Gerrit-Reviewer: EBernhardson <[email protected]>
Gerrit-Reviewer: Smalyshev <[email protected]>
Gerrit-Reviewer: Tjones <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] wikimedia/textcat[master]: Move ambiguity detection into main TextCat module.

Reply via email to