[MediaWiki-commits] [Gerrit] mediawiki...CirrusSearch[master]: Deploy TextCat Improvements

2017-01-31 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/334728 )

Change subject: Deploy TextCat Improvements
..


Deploy TextCat Improvements

Update to TextCat 1.2.0.
Use multiple language model directories.
Add config for TextCat parameters and set at runtime.
Add TextCat tests that use parameters.
Fix misc typos, syntax, and EOL whitespace.

Bug: T149324
Change-Id: I20a82978aa7a046f885dfbdcbee93d4a13f71101
---
M CirrusSearch.php
M composer.json
M docs/settings.txt
M includes/LanguageDetector/TextCat.php
M tests/unit/LanguageDetectTest.php
5 files changed, 138 insertions(+), 38 deletions(-)

Approvals:
  Cindy-the-browser-test-bot: Looks good to me, but someone else must approve
  jenkins-bot: Verified
  DCausse: Looks good to me, approved



diff --git a/CirrusSearch.php b/CirrusSearch.php
index 856db50..8ca2da4 100644
--- a/CirrusSearch.php
+++ b/CirrusSearch.php
@@ -957,14 +957,23 @@
 $wgCirrusSearchLanguageDetectors = [];
 
 /**
- * Directory where TextCat detector should look for language model
+ * List of directories where TextCat detector should look for language models
  */
-$wgCirrusSearchTextcatModel = false;
+$wgCirrusSearchTextcatModel = [];
+
+/**
+ * Configuration for specifying TextCat parameters.
+ * Keys are maxNgrams, maxReturnedLanguages, resultsRatio,
+ * minInputLength, maxProportion, langBoostScore, and numBoostedLangs.
+ * See vendor/wikimedia/textcat/TextCat.php
+ */
+
+$wgCirrusSearchTextcatConfig = [];
 
 /**
  * Limit the set of languages detected by Textcat.
- * Useful when some languages in the model have very bad precision, e.g.:
- * $wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' );
+ * Useful when some languages in the model have too many false positives, e.g.:
+ * $wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ];
  */
 
 /**
diff --git a/composer.json b/composer.json
index a285c6b..ecbb259 100644
--- a/composer.json
+++ b/composer.json
@@ -5,6 +5,6 @@
"license": "GPL-2.0+",
"minimum-stability": "dev",
"require": {
-   "wikimedia/textcat": "1.1.3"
+   "wikimedia/textcat": "1.2.0"
}
 }
diff --git a/docs/settings.txt b/docs/settings.txt
index bfc0160..ad9b374 100644
--- a/docs/settings.txt
+++ b/docs/settings.txt
@@ -131,7 +131,7 @@
 Elasticsearch plugin that should produce better snippets for search results.
 Installation instructions are here: 
https://github.com/wikimedia/search-highlighter
 If you have the highlighter installed you can switch this on and off so long
-as you don't rebuild the index while 
$wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true.  
+as you don't rebuild the index while 
$wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true.
 Setting it to true without the highlighter installed will break search.
 
 ; $wgCirrusSearchOptimizeIndexForExperimentalHighlighter
@@ -1269,9 +1269,19 @@
 ; $wgCirrusSearchTextcatModel
 
 Default:
-$wgCirrusSearchTextcatModel = false;
+$wgCirrusSearchTextcatModel = [];
 
-Directory where TextCat detector should look for language model.
+List of directories where TextCat detector should look for language models
+
+; $wgCirrusSearchTextcatConfig
+
+Default:
+$wgCirrusSearchTextcatConfig = null;
+
+Configuration for specifying TextCat parameters.
+Keys are maxNgrams, maxReturnedLanguages, resultsRatio,
+minInputLength, maxProportion, langBoostScore, and numBoostedLangs.
+See vendor/wikimedia/textcat/TextCat.php
 
 ; $wgCirrusSearchTextcatLanguages
 
@@ -1281,7 +1291,7 @@
 Limit the set of languages detected by Textcat.
 Useful when some languages in the model have very bad precision, e.g.:
 
-$wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' );
+$wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ];
 
 ; $wgCirrusSearchMasterTimeout
 
diff --git a/includes/LanguageDetector/TextCat.php 
b/includes/LanguageDetector/TextCat.php
index 94ef1c5..72c264d 100644
--- a/includes/LanguageDetector/TextCat.php
+++ b/includes/LanguageDetector/TextCat.php
@@ -22,22 +22,57 @@
// Should not happen
return null;
}
-   $dir = $config->getElement('CirrusSearchTextcatModel');
-   if( !$dir ) {
+   $dirs = $config->getElement('CirrusSearchTextcatModel');
+   if( !$dirs ) {
return null;
}
-   if( !is_dir( $dir ) ) {
-   LoggerFactory::getInstance( 'CirrusSearch' )->warning(
-   "Bad directory for TextCat model: {dir}",
-   [ "dir" => $dir ]
-   );
+   if ( !is_array( $dirs ) ) { // backward compatibility
+   $dirs = [ $dirs ];
+   }
+   foreach ($dirs as $dir) {
+   if( !is_dir( $dir ) ) {

[MediaWiki-commits] [Gerrit] mediawiki...CirrusSearch[master]: Deploy TextCat Improvements

2017-01-27 Thread Tjones (Code Review)
Tjones has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/334728 )

Change subject: Deploy TextCat Improvements
..

Deploy TextCat Improvements

Update to TextCat 1.2.0.
Use multiple language model directories.
Add config for TextCat parameters and set at runtime.
Add TextCat tests that use parameters.
Fix misc typos, syntax, and EOL whitespace.

Bug: T149324
Change-Id: I20a82978aa7a046f885dfbdcbee93d4a13f71101
---
M CirrusSearch.php
M composer.json
M docs/settings.txt
M includes/LanguageDetector/TextCat.php
M tests/unit/LanguageDetectTest.php
5 files changed, 119 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch 
refs/changes/28/334728/1

diff --git a/CirrusSearch.php b/CirrusSearch.php
index 856db50..8ca2da4 100644
--- a/CirrusSearch.php
+++ b/CirrusSearch.php
@@ -957,14 +957,23 @@
 $wgCirrusSearchLanguageDetectors = [];
 
 /**
- * Directory where TextCat detector should look for language model
+ * List of directories where TextCat detector should look for language models
  */
-$wgCirrusSearchTextcatModel = false;
+$wgCirrusSearchTextcatModel = [];
+
+/**
+ * Configuration for specifying TextCat parameters.
+ * Keys are maxNgrams, maxReturnedLanguages, resultsRatio,
+ * minInputLength, maxProportion, langBoostScore, and numBoostedLangs.
+ * See vendor/wikimedia/textcat/TextCat.php
+ */
+
+$wgCirrusSearchTextcatConfig = [];
 
 /**
  * Limit the set of languages detected by Textcat.
- * Useful when some languages in the model have very bad precision, e.g.:
- * $wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' );
+ * Useful when some languages in the model have too many false positives, e.g.:
+ * $wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ];
  */
 
 /**
diff --git a/composer.json b/composer.json
index a285c6b..ecbb259 100644
--- a/composer.json
+++ b/composer.json
@@ -5,6 +5,6 @@
"license": "GPL-2.0+",
"minimum-stability": "dev",
"require": {
-   "wikimedia/textcat": "1.1.3"
+   "wikimedia/textcat": "1.2.0"
}
 }
diff --git a/docs/settings.txt b/docs/settings.txt
index bfc0160..ad9b374 100644
--- a/docs/settings.txt
+++ b/docs/settings.txt
@@ -131,7 +131,7 @@
 Elasticsearch plugin that should produce better snippets for search results.
 Installation instructions are here: 
https://github.com/wikimedia/search-highlighter
 If you have the highlighter installed you can switch this on and off so long
-as you don't rebuild the index while 
$wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true.  
+as you don't rebuild the index while 
$wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true.
 Setting it to true without the highlighter installed will break search.
 
 ; $wgCirrusSearchOptimizeIndexForExperimentalHighlighter
@@ -1269,9 +1269,19 @@
 ; $wgCirrusSearchTextcatModel
 
 Default:
-$wgCirrusSearchTextcatModel = false;
+$wgCirrusSearchTextcatModel = [];
 
-Directory where TextCat detector should look for language model.
+List of directories where TextCat detector should look for language models
+
+; $wgCirrusSearchTextcatConfig
+
+Default:
+$wgCirrusSearchTextcatConfig = null;
+
+Configuration for specifying TextCat parameters.
+Keys are maxNgrams, maxReturnedLanguages, resultsRatio,
+minInputLength, maxProportion, langBoostScore, and numBoostedLangs.
+See vendor/wikimedia/textcat/TextCat.php
 
 ; $wgCirrusSearchTextcatLanguages
 
@@ -1281,7 +1291,7 @@
 Limit the set of languages detected by Textcat.
 Useful when some languages in the model have very bad precision, e.g.:
 
-$wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' );
+$wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ];
 
 ; $wgCirrusSearchMasterTimeout
 
diff --git a/includes/LanguageDetector/TextCat.php 
b/includes/LanguageDetector/TextCat.php
index 94ef1c5..e36cd76 100644
--- a/includes/LanguageDetector/TextCat.php
+++ b/includes/LanguageDetector/TextCat.php
@@ -22,22 +22,54 @@
// Should not happen
return null;
}
-   $dir = $config->getElement('CirrusSearchTextcatModel');
-   if( !$dir ) {
+   $dirs = $config->getElement('CirrusSearchTextcatModel');
+   if( !$dirs ) {
return null;
}
-   if( !is_dir( $dir ) ) {
-   LoggerFactory::getInstance( 'CirrusSearch' )->warning(
-   "Bad directory for TextCat model: {dir}",
-   [ "dir" => $dir ]
-   );
+   if ( !is_array( $dirs ) ) { // backward compatibility
+   $dirs = [ $dirs ];
+   }
+   foreach ($dirs as $dir) {
+   if( !is_dir( $dir ) ) {
+   LoggerFactory::getInstance(