[MediaWiki-commits] [Gerrit] mediawiki...CirrusSearch[master]: Deploy TextCat Improvements
jenkins-bot has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/334728 ) Change subject: Deploy TextCat Improvements .. Deploy TextCat Improvements Update to TextCat 1.2.0. Use multiple language model directories. Add config for TextCat parameters and set at runtime. Add TextCat tests that use parameters. Fix misc typos, syntax, and EOL whitespace. Bug: T149324 Change-Id: I20a82978aa7a046f885dfbdcbee93d4a13f71101 --- M CirrusSearch.php M composer.json M docs/settings.txt M includes/LanguageDetector/TextCat.php M tests/unit/LanguageDetectTest.php 5 files changed, 138 insertions(+), 38 deletions(-) Approvals: Cindy-the-browser-test-bot: Looks good to me, but someone else must approve jenkins-bot: Verified DCausse: Looks good to me, approved diff --git a/CirrusSearch.php b/CirrusSearch.php index 856db50..8ca2da4 100644 --- a/CirrusSearch.php +++ b/CirrusSearch.php @@ -957,14 +957,23 @@ $wgCirrusSearchLanguageDetectors = []; /** - * Directory where TextCat detector should look for language model + * List of directories where TextCat detector should look for language models */ -$wgCirrusSearchTextcatModel = false; +$wgCirrusSearchTextcatModel = []; + +/** + * Configuration for specifying TextCat parameters. + * Keys are maxNgrams, maxReturnedLanguages, resultsRatio, + * minInputLength, maxProportion, langBoostScore, and numBoostedLangs. + * See vendor/wikimedia/textcat/TextCat.php + */ + +$wgCirrusSearchTextcatConfig = []; /** * Limit the set of languages detected by Textcat. - * Useful when some languages in the model have very bad precision, e.g.: - * $wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' ); + * Useful when some languages in the model have too many false positives, e.g.: + * $wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ]; */ /** diff --git a/composer.json b/composer.json index a285c6b..ecbb259 100644 --- a/composer.json +++ b/composer.json @@ -5,6 +5,6 @@ "license": "GPL-2.0+", "minimum-stability": "dev", "require": { - "wikimedia/textcat": "1.1.3" + "wikimedia/textcat": "1.2.0" } } diff --git a/docs/settings.txt b/docs/settings.txt index bfc0160..ad9b374 100644 --- a/docs/settings.txt +++ b/docs/settings.txt @@ -131,7 +131,7 @@ Elasticsearch plugin that should produce better snippets for search results. Installation instructions are here: https://github.com/wikimedia/search-highlighter If you have the highlighter installed you can switch this on and off so long -as you don't rebuild the index while $wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true. +as you don't rebuild the index while $wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true. Setting it to true without the highlighter installed will break search. ; $wgCirrusSearchOptimizeIndexForExperimentalHighlighter @@ -1269,9 +1269,19 @@ ; $wgCirrusSearchTextcatModel Default: -$wgCirrusSearchTextcatModel = false; +$wgCirrusSearchTextcatModel = []; -Directory where TextCat detector should look for language model. +List of directories where TextCat detector should look for language models + +; $wgCirrusSearchTextcatConfig + +Default: +$wgCirrusSearchTextcatConfig = null; + +Configuration for specifying TextCat parameters. +Keys are maxNgrams, maxReturnedLanguages, resultsRatio, +minInputLength, maxProportion, langBoostScore, and numBoostedLangs. +See vendor/wikimedia/textcat/TextCat.php ; $wgCirrusSearchTextcatLanguages @@ -1281,7 +1291,7 @@ Limit the set of languages detected by Textcat. Useful when some languages in the model have very bad precision, e.g.: -$wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' ); +$wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ]; ; $wgCirrusSearchMasterTimeout diff --git a/includes/LanguageDetector/TextCat.php b/includes/LanguageDetector/TextCat.php index 94ef1c5..72c264d 100644 --- a/includes/LanguageDetector/TextCat.php +++ b/includes/LanguageDetector/TextCat.php @@ -22,22 +22,57 @@ // Should not happen return null; } - $dir = $config->getElement('CirrusSearchTextcatModel'); - if( !$dir ) { + $dirs = $config->getElement('CirrusSearchTextcatModel'); + if( !$dirs ) { return null; } - if( !is_dir( $dir ) ) { - LoggerFactory::getInstance( 'CirrusSearch' )->warning( - "Bad directory for TextCat model: {dir}", - [ "dir" => $dir ] - ); + if ( !is_array( $dirs ) ) { // backward compatibility + $dirs = [ $dirs ]; + } + foreach ($dirs as $dir) { + if( !is_dir( $dir ) ) {
[MediaWiki-commits] [Gerrit] mediawiki...CirrusSearch[master]: Deploy TextCat Improvements
Tjones has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/334728 ) Change subject: Deploy TextCat Improvements .. Deploy TextCat Improvements Update to TextCat 1.2.0. Use multiple language model directories. Add config for TextCat parameters and set at runtime. Add TextCat tests that use parameters. Fix misc typos, syntax, and EOL whitespace. Bug: T149324 Change-Id: I20a82978aa7a046f885dfbdcbee93d4a13f71101 --- M CirrusSearch.php M composer.json M docs/settings.txt M includes/LanguageDetector/TextCat.php M tests/unit/LanguageDetectTest.php 5 files changed, 119 insertions(+), 28 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch refs/changes/28/334728/1 diff --git a/CirrusSearch.php b/CirrusSearch.php index 856db50..8ca2da4 100644 --- a/CirrusSearch.php +++ b/CirrusSearch.php @@ -957,14 +957,23 @@ $wgCirrusSearchLanguageDetectors = []; /** - * Directory where TextCat detector should look for language model + * List of directories where TextCat detector should look for language models */ -$wgCirrusSearchTextcatModel = false; +$wgCirrusSearchTextcatModel = []; + +/** + * Configuration for specifying TextCat parameters. + * Keys are maxNgrams, maxReturnedLanguages, resultsRatio, + * minInputLength, maxProportion, langBoostScore, and numBoostedLangs. + * See vendor/wikimedia/textcat/TextCat.php + */ + +$wgCirrusSearchTextcatConfig = []; /** * Limit the set of languages detected by Textcat. - * Useful when some languages in the model have very bad precision, e.g.: - * $wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' ); + * Useful when some languages in the model have too many false positives, e.g.: + * $wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ]; */ /** diff --git a/composer.json b/composer.json index a285c6b..ecbb259 100644 --- a/composer.json +++ b/composer.json @@ -5,6 +5,6 @@ "license": "GPL-2.0+", "minimum-stability": "dev", "require": { - "wikimedia/textcat": "1.1.3" + "wikimedia/textcat": "1.2.0" } } diff --git a/docs/settings.txt b/docs/settings.txt index bfc0160..ad9b374 100644 --- a/docs/settings.txt +++ b/docs/settings.txt @@ -131,7 +131,7 @@ Elasticsearch plugin that should produce better snippets for search results. Installation instructions are here: https://github.com/wikimedia/search-highlighter If you have the highlighter installed you can switch this on and off so long -as you don't rebuild the index while $wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true. +as you don't rebuild the index while $wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true. Setting it to true without the highlighter installed will break search. ; $wgCirrusSearchOptimizeIndexForExperimentalHighlighter @@ -1269,9 +1269,19 @@ ; $wgCirrusSearchTextcatModel Default: -$wgCirrusSearchTextcatModel = false; +$wgCirrusSearchTextcatModel = []; -Directory where TextCat detector should look for language model. +List of directories where TextCat detector should look for language models + +; $wgCirrusSearchTextcatConfig + +Default: +$wgCirrusSearchTextcatConfig = null; + +Configuration for specifying TextCat parameters. +Keys are maxNgrams, maxReturnedLanguages, resultsRatio, +minInputLength, maxProportion, langBoostScore, and numBoostedLangs. +See vendor/wikimedia/textcat/TextCat.php ; $wgCirrusSearchTextcatLanguages @@ -1281,7 +1291,7 @@ Limit the set of languages detected by Textcat. Useful when some languages in the model have very bad precision, e.g.: -$wgCirrusSearchTextcatLanguages = array( 'ar', 'it', 'de' ); +$wgCirrusSearchTextcatLanguages = [ 'ar', 'it', 'de' ]; ; $wgCirrusSearchMasterTimeout diff --git a/includes/LanguageDetector/TextCat.php b/includes/LanguageDetector/TextCat.php index 94ef1c5..e36cd76 100644 --- a/includes/LanguageDetector/TextCat.php +++ b/includes/LanguageDetector/TextCat.php @@ -22,22 +22,54 @@ // Should not happen return null; } - $dir = $config->getElement('CirrusSearchTextcatModel'); - if( !$dir ) { + $dirs = $config->getElement('CirrusSearchTextcatModel'); + if( !$dirs ) { return null; } - if( !is_dir( $dir ) ) { - LoggerFactory::getInstance( 'CirrusSearch' )->warning( - "Bad directory for TextCat model: {dir}", - [ "dir" => $dir ] - ); + if ( !is_array( $dirs ) ) { // backward compatibility + $dirs = [ $dirs ]; + } + foreach ($dirs as $dir) { + if( !is_dir( $dir ) ) { + LoggerFactory::getInstance(