DCausse has uploaded a new change for review. https://gerrit.wikimedia.org/r/227424
Change subject: Allow customnization of "Did you mean" suggestions ...................................................................... Allow customnization of "Did you mean" suggestions The settings used to build the suggest request are now configurable at runtime with request URI parameters. These settings can be made persistent with the 'cirrus-didyoumean-settings' system message. Bug: T106692 Change-Id: I10f5e6b1683d29e57d1a7ac23a48342bf26a64d3 --- M CirrusSearch.php M i18n/en.json M i18n/qqq.json M includes/Hooks.php M includes/Searcher.php M tests/browser/features/did_you_mean_api.feature M tests/browser/features/step_definitions/search_steps.rb 7 files changed, 206 insertions(+), 20 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch refs/changes/24/227424/1 diff --git a/CirrusSearch.php b/CirrusSearch.php index bd69af8..18be4ca 100644 --- a/CirrusSearch.php +++ b/CirrusSearch.php @@ -240,10 +240,62 @@ // See max_errors on http://www.elasticsearch.org/guide/reference/api/search/suggest/ $wgCirrusSearchPhraseSuggestMaxErrors = 2; +// Set the hard limit for $wgCirrusSearchPhraseSuggestMaxErrors. This prevents customizing +// this setting in a way that could hurt the system performances. +$wgCirrusSearchPhraseSuggestMaxErrorsHardLimit = 2; + // Confidence level required to suggest new phrases. // See confidence on http://www.elasticsearch.org/guide/reference/api/search/suggest/ $wgCirrusSearchPhraseSuggestConfidence = 2.0; +// The suggest mode used by the phrase suggester +// can be : +// * missing: Only suggest terms in the suggest text that aren’t in the index. +// * popular: Only suggest suggestions that occur in more docs then the original +// suggest text term. +// * always: Suggest any matching suggestions based on terms in the suggest text. +$wgCirrusSearchPhraseSuggestMode = 'always'; + +// List of allowed values for the suggest mode +$wgCirrusSearchPhraseSuggestAllowedMode = array( 'missing', 'popular', 'always' ); + + +// The max term freq used by the phrase suggester. +// The maximum threshold in number of documents a suggest text token can exist in +// order to be included. Can be a relative percentage number (e.g 0.4) or an absolute +// number to represent document frequencies. If an value higher than 1 is specified +// then fractional can not be specified. Defaults to 0.01f. +// If a term appears in more then half the docs then don't try to correct it. This really +// shouldn't kick in much because we're not looking for misspellings. We're looking for phrases +// that can be might off. Like "noble prize" -> "nobel prize". In any case, the default was +// 0.01 which way too frequently decided not to correct some terms. +$wgCirrusSearchPhraseMaxTermFreq = 0.5; + +// Set the hard limit for $wgCirrusSearchPhraseMaxTermFreq. This prevents customizing +// this setting in a way that could hurt the system performances. +$wgCirrusSearchPhraseMaxTermFreqHardLimit = 0.6; + +// The max doc freq (shard level) used by the phrase suggester +// The minimal threshold in number of documents a suggestion should appear in. +// This can be specified as an absolute number or as a relative percentage of +// number of documents. This can improve quality by only suggesting high frequency +// terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified +// then the number cannot be fractional. The shard level document frequencies are +// used for this option. +// NOTE: this value is ignored if $wgCirrusSearchPhraseSuggestMode is always +$wgCirrusSearchPhraseMinDocFreq = 0.0; + +// The prefix length used by the phrase suggester +// The number of minimal prefix characters that must match in order be a candidate +// suggestions. Defaults to 1. Increasing this number improves spellcheck performance. +// Usually misspellings don’t occur in the beginning of terms. +$wgCirrusSearchPhrasePrefixLength = 2; + +// Set the hard limit for $wgCirrusSearchPhrasePrefixLength. This prevents customizing +// this setting in a way that could hurt the system performances. +// (This is the minimal value) +$wgCirrusSearchPhrasePrefixLengthHardLimit = 2; + // Look for suggestions in the article text? Changing this from false to true will // break search until you perform an in place index rebuild. Changing it from true // to false is ok and then you can change it back to true so long as you _haven't_ diff --git a/i18n/en.json b/i18n/en.json index 5ec1c81..ba372c8 100644 --- a/i18n/en.json +++ b/i18n/en.json @@ -20,5 +20,6 @@ "apihelp-cirrus-mapping-dump-description": "Dump of CirrusSearch mapping for this wiki.", "apihelp-cirrus-settings-dump-description": "Dump of CirrusSearch settings for this wiki.", "cirrussearch-give-feedback": "Give us your feedback", - "cirrussearch-morelikethis-settings": " #<!-- leave this line exactly as it is --> <pre>\n# This message let you configure the settings of the \"more like this\" feature.\n# Changes to this take effect immediately.\n# Syntax is as follows:\n# * Everything from a \"#\" character to the end of the line is a comment.\n# * Every non-blank line is the setting name followed by a \":\" character followed by the setting value\n# Settings are :\n# * min_doc_freq (integer): Minimum number of documents (per shard) that need a term for it to be considered.\n# * max_doc_freq (integer): Maximum number of documents (per shard) that have a term for it to be considered.\n# High frequency terms are generally \"stop words\".\n# * max_query_terms (integer): Maximum number of terms to be considered. This value is limited to $wgCirrusSearchMoreLikeThisMaxQueryTermsLimit (100).\n# * min_term_freq (integer): Minimum number of times the term appears in the input to doc to be considered. For small fields (title) this value should be 1.\n# * percent_terms_to_match (float 0 to 1): The percentage of terms to match on. Defaults to 0.3 (30 percent).\n# * min_word_len (integer): Minimal length of a term to be considered. Defaults to 0.\n# * max_word_len (integer): The maximum word length above which words will be ignored. Defaults to unbounded (0).\n# * fields (comma separated list of values): These are the fields to use. Allowed fields are title, text, auxiliary_text, opening_text, headings and all.\n# * use_fields (true|false) : Tell the \"more like this\" query to use only the field data. Defaults to false: the system will extract the content of the text field to build the query.\n# Examples of good lines:\n# min_doc_freq:2\n# max_doc_freq:20000\n# max_query_terms:25\n# min_term_freq:2\n# percent_terms_to_match:0.3\n# min_word_len:2\n# max_word_len:40\n# fields:text,opening_text\n# use_fields:true\n# </pre> <!-- leave this line exactly as it is -->" + "cirrussearch-morelikethis-settings": " #<!-- leave this line exactly as it is --> <pre>\n# This message let you configure the settings of the \"more like this\" feature.\n# Changes to this take effect immediately.\n# Syntax is as follows:\n# * Everything from a \"#\" character to the end of the line is a comment.\n# * Every non-blank line is the setting name followed by a \":\" character followed by the setting value\n# Settings are :\n# * min_doc_freq (integer): Minimum number of documents (per shard) that need a term for it to be considered.\n# * max_doc_freq (integer): Maximum number of documents (per shard) that have a term for it to be considered.\n# High frequency terms are generally \"stop words\".\n# * max_query_terms (integer): Maximum number of terms to be considered. This value is limited to $wgCirrusSearchMoreLikeThisMaxQueryTermsLimit (100).\n# * min_term_freq (integer): Minimum number of times the term appears in the input to doc to be considered. For small fields (title) this value should be 1.\n# * percent_terms_to_match (float 0 to 1): The percentage of terms to match on. Defaults to 0.3 (30 percent).\n# * min_word_len (integer): Minimal length of a term to be considered. Defaults to 0.\n# * max_word_len (integer): The maximum word length above which words will be ignored. Defaults to unbounded (0).\n# * fields (comma separated list of values): These are the fields to use. Allowed fields are title, text, auxiliary_text, opening_text, headings and all.\n# * use_fields (true|false) : Tell the \"more like this\" query to use only the field data. Defaults to false: the system will extract the content of the text field to build the query.\n# Examples of good lines:\n# min_doc_freq:2\n# max_doc_freq:20000\n# max_query_terms:25\n# min_term_freq:2\n# percent_terms_to_match:0.3\n# min_word_len:2\n# max_word_len:40\n# fields:text,opening_text\n# use_fields:true\n# </pre> <!-- leave this line exactly as it is -->", + "cirrussearch-didyoumean-settings": " #<!-- leave this line exactly as it is --> <pre>\n# This message let you configure the settings of the \"Did you mean\" suggestions.\n# See also https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html\n# Changes to this take effect immediately.\n# Syntax is as follows:\n# * Everything from a \"#\" character to the end of the line is a comment.\n# * Every non-blank line is the setting name followed by a \":\" character followed by the setting value\n# Settings are :\n# * max_errors (integer): the maximum number of the terms that at most considered to be misspellings in order to form a correction. 1 or 2.\n# * confidence (float): The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of 1.0 will only return suggestions that score higher than the input phrase. If set to 0.0 the top N candidates are returned.\n# * min_doc_freq (float 0 to 1): The minimal threshold in number of documents a suggestion should appear in.\n# High frequency terms are generally \"stop words\".\n# * max_term_freq (float 0 to 1): The maximum threshold in number of documents a suggest text token can exist in order to be included.\n# * prefix_length (integer): The number of minimal prefix characters that must match in order be a candidate suggestions.\n# * suggest_mode (missing, popular, always): The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested.\n# Examples of good lines:\n# max_errors:2\n# confidence:2.0\n# max_term_freq:0.5\n# min_doc_freq:0.01\n# prefix_length:2\n# suggest_mode:always\n#\n# </pre> <!-- leave this line exactly as it is -->" } diff --git a/i18n/qqq.json b/i18n/qqq.json index ba771c6..4793e70 100644 --- a/i18n/qqq.json +++ b/i18n/qqq.json @@ -27,5 +27,6 @@ "apihelp-cirrus-mapping-dump-description": "{{doc-apihelp-description|cirrus-mapping-dump}}", "apihelp-cirrus-settings-dump-description": "{{doc-apihelp-description|cirrus-settings-dump}}", "cirrussearch-give-feedback": "Used as text for an feedback link shown at the end of Special:Search result ([[mw:Extension:CirrusSearch|$wgCirrusSearchFeedbackLink]])", - "cirrussearch-morelikethis-settings": "Settings for the More Like This query.\n\n\"More Like This\" is the English name of the feature. The feature is described at [[:mw:Help:CirrusSearch#Special prefixes]]. The prefix \"morelike\" cannot be translated anywhere, but the full name of the feature \"More Like This\" can be translated.\n\nDon't translate technical names like min_doc_freq, max_query_terms, true|false, field names title, text, auxiliary_text, opening_text, headings, all etc.\n\nFor a definition of \"stopwords\" see [[:w:en:Stop words|Stop words in Wikipedia]]." + "cirrussearch-morelikethis-settings": "Settings for the More Like This query.\n\n\"More Like This\" is the English name of the feature. The feature is described at [[:mw:Help:CirrusSearch#Special prefixes]]. The prefix \"morelike\" cannot be translated anywhere, but the full name of the feature \"More Like This\" can be translated.\n\nDon't translate technical names like min_doc_freq, max_query_terms, true|false, field names title, text, auxiliary_text, opening_text, headings, all etc.\n\nFor a definition of \"stopwords\" see [[:w:en:Stop words|Stop words in Wikipedia]].", + "cirrussearch-didyoumean-settings": "Settings for the \"Did You Mean?\" suggestions.\n\n\"Did You Mean?\" is the English name of the feature and can be translated. This feature is described at [[:mw:Help:CirrusSearch#Did_you_mean]].\n\nDon't translate technical names like max_errors, confidence, max_term_freq, min_doc_freq and suggest_mode." } diff --git a/includes/Hooks.php b/includes/Hooks.php index c425c56..b163af0 100644 --- a/includes/Hooks.php +++ b/includes/Hooks.php @@ -92,6 +92,7 @@ } self::overrideMoreLikeThisOptionsFromMessage(); + self::overridePhraseSuggesterOptionsFromMessage(); if ( $request ) { // Engage the experimental highlighter if a url parameter requests it @@ -108,17 +109,24 @@ self::overrideYesNo( $wgCirrusSearchAllFieldsForRescore, $request, 'cirrusUseAllFieldsForRescore' ); self::overrideUseExtraPluginForRegex( $request ); self::overrideMoreLikeThisOptions( $request ); + self::overridePhraseSuggesterOptions( $request ); } } /** - * Set $dest to the numeric value from $request->getVal( $name ) if it is <= $limit. + * Set $dest to the numeric value from $request->getVal( $name ) if it is <= $limit + * or => $limit if upperLimit is false. */ - private static function overrideNumeric( &$dest, $request, $name, $limit = null ) { + private static function overrideNumeric( &$dest, $request, $name, $limit = null, $upperLimit = true ) { $val = $request->getVal( $name ); - if ( $val !== null && is_numeric( $val ) - && ( ( isset ( $limit ) && $val <= $limit ) || ( !isset( $limit ) ) ) ) { - $dest = $val; + if ( $val !== null && is_numeric( $val ) ) { + if ( !isset( $limit ) ) { + $dest = $val; + } else if ( $upperLimit && $val <= $limit ) { + $dest = $val; + } else if ( !$upperLimit && $val >= $limit ) { + $dest = $val; + } } } @@ -237,6 +245,94 @@ $wgCirrusSearchMoreLikeThisAllowedFields ); } } + + /** + * Override Phrase suggester options ("Did you mean?" suggestions) + */ + private static function overridePhraseSuggesterOptions( $request ) { + global $wgCirrusSearchPhraseSuggestMaxErrors, + $wgCirrusSearchPhraseSuggestConfidence, + $wgCirrusSearchPhraseSuggestMode, + $wgCirrusSearchPhraseMaxTermFreq, + $wgCirrusSearchPhraseMinDocFreq, + $wgCirrusSearchPhrasePrefixLength, + $wgCirrusSearchPhraseMaxTermFreqHardLimit, + $wgCirrusSearchPhraseSuggestMaxErrorsHardLimit, + $wgCirrusSearchPhrasePrefixLengthHardLimit, + $wgCirrusSearchPhraseSuggestAllowedMode; + + self::overrideNumeric( $wgCirrusSearchPhraseSuggestMaxErrors, $request, 'cirrusSuggMaxErrors', + $wgCirrusSearchPhraseSuggestMaxErrorsHardLimit ); + self::overrideNumeric( $wgCirrusSearchPhraseSuggestConfidence, $request, 'cirrusSuggConfidence' ); + self::overrideNumeric( $wgCirrusSearchPhraseMaxTermFreq, $request, 'cirrusSuggMaxTermFreq', + $wgCirrusSearchPhraseMaxTermFreqHardLimit ); + self::overrideNumeric( $wgCirrusSearchPhraseMinDocFreq, $request, 'cirrusSuggMinDocFreq' ); + self::overrideNumeric( $wgCirrusSearchPhrasePrefixLength, $request, 'cirrusSuggPrefixLength', + $wgCirrusSearchPhrasePrefixLengthHardLimit, false ); + $mode = $request->getVal( 'cirrusSuggMode' ); + if( isset( $mode ) && in_array( $mode, $wgCirrusSearchPhraseSuggestAllowedMode ) ) { + $wgCirrusSearchPhraseSuggestMode = $mode; + } + } + + /** + * Override Phrase suggester options ("Did you mean?" suggestions) + */ + private static function overridePhraseSuggesterOptionsFromMessage( ) { + global $wgCirrusSearchPhraseSuggestMaxErrors, + $wgCirrusSearchPhraseSuggestConfidence, + $wgCirrusSearchPhraseSuggestMode, + $wgCirrusSearchPhraseMaxTermFreq, + $wgCirrusSearchPhraseMinDocFreq, + $wgCirrusSearchPhrasePrefixLength, + $wgCirrusSearchPhraseMaxTermFreqHardLimit, + $wgCirrusSearchPhraseSuggestMaxErrorsHardLimit, + $wgCirrusSearchPhrasePrefixLengthHardLimit, + $wgCirrusSearchPhraseSuggestAllowedMode; + + $source = wfMessage( 'cirrussearch-didyoumean-settings' )->inContentLanguage(); + if ( !$source || $source->isDisabled() ) { + return; + } + $lines = Util::parseSettingsInMessage( $source->plain() ); + + foreach ( $lines as $line ) { + list( $k, $v ) = explode( ':', $line, 2 ); + switch( $k ) { + case 'max_errors' : + if ( is_numeric( $v ) && $v >= 1 && $v <= $wgCirrusSearchPhraseSuggestMaxErrorsHardLimit ) { + $wgCirrusSearchPhraseSuggestMaxErrors = floatval( $v ); + } + break; + case 'confidence' : + if ( is_numeric( $v ) && $v >= 0 ) { + $wgCirrusSearchPhraseSuggestConfidence = floatval( $v ); + } + break; + case 'max_term_freq' : + if ( is_numeric( $v ) && $v >= 0 && $v <= $wgCirrusSearchPhraseMaxTermFreqHardLimit ) { + $wgCirrusSearchPhraseMaxTermFreq = floatval( $v ); + } + break; + case 'min_doc_freq' : + if ( is_numeric( $v ) && $v >= 0 && $v < 1 ) { + $wgCirrusSearchPhraseMinDocFreq = floatval( $v ); + } + break; + case 'prefix_length' : + if ( is_numeric( $v ) && $v >= 0 && $v <= $wgCirrusSearchPhrasePrefixLengthHardLimit ) { + $wgCirrusSearchPhrasePrefixLength = intval( $v ); + } + break; + case 'suggest_mode' : + if ( in_array( $v, $wgCirrusSearchPhraseSuggestAllowedMode ) ) { + $wgCirrusSearchPhraseSuggestMode = $v; + } + break; + } + } + } + /** * Hook to call before an article is deleted * @param WikiPage $page The page we're deleting diff --git a/includes/Searcher.php b/includes/Searcher.php index 91748cc..c548331 100644 --- a/includes/Searcher.php +++ b/includes/Searcher.php @@ -1319,8 +1319,13 @@ * @return array[] array of Elastica configuration */ private function buildSuggestConfig( $field ) { - global $wgCirrusSearchPhraseSuggestMaxErrors; - global $wgCirrusSearchPhraseSuggestConfidence; + global $wgCirrusSearchPhraseSuggestMaxErrors, + $wgCirrusSearchPhraseSuggestConfidence, + $wgCirrusSearchPhraseSuggestMode, + $wgCirrusSearchPhraseMaxTermFreq, + $wgCirrusSearchPhraseMinDocFreq, + $wgCirrusSearchPhrasePrefixLength; + print ($wgCirrusSearchPhraseMinDocFreq); return array( 'phrase' => array( 'field' => $field, @@ -1330,13 +1335,10 @@ 'direct_generator' => array( array( 'field' => $field, - 'suggest_mode' => 'always', // Forces us to generate lots of phrases to try. - // If a term appears in more then half the docs then don't try to correct it. This really - // shouldn't kick in much because we're not looking for misspellings. We're looking for phrases - // that can be might off. Like "noble prize" -> "nobel prize". In any case, the default was - // 0.01 which way too frequently decided not to correct some terms. - 'max_term_freq' => 0.5, - 'prefix_length' => 2, + 'suggest_mode' => $wgCirrusSearchPhraseSuggestMode, + 'max_term_freq' => $wgCirrusSearchPhraseMaxTermFreq, + 'min_doc_freq' => $wgCirrusSearchPhraseMinDocFreq, + 'prefix_length' => $wgCirrusSearchPhrasePrefixLength, ), ), 'highlight' => array( diff --git a/tests/browser/features/did_you_mean_api.feature b/tests/browser/features/did_you_mean_api.feature index 8b5e51c..a124249 100644 --- a/tests/browser/features/did_you_mean_api.feature +++ b/tests/browser/features/did_you_mean_api.feature @@ -54,3 +54,23 @@ | boost-templates:"prize\|150%" noble prize | boost-templates:"prize\|150%" *nobel* prize | | noble prize prefix:n | *nobel* prize prefix:n | + Scenario: Customize prefix length of did you mean suggestions + When I set did you mean suggester option cirrusSuggPrefixLength to 5 + And I api search for noble prize + Then there is no api suggestion + + Scenario: Customize max term freq did you mean suggestions + When I set did you mean suggester option cirrusSuggMaxTermFreq to 0.001 + And I api search for noble prize + Then there is no api suggestion + + Scenario: Customize min doc freq did you mean suggestions + When I set did you mean suggester option cirrusSuggMode to popular + And I set did you mean suggester option cirrusSuggMinDocFreq to 0.60 + And I api search for noble prize + Then there is no api suggestion + + Scenario: Customize prefix length of did you mean suggestions below the hard limit + When I set did you mean suggester option cirrusSuggPrefixLength to 1 + And I api search for nabel prize + Then there is no api suggestion diff --git a/tests/browser/features/step_definitions/search_steps.rb b/tests/browser/features/step_definitions/search_steps.rb index afc1284..413bf22 100644 --- a/tests/browser/features/step_definitions/search_steps.rb +++ b/tests/browser/features/step_definitions/search_steps.rb @@ -17,18 +17,32 @@ When(/^I locate the page id of (.*) and store it as (%.*%)$/) do |title, varname| @search_vars[varname] = page_id_of title end +When(/^I reset did you mean suggester options$/) do + @didyoumean_options = {} +end +When(/^I set did you mean suggester option (.*) to (.*)$/) do |varname, value| + @didyoumean_options ||= {} + @didyoumean_options[varname] = value; +end # rubocop:disable LineLength When(/^I api search( with disabled incoming link weighting)?(?: with offset (\d+))?(?: in the (.*) language)?(?: in namespaces? (\d+(?: \d+)*))? for (.*)$/) do |incoming_links, offset, lang, namespaces, search| begin + options = { + sroffset: offset, + srnamespace: (namespaces || "0").split(/ /), + uselang: lang, + cirrusBoostLinks: incoming_links ? "no" : "yes", + } + if(defined?@didyoumean_options) + options = options.merge(@didyoumean_options) + end + @api_result = search_for( search.gsub(/%[^ {]+%/, @search_vars) .gsub(/%\{\\u([\dA-Fa-f]{4,6})\}%/) do # replace %{\uXXXX}% with the unicode code point [Regexp.last_match[1].hex].pack("U") end, - sroffset: offset, - srnamespace: (namespaces || "0").split(/ /), - uselang: lang, - cirrusBoostLinks: incoming_links ? "no" : "yes" + options ) rescue MediawikiApi::ApiError => e @api_error = e -- To view, visit https://gerrit.wikimedia.org/r/227424 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I10f5e6b1683d29e57d1a7ac23a48342bf26a64d3 Gerrit-PatchSet: 1 Gerrit-Project: mediawiki/extensions/CirrusSearch Gerrit-Branch: master Gerrit-Owner: DCausse <dcau...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits