Manybubbles has uploaded a new change for review. https://gerrit.wikimedia.org/r/129678
Change subject: Improved experimental highlighter settings ...................................................................... Improved experimental highlighter settings These settings will improve quality of snippets. The tests will only pass with the new 0.0.5 release of the highlighter though. Bug: 64259 Change-Id: If970dea018eb631af13f561e33f11bbf1d2395b2 (cherry picked from commit ab93e0f2984f94d805c51eb04c89b2acd3a6ee1f) --- M includes/ResultsType.php M tests/browser/features/highlighting.feature 2 files changed, 19 insertions(+), 30 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch refs/changes/78/129678/1 diff --git a/includes/ResultsType.php b/includes/ResultsType.php index 7e0b8ff..e8df16e 100644 --- a/includes/ResultsType.php +++ b/includes/ResultsType.php @@ -188,20 +188,20 @@ 'top_scoring' => true, 'boost_before' => array( // Note these values are super arbitrary right now. - '20' => 8, - '50' => 7, - '200' => 4, - '1000' => 2, + '20' => 2, + '50' => 1.8, + '200' => 1.5, + '1000' => 1.2, ), - // Since the best fragments are typically at the beginning of the article - // any way we can relatively safely stop searching for matches after 50 - // fragments. This should help with really crazy documents, say 10MB of - // "d d". Without this we'll scan out all the "d"s on a search for "d". - // With it, only the first 50. - 'max_fragments_scored' => 50, + // We should set a limit on the number of fragments we try because if we + // don't then we'll hit really crazy documents, say 10MB of "d d". This'll + // keep us from scanning more then the first couple thousand of them. + // Setting this too low (like 50) can bury good snippets if the search + // contains common words. + 'max_fragments_scored' => 5000, ), ); - // If there isn't a match just return some of the the first few sentences . + // If there isn't a match just return some of the the first few characters. $text = $singleFragment; $text[ 'no_match_size' ] = 100; } else { diff --git a/tests/browser/features/highlighting.feature b/tests/browser/features/highlighting.feature index d9c2d15..6b22b98 100644 --- a/tests/browser/features/highlighting.feature +++ b/tests/browser/features/highlighting.feature @@ -18,12 +18,9 @@ | template:test pickle | Template:Template *Test* | *pickles* | # Verify highlighting the presence of accent squashing | Africa test | *África* | for *testing* | - # Verify highlighting on large pages (Bug 52680). - # Bug 64259 - # | "discuss problems of social and cultural importance" | Rashidun Caliphate | the faithful gathered to *discuss problems of social and cultural importance*. During the caliphate of | - # | "discuss problems of social and cultural importance"~ | Rashidun Caliphate | the faithful gathered to *discuss problems of social and cultural importance*. During the caliphate of | - | "discuss problems of social and cultural importance" | Rashidun Caliphate | first Khalifa Rasul Allah (Successor *of* the Messenger *of* God), *and* embarked on campaigns to propagate | - | "discuss problems of social and cultural importance"~ | Rashidun Caliphate | first Khalifa Rasul Allah (Successor *of* the Messenger *of* God), *and* embarked on campaigns to propagate | + # Verify highlighting on large pages. + | "discuss problems of social and cultural importance" | Rashidun Caliphate | gathered to *discuss* *problems* *of* *social* *and* *cultural* *importance*. During the caliphate *of* Umar as many | + | "discuss problems of social and cultural importance"~ | Rashidun Caliphate | gathered to *discuss* *problems* *of* *social* *and* *cultural* *importance*. During the caliphate *of* Umar as many | # Auxiliary text | tallest alborz | Rashidun Caliphate | Mount Damavand, Iran's *tallest* mountain is located in *Alborz* mountain range. | @@ -33,9 +30,7 @@ Scenario: Found words are highlighted even if found by different analyzers When I search for "threatening the unity" community - # Bug 64259 - # Then *threatening the unity* and stability of the new *community* is in the highlighted text of the first search result - Then *the* is in the highlighted text of the first search result + Then *threatening* *the* *unity* and stability of *the* new *community* is in the highlighted text of the first search result @headings Scenario: Found words are highlighted in headings @@ -48,15 +43,11 @@ Scenario: Found words are highlighted in text even in large documents When I search for Allowance to non-Muslims - # Bug 64259 - # Then *Allowance to non-Muslims* is in the highlighted text of the first search result - Then *to* is in the highlighted text of the first search result + Then *Allowance* *to* *non*-*Muslims* is in the highlighted text of the first search result Scenario: Found words are highlighted in text even in large documents When I search for "Allowance to non-Muslims" - # Bug 64259 - # Then *Allowance to non-Muslims* is in the highlighted text of the first search result - Then *to* is in the highlighted text of the first search result + Then *Allowance* *to* *non*-*Muslims* is in the highlighted text of the first search result Scenario: Words are not found in image captions When I search for The Rose Trellis Egg @@ -79,9 +70,7 @@ Scenario: Found words are highlighted in headings even in large documents when searching in a non-strict phrase When I search for "Allowance to non-Muslims"~ - # Bug 64259 - # Then *Allowance to non-Muslims* is in the highlighted text of the first search result - Then *to* is in the highlighted text of the first search result + Then *Allowance* *to* *non*-*Muslims* is in the highlighted text of the first search result @headings Scenario: The highest scoring heading is highlighted AND it doesn't contain html even if the heading on the page does @@ -109,7 +98,7 @@ When I search for Rashidun Caliphate Then Template:History of the Arab States The *Rashidun* *Caliphate* (Template:lang-ar al-khilafat ar-Rāshidīyah) is the highlighted text of the first search result When I search for caliphs - Then al-khilafat ar-Rāshidīyah), comprising the first four *caliphs* in Islam's history, was founded after Muhammad's is the highlighted text of the first search result + Then The first four *caliphs* are called the Rashidun, meaning the Rightly Guided *Caliphs*, because they are is the highlighted text of the first search result @references Scenario: References don't appear in highlighted section titles -- To view, visit https://gerrit.wikimedia.org/r/129678 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: If970dea018eb631af13f561e33f11bbf1d2395b2 Gerrit-PatchSet: 1 Gerrit-Project: mediawiki/extensions/CirrusSearch Gerrit-Branch: wmf/1.24wmf2 Gerrit-Owner: Manybubbles <never...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits