https://bugzilla.wikimedia.org/show_bug.cgi?id=41577
Web browser: --- Bug #: 41577 Summary: Use normalized search key in term search index Product: MediaWiki extensions Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: Unprioritized Component: WikidataRepo AssignedTo: wikidata-b...@lists.wikimedia.org ReportedBy: daniel.kinz...@wikimedia.de CC: wikibugs-l@lists.wikimedia.org, wikidata-b...@lists.wikimedia.org Classification: Unclassified Mobile Platform: --- The term search index currently uses on-the-fly conversion to utf8 (and then lower case) to perform comparisons. That means a full table scan followed by a file sort on a table that is likely to contain several dozen million rows. That's likely to kill the DB server. To avoid this, there should be a dedicated search key column holding the normalized key (similar to the way a search key column is used for category sorting and finding external links). The same normalization shall apply to the index term when inserted and the search term when generating the query. In particular, the following normalization shall apply: * unicode normalization (NFC) * trim leading and trailing whitespace (ideally, all unicode whitespace chars) * lower case (ideally, using the implementation from the appropriate Language class). * optionally, apply a configurable regular expression for stripping separators (e.g. per default stripping all internal whitespace and hyphens, so "foobar" will match "foo-bar" and "foo bar"). This will provide case-insensitive matches with some flexibility regarding whitespace, etc. If only exact matches are desired, the "soft" result could be filtered programmatically before returning it to the caller. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l