https://bugzilla.wikimedia.org/show_bug.cgi?id=41577

       Web browser: ---
             Bug #: 41577
           Summary: Use normalized search key in term search index
           Product: MediaWiki extensions
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: WikidataRepo
        AssignedTo: wikidata-b...@lists.wikimedia.org
        ReportedBy: daniel.kinz...@wikimedia.de
                CC: wikibugs-l@lists.wikimedia.org,
                    wikidata-b...@lists.wikimedia.org
    Classification: Unclassified
   Mobile Platform: ---


The term search index currently uses on-the-fly conversion to utf8 (and then
lower case) to perform comparisons. That means a full table scan followed by a
file sort on a table that is likely to contain several dozen million rows.
That's likely to kill the DB server.

To avoid this, there should be a dedicated search key column holding the
normalized key (similar to the way a search key column is used for category
sorting and finding external links). The same normalization shall apply to the
index term when inserted and the search term when generating the query. In
particular, the following normalization shall apply:

* unicode normalization (NFC)
* trim leading and trailing whitespace (ideally, all unicode whitespace chars)
* lower case (ideally, using the implementation from the appropriate Language
class).
* optionally, apply a configurable regular expression for stripping separators
(e.g. per default stripping all internal whitespace and hyphens, so "foobar"
will match "foo-bar" and "foo bar").

This will provide case-insensitive matches with some flexibility regarding
whitespace, etc. If only exact matches are desired, the "soft" result could be
filtered programmatically before returning it to the caller.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to