[Firebird-devel] RFC: Fix for issue 6915

Pavel Cisar Tue, 02 Nov 2021 03:58:51 -0700

Problem:

This issue is related to intl. support for some languages, use ofcollations and LIKE and STARTING WITH predicates, and >= / <= operators.Specifically, when certain values are used on columns data in certainlanguages & collations, the performance may drop significantly (scaledepends on actual query and data distribution).

Analysis revealed, that when last character of the lookup value is partof "contraction" in used language, it's removed from the key by engine,so the lookup is done for shortened key and thus may include much morecandidate rows - that are later eliminated from result by expressionevaluation. The removal is done in Utf16Collation::stringToKey andLC_NARROW_string_to_key functions, and thus affects collations based onUtf16 and NARROW, but not in KSC, UCS2, BIG5, JIS or GB2312 charsets.

The "contraction" is a facility to handle multi-character letters incertain languages, like Spanish or Czech "ch" that is composed from "c"and "h". The main purpose of contractions is handling of specialordering rules for these letters, but has more broader meaning and use.There are many languages that use contractions (but the issue was firstreported for Czech language), and contractions may consists from morethan two characters (some contraction rules could be quite complex, forexample there are "contractions with expansions" in Hungarian language).

The main reason why this "removal of trailing partial contraction" wasdone is to achieve behavior "consistent" with search/evaluation in othersoftware (like text editors etc.), so (for example) STARTING WITH "C" orLIKE "C%" will return rows starting with "C" or "CH".

This behavior itself is questionable (but more about that later), butthe main issue was the performance degradation. It's caused by increasedamount of I/O, as engine may process more candidate rows (selected byindex) than necessary. The amount of excess I/O depends greatly on datadistribution and used character set. For example, charset WIN1250 withPDOX_CSY collation is affected (has contraction for "CH"), but WIN1250is NARROW charset so data are dense and stored on less pages, whilecharset UTF8 with collation for LOCALE=cs_CZ uses much more space (up to2x-3x times) and thus more I/O for the same set of data.

The (non-)solution to performance issue was to use collations WITHOUTcontractions, which is not acceptable by end users, as it would alsolead in wrong order of data in ORDER BY / GROUP BY that would notrespect ordering rules for multi-character letters.

So, a better fix for performance issue is needed. Please note thatalthough this issue has very specific conditions to manifest, once useris affected, the impact could be quite severe, especially for UTF8-baseddata. It's also important to consider here that it's not possible towork around or minimize the chance that this issue manifests by databaseor application design.


Proposed solutions:

There are two "competing" solutions that BOTH solve the performanceproblem, but have different consequences.


1. Solution (proposed by Vlad and me):

Use collation sortkey also in non-indexed lookup / comparisons and getrid of the trailing partial contraction removal code, so collationsortkeys would be used in both cases.

We tested patched 3.0.7 engine that does not do trailing partialcontraction removal, and can confirm that the performance problem isgone. It appears that only the STARTING WITH (that covers LIKE 'X%'patterns) needs to be adapted to use sortkeys for non-indexed lookup tohave consistent results. Casual expressions as col >= 'C' worksnormally. Other pattern matching expression types (LIKE patterns thatcouldn't be converted to STARTING WITH, regex) work on character leveland there isn't any type of "key" involved, so using collation is notexpected nor desired.

However, there are some changes in behavior. The STARTING WITH will notinclude contractable data that COULD be partially matched by lookup key,i.e. STARTING WITH 'C' will not match string starting with 'CH'. Fromlanguage sense this is the CORRECT behavior, that is unfortunately notsupported by ALL software. Users are just used that there are some caseswhen they get apples + oranges while asking for apples, because it's howit works now. I'm not YET aware of any other case that would yielddifferent results. There could be some strange collation withcontractions that would sort contracted unit before it's first letter,so output from query with condition like col >= 'X' would returndifferent set, but I'm not aware of any.

Although the new behavior is not supported by all software, it ISCONSISTENT with how MS SQLServer and Oracle handle these cases (which weverified). This fact was highlighted by asked users (that often buildapplications supporting multiple RDBMS) as another reason (beside thatit's how it has to work in first place) why this solution should be thepreferred one.

IF we will implement this solution, we will introduce configurationoption to switch between new and old behavior. If this will be done inpoint releases, it should be *disabled* (old way) by default and enabledexplicitly via config option. v5 may have this enabled unconditionally.


2. Solution (proposed by Adriano):

Replace the trailing partial contraction removal code with code thatwould make multiple lookups for all parts ofpossible contraction, for example STARTING WITH "C" will perform lookupfor "C" and "CH".

Experiments show that this should also solve the performance problem,while keeping the current behavior. However, the current behavior is notexactly what asked affected users want.


Here is the opinion from Dmitry Yemanov:

I'm not really happy with comparing only to Oracle/MSSQL. MySQL seems tobe in the opposite camp and agrees on 'ch' like 'c%' is true in theirutf_czech_ci collation. PostgreSQL is somewhat difficult to compareagainst, as AFAIR it still relies on the system locales and does nothave any built-in collations.

And regardless, I give more priority to the "natural language" argumentthan to the "compatibility" one. However, it's good when they match eachother.


For me two things are obvious:

1) Regular comparisons (as well as sorting) must use all collationrules, including contractions

2) There should be no difference (except performance) between indexedand non-indexed access

But the rest is really complicated for me. We had INTL support for kindaeverything, but SIMILAR was implemented later and uses a differentrules. Maybe in the ideal world it could also take collations intoaccount, but we're surely not in position to do that ourselves, andgiven that no other DBMS seem to have collation-aware SIMILAR, let'sjust consider it a fact we live with.

LIKE and STARTING WITH are in the middle between comparisons/sorts andSIMILAR though. If we treat LIKE being a close friend for SIMILAR andSTARTING WITH being a shorthand for LIKE, we may end with both ignoringthe collation rules. If we treat STARTING WITH as a custom shorthand forgreater-or-less comparisons, then it should be collation-aware. For me,the former thinking is more logical. But STARTING WITH is non-standardso we are free to implement it the way which is more useful for customers.

What is really questionable to me is whether it's OK to divorce STARTINGfrom LIKE re. collations, given that internally they are closely related(LIKE may be backed by STARTING). Is it possible that internal STARTING(which thinks that 'ch' does not start with 'c') would break theuser-specified LIKE which expects the opposite behaviour?

If both STARTING and LIKE would respect contractions, then the onlyproblem remaining (*) is LIKE vs SIMILAR mismatch, but personally Iconsider it a lesser evil (see above).

(*) I see SUBSTRING/REPLACE/CHAR_LENGTH as different class citizens,thus intentionally ignore their inconsistency with LIKE/STARTING.

As for Adriano's suggestion with multiple lookups, I don't see it as aproblem. It looks hackery at the first glance, but it solves theoriginal performance issue and I suppose it could even be improved (withmore efforts, of course) to use a single scan for multiple matches. Butit makes STARTING to consider 'ch' like 'c%' is true what AFAIU ourcustomers don't really want and this is the problem.

---

As you can see, we were not able to reach final decision which approachshould be used as either solution does not have majority orauthoritative support. It's very unfortunate, as this issue is criticalfor (at least) one major Firebird user, so we need to select one andimplement it as soon as possible. Hence we'd like ask you for yourfeedback about proposed solutions, so we could break this stalemate.


best regards
Pavel Cisar


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

[Firebird-devel] RFC: Fix for issue 6915

Reply via email to