Problem:

This issue is related to intl. support for some languages, use of collations and LIKE and STARTING WITH predicates, and >= / <= operators. Specifically, when certain values are used on columns data in certain languages & collations, the performance may drop significantly (scale depends on actual query and data distribution).

Analysis revealed, that when last character of the lookup value is part of "contraction" in used language, it's removed from the key by engine, so the lookup is done for shortened key and thus may include much more candidate rows - that are later eliminated from result by expression evaluation. The removal is done in Utf16Collation::stringToKey and LC_NARROW_string_to_key functions, and thus affects collations based on Utf16 and NARROW, but not in KSC, UCS2, BIG5, JIS or GB2312 charsets.

The "contraction" is a facility to handle multi-character letters in certain languages, like Spanish or Czech "ch" that is composed from "c" and "h". The main purpose of contractions is handling of special ordering rules for these letters, but has more broader meaning and use. There are many languages that use contractions (but the issue was first reported for Czech language), and contractions may consists from more than two characters (some contraction rules could be quite complex, for example there are "contractions with expansions" in Hungarian language).

The main reason why this "removal of trailing partial contraction" was done is to achieve behavior "consistent" with search/evaluation in other software (like text editors etc.), so (for example) STARTING WITH "C" or LIKE "C%" will return rows starting with "C" or "CH".

This behavior itself is questionable (but more about that later), but the main issue was the performance degradation. It's caused by increased amount of I/O, as engine may process more candidate rows (selected by index) than necessary. The amount of excess I/O depends greatly on data distribution and used character set. For example, charset WIN1250 with PDOX_CSY collation is affected (has contraction for "CH"), but WIN1250 is NARROW charset so data are dense and stored on less pages, while charset UTF8 with collation for LOCALE=cs_CZ uses much more space (up to 2x-3x times) and thus more I/O for the same set of data.

The (non-)solution to performance issue was to use collations WITHOUT contractions, which is not acceptable by end users, as it would also lead in wrong order of data in ORDER BY / GROUP BY that would not respect ordering rules for multi-character letters.

So, a better fix for performance issue is needed. Please note that although this issue has very specific conditions to manifest, once user is affected, the impact could be quite severe, especially for UTF8-based data. It's also important to consider here that it's not possible to work around or minimize the chance that this issue manifests by database or application design.

Proposed solutions:

There are two "competing" solutions that BOTH solve the performance problem, but have different consequences.

1. Solution (proposed by Vlad and me):

Use collation sortkey also in non-indexed lookup / comparisons and get rid of the trailing partial contraction removal code, so collation sortkeys would be used in both cases.

We tested patched 3.0.7 engine that does not do trailing partial contraction removal, and can confirm that the performance problem is gone. It appears that only the STARTING WITH (that covers LIKE 'X%' patterns) needs to be adapted to use sortkeys for non-indexed lookup to have consistent results. Casual expressions as col >= 'C' works normally. Other pattern matching expression types (LIKE patterns that couldn't be converted to STARTING WITH, regex) work on character level and there isn't any type of "key" involved, so using collation is not expected nor desired.

However, there are some changes in behavior. The STARTING WITH will not include contractable data that COULD be partially matched by lookup key, i.e. STARTING WITH 'C' will not match string starting with 'CH'. From language sense this is the CORRECT behavior, that is unfortunately not supported by ALL software. Users are just used that there are some cases when they get apples + oranges while asking for apples, because it's how it works now. I'm not YET aware of any other case that would yield different results. There could be some strange collation with contractions that would sort contracted unit before it's first letter, so output from query with condition like col >= 'X' would return different set, but I'm not aware of any.

Although the new behavior is not supported by all software, it IS CONSISTENT with how MS SQLServer and Oracle handle these cases (which we verified). This fact was highlighted by asked users (that often build applications supporting multiple RDBMS) as another reason (beside that it's how it has to work in first place) why this solution should be the preferred one.

IF we will implement this solution, we will introduce configuration option to switch between new and old behavior. If this will be done in point releases, it should be *disabled* (old way) by default and enabled explicitly via config option. v5 may have this enabled unconditionally.

2. Solution (proposed by Adriano):

Replace the trailing partial contraction removal code with code that would make multiple lookups for all parts of possible contraction, for example STARTING WITH "C" will perform lookup for "C" and "CH".

Experiments show that this should also solve the performance problem, while keeping the current behavior. However, the current behavior is not exactly what asked affected users want.

Here is the opinion from Dmitry Yemanov:

I'm not really happy with comparing only to Oracle/MSSQL. MySQL seems to be in the opposite camp and agrees on 'ch' like 'c%' is true in their utf_czech_ci collation. PostgreSQL is somewhat difficult to compare against, as AFAIR it still relies on the system locales and does not have any built-in collations.

And regardless, I give more priority to the "natural language" argument than to the "compatibility" one. However, it's good when they match each other.

For me two things are obvious:

1) Regular comparisons (as well as sorting) must use all collation rules, including contractions

2) There should be no difference (except performance) between indexed and non-indexed access

But the rest is really complicated for me. We had INTL support for kinda everything, but SIMILAR was implemented later and uses a different rules. Maybe in the ideal world it could also take collations into account, but we're surely not in position to do that ourselves, and given that no other DBMS seem to have collation-aware SIMILAR, let's just consider it a fact we live with.

LIKE and STARTING WITH are in the middle between comparisons/sorts and SIMILAR though. If we treat LIKE being a close friend for SIMILAR and STARTING WITH being a shorthand for LIKE, we may end with both ignoring the collation rules. If we treat STARTING WITH as a custom shorthand for greater-or-less comparisons, then it should be collation-aware. For me, the former thinking is more logical. But STARTING WITH is non-standard so we are free to implement it the way which is more useful for customers.

What is really questionable to me is whether it's OK to divorce STARTING from LIKE re. collations, given that internally they are closely related (LIKE may be backed by STARTING). Is it possible that internal STARTING (which thinks that 'ch' does not start with 'c') would break the user-specified LIKE which expects the opposite behaviour?

If both STARTING and LIKE would respect contractions, then the only problem remaining (*) is LIKE vs SIMILAR mismatch, but personally I consider it a lesser evil (see above).

(*) I see SUBSTRING/REPLACE/CHAR_LENGTH as different class citizens, thus intentionally ignore their inconsistency with LIKE/STARTING.

As for Adriano's suggestion with multiple lookups, I don't see it as a problem. It looks hackery at the first glance, but it solves the original performance issue and I suppose it could even be improved (with more efforts, of course) to use a single scan for multiple matches. But it makes STARTING to consider 'ch' like 'c%' is true what AFAIU our customers don't really want and this is the problem.

---

As you can see, we were not able to reach final decision which approach should be used as either solution does not have majority or authoritative support. It's very unfortunate, as this issue is critical for (at least) one major Firebird user, so we need to select one and implement it as soon as possible. Hence we'd like ask you for your feedback about proposed solutions, so we could break this stalemate.

best regards
Pavel Cisar


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to