Problem:
This issue is related to intl. support for some languages, use of
collations and LIKE and STARTING WITH predicates, and >= / <= operators.
Specifically, when certain values are used on columns data in certain
languages & collations, the performance may drop significantly (scale
depends on actual query and data distribution).
Analysis revealed, that when last character of the lookup value is part
of "contraction" in used language, it's removed from the key by engine,
so the lookup is done for shortened key and thus may include much more
candidate rows - that are later eliminated from result by expression
evaluation. The removal is done in Utf16Collation::stringToKey and
LC_NARROW_string_to_key functions, and thus affects collations based on
Utf16 and NARROW, but not in KSC, UCS2, BIG5, JIS or GB2312 charsets.
The "contraction" is a facility to handle multi-character letters in
certain languages, like Spanish or Czech "ch" that is composed from "c"
and "h". The main purpose of contractions is handling of special
ordering rules for these letters, but has more broader meaning and use.
There are many languages that use contractions (but the issue was first
reported for Czech language), and contractions may consists from more
than two characters (some contraction rules could be quite complex, for
example there are "contractions with expansions" in Hungarian language).
The main reason why this "removal of trailing partial contraction" was
done is to achieve behavior "consistent" with search/evaluation in other
software (like text editors etc.), so (for example) STARTING WITH "C" or
LIKE "C%" will return rows starting with "C" or "CH".
This behavior itself is questionable (but more about that later), but
the main issue was the performance degradation. It's caused by increased
amount of I/O, as engine may process more candidate rows (selected by
index) than necessary. The amount of excess I/O depends greatly on data
distribution and used character set. For example, charset WIN1250 with
PDOX_CSY collation is affected (has contraction for "CH"), but WIN1250
is NARROW charset so data are dense and stored on less pages, while
charset UTF8 with collation for LOCALE=cs_CZ uses much more space (up to
2x-3x times) and thus more I/O for the same set of data.
The (non-)solution to performance issue was to use collations WITHOUT
contractions, which is not acceptable by end users, as it would also
lead in wrong order of data in ORDER BY / GROUP BY that would not
respect ordering rules for multi-character letters.
So, a better fix for performance issue is needed. Please note that
although this issue has very specific conditions to manifest, once user
is affected, the impact could be quite severe, especially for UTF8-based
data. It's also important to consider here that it's not possible to
work around or minimize the chance that this issue manifests by database
or application design.
Proposed solutions:
There are two "competing" solutions that BOTH solve the performance
problem, but have different consequences.
1. Solution (proposed by Vlad and me):
Use collation sortkey also in non-indexed lookup / comparisons and get
rid of the trailing partial contraction removal code, so collation
sortkeys would be used in both cases.
We tested patched 3.0.7 engine that does not do trailing partial
contraction removal, and can confirm that the performance problem is
gone. It appears that only the STARTING WITH (that covers LIKE 'X%'
patterns) needs to be adapted to use sortkeys for non-indexed lookup to
have consistent results. Casual expressions as col >= 'C' works
normally. Other pattern matching expression types (LIKE patterns that
couldn't be converted to STARTING WITH, regex) work on character level
and there isn't any type of "key" involved, so using collation is not
expected nor desired.
However, there are some changes in behavior. The STARTING WITH will not
include contractable data that COULD be partially matched by lookup key,
i.e. STARTING WITH 'C' will not match string starting with 'CH'. From
language sense this is the CORRECT behavior, that is unfortunately not
supported by ALL software. Users are just used that there are some cases
when they get apples + oranges while asking for apples, because it's how
it works now. I'm not YET aware of any other case that would yield
different results. There could be some strange collation with
contractions that would sort contracted unit before it's first letter,
so output from query with condition like col >= 'X' would return
different set, but I'm not aware of any.
Although the new behavior is not supported by all software, it IS
CONSISTENT with how MS SQLServer and Oracle handle these cases (which we
verified). This fact was highlighted by asked users (that often build
applications supporting multiple RDBMS) as another reason (beside that
it's how it has to work in first place) why this solution should be the
preferred one.
IF we will implement this solution, we will introduce configuration
option to switch between new and old behavior. If this will be done in
point releases, it should be *disabled* (old way) by default and enabled
explicitly via config option. v5 may have this enabled unconditionally.
2. Solution (proposed by Adriano):
Replace the trailing partial contraction removal code with code that
would make multiple lookups for all parts of
possible contraction, for example STARTING WITH "C" will perform lookup
for "C" and "CH".
Experiments show that this should also solve the performance problem,
while keeping the current behavior. However, the current behavior is not
exactly what asked affected users want.
Here is the opinion from Dmitry Yemanov:
I'm not really happy with comparing only to Oracle/MSSQL. MySQL seems to
be in the opposite camp and agrees on 'ch' like 'c%' is true in their
utf_czech_ci collation. PostgreSQL is somewhat difficult to compare
against, as AFAIR it still relies on the system locales and does not
have any built-in collations.
And regardless, I give more priority to the "natural language" argument
than to the "compatibility" one. However, it's good when they match each
other.
For me two things are obvious:
1) Regular comparisons (as well as sorting) must use all collation
rules, including contractions
2) There should be no difference (except performance) between indexed
and non-indexed access
But the rest is really complicated for me. We had INTL support for kinda
everything, but SIMILAR was implemented later and uses a different
rules. Maybe in the ideal world it could also take collations into
account, but we're surely not in position to do that ourselves, and
given that no other DBMS seem to have collation-aware SIMILAR, let's
just consider it a fact we live with.
LIKE and STARTING WITH are in the middle between comparisons/sorts and
SIMILAR though. If we treat LIKE being a close friend for SIMILAR and
STARTING WITH being a shorthand for LIKE, we may end with both ignoring
the collation rules. If we treat STARTING WITH as a custom shorthand for
greater-or-less comparisons, then it should be collation-aware. For me,
the former thinking is more logical. But STARTING WITH is non-standard
so we are free to implement it the way which is more useful for customers.
What is really questionable to me is whether it's OK to divorce STARTING
from LIKE re. collations, given that internally they are closely related
(LIKE may be backed by STARTING). Is it possible that internal STARTING
(which thinks that 'ch' does not start with 'c') would break the
user-specified LIKE which expects the opposite behaviour?
If both STARTING and LIKE would respect contractions, then the only
problem remaining (*) is LIKE vs SIMILAR mismatch, but personally I
consider it a lesser evil (see above).
(*) I see SUBSTRING/REPLACE/CHAR_LENGTH as different class citizens,
thus intentionally ignore their inconsistency with LIKE/STARTING.
As for Adriano's suggestion with multiple lookups, I don't see it as a
problem. It looks hackery at the first glance, but it solves the
original performance issue and I suppose it could even be improved (with
more efforts, of course) to use a single scan for multiple matches. But
it makes STARTING to consider 'ch' like 'c%' is true what AFAIU our
customers don't really want and this is the problem.
---
As you can see, we were not able to reach final decision which approach
should be used as either solution does not have majority or
authoritative support. It's very unfortunate, as this issue is critical
for (at least) one major Firebird user, so we need to select one and
implement it as soon as possible. Hence we'd like ask you for your
feedback about proposed solutions, so we could break this stalemate.
best regards
Pavel Cisar
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel