On 09/01/2026 14:06, David Geier wrote:
On 06.01.2026 18:00, Heikki Linnakangas wrote:
On 05/01/2026 17:01, David Geier wrote:
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.

This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.

Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.

postgres=# select lower('III' COLLATE "tr_TR");
  lower
-------
  ııı

postgres=# select show_trgm('III' COLLATE "tr_TR");
         show_trgm
-------------------------
  {"  i"," ii","ii ",iii}
(1 row)

But when using tr_TR as default locale of the database the following
happens:

postgres=# select lower('III' COLLATE "tr_TR");
  lower
-------
  ııı

postgres=# select show_trgm('III');sü
                show_trgm
---------------------------------------
  {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}

I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.

Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.

Hmm, yeah, that feels funny. The trigram code predates per-column collation support, so I guess we never really thought through how it should interact with COLLATE clauses.

Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.

I think that's only possible for libc locales, which operate one character at a time. In ICU locales, lower-casing a character can depend on the surrounding characters, so you cannot just test the conversion of every ascii character individually. It would make sense for libc locales though, and I hope the ICU functions are a little faster anyway.

Although, we probably should be using case-folding rather than lower-casing with ICU locales anyway. Case-folding is designed for string matching. It'd be a backwards-compatibility breaking change, though.

- Heikki



Reply via email to