Re: Reduce build times of pg_trgm GIN indexes

Heikki Linnakangas Mon, 12 Jan 2026 14:10:34 -0800

On 09/01/2026 14:06, David Geier wrote:

On 06.01.2026 18:00, Heikki Linnakangas wrote:

On 05/01/2026 17:01, David Geier wrote:

v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.


This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.


Oh, that's evil. I had tested that specifically. But it only worked
because the code in master uses str_tolower() with
DEFAULT_COLLATION_OID. So using a different locale like in the following
example does something different than when creating a database with the
same locale.

postgres=# select lower('III' COLLATE "tr_TR");
  lower
-------
  ııı

postgres=# select show_trgm('III' COLLATE "tr_TR");
         show_trgm
-------------------------
  {"  i"," ii","ii ",iii}
(1 row)

But when using tr_TR as default locale of the database the following
happens:

postgres=# select lower('III' COLLATE "tr_TR");
  lower
-------
  ııı

postgres=# select show_trgm('III');sü
                show_trgm
---------------------------------------
  {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}

I'm wondering if that's intentional to begin with. Shouldn't the code
instead pass PG_GET_COLLATION() to str_tolower()? Might require some
research to see how other index types handle locales.

Coming back to the original problem: the lengthy comment at the top of
pg_locale_libc.c, suggests that in some cases ASCII characters are
handled the pg_ascii_tolower() way for the default locale. See for
example tolower_libc_mb(). So a character by character conversion using
that function will yield a different result than strlower_libc_mb(). I'm
wondering why that is.

Hmm, yeah, that feels funny. The trigram code predates per-columncollation support, so I guess we never really thought through how itshould interact with COLLATE clauses.

Anyways, we could limit the optimization to only kick in when the used
locale follows the same rules as pg_ascii_tolower(). We could test that
when creating the locale and store that info in pg_locale_struct.

I think that's only possible for libc locales, which operate onecharacter at a time. In ICU locales, lower-casing a character can dependon the surrounding characters, so you cannot just test the conversion ofevery ascii character individually. It would make sense for libc localesthough, and I hope the ICU functions are a little faster anyway.

Although, we probably should be using case-folding rather thanlower-casing with ICU locales anyway. Case-folding is designed forstring matching. It'd be a backwards-compatibility breaking change, though.


- Heikki

Re: Reduce build times of pg_trgm GIN indexes

Reply via email to