On 05/01/2026 17:01, David Geier wrote:
The script I used for testing is attached. I ran CREATE INDEX three
times and took the fastest run. I'm getting the following results on my
i9-13950HX dev laptop in release build:
Data set | Patched (ms) | Master (ms) | Speedup
--------------------|--------------|--------------|----------
movies(plot) | 3,409 | 10,311 | 3.02x
lineitem(l_comment) | 161,569 | 256,986 | 1.59x
Impressive speedup!
The attached patches do the following:
- v1-0001-Inline-ginCompareAttEntries.patch: Inline
ginCompareAttEntries() which is very frequently called by the GIN code.
Looks good.
- v1-0002-Optimized-comparison-functions.patch: Use FunctionCallInvoke()
instead of FunctionCall2Coll(). This saves a bunch of per-comparison
setup code, such as calling InitFunctionCallInfoData().
You lose the check for NULL result with this. That's probably still
worth checking.
- v1-0003-Use-sort_template.h.patch: Use sort_template.h instead of
qsort(), to inline calls to the sort comparator. This is an interim step
that is further improved on by patch 0006.
ok
- v1-0004-Avoid-dedup-and-sort-in-ginExtractEntries.patch
ginExtractEntries() deduplicates and sorts the entries returned from the
extract value function. In case of pg_trgm, that is completely redundant
because the trigrams are already deduplicated and sorted. The current
version of this patch is just to demonstrate the potential. We need to
think about what we want here. Ideally, we would require the extraction
function to provide the entries deduplicated and sorted. Alternatively,
we could indicate to ginExtractEntries() if the entries are already
deduplicated and sorted. If we don't want to alter the signature of the
extract value function, we could e.g. use the MSB of the nentries argument.
Yeah, this seems wrong as it is. You're assuming that if the extract
function returns nullFlags == NULL, the array is already sorted and deduped.
- v1-0005-Make-btint4cmp-branchless.patch: Removes branches from
btint4cmp(), which is heavily called from the GIN code. This might as
well have benefit in other parts of the code base.
Seems reasonable.
v1-0006-Use-radix-sort.patch: Replace the sort_template.h-based qsort()
with radix sort. For the purpose of demonstrating the possible gains,
I've only replaced the signed variant for now. I've also tried using
simplehash.h for deduplicating followed by a sort_template.h-based sort.
But that was slower.
Ok.
v1-0007-Faster-qunique-comparator.patch: qunique() doesn't require a
full sort comparator (-1 = less, 0 = equal, 1 = greater) but only a
equal/unequal comparator (e.g. 0 = equal and 1 = unequal). The same
optimization can be done in plenty of other places in our code base.
Likely, in most of them the gains are insignificant.
Makes sense. I'm a little disappointed the compiler won't do that
optimization for us..
Perhaps we should introduce a new qunique_eq() function with a different
callback signature:
/* like qunique(), but the callback function returns true/false rather
than int */
static inline size_t
qunique_eq(void *array, size_t elements, size_t width,
bool (*equal) (const void *, const void *))
v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
of text is actually ASCII. Hence, we provide a fast path for this case
which is exercised if the MSB of the current character is unset.
This uses pg_ascii_tolower() when for ASCII characters when built with
the IGNORECASE. I don't think that's correct, if the proper collation
would do something more complicated for than what pg_ascii_tolower() does.
Did you measure how big is the impact from each individual patch?
Patches 1 and 2 seem pretty much ready to be committed, but I wonder if
they make any difference on their own.
- Heikki