Speed up eqjoinsel() with lots of MCV entries.

If both sides of the operator have most-common-value statistics,
eqjoinsel wants to check which MCVs have matches on the other side.
Formerly it did this with a dumb compare-all-the-entries loop,
which had O(N^2) behavior for long MCV lists.  When that code was
written, twenty-plus years ago, that seemed tolerable; but nowadays
people frequently use much larger statistics targets, so that the
O(N^2) behavior can hurt quite a bit.

To add insult to injury, when asked for semijoin semantics, the
entire comparison loop was done over, even though we frequently
know that it will yield exactly the same results.

To improve matters, switch to using a hash table to perform the
matching.  Testing suggests that depending on the data type, we may
need up to about 100 MCVs on each side to amortize the extra costs
of setting up the hash table and performing hash-value computations;
so continue to use the old looping method when there are fewer MCVs
than that.

Also, refactor so that we don't repeat the matching work unless
we really need to, which occurs only in the uncommon case where
eqjoinsel_semi decides to truncate the set of inner MCVs it
considers.  The refactoring also got rid of the need to use the
presented operator's commutator.  Real-world operators that are
using eqjoinsel should pretty much always have commutators, but
at the very least this saves a few syscache lookups.

Author: Ilia Evdokimov <[email protected]>
Co-authored-by: David Geier <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Discussion: 
https://postgr.es/m/[email protected]

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/057012b205a082ec930cd7bb7f6663516be011e2

Modified Files
--------------
src/backend/utils/adt/selfuncs.c | 554 ++++++++++++++++++++++++++++++---------
src/tools/pgindent/typedefs.list |   3 +
2 files changed, 436 insertions(+), 121 deletions(-)

Reply via email to