Does anyone else have an opinion on this patch? It looks promising.
--------------------------------------------------------------------------- On Wed, Jun 28, 2023 at 04:53:06PM +0300, Alena Rybakina wrote: > Hi! > > On 15.06.2023 11:30, Andrey Lepikhov wrote: > > Hi, all. > > Some of my clients use JOIN's with three - four clauses. Quite frequently, > I see complaints on unreasonable switch of JOIN algorithm to Merge Join > instead of Hash Join. Quick research have shown one weak place - > estimation > of an average bucket size in final_cost_hashjoin (see q2.sql in > attachment) > with very conservative strategy. > Unlike estimation of groups, here we use smallest ndistinct value across > all buckets instead of multiplying them (or trying to make multivariate > analysis). > It works fine for the case of one clause. But if we have many clauses, and > if each has high value of ndistinct, we will overestimate average size of > a > bucket and, as a result, prefer to use Merge Join. As the example in > attachment shows, it leads to worse plan than possible, sometimes > drastically worse. > I assume, this is done with fear of functional dependencies between hash > clause components. But as for me, here we should go the same way, as > estimation of groups. > The attached patch shows a sketch of the solution. > > > This problem is very important. > > Honestly, I'm still learning your code and looking for cases on which cases > your patch can affect for the worse or for the better. But I have already > found > something that seemed interesting to me. I have found several other > interesting > cases where your patch can solve some problem in order to choose a more > correct > plan, but in focus on memory consumption. > > To make it easier to evaluate, I added a hook to your patch that makes it > easier to switch to your or the original way of estimating the size of baskets > (diff_estimate.diff). > > Here are other cases where your fix improves the query plan. > > > First of all, I changed the way creation of tables are created to look at the > behavior of the query plan in terms of planning and execution time: > > DROP TABLE IF EXISTS a,b CASCADE; > CREATE TABLE a AS > SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z > FROM generate_series(1,1e5) AS gs; > CREATE TABLE b AS > SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload > FROM generate_series(1,1e5) AS gs; > ANALYZE a,b; > > SET enable_cost_size = 'on'; > EXPLAIN ANALYZE > SELECT * FROM a,b > WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; > > SET enable_cost_size = 'off'; > EXPLAIN ANALYZE > SELECT * FROM a,b > WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; > > > QUERY PLAN > --------------------------------------------------------------------------- > Hash Join (actual time=200.872..200.879 rows=0 loops=1) > Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z)) > -> Seq Scan on b (actual time=0.029..15.946 rows=100000 loops=1) > -> Hash (actual time=97.645..97.649 rows=100000 loops=1) > Buckets: 131072 Batches: 1 Memory Usage: 5612kB > -> Seq Scan on a (actual time=0.024..17.153 rows=100000 loops=1) > Planning Time: 2.910 ms > Execution Time: 201.949 ms > (8 rows) > > SET > QUERY PLAN > --------------------------------------------------------------------------- > Merge Join (actual time=687.415..687.416 rows=0 loops=1) > Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z)) > -> Sort (actual time=462.022..536.716 rows=100000 loops=1) > Sort Key: b.y, b.x, b.z > Sort Method: external merge Disk: 3328kB > -> Seq Scan on b (actual time=0.017..12.326 rows=100000 loops=1) > -> Sort (actual time=111.295..113.196 rows=16001 loops=1) > Sort Key: a.y, a.x, a.z > Sort Method: external sort Disk: 2840kB > -> Seq Scan on a (actual time=0.020..10.129 rows=100000 loops=1) > Planning Time: 0.752 ms > Execution Time: 688.829 ms > (12 rows) > > Secondly, I found another case that is not related to the fact that the > planner > would prefer to choose merge join rather than hash join, but we have the > opportunity to see that the plan has become better due to the consumption of > less memory, and also takes less planning time. > > Here, with the same query, the planning time was reduced by 5 times, and the > number of buckets by 128 times, therefore, memory consumption also decreased: > > DROP TABLE IF EXISTS a,b CASCADE; > > CREATE TABLE a AS > SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z > FROM generate_series(1,600) AS gs; > CREATE TABLE b AS > SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload > FROM generate_series(1,1e5) AS gs; > ANALYZE a,b; > > SET enable_cost_size = 'on'; > EXPLAIN ANALYZE > SELECT * FROM a,b > WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; > > SET enable_cost_size = 'off'; > EXPLAIN ANALYZE > SELECT * FROM a,b > WHERE a.x=b.x AND a.y=b.y AND a.z=b.z; > > QUERY > PLAN > ---------------------------------------------------------------------------------------------------------------- > Hash Join (cost=20.50..3157.58 rows=8 width=32) (actual time=95.648..95.651 > rows=0 loops=1) > Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z = > (a.z)::numeric)) > -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time= > 0.027..17.980 rows=100000 loops=1) > -> Hash (cost=10.00..10.00 rows=600 width=12) (actual time=2.046..2.047 > rows=600 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 34kB > -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time= > 0.022..0.315 rows=600 loops=1) > Planning Time: 0.631 ms > Execution Time: 95.730 ms > (8 rows) > > SET > QUERY > PLAN > ---------------------------------------------------------------------------------------------------------------------- > Hash Join (cost=3387.00..8621.58 rows=8 width=32) (actual time= > 102.873..102.877 rows=0 loops=1) > Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND > ((a.z)::numeric = b.z)) > -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time= > 0.014..0.131 rows=600 loops=1) > -> Hash (cost=1637.00..1637.00 rows=100000 width=20) (actual time= > 101.920..101.921 rows=100000 loops=1) > Buckets: 131072 Batches: 1 Memory Usage: 6474kB > -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual > time=0.013..16.349 rows=100000 loops=1) > Planning Time: 0.153 ms > Execution Time: 103.518 ms > (8 rows) > > I also give an improvement relative to the left external or right connection: > > DROP TABLE IF EXISTS a,b CASCADE; > > CREATE TABLE a AS > SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z > FROM generate_series(1,600) AS gs; > CREATE TABLE b AS > SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload > FROM generate_series(1,1e5) AS gs; > ANALYZE a,b; > > > SET enable_cost_size = 'on'; > > EXPLAIN ANALYZE > SELECT * FROM a right join b > on a.x=b.x AND a.y=b.y AND a.z=b.z; > > SET enable_cost_size = 'off'; > EXPLAIN ANALYZE > SELECT * FROM a right join b > on a.x=b.x AND a.y=b.y AND a.z=b.z; > > QUERY > PLAN > ---------------------------------------------------------------------------------------------------------------- > Hash Left Join (cost=20.50..3157.58 rows=100000 width=32) (actual time= > 1.846..102.264 rows=100000 loops=1) > Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z = > (a.z)::numeric)) > -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time= > 0.041..15.328 rows=100000 loops=1) > -> Hash (cost=10.00..10.00 rows=600 width=12) (actual time=1.780..1.781 > rows=600 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 34kB > -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time= > 0.031..0.252 rows=600 loops=1) > Planning Time: 0.492 ms > Execution Time: 107.609 ms > (8 rows) > > SET > QUERY > PLAN > ---------------------------------------------------------------------------------------------------------------------- > Hash Right Join (cost=3387.00..8500.08 rows=100000 width=32) (actual time= > 80.919..101.613 rows=100000 loops=1) > Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND > ((a.z)::numeric = b.z)) > -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time= > 0.017..0.084 rows=600 loops=1) > -> Hash (cost=1637.00..1637.00 rows=100000 width=20) (actual time= > 80.122..80.123 rows=100000 loops=1) > Buckets: 131072 Batches: 1 Memory Usage: 6474kB > -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual > time=0.015..11.819 rows=100000 loops=1) > Planning Time: 0.194 ms > Execution Time: 104.662 ms > (8 rows) > > -- > Regards, > Alena Rybakina > Postgres Professional > > diff --git a/src/backend/optimizer/path/costsize.c > b/src/backend/optimizer/path/costsize.c > index ef475d95a18..31771dfba46 100644 > --- a/src/backend/optimizer/path/costsize.c > +++ b/src/backend/optimizer/path/costsize.c > @@ -153,6 +153,7 @@ bool enable_parallel_hash = true; > bool enable_partition_pruning = true; > bool enable_presorted_aggregate = true; > bool enable_async_append = true; > +bool enable_cost_size = true; > > typedef struct > { > @@ -4033,11 +4034,22 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path, > thismcvfreq = restrictinfo->left_mcvfreq; > } > > + if (enable_cost_size) > + { > + innerbucketsize *= thisbucketsize; > + innermcvfreq *= thismcvfreq; > + } > + else > + { > if (innerbucketsize > thisbucketsize) > innerbucketsize = thisbucketsize; > if (innermcvfreq > thismcvfreq) > innermcvfreq = thismcvfreq; > + } > } > + > + if (enable_cost_size && innerbucketsize > virtualbuckets) > + innerbucketsize = 1.0 / virtualbuckets; > } > > /* > diff --git a/src/backend/utils/misc/guc_tables.c > b/src/backend/utils/misc/guc_tables.c > index 71e27f8eb05..ded9ba3b7a9 100644 > --- a/src/backend/utils/misc/guc_tables.c > +++ b/src/backend/utils/misc/guc_tables.c > @@ -1007,6 +1007,19 @@ struct config_bool ConfigureNamesBool[] = > true, > NULL, NULL, NULL > }, > + { > + {"enable_cost_size", PGC_USERSET, QUERY_TUNING_OTHER, > + gettext_noop("set the optimizer coefficient" > + "so that custom or generic > plan is selected more often. " > + "by default, the value is set > to 1, which means that " > + "the choice of using both > depends on the calculated cost"), > + NULL, > + GUC_EXPLAIN > + }, > + &enable_cost_size, > + true, > + NULL, NULL, NULL > + }, > { > {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD, > gettext_noop("Enables the planner's use of async append > plans."), > diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h > index 6cf49705d3a..c79ec12e6d5 100644 > --- a/src/include/optimizer/cost.h > +++ b/src/include/optimizer/cost.h > @@ -71,6 +71,7 @@ extern PGDLLIMPORT bool enable_partition_pruning; > extern PGDLLIMPORT bool enable_presorted_aggregate; > extern PGDLLIMPORT bool enable_async_append; > extern PGDLLIMPORT int constraint_exclusion; > +extern PGDLLIMPORT bool enable_cost_size; > > extern double index_pages_fetched(double tuples_fetched, BlockNumber pages, > double > index_pages, PlannerInfo *root); -- Bruce Momjian <br...@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.