from:"Andrei Lepikhov"

Re: Propagate pathkeys from CTEs up to the outer query

2024-02-13 Thread Andrei Lepikhov


On 29/1/2024 10:18, Richard Guo wrote:

In [1] we've reached a conclusion that for a MATERIALIZED CTE it's okay
to 'allow our statistics or guesses for the sub-query to subsequently
influence what the upper planner does'.  Commit f7816aec23 exposes
column statistics to the upper planner.  In the light of that, here is a
patch that exposes info about the ordering of the CTE result to the
upper planner.

This patch was initially posted in that same thread and has received
some comments from Tom in [2].  Due to the presence of multiple patches
in that thread, it has led to confusion.  So fork a new thread here
specifically dedicated to discussing the patch about exposing pathkeys
from CTEs to the upper planner.
I like this approach. It looks good initially, but such features need 
more opinions/views/time to analyse corner cases.
It goes alongside my current backburner - pull parameterisation through 
the GROUP-BY and the query block fence up to the JOIN searching code of 
the parent query.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2024-02-13 Thread Andrei Lepikhov


On 30/1/2024 12:44, Ashutosh Bapat wrote:

Thanks Vignesh. PFA patches rebased on the latest HEAD. The patch
addressing Amit's comments is still a separate patch for him to
review.
Thanks for this improvement. Working with partitions, I frequently see 
peaks of memory consumption during planning. So, maybe one more case can 
be resolved here.
Patch 0001 looks good. I'm not sure about free_child_sjinfo_members. Do 
we really need it as a separate routine? It might be better to inline 
this code.

Patch 0002 adds valuable comments, and I'm OK with that.

Also, as I remember, some extensions, such as pg_hint_plan, call 
build_child_join_sjinfo. It is OK to break the interface with a major 
version. But what if they need child_sjinfo a bit longer and collect 
links to this structure? I don't think it is a real stopper, but it is 
worth additional analysis.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2024-02-13 Thread Andrei Lepikhov


On 13/2/2024 17:03, Andrei Lepikhov wrote:

On 13/2/2024 07:00, jian he wrote:

The time is the last result of the 10 iterations.
I'm not sure about the origins of such behavior, but it seems to be an 
issue of parallel workers, not this specific optimization.
Having written that, I'd got a backburner. And to close that issue, I 
explored get_restriction_qual_cost(). A close look shows us that "x IN 
(..)" cheaper than its equivalent "x=N1 OR ...". Just numbers:


ANY: startup_cost = 0.0225; total_cost = 0.005
OR: startup_cost==0; total_cost = 0.0225

Expression total_cost is calculated per tuple. In your example, we have 
many tuples, so the low cost of expression per tuple dominates over the 
significant startup cost.


According to the above, SAOP adds 6250 to the cost of SeqScan; OR - 
13541. So, the total cost of the query with SAOP is less than with OR, 
and the optimizer doesn't choose heavy parallel workers. And it is the 
answer.


So, this example is more about the subtle balance between 
parallel/sequential execution, which can vary from one platform to another.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2024-02-13 Thread Andrei Lepikhov


On 12/2/2024 17:51, Alena Rybakina wrote:

On 12.02.2024 12:01, Andrei Lepikhov wrote:

On 12/2/2024 15:55, Alena Rybakina wrote:
As I understand it, match_clauses_to_index is necessary if you have a 
RestrictInfo (rinfo1) variable, so maybe we should run it after the 
make_restrictonfo procedure, otherwise calling it, I think, is useless.
I think you must explain your note in more detail. Before this call, 
we already called make_restrictinfo() and built rinfo1, haven't we?


I got it, I think, I was confused by the “else“ block when we can 
process the index, otherwise we move on to the next element.


I think maybe “else“ block of creating restrictinfo with the indexpaths 
list creation should be moved to a separate function or just remove "else"?
IMO, it is a matter of taste. But if you are really confused, maybe it 
will make understanding for someone else simpler. So, changed.
I think we need to check that rinfo->clause is not empty, because if it 
is we can miss calling build_paths_for_OR function. We should add it there:


restriction_is_saop_clause(RestrictInfo *restrictinfo)
{
     if (IsA(restrictinfo->clause, ScalarArrayOpExpr))
I wonder if we should add here assertion, not NULL check. In what case 
we could get NULL clause here? But added for surety.


By the way, I think we need to add a check that the clauseset is not 
empty (if (!clauseset.nonempty)) otherwise we could get an error. The 
same check I noticed in build_paths_for_OR function.

I don't. Feel free to provide counterexample.

--
regards,
Andrei Lepikhov
Postgres Professional
From 2b088a1bd662c614ee491a797d2fcb67e2fb2391 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 24 Jan 2024 14:07:17 +0700
Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths
 over SAOP clauses.

Likewise OR clauses, discover SAOP array and try to split its elements
between smaller sized arrays to fit a set of partial indexes.
---
 src/backend/optimizer/path/indxpath.c | 301 ++
 src/backend/optimizer/util/predtest.c |  46 
 src/backend/optimizer/util/restrictinfo.c |  13 +
 src/include/optimizer/optimizer.h |  16 ++
 src/include/optimizer/restrictinfo.h  |   1 +
 src/test/regress/expected/select.out  | 282 
 src/test/regress/sql/select.sql   |  82 ++
 src/tools/pgindent/typedefs.list  |   1 +
 8 files changed, 689 insertions(+), 53 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c 
b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbdc..56b04541db 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -32,6 +32,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
+#include "utils/array.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 
@@ -1220,11 +1221,174 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
return result;
 }
 
+/*
+ * Building index paths over SAOP clause differs from the logic of OR clauses.
+ * Here we iterate across all the array elements and split them to SAOPs,
+ * corresponding to different indexes. We must match each element to an index.
+ */
+static List *
+build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo,
+List *other_clauses)
+{
+   List   *result = NIL;
+   List   *predicate_lists = NIL;
+   ListCell   *lc;
+   PredicatesData *pd;
+   ScalarArrayOpExpr  *saop = (ScalarArrayOpExpr *) rinfo->clause;
+
+   Assert(IsA(saop, ScalarArrayOpExpr) && saop->useOr);
+
+   if (!IsA(lsecond(saop->args), Const))
+   /*
+* Has it practical outcome to merge arrays which couldn't 
constantified
+* before that step?
+*/
+   return NIL;
+
+   /* Collect predicates */
+   foreach(lc, rel->indexlist)
+   {
+   IndexOptInfo *index = (IndexOptInfo *) lfirst(lc);
+
+   /* Take into consideration partial indexes supporting bitmap 
scans */
+   if (!index->amhasgetbitmap || index->indpred == NIL || 
index->predOK)
+   continue;
+
+   pd = palloc0(sizeof(PredicatesData));
+   pd->id = foreach_current_index(lc);
+   /* The trick with reducing recursion is stolen from 
predicate_implied_by */
+   pd->predicate = list_length(index->indpred) == 1 ?
+   
(Node *) linitial(index->indpred) :
+   
(Node *) index->indpred;
+   predicate_lists = lappend(predicate_lis

Re: POC, WIP: OR-clause support for indexes

2024-02-13 Thread Andrei Lepikhov


On 13/2/2024 07:00, jian he wrote:

+ newa = makeNode(ArrayExpr);
+ /* array_collid will be set by parse_collate.c */
+ newa->element_typeid = scalar_type;
+ newa->array_typeid = array_type;
+ newa->multidims = false;
+ newa->elements = aexprs;
+ newa->location = -1;

I am confused by the comments `array_collid will be set by
parse_collate.c`, can you further explain it?
I wonder if the second paragraph of comments on commit b310b6e will be 
enough to dive into details.



if OR expression right arm is not plain Const, but with collation
specification, eg.
`where a  = 'a' collate "C" or a = 'b' collate "C";`

then the rightop is not Const, it will be CollateExpr, it will not be
used in transformation.
Yes, it is done for simplicity right now. I'm not sure about corner 
cases of merging such expressions.




set enable_or_transformation to on;
explain(timing off, analyze, costs off)
select count(*) from test where (x = 1 or x = 2 or x = 3 or x = 4 or x
= 5 or x = 6 or x = 7 or x = 8 or x = 9 ) \watch i=0.1 c=10
35.376 ms

The time is the last result of the 10 iterations.

The reason here - parallel workers.
If you see into the plan you will find parallel workers without 
optimization and absence of them in the case of optimization:


Gather  (cost=1000.00..28685.37 rows=87037 width=12)
(actual rows=90363 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test
 Filter: ((x = 1) OR (x = 2) OR (x = 3) OR (x = 4) OR (x = 5) 
OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 9))


Seq Scan on test  (cost=0.02..20440.02 rows=90600 width=12)
  (actual rows=90363 loops=1)
   Filter: (x = ANY ('{1,2,3,4,5,6,7,8,9}'::integer[]))

Having 90600 tuples returned we estimate it into 87000 (less precisely) 
without transformation and 90363 (more precisely) with the transformation.
But if you play with parallel_tuple_cost and parallel_setup_cost, you 
will end up having these parallel workers:


 Gather  (cost=0.12..11691.03 rows=90600 width=12)
 (actual rows=90363 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test
 Filter: (x = ANY ('{1,2,3,4,5,6,7,8,9}'::integer[]))
 Rows Removed by Filter: 303212

And some profit about 25%, on my laptop.
I'm not sure about the origins of such behavior, but it seems to be an 
issue of parallel workers, not this specific optimization.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2024-02-12 Thread Andrei Lepikhov


On 12/2/2024 15:55, Alena Rybakina wrote:

Hi!

I can't unnderstand this part of code:

/* Time to generate index paths */

MemSet(, 0, sizeof(clauseset));
match_clauses_to_index(root, list_make1(rinfo1), index, );

As I understand it, match_clauses_to_index is necessary if you have a 
RestrictInfo (rinfo1) variable, so maybe we should run it after the 
make_restrictonfo procedure, otherwise calling it, I think, is useless.
I think you must explain your note in more detail. Before this call, we 
already called make_restrictinfo() and built rinfo1, haven't we?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2024-02-11 Thread Andrei Lepikhov


Thanks for the review!
It was the first version for discussion. Of course, refactoring and 
polishing cycles will be needed, even so we can discuss the general idea 
earlier.


On 10/2/2024 12:00, jian he wrote:

On Thu, Feb 8, 2024 at 1:34 PM Andrei Lepikhov
  1235 | PredicatesData *pd;

Thanks


+ if (!predicate_implied_by(index->indpred, list_make1(rinfo1), true))
+ elog(ERROR, "Logical mistake in OR <-> ANY transformation code");
the error message seems not clear?

Yeah, have rewritten


static List *
build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo,
List *other_clauses)
I am not sure what's `other_clauses`, and `rinfo` refers to? adding
some comments would be great.

struct PredicatesData needs some comments, I think.

Added, not so much though


+bool
+saop_covered_by_predicates(ScalarArrayOpExpr *saop, List *predicate_lists)
+{
+ ListCell   *lc;
+ PredIterInfoData clause_info;
+ bool result = false;
+ bool isConstArray;
+
+ Assert(IsA(saop, ScalarArrayOpExpr));
is this Assert necessary?

Not sure. Moved it into another routine.


For the function build_paths_for_SAOP, I think I understand the first
part of the code.
But I am not 100% sure of the second part of the `foreach(lc,
predicate_lists)` code.
more comments in `foreach(lc, predicate_lists)` would be helpful.

Done


do you need to add `PredicatesData` to src/tools/pgindent/typedefs.list?

Done


I also did some minor refactoring of generate_saop_pathlist.

Partially agree


instead of let it go to `foreach (lc, entries)`,
we can reject the Const array at `foreach(lc, expr->args)`
Yeah, I just think we can go further and transform two const arrays into 
a new one if we have the same clause and operator. In that case, we 
should allow it to pass through this cycle down to the classification stage.


also `foreach(lc, expr->args)` do we need to reject cases like
`contain_subplans((Node *) nconst_expr)`?
maybe let the nconst_expr be a Var node would be far more easier.
It's contradictory. On the one hand, we simplify the comparison 
procedure and can avoid expr jumbling at all. On the other hand - we 
restrict the feature. IMO, it would be better to unite such clauses

complex_clause1 IN (..) OR complex_clause1 IN (..)
into
complex_clause1 IN (.., ..)
and don't do duplicated work computing complex clauses.
In the attachment - the next version of the second patch.

--
regards,
Andrei Lepikhov
Postgres Professional
From c1e5fc35778acd01cca67e63fdf80c5605af03f2 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 24 Jan 2024 14:07:17 +0700
Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths
 over SAOP clauses.

Likewise OR clauses, discover SAOP array and try to split its elements
between smaller sized arrays to fit a set of partial indexes.
---
 src/backend/optimizer/path/indxpath.c | 304 ++
 src/backend/optimizer/util/predtest.c |  46 
 src/backend/optimizer/util/restrictinfo.c |  13 +
 src/include/optimizer/optimizer.h |  16 ++
 src/include/optimizer/restrictinfo.h  |   1 +
 src/test/regress/expected/select.out  | 282 
 src/test/regress/sql/select.sql   |  82 ++
 src/tools/pgindent/typedefs.list  |   1 +
 8 files changed, 692 insertions(+), 53 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c 
b/src/backend/optimizer/path/indxpath.c
index 32c6a8bbdc..5383cb76dc 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -32,6 +32,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
+#include "utils/array.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 
@@ -1220,11 +1221,177 @@ build_paths_for_OR(PlannerInfo *root, RelOptInfo *rel,
return result;
 }
 
+/*
+ * Building index paths over SAOP clause differs from the logic of OR clauses.
+ * Here we iterate across all the array elements and split them to SAOPs,
+ * corresponding to different indexes. We must match each element to an index.
+ */
+static List *
+build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo,
+List *other_clauses)
+{
+   List   *result = NIL;
+   List   *predicate_lists = NIL;
+   ListCell   *lc;
+   PredicatesData *pd;
+   ScalarArrayOpExpr  *saop = (ScalarArrayOpExpr *) rinfo->clause;
+
+   Assert(IsA(saop, ScalarArrayOpExpr) && saop->useOr);
+
+   if (!IsA(lsecond(saop->args), Const))
+   /*
+* Has it practical outcome to merge arrays which couldn't 
constantified
+* before that step?
+*/
+   return NIL;
+
+   /* Collect predicates */
+   foreach(lc, rel-&g

Re: pg_stat_advisor extension

2024-02-07 Thread Andrei Lepikhov


On 6/2/2024 22:27, Ilia Evdokimov wrote:


I welcome your insights, feedback, and evaluations regarding the 
necessity of integrating this new extension into PostgreSQL.
Besides other issues that were immediately raised during the discovery 
of the extension, Let me emphasize two issues:
1. In the case of parallel workers the plan_rows value has a different 
semantics than the number of rows predicted. Just explore 
get_parallel_divisor().
2. The extension recommends new statistics immediately upon an error 
finding. But what if the reason for the error is stale statistics? Or 
this error may be raised for only one specific set of constants, and 
estimation will be done well in another 99.% of cases for the same 
expression.


According to No.2, it might make sense to collect and track clause 
combinations and cardinality errors found and let the DBA make decisions 
on their own.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2024-02-07 Thread Andrei Lepikhov


On 3/2/2024 02:06, Alena Rybakina wrote:

On 01.02.2024 08:00, jian he wrote:
I added your code to the patch.

Thanks Alena and Jian for the detailed scrutiny!

A couple of questions:
1. As I see, transformAExprIn uses the same logic as we invented but 
allows composite and domain types. Could you add a comment explaining 
why we forbid row types in general, in contrast to the transformAExprIn 
routine?
2. Could you provide the tests to check issues covered by the recent (in 
v.15) changes?


Patch 0001-* in the attachment incorporates changes induced by Jian's 
notes from [1].
Patch 0002-* contains a transformation of the SAOP clause, which allows 
the optimizer to utilize partial indexes if they cover all values in 
this array. Also, it is an answer to Alexander's note [2] on performance 
degradation. This first version may be a bit raw, but I need your 
opinion: Does it resolve the issue?


Skimming through the thread, I see that, in general, all issues have 
been covered for now. Only Robert's note on a lack of documentation is 
still needs to be resolved.


[1] 
https://www.postgresql.org/message-id/CACJufxGXhJ823cdAdp2Ho7qC-HZ3_-dtdj-myaAi_u9RQLn45g%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com


--
regards,
Andrei Lepikhov
Postgres Professional
From 0ac511ab94959e87d1525d5ea8c4855643a6ccbe Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Fri, 2 Feb 2024 22:01:09 +0300
Subject: [PATCH 1/2] Transform OR clause to ANY expressions.

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the
preliminary stage of optimization when we are still working with the tree
expression.
Here C is a constant expression, 'expr' is non-constant expression, 'op' is
an operator which returns boolean result and has a commuter (for the case of
reverse order of constant and non-constant parts of the expression,
like 'CX op expr').
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
Authors: Alena Rybakina , Andrey Lepikhov 

Reviewed-by: Peter Geoghegan , Ranier Vilela 
Reviewed-by: Alexander Korotkov , Robert Haas 

Reviewed-by: jian he 
---
 .../postgres_fdw/expected/postgres_fdw.out|  16 +-
 src/backend/nodes/queryjumblefuncs.c  |  27 ++
 src/backend/parser/parse_expr.c   | 327 +-
 src/backend/utils/misc/guc_tables.c   |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |   6 +-
 src/include/nodes/queryjumble.h   |   1 +
 src/include/optimizer/optimizer.h |   1 +
 src/test/regress/expected/create_index.out| 156 -
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 219 ++--
 src/test/regress/expected/rules.out   |   4 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  35 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 876 insertions(+), 70 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index b5a38aeb21..a07aefc9e5 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN 
(SELECT * FROM ft5 WHERE
  Foreign Scan
Output: t1.c1, t1.c2, ft5.c1, ft5.c2
Relations: (public.ft4 t1) LEFT JOIN (public.ft5)
-   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 
10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10))
+   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 
IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10))
 (4 rows)
 
 SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 
WHERE c1 < 10) t2 ON (t1.c1 = t2.c1)
@@ -3105,7 +3105,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full 
join ft5 t2 on (t1.c1 = t2
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5

Re: POC: GROUP BY optimization

2024-02-01 Thread Andrei Lepikhov


On 2/2/2024 11:06, Richard Guo wrote:


On Fri, Feb 2, 2024 at 11:32 AM Richard Guo <mailto:guofengli...@gmail.com>> wrote:


On Fri, Feb 2, 2024 at 10:02 AM Tom Lane mailto:t...@sss.pgh.pa.us>> wrote:

One of the test cases added by this commit has not been very
stable in the buildfarm.  Latest example is here:


https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04
 
<https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04>

and I've seen similar failures intermittently on other machines.

I'd suggest building this test atop a table that is more stable
than pg_class.  You're just waving a red flag in front of a bull
if you expect stable statistics from that during a regression run.
Nor do I see any particular reason for pg_class to be especially
suited to the test.


Yeah, it's not a good practice to use pg_class in this place.  While
looking through the test cases added by this commit, I noticed some
other minor issues that are not great.  Such as

* The table 'btg' is inserted with 1 tuples, which seems a bit
expensive for a test.  I don't think we need such a big table to test
what we want.

* I don't see why we need to manipulate GUC max_parallel_workers and
max_parallel_workers_per_gather.

* I think we'd better write the tests with the keywords being all upper
or all lower.  A mixed use of upper and lower is not great. Such as in

     explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w;

* Some comments for the test queries are not easy to read.

* For this statement

     CREATE INDEX idx_y_x_z ON btg(y,x,w);

I think the index name would cause confusion.  It creates an index on
columns y, x and w, but the name indicates an index on y, x and z.

I'd like to write a draft patch for the fixes.


Here is the draft patch that fixes the issues I complained about in
upthread.
I passed through the patch. Looks like it doesn't break anything. Why do 
you prefer to use count(*) in EXPLAIN instead of plain targetlist, like 
"SELECT x,y,..."?

Also, according to the test mentioned by Tom:
1. I see, PG uses IndexScan on (x,y). So, column x will be already 
sorted before the MergeJoin. Why not use Incremental Sort on (x,z,w) 
instead of full sort?
2. For memo, IMO, this test shows us the future near perspective of this 
feature: It is cheaper to use grouping order (w,z) instead of (z,w).


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC: GROUP BY optimization

2024-02-01 Thread Andrei Lepikhov


On 2/2/2024 09:02, Tom Lane wrote:

Alexander Korotkov  writes:

I'm going to push this if there are no objections.


One of the test cases added by this commit has not been very
stable in the buildfarm.  Latest example is here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-01%2021%3A28%3A04

and I've seen similar failures intermittently on other machines.

I'd suggest building this test atop a table that is more stable
than pg_class.  You're just waving a red flag in front of a bull
if you expect stable statistics from that during a regression run.
Nor do I see any particular reason for pg_class to be especially
suited to the test.

Yeah, It is my fault. Please, see in the attachment the patch fixing that.
--
regards,
Andrei Lepikhov
Postgres Professional
From 11a049d95ee48e38ad569aab7663d8de91f946ad Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Fri, 2 Feb 2024 10:39:55 +0700
Subject: [PATCH] Replace the GROUP-BY optimization test with the same based on
 something less volatile when the pg_class relation.

---
 src/test/regress/expected/aggregates.out | 32 +++-
 src/test/regress/sql/aggregates.sql  |  9 +++
 2 files changed, 18 insertions(+), 23 deletions(-)

diff --git a/src/test/regress/expected/aggregates.out 
b/src/test/regress/expected/aggregates.out
index 7a73c19314..c2e1b8c9ed 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2873,7 +2873,6 @@ SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 
GROUP BY x,y;
 (6 rows)
 
 RESET enable_incremental_sort;
-DROP TABLE btg;
 -- The case, when scanning sort order correspond to aggregate sort order but
 -- can not be found in the group-by list
 CREATE TABLE agg_sort_order (c1 int PRIMARY KEY, c2 int);
@@ -2901,32 +2900,31 @@ DROP TABLE agg_sort_order CASCADE;
 SET enable_hashjoin = off;
 SET enable_nestloop = off;
 explain (COSTS OFF)
-SELECT c1.relname,c1.relpages
-FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND 
c1.relpages=c2.relpages)
-GROUP BY c1.reltuples,c1.relpages,c1.relname
-ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages;
- QUERY PLAN
  
--
+SELECT b1.x,b1.w FROM btg b1 JOIN btg b2 ON (b1.z=b2.z AND b1.w=b2.w)
+GROUP BY b1.x,b1.z,b1.w ORDER BY b1.z, b1.w, b1.x*b1.x;
+QUERY PLAN 
+---
  Incremental Sort
-   Sort Key: c1.relpages, c1.relname, ((c1.relpages * c1.relpages))
-   Presorted Key: c1.relpages, c1.relname
+   Sort Key: b1.z, b1.w, ((b1.x * b1.x))
+   Presorted Key: b1.z, b1.w
->  Group
- Group Key: c1.relpages, c1.relname, c1.reltuples
+ Group Key: b1.z, b1.w, b1.x
  ->  Incremental Sort
-   Sort Key: c1.relpages, c1.relname, c1.reltuples
-   Presorted Key: c1.relpages, c1.relname
+   Sort Key: b1.z, b1.w, b1.x
+   Presorted Key: b1.z, b1.w
->  Merge Join
- Merge Cond: ((c1.relpages = c2.relpages) AND (c1.relname 
= c2.relname))
+ Merge Cond: ((b1.z = b2.z) AND (b1.w = b2.w))
  ->  Sort
-   Sort Key: c1.relpages, c1.relname
-   ->  Seq Scan on pg_class c1
+   Sort Key: b1.z, b1.w
+   ->  Seq Scan on btg b1
  ->  Sort
-   Sort Key: c2.relpages, c2.relname
-   ->  Seq Scan on pg_class c2
+   Sort Key: b2.z, b2.w
+   ->  Seq Scan on btg b2
 (16 rows)
 
 RESET enable_hashjoin;
 RESET enable_nestloop;
+DROP TABLE btg;
 RESET enable_hashagg;
 RESET max_parallel_workers;
 RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/sql/aggregates.sql 
b/src/test/regress/sql/aggregates.sql
index 916dbf908f..3548fbb8db 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1229,8 +1229,6 @@ EXPLAIN (VERBOSE, COSTS OFF)
 SELECT y,x,array_agg(distinct w) FROM btg WHERE y < 0 GROUP BY x,y;
 RESET enable_incremental_sort;
 
-DROP TABLE btg;
-
 -- The case, when scanning sort order correspond to aggregate sort order but
 -- can not be found in the group-by list
 CREATE TABLE agg_sort_order (c1 int PRIMARY KEY, c2 int);
@@ -1245,13 +1243,12 @@ DROP TABLE agg_sort_order CASCADE;
 SET enable_hashjoin = off;
 SET enable_nestloop = off;
 explain (COSTS OFF)
-SELECT c1.relname,c1.relpages
-FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND 
c1.relpages=c2.relpages)
-GROUP BY c1.reltuples,c1.relpages,c1.relname
-ORDER BY c1.relpages, c1.relname, c1.relpages*c1.r

Re: POC: GROUP BY optimization

2024-01-18 Thread Andrei Lepikhov


Just forgotten second attachment.

--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 1095b73dac..b612420547 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -432,6 +432,21 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List 
**group_pathkeys,
return n;
 }
 
+static bool
+duplicated_pathkey_combination(List *infos, List *pathkeys)
+{
+   ListCell *lc;
+
+   foreach (lc, infos)
+   {
+   PathKeyInfo *info = lfirst_node(PathKeyInfo, lc);
+
+   if (compare_pathkeys(pathkeys, info->pathkeys) == 
PATHKEYS_EQUAL)
+   return true;
+   }
+   return false;
+}
+
 /*
  * get_useful_group_keys_orderings
  * Determine which orderings of GROUP BY keys are potentially 
interesting.
@@ -491,7 +506,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)
 
if (n > 0 &&
(enable_incremental_sort || n == 
root->num_groupby_pathkeys) &&
-   compare_pathkeys(pathkeys, root->group_pathkeys) != 
PATHKEYS_EQUAL)
+   !duplicated_pathkey_combination(infos, pathkeys))
{
info = makeNode(PathKeyInfo);
info->pathkeys = pathkeys;
@@ -514,8 +529,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)

   ,

   root->num_groupby_pathkeys);
 
-   if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) 
!= PATHKEYS_EQUAL &&
-   (enable_incremental_sort || n == 
list_length(root->sort_pathkeys)))
+   if (n > 0 &&
+   (enable_incremental_sort || n == 
list_length(root->sort_pathkeys)) &&
+   !duplicated_pathkey_combination(infos, pathkeys))
{
info = makeNode(PathKeyInfo);
info->pathkeys = pathkeys;

Re: POC: GROUP BY optimization

2024-01-18 Thread Andrei Lepikhov


On 16/1/2024 22:05, Alexander Korotkov wrote:

On Tue, Jan 16, 2024 at 4:48 AM Richard Guo  wrote:

* When trying to match the ordering of GROUP BY to that of ORDER BY in
get_useful_group_keys_orderings, this patch checks against the length of
path->pathkeys.  This does not make sense.  I guess this is a typo and
the real intention is to check against root->sort_pathkeys.

Thanks! It is really my blunder - fresh look works.


--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)
root->num_groupby_pathkeys);

 if (n > 0 &&
-   (enable_incremental_sort || n == list_length(path->pathkeys)))
+   (enable_incremental_sort || n == list_length(root->sort_pathkeys)))

However, I think this is also incorrect.  When incremental sort is
disabled, it is reasonable to consider reordering the GROUP BY keys only
if the number of matching pathkeys is equal to the length of
root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.


Hmm... Why should this be list_length(root->group_pathkeys) while
we're targeting root->sort_pathkeys.  I yet changed that to
list_length(root->sort_pathkeys).
I think, in the first case, when we are trying to arrange group-by keys 
according to underlying pathkeys with incremental sort = off, it makes 
sense to do if we fetch all group-by keys regardless of a more or equal 
number of path keys the incoming path contains. The code and test case 
are in step1.txt.



* When the original ordering of GROUP BY keys matches the path's
pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
generate duplicate PathKeyInfos.  Consequently, this duplication would
lead to the creation and addition of identical paths multiple times.
This is not great.  Consider the query below:

create table t (a int, b int);
create index on t (a, b);
set enable_hashagg to off;

explain select count(*) from t group by a, b;
 QUERY PLAN
--
  GroupAggregate  (cost=0.15..97.27 rows=226 width=16)
Group Key: a, b
->  Index Only Scan using t_a_b_idx on t  (cost=0.15..78.06 rows=2260 
width=8)
(3 rows)

The same path with group keys {a, b} is created and added twice.


I tried to address that.


* Part of the work performed in this patch overlaps with that of
preprocess_groupclause.  They are both trying to adjust the ordering of
the GROUP BY keys to match ORDER BY.  I wonder if it would be better to
perform this work only once.


Andrei, could you take a look.
As I see, the PathKeyInfo list often contains duplicated pathkeys, 
coming from either sort_pathkeys or path->pathkeys orderings. So, I can 
propose to check duplicates each time (see step2.txt in attachment).



* When reordering the GROUP BY keys, I think the following checks need
some improvements.

+   /*
+* Since 1349d27 pathkey coming from underlying node can be in the
+* root->group_pathkeys but not in the processed_groupClause. So, we
+* should be careful here.
+*/
+   sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
+   *group_clauses);
+   if (!sgc)
+   /* The grouping clause does not cover this pathkey */
+   break;

I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
valid before calling get_sortgroupref_clause_noerr with it.  Also, how
can we guarantee that the returned GROUP BY item is sortable?  Should we
check that with OidIsValid(sgc->sortop)?


Added.

Reviewed it, looks good.

--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 919d54dd79..1095b73dac 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -489,8 +489,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)
n = group_keys_reorder_by_pathkeys(path->pathkeys, , 
,

   root->num_groupby_pathkeys);
 
-   if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) 
!= PATHKEYS_EQUAL &&
-   (enable_incremental_sort || n == 
list_length(path->pathkeys)))
+   if (n > 0 &&
+   (enable_incremental_sort || n == 
root->num_groupby_pathkeys) &&
+   compare_pathkeys(pathkeys, root->group_pathkeys) != 
PATHKEYS_EQUAL)
{
info = makeNode(PathKeyInfo);
info->pathkeys = pathkeys;
diff --git a/src/test/regress/expected/ag

Re: POC: GROUP BY optimization

2024-01-15 Thread Andrei Lepikhov


On 15/1/2024 13:42, Richard Guo wrote:


On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov <mailto:aekorot...@gmail.com>> wrote:


Thank you for providing the test case relevant for this code change.
The revised patch incorporating this change is attached.  Now the
patchset looks good to me.  I'm going to push it if there are no
objections.


Seems I'm late for the party.  Can we hold for several more days?  I'd
like to have a review on this patch.
Get on board! It looks like this feature needs as much review as 
possible (likewise SJE).


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC: GROUP BY optimization

2024-01-14 Thread Andrei Lepikhov


On 13/1/2024 22:00, Alexander Korotkov wrote:

On Sat, Jan 13, 2024 at 11:09 AM Andrei Lepikhov
 wrote:

On 11/1/2024 18:30, Alexander Korotkov wrote:

On Tue, Jan 9, 2024 at 1:14 PM Pavel Borisov  wrote:

Hmm, I don't see this old code in these patches. Resend 0002-* because
of trailing spaces.



AFAIK, cfbot does not seek old versions of patchset parts in previous messages. 
So for it to run correctly, a new version of the whole patchset should be sent 
even if most patches are unchanged.


Please, find the revised patchset with some refactoring and comments
improvement from me.  I'll continue looking into this.

The patch looks better, thanks to your refactoring.
I propose additional comments and tests to make the code more
understandable (see attachment).
I intended to look into this part of the code more, but the tests show a
difference between PostgreSQL v.15 and v.16, which causes changes in the
code of this feature.


Makes sense.  I've incorporated your changes into the patchset.

One more improvement. To underpin code change:

-   return cur_ec;  /* Match! */
+   {
+   /*
+* Match!
+*
+* Copy the sortref if it wasn't set yet. That may happen if
+* the ec was constructed from WHERE clause, i.e. it doesn't
+* have a target reference at all.
+*/
+   if (cur_ec->ec_sortref == 0 && sortref > 0)
+   cur_ec->ec_sortref = sortref;
+   return cur_ec;
+   }

I propose the test (see attachment). It shows why we introduce this 
change: GROUP-BY should juggle not only pathkeys generated by explicit 
sort operators but also planner-made, likewise in this example, by 
MergeJoin.


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/test/regress/expected/aggregates.out 
b/src/test/regress/expected/aggregates.out
index 0d46e096e5..ca38e78f21 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2879,6 +2879,37 @@ FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2;
 (10 rows)
 
 DROP TABLE t1 CASCADE;
+-- Check, that GROUP-BY reordering optimization can operate with pathkeys, 
built
+-- by planner itself. For example, by MergeJoin.
+SET enable_hashjoin = off;
+SET enable_nestloop = off;
+explain (COSTS OFF)
+SELECT c1.relname,c1.relpages
+FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND 
c1.relpages=c2.relpages)
+GROUP BY c1.reltuples,c1.relpages,c1.relname
+ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages;
+ QUERY PLAN
  
+-
+ Incremental Sort
+   Sort Key: c1.relpages, c1.relname, ((c1.relpages * c1.relpages))
+   Presorted Key: c1.relpages, c1.relname
+   ->  Group
+ Group Key: c1.relpages, c1.relname, c1.reltuples
+ ->  Incremental Sort
+   Sort Key: c1.relpages, c1.relname, c1.reltuples
+   Presorted Key: c1.relpages, c1.relname
+   ->  Merge Join
+ Merge Cond: ((c1.relpages = c2.relpages) AND (c1.relname 
= c2.relname))
+ ->  Sort
+   Sort Key: c1.relpages, c1.relname
+   ->  Seq Scan on pg_class c1
+ ->  Sort
+   Sort Key: c2.relpages, c2.relname
+   ->  Seq Scan on pg_class c2
+(16 rows)
+
+RESET enable_hashjoin;
+RESET enable_nestloop;
 RESET enable_hashagg;
 RESET max_parallel_workers;
 RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/sql/aggregates.sql 
b/src/test/regress/sql/aggregates.sql
index f99167ac9e..cf87b5d5dd 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1233,6 +1233,18 @@ SELECT array_agg(c1 ORDER BY c2),c2
 FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2;
 DROP TABLE t1 CASCADE;
 
+-- Check, that GROUP-BY reordering optimization can operate with pathkeys, 
built
+-- by planner itself. For example, by MergeJoin.
+SET enable_hashjoin = off;
+SET enable_nestloop = off;
+explain (COSTS OFF)
+SELECT c1.relname,c1.relpages
+FROM pg_class c1 JOIN pg_class c2 ON (c1.relname=c2.relname AND 
c1.relpages=c2.relpages)
+GROUP BY c1.reltuples,c1.relpages,c1.relname
+ORDER BY c1.relpages, c1.relname, c1.relpages*c1.relpages;
+RESET enable_hashjoin;
+RESET enable_nestloop;
+
 RESET enable_hashagg;
 RESET max_parallel_workers;
 RESET max_parallel_workers_per_gather;

Re: POC: GROUP BY optimization

2024-01-13 Thread Andrei Lepikhov


On 11/1/2024 18:30, Alexander Korotkov wrote:

Hi!

On Tue, Jan 9, 2024 at 1:14 PM Pavel Borisov  wrote:

Hmm, I don't see this old code in these patches. Resend 0002-* because
of trailing spaces.



AFAIK, cfbot does not seek old versions of patchset parts in previous messages. 
So for it to run correctly, a new version of the whole patchset should be sent 
even if most patches are unchanged.


Please, find the revised patchset with some refactoring and comments
improvement from me.  I'll continue looking into this.

The patch looks better, thanks to your refactoring.
I propose additional comments and tests to make the code more 
understandable (see attachment).
I intended to look into this part of the code more, but the tests show a 
difference between PostgreSQL v.15 and v.16, which causes changes in the 
code of this feature.


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index f4b7dcac21..5aac6d6677 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -397,6 +397,11 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List 
**group_pathkeys,
!list_member_ptr(*group_pathkeys, pathkey))
break;
 
+   /*
+* Since 1349d27 pathkey coming from underlying node can be in 
the
+* root->group_pathkeys but not in the processed_groupClause. 
So, we
+* should be careful here.
+*/
sgc = 
get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,

*group_clauses);
if (!sgc)
diff --git a/src/test/regress/expected/aggregates.out 
b/src/test/regress/expected/aggregates.out
index 423c8ec3b6..0d46e096e5 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2857,6 +2857,28 @@ explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z 
ORDER BY x*x,z;
 (8 rows)
 
 DROP TABLE btg;
+-- The case, when scanning sort order correspond to aggregate sort order but
+-- can not be found in the group-by list
+CREATE TABLE t1 (c1 int PRIMARY KEY, c2 int);
+CREATE UNIQUE INDEX ON t1(c2);
+explain (costs off)
+SELECT array_agg(c1 ORDER BY c2),c2
+FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2;
+   QUERY PLAN   
+
+ Sort
+   Sort Key: c2
+   ->  GroupAggregate
+ Group Key: c1
+ ->  Sort
+   Sort Key: c1, c2
+   ->  Bitmap Heap Scan on t1
+ Recheck Cond: (c2 < 100)
+ ->  Bitmap Index Scan on t1_c2_idx
+   Index Cond: (c2 < 100)
+(10 rows)
+
+DROP TABLE t1 CASCADE;
 RESET enable_hashagg;
 RESET max_parallel_workers;
 RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/sql/aggregates.sql 
b/src/test/regress/sql/aggregates.sql
index b9fcceedd7..f99167ac9e 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1224,6 +1224,15 @@ explain (COSTS OFF) SELECT x,w FROM btg GROUP BY w,x,y,z 
ORDER BY x*x,z;
 
 DROP TABLE btg;
 
+-- The case, when scanning sort order correspond to aggregate sort order but
+-- can not be found in the group-by list
+CREATE TABLE t1 (c1 int PRIMARY KEY, c2 int);
+CREATE UNIQUE INDEX ON t1(c2);
+explain (costs off)
+SELECT array_agg(c1 ORDER BY c2),c2
+FROM t1 WHERE c2 < 100 GROUP BY c1 ORDER BY 2;
+DROP TABLE t1 CASCADE;
+
 RESET enable_hashagg;
 RESET max_parallel_workers;
 RESET max_parallel_workers_per_gather;

Re: introduce dynamic shared memory registry

2024-01-10 Thread Andrei Lepikhov

On 9/1/2024 00:16, Nathan Bossart wrote:

On Mon, Jan 08, 2024 at 10:53:17AM +0530, Bharath Rupireddy wrote:

1. I think we need to add some notes about this new way of getting
shared memory for external modules in the Shared Memory and
LWLocks section in xfunc.sgml? This will at least tell there's
another way for external modules to get shared memory, not just with
the shmem_request_hook and shmem_startup_hook. What do you think?
+1. Maybe even more - in the section related to extensions, this 
approach to using shared data can be mentioned, too.

2. FWIW, I'd like to call this whole feature "Support for named DSM
segments in Postgres". Do you see anything wrong with this?

Why do you feel it should be renamed?  I don't see anything wrong with it,
but I also don't see any particular advantage with that name compared to
"dynamic shared memory registry."
It is not a big issue, I suppose. But for me personally (as not a native 
English speaker), the label "Named DSM segments" seems more 
straightforward to understand.

3. IIUC, this feature eventually makes both shmem_request_hook and
shmem_startup_hook pointless, no? Or put another way, what's the
significance of shmem request and startup hooks in lieu of this new
feature? I think it's quite possible to get rid of the shmem request
and startup hooks (of course, not now but at some point in future to
not break the external modules), because all the external modules can
allocate and initialize the same shared memory via
dsm_registry_init_or_attach and its init_callback. All the external
modules will then need to call dsm_registry_init_or_attach in their
_PG_init callbacks and/or in their bg worker's main functions in case
the modules intend to start up bg workers. Am I right?

Well, modules might need to do a number of other things (e.g., adding
hooks) that can presently only be done when preloaded, in which case I
doubt there's much benefit from switching to the DSM registry.  I don't
really intend for it to replace the existing request/startup hooks, but
you're probably right that most, if not all, could use the registry
instead.  IMHO this is well beyond the scope of this thread, though.

+1, it may be a many reasons to use these hooks.

>> 3. Use NAMEDATALEN instead of 64?
>> +charkey[64];
> I kept this the same, as I didn't see any need to tie the key size to
> NAMEDATALEN.
IMO, we should avoid magic numbers whenever possible. Current logic 
according to which first users of this feature will be extensions 
naturally bonds this size to the size of the 'name' type.

And one more point. I think the commit already deserves a more detailed 
commit message.

--
regards,
Andrei Lepikhov
Postgres Professional

Re: Custom explain options

2024-01-10 Thread Andrei Lepikhov


On 10/1/2024 20:27, Konstantin Knizhnik wrote:


On 10/01/2024 8:46 am, Michael Paquier wrote:

On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote:
What do you think about this really useful feature? Do you wish to 
develop

it further?

I am biased here.  This seems like a lot of code for something we've
been delegating to the explain hook for ages.  Even if I can see the
appeal of pushing that more into explain.c to get more data on a
per-node basis depending on the custom options given by the caller of
an EXPLAIN entry point, I cannot get really excited about the extra
maintenance this facility would involve compared to the potential
gains, knowing that there's a hook.
--
Michael



Well, I am not sure that proposed patch is flexible enough to handle all 
possible scenarios.
I just wanted to make it as simple as possible to leave some chances for 
it to me merged.
But it is easy to answer the question why existed explain hook is not 
enough:


1. It doesn't allow to add some extra options to EXPLAIN. My intention 
was to be able to do something like this "explain 
(analyze,buffers,prefetch) ...". It is completely not possible with 
explain hook.
I agree. Designing mostly planner-related extensions, I also wanted to 
add some information to the explain of nodes. For example, 
pg_query_state could add the status of the node at the time of 
interruption of execution: started, stopped, or loop closed.
Maybe we should gather some statistics on how developers of extensions 
deal with that issue ...


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Multidimensional Histograms

2024-01-10 Thread Andrei Lepikhov


On 8/1/2024 16:21, Alexander Cheshev wrote:

Hi Andrei,


Maybe my wording needed to be more precise. I didn't implement
multidimensional histograms before, so I don't know how expensive they
are. I meant that for dependency statistics over about six columns, we
have a lot of combinations to compute.


Equi-Depth Histogram in a 6 dimensional case requires 6 times more
iterations. Postgres already uses Equi-Depth Histogram. Even if you
increase the number of buckets from 100 to 1000 then there will be no
overhead as the time complexity of Equi-Depth Histogram has no
dependence on the number of buckets. So, no overhead at all!


Maybe. For three columns, we have 9 combinations (passes) for building 
dependency statistics and 4 combinations for ndistincts; for six 
columns, we have 186 and 57 combinations correspondingly.
Even remembering that dependency is just one number for one combination, 
building the dependency statistics is still massive work. So, in the 
multicolumn case, having something like a histogram may be more effective.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Custom explain options

2024-01-09 Thread Andrei Lepikhov


On 30/11/2023 22:40, Konstantin Knizhnik wrote:
In all this cases we are using array of `Instrumentation` and if it 
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations 
between worker processes  and I have updated this code in my PR to send 
actual amount of custom instrumentation data. But it can not help with 
the cases above.
What do you think about this really useful feature? Do you wish to 
develop it further?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC: GROUP BY optimization

2024-01-09 Thread Andrei Lepikhov


On 9/1/2024 16:45, vignesh C wrote:

On Tue, 9 Jan 2024 at 14:31, Andrei Lepikhov  wrote:


Here is a new version of GROUP-BY optimization without sort model.

On 21/12/2023 17:53, Alexander Korotkov wrote:

I'd like to make some notes.

1) As already mentioned, there is clearly a repetitive pattern for the
code following after get_useful_group_keys_orderings() calls.  I think
it would be good to extract it into a separate function.  Please, do
this as a separate patch coming before the group-by patch. That would
simplify the review.

Done. See patch 0001-*. Unfortunately, extraction of whole cycle isn't
practical, because it blows out the interface of the routine.


2) I wonder what planning overhead this patch could introduce?  Could
you try to measure the worst case?  What if we have a table with a lot
of indexes and a long list of group-by clauses partially patching
every index.  This should give us an understanding on whether we need
a separate GUC to control this feature.

In current implementation I don't anticipate any significant overhead.
GUC is needed here to allow users adhere their own ordering and to
disable feature in the case of problems.


4) I think we can do some optimizations when enable_incremental_sort
== off.  Then in get_useful_group_keys_orderings() we should only deal
with input_path fully matching the group-by clause, and try only full
match of group-by output to the required order.

Hm, is it really make sense in current implementation?


CFBot shows the following errors at [1] with:
[08:33:28.813] ../src/backend/utils/adt/selfuncs.c: In function
‘estimate_num_groups’:
[08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3389:9: warning:
implicit declaration of function ‘estimate_num_groups_incremental’
[-Wimplicit-function-declaration]
[08:33:28.813] 3389 | return estimate_num_groups_incremental(root, groupExprs,
[08:33:28.813] | ^~~
[08:33:28.813] ../src/backend/utils/adt/selfuncs.c: At top level:
[08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3400:1: warning: no
previous prototype for ‘estimate_num_groups_incremental’
[-Wmissing-prototypes]
[08:33:28.813] 3400 | estimate_num_groups_incremental(PlannerInfo
*root, List *groupExprs,
[08:33:28.813] | ^~~
[08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3400:1: error:
conflicting types for ‘estimate_num_groups_incremental’
[08:33:28.813] ../src/backend/utils/adt/selfuncs.c:3389:9: note:
previous implicit declaration of ‘estimate_num_groups_incremental’ was
here
[08:33:28.813] 3389 | return estimate_num_groups_incremental(root, groupExprs,
Hmm, I don't see this old code in these patches. Resend 0002-* because 
of trailing spaces.


--
regards,
Andrei Lepikhov
Postgres Professional
From 45cfff5731b81c0df2af00f5e4212fc598f6a231 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 9 Jan 2024 12:32:15 +0700
Subject: [PATCH 2/2] Explore alternative orderings of group-by pathkeys during
 optimization.

When evaluating a query with a multi-column GROUP BY clause, we can minimize
sort operations or avoid them if we synchronize the order of GROUP BY clauses
with the ORDER BY sort clause or sort order, which comes from the underlying
query tree. Grouping does not imply any ordering, so we can compare
the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg,
we simply compared keys in the order specified in the query. This commit
explores alternative ordering of the keys, trying to find a cheaper one.

The ordering of group keys may interact with other parts of the query, some of
which may not be known while planning the grouping. For example, there may be
an explicit ORDER BY clause or some other ordering-dependent operation higher up
in the query, and using the same ordering may allow using either incremental
sort or even eliminating the sort entirely.

The patch always keeps the ordering specified in the query, assuming the user
might have additional insights.

This introduces a new GUC enable_group_by_reordering so that the optimization
may be disabled if needed.
---
 src/backend/optimizer/path/equivclass.c   |  13 +-
 src/backend/optimizer/path/pathkeys.c | 222 ++
 src/backend/optimizer/plan/planner.c  | 214 +++--
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/pathnodes.h |  10 +
 src/include/optimizer/paths.h |   3 +
 src/test/regress/expected/aggregates.out  | 132 +++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/aggregates.sql   |  47 
 10 files changed, 586 insertions(+), 69 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c 
b/src/backend/optimizer/path/equivclass.c
index e86dfeaecd..7dd14d0a43 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -652

Re: POC: GROUP BY optimization

2024-01-09 Thread Andrei Lepikhov


Here is a new version of GROUP-BY optimization without sort model.

On 21/12/2023 17:53, Alexander Korotkov wrote:

I'd like to make some notes.

1) As already mentioned, there is clearly a repetitive pattern for the
code following after get_useful_group_keys_orderings() calls.  I think
it would be good to extract it into a separate function.  Please, do
this as a separate patch coming before the group-by patch. That would
simplify the review.
Done. See patch 0001-*. Unfortunately, extraction of whole cycle isn't 
practical, because it blows out the interface of the routine.



2) I wonder what planning overhead this patch could introduce?  Could
you try to measure the worst case?  What if we have a table with a lot
of indexes and a long list of group-by clauses partially patching
every index.  This should give us an understanding on whether we need
a separate GUC to control this feature.
In current implementation I don't anticipate any significant overhead. 
GUC is needed here to allow users adhere their own ordering and to 
disable feature in the case of problems.



4) I think we can do some optimizations when enable_incremental_sort
== off.  Then in get_useful_group_keys_orderings() we should only deal
with input_path fully matching the group-by clause, and try only full
match of group-by output to the required order.

Hm, is it really make sense in current implementation?

--
regards,
Andrei Lepikhov
Postgres Professional
From ea068e221a754a463142575f6e0360d3118475f8 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 9 Jan 2024 12:00:27 +0700
Subject: [PATCH 1/2] Generalize common code of adding sort before generation
 of grouping paths.

---
 src/backend/optimizer/plan/planner.c | 209 ---
 1 file changed, 61 insertions(+), 148 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 667723b675..fcf647940e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6809,6 +6809,52 @@ done:
return parallel_workers;
 }
 
+static Path *
+try_presorting(PlannerInfo *root, RelOptInfo *grouped_rel,
+   Path *path, Path *cheapest_path)
+{
+   boolis_sorted;
+   int presorted_keys;
+
+   is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+   
path->pathkeys,
+   
_keys);
+
+   if (!is_sorted)
+   {
+   /*
+* Try at least sorting the cheapest path and also try
+* incrementally sorting any path which is partially sorted
+* already (no need to deal with paths which have presorted
+* keys when incremental sort is disabled unless it's the
+* cheapest input path).
+*/
+   if (path != cheapest_path &&
+   (presorted_keys == 0 || !enable_incremental_sort))
+   return NULL;
+
+   /*
+* We've no need to consider both a sort and incremental sort.
+* We'll just do a sort if there are no presorted keys and an
+* incremental sort when there are presorted keys.
+*/
+   if (presorted_keys == 0 || !enable_incremental_sort)
+   path = (Path *) create_sort_path(root,
+   
 grouped_rel,
+   
 path,
+   
 root->group_pathkeys,
+   
 -1.0);
+   else
+   path = (Path *) create_incremental_sort_path(root,
+   
 grouped_rel,
+   
 path,
+   
 root->group_pathkeys,
+   
 presorted_keys,
+   
 -1.0);
+   }
+   return path;
+}
+
 /*
  * add_paths_to_grouping_rel
  *
@@ -6840,45 +6886,11 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo 
*input_rel,
foreach(lc, input_rel->pathlist)
{

Re: Multidimensional Histograms

2024-01-07 Thread Andrei Lepikhov


On 8/1/2024 01:36, Tomas Vondra wrote:

On 1/7/24 18:26, Andrei Lepikhov wrote:

On 7/1/2024 17:51, Tomas Vondra wrote:

On 1/7/24 11:22, Andrei Lepikhov wrote:

On 7/1/2024 06:54, Tomas Vondra wrote:

It's an interesting are for experiments, no doubt about it. And if you
choose to explore it, that's fine. But it's better to be aware it may
not end with a commit.
For the multi-dimensional case, I propose we first try to experiment
with the various algorithms, and figure out what works etc. Maybe
implementing them in python or something would be easier than C.


Curiously, trying to utilize extended statistics for some problematic
cases, I am experimenting with auto-generating such statistics by
definition of indexes [1]. Doing that, I wanted to add some hand-made
statistics like a multidimensional histogram or just a histogram which
could help to perform estimation over a set of columns/expressions.
I realized that current hooks get_relation_stats_hook and
get_index_stats_hook are insufficient if I want to perform an estimation
over a set of ANDed quals on different columns.
In your opinion, is it possible to add a hook into the extended
statistics to allow for an extension to propose alternative estimation?

[1] https://github.com/danolivo/pg_index_stats



No idea, I haven't thought about that very much. Presumably the existing
hooks are insufficient because they're per-attnum? I guess it would make
sense to have a hook for all the attnums of the relation, but I'm not
sure it'd be enough to introduce a new extended statistics kind ...


I got stuck on the same problem Alexander mentioned: we usually have
large tables with many uniformly distributed values. In this case, MCV
doesn't help a lot.

Usually, I face problems scanning a table with a filter containing 3-6
ANDed quals. Here, Postgres multiplies selectivities and ends up with a
less than 1 tuple selectivity. But such scans, in reality, mostly have
some physical sense and return a bunch of tuples. It looks like the set
of columns representing some value of composite type.


Understood. That's a fairly common scenario, and it can easily end up
with rather terrible plan due to the underestimate.


Sometimes extended statistics on dependency helps well, but it expensive
for multiple columns.


Expensive in what sense? Also, how would a multidimensional histogram be
any cheaper?
Maybe my wording needed to be more precise. I didn't implement 
multidimensional histograms before, so I don't know how expensive they 
are. I meant that for dependency statistics over about six columns, we 
have a lot of combinations to compute.



And sometimes I see that even a trivial histogram
on a ROW(x1,x2,...) could predict a much more adequate value (kind of
conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
ROW(N1, N2,...).


Are you guessing the histogram would help, or do you know it'd help? I
have no problem believing that for range queries, but I think it's far
less clear for simple equalities. I'm not sure a histograms would work
for that ...


I added Teodor Sigaev to the CC of this email - He has much more user 
feedback on this technique. As I see, having a histogram over a set of 
columns, we have top selectivity estimation for the filter. I'm not sure 
how good a simple histogram is in that case, too, but according to my 
practice, it works, disallowing usage of too-optimistic query plans.



Maybe it'd be possible to track more stuff for each bucket, not just the
frequency, but also ndistinct for combinations of columns, and then use
that to do better equality estimates. Or maybe we could see how the
"expected" and "actual" bucket frequency compare, and use that to
correct the equality estimate? Not sure.
Yes, it is in my mind, too. Having such experimental stuff as an 
extension(s) in GitHub, we could get some community feedback.


But perhaps you have some data to demonstrate it'd actually help?
It should be redirected to Teodor, but I will consider translating messy 
real-life reports into a clear example.



For such cases we don't have an in-core solution, and introducing a hook
on clause list estimation (paired with maybe a hook on statistics
generation) could help invent an extension that would deal with that
problem. Also, it would open a way for experiments with different types
of extended statistics ...

I really don't know how to do that, or what would it take. AFAICS the
statistics system is not very extensible from external code. Even if we
could add new types of attribute/extended stats, I'm not sure the code
calculating the estimates could use that.
I imagine we can add an analysis routine and directly store statistics 
in an extension for demonstration and experimental purposes, no problem.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Multidimensional Histograms

2024-01-07 Thread Andrei Lepikhov


On 7/1/2024 17:51, Tomas Vondra wrote:

On 1/7/24 11:22, Andrei Lepikhov wrote:

On 7/1/2024 06:54, Tomas Vondra wrote:

It's an interesting are for experiments, no doubt about it. And if you
choose to explore it, that's fine. But it's better to be aware it may
not end with a commit.
For the multi-dimensional case, I propose we first try to experiment
with the various algorithms, and figure out what works etc. Maybe
implementing them in python or something would be easier than C.


Curiously, trying to utilize extended statistics for some problematic
cases, I am experimenting with auto-generating such statistics by
definition of indexes [1]. Doing that, I wanted to add some hand-made
statistics like a multidimensional histogram or just a histogram which
could help to perform estimation over a set of columns/expressions.
I realized that current hooks get_relation_stats_hook and
get_index_stats_hook are insufficient if I want to perform an estimation
over a set of ANDed quals on different columns.
In your opinion, is it possible to add a hook into the extended
statistics to allow for an extension to propose alternative estimation?

[1] https://github.com/danolivo/pg_index_stats



No idea, I haven't thought about that very much. Presumably the existing
hooks are insufficient because they're per-attnum? I guess it would make
sense to have a hook for all the attnums of the relation, but I'm not
sure it'd be enough to introduce a new extended statistics kind ...


I got stuck on the same problem Alexander mentioned: we usually have 
large tables with many uniformly distributed values. In this case, MCV 
doesn't help a lot.
Usually, I face problems scanning a table with a filter containing 3-6 
ANDed quals. Here, Postgres multiplies selectivities and ends up with a 
less than 1 tuple selectivity. But such scans, in reality, mostly have 
some physical sense and return a bunch of tuples. It looks like the set 
of columns representing some value of composite type.
Sometimes extended statistics on dependency helps well, but it expensive 
for multiple columns. And sometimes I see that even a trivial histogram 
on a ROW(x1,x2,...) could predict a much more adequate value (kind of 
conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND 
..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = 
ROW(N1, N2,...).
For such cases we don't have an in-core solution, and introducing a hook 
on clause list estimation (paired with maybe a hook on statistics 
generation) could help invent an extension that would deal with that 
problem. Also, it would open a way for experiments with different types 
of extended statistics ...


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Multidimensional Histograms

2024-01-07 Thread Andrei Lepikhov


On 7/1/2024 06:54, Tomas Vondra wrote:

It's an interesting are for experiments, no doubt about it. And if you
choose to explore it, that's fine. But it's better to be aware it may
not end with a commit.
For the multi-dimensional case, I propose we first try to experiment
with the various algorithms, and figure out what works etc. Maybe
implementing them in python or something would be easier than C.


Curiously, trying to utilize extended statistics for some problematic 
cases, I am experimenting with auto-generating such statistics by 
definition of indexes [1]. Doing that, I wanted to add some hand-made 
statistics like a multidimensional histogram or just a histogram which 
could help to perform estimation over a set of columns/expressions.
I realized that current hooks get_relation_stats_hook and 
get_index_stats_hook are insufficient if I want to perform an estimation 
over a set of ANDed quals on different columns.
In your opinion, is it possible to add a hook into the extended 
statistics to allow for an extension to propose alternative estimation?


[1] https://github.com/danolivo/pg_index_stats

--
regards,
Andrei Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-12-29 Thread Andrei Lepikhov


On 29/12/2023 12:00, Alexander Lakhin wrote:

Hi Alexander,

23.10.2023 14:29, Alexander Korotkov wrote:

Fixed all of the above. Thank you for catching this!


I've discovered that starting from d3d55ce57 the following query:
CREATE TABLE t(a int PRIMARY KEY);

WITH tt AS (SELECT * FROM t)
UPDATE t SET a = tt.a + 1 FROM tt
WHERE tt.a = t.a RETURNING t.a;

triggers an error "variable not found in subplan target lists".
(Commits 8a8ed916f and b5fb6736e don't fix this, unfortunately.)


Thanks for the report!
The problem is with the resultRelation field. We forget to replace the 
relid here.
Could you check your issue with the patch in the attachment? Does it 
resolve this case?


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index 6c02fe8908..f79c673afd 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -1861,6 +1861,8 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark 
*kmark, PlanRowMark *rmark,
/* Replace varno in all the query structures */
query_tree_walker(root->parse, replace_varno_walker, ,
  QTW_EXAMINE_SORTGROUP);
+   if (root->parse->resultRelation == toRemove->relid)
+   root->parse->resultRelation = toKeep->relid;
 
/* Replace links in the planner info */
remove_rel_from_query(root, toRemove, toKeep->relid, NULL, NULL);
@@ -1870,6 +1872,9 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark 
*kmark, PlanRowMark *rmark,
  toRemove->relid, toKeep->relid);
replace_varno((Node *) root->processed_groupClause,
  toRemove->relid, toKeep->relid);
+   replace_relid(root->all_result_relids, toRemove->relid, toKeep->relid);
+   replace_relid(root->leaf_result_relids, toRemove->relid, toKeep->relid);
+
 
/*
 * There may be references to the rel in root->fkey_list, but if so,

Re: POC: GROUP BY optimization

2023-12-28 Thread Andrei Lepikhov


On 28/12/2023 18:29, Alexander Korotkov wrote:

On Thu, Dec 28, 2023 at 10:22 AM Andrei Lepikhov
 wrote:

But arrangement with an ORDER BY clause doesn't work:

DROP INDEX abc;
explain SELECT x,w,z FROM t GROUP BY (w,x,z) ORDER BY (x,z,w);

I think the reason is that the sort_pathkeys and group_pathkeys are
physically different structures, and we can't just compare pointers here.


I haven't yet looked into the code.  But this looks strange to me.
Somehow, optimizer currently matches index pathkeys to ORDER BY
pathkeys.  If GROUP BY pathkeys could be matched to index pathkeys,
then it should be possible to match them to ORDER BY pathkeys too.
Oh, I found the mistake: I got used to using GROUP BY and ORDER BY on 
many columns with round brackets. In the case of the grouping list, it 
doesn't change anything. But ordering treats it as a WholeRowVar and 
breaks group-by arrangement. Look:

explain (COSTS OFF) SELECT relname,reltuples FROM pg_class
GROUP BY relname,reltuples ORDER BY reltuples,relname;

 Group
   Group Key: reltuples, relname
   ->  Sort
 Sort Key: reltuples, relname
 ->  Seq Scan on pg_class
But:
explain (COSTS OFF) SELECT relname,reltuples FROM pg_class
GROUP BY relname,reltuples ORDER BY (reltuples,relname);

 Sort
   Sort Key: (ROW(reltuples, relname))
   ->  Group
 Group Key: relname, reltuples
 ->  Sort
   Sort Key: relname, reltuples
   ->  Seq Scan on pg_class

So, let's continue to work.

--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC: GROUP BY optimization

2023-12-28 Thread Andrei Lepikhov


On 27/12/2023 12:07, Tom Lane wrote:

Andrei Lepikhov  writes:

To be clear. In [1], I mentioned we can perform micro-benchmarks and
structure costs of operators. At least for fixed-length operators, it is
relatively easy.


I repeat what I said: this is a fool's errand.  You will not get
trustworthy results even for the cases you measured, let alone
all the rest.  I'd go as far as to say I would not believe your
microbenchmarks, because they would only apply for one platform,
compiler, backend build, phase of the moon, etc.


Thanks for the explanation.
I removed all cost-related codes. It still needs to be finished; I will 
smooth the code further and rewrite regression tests - many of them 
without cost-dependent reorderings look silly. Also, remember 
Alexander's remarks, which must be implemented, too.

But already here, it works well. Look:

Preliminaries:
CREATE TABLE t(x int, y int, z text, w int);
INSERT INTO t SELECT gs%100,gs%100, 'abc' || gs%10, gs
  FROM generate_series(1,1) AS gs;
CREATE INDEX abc ON t(x,y);
ANALYZE t;
SET enable_hashagg = 'off';

This patch eliminates unneeded Sort operation:
explain SELECT x,y FROM t GROUP BY (x,y);
explain SELECT x,y FROM t GROUP BY (y,x);

Engages incremental sort:
explain SELECT x,y FROM t GROUP BY (x,y,z,w);
explain SELECT x,y FROM t GROUP BY (z,y,w,x);
explain SELECT x,y FROM t GROUP BY (w,z,x,y);
explain SELECT x,y FROM t GROUP BY (w,x,z,y);

Works with subqueries:
explain SELECT x,y
FROM (SELECT * FROM t ORDER BY x,y,w,z) AS q1
GROUP BY (w,x,z,y);
explain SELECT x,y
FROM (SELECT * FROM t ORDER BY x,y,w,z LIMIT 100) AS q1
GROUP BY (w,x,z,y);

But arrangement with an ORDER BY clause doesn't work:

DROP INDEX abc;
explain SELECT x,w,z FROM t GROUP BY (w,x,z) ORDER BY (x,z,w);

I think the reason is that the sort_pathkeys and group_pathkeys are 
physically different structures, and we can't just compare pointers here.


--
regards,
Andrei Lepikhov
Postgres Professional
From 6183555c2c56d7cbbc44f213f6c5bc4cbcaef1ec Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 13 Sep 2023 11:20:03 +0700
Subject: [PATCH] Explore alternative orderings of group-by pathkeys during
 optimization.

When evaluating a query with a multi-column GROUP BY clause, we can minimize
sort operations or avoid them if we synchronize the order of GROUP BY clauses
with the ORDER BY sort clause or sort order, which comes from the underlying
query tree. Grouping does not imply any ordering, so we can compare
the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg,
we simply compared keys in the order specified in the query. This commit
explores alternative ordering of the keys, trying to find a cheaper one.

The ordering of group keys may interact with other parts of the query, some of
which may not be known while planning the grouping. For example, there may be
an explicit ORDER BY clause or some other ordering-dependent operation higher up
in the query, and using the same ordering may allow using either incremental
sort or even eliminating the sort entirely.

The patch always keeps the ordering specified in the query, assuming the user
might have additional insights.

This introduces a new GUC enable_group_by_reordering so that the optimization
may be disabled if needed.
---
 src/backend/optimizer/path/equivclass.c   |  13 +-
 src/backend/optimizer/path/pathkeys.c | 224 +
 src/backend/optimizer/plan/planner.c  | 464 ++
 src/backend/utils/adt/selfuncs.c  |  38 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/pathnodes.h |  10 +
 src/include/optimizer/paths.h |   3 +
 src/test/regress/expected/aggregates.out  | 235 +
 .../regress/expected/incremental_sort.out |  20 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/aggregates.sql   |  99 
 src/test/regress/sql/incremental_sort.sql |   2 +-
 13 files changed, 912 insertions(+), 210 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c 
b/src/backend/optimizer/path/equivclass.c
index 7fa502d6e2..07edd4f38e 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -652,7 +652,18 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
if (opcintype == cur_em->em_datatype &&
equal(expr, cur_em->em_expr))
-   return cur_ec;  /* Match! */
+   {
+   /*
+* Match!
+*
+* Copy the sortref if it wasn't set yet. That 
may happen if
+* the ec was constructed from WHERE clause, 
i.e. it doesn't
+* have a

Re: POC: GROUP BY optimization

2023-12-26 Thread Andrei Lepikhov


On 27/12/2023 11:15, Alexander Korotkov wrote:

On Wed, Dec 27, 2023 at 5:23 AM Tom Lane  wrote:

Alexander Korotkov  writes:

2) An accurate estimate of the sorting cost is quite a difficult task.


Indeed.


What if we make a simple rule of thumb that sorting integers and
floats is cheaper than sorting numerics and strings with collation C,
in turn, that is cheaper than sorting collation-aware strings
(probably more groups)?  Within the group, we could keep the original
order of items.


I think it's a fool's errand to even try to separate different sort
column orderings by cost.  We simply do not have sufficiently accurate
cost information.  The previous patch in this thread got reverted because
of that (well, also some implementation issues, but mostly that), and
nothing has happened to make me think that another try will fare any
better.
To be clear. In [1], I mentioned we can perform micro-benchmarks and 
structure costs of operators. At least for fixed-length operators, it is 
relatively easy. So, the main block here is an accurate prediction of 
ndistincts for different combinations of columns. Does it make sense to 
continue to design the feature in the direction of turning on choosing 
between different sort column orderings if we have extended statistics 
on the columns?


[1] 
https://www.postgresql.org/message-id/e3602ccb-e643-2e79-ed2c-1175a8053...@postgrespro.ru


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC: GROUP BY optimization

2023-12-26 Thread Andrei Lepikhov

On 21/12/2023 17:53, Alexander Korotkov wrote:

On Sun, Oct 1, 2023 at 11:45 AM Andrei Lepikhov
wrote:

New version of the patch. Fixed minor inconsistencies and rebased onto
current master.

Thank you (and other authors) for working on this subject. Indeed to
GROUP BY clauses are order-agnostic. Reordering them in the most
suitable order could give up significant query planning benefits. I
went through the thread: I see significant work has been already made
on this patch, the code is quite polished.
Maybe, but issues, mentioned in [1], still not resolved. It is the only
reason, why this thread hasn't been active.

I'd like to make some notes.
1) As already mentioned, there is clearly a repetitive pattern for the
code following after get_useful_group_keys_orderings() calls. I think
it would be good to extract it into a separate function. Please, do
this as a separate patch coming before the group-by patch. That would
simplify the review.
Yeah, these parts of code a bit different. I will try to make common
routine.

2) I wonder what planning overhead this patch could introduce? Could
you try to measure the worst case? What if we have a table with a lot
of indexes and a long list of group-by clauses partially patching
every index. This should give us an understanding on whether we need
a separate GUC to control this feature.

Ok> 3) I see that get_useful_group_keys_orderings() makes 3 calls to

get_cheapest_group_keys_order() function. Each time
get_cheapest_group_keys_order() performs the cost estimate and
reorders the free keys. However, cost estimation implies the system
catalog lookups (that is quite expensive). I wonder if we could
change the algorithm. Could we just sort the group-by keys by cost
once, save this ordering and then just re-use it. So, every time we
need to reorder a group by, we can just pull the required keys to the
top and use saved ordering for the rest. I also wonder if we could do
this once for add_paths_to_grouping_rel() and
create_partial_grouping_paths() calls. So, it probably should be
somewhere in create_ordinary_grouping_paths().
Thanks for the idea!> 4) I think we can do some optimizations when
enable_incremental_sort

== off. Then in get_useful_group_keys_orderings() we should only deal
with input_path fully matching the group-by clause, and try only full
match of group-by output to the required order.
Oh, we had designed before the incremental sort was invented. Will see
what we can do here.

[1]
https://www.postgresql.org/message-id/60610df1-c32f-ebdf-e58c-7a664431f452%40enterprisedb.com

--
regards,
Andrei Lepikhov
Postgres Professional

Specify description of the SpecialJoinInfo structure

2023-12-26 Thread Andrei Lepikhov


Hi,

Working on Asymmetric Join, I found slight inconsistency in the 
description of SpecialJoinInfo: join type JOIN_ANTI can be accompanied 
by a zero value of the ojrelid if this join was created by the 
transformation of the NOT EXISTS subquery.


--
regards,
Andrei Lepikhov
Postgres Professionaldiff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index ed85dc7414..189b59f127 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2791,7 +2791,8 @@ typedef struct PlaceHolderVar
  *
  * ojrelid is the RT index of the join RTE representing this outer join,
  * if there is one.  It is zero when jointype is INNER or SEMI, and can be
- * zero for jointype ANTI (if the join was transformed from a SEMI join).
+ * zero for jointype ANTI (if the join was transformed from a SEMI join or
+ * converted from a sublink).
  * One use for this field is that when constructing the output targetlist of a
  * join relation that implements this OJ, we add ojrelid to the varnullingrels
  * and phnullingrels fields of nullable (RHS) output columns, so that the

Re: Optimization outcome depends on the index order

2023-12-25 Thread Andrei Lepikhov

On 25/12/2023 18:36, Alexander Korotkov wrote:

On Fri, Dec 22, 2023 at 7:24 PM Andrei Lepikhov
 wrote:

On 22/12/2023 11:48, Alexander Korotkov wrote:

  > Because we must trust all predictions made by the planner, we just
  > choose the most trustworthy path. According to the planner logic, it is
  > a path with a smaller selectivity. We can make mistakes anyway just
  > because of the nature of estimation.

Even if we need to take selectivity into account here, it's still not
clear why this should be on top of other logic later in add_path().

I got your point now, thanks for pointing it out. In the next version of
the patch selectivity is used as a criteria only in the case of COSTS_EQUAL.

It looks better now.  But it's hard for me to judge these heuristics
in add_path().  Tom, what do you think about this?

Just food for thought:
Another approach I have considered was to initialize the relation index 
list according to some consistent rule: place unique indexes at the head 
of the list, arrange indexes according to the number of columns involved 
and sort in some lexical order.

But IMO, the implemented approach looks better.

--
regards,
Andrei Lepikhov
Postgres Professional

Re: Optimization outcome depends on the index order

2023-12-22 Thread Andrei Lepikhov


On 22/12/2023 11:48, Alexander Korotkov wrote:

 > Because we must trust all predictions made by the planner, we just
 > choose the most trustworthy path. According to the planner logic, it is
 > a path with a smaller selectivity. We can make mistakes anyway just
 > because of the nature of estimation.

Even if we need to take selectivity into account here, it's still not 
clear why this should be on top of other logic later in add_path().
I got your point now, thanks for pointing it out. In the next version of 
the patch selectivity is used as a criteria only in the case of COSTS_EQUAL.


--
regards,
Andrei Lepikhov
Postgres Professional
From 45bda9784d28dc9cec90c5b33285023a49850800 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Mon, 27 Nov 2023 11:23:48 +0700
Subject: [PATCH] Choose an index path with the best selectivity estimation.

In the case when optimizer predicts only one row prefer choosing UNIQUE indexes
In other cases, if optimizer treats indexes as equal, make a last attempt
selecting the index with less selectivity - this decision takes away dependency
on the order of indexes in an index list (good for reproduction of some issues)
and proposes one more objective argument to choose specific index.
---
 src/backend/optimizer/util/pathnode.c | 32 +++
 src/test/regress/expected/functional_deps.out | 39 +++
 src/test/regress/sql/functional_deps.sql  | 32 +++
 3 files changed, 103 insertions(+)

diff --git a/src/backend/optimizer/util/pathnode.c 
b/src/backend/optimizer/util/pathnode.c
index 0b1d17b9d3..984c974b57 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -454,6 +454,38 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
costcmp = compare_path_costs_fuzzily(new_path, old_path,

 STD_FUZZ_FACTOR);
 
+   /*
+* Apply some heuristics on index paths.
+*/
+   if (costcmp == COSTS_EQUAL)
+   {
+   IndexPath *inp = (IndexPath *) new_path;
+   IndexPath *iop = (IndexPath *) old_path;
+
+   if  (IsA(new_path, IndexPath) && IsA(old_path, 
IndexPath))
+   {
+   /*
+* When both paths are predicted to produce 
only one tuple,
+* the optimiser should prefer choosing a 
unique index scan
+* in all cases.
+*/
+   if (inp->indexinfo->unique && 
!iop->indexinfo->unique)
+   costcmp = COSTS_BETTER1;
+   else if (!inp->indexinfo->unique && 
iop->indexinfo->unique)
+   costcmp = COSTS_BETTER2;
+   else if (costcmp != COSTS_DIFFERENT)
+   /*
+* If the optimiser doesn't have an 
obviously stable choice
+* of unique index, increase the chance 
of avoiding mistakes
+* by choosing an index with smaller 
selectivity.
+* This option makes decision more 
conservative and looks
+* debatable.
+*/
+   costcmp = (inp->indexselectivity < 
iop->indexselectivity) ?
+   
COSTS_BETTER1 : COSTS_BETTER2;
+   }
+   }
+
/*
 * If the two paths compare differently for startup and total 
cost,
 * then we want to keep both, and we can skip comparing 
pathkeys and
diff --git a/src/test/regress/expected/functional_deps.out 
b/src/test/regress/expected/functional_deps.out
index 32381b8ae7..7057254278 100644
--- a/src/test/regress/expected/functional_deps.out
+++ b/src/test/regress/expected/functional_deps.out
@@ -230,3 +230,42 @@ EXECUTE foo;
 ALTER TABLE articles DROP CONSTRAINT articles_pkey RESTRICT;
 EXECUTE foo;  -- fail
 ERROR:  column "articles.keywords" must appear in the GROUP BY clause or be 
used in an aggregate function
+/*
+ * Corner case of the PostgreSQL optimizer:
+ *
+ * ANDed clauses selectivity multiplication increases total selectivity error.
+ * If such non-true selectivity is so tiny that row estimation predicts the
+ * absolute minimum number of tuples (1), the optimizer can't choose between
+ * different indexes and picks a first from the index list (last created).
+ */
+CREATE TABLE t A

Optimization outcome depends on the index order

2023-12-21 Thread Andrei Lepikhov


On 21/12/2023 12:10, Alexander Korotkov wrote:
> I took a closer look at the patch in [9].  I should drop my argument
> about breaking the model, because add_path() already considers other
> aspects than just costs.  But I have two more note about that patch:
>
> 1) It seems that you're determining the fact that the index path
> should return strictly one row by checking path->rows <= 1.0 and
> indexinfo->unique.  Is it really guaranteed that in this case quals
> are matching unique constraint?  path->rows <= 1.0 could be just an
> estimation error.  Or one row could be correctly estimated, but it's
> going to be selected by some quals matching unique constraint and
> other quals in recheck.  So, it seems there is a risk to select
> suboptimal index due to this condition.

Operating inside the optimizer, we consider all estimations to be the 
sooth. This patch modifies only one place: having two equal assumptions, 
we just choose one that generally looks more stable.
Filtered tuples should be calculated and included in the cost of the 
path. The decision on the equality of paths has been made in view of the 
estimation of these filtered tuples.


> 2) Even for non-unique indexes this patch is putting new logic on top
> of the subsequent code.  How we can prove it's going to be a win?
> That could lead, for instance, to dropping parallel-safe paths in
> cases we didn't do so before.
Because we must trust all predictions made by the planner, we just 
choose the most trustworthy path. According to the planner logic, it is 
a path with a smaller selectivity. We can make mistakes anyway just 
because of the nature of estimation.


> Anyway, please start a separate thread if you're willing to put more
> work into this.

Done

> 9. https://www.postgresql.org/message-id/154f786a-06a0-4fb1-
> b8a4-16c66149731b%40postgrespro.ru

--
regards,
Andrei Lepikhov
Postgres ProfessionalFrom 7b044de1449a5fdc450cb629caafb4e15ded7a93 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Mon, 27 Nov 2023 11:23:48 +0700
Subject: [PATCH] Choose an index path with the best selectivity estimation.

In the case when optimizer predicts only one row prefer choosing UNIQUE indexes
In other cases, if optimizer treats indexes as equal, make a last attempt
selecting the index with less selectivity - this decision takes away dependency
on the order of indexes in an index list (good for reproduction of some issues)
and proposes one more objective argument to choose specific index.
---
 src/backend/optimizer/util/pathnode.c | 42 +++
 .../expected/drop-index-concurrently-1.out| 16 +++
 src/test/regress/expected/functional_deps.out | 39 +
 src/test/regress/expected/join.out| 40 +-
 src/test/regress/sql/functional_deps.sql  | 32 ++
 5 files changed, 143 insertions(+), 26 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c 
b/src/backend/optimizer/util/pathnode.c
index 0b1d17b9d3..4b5aedd579 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -454,6 +454,48 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
costcmp = compare_path_costs_fuzzily(new_path, old_path,

 STD_FUZZ_FACTOR);
 
+   /*
+* Apply some heuristics on index paths.
+*/
+   if (IsA(new_path, IndexPath) && IsA(old_path, IndexPath))
+   {
+   IndexPath *inp = (IndexPath *) new_path;
+   IndexPath *iop = (IndexPath *) old_path;
+
+   if (new_path->rows <= 1.0 && old_path->rows <= 1.0)
+   {
+   /*
+* When both paths are predicted to produce 
only one tuple,
+* the optimiser should prefer choosing a 
unique index scan
+* in all cases.
+*/
+   if (inp->indexinfo->unique && 
!iop->indexinfo->unique)
+   costcmp = COSTS_BETTER1;
+   else if (!inp->indexinfo->unique && 
iop->indexinfo->unique)
+   costcmp = COSTS_BETTER2;
+   else if (costcmp != COSTS_DIFFERENT)
+   /*
+* If the optimiser doesn't have an 
obviously stable choice
+* of unique index, increase the chance 
of avoiding mistakes
+* by choosing an index with smaller 
selectivity

Re: Postgres picks suboptimal index after building of an extended statistics

2023-12-21 Thread Andrei Lepikhov


On 18/12/2023 15:29, Alexander Korotkov wrote:
Also, there is a set of patches [7], [8], and [9], which makes the 
optimizer consider path selectivity as long as path costs during the 
path selection.  I've rechecked that none of these patches could resolve 
the original problem described in [1].

It is true. We accidentally mixed two different problems in one thread.
  Also, I think they are quite 
tricky.  The model of our optimizer assumes that paths in the list 
should be the different ways of getting the same result.  If we choose 
the paths by their selectivity, that breaks this model.  I don't say 
there is no way for this.  But if we do this, that would require 
significant rethinking of our optimizer model and possible revision of a 
significant part of it.
I can't understand that. In [9] we just elaborate the COSTS_EQUAL case 
and establish final decision on more stable basis than a casual order of 
indexes in the list.
  Anyway, I think if there is still interest in 
this, that should be moved into a separate thread to keep this thread 
focused on the problem described in [1].
Agree. IMO, the problem of optimizer dependency on an order of indexes 
in the relation index list is more urgent for now.


Finally, I'd like to note that the issue described in [1] is mostly the 
selectivity estimation problem.  It could be solved by adding the 
multi-column MCV statistics.  The patches published so far look more 
like hacks for particular use cases rather than appropriate solutions.  
It still looks promising to me to use the knowledge of unique 
constraints during selectivity estimation [10].  Even though it's hard 
to implement and possibly implies some overhead, it fits the current 
model.  I also think unique contracts could probably be used in some way 
to improve estimates even when there is no full match.
I have tried to use the knowledge about unique indexes in the 
selectivity estimation routine. But it looks invasive and adds a lot of 
overhead.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: introduce dynamic shared memory registry

2023-12-20 Thread Andrei Lepikhov


On 20/12/2023 17:33, Nathan Bossart wrote:

On Wed, Dec 20, 2023 at 11:02:58AM +0200, Andrei Lepikhov wrote:

In that case, maybe change the test case to make it closer to real-life
usage - with locks and concurrent access (See attachment)?


I'm not following why we should make this test case more complicated.  It
is only intended to test the DSM registry machinery, and setting/retrieving
an atomic variable seems like a realistic use-case to me.


I could provide you at least two reasons here:
1. A More complicated example would be a tutorial on using the feature 
correctly. It will reduce the number of questions in mailing lists.
2. Looking into existing extensions, I see that the most common case of 
using a shared memory segment is maintaining some hash table or state 
structure that needs at least one lock.


Try to rewrite the pg_prewarm according to this new feature, and you 
will realize how difficult it is.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: introduce dynamic shared memory registry

2023-12-20 Thread Andrei Lepikhov


On 20/12/2023 07:04, Michael Paquier wrote:

On Tue, Dec 19, 2023 at 10:14:44AM -0600, Nathan Bossart wrote:

On Tue, Dec 19, 2023 at 10:49:23AM -0500, Robert Haas wrote:

On Mon, Dec 18, 2023 at 3:32 AM Andrei Lepikhov
 wrote:

2. I think a separate file for this feature looks too expensive.
According to the gist of that code, it is a part of the DSA module.


-1. I think this is a totally different thing than DSA. More files
aren't nearly as expensive as the confusion that comes from smushing
unrelated things together.


Agreed.  I think there's a decent chance that more functionality will be
added to this registry down the line, in which case it will be even more
important that this stuff stays separate from the tools it is built with.


+1 for keeping a clean separation between both.


Thanks, I got the reason.
In that case, maybe change the test case to make it closer to real-life 
usage - with locks and concurrent access (See attachment)?


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl 
b/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl
index 4ad6097d1c..762489f3dd 100644
--- a/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl
+++ b/src/test/modules/test_dsm_registry/t/001_test_dsm_registry.pl
@@ -13,12 +13,27 @@ $node->init;
 $node->start;
 
 $node->safe_psql('postgres', "CREATE DATABASE test;");
-$node->safe_psql('postgres', "CREATE EXTENSION test_dsm_registry;");
-$node->safe_psql('postgres', "SELECT set_val_in_shmem(1236);");
-
 $node->safe_psql('test', "CREATE EXTENSION test_dsm_registry;");
-my $result = $node->safe_psql('test', "SELECT get_val_in_shmem();");
-is($result, "1236", "check shmem val");
+
+my ($initial_value, $result, $custom_script);
+
+$initial_value = $node->safe_psql('test', "SELECT get_val_in_shmem();");
+
+$custom_script = File::Temp->new();
+append_to_file($custom_script, q{
+  \set val random(0, 1999)
+  SELECT increment_val_in_shmem(:val);
+  SELECT increment_val_in_shmem(-(:val));
+});
+
+$node->command_ok([ 'pgbench', '-i', 'test' ], 'Init database');
+$node->command_ok([ 'pgbench', '-t', '31', '-c', '3', '-j', '3',
+   '-f', "$custom_script", 'test' ],
+   'Change shared value simlutaneously');
+
+$result = $node->safe_psql('test', "SELECT get_val_in_shmem();");
+print("res $initial_value, $result");
+is($result, $initial_value, "Check concurrent access");
 
 $node->stop;
 
diff --git a/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql 
b/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql
index 8c55b0919b..0144845afa 100644
--- a/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql
+++ b/src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql
@@ -3,7 +3,7 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION test_dsm_registry" to load this file. \quit
 
-CREATE FUNCTION set_val_in_shmem(val INT) RETURNS VOID
+CREATE FUNCTION increment_val_in_shmem(val INT) RETURNS VOID
AS 'MODULE_PATHNAME' LANGUAGE C;
 
 CREATE FUNCTION get_val_in_shmem() RETURNS INT
diff --git a/src/test/modules/test_dsm_registry/test_dsm_registry.c 
b/src/test/modules/test_dsm_registry/test_dsm_registry.c
index 8f78012e52..eed29a9f34 100644
--- a/src/test/modules/test_dsm_registry/test_dsm_registry.c
+++ b/src/test/modules/test_dsm_registry/test_dsm_registry.c
@@ -13,33 +13,48 @@
 #include "postgres.h"
 
 #include "fmgr.h"
+#include "common/pg_prng.h"
 #include "port/atomics.h"
 #include "storage/dsm_registry.h"
+#include "storage/lwlock.h"
 
 PG_MODULE_MAGIC;
 
-static pg_atomic_uint32 *val;
+typedef struct
+{
+   LWLock  lock;
+   uint32  value;
+} SharedState;
+
+static SharedState *extstate;
 
 static void
-init_val(void *ptr)
+init_dsm(void *ptr)
 {
-   pg_atomic_init_u32(ptr, 0);
+   SharedState *state = (SharedState *) ptr;
+
+   LWLockInitialize(>lock, LWLockNewTrancheId());
+   state->value = pg_prng_uint32(_global_prng_state);
 }
 
 static void
 dsm_registry_attach(void)
 {
-   dsm_registry_init_or_attach("test_dsm_registry", (void **) ,
-   
sizeof(pg_atomic_uint32), init_val);
+   dsm_registry_init_or_attach("test_dsm_registry", (void **) ,
+   
sizeof(SharedState), init_dsm);
 }
 
-PG_FUNCTION_INFO_V1(set_val_in_shmem);
+PG_FUNCTION_INFO_V1(increment_val_in_shmem);
 Datum
-set_val_in_shmem(PG_FUNCTION_ARGS)
+increment_val_in_shmem(PG_FUNCTION_ARGS)
 {
+   uint32 increment = PG_GETARG_UINT32(0);
+

Re: introduce dynamic shared memory registry

2023-12-18 Thread Andrei Lepikhov


On 18/12/2023 13:39, Andrei Lepikhov wrote:

On 5/12/2023 10:46, Nathan Bossart wrote:

I don't presently have any concrete plans to use this for anything, but I
thought it might be useful for extensions for caching, etc. and wanted to
see whether there was any interest in the feature.


I am delighted that you commenced this thread.
Designing extensions, every time I feel pain introducing one shared 
value or some global stat, the extension must be required to be loadable 
on startup only. It reduces the flexibility of even very lightweight 
extensions, which look harmful to use in a cloud.


After looking into the code, I have some comments:
1. The code looks good; I didn't find possible mishaps. Some proposed 
changes are in the attachment.
2. I think a separate file for this feature looks too expensive. 
According to the gist of that code, it is a part of the DSA module.
3. The dsm_registry_init_or_attach routine allocates a DSM segment. Why 
not create dsa_area for a requestor and return it?


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/storage/ipc/dsm_registry.c 
b/src/backend/storage/ipc/dsm_registry.c
index ea80f45716..0343ce987f 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -45,8 +45,8 @@ static const dshash_parameters dsh_params = {
LWTRANCHE_DSM_REGISTRY_HASH
 };
 
-static dsa_area *dsm_registry_dsa;
-static dshash_table *dsm_registry_table;
+static dsa_area *dsm_registry_dsa = NULL;
+static dshash_table *dsm_registry_table = NULL;
 
 static void init_dsm_registry(void);
 
@@ -83,13 +83,20 @@ init_dsm_registry(void)
 {
/* Quick exit if we already did this. */
if (dsm_registry_table)
+   {
+   Assert(dsm_registry_dsa != NULL);
return;
+   }
+
+   Assert(dsm_registry_dsa == NULL);
 
/* Otherwise, use a lock to ensure only one process creates the table. 
*/
LWLockAcquire(DSMRegistryLock, LW_EXCLUSIVE);
 
if (DSMRegistryCtx->dshh == DSHASH_HANDLE_INVALID)
{
+   Assert(DSMRegistryCtx->dsah == DSHASH_HANDLE_INVALID);
+
/* Initialize dynamic shared hash table for registry. */
dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA);
dsa_pin(dsm_registry_dsa);
@@ -102,6 +109,8 @@ init_dsm_registry(void)
}
else
{
+   Assert(DSMRegistryCtx->dsah != DSHASH_HANDLE_INVALID);
+
/* Attach to existing dynamic shared hash table. */
dsm_registry_dsa = dsa_attach(DSMRegistryCtx->dsah);
dsa_pin_mapping(dsm_registry_dsa);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 665d471418..e0e7b3b765 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -209,7 +209,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_LAUNCHER_HASH,
LWTRANCHE_DSM_REGISTRY_DSA,
LWTRANCHE_DSM_REGISTRY_HASH,
-   LWTRANCHE_FIRST_USER_DEFINED
+   LWTRANCHE_FIRST_USER_DEFINED,
 }  BuiltinTrancheIds;
 
 /*

Re: introduce dynamic shared memory registry

2023-12-17 Thread Andrei Lepikhov


On 5/12/2023 10:46, Nathan Bossart wrote:

I don't presently have any concrete plans to use this for anything, but I
thought it might be useful for extensions for caching, etc. and wanted to
see whether there was any interest in the feature.


I am delighted that you commenced this thread.
Designing extensions, every time I feel pain introducing one shared 
value or some global stat, the extension must be required to be loadable 
on startup only. It reduces the flexibility of even very lightweight 
extensions, which look harmful to use in a cloud.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Statistics Import and Export

2023-12-15 Thread Andrei Lepikhov

On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a 
complete vision for how these tools will be used
by pg_upgrade and pg_dump/restore, the places where these will provide 
the biggest win for users.


Some issues here with docs:

func.sgml:28465: parser error : Opening and ending tag mismatch: sect1 
line 26479 and sect2

  
  ^

Also, as I remember, we already had some attempts to invent dump/restore 
statistics [1,2]. They were stopped with the problem of type 
verification. What if the definition of the type has changed between the 
dump and restore? As I see in the code, Importing statistics you just 
check the column name and don't see into the type.


[1] Backup and recovery of pg_statistic
https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook
[2] Re: Ideas about a better API for postgres_fdw remote estimates
https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru

--
regards,
Andrei Lepikhov
Postgres Professional

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-12-12 Thread Andrei Lepikhov


On 6/12/2023 14:30, Richard Guo wrote:

I've self-reviewed this patch again and I think it's now in a
committable state.  I'm wondering if we can mark it as 'Ready for
Committer' now, or we need more review comments/feedbacks.

To recap, this patch postpones reparameterization of paths until
createplan.c, which would help avoid building the reparameterized
expressions we might not use.  More importantly, it makes it possible to
modify the expressions in RTEs (e.g. tablesample) and in RelOptInfos
(e.g. baserestrictinfo) for reparameterization.  Failing to do that can
lead to executor crashes, wrong results, or planning errors, as we have
already seen.


As I see, this patch contains the following modifications:
1. Changes in the create_nestloop_path: here, it arranges all tests to 
the value of top_parent_relids, if any. It is ok for me.
2. Changes in reparameterize_path_by_child and coupled new function 
path_is_reparameterizable_by_child. I compared them, and it looks good.
One thing here. You use the assertion trap in the case of inconsistency 
in the behaviour of these two functions. How disastrous would it be if 
someone found such inconsistency in production? Maybe better to use 
elog(PANIC, ...) report?
3. ADJUST_CHILD_ATTRS() macros. Understanding the necessity of this 
change, I want to say it looks a bit ugly.
4. SampleScan reparameterization part. It looks ok. It can help us in 
the future with asymmetric join and something else.
5. Tests. All of them are related to partitioning and the SampleScan 
issue. Maybe it is enough, but remember, this change now impacts the 
parameterization feature in general.
6. I can't judge how this switch of overall approach could impact 
something in the planner. I am only wary about what, from the first 
reparameterization up to the plan creation, we have some inconsistency 
in expressions (partitions refer to expressions with the parent relid). 
What if an extension in the middle changes the parent expression?


By and large, this patch is in a good state and may be committed.

--
regards,
Andrei Lepikhov
Postgres Professional

Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'

2023-12-10 Thread Andrei Lepikhov


On 11/12/2023 09:31, Richard Guo wrote:
On Fri, Dec 8, 2023 at 3:13 PM Alexander Pyhalov 
mailto:a.pyha...@postgrespro.ru>> wrote:

Andrei Lepikhov писал(а) 2023-12-08 07:37:
 > I'd already clashed with Tom on copying the required_relids field
and
 > voluntarily made unnecessary copies in the project [1].
 > And ... stuck into huge memory consumption. The reason was in
 > Bitmapsets:
 > When we have 1E3-1E4 partitions and try to reparameterize a join,
one
 > bitmapset field can have a size of about 1kB. Having bitmapset
 > referencing Relation with a large index value, we had a lot of (for
 > example, 1E4 * 1kB) copies on each reparametrization of such a
field.
 > Alexander Pyhalov should remember that case.
Yes. If it matters, this happened during reparametrization when 2
partitioned tables with 1000 partitions each were joined. Then
asymmetric  pw join managed to eat lots of memory for bitmapsets (by
lots of memory I mean all available on the test VM).
By reparametrization did you mean the work done in
reparameterize_path_by_child()?  If so maybe you'd be interested in the
patch [1] which postpones reparameterization of paths until createplan.c
and thus can help avoid unnecessary reparametrization work.


Yeah, I have discovered it already. It is a promising solution and only 
needs a bit more review. But here, I embraced some corner cases with the 
idea that we may not see other cases right now. And also, sometimes the 
Bitmapset field is significant - it is not a corner case.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'

2023-12-07 Thread Andrei Lepikhov

On 28/11/2023 01:37, Alexander Korotkov wrote:

On Mon, Nov 27, 2023 at 8:07 PM Andres Freund wrote:

Sorry for the late answer, I missed this thread because of vacation.

On 2023-11-27 11:29:48 +0530, Ashutosh Bapat wrote:

How do we ensure that we are not making unnecessary copies of Bitmapsets?

We don't - but that's not specific to this patch. Bitmapsets typically aren't
very large, I doubt that it's a significant proportion of the memory
usage. Adding refcounts or such would likely add more overhead than it'd save,
both in time and memory.

I'd already clashed with Tom on copying the required_relids field and
voluntarily made unnecessary copies in the project [1].

And ... stuck into huge memory consumption. The reason was in Bitmapsets:
When we have 1E3-1E4 partitions and try to reparameterize a join, one
bitmapset field can have a size of about 1kB. Having bitmapset
referencing Relation with a large index value, we had a lot of (for
example, 1E4 * 1kB) copies on each reparametrization of such a field.
Alexander Pyhalov should remember that case.
I don't claim we will certainly catch such an issue here, but it is a
reason why we should look at this option carefully.

I am a bit worried about the maintainability of remove_rel_from_query() et
al. Is there any infrastructure for detecting that some PlannerInfo field that
needs updating wasn't updated? There's not even a note in PlannerInfo that
documents that that needs to happen.
Thanks you for highlighting this issue.> That makes sense, thank you.
We need at least a comment about this.

I'll write a patch adding this comment.

BTW, what do you think about the patches upthread [1].

Links
1.
https://www.postgresql.org/message-id/CAPpHfdtLgCryACcrmLv=Koq9rAB3=tr5y9d84dggvuhscvj...@mail.gmail.com

0001 - Looks good and can be applied.
0002 - I am afraid the problems with expanded range table entries are
likewise described above. The patch makes sense, but it requires time to
reproduce corner cases. Maybe we can do it separately from the current
hotfix?
0003 - I think it is really what we need right now: SJE is quite a rare
optimization and executes before the entries expansion procedure. So it
looks less risky.

[1] Asymmetric partition-wise JOIN
https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com

--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2023-12-05 Thread Andrei Lepikhov

Here is fresh version with the pg_dump.pl regex fixed. Now it must pass 
buildfarm.


Under development:
1. Explanation of the general idea in comments (Robert's note)
2. Issue with hiding some optimizations (Alexander's note and example 
with overlapping clauses on two partial indexes)


--
regards,
Andrei Lepikhov
Postgres Professional
From 71746caae3eb0c771792b563fd53244fc1ac575b Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the
preliminary stage of optimization when we are still working with the tree
expression.
Here C is a constant expression, 'expr' is non-constant expression, 'op' is
an operator which returns boolean result and has a commuter (for the case of
reverse order of constant and non-constant parts of the expression,
like 'CX op expr').
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 .../postgres_fdw/expected/postgres_fdw.out|  16 +-
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 310 +-
 src/backend/utils/misc/guc_tables.c   |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |   6 +-
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 156 -
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 219 +++--
 src/test/regress/expected/rules.out   |   4 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  35 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 862 insertions(+), 70 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0a5bdf8bcc..ff69b2cd3b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN 
(SELECT * FROM ft5 WHERE
  Foreign Scan
Output: t1.c1, t1.c2, ft5.c1, ft5.c2
Relations: (public.ft4 t1) LEFT JOIN (public.ft5)
-   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 
10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10))
+   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 
IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10))
 (4 rows)
 
 SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 
WHERE c1 < 10) t2 ON (t1.c1 = t2.c1)
@@ -3112,7 +3112,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full 
join ft5 t2 on (t1.c1 = t2
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 
20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY 
array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST
+   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS 
NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT 
(r1.c1 % 5)) ASC NULLS LAST
 (4 rows)
 
 select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = 
t2.c1) where t1.c1 < 20 or (t1.c1 is null and t2.c1 < 5) group by (t2.c1)%3 
order by 1;
@@ -3130,7 +3130,7 @@ select array_agg(distinct (t1.c1)%5 order by (t1.c1)%5) 
from ft4 t1 full join ft
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5) ORDER BY (t1.c1 % 5))), ((t2.c1 % 
3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5) ORDER BY ((r1.c1 % 5)) 
ASC NULLS LAST), (r2.c1 % 3) FROM ("S 1&q

Re: POC, WIP: OR-clause support for indexes

2023-12-03 Thread Andrei Lepikhov


Hi,

Here is the next version of the patch where, I think, all of Roberts's 
claims related to the code have been fixed.

For example, expression 'x < 1 OR x < 2' is transformed to
'x < ANY (1,2)'.

Here, we still need to deal with the architectural issues. I like the 
approach mentioned by Peter: try to transform the expression tree to 
some 'normal' form, which is more laconic and simple; delay the search 
for any optimization ways to the following stages.


Also, it doesn't pass pg_dump test. At first glance, it is a problem of 
regex expression, which should be corrected further.


--
regards,
Andrei Lepikhov
Postgres Professional
From 73031b7acae68494ddd0f9b1faf4c94aae3bd6b0 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(C1, C2, ...) on the
preliminary stage of optimization when we are still working with the tree
expression.
Here C is a constant expression, 'expr' is non-constant expression, 'op' is
an operator which returns boolean result and has a commuter (for the case of
reverse order of constant and non-constant parts of the expression,
like 'CX op expr').
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 .../postgres_fdw/expected/postgres_fdw.out|  16 +-
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 310 +-
 src/backend/utils/misc/guc_tables.c   |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 156 -
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 219 +++--
 src/test/regress/expected/rules.out   |   4 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  35 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 20 files changed, 859 insertions(+), 67 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0a5bdf8bcc..ff69b2cd3b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1349,7 +1349,7 @@ SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN 
(SELECT * FROM ft5 WHERE
  Foreign Scan
Output: t1.c1, t1.c2, ft5.c1, ft5.c2
Relations: (public.ft4 t1) LEFT JOIN (public.ft5)
-   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 < 
10) OR (r4.c1 IS NULL))) AND ((r1.c1 < 10))
+   Remote SQL: SELECT r1.c1, r1.c2, r4.c1, r4.c2 FROM ("S 1"."T 3" r1 LEFT 
JOIN "S 1"."T 4" r4 ON (((r1.c1 = r4.c1)) AND ((r4.c1 < 10 WHERE (((r4.c1 
IS NULL) OR (r4.c1 < 10))) AND ((r1.c1 < 10))
 (4 rows)
 
 SELECT t1.c1, t1.c2, t2.c1, t2.c2 FROM ft4 t1 LEFT JOIN (SELECT * FROM ft5 
WHERE c1 < 10) t2 ON (t1.c1 = t2.c1)
@@ -3112,7 +3112,7 @@ select array_agg(distinct (t1.c1)%5) from ft4 t1 full 
join ft5 t2 on (t1.c1 = t2
  Foreign Scan
Output: (array_agg(DISTINCT (t1.c1 % 5))), ((t2.c1 % 3))
Relations: Aggregate on ((public.ft4 t1) FULL JOIN (public.ft5 t2))
-   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE (((r1.c1 < 
20) OR ((r1.c1 IS NULL) AND (r2.c1 < 5 GROUP BY 2 ORDER BY 
array_agg(DISTINCT (r1.c1 % 5)) ASC NULLS LAST
+   Remote SQL: SELECT array_agg(DISTINCT (r1.c1 % 5)), (r2.c1 % 3) FROM ("S 
1"."T 3" r1 FULL JOIN "S 1"."T 4" r2 ON (((r1.c1 = r2.c1 WHERE r1.c1 IS 
NULL) AND (r2.c1 < 5)) OR (r1.c1 < 20))) GROUP BY 2 ORDER BY array_agg(DISTINCT 
(r1.c1 % 5)) ASC NULLS LAST
 (4 rows)
 
 select array_agg(distinct (t1.c1)%5) from ft4 t1 full join ft5 t2 on (t1.c1 = 
t2.c1) where t1.c1 < 20 or (t1.c1 is null and t2.c1 < 5) group by (t2.c1)%3 
order by 1;
@@ -3130,7 +3130,7 @@ select array_agg(distinct (t1.c1)%5 order by (t1.c1)%5) 
from ft4 t1 full join ft
  Foreign Scan
Output: (array_agg(D

Re: Custom explain options

2023-11-30 Thread Andrei Lepikhov


On 30/11/2023 22:40, Konstantin Knizhnik wrote:


On 30/11/2023 5:59 am, Andrei Lepikhov wrote:

On 21/10/2023 19:16, Konstantin Knizhnik wrote:
EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, 
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about 
page prefetching during query execution (prefetching is more critical 
for Neon
architecture because of separation of compute and storage, so it is 
implemented not only for bitmap heap scan as in Vanilla Postgres, but 
also for seqscan, indexscan and indexonly scan). Another possible 
candidate  for explain options is local file cache (extra caching 
layer above shared buffers which is used to somehow replace file 
system cache in standalone Postgres).


I think that it will be nice to have a generic mechanism which allows 
extensions to add its own options to EXPLAIN.


Generally, I welcome this idea: Extensions can already do a lot of 
work, and they should have a tool to report their state, not only into 
the log.
But I think your approach needs to be elaborated. At first, it would 
be better to allow registering extended instruments for specific node 
types to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the 
reason to limit the size: as I see everywhere it exists locally or in 
the DSA where its size is calculated on the fly. So, by registering an 
extended instrument, we can reserve a slot for the extension. The 
actual size of underlying data can be provided by the extension routine.



Thank you for review.

I agree that support of extended instruments is desired. I just tried to 
minimize number of changes to make this patch smaller.


I got it. But having a substantial number of extensions in support, I 
think the extension part of instrumentation could have advantages and be 
worth elaborating on.


In all this cases we are using array of `Instrumentation` and if it 
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations 
between worker processes  and I have updated this code in my PR to send 
actual amount of custom instrumentation data. But it can not help with 
the cases above.

I see next basic instruments in the code:
- Instrumentation (which should be named NodeInstrumentation)
- MemoizeInstrumentation
- JitInstrumentation
- AggregateInstrumentation
- HashInstrumentation
- TuplesortInstrumentation

As a variant, extensibility can be designed with parent 
'AbstractInstrumentation' node, containing node type and link to 
extensible part. sizeof(Instr) calls should be replaced with the 
getInstrSize() call - not so much places in the code; memcpy() also can 
be replaced with the copy_instr() routine.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Report planning memory in EXPLAIN ANALYZE

2023-11-30 Thread Andrei Lepikhov


On 30/11/2023 18:40, Alvaro Herrera wrote:

Well, SUMMARY is enabled by default with ANALYZE, and I'd rather not
have planner memory consumption displayed by default with all EXPLAIN
ANALYZEs.  So yeah, I still think this deserves its own option.

But let's hear others' opinions on this point.  I'm only one vote here.


I agree; it should be disabled by default. The fact that memory 
consumption outputs with byte precision (very uncomfortable for 
perception) is a sign that the primary audience is developers and for 
debugging purposes.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2023-11-30 Thread Andrei Lepikhov


On 30/11/2023 15:00, Alena Rybakina wrote:
2. The second patch is my patch version when I moved the OR 
transformation in the s index formation stage:


So, I got the best query plan despite the possible OR to ANY 
transformation:


If the user uses a clause like "x IN (1,2) AND y=100", it will break 
your 'good' solution.
In my opinion, the general approach here is to stay with OR->ANY 
transformation at the parsing stage and invent one more way for picking 
an index by looking into the array and attempting to find a compound index.
Having a shorter list of expressions, where uniform ORs are grouped into 
arrays, the optimizer will do such work with less overhead.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Custom explain options

2023-11-29 Thread Andrei Lepikhov


On 21/10/2023 19:16, Konstantin Knizhnik wrote:
EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, 
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page 
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is 
implemented not only for bitmap heap scan as in Vanilla Postgres, but 
also for seqscan, indexscan and indexonly scan). Another possible 
candidate  for explain options is local file cache (extra caching layer 
above shared buffers which is used to somehow replace file system cache 
in standalone Postgres).


I think that it will be nice to have a generic mechanism which allows 
extensions to add its own options to EXPLAIN.


Generally, I welcome this idea: Extensions can already do a lot of work, 
and they should have a tool to report their state, not only into the log.
But I think your approach needs to be elaborated. At first, it would be 
better to allow registering extended instruments for specific node types 
to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the reason 
to limit the size: as I see everywhere it exists locally or in the DSA 
where its size is calculated on the fly. So, by registering an extended 
instrument, we can reserve a slot for the extension. The actual size of 
underlying data can be provided by the extension routine.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2023-11-28 Thread Andrei Lepikhov


On 28/11/2023 04:03, Robert Haas wrote:

On Mon, Nov 27, 2023 at 3:02 AM Andrei Lepikhov
 wrote:

On 25/11/2023 08:23, Alexander Korotkov wrote:

I think patch certainly gets better in this aspect.  One thing I can't
understand is why do we use home-grown code for resolving
hash-collisions.  You can just define custom hash and match functions
in HASHCTL.  Even if we need to avoid repeated JumbleExpr calls, we
still can save pre-calculated hash value into hash entry and use
custom hash and match.  This doesn't imply us to write our own
collision-resolving code.


Thanks, it was an insightful suggestion.
I implemented it, and the code has become shorter (see attachment).


Neither the code comments nor the commit message really explain the
design idea here. That's unfortunate, principally because it makes
review difficult.


Yeah, it is still an issue. We will think about how to improve this; any 
suggestions are welcome.



I'm very skeptical about the idea of using JumbleExpr for any part of
this. It seems fairly expensive, and it might produce false matches.
If expensive is OK, then why not just use equal()? If it's not, then
this probably isn't really OK either. But in any case there should be
comments explaining why this strategy was chosen.


We used the equal() routine without hashing in earlier versions. Hashing 
resolves issues with many different OR clauses. Is it expensive? Maybe, 
but we assume this transformation should be applied to simple enough 
expressions.



The use of op_mergejoinable() seems pretty random to me. Why should we
care about that? If somebody writes a<1 or a<2 or a<3 or a<4, you can
transform that to a

You are right. The only reason was to obtain a working patch to 
benchmark and look for corner cases. We would rewrite that place but 
still live with the equivalence operator.



The reader shouldn't be left to guess whether a rule like this was made for
reasons of correctness or for reasons of efficiency or something else.
Looking further, I see that the reason for this is likely that the
operator for the transformation result is constructing using
list_make1(makeString((char *) "=")), but trying to choose an operator
based on the operator name is, I think, pretty clearly unacceptable.


Yes, it was a big mistake. It is fixed in the new version (I guess).


I am extremely dubious about the use of select_common_type() here. Why
not do this only when the types already match exactly? Maybe the
concern is unknown literals, but perhaps that should be handled in
some other way. If you do this kind of thing, you need to justify why
it can't fail or produce wrong answers.


Perhaps. We implemented your approach in the next version. At least we 
could see consequences.



Honestly, it seems very hard to avoid the conclusion that this
transformation is being done at too early a stage. Parse analysis is
not the time to try to do query optimization. I can't really believe
that there's a way to produce a committable patch along these lines.
Ideally, a transformation like this should be done after we know what
plan shape we're using (or considering using), so that we can make
cost-based decisions about whether to transform or not. But at the
very least it should happen somewhere in the planner. There's really
no justification for parse analysis rewriting the SQL that the user
entered.


Here, we assume that array operation is generally better than many ORs.
As a result, it should be more effective to make OR->ANY transformation 
in the parser (it is a relatively lightweight operation here) and, as a 
second phase, decompose that in the optimizer.
We implemented earlier prototypes in different places of the optimizer, 
and I'm convinced that only this approach resolves the issues we found.

Does this approach look weird? Maybe. We can debate it in this thread.

--
regards,
Andrei Lepikhov
Postgres Professional
From a5f8eba522ff907f7a3fffb0635b61d38933e385 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of
optimization when we are still working with a tree expression.
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 .../postgres_fdw/expected/postgres_fdw.out|   8 +-
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 289 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 141 +++--
 src/test/regress/expected/guc

Re: POC, WIP: OR-clause support for indexes

2023-11-27 Thread Andrei Lepikhov


On 25/11/2023 08:23, Alexander Korotkov wrote:

I think patch certainly gets better in this aspect.  One thing I can't
understand is why do we use home-grown code for resolving
hash-collisions.  You can just define custom hash and match functions
in HASHCTL.  Even if we need to avoid repeated JumbleExpr calls, we
still can save pre-calculated hash value into hash entry and use
custom hash and match.  This doesn't imply us to write our own
collision-resolving code.


Thanks, it was an insightful suggestion.
I implemented it, and the code has become shorter (see attachment).

--
regards,
Andrei Lepikhov
Postgres Professional
From 8a43f5b50be6cb431046ab352fbcb9bdd3e4376c Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of
optimization when we are still working with a tree expression.
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 .../postgres_fdw/expected/postgres_fdw.out|   8 +-
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 283 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 141 +++--
 src/test/regress/expected/guc.out |   3 +-
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 257 +---
 src/test/regress/expected/rules.out   |   4 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  32 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 830 insertions(+), 83 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 22cae37a1e..fdfecd51c7 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8537,18 +8537,18 @@ insert into utrtest values (2, 'qux');
 -- Check case where the foreign partition is a subplan target rel
 explain (verbose, costs off)
 update utrtest set a = 1 where a = 1 or a = 2 returning *;
- QUERY PLAN
 
-
+  QUERY PLAN   
   
+--
  Update on public.utrtest
Output: utrtest_1.a, utrtest_1.b
Foreign Update on public.remp utrtest_1
Update on public.locp utrtest_2
->  Append
  ->  Foreign Update on public.remp utrtest_1
-   Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a 
= 2))) RETURNING a, b
+   Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY 
('{1,2}'::integer[]))) RETURNING a, b
  ->  Seq Scan on public.locp utrtest_2
Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record
-   Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2))
+   Filter: (utrtest_2.a = ANY ('{1,2}'::integer[]))
 (10 rows)
 
 -- The new values are concatenated with ' triggered !'
diff --git a/src/backend/nodes/queryjumblefuncs.c 
b/src/backend/nodes/queryjumblefuncs.c
index 281907a4d8..99207a8670 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -135,6 +135,36 @@ JumbleQuery(Query *query)
return jstate;
 }
 
+JumbleState *
+JumbleExpr(Expr *expr, uint64 *queryId)
+{
+   JumbleState *jstate = NULL;
+
+   Assert(queryId != NULL);
+
+   jstate = (JumbleState *) palloc(sizeof(JumbleState));
+
+   /* Set up workspace for query jumbling */
+   jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE);
+   jstate->jumble_len = 0;
+   jstate->clocations_buf_size = 32;
+   jstate->clocations = (LocationLen *)
+   palloc(jstate->clocations_buf_size * sizeof(LocationLen));
+   jstate->clocations_count = 0;
+

Re: Postgres picks suboptimal index after building of an extended statistics

2023-11-26 Thread Andrei Lepikhov

Second version of the patch - resolve non-symmetrical decision, thanks 
to Teodor Sigaev's review.



--
regards,
Andrei Lepikhov
Postgres Professional
From 604899b6afe70eccbbdbf69ce254f37808c598db Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Mon, 27 Nov 2023 11:23:48 +0700
Subject: [PATCH] Choose an index path with the best selectivity estimation.

In the case when optimizer predicts only one row prefer choosing UNIQUE indexes
In other cases, if optimizer treats indexes as equal, make a last attempt
selecting the index with less selectivity - this decision takes away dependency
on the order of indexes in an index list (good for reproduction of some issues)
and proposes one more objective argument to choose specific index.
---
 src/backend/optimizer/util/pathnode.c | 42 ++
 .../expected/drop-index-concurrently-1.out| 16 ---
 src/test/regress/expected/functional_deps.out | 43 +++
 src/test/regress/expected/join.out| 40 +
 src/test/regress/sql/functional_deps.sql  | 36 
 5 files changed, 151 insertions(+), 26 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c 
b/src/backend/optimizer/util/pathnode.c
index 0b1d17b9d3..4b5aedd579 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -454,6 +454,48 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
costcmp = compare_path_costs_fuzzily(new_path, old_path,

 STD_FUZZ_FACTOR);
 
+   /*
+* Apply some heuristics on index paths.
+*/
+   if (IsA(new_path, IndexPath) && IsA(old_path, IndexPath))
+   {
+   IndexPath *inp = (IndexPath *) new_path;
+   IndexPath *iop = (IndexPath *) old_path;
+
+   if (new_path->rows <= 1.0 && old_path->rows <= 1.0)
+   {
+   /*
+* When both paths are predicted to produce 
only one tuple,
+* the optimiser should prefer choosing a 
unique index scan
+* in all cases.
+*/
+   if (inp->indexinfo->unique && 
!iop->indexinfo->unique)
+   costcmp = COSTS_BETTER1;
+   else if (!inp->indexinfo->unique && 
iop->indexinfo->unique)
+   costcmp = COSTS_BETTER2;
+   else if (costcmp != COSTS_DIFFERENT)
+   /*
+* If the optimiser doesn't have an 
obviously stable choice
+* of unique index, increase the chance 
of avoiding mistakes
+* by choosing an index with smaller 
selectivity.
+* This option makes decision more 
conservative and looks
+* debatable.
+*/
+   costcmp = (inp->indexselectivity < 
iop->indexselectivity) ?
+   
COSTS_BETTER1 : COSTS_BETTER2;
+   }
+   else if (costcmp == COSTS_EQUAL)
+   /*
+* The optimizer can't differ the value of two 
index paths.
+* To avoid making a decision that is based on 
only an index
+* order in the list, use some rational 
strategy based on
+* selectivity: prefer touching fewer tuples on 
the disk to
+* filtering them after.
+*/
+   costcmp = (inp->indexselectivity < 
iop->indexselectivity) ?
+   
COSTS_BETTER1 : COSTS_BETTER2;
+   }
+
/*
 * If the two paths compare differently for startup and total 
cost,
 * then we want to keep both, and we can skip comparing 
pathkeys and
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out 
b/src/test/isolation/expected/drop-index-concurrently-1.out
index 1cb2250891..2392cdb033 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -12,13 +12,15 @@ step preps: PREPARE getrow_seqscan AS SELECT * FROM test_dc 
WHERE data = 34 ORDE
 ste

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-11-23 Thread Andrei Lepikhov


On 14/11/2023 17:10, Damir Belyalov wrote:

  Here is a very straw-man-level sketch of what I think might work.
  The option to COPY FROM looks something like

       ERRORS TO other_table_name (item [, item [, ...]])


I tried to implement the patch using a table and came across a number of 
questions.


Which table should we implement for this feature: a system catalog table 
or store this table as a file or create a new table?


In these cases, security and user rights management issues arise.
It is better for other users not to see error lines from another user. 
It is also not clear how access rights to this table are inherited and 
be given.


Previous reviews have given helpful ideas about storing errors in the 
new table.
It should be trivial code - use the current table name + 'err' + suffix 
as we already do in the case of conflicting auto-generated index names.
The 'errors table' must inherit any right policies from the table, to 
which we do the copy.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: POC, WIP: OR-clause support for indexes

2023-11-23 Thread Andrei Lepikhov


On 23/11/2023 16:23, Andrei Lepikhov wrote:
This code changes tests in many places. But, as I see it, it mostly 
demonstrates the positive effect of the transformation.


I found out that the postgres_fdw tests were impacted by the feature. 
Fix it, because the patch is on the commitfest and passes buildfarm.
Taking advantage of this, I suppressed the expression evaluation 
procedure to make regression test changes more clear.


--
regards,
Andrei Lepikhov
Postgres Professional
From e29b35d3445d9852f3ecb9cdaa4f231c72334973 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of
optimization when we are still working with a tree expression.
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 .../postgres_fdw/expected/postgres_fdw.out|   8 +-
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 284 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 141 +++--
 src/test/regress/expected/guc.out |   3 +-
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 215 +++--
 src/test/regress/expected/rules.out   |   4 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  32 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 810 insertions(+), 62 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 64bcc66b8d..a8442f08aa 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8537,18 +8537,18 @@ insert into utrtest values (2, 'qux');
 -- Check case where the foreign partition is a subplan target rel
 explain (verbose, costs off)
 update utrtest set a = 1 where a = 1 or a = 2 returning *;
- QUERY PLAN
 
-
+  QUERY PLAN   
   
+--
  Update on public.utrtest
Output: utrtest_1.a, utrtest_1.b
Foreign Update on public.remp utrtest_1
Update on public.locp utrtest_2
->  Append
  ->  Foreign Update on public.remp utrtest_1
-   Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a 
= 2))) RETURNING a, b
+   Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY 
('{1,2}'::integer[]))) RETURNING a, b
  ->  Seq Scan on public.locp utrtest_2
Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record
-   Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2))
+   Filter: (utrtest_2.a = ANY ('{1,2}'::integer[]))
 (10 rows)
 
 -- The new values are concatenated with ' triggered !'
diff --git a/src/backend/nodes/queryjumblefuncs.c 
b/src/backend/nodes/queryjumblefuncs.c
index 281907a4d8..99207a8670 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -135,6 +135,36 @@ JumbleQuery(Query *query)
return jstate;
 }
 
+JumbleState *
+JumbleExpr(Expr *expr, uint64 *queryId)
+{
+   JumbleState *jstate = NULL;
+
+   Assert(queryId != NULL);
+
+   jstate = (JumbleState *) palloc(sizeof(JumbleState));
+
+   /* Set up workspace for query jumbling */
+   jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE);
+   jstate->jumble_len = 0;
+   jstate->clocations_buf_size = 32;
+   jstate->clocations = (LocationLen *)
+   palloc(jstate->clocations_buf_size * sizeof(LocationLen));
+   jstate->clocations_count = 0;
+   jstate->highest_extern_param_id = 0;
+
+   /* Compute query ID */
+   _jumbleNode(jstate, (Node *) expr);
+   *queryId = DatumGe

Re: POC, WIP: OR-clause support for indexes

2023-11-23 Thread Andrei Lepikhov


On 21/11/2023 18:31, Alena Rybakina wrote:
Sorry, I lost your changes  during the revision process. I returned 
them. I raised the patch version just in case to run ci successfully.


I think the usage of nodeToString for the generation of clause hash is 
too expensive and buggy.
Also, in the code, you didn't resolve hash collisions. So, I've 
rewritten the patch a bit (see the attachment).
One more thing: I propose to enable transformation by default at least 
for quick detection of possible issues.
This code changes tests in many places. But, as I see it, it mostly 
demonstrates the positive effect of the transformation.


--
regards,
Andrei Lepikhov
Postgres Professional
From 5071d02426ac3430f4dd61a8ad32c2847ba6f8a5 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 23 Nov 2023 16:00:13 +0700
Subject: [PATCH] Transform OR clause to ANY expressions.

Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the preliminary stage of
optimization when we are still working with a tree expression.
Sometimes it can lead to not optimal plan. But we think it is better to have
array of elements instead of a lot of OR clauses. Here is a room for further
optimizations on decomposing that array into more optimal parts.
---
 src/backend/nodes/queryjumblefuncs.c  |  30 ++
 src/backend/parser/parse_expr.c   | 285 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/queryjumble.h   |   1 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 141 +++--
 src/test/regress/expected/create_view.out |   2 +-
 src/test/regress/expected/guc.out |   3 +-
 src/test/regress/expected/inherit.out |   2 +-
 src/test/regress/expected/join.out|  62 +++-
 src/test/regress/expected/partition_prune.out | 215 +++--
 src/test/regress/expected/rules.out   |  18 +-
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  23 +-
 src/test/regress/sql/create_index.sql |  32 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 815 insertions(+), 66 deletions(-)

diff --git a/src/backend/nodes/queryjumblefuncs.c 
b/src/backend/nodes/queryjumblefuncs.c
index 281907a4d8..99207a8670 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -135,6 +135,36 @@ JumbleQuery(Query *query)
return jstate;
 }
 
+JumbleState *
+JumbleExpr(Expr *expr, uint64 *queryId)
+{
+   JumbleState *jstate = NULL;
+
+   Assert(queryId != NULL);
+
+   jstate = (JumbleState *) palloc(sizeof(JumbleState));
+
+   /* Set up workspace for query jumbling */
+   jstate->jumble = (unsigned char *) palloc(JUMBLE_SIZE);
+   jstate->jumble_len = 0;
+   jstate->clocations_buf_size = 32;
+   jstate->clocations = (LocationLen *)
+   palloc(jstate->clocations_buf_size * sizeof(LocationLen));
+   jstate->clocations_count = 0;
+   jstate->highest_extern_param_id = 0;
+
+   /* Compute query ID */
+   _jumbleNode(jstate, (Node *) expr);
+   *queryId = DatumGetUInt64(hash_any_extended(jstate->jumble,
+   
jstate->jumble_len,
+   
0));
+
+   if (*queryId == UINT64CONST(0))
+   *queryId = UINT64CONST(1);
+
+   return jstate;
+}
+
 /*
  * Enables query identifier computation.
  *
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 64c582c344..d782642771 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -22,6 +22,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/queryjumble.h"
 #include "optimizer/optimizer.h"
 #include "parser/analyze.h"
 #include "parser/parse_agg.h"
@@ -43,6 +44,7 @@
 
 /* GUC parameters */
 bool   Transform_null_equals = false;
+bool   enable_or_transformation = true;
 
 
 static Node *transformExprRecurse(ParseState *pstate, Node *expr);
@@ -99,6 +101,287 @@ static Expr *make_distinct_op(ParseState *pstate, List 
*opname,
 static Node *make_nulltest_from_distinct(ParseState *pstate,

 A_Expr *distincta, Node *arg);
 
+typedef struct OrClauseGroupEntry
+{
+   Node

Re: Postgres picks suboptimal index after building of an extended statistics

2023-11-21 Thread Andrei Lepikhov


Thanks for detaied answer,

On 3/11/2023 00:37, Tomas Vondra wrote:

On 9/25/23 06:30, Andrey Lepikhov wrote:

...
I can't stop thinking about this issue. It is bizarre when Postgres
chooses a non-unique index if a unique index gives us proof of minimum
scan.

That's true, but no one implemented this heuristics. So the "proof of
minimum scan" is merely hypothetical - the optimizer is unaware of it.


See the simple patch in the attachment. There, I have attempted to 
resolve situations of uncertainty to avoid making decisions based solely 
on the order of indexes in the list.



I don't see a reason to teach the clauselist_selectivity() routine to
estimate UNIQUE indexes. We add some cycles, but it will work with btree
indexes only.

I'm not sure I understand what this is meant to say. Can you elaborate?
We only allow UNIQUE for btree indexes anyway, so what exactly is the
problem here?


Partly, you already answered yourself below: we have unique index 
estimation in a few estimation calls, but go through the list of indexes 
each time.
Also, for this sake, we would add some input parameters, usually NULL, 
because many estimations don't involve indexes at all.



Maybe to change compare_path_costs_fuzzily() and add some heuristic, for
example:
"If selectivity of both paths gives us no more than 1 row, prefer to use
a unique index or an index with least selectivity."

I don't understand how this would work. What do yo mean by "selectivity
of a path"? AFAICS the problem here is that we estimate a path to return
more rows (while we know there can only be 1, thanks to UNIQUE index).


Oops, I meant cardinality. See the patch in the attachment.


So how would you know the path does not give us more than 1 row? Surely
you're not proposing compare_path_costs_fuzzily() to do something
expensive like analyzing the paths, or something.


I solely propose to make optimizer more consecutive in its decisions: if 
we have one row for both path candidates, use uniqueness of the index or 
value of selectivity as one more parameter.



Also, how would it work in upper levels? If you just change which path
we keep, but leave the inaccurate row estimate in place, that may fix
that level, but it's certainly going to confuse later planning steps.

It is designed for the only scan level.

IMHO the problem here is that we produce wrong estimate, so we better
fix that, instead of adding band-aids to other places.


Agree. I am looking for a solution to help users somehow resolve such 
problems. As an alternative solution, I can propose a selectivity hook 
or (maybe even better) - use the pathlist approach and add indexes into 
the index list with some predefined order - at first positions, place 
unique indexes with more columns, etc.



This happens because functional dependencies are very simple type of
statistics - it has some expectations about the input data and also the
queries executed on it. For example it assumes the data is reasonably
homogeneous, so that we can calculate a global "degree".

But the data in the example directly violates that - it has 1000 rows
that are very random (so we'd find no dependencies), and 1000 rows with
perfect dependencies. Hence we end with degree=0.5, which approximates
the dependencies to all data. Not great, true, but that's the price for
simplicity of this statistics kind.

So the simplest solution is to disable dependencies on such data sets.
It's a bit inconvenient/unfortunate we build dependencies by default,
and people may not be aware of there assumptions.

Perhaps we could/should make dependency_degree() more pessimistic when
we find some "violations" of the rule (we intentionally are not strict
about it, because data are imperfect). I don't have a good idea how to
change the formulas, but I'm open to the idea in principle.


Thanks for the explanation!


The other thing we could do is checking for unique indexes in
clauselist_selectivity, and if we find a match we can just skip the
extended statistics altogether. Not sure how expensive this would be,
but for typical cases (with modest number of indexes) perhaps OK. It
wouldn't work without a unique index, but I don't have a better idea.
It looks a bit expensive for me. But I am ready to try, if current 
solution doesn't look applicable.


--
regards,
Andrei Lepikhov
Postgres Professional
From 21677861e101eed22c829e0b14290a56786a12c2 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 22 Nov 2023 12:01:39 +0700
Subject: [PATCH] Use an index path with the best selectivity estimation

---
 src/backend/optimizer/util/pathnode.c | 40 +
 .../expected/drop-index-concurrently-1.out| 16 ---
 src/test/regress/expected/functional_deps.out | 43 +++
 src/test/regress/expected/join.out| 40 +
 src/test/regress/sql/functional_deps.sql  | 36 
 5 files changed, 149 insertions(

Re: POC, WIP: OR-clause support for indexes

2023-11-20 Thread Andrei Lepikhov


On 10/11/2023 16:20, Alena Rybakina wrote:

I added log like that: ERROR:  unrecognized node type: 0.

I fixed this issue and added some cosmetic refactoring.
The changes are presented in the or_patch_changes.diff file.


Looking into the patch, I found some trivial improvements (see attachment).
Also, it is not obvious that using a string representation of the clause 
as a hash table key is needed here. Also, by making a copy of the node 
in the get_key_nconst_node(), you replace the location field, but in the 
case of complex expression, you don't do the same with other nodes.
I propose to generate expression hash instead + prove the equality of 
two expressions by calling equal().


--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 25a4235dbd..de27d2646c 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -178,6 +178,10 @@ transformBoolExprOr(ParseState *pstate, BoolExpr 
*expr_orig)
HTAB   *or_group_htab = NULL;
int len_ors = list_length(expr_orig->args);
 
+   /* If this is not an 'OR' expression, skip the transformation */
+   if (!or_transform_limit || expr_orig->boolop != OR_EXPR || len_ors < 2)
+   return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
+
MemSet(, 0, sizeof(info));
info.keysize = sizeof(char *);
info.entrysize = sizeof(OrClauseGroupEntry);
@@ -188,10 +192,6 @@ transformBoolExprOr(ParseState *pstate, BoolExpr 
*expr_orig)
  ,
  
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
-   /* If this is not an 'OR' expression, skip the transformation */
-   if (expr_orig->boolop != OR_EXPR || !or_transform_limit || len_ors == 1 
|| !or_group_htab)
-   return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
-
foreach(lc, expr_orig->args)
{
Node   *arg = lfirst(lc);

Re: [PoC] Reducing planning time when tables have many partitions

2023-11-19 Thread Andrei Lepikhov


On 27/9/2023 14:28, Yuya Watari wrote:

Thank you for pointing it out. The ec_source_indexes and
ec_derive_indexes are just picked up from the previous patch, and I
have not changed their design. I think a similar approach to
EquivalenceMembers may be applied to RestrictInfos. I will experiment
with them and share a new patch.


During the work on committing the SJE feature [1], Alexander Korotkov 
pointed out the silver lining in this work [2]: he proposed that we 
shouldn't remove RelOptInfo from simple_rel_array at all but replace it 
with an 'Alias', which will refer the kept relation. It can simplify 
further optimizations on removing redundant parts of the query.


[1] 
https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru
[2] 
https://www.postgresql.org/message-id/capphfdsnabg8cak+nj8akig_+_tt07ecstkb1loa50f0ust...@mail.gmail.com


--
regards,
Andrei Lepikhov
Postgres Professional

Re: A performance issue with Memoize

2023-10-30 Thread Andrei Lepikhov


On 30/10/2023 14:55, Richard Guo wrote:


On Thu, Oct 26, 2023 at 12:07 PM Andrei Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:


Do you've thought about the case, fixed with the commit 1db5667? As I
see, that bugfix still isn't covered by regression tests. Could your
approach of a PARAM_EXEC slot reusing break that case?


Hm, I don't think so.  The issue fixed by commit 1db5667 was caused by
sharing PARAM_EXEC slots between different levels of NestLoop.  AFAICS
it's safe to share PARAM_EXEC slots within the same level of NestLoop.

The change here is about sharing PARAM_EXEC slots between subquery's
subplan_params and outer-relation variables, which happens within the
same level of NestLoop.
...
Did you notice a case that the change here breaks?

Hi Tom, could you share your insights on this issue and the proposed
fix?


I think your patch works correctly so far. I mentioned the commit 
1db5667 because, as I see, the origin of the problem was parallel 
workers. I have thought about pushing Memoize down to a parallel worker 
and couldn't imagine whether such a solution would be correct.

Sorry if I disturbed you in vain.

--
regards,
Andrei Lepikhov
Postgres Professional

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-26 Thread Andrei Lepikhov


On 25/10/2023 20:00, Andrei Zubkov wrote:

Hi Andrei,

On Wed, 2023-10-25 at 13:59 +0700, Andrei Lepikhov wrote:

But minmax_stats_since and changes in the UI of the reset routine
look like syntactic sugar here.
I can't convince myself that it is really needed. Do you have some
set of cases that can enforce the changes proposed?


Yes, it looks strange, but it is needed in some way.
The main purpose of this patch is to provide means for sampling
solutions for collecting statistics data in series of samples. The
first goal here - is to be sure that the statement was not evicted and
come back between samples (especially between rare samples). This is
the target of the stats_since field. The second goal - is the ability
of getting all statistic increments for the interval between samples.
All statistics provided by pg_stat_statements with except of min/max
values can be calculated for the interval since the last sample knowing
the values in the last and current samples. We need a way to get
min/max values too. This is achieved by min/max reset performed by the
sampling solution. The minmax_stats_since field is here for the same
purpose - the sampling solution is need to be sure that no one have
done a min/max reset between samples. And if such reset was performed
at least we know when it was performed.


It is the gist of my question. If needed, You can remove the record by 
(userid, dbOid, queryId). As I understand, this extension is usually 
used by an administrator. Who can reset these parameters except you and 
the DBMS?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: A performance issue with Memoize

2023-10-25 Thread Andrei Lepikhov


On 20/10/2023 17:40, Richard Guo wrote:

I noticed $subject with the query below.

set enable_memoize to off;

explain (analyze, costs off)
select * from tenk1 t1 left join lateral
     (select t1.two as t1two, * from tenk1 t2 offset 0) s
on t1.two = s.two;
                                      QUERY PLAN

  Nested Loop Left Join (actual time=0.050..59578.053 rows=5000 loops=1)
    ->  Seq Scan on tenk1 t1 (actual time=0.027..2.703 rows=1 loops=1)
    ->  Subquery Scan on s (actual time=0.004..4.819 rows=5000 loops=1)
          Filter: (t1.two = s.two)
          Rows Removed by Filter: 5000
          ->  Seq Scan on tenk1 t2 (actual time=0.002..3.834 rows=1 
loops=1)

  Planning Time: 0.666 ms
  Execution Time: 60937.899 ms
(8 rows)

set enable_memoize to on;

explain (analyze, costs off)
select * from tenk1 t1 left join lateral
     (select t1.two as t1two, * from tenk1 t2 offset 0) s
on t1.two = s.two;
                                         QUERY PLAN
--
  Nested Loop Left Join (actual time=0.061..122684.607 rows=5000 
loops=1)

    ->  Seq Scan on tenk1 t1 (actual time=0.026..3.367 rows=1 loops=1)
    ->  Memoize (actual time=0.011..9.821 rows=5000 loops=1)
          Cache Key: t1.two, t1.two
          Cache Mode: binary
          Hits: 0  Misses: 1  Evictions:   Overflows: 0  Memory 
Usage: 1368kB
          ->  Subquery Scan on s (actual time=0.008..5.188 rows=5000 
loops=1)

                Filter: (t1.two = s.two)
                Rows Removed by Filter: 5000
                ->  Seq Scan on tenk1 t2 (actual time=0.004..4.081 
rows=1 loops=1)

  Planning Time: 0.607 ms
  Execution Time: 124431.388 ms
(12 rows)

The execution time (best of 3) is 124431.388 VS 60937.899 with and
without memoize.

The Memoize runtime stats 'Hits: 0  Misses: 1  Evictions: '
seems suspicious to me, so I've looked into it a little bit, and found
that the MemoizeState's keyparamids and its outerPlan's chgParam are
always different, and that makes us have to purge the entire cache each
time we rescan the memoize node.

But why are they always different?  Well, for the query above, we have
two NestLoopParam nodes, one (with paramno 1) is created when we replace
outer-relation Vars in the scan qual 't1.two = s.two', the other one
(with paramno 0) is added from the subquery's subplan_params, which was
created when we replaced uplevel vars with Param nodes for the subquery.
That is to say, the chgParam would be {0, 1}.

When it comes to replace outer-relation Vars in the memoize keys, the
two 't1.two' Vars are both replaced with the NestLoopParam with paramno
1, because it is the first NLP we see in root->curOuterParams that is
equal to the Vars in memoize keys.  That is to say, the memoize node's
keyparamids is {1}.
...
Any thoughts?


Do you've thought about the case, fixed with the commit 1db5667? As I 
see, that bugfix still isn't covered by regression tests. Could your 
approach of a PARAM_EXEC slot reusing break that case?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-25 Thread Andrei Lepikhov


On 19/10/2023 19:40, Andrei Zubkov wrote:

Hi hackers,

New version 23 attached. It contains rebase to the current master.


I discovered the patch and parameters you've proposed. In my opinion, 
the stats_since parameter adds valuable information and should 
definitely be included in the stats data because the statement can be 
noteless removed from the list and inserted again.
But minmax_stats_since and changes in the UI of the reset routine look 
like syntactic sugar here.
I can't convince myself that it is really needed. Do you have some set 
of cases that can enforce the changes proposed? Maybe we should 
intensively work on adding the 'stats_since' parameter and continue the 
discussion of the necessity of another one?


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-23 Thread Andrei Lepikhov


On 20/10/2023 19:39, Stephen Frost wrote:
Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:

The only issue I worry about is the uncertainty and clutter that can be
created by this feature. In the worst case, when we have a complex error
stack (including the extension's CATCH sections, exceptions in stored
procedures, etc.), the backend will throw the memory limit error repeatedly.


I'm not seeing what additional uncertainty or clutter there is- this is,
again, exactly the same as what happens today on a system with
overcommit disabled and I don't feel like we get a lot of complaints
about this today.


Maybe I missed something or see this feature from an alternate point of 
view (as an extension developer), but overcommit is more useful so far: 
it kills a process.
It means that after restart, the backend/background worker will have an 
initial internal state. With this limit enabled, we need to remember 
that each function call can cause an error, and we have to remember it 
using static PG_CATCH sections where we must rearrange local variables 
to the initial (?) state. So, it complicates development.
Of course, this limit is a good feature, but from my point of view, it 
would be better to kill a memory-consuming backend instead of throwing 
an error. At least for now, we don't have a technique to repeat query 
planning with chances to build a more effective plan.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-10-22 Thread Andrei Lepikhov


On 22/10/2023 05:01, Alexander Korotkov wrote:

On Thu, Oct 19, 2023 at 6:16 AM Andrei Lepikhov
 wrote:

On 19/10/2023 01:50, Alexander Korotkov wrote:

This query took 3778.432 ms with self-join removal disabled, and
3756.009 ms with self-join removal enabled.  So, no measurable
overhead.  Similar to the higher number of joins.  Can you imagine
some extreme case when self-join removal could introduce significant
overhead in comparison with other optimizer parts?  If not, should we
remove self_join_search_limit GUC?

Thanks,
It was Zhihong Yu who worried about that case [1]. And my purpose was to
show a method to avoid such a problem if it would be needed.
I guess the main idea here is that we have a lot of self-joins, but only
few of them (or no one) can be removed.
I can't imagine a practical situation when we can be stuck in the
problems here. So, I vote to remove this GUC.


I've removed the self_join_search_limit.  Anyway there is
enable_self_join_removal if the self join removal algorithm causes any
problems.  I also did some grammar corrections for the comments.  I
think the patch is getting to the committable shape.  I noticed some
failures on commitfest.cputube.org.  I'd like to check how this
version will pass it.


I have observed the final patch. A couple of minor changes can be made 
(see attachment).
Also, I see room for improvement, but it can be done later. For example, 
we limit the optimization to only ordinary tables in this patch. It can 
be extended at least with partitioned and foreign tables soon.


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index 6b848aadad..b84197dadb 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -23,7 +23,6 @@
 #include "postgres.h"
 
 #include "catalog/pg_class.h"
-#include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -50,7 +49,7 @@ typedef struct UniqueRelInfo
 } UniqueRelInfo;
 
 /*
- * The context for replace_varno_walker() containing source and target relids2
+ * The context for replace_varno_walker() containing source and target relids.
  */
 typedef struct
 {

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-19 Thread Andrei Lepikhov


On 20/10/2023 05:06, Stephen Frost wrote:

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:

On 19/10/2023 02:00, Stephen Frost wrote:

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:

On 29/9/2023 09:52, Andrei Lepikhov wrote:

On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:

Attach patches updated to master.
Pulled from patch 2 back to patch 1 a change that was also pertinent
to patch 1.

+1 to the idea, have doubts on the implementation.

I have a question. I see the feature triggers ERROR on the exceeding of
the memory limit. The superior PG_CATCH() section will handle the error.
As I see, many such sections use memory allocations. What if some
routine, like the CopyErrorData(), exceeds the limit, too? In this case,
we could repeat the error until the top PG_CATCH(). Is this correct
behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for
recursion and allow error handlers to slightly exceed this hard limit?



By the patch in attachment I try to show which sort of problems I'm worrying
about. In some PП_CATCH() sections we do CopyErrorData (allocate some
memory) before aborting the transaction. So, the allocation error can move
us out of this section before aborting. We await for soft ERROR message but
will face more hard consequences.


While it's an interesting idea to consider making exceptions to the
limit, and perhaps we'll do that (or have some kind of 'reserve' for
such cases), this isn't really any different than today, is it?  We
might have a malloc() failure in the main path, end up in PG_CATCH() and
then try to do a CopyErrorData() and have another malloc() failure.

If we can rearrange the code to make this less likely to happen, by
doing a bit more work to free() resources used in the main path before
trying to do new allocations, then, sure, let's go ahead and do that,
but that's independent from this effort.


I agree that rearranging efforts can be made independently. The code in the
letter above was shown just as a demo of the case I'm worried about.
IMO, the thing that should be implemented here is a recursion level for the
memory limit. If processing the error, we fall into recursion with this
limit - we should ignore it.
I imagine custom extensions that use PG_CATCH() and allocate some data
there. At least we can raise the level of error to FATAL.


Ignoring such would defeat much of the point of this effort- which is to
get to a position where we can say with some confidence that we're not
going to go over some limit that the user has set and therefore not
allow ourselves to end up getting OOM killed.  These are all the same
issues that already exist today on systems which don't allow overcommit
too, there isn't anything new here in regards to these risks, so I'm not
really keen to complicate this to deal with issues that are already
there.

Perhaps once we've got the basics in place then we could consider
reserving some space for handling such cases..  but I don't think it'll
actually be very clean and what if we have an allocation that goes
beyond what that reserved space is anyway?  Then we're in the same spot
again where we have the choice of either failing the allocation in a
less elegant way than we might like to handle that error, or risk
getting outright kill'd by the kernel.  Of those choices, sure seems
like failing the allocation is the better way to go.


I've got your point.
The only issue I worry about is the uncertainty and clutter that can be 
created by this feature. In the worst case, when we have a complex error 
stack (including the extension's CATCH sections, exceptions in stored 
procedures, etc.), the backend will throw the memory limit error 
repeatedly. Of course, one failed backend looks better than a 
surprisingly killed postmaster, but the mix of different error reports 
and details looks terrible and challenging to debug in the case of 
trouble. So, may we throw a FATAL error if we reach this limit while 
handling an exception?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-18 Thread Andrei Lepikhov


On 18/10/2023 16:59, Ashutosh Bapat wrote:

On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
 wrote:



But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.


Does the answer above resolved this issue?


May be. There are other problematic areas like EvalPlanQual, Rescans,
reparameterised paths which can blow up if we use the same RelOptInfo
for different scans of the same relation. It will be good to test


Yeah, now I got it. It is already the second place where I see some 
reference to a kind of hidden rule that the rte entry (or RelOptInfo) 
must correspond to only one plan node. I don't have a quick answer for 
now - maybe it is a kind of architectural agreement - and I will 
consider this issue during the development.



those. And also A need not be a simple relation; it could be join as
well.


For a join RelOptInfo, as well as for any subtree, we have the same 
logic: the pathlist of this subtree is already formed during the 
previous level of the search and will not be changed.





The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.


I don't understand this sentence. Which way executor uses this index of
RelOptInfo ?


See Scan::scanrelid



--
regards,
Andrey Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-10-18 Thread Andrei Lepikhov

On 19/10/2023 01:50, Alexander Korotkov wrote:

On Mon, Oct 16, 2023 at 11:28 AM Andrei Lepikhov
 wrote:

On 12/10/2023 18:32, Alexander Korotkov wrote:

On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov
 wrote:

On 4/10/2023 14:34, Alexander Korotkov wrote:

   > Relid replacement machinery is the most contradictory code here. We used
   > a utilitarian approach and implemented a simplistic variant.

   > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
   > > they are not necessary.  [1] points that infrastructure from [2] might
   > > be useful.  The patchset from [2] seems committed mow.  However, I
   > > can't see it is directly helpful in this matter.  Could we just skip
   > > adding IS NOT NULL clause for the columns, that have
   > > pg_attribute.attnotnull set?
   > Thanks for the links, I will look into that case.

To be more precise, in the attachment, you can find a diff to the main
patch, which shows the volume of changes to achieve the desired behaviour.
Some explains in regression tests shifted. So, I've made additional tests:

DROP TABLE test CASCADE;
CREATE TABLE test (a int, b int not null);
CREATE UNIQUE INDEX abc ON test(b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
CREATE UNIQUE INDEX abc1 ON test(a,b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a);
DROP INDEX abc1;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b);

We have almost the results we wanted to have. But in the last explain
you can see that nothing happened with the OR clause. We should use the
expression mutator instead of walker to handle such clauses. But It
doesn't process the RestrictInfo node ... I'm inclined to put a solution
of this issue off for a while.

OK.  I think it doesn't worth to eliminate IS NULL quals with this
complexity (at least at this stage of work).

I made improvements over the code.  Mostly new comments, grammar
corrections of existing comments and small refactoring.

Also, I found that the  suggestion from David Rowley [1] to qsort
array of relations to faster find duplicates is still unaddressed.
I've implemented it.  That helps to evade quadratic complexity with
large number of relations.

Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).

I would like to propose one more minor improvement (see in attachment).
The idea here is that after removing a self-join and changing clauses we
should re-probe the set of relids with the same Oid, because we can find
more removable self-joins (see the demo test in join.sql).

Thank you, I've integrated this into the patch.  BTW, the patch
introduces two new GUC variables: enable_self_join_removal,
self_join_search_limit.  enable_self_join_removal variable turns
on/off optimization at all.  self_join_search_limit variable limits
its usage by the number of joins.  AFICS, self_join_search_limit is
intended to protect us from quadratic complexity self-join removal
has.  I tried to reproduce the extreme case.

SELECT count(*) FROM pgbench_accounts a0, pgbench_accounts a1, ...,
pgbench_accounts a100 WHERE a0.aid = 1 AND a1.aid = a0.aid + 1 AND ...
AND a100.aid = a99.aid + 1;

This query took 3778.432 ms with self-join removal disabled, and
3756.009 ms with self-join removal enabled.  So, no measurable
overhead.  Similar to the higher number of joins.  Can you imagine
some extreme case when self-join removal could introduce significant
overhead in comparison with other optimizer parts?  If not, should we
remove self_join_search_limit GUC?

Thanks,
It was Zhihong Yu who worried about that case [1]. And my purpose was to 
show a method to avoid such a problem if it would be needed.
I guess the main idea here is that we have a lot of self-joins, but only 
few of them (or no one) can be removed.
I can't imagine a practical situation when we can be stuck in the 
problems here. So, I vote to remove this GUC.

[1] 
https://www.postgresql.org/message-id/CALNJ-vTyL-LpvSOPZxpC63Et3LJLUAFZSfRqGEhT5Rj7_EEj7w%40mail.gmail.com

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-18 Thread Andrei Lepikhov


On 19/10/2023 02:00, Stephen Frost wrote:

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:

On 29/9/2023 09:52, Andrei Lepikhov wrote:

On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:

Attach patches updated to master.
Pulled from patch 2 back to patch 1 a change that was also pertinent
to patch 1.

+1 to the idea, have doubts on the implementation.

I have a question. I see the feature triggers ERROR on the exceeding of
the memory limit. The superior PG_CATCH() section will handle the error.
As I see, many such sections use memory allocations. What if some
routine, like the CopyErrorData(), exceeds the limit, too? In this case,
we could repeat the error until the top PG_CATCH(). Is this correct
behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for
recursion and allow error handlers to slightly exceed this hard limit?



By the patch in attachment I try to show which sort of problems I'm worrying
about. In some PП_CATCH() sections we do CopyErrorData (allocate some
memory) before aborting the transaction. So, the allocation error can move
us out of this section before aborting. We await for soft ERROR message but
will face more hard consequences.


While it's an interesting idea to consider making exceptions to the
limit, and perhaps we'll do that (or have some kind of 'reserve' for
such cases), this isn't really any different than today, is it?  We
might have a malloc() failure in the main path, end up in PG_CATCH() and
then try to do a CopyErrorData() and have another malloc() failure.

If we can rearrange the code to make this less likely to happen, by
doing a bit more work to free() resources used in the main path before
trying to do new allocations, then, sure, let's go ahead and do that,
but that's independent from this effort.


I agree that rearranging efforts can be made independently. The code in 
the letter above was shown just as a demo of the case I'm worried about.
IMO, the thing that should be implemented here is a recursion level for 
the memory limit. If processing the error, we fall into recursion with 
this limit - we should ignore it.
I imagine custom extensions that use PG_CATCH() and allocate some data 
there. At least we can raise the level of error to FATAL.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-10-18 Thread Andrei Lepikhov

On 18/10/2023 13:39, Richard Guo wrote:

On Fri, Oct 13, 2023 at 6:18 PM Andrei Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:

On 23/8/2023 12:37, Richard Guo wrote:
 > To fix it we may need to modify RelOptInfos for Path, BitmapHeapPath,
 > ForeignPath and CustomPath, and modify IndexOptInfos for
IndexPath.  It
 > seems that that is not easily done without postponing
reparameterization
 > of paths until createplan.c.
 >
 > Attached is a patch which is v5 + fix for this new issue.

Having looked into the patch for a while, I couldn't answer to myself
for some stupid questions:

Thanks for reviewing this patch!  I think these are great questions.

1. If we postpone parameterization until the plan creation, why do we
still copy the path node in the FLAT_COPY_PATH macros? Do we really
need it?

Good point.  The NestPath's origin inner path should not be referenced
any more after the reparameterization, so it seems safe to adjust the
path itself, without the need of a flat-copy.  I've done that in v8
patch.

2. I see big switches on path nodes. May it be time to create a
path_walker function? I recall some thread where such a suggestion was
declined, but I don't remember why.

I'm not sure.  But this seems a separate topic, so maybe it's better to
discuss it separately?

Agree.

3. Clause changing is postponed. Why does it not influence the
calc_joinrel_size_estimate? We will use statistics on the parent table
here. Am I wrong?

Hmm, I think the reparameterization does not change the estimated
size/costs.  Quoting the comment

Agree. I have looked at the code and figured it out - you're right. But 
it seems strange: maybe I don't understand something. Why not estimate 
selectivity for parameterized clauses based on leaf partition statistic, 
not the parent one?

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Allow ALTER SYSTEM SET on unrecognized custom GUCs

2023-10-17 Thread Andrei Lepikhov


On 18/10/2023 12:15, Tom Lane wrote:

Andrei Lepikhov  writes:

"SET foo.bar TO 'smth'" can immediately alter the placeholder's value.
But what is the reason that "ALTER SYSTEM SET foo.bar TO 'smth'" doesn't
do the same?


Because it's not supposed to take effect until you issue a reload
command (and maybe not even then, depending on which GUC we're
talking about).  I certainly think it wouldn't make sense for your
own session to adopt the value ahead of others.


Thanks for the answer.
Introducing the assignable_custom_variable_name can be helpful. The code 
looks good. I think it deserves to be committed - after the indentation 
fix, of course.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-17 Thread Andrei Lepikhov


On 17/10/2023 17:09, Ashutosh Bapat wrote:

On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov
 wrote:


On 16/10/2023 23:21, Ashutosh Bapat wrote:

On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.


I'm curious, which type of problems do you afraid here? Why we need a
range table entry for each scan of non-partitioned relation?



Not RTE but RelOptInfo.

Using the same example as Alexander Korotkov, let's say A is the
nonpartitioned table and P is partitioned table with partitions P1,
P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ...
APn. Each of these joins may have different properties and thus will
require creating paths. In order to save these paths, we need
RelOptInfos which are indentified by relids. Let's assume that the
relids of these join RelOptInfos are created by union of relid of A
and relid of Px (the partition being joined). This is notionally
misleading but doable.


Ok, now I see your disquiet. In current patch we have built RelOptInfo 
for each JOIN(A, Pi) by the build_child_join_rel() routine. And of 
course, they all have different sets of cheapest paths (it is one more 
point of optimality). At this point the RelOptInfo of relation A is 
fully formed and upper joins use the pathlist "as is", without changes.



But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.


Does the answer above resolved this issue?


The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.


I don't understand this sentence. Which way executor uses this index of 
RelOptInfo ?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Allow ALTER SYSTEM SET on unrecognized custom GUCs

2023-10-17 Thread Andrei Lepikhov


On 17/10/2023 07:19, Tom Lane wrote:

Currently we have this odd behavior (for a superuser):

regression=# ALTER SYSTEM SET foo.bar TO 'baz';
ERROR:  unrecognized configuration parameter "foo.bar"
regression=# SET foo.bar TO 'baz';
SET
regression=# ALTER SYSTEM SET foo.bar TO 'baz';
ALTER SYSTEM

That is, you can't ALTER SYSTEM SET a random custom GUC unless there
is already a placeholder GUC for it, because the find_option call in
AlterSystemSetConfigFile fails.  This is surely pretty inconsistent.
Either the first ALTER SYSTEM SET ought to succeed, or the second one
ought to fail too, because we don't have any more knowledge about the
custom GUC than we did before.

In the original discussion about this [1], I initially leaned towards
"they should both fail", but I reconsidered: there doesn't seem to be
any harm in allowing ALTER SYSTEM SET to succeed for any custom GUC
name, as long as you're superuser.

Hence, attached is a patch for that.  Much of it is refactoring to
avoid duplicating the code that checks for a reserved GUC name, which
I think should still be done here --- otherwise, we're losing a lot of
the typo detection that that check was intended to provide.  (That is,
if you have loaded an extension that defines "foo" as a prefix, we
should honor the extension's opinion about whether "foo.bar" is
valid.)  I also fixed the code for GRANT ON PARAMETER so that it
follows the same rules and throws the same errors for invalid cases.

There's a chunk of AlterSystemSetConfigFile that now needs indenting
one more tab stop, but I didn't do that yet for ease of review.

Thoughts?


I have reviewed this patch. It looks good in general. Now, we can change 
the placeholder value with the SET command and have one more tool (which 
may be unusual) to pass some data through the session.
Keeping away from the reason why DBMS allows such behaviour, I have one 
question:
"SET foo.bar TO 'smth'" can immediately alter the placeholder's value. 
But what is the reason that "ALTER SYSTEM SET foo.bar TO 'smth'" doesn't 
do the same?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-17 Thread Andrei Lepikhov


On 16/10/2023 23:21, Ashutosh Bapat wrote:

On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.

I'm curious, which type of problems do you afraid here? Why we need a 
range table entry for each scan of non-partitioned relation?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-10-16 Thread Andrei Lepikhov

On 12/10/2023 18:32, Alexander Korotkov wrote:

On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov
 wrote:

On 4/10/2023 14:34, Alexander Korotkov wrote:

  > Relid replacement machinery is the most contradictory code here. We used
  > a utilitarian approach and implemented a simplistic variant.

  > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
  > > they are not necessary.  [1] points that infrastructure from [2] might
  > > be useful.  The patchset from [2] seems committed mow.  However, I
  > > can't see it is directly helpful in this matter.  Could we just skip
  > > adding IS NOT NULL clause for the columns, that have
  > > pg_attribute.attnotnull set?
  > Thanks for the links, I will look into that case.

To be more precise, in the attachment, you can find a diff to the main
patch, which shows the volume of changes to achieve the desired behaviour.
Some explains in regression tests shifted. So, I've made additional tests:

DROP TABLE test CASCADE;
CREATE TABLE test (a int, b int not null);
CREATE UNIQUE INDEX abc ON test(b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
CREATE UNIQUE INDEX abc1 ON test(a,b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a);
DROP INDEX abc1;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b);

We have almost the results we wanted to have. But in the last explain
you can see that nothing happened with the OR clause. We should use the
expression mutator instead of walker to handle such clauses. But It
doesn't process the RestrictInfo node ... I'm inclined to put a solution
of this issue off for a while.

OK.  I think it doesn't worth to eliminate IS NULL quals with this
complexity (at least at this stage of work).

I made improvements over the code.  Mostly new comments, grammar
corrections of existing comments and small refactoring.

Also, I found that the  suggestion from David Rowley [1] to qsort
array of relations to faster find duplicates is still unaddressed.
I've implemented it.  That helps to evade quadratic complexity with
large number of relations.

Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).

I would like to propose one more minor improvement (see in attachment). 
The idea here is that after removing a self-join and changing clauses we 
should re-probe the set of relids with the same Oid, because we can find 
more removable self-joins (see the demo test in join.sql).

--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index 7b8dc7a2b7..f7ccda5231 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -2298,6 +2298,7 @@ remove_self_joins_recurse(PlannerInfo *root, List 
*joinlist, Relids toRemove)
{
/* Create group of relation indexes with the 
same oid */
Relids  group = NULL;
+   Relids  removed;

while (i < j)
{
@@ -2306,8 +2307,21 @@ remove_self_joins_recurse(PlannerInfo *root, List 
*joinlist, Relids toRemove)
}

relids = bms_del_members(relids, group);
-   toRemove = bms_add_members(toRemove,
-   
remove_self_joins_one_group(root, group));
+
+   /*
+* Try to remove self-joins from a group of 
identical entries.
+* Make next attempt iteratively - if something 
is deleted from
+* a group, changes in clauses and equivalence 
classes can give
+* us a chance to find more candidates.
+*/
+   do {
+   Assert(!bms_overlap(group, toRemove));
+   removed = 
remove_self_joins_one_group(root, group);
+   toRemove = bms_add_members(toRemove, 
removed);
+   group = bms_del_members(group, removed);
+   } while (!bms_is_empty(removed) &&
+bms_membership(group) == 
BMS_MULTIPLE);
+   bms_free(

Re: Asymmetric partition-wise JOIN

2023-10-15 Thread Andrei Lepikhov


On 15/10/2023 17:25, Alexander Korotkov wrote:

On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov
 wrote:

Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways
to push down more joins into foreign servers. And, because of
asynchronous append, the benefit of that optimization was obvious.
Unfortunately, we hadn't found other applications for this feature,
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover,
the main issue of the patch was massive memory consumption in the case
of many joins and partitions - because of reparameterization. But now,
postponing the reparameterization proposed in the thread [1] resolves
that problem and gives some insights into the reparameterization
technique of some fields, like lateral references.
Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and
implement in the cost model cases of effectiveness when asymmetric join
would give significant performance.


Great!  I'm looking forward to the revised patch
Before preparing a new patch, it would be better to find the common 
ground in the next issue:
So far, this optimization stays aside, proposing an alternative path for 
a join RelOptInfo if we have an underlying append path in the outer.
My back burner is redesigning the approach: asymmetric join doesn't 
change the partitioning scheme and bounds of the partitioned side. So, 
it looks consecutive to make it a part of partitionwise_join machinery 
and implement it as a part of the try_partitionwise_join / 
generate_partitionwise_join_paths routines.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Asymmetric partition-wise JOIN

2023-10-14 Thread Andrei Lepikhov

On 15/10/2023 07:18, Alexander Korotkov wrote:

Hi Alexander,
Hi Andrey,

Thank you for your work on this subject.

On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov
wrote:

The patch does not longer apply cleanly, so I rebased it. Attaching
rebased version.

Not surprising that the patch doesn't apply after 1.5 years since the
last message. Could you please rebase it?

I read the thread and the patch. The patch improves the joining of
partitioned tables with non-partitioned relations. Let's denote
non-partitioned relation as A, partitions as P1 ... PN. The patch
allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A,
Append(P1, ... PN). That could be cheaper because it's generally
cheaper to join small pieces rather than do one big join. The
drawback is the need to scan A multiple times. But is this really
necessary and acceptable? Let's consider multiple options.

1) A is non-table. For instance, A is a function scan. In this case,
doing multiple scans of A is not just expensive, but could lead to
unexpected side effects. When the user includes a function once in
the FROM clause, she expects this function to be evaluated once. I
propose that we should materialize a scan of non-table relations. So,
materialized representation will be scanned multiple times, but the
source only scanned once. That would be similar to CTE.
2) A is the table to be scanned with the parametrized path in the
inner part of the nested loop join. In this case, there is no big
scan of A and nothing to materialize.
3) A is the table to be used in merge join or outer part of nested
loop join. In this case, it would be nice to consider materialize.
It's not always good to materialize, because materialization has its
additional costs. I think that could be a cost-based decision.
4) A is used in the hash join. Could we re-use the hashed
representation of A between multiple joins? I read upthread it was
proposed to share a hashed table between multiple background workers
via shared memory. But the first step would be to just share it
between multiple join nodes within the same process.

As we consider joining with each partition individually, there could
be chosen different join methods. As I get, the current patch
considers joining with each of the partitions as a separate isolated
optimization task. However, if we share resources between the
multiple joins, then rises a need for some global optimization. For
instance, a join type could be expensive when applied to an individual
partition, but cheap when applied to all the partitions thanks to
saving the common work.

My idea is to consider generated common resources (such as
materialized scans) as a property of the path. For instance, if the
nested loop join is cheaper than the hash join, but the hash join
generates a common hash map of table A, we don't drop hash join
immediately from the consideration and leave it to see how it could
help join other partitions. What do you think?

Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways
to push down more joins into foreign servers. And, because of
asynchronous append, the benefit of that optimization was obvious.
Unfortunately, we hadn't found other applications for this feature,
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover,
the main issue of the patch was massive memory consumption in the case
of many joins and partitions - because of reparameterization. But now,
postponing the reparameterization proposed in the thread [1] resolves
that problem and gives some insights into the reparameterization
technique of some fields, like lateral references.

Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and
implement in the cost model cases of effectiveness when asymmetric join
would give significant performance.

[1] Oversight in reparameterize_path_by_child leading to executor crash
https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-10-13 Thread Andrei Lepikhov


On 23/8/2023 12:37, Richard Guo wrote:

To fix it we may need to modify RelOptInfos for Path, BitmapHeapPath,
ForeignPath and CustomPath, and modify IndexOptInfos for IndexPath.  It
seems that that is not easily done without postponing reparameterization
of paths until createplan.c.

Attached is a patch which is v5 + fix for this new issue.


Having looked into the patch for a while, I couldn't answer to myself 
for some stupid questions:
1. If we postpone parameterization until the plan creation, why do we 
still copy the path node in the FLAT_COPY_PATH macros? Do we really need it?
2. I see big switches on path nodes. May it be time to create a 
path_walker function? I recall some thread where such a suggestion was 
declined, but I don't remember why.
3. Clause changing is postponed. Why does it not influence the 
calc_joinrel_size_estimate? We will use statistics on the parent table 
here. Am I wrong?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-10-13 Thread Andrei Lepikhov


On 13/10/2023 15:56, a.rybakina wrote:



Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).


Agree. I wouldn't say I like it too. But also, I suggest skipping some 
unnecessary assertions proposed in that patch:
Assert(toKeep->relid != -1); - quite strange. Why -1? Why not all the 
negative numbers, at least?
Assert(is_opclause(orinfo->clause)); - above we skip clauses with 
rinfo->mergeopfamilies == NIL. Each mergejoinable clause is already 
checked as is_opclause.

All these changes (see in the attachment) are optional.


I don't mind about asserts, maybe I misunderstood something in the patch.

About skipping SJ removal when no SJ quals is found, I assume it is 
about it:


split_selfjoin_quals(root, restrictlist, ,
                               , inner->relid, 
outer->relid);


+            if (list_length(selfjoinquals) == 0)
+             {
+                 /*
+                  * XXX:
+                  * we would detect self-join without quals like 'x==x' 
if we had
+                  * an foreign key constraint on some of other quals 
and this join

+                  * haven't any columns from the outer in the target list.
+                  * But it is still complex task.
+                  */
+                 continue;
+             }

as far as I remember, this is the place where it is checked that the SJ 
list is empty and it is logical, in my opinion, that no transformations 
should be performed if no elements are found for them.
You forget we have "Degenerate" case, as Alexander mentioned above. What 
if you have something like that:

SELECT ... FROM A a1, A a2 WHERE a1.id=1 AND a2.id=1;
In this case, uniqueness can be achieved by the baserestrictinfo 
"A.id=1", if we have an unique index on this column.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-10-12 Thread Andrei Lepikhov


On 12/10/2023 18:32, Alexander Korotkov wrote:

On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov

We have almost the results we wanted to have. But in the last explain
you can see that nothing happened with the OR clause. We should use the
expression mutator instead of walker to handle such clauses. But It
doesn't process the RestrictInfo node ... I'm inclined to put a solution
of this issue off for a while.


OK.  I think it doesn't worth to eliminate IS NULL quals with this
complexity (at least at this stage of work).


Yeah. I think It would be meaningful in the case of replacing also 
nested x IS NOT NULL with nothing. But it requires using a mutator 
instead of the walker and may be done more accurately next time.



I made improvements over the code.  Mostly new comments, grammar
corrections of existing comments and small refactoring.


Great!


Also, I found that the  suggestion from David Rowley [1] to qsort
array of relations to faster find duplicates is still unaddressed.
I've implemented it.  That helps to evade quadratic complexity with
large number of relations.


I see. The thread is too long so far, thanks for the catch.


Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).


Agree. I wouldn't say I like it too. But also, I suggest skipping some 
unnecessary assertions proposed in that patch:
Assert(toKeep->relid != -1); - quite strange. Why -1? Why not all the 
negative numbers, at least?
Assert(is_opclause(orinfo->clause)); - above we skip clauses with 
rinfo->mergeopfamilies == NIL. Each mergejoinable clause is already 
checked as is_opclause.

All these changes (see in the attachment) are optional.

--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index f0746f35a3..7b8dc7a2b7 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -1710,8 +1710,6 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark 
*kmark, PlanRowMark *rmark,
List   *binfo_candidates = NIL;
ReplaceVarnoContext ctx = {.from = toRemove->relid,.to = toKeep->relid};
 
-   Assert(toKeep->relid != -1);
-
/*
 * Replace index of removing table with the keeping one. The technique 
of
 * removing/distributing restrictinfo is used here to attach just 
appeared
@@ -2017,8 +2015,6 @@ match_unique_clauses(PlannerInfo *root, RelOptInfo 
*outer, List *uclauses,
/* Don't consider clauses which aren't similar 
to 'F(X)=G(Y)' */
continue;
 
-   Assert(is_opclause(orinfo->clause));
-
oclause = bms_is_empty(orinfo->left_relids) ?
get_rightop(orinfo->clause) : 
get_leftop(orinfo->clause);
c2 = (bms_is_empty(orinfo->left_relids) ?
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2ff4881fdf..96ebd6eed3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -367,7 +367,6 @@ CatalogId
 CatalogIdMapEntry
 CatalogIndexState
 ChangeVarNodes_context
-ReplaceVarnoContext
 CheckPoint
 CheckPointStmt
 CheckpointStatsData
@@ -2341,6 +2340,7 @@ ReorderBufferUpdateProgressTxnCB
 ReorderTuple
 RepOriginId
 ReparameterizeForeignPathByChild_function
+ReplaceVarnoContext
 ReplaceVarsFromTargetList_context
 ReplaceVarsNoMatchOption
 ReplicaIdentityStmt
@@ -2474,6 +2474,7 @@ SeenRelsEntry
 SelectLimit
 SelectStmt
 Selectivity
+SelfJoinCandidate
 SemTPadded
 SemiAntiJoinFactors
 SeqScan

Re: Removing unneeded self joins

2023-10-10 Thread Andrei Lepikhov


On 11/10/2023 02:29, Alena Rybakina wrote:

I have reviewed your patch and I noticed a few things.


Thanks for your efforts,

I have looked at the latest version of the code, I assume that the error 
lies in the replace_varno_walker function, especially in the place where 
we check the node by type Var, and does not form any NullTest node.


It's not a bug, it's an optimization we discussed with Alexander above.

Secondly, I added some code in some places to catch erroneous cases and 
added a condition when we should not try to apply the self-join-removal 
transformation due to the absence of an empty self-join list after 
searching for it and in general if there are no joins in the query. 
Besides, I added a query for testing and wrote about it above. I have 
attached my diff file.

Ok, I will look at this
In addition, I found a comment for myself that was not clear to me. I 
would be glad if you could explain it to me.


You mentioned superior outer join in the comment, unfortunately, I 
didn't find anything about it in the PostgreSQL code, and this 
explanation remained unclear to me. Could you explain in more detail 
what you meant?
I meant here that only clauses pushed by reconsider_outer_join_clauses() 
can be found in the joininfo list, and they are not relevant, as you can 
understand.
Having written that, I realized that it was a false statement. ;) - 
joininfo can also contain results of previous SJE iterations, look:


CREATE TABLE test (oid int PRIMARY KEY);
CREATE UNIQUE INDEX ON test((oid*oid));
explain
SELECT count(*)
FROM test c1, test c2, test c3
WHERE c1.oid=c2.oid AND c1.oid*c2.oid=c3.oid*c3.oid;
explain
SELECT count(*)
FROM test c1, test c2, test c3
WHERE c1.oid=c3.oid AND c1.oid*c3.oid=c2.oid*c2.oid;
explain
SELECT count(*)
FROM test c1, test c2, test c3
WHERE c3.oid=c2.oid AND c3.oid*c2.oid=c1.oid*c1.oid;

Having executed this SQL code, you could see that in the last query, the 
SJE feature didn't delete one of the JOINs because of the reason I had 
written above.

It's not an one-minute fix - I will try to propose solution a bit later.

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-10-05 Thread Andrei Lepikhov


On 4/10/2023 14:34, Alexander Korotkov wrote:

 > Relid replacement machinery is the most contradictory code here. We used
 > a utilitarian approach and implemented a simplistic variant.

 > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
 > > they are not necessary.  [1] points that infrastructure from [2] might
 > > be useful.  The patchset from [2] seems committed mow.  However, I
 > > can't see it is directly helpful in this matter.  Could we just skip
 > > adding IS NOT NULL clause for the columns, that have
 > > pg_attribute.attnotnull set?
 > Thanks for the links, I will look into that case.
To be more precise, in the attachment, you can find a diff to the main 
patch, which shows the volume of changes to achieve the desired behaviour.

Some explains in regression tests shifted. So, I've made additional tests:

DROP TABLE test CASCADE;
CREATE TABLE test (a int, b int not null);
CREATE UNIQUE INDEX abc ON test(b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
CREATE UNIQUE INDEX abc1 ON test(a,b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a);
DROP INDEX abc1;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b);

We have almost the results we wanted to have. But in the last explain 
you can see that nothing happened with the OR clause. We should use the 
expression mutator instead of walker to handle such clauses. But It 
doesn't process the RestrictInfo node ... I'm inclined to put a solution 
of this issue off for a while.


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index a127239d30..c12aa15fc9 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -73,7 +73,7 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
   List 
*restrictlist,
   List 
**extra_clauses);
 static Bitmapset *replace_relid(Relids relids, int oldId, int newId);
-static void replace_varno(Node *node, int from, int to);
+static void replace_varno(PlannerInfo *root, Node *node, int from, int to);
 
 
 /*
@@ -388,7 +388,7 @@ remove_rel_from_query_common(PlannerInfo *root, RelOptInfo 
*rel,
}
 
/* Update lateral references. */
-   replace_varno((Node *) otherrel->lateral_vars, relid, subst);
+   replace_varno(root, (Node *) otherrel->lateral_vars, relid, 
subst);
}
 
/*
@@ -425,7 +425,7 @@ remove_rel_from_query_common(PlannerInfo *root, RelOptInfo 
*rel,
sjinf->commute_below_l = replace_relid(sjinf->commute_below_l, 
ojrelid, subst);
sjinf->commute_below_r = replace_relid(sjinf->commute_below_r, 
ojrelid, subst);
 
-   replace_varno((Node *) sjinf->semi_rhs_exprs, relid, subst);
+   replace_varno(root, (Node *) sjinf->semi_rhs_exprs, relid, 
subst);
}
 
/*
@@ -1399,6 +1399,7 @@ is_innerrel_unique_for(PlannerInfo *root,
 
 typedef struct ReplaceVarnoContext
 {
+   PlannerInfo *root;
int from;
int to;
 } ReplaceVarnoContext;
@@ -1420,6 +1421,11 @@ replace_varno_walker(Node *node, ReplaceVarnoContext 
*ctx)
}
return false;
}
+
+   /*
+* Expression walker doesn't know about RestrictInfo node. Do recursive 
pass
+* into the clauses manually.
+*/
if (IsA(node, RestrictInfo))
{
RestrictInfo   *rinfo = (RestrictInfo *) node;
@@ -1429,20 +1435,26 @@ replace_varno_walker(Node *node, ReplaceVarnoContext 
*ctx)
 
if (bms_is_member(ctx->from, rinfo->clause_relids))
{
-   replace_varno((Node *) rinfo->clause, ctx->from, 
ctx->to);
-   replace_varno((Node *) rinfo->orclause, ctx->from, 
ctx->to);
-   rinfo->clause_relids = 
replace_relid(rinfo->clause_relids, ctx->from, ctx->to);
-   rinfo->left_relids = replace_relid(rinfo->left_relids, 
ctx->from, ctx->to);
-   rinfo->right_relids = 
replace_relid(rinfo->right_relids, ctx->from, ctx->to);
+   replace_varno(ctx->root, (Node *) rinfo->clause, 
ctx->from, ctx->to);
+   replace_varno(ctx->root, (Node *) rinfo->orclause, 
ctx->from, ctx->to);
+   rinfo->clause_relids =
+   
replace_relid(rinfo->clause_relids, ctx->from, ctx->to);
+   rinfo->left_relids =
+   
replace_relid(rinfo->left_relids, ctx->from, ctx->to);
+

Re: Removing unneeded self joins

2023-10-05 Thread Andrei Lepikhov

On 4/10/2023 14:34, Alexander Korotkov wrote:

Hi!

On Wed, Oct 4, 2023 at 9:56 AM Andrei Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:

 > On 4/10/2023 07:12, Alexander Korotkov wrote:
 > > On Tue, Sep 12, 2023 at 4:58 PM Andrey Lepikhov
 > > mailto:a.lepik...@postgrespro.ru>> wrote:
 > >> On 5/7/2023 21:28, Andrey Lepikhov wrote:
 > >>> During the significant code revision in v.41 I lost some replacement
 > >>> operations. Here is the fix and extra tests to check this in the 
future.

 > >>> Also, Tom added the JoinDomain structure five months ago, and I added
 > >>> code to replace relids for that place too.
 > >>> One more thing, I found out that we didn't replace SJs, defined by
 > >>> baserestrictinfos if no one self-join clause have existed for the 
join.

 > >>> Now, it is fixed, and the test has been added.
 > >>> To understand changes readily, see the delta file in the attachment.
 > >> Here is new patch in attachment. Rebased on current master and some
 > >> minor gaffes are fixed.
 > >
 > > I went through the thread and I think the patch gets better shape.  A
 > > couple of notes from my side.
 > > 1) Why replace_relid() makes a copy of lids only on insert/replace of
 > > a member, but performs deletion in-place?
 >
 > Shortly speaking, it was done according to the 'Paranoid' strategy.
 > The main reason for copying before deletion was the case with the rinfo
 > required_relids and clause_relids. They both point to the same Bitmapset
 > in some cases. And we feared such things for other fields.
 > Right now, it may be redundant because we resolved the issue mentioned
 > above in replace_varno_walker.

OK, but my point is still that you should be paranoid in all the cases 
or none of the cases.  Right now (newId < 0) branch doesn't copy source 
relids, but bms_is_member(oldId, relids) does copy.  Also, I think 
whether we copy or not should be reflected in the function comment.

/*
  * Substitute newId by oldId in relids.
  */
static Bitmapset *
replace_relid(Relids relids, int oldId, int newId)
{
     if (oldId < 0)
         return relids;

     if (newId < 0)
         /* Delete relid without substitution. */
         return bms_del_member(relids, oldId);

     if (bms_is_member(oldId, relids))
         return bms_add_member(bms_del_member(bms_copy(relids), oldId), 
newId);

     return relids;
}

We tried to use replace_relid() for both cases of JOIN deletion: 
unneeded outer join and self-join, and the relid deletion is used only 
in the first case. Such an approach was used there for a long time, and 
we just didn't change it.

I am prone to removing the copy operation in the code of relid replacement.

 > Relid replacement machinery is the most contradictory code here. We used
 > a utilitarian approach and implemented a simplistic variant.

 > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
 > > they are not necessary.  [1] points that infrastructure from [2] might
 > > be useful.  The patchset from [2] seems committed mow.  However, I
 > > can't see it is directly helpful in this matter.  Could we just skip
 > > adding IS NOT NULL clause for the columns, that have
 > > pg_attribute.attnotnull set?
 > Thanks for the links, I will look into that case.

Thanks for the curious issue.
The new field Var::varnullingrels introduced in [2] doesn't make sense 
here, as I see: we operate with plain relations only, and I don't know 
how it can be applied to an arbitrary subtree contained OUTER JOINs.
The second option, the attnotnull flag, can be used in this code. We 
haven't implemented it because the process_equivalence routine doesn't 
check the attnotnull before creating NullTest.
In general, it is not a difficult operation - we just need to add a 
trivial get_attnotnull() routine to lssycache.c likewise 
get_attgenerated() and other functions.
But, replace_varno uses the walker to change the relid. The mentioned 
replacement, like

X=X --> X IS NOT NULL
can be applied on different levels of the expression, look:
A a1 JOIN A a2 ON (a1.id=a2.id) WHERE (a1.x AND (a1.y=a2.y))
Here, we can replace id=id and y=y. It may need some 'unwanted clauses' 
collection procedure and a second pass through the expression tree to 
remove them. It may add some unpredictable overhead.
We can replace such a clause with a trivial 'TRUE' clause, of course. 
But is it the feature you have requested?

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Removing unneeded self joins

2023-10-04 Thread Andrei Lepikhov


On 4/10/2023 07:12, Alexander Korotkov wrote:

Hi!

Thanks for the review!


I think this is a neat optimization.

On Tue, Sep 12, 2023 at 4:58 PM Andrey Lepikhov
 wrote:

On 5/7/2023 21:28, Andrey Lepikhov wrote:

During the significant code revision in v.41 I lost some replacement
operations. Here is the fix and extra tests to check this in the future.
Also, Tom added the JoinDomain structure five months ago, and I added
code to replace relids for that place too.
One more thing, I found out that we didn't replace SJs, defined by
baserestrictinfos if no one self-join clause have existed for the join.
Now, it is fixed, and the test has been added.
To understand changes readily, see the delta file in the attachment.

Here is new patch in attachment. Rebased on current master and some
minor gaffes are fixed.


I went through the thread and I think the patch gets better shape.  A
couple of notes from my side.
1) Why replace_relid() makes a copy of lids only on insert/replace of
a member, but performs deletion in-place?


Shortly speaking, it was done according to the 'Paranoid' strategy.
The main reason for copying before deletion was the case with the rinfo 
required_relids and clause_relids. They both point to the same Bitmapset 
in some cases. And we feared such things for other fields.
Right now, it may be redundant because we resolved the issue mentioned 
above in replace_varno_walker.


Relid replacement machinery is the most contradictory code here. We used 
a utilitarian approach and implemented a simplistic variant.



2) It would be nice to skip the insertion of IS NOT NULL checks when
they are not necessary.  [1] points that infrastructure from [2] might
be useful.  The patchset from [2] seems committed mow.  However, I
can't see it is directly helpful in this matter.  Could we just skip
adding IS NOT NULL clause for the columns, that have
pg_attribute.attnotnull set?

Thanks for the links, I will look into that case.


Links
1. https://www.postgresql.org/message-id/2375492.jE0xQCEvom%40aivenronan
2.  https://www.postgresql.org/message-id/flat/830269.1656693747%40sss.pgh.pa.us


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-03 Thread Andrei Lepikhov


On 29/9/2023 09:52, Andrei Lepikhov wrote:

On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:

Attach patches updated to master.
Pulled from patch 2 back to patch 1 a change that was also pertinent 
to patch 1.

+1 to the idea, have doubts on the implementation.

I have a question. I see the feature triggers ERROR on the exceeding of 
the memory limit. The superior PG_CATCH() section will handle the error. 
As I see, many such sections use memory allocations. What if some 
routine, like the CopyErrorData(), exceeds the limit, too? In this case, 
we could repeat the error until the top PG_CATCH(). Is this correct 
behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for 
recursion and allow error handlers to slightly exceed this hard limit?
By the patch in attachment I try to show which sort of problems I'm 
worrying about. In some PП_CATCH() sections we do CopyErrorData 
(allocate some memory) before aborting the transaction. So, the 
allocation error can move us out of this section before aborting. We 
await for soft ERROR message but will face more hard consequences.


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 33975687b3..3f992b8d92 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -291,10 +291,7 @@ _SPI_commit(bool chain)
{
ErrorData  *edata;
 
-   /* Save error info in caller's context */
MemoryContextSwitchTo(oldcontext);
-   edata = CopyErrorData();
-   FlushErrorState();
 
/*
 * Abort the failed transaction.  If this fails too, we'll just
@@ -302,6 +299,10 @@ _SPI_commit(bool chain)
 */
AbortCurrentTransaction();
 
+   /* Save error info in caller's context */
+   edata = CopyErrorData();
+   FlushErrorState();
+
/* ... and start a new one */
StartTransactionCommand();
if (chain)
@@ -383,10 +384,7 @@ _SPI_rollback(bool chain)
{
ErrorData  *edata;
 
-   /* Save error info in caller's context */
MemoryContextSwitchTo(oldcontext);
-   edata = CopyErrorData();
-   FlushErrorState();
 
/*
 * Try again to abort the failed transaction.  If this fails 
too,
@@ -395,6 +393,10 @@ _SPI_rollback(bool chain)
 */
AbortCurrentTransaction();
 
+   /* Save error info in caller's context */
+   edata = CopyErrorData();
+   FlushErrorState();
+
/* ... and start a new one */
StartTransactionCommand();
if (chain)
diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index 12edc5772a..f9cf599026 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2565,7 +2565,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
PG_CATCH();
{
MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
-   ErrorData  *errdata = CopyErrorData();
+   ErrorData  *errdata;
 
/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH 
blocks */
if (iterstate)
@@ -2579,6 +2579,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
 */
AbortCurrentTransaction();
 
+   errdata = CopyErrorData();
+
/* make sure there's no cache pollution */
ReorderBufferExecuteInvalidations(txn->ninvalidations,

  txn->invalidations);

Re: POC: GROUP BY optimization

2023-10-01 Thread Andrei Lepikhov

New version of the patch. Fixed minor inconsistencies and rebased onto 
current master.


--
regards,
Andrey Lepikhov
Postgres Professional
From 2f5a42c8a53286f830e3376ff4d3f8b7d4215b4b Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 13 Sep 2023 11:20:03 +0700
Subject: [PATCH] Explore alternative orderings of group-by pathkeys during
 optimization.

When evaluating a query with a multi-column GROUP BY clause using sort,
the cost may depend heavily on the order in which the keys are compared when
building the groups. Grouping does not imply any ordering, so we can compare
the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg,
we simply compared keys in the order specified in the query. This commit
explores alternative ordering of the keys, trying to find a cheaper one.

In principle, we might generate grouping paths for all permutations of the keys
and leave the rest to the optimizer. But that might get very expensive, so we
try to pick only a couple interesting orderings based on both local and global
information.

When planning the grouping path, we explore statistics (number of distinct
values, cost of the comparison function) for the keys and reorder them
to minimize comparison costs. Intuitively, it may be better to perform more
expensive comparisons (for complex data types, etc.) last because maybe
the cheaper comparisons will be enough. Similarly, the higher the cardinality
of a key, the lower the probability we'll need to compare more keys. The patch
generates and costs various orderings, picking the cheapest ones.

The ordering of group keys may interact with other parts of the query, some of
which may not be known while planning the grouping. For example, there may be
an explicit ORDER BY clause or some other ordering-dependent operation higher up
in the query, and using the same ordering may allow using either incremental
sort or even eliminating the sort entirely.

The patch generates orderings and picks those, minimizing the comparison cost
(for various path keys), and then adds orderings that might be useful for
operations higher up in the plan (ORDER BY, etc.). Finally, it always keeps
the ordering specified in the query, assuming the user might have additional
insights.

This introduces a new GUC enable_group_by_reordering so that the optimization
may be disabled if needed.
---
 .../postgres_fdw/expected/postgres_fdw.out|  36 +-
 src/backend/optimizer/path/costsize.c | 363 ++-
 src/backend/optimizer/path/equivclass.c   |  13 +-
 src/backend/optimizer/path/pathkeys.c | 600 ++
 src/backend/optimizer/plan/planner.c  | 465 --
 src/backend/optimizer/util/pathnode.c |   4 +-
 src/backend/utils/adt/selfuncs.c  |  38 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/pathnodes.h |  10 +
 src/include/optimizer/cost.h  |   4 +-
 src/include/optimizer/paths.h |   7 +
 src/include/utils/selfuncs.h  |   5 +
 src/test/regress/expected/aggregates.out  | 244 ++-
 .../regress/expected/incremental_sort.out |   2 +-
 src/test/regress/expected/join.out|  51 +-
 src/test/regress/expected/merge.out   |  15 +-
 .../regress/expected/partition_aggregate.out  |  44 +-
 src/test/regress/expected/partition_join.out  |  75 +--
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/union.out   |  60 +-
 src/test/regress/sql/aggregates.sql   |  99 +++
 src/test/regress/sql/incremental_sort.sql |   2 +-
 23 files changed, 1769 insertions(+), 382 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index 144c114d0f..63af7feabe 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2319,18 +2319,21 @@ SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT 
DISTINCT t2.c1, t3.c1 FROM
 -- join with pseudoconstant quals
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1 AND CURRENT_USER 
= SESSION_USER) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
-   
QUERY PLAN  
 
-
+  QUERY PLAN

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-09-28 Thread Andrei Lepikhov


On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:

Attach patches updated to master.
Pulled from patch 2 back to patch 1 a change that was also pertinent to patch 1.

+1 to the idea, have doubts on the implementation.

I have a question. I see the feature triggers ERROR on the exceeding of 
the memory limit. The superior PG_CATCH() section will handle the error. 
As I see, many such sections use memory allocations. What if some 
routine, like the CopyErrorData(), exceeds the limit, too? In this case, 
we could repeat the error until the top PG_CATCH(). Is this correct 
behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for 
recursion and allow error handlers to slightly exceed this hard limit?


Also, the patch needs to be rebased.

--
regards,
Andrey Lepikhov
Postgres Professional

< 1 2

101 - 186 of 186 matches

Mail list logo