Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-04 Thread Andrey V. Lepikhov

On 4/3/22 15:29, Etsuro Fujita wrote:

On Sun, Mar 13, 2022 at 6:39 PM Etsuro Fujita  wrote:

On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
 wrote:

The patch looks good to me and seems to work as expected.


I’m planning to commit the patch.


I polished the patch a bit:

* Reordered a bit of code in create_append_plan() in logical order (no
functional changes).
* Added more comments.
* Added/Tweaked regression test cases.

Also, I added the commit message.  Attached is a new version of the
patch.  Barring objections, I’ll commit this.


Sorry for late answer - just vacation.
I looked through this patch - looks much more stable.
But, as far as I remember, on previous version some problems were found 
out on the TPC-H test. I want to play a bit with the TPC-H and with 
parameterized plans.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Removing unneeded self joins

2022-04-04 Thread Andrey V. Lepikhov

On 4/1/22 20:27, Greg Stark wrote:

Sigh. And now there's a patch conflict in a regression test expected
output: sysviews.out

Please rebase. Incidentally, make sure to check the expected output is
actually correct. It's easy to "fix" an expected output to
accidentally just memorialize an incorrect output.

Btw, it's the last week before feature freeze so time is of the essence.Thanks,

patch in attachment rebased on current master.
Sorry for late answer.

--
regards,
Andrey Lepikhov
Postgres ProfessionalFrom d5fab52bd7e7124d0e557f1eec075a9543c67d29 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 15 Jul 2021 15:26:13 +0300
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if could prove that the join
can be replaced with a scan. We can prove the uniqueness
using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars, an inner row
must be (physically) the same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals which look like a.x = b.x
2. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. So proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 888 +-
 src/backend/optimizer/plan/planmain.c |   5 +
 src/backend/optimizer/util/joininfo.c |   3 +
 src/backend/optimizer/util/relnode.c  |  26 +-
 src/backend/utils/misc/guc.c  |  10 +
 src/include/optimizer/pathnode.h  |   4 +
 src/include/optimizer/planmain.h  |   2 +
 src/test/regress/expected/equivclass.out  |  32 +
 src/test/regress/expected/join.out| 426 +++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/equivclass.sql   |  16 +
 src/test/regress/sql/join.sql | 197 +
 12 files changed, 1583 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 337f470d58..c5ac8e2bd4 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -32,10 +33,12 @@
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
 
+bool enable_self_join_removal;
+
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
-  Relids joinrelids);
+  Relids joinrelids, int subst_relid);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
@@ -47,6 +50,9 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
    RelOptInfo *innerrel,
    JoinType jointype,
    List *restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
+static Bitmapset* change_relid(Relids relids, Index oldId, Index newId);
+static void change_varno(Expr *expr, Index oldRelid, Index newRelid);
 
 
 /*
@@ -86,7 +92,7 @@ restart:
 
 		remove_rel_from_query(root, innerrelid,
 			  bms_union(sjinfo->min_lefthand,
-		sjinfo->min_righthand));
+		sjinfo->min_righthand), 0);
 
 		/* We verify that exactly one reference gets removed from joinlist */
 		nremoved = 0;
@@ -300,7 +306,10 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 
 /*
  * Remove the target relid from the planner's data structures, having
- * determined that there is no need to include it in the query.
+ * determined that there is no need to include it in the query. Or replace
+ * with another relid.
+ * To reusability, this routine can work in two modes: delete relid from a plan
+ * or replace it. It is used in replace mode in a self-join removing process.
  *
  * We are not terribly thorough here.  We must make sure that the rel is
  * no longer treated as a baserel, and that attributes of other baserels
@@ -309,13 +318,16 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
  * lists, but only if they belong to the outer join identified by joinrelids.
  */
 static void
-remove_rel_from_query(PlannerInfo *root, int relid, Relids joinrelids)
+remove_rel_from_q

Re: Fast COPY FROM based on batch insert

2022-03-24 Thread Andrey V. Lepikhov

On 3/22/22 06:54, Etsuro Fujita wrote:

On Fri, Jun 4, 2021 at 5:26 PM Andrey Lepikhov
 wrote:

We still have slow 'COPY FROM' operation for foreign tables in current
master.
Now we have a foreign batch insert operation And I tried to rewrite the
patch [1] with this machinery.


The patch has been rewritten to something essentially different, but
no one reviewed it.  (Tsunakawa-san gave some comments without looking
at it, though.)  So the right status of the patch is “Needs review”,
rather than “Ready for Committer”?  Anyway, here are a few review
comments from me:

* I don’t think this assumption is correct:

@@ -359,6 +386,12 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
  (resultRelInfo->ri_TrigDesc->trig_insert_after_row ||
   resultRelInfo->ri_TrigDesc->trig_insert_new_table))
 {
+   /*
+* AFTER ROW triggers aren't allowed with the foreign bulk insert
+* method.
+*/
+   Assert(resultRelInfo->ri_RelationDesc->rd_rel->relkind !=
RELKIND_FOREIGN_TABLE);
+

In postgres_fdw we disable foreign batch insert when the target table
has AFTER ROW triggers, but the core allows it even in that case.  No?

Agree


* To allow foreign multi insert, the patch made an invasive change to
the existing logic to determine whether to use multi insert for the
target relation, adding a new member ri_usesMultiInsert to the
ResultRelInfo struct, as well as introducing a new function
ExecMultiInsertAllowed().  But I’m not sure we really need such a
change.  Isn’t it reasonable to *adjust* the existing logic to allow
foreign multi insert when possible?
Of course, such approach would look much better, if we implemented it. 
I'll ponder how to do it.



I didn’t finish my review, but I’ll mark this as “Waiting on Author”.
I rebased the patch onto current master. Now it works correctly. I'll 
mark it as "Waiting for review".


--
regards,
Andrey Lepikhov
Postgres ProfessionalFrom 2d51d0f5d94a3e4b3400714b5841228d1896fb56 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Fri, 4 Jun 2021 13:21:43 +0500
Subject: [PATCH] Implementation of a Bulk COPY FROM operation into foreign
 table.

---
 .../postgres_fdw/expected/postgres_fdw.out|  45 +++-
 contrib/postgres_fdw/sql/postgres_fdw.sql |  47 
 src/backend/commands/copyfrom.c   | 210 --
 src/backend/executor/execMain.c   |  45 
 src/backend/executor/execPartition.c  |   8 +
 src/include/commands/copyfrom_internal.h  |  10 -
 src/include/executor/executor.h   |   1 +
 src/include/nodes/execnodes.h |   5 +-
 8 files changed, 237 insertions(+), 134 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index f210f91188..a803029f2f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8415,6 +8415,7 @@ drop table loct2;
 -- ===
 -- test COPY FROM
 -- ===
+alter server loopback options (add batch_size '2');
 create table loc2 (f1 int, f2 text);
 alter table loc2 set (autovacuum_enabled = 'false');
 create foreign table rem2 (f1 int, f2 text) server loopback options(table_name 'loc2');
@@ -8437,7 +8438,7 @@ copy rem2 from stdin; -- ERROR
 ERROR:  new row for relation "loc2" violates check constraint "loc2_f1positive"
 DETAIL:  Failing row contains (-1, xyzzy).
 CONTEXT:  remote SQL command: INSERT INTO public.loc2(f1, f2) VALUES ($1, $2)
-COPY rem2, line 1: "-1	xyzzy"
+COPY rem2, line 2
 select * from rem2;
  f1 | f2  
 +-
@@ -8448,6 +8449,19 @@ select * from rem2;
 alter foreign table rem2 drop constraint rem2_f1positive;
 alter table loc2 drop constraint loc2_f1positive;
 delete from rem2;
+create table foo (a int) partition by list (a);
+create table foo1 (like foo);
+create foreign table ffoo1 partition of foo for values in (1)
+	server loopback options (table_name 'foo1');
+create table foo2 (like foo);
+create foreign table ffoo2 partition of foo for values in (2)
+	server loopback options (table_name 'foo2');
+create function print_new_row() returns trigger language plpgsql as $$
+	begin raise notice '%', new; return new; end; $$;
+create trigger ffoo1_br_trig before insert on ffoo1
+	for each row execute function print_new_row();
+copy foo from stdin;
+NOTICE:  (1)
 -- Test local triggers
 create trigger trig_stmt_before before insert on rem2
 	for each statement execute procedure trigger_func();
@@ -8556,6 +8570,34 @@ drop trigger rem2_trig_row_before on rem2;
 drop trigger rem2_trig_row_after on rem2;
 drop trigger loc2_trig_row_before_insert on loc2;
 delete from rem2;
+alter table loc2 drop column f1;
+alter table loc2 drop column f2;
+copy 

Re: Removing unneeded self joins

2022-03-23 Thread Andrey V. Lepikhov

On 3/22/22 05:58, Andres Freund wrote:

Hi,

On 2022-03-04 15:47:47 +0500, Andrey Lepikhov wrote:

Also, in new version of the patch I fixed one stupid bug: checking a
self-join candidate expression operator - we can remove only expressions
like F(arg1) = G(arg2).


This CF entry currently fails tests: 
https://cirrus-ci.com/task/4632127944785920?logs=test_world#L1938

Looks like you're missing an adjustment of postgresql.conf.sample

Marked as waiting-on-author.

Thanks, I fixed it.

--
regards,
Andrey Lepikhov
Postgres ProfessionalFrom 620dea31ce19965beefe545f08dcc5c8b319c434 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 15 Jul 2021 15:26:13 +0300
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if could prove that the join
can be replaced with a scan. We can prove the uniqueness
using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars, an inner row
must be (physically) the same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals which look like a.x = b.x
2. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. So proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 888 +-
 src/backend/optimizer/plan/planmain.c |   5 +
 src/backend/optimizer/util/joininfo.c |   3 +
 src/backend/optimizer/util/relnode.c  |  26 +-
 src/backend/utils/misc/guc.c  |  10 +
 src/include/optimizer/pathnode.h  |   4 +
 src/include/optimizer/planmain.h  |   2 +
 src/test/regress/expected/equivclass.out  |  32 +
 src/test/regress/expected/join.out| 426 +++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/equivclass.sql   |  16 +
 src/test/regress/sql/join.sql | 197 +
 12 files changed, 1583 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 337f470d58..c5ac8e2bd4 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -32,10 +33,12 @@
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
 
+bool enable_self_join_removal;
+
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
-  Relids joinrelids);
+  Relids joinrelids, int subst_relid);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
@@ -47,6 +50,9 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
    RelOptInfo *innerrel,
    JoinType jointype,
    List *restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
+static Bitmapset* change_relid(Relids relids, Index oldId, Index newId);
+static void change_varno(Expr *expr, Index oldRelid, Index newRelid);
 
 
 /*
@@ -86,7 +92,7 @@ restart:
 
 		remove_rel_from_query(root, innerrelid,
 			  bms_union(sjinfo->min_lefthand,
-		sjinfo->min_righthand));
+		sjinfo->min_righthand), 0);
 
 		/* We verify that exactly one reference gets removed from joinlist */
 		nremoved = 0;
@@ -300,7 +306,10 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 
 /*
  * Remove the target relid from the planner's data structures, having
- * determined that there is no need to include it in the query.
+ * determined that there is no need to include it in the query. Or replace
+ * with another relid.
+ * To reusability, this routine can work in two modes: delete relid from a plan
+ * or replace it. It is used in replace mode in a self-join removing process.
  *
  * We are not terribly thorough here.  We must make sure that the rel is
  * no longer treated as a baserel, and that attributes of other baserels
@@ -309,13 +318,16 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
  * lists, but only if they belong to the outer join identified by joinrelids.
  */
 static void
-remove_rel_from_query(PlannerInfo *root, int relid, Relids joinre

Re: POC: GROUP BY optimization

2022-03-18 Thread Andrey V. Lepikhov

On 3/15/22 13:26, Tomas Vondra wrote:

Thanks for the rebase. The two proposed changes (tweaked costing and
simplified fake_var handling) seem fine to me. I think the last thing
that needs to be done is cleanup of the debug GUCs, which I added to
allow easier experimentation with the patch.

Thanks, I'm waiting for the last step.


I probably won't remove the GUCs entirely, though. I plan to add a
single GUC that would enable/disable this optimization. I'm not a huge
fan of adding more and more GUCs, but in this case it's probably the
right thing to do given the complexity of estimating cost with
correlated columns etc.
Agree. Because it is a kind of automation we should allow user to switch 
it off in the case of problems or manual tuning.


Also, I looked through this patch. It has some minor problems:
1. Multiple typos in the patch comment.
2. The term 'cardinality of a key' - may be replace with 'number of 
duplicates'?


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Merging statistics from children instead of re-sampling everything

2022-02-18 Thread Andrey V. Lepikhov

On 2/14/22 20:16, Tomas Vondra wrote:



On 2/14/22 11:22, Andrey V. Lepikhov wrote:

On 2/11/22 20:12, Tomas Vondra wrote:



On 2/11/22 05:29, Andrey V. Lepikhov wrote:

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in 
small subset of partitions.


For (1) i can imagine a solution like multiplexing: on the stage of 
defining which relations to scan, group them and prepare parameters 
of scanning to make multiple samples in one shot.
I'm not sure I understand what you mean by multiplexing. The term 
usually means "sending multiple signals at once" but I'm not sure how 
that applies to this issue. Can you elaborate?


I suppose to make a set of samples in one scan: one sample for plane 
table, another - for a parent and so on, according to the inheritance 
tree. And cache these samples in memory. We can calculate all 
parameters of reservoir method to do it.




I doubt keeping the samples just in memory is a good solution. Firstly, 
there's the question of memory consumption. Imagine a large partitioned 
table with 1-10k partitions. If we keep a "regular" sample (30k rows) 
per partition, that's 30M-300M rows. If each row needs 100B, that's 
3-30GB of data.
I tell about caching a sample only for a time that it needed in this 
ANALYZE operation. Imagine 3 levels of partitioned table. On each 
partition you should create and keep three different samples (we can do 
it in one scan). Sample for a plane table we can use immediately and 
destroy it.
Sample for the partition on second level of hierarchy: we can save a 
copy of sample for future usage (maybe, repeated analyze) to a disk. 
In-memory data used to form a reservoir, that has a limited size and can 
be destroyed immediately. At the third level we can use the same logic.
So, at one moment we only use as many samples as many levels of 
hierarchy we have. IMO, it isn't large number.


> the trouble is partitions may be detached, data may be deleted from
> some partitions, etc.
Because statistics hasn't strong relation with data, we can use two 
strategies: In the case of explicit 'ANALYZE ' we can recalculate 
all samples for all partitions, but in autovacuum case or implicit 
analysis we can use not-so-old versions of samples and samples of 
detached (but not destroyed) partitions in optimistic assumption that it 
doesn't change statistic drastically.



So IMHO the samples need to be serialized, in some way.

Agreed

Well, a separate catalog is one of the options. But I don't see how that 
deals with large samples, etc.
I think, we can design fall back to previous approach in the case of 
very large tuples, like a switch from HashJoin to NestedLoop if we 
estimate, that we haven't enough memory.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Merging statistics from children instead of re-sampling everything

2022-02-14 Thread Andrey V. Lepikhov

On 2/11/22 20:12, Tomas Vondra wrote:



On 2/11/22 05:29, Andrey V. Lepikhov wrote:

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in 
small subset of partitions.


For (1) i can imagine a solution like multiplexing: on the stage of 
defining which relations to scan, group them and prepare parameters of 
scanning to make multiple samples in one shot.
I'm not sure I understand what you mean by multiplexing. The term 
usually means "sending multiple signals at once" but I'm not sure how 
that applies to this issue. Can you elaborate?


I suppose to make a set of samples in one scan: one sample for plane 
table, another - for a parent and so on, according to the inheritance 
tree. And cache these samples in memory. We can calculate all parameters 
of reservoir method to do it.



sample might be used for estimation of clauses directly.
You mean, to use them in difficult cases, such of estimation of grouping 
over APPEND ?


But it requires storing the sample somewhere, and I haven't found a good 
and simple way to do that. We could serialize that into bytea, or we 
could create a new fork, or something, but what should that do with 
oversized attributes (how would TOAST work for a fork) and/or large 
samples (which might not fit into 1GB bytea)? 
This feature looks like meta-info over a database. It can be stored in 
separate relation. It is not obvious that we need to use it for each 
relation, for example, with large samples. I think, it can be controlled 
by a table parameter.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Merging statistics from children instead of re-sampling everything

2022-02-10 Thread Andrey V. Lepikhov

On 2/11/22 03:37, Tomas Vondra wrote:

That being said, this thread was not really about foreign partitions,
but about re-analyzing inheritance trees in general. And sampling
foreign partitions doesn't really solve that - we'll still do the
sampling over and over.

IMO, to solve the problem we should do two things:
1. Avoid repeatable partition scans in the case inheritance tree.
2. Avoid to re-analyze everything in the case of active changes in small 
subset of partitions.


For (1) i can imagine a solution like multiplexing: on the stage of 
defining which relations to scan, group them and prepare parameters of 
scanning to make multiple samples in one shot.
It looks like we need a separate logic for analysis of partitioned 
tables - we should form and cache samples on each partition before an 
analysis.
It requires a prototype to understand complexity of such solution and 
can be done separately from (2).


Task (2) is more difficult to solve. Here we can store samples from each 
partition in values[] field of pg_statistic or in specific table which 
stores a 'most probable values' snapshot of each table.
Most difficult problem here, as you mentioned, is ndistinct value. Is it 
possible to store not exactly calculated value of ndistinct, but an 
'expected value', based on analysis of samples and histograms on 
partitions? Such value can solve also a problem of estimation of a SETOP 
result grouping (joining of them, etc), where we have statistics only on 
sources of the union.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: POC: GROUP BY optimization

2022-02-10 Thread Andrey V. Lepikhov

On 1/22/22 01:34, Tomas Vondra wrote:




I rebased (with minor fixes) this patch onto current master.

Also, second patch dedicated to a problem of "varno 0" (fake_var).
I think, this case should make the same estimations as in the case of 
varno != 0, but no any stats found. So I suppose to restrict number of 
groups with min of a number of incoming tuples and DEFAULT_NUM_DISTINCT 
value.


--
regards,
Andrey Lepikhov
Postgres ProfessionalFrom d5fd0f8f981d9e457320c1007f21f2b9b74aab9e Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 10 Feb 2022 13:51:51 +0500
Subject: [PATCH 2/2] Use default restriction for number of groups.

---
 src/backend/optimizer/path/costsize.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index 68a32740d7..b9e975df10 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1756,8 +1756,8 @@ cost_recursive_union(Path *runion, Path *nrterm, Path 
*rterm)
 
 /*
  * is_fake_var
- * Workaround for generate_append_tlist() which generates fake 
Vars with
- * varno == 0, that will cause a fail of estimate_num_group() call
+ * Workaround for generate_append_tlist() which generates fake 
Vars for the
+ * case of "varno 0", that will cause a fail of 
estimate_num_group() call
  *
  * XXX Ummm, why would estimate_num_group fail with this?
  */
@@ -1978,21 +1978,13 @@ compute_cpu_sort_cost(PlannerInfo *root, List 
*pathkeys, int nPresortedKeys,

  tuplesPerPrevGroup, NULL, NULL,

  _varinfos,

  list_length(pathkeyExprs) - 1);
-   else if (tuples > 4.0)
+   else
/*
-* Use geometric mean as estimation if there is no any 
stats.
-* Don't use DEFAULT_NUM_DISTINCT because it used for 
only one
-* column while here we try to estimate number of 
groups over
-* set of columns.
-*
-* XXX Perhaps this should use DEFAULT_NUM_DISTINCT at 
least to
-* limit the calculated values, somehow?
-*
-* XXX What's the logic of the following formula?
+* In case of full uncertainity use default defensive 
approach. It
+* means that any permutations of such vars are 
equivalent.
+* Also, see comments for the cost_incremental_sort 
routine.
 */
-   nGroups = ceil(2.0 + sqrt(tuples) * (i + 1) / 
list_length(pathkeys));
-   else
-   nGroups = tuples;
+   nGroups = Min(tuplesPerPrevGroup, DEFAULT_NUM_DISTINCT);
 
/*
 * Presorted keys aren't participated in comparison but still 
checked
-- 
2.25.1

From 7357142a0f45313aa09014c4413c011b61aafe5f Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 21 Jan 2022 20:22:14 +0100
Subject: [PATCH 1/2] GROUP BY reordering

---
 .../postgres_fdw/expected/postgres_fdw.out|  15 +-
 src/backend/optimizer/path/costsize.c | 369 +-
 src/backend/optimizer/path/equivclass.c   |  13 +-
 src/backend/optimizer/path/pathkeys.c | 580 
 src/backend/optimizer/plan/planner.c  | 653 ++
 src/backend/optimizer/util/pathnode.c |   2 +-
 src/backend/utils/adt/selfuncs.c  |  37 +-
 src/backend/utils/misc/guc.c  |  32 +
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/pathnodes.h |  10 +
 src/include/optimizer/cost.h  |   4 +-
 src/include/optimizer/paths.h |  11 +
 src/include/utils/selfuncs.h  |   5 +
 src/test/regress/expected/aggregates.out  | 244 ++-
 src/test/regress/expected/guc.out |   9 +-
 .../regress/expected/incremental_sort.out |   2 +-
 src/test/regress/expected/join.out|  51 +-
 .../regress/expected/partition_aggregate.out  | 136 ++--
 src/test/regress/expected/partition_join.out  |  75 +-
 src/test/regress/expected/union.out   |  60 +-
 src/test/regress/sql/aggregates.sql   |  99 +++
 src/test/regress/sql/incremental_sort.sql |   2 +-
 22 files changed, 1913 insertions(+), 497 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b2e02caefe..2331d13858 100644
--- 

Re: POC: GROUP BY optimization

2022-02-01 Thread Andrey V. Lepikhov

On 7/22/21 3:58 AM, Tomas Vondra wrote:
I've simplified the costing a bit, and the attached version actually
undoes all the "suspicious" plan changes in postgres_fdw. It changes one
new plan, but that seems somewhat reasonable, as it pushes sort to the
remote side.


I tried to justify heap-sort part of the compute_cpu_sort_cost() routine 
and realized, that here we may have a mistake.
After a week of efforts, I don't found any research papers on dependence 
of bounded heap-sort time compexity on number of duplicates.


So, I suppose self-made formula, based on simple logical constructions:

1. We should base on initial formula: cost ~ N*LOG2(M), where M - 
output_tuples.

2. Realize, that full representation of this formula is:

cost ~ N*LOG2(min{N,M})

3. In the case of multicolumn, number of comparisons for each next 
column can be estimated by the same formula, but arranged to a number of 
tuples per group:


comparisons ~ input * LOG2(min{input,M})

4. Realize, that for the case, when M > input, we should change this 
formula a bit:


comparisons ~ max{input,M} * LOG2(min{input,M})

Remember, that in our case M << tuples.
So, general formula for bounded heap sort can be written as:

cost ~ N * sum(max{n_i,M}/N * LOG2(min{n_i,M})), i=1,ncols

where n_1 == N, n_i - number of tuples per group, estimated from 
previous iteration.


In attachment - an implementation of this approach.

--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index 68a32740d7..2c3cce57aa 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,6 +1872,15 @@ get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
  *
  * N * sum( Fk * log(Gk) )
  *
+ * For bounded heap sort we haven't such a duplicates-related research. So 
invent
+ * it based on a simple logic (M - number of output tuples):
+ * For one column we can estimate number of comparisons as:
+ * N * log(min{N,M})
+ * For the case of many columns we can natively estimate number of comparisons 
by
+ * the formula:
+ * sum(max{n_i,M} * log(min{N,M})),
+ * where n_0 == N, n_i - number of tuples per group, estimated on previous 
step.
+ *
  * Note: We also consider column width, not just the comparator cost.
  *
  * NOTE: some callers currently pass NIL for pathkeys because they
@@ -1881,7 +1890,7 @@ get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
 static Cost
 compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, int nPresortedKeys,
  Cost comparison_cost, double tuples, 
double output_tuples,
- bool heapSort)
+ bool bounded_sort)
 {
Costper_tuple_cost = 0.0;
ListCell*lc;
@@ -1907,7 +1916,7 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, 
int nPresortedKeys,
 * of this function, but I'm not sure. I suggest we introduce 
some simple
 * constants for that, instead of magic values.
 */
-   output_tuples = (heapSort) ? 2.0 * output_tuples : tuples;
+   output_tuples = (bounded_sort) ? 2.0 * output_tuples : tuples;
per_tuple_cost += 2.0 * cpu_operator_cost * LOG2(output_tuples);
 
/* add cost provided by caller */
@@ -1925,8 +1934,7 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, 
int nPresortedKeys,
{
PathKey *pathkey = (PathKey*) 
lfirst(lc);
EquivalenceMember   *em;
-   double   nGroups,
-correctedNGroups;
+   double   nGroups;
 
/*
 * We believe that equivalence members aren't very different, 
so, to
@@ -2000,24 +2008,16 @@ compute_cpu_sort_cost(PlannerInfo *root, List 
*pathkeys, int nPresortedKeys,
 */
if (i >= nPresortedKeys)
{
-   if (heapSort)
-   {
-   double heap_tuples;
-
-   /* have to keep at least one group, and a 
multiple of group size */
-   heap_tuples = Max(ceil(output_tuples / 
tuplesPerPrevGroup) * tuplesPerPrevGroup,
- 
tuplesPerPrevGroup);
-
-   /* so how many (whole) groups is that? */
-   correctedNGroups = ceil(heap_tuples / 
tuplesPerPrevGroup);
-   }
+   /*
+* Quick sort and 'top-N' sorting (bounded heap sort) 
algorithms
+* have different formulas for time complexity 
estimation.
+*/
+   

Re: Multiple Query IDs for a rewritten parse tree

2022-01-31 Thread Andrey V. Lepikhov

On 1/28/22 9:51 PM, Dmitry Dolgov wrote:

On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
Registration of an queryId generator implemented by analogy with extensible
methods machinery.


Why not more like suggested with stakind and slots in some data
structure? All of those generators have to be iterated anyway, so not
sure if a hash table makes sense.
Maybe. But it is not obvious. We don't really know, how many extensions 
could set an queryId.
For example, adaptive planning extensions definitely wants to set an 
unique id (for example, simplistic counter) to trace specific 
{query,plan} across all executions (remember plancache too). And they 
would register a personal generator for such purpose.



Also, I switched queryId to int64 type and renamed to
'label'.


A name with "id" in it would be better I believe. Label could be think
of as "the query belongs to a certain category", while the purpose is
identification.
I think, it is not a full true. Current jumbling generates not unique 
queryId (i hope, intentionally) and pg_stat_statements uses queryId to 
group queries into classes.
For tracking specific query along execution path it performs additional 
efforts (to remember nesting query level, as an example).
BTW, before [1], I tried to improve queryId, that can be stable for 
permutations of tables in 'FROM' section and so on. It would allow to 
reduce a number of pg_stat_statements entries (critical factor when you 
use an ORM, like 1C for example).

So, i think queryId is an Id and a category too.



2. We need a custom queryId, that is based on a generated queryId (according
to the logic of pg_stat_statements).


Could you clarify?
pg_stat_statements uses origin queryId and changes it for a reason 
(sometimes zeroed it, sometimes not). So you can't use this value in 
another extension and be confident that you use original value, 
generated by JumbleQuery(). Custom queryId allows to solve this problem.



4. We should reserve position of default in-core generator


 From the discussion above I was under the impression that the core
generator should be distinguished by a predefined kind.
Yes, but I think we should have a range of values, enough for use in 
third party extensions.



5. We should add an EXPLAIN hook, to allow an extension to print this custom
queryId.


Why? It would make sense if custom generation code will be generating
some complex structure, but the queryId itself is still a hash.

Extension can print not only queryId, but an explanation of a kind, 
maybe additional logic.
Moreover why an extension can't show some useful monitoring data, 
collected during an query execution, in verbose mode?


[1] 
https://www.postgresql.org/message-id/flat/e50c1e8f-e5d6-5988-48fa-63dd992e9565%40postgrespro.ru

--
regards,
Andrey Lepikhov
Postgres Professional




Re: Multiple Query IDs for a rewritten parse tree

2022-01-20 Thread Andrey V. Lepikhov

On 1/9/22 5:49 AM, Tom Lane wrote:

The idea I'd been vaguely thinking about is to allow attaching a list
of query-hash nodes to a Query, where each node would contain a "tag"
identifying the specific hash calculation method, and also the value
of the query's hash calculated according to that method.  We could
probably get away with saying that all such hash values must be uint64.
The main difference from your function-OID idea, I think, is that
I'm envisioning the tags as being small integers with well-known
values, similarly to the way we manage stakind values in pg_statistic.
In this way, an extension that wants a hash that the core knows how
to calculate doesn't need its own copy of the code, and similarly
one extension could publish a calculation method for use by other
extensions.


To move forward, I have made a patch that implements this idea (see 
attachment). It is a POC, but passes all regression tests.
Registration of an queryId generator implemented by analogy with 
extensible methods machinery. Also, I switched queryId to int64 type and 
renamed to 'label'.


Some lessons learned:
1. Single queryId implementation is deeply tangled with the core code 
(stat reporting machinery and parallel workers as an example).
2. We need a custom queryId, that is based on a generated queryId 
(according to the logic of pg_stat_statements).

3. We should think about safety of de-registering procedure.
4. We should reserve position of default in-core generator and think on 
logic of enabling/disabling it.
5. We should add an EXPLAIN hook, to allow an extension to print this 
custom queryId.


--
regards,
Andrey Lepikhov
Postgres Professional
>From f54bb60bd71b49ac8e1b85cd2ad86332a8a81e84 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 20 Jan 2022 16:05:17 +0500
Subject: [PATCH] Initial commit

---
 .../pg_stat_statements/pg_stat_statements.c   |  57 --
 src/backend/commands/explain.c|  11 +-
 src/backend/executor/execMain.c   |   4 +-
 src/backend/executor/execParallel.c   |   3 +-
 src/backend/nodes/copyfuncs.c |  21 ++-
 src/backend/nodes/outfuncs.c  |  17 +-
 src/backend/nodes/readfuncs.c |  17 +-
 src/backend/optimizer/plan/planner.c  |   2 +-
 src/backend/parser/analyze.c  |  15 +-
 src/backend/rewrite/rewriteHandler.c  |   4 +-
 src/backend/tcop/postgres.c   |  11 +-
 src/backend/utils/misc/guc.c  |   2 +-
 src/backend/utils/misc/queryjumble.c  | 164 --
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/parsenodes.h|  10 +-
 src/include/nodes/plannodes.h |   2 +-
 src/include/parser/analyze.h  |   3 +-
 src/include/utils/queryjumble.h   |  12 +-
 18 files changed, 294 insertions(+), 62 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 082bfa8f77..ebfc3331df 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -307,8 +307,7 @@ PG_FUNCTION_INFO_V1(pg_stat_statements_info);
 
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
-static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
-	JumbleState *jstate);
+static void pgss_post_parse_analyze(ParseState *pstate, Query *query);
 static PlannedStmt *pgss_planner(Query *parse,
  const char *query_string,
  int cursorOptions,
@@ -813,13 +812,29 @@ error:
 }
 
 /*
- * Post-parse-analysis hook: mark query with a queryId
+ * Post-parse-analysis hook: create a label for the query.
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query)
 {
+	JumbleState	   *jstate;
+	QueryLabel	   *label = get_query_label(query->queryIds, 0);
+
 	if (prev_post_parse_analyze_hook)
-		prev_post_parse_analyze_hook(pstate, query, jstate);
+		prev_post_parse_analyze_hook(pstate, query);
+
+	if (label)
+	{
+		add_custom_query_label(>queryIds, -1, label->hash);
+		jstate = (JumbleState *) label->context;
+	}
+	else
+	{
+		add_custom_query_label(>queryIds, -1, UINT64CONST(0));
+		jstate = NULL;
+	}
+
+	label = get_query_label(query->queryIds, -1);
 
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(exec_nested_level))
@@ -833,7 +848,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (query->utilityStmt)
 	{
 		if (pgss_track_utility && !PGSS_HANDLED_UTILITY(query->utilityStmt))
-			query->queryId = UINT64CONST(0);
+			label->hash = UINT64CONST(0);
 		return;
 	}
 
@@ -846,7 +861,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	 */
 	if (jstate && jstate->clocations_count > 0)
 		pgss_store(pstate->p_sourcetext,

Re: POC: GROUP BY optimization

2022-01-19 Thread Andrey V. Lepikhov

I keep work on this patch. Here is intermediate results.

On 7/22/21 3:58 AM, Tomas Vondra wrote:

in the first loop. Which seems pretty bogus - why would there be just
two groups? When processing the first expression, it's as if there was
one big "prev group" with all the tuples, so why not to just use nGroups
as it is?


I think, heapsort code seems very strange. Look into fallback case. It 
based on an output_tuples value. Maybe we should use nGroups value here, 
but based on a number of output_tuples?


> 1) I looked at the resources mentioned as sources the formulas came
> from, but I've been unable to really match the algorithm to them. The
> quicksort paper is particularly "dense", the notation seems to be very
> different, and none of the theorems seem like an obvious fit. Would be
> good to make the relationship clearer in comments etc.

Fixed (See attachment).


3) I'm getting a bit skeptical about the various magic coefficients that
are meant to model higher costs with non-uniform distribution. But
consider that we do this, for example:

tuplesPerPrevGroup = ceil(1.5 * tuplesPerPrevGroup / nGroups);

but then in the next loop we call estimate_num_groups_incremental and
pass this "tweaked" tuplesPerPrevGroup value to it. I'm pretty sure this
may have various strange consequences - we'll calculate the nGroups
based on the inflated value, and we'll calculate tuplesPerPrevGroup from
that again - that seems susceptible to "amplification".

We inflate tuplesPerPrevGroup by 50%, which means we'll get a higher
nGroups estimate in the next loop - but not linearly. An then we'll
calculate the inflated tuplesPerPrevGroup and estimated nGroup ...


Weighting coefficient '1.5' shows our desire to minimize the number of 
comparison operations on each next attribute of a pathkeys list.
Increasing this coef we increase a chance, that planner will order 
pathkeys by decreasing of uniqueness.

I think, it's ok.

--
regards,
Andrey Lepikhov
Postgres Professional
From fbc5e6709550f485a2153dda97ef805700717f23 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 24 Dec 2021 13:01:48 +0500
Subject: [PATCH] My fixes.

---
 src/backend/optimizer/path/costsize.c | 33 ---
 src/backend/optimizer/path/pathkeys.c |  9 
 src/backend/optimizer/plan/planner.c  |  8 ---
 3 files changed, 20 insertions(+), 30 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index afb8ba54ea..70af9c91d5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1845,19 +1845,17 @@ get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
  * groups in the current pathkey prefix and the new pathkey), and the 
comparison
  * costs (which is data type specific).
  *
- * Estimation of the number of comparisons is based on ideas from two sources:
+ * Estimation of the number of comparisons is based on ideas from:
  *
- * 1) "Algorithms" (course), Robert Sedgewick, Kevin Wayne 
[https://algs4.cs.princeton.edu/home/]
+ * "Quicksort Is Optimal", Robert Sedgewick, Jon Bentley, 2002
+ * [https://www.cs.princeton.edu/~rs/talks/QuicksortIsOptimal.pdf]
  *
- * 2) "Quicksort Is Optimal For Many Equal Keys" (paper), Sebastian Wild,
- * arXiv:1608.04906v4 [cs.DS] 1 Nov 2017. [https://arxiv.org/abs/1608.04906]
- *
- * In term of that paper, let N - number of tuples, Xi - number of tuples with
- * key Ki, then the estimate of number of comparisons is:
+ * In term of that paper, let N - number of tuples, Xi - number of identical
+ * tuples with value Ki, then the estimate of number of comparisons is:
  *
  * log(N! / (X1! * X2! * ..))  ~  sum(Xi * log(N/Xi))
  *
- * In our case all Xi are the same because now we don't have any estimation of
+ * We assume all Xi the same because now we don't have any estimation of
  * group sizes, we have only know the estimate of number of groups (distinct
  * values). In that case, formula becomes:
  *
@@ -1865,7 +1863,7 @@ get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
  *
  * For multi-column sorts we need to estimate the number of comparisons for
  * each individual column - for example with columns (c1, c2, ..., ck) we
- * can estimate that number of comparions on ck is roughly
+ * can estimate that number of comparisons on ck is roughly
  *
  * ncomparisons(c1, c2, ..., ck) / ncomparisons(c1, c2, ..., c(k-1))
  *
@@ -1874,10 +1872,10 @@ get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
  *
  * N * sum( Fk * log(Gk) )
  *
- * Note: We also consider column witdth, not just the comparator cost.
+ * Note: We also consider column width, not just the comparator cost.
  *
  * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  In this case, it will fallback to
+ * can't conveniently supply the sort keys. In this case, it will fallback to
  * simple comparison cost estimate.
  */
 static Cost
@@ -1925,13 

Re: Multiple Query IDs for a rewritten parse tree

2022-01-09 Thread Andrey V. Lepikhov

On 1/10/22 9:51 AM, Julien Rouhaud wrote:

On Mon, Jan 10, 2022 at 09:10:59AM +0500, Andrey V. Lepikhov wrote:

I can add one more use case.
Our extension for freezing query plan uses query tree comparison technique
to prove, that the plan can be applied (and we don't need to execute
planning procedure at all).
The procedure of a tree equality checking is expensive and we use cheaper
queryId comparison to identify possible candidates. So here, for the better
performance and queries coverage, we need to use query tree normalization -
queryId should be stable to some modifications in a query text which do not
change semantics.
As an example, query plan with external parameters can be used to execute
constant query if these constants correspond by place and type to the
parameters. So, queryId calculation technique returns also pointers to all
constants and parameters found during the calculation.


I'm also working on a similar extension, and yes you can't accept any
fingerprinting approach for that.  I don't know what are the exact heuristics
of your cheaper queryid calculation are, but is it reasonable to use it with
something like pg_stat_statements?  If yes, you don't really need two queryid
approach for the sake of this single extension and therefore don't need to
store multiple jumble state or similar per statement.  Especially since
requiring another one would mean a performance drop as soon as you want to use
something as common as pg_stat_statements.

I think, pg_stat_statements can live with an queryId generator of the 
sr_plan extension. But It replaces all constants with $XXX parameter at 
the query string. In our extension user defines which plan is optimal 
and which constants can be used as parameters in the plan.
One drawback I see here - creating or dropping of my extension changes 
behavior of pg_stat_statements that leads to distortion of the DB load 
profile. Also, we haven't guarantees, that another extension will work 
correctly (or in optimal way) with such queryId.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Multiple Query IDs for a rewritten parse tree

2022-01-09 Thread Andrey V. Lepikhov

On 1/9/22 5:13 PM, Julien Rouhaud wrote:

For now the queryid mixes two different things: fingerprinting and query text
normalization.  Should each calculation method be allowed to do a different
normalization too, and if yes where should be stored the state data needed for
that?  If not, we would need some kind of primary hash for that purpose.


Do You mean JumbleState?
I think, registering queryId generator we should store also a pointer 
(void **args) to an additional data entry, as usual.



Looking at Andrey's use case for wanting multiple hashes, I don't think that
adaptive optimization needs a normalized query string.  The only use would be
to output some statistics, but this could be achieved by storing a list of
"primary queryid" for each adaptive entry.  That's probably also true for
anything that's not monitoring intended.  Also, all monitoring consumers should
probably agree on the same queryid, both fingerprint and normalized string, as
otherwise it's impossible to cross-reference metric data.


I can add one more use case.
Our extension for freezing query plan uses query tree comparison 
technique to prove, that the plan can be applied (and we don't need to 
execute planning procedure at all).
The procedure of a tree equality checking is expensive and we use 
cheaper queryId comparison to identify possible candidates. So here, for 
the better performance and queries coverage, we need to use query tree 
normalization - queryId should be stable to some modifications in a 
query text which do not change semantics.
As an example, query plan with external parameters can be used to 
execute constant query if these constants correspond by place and type 
to the parameters. So, queryId calculation technique returns also 
pointers to all constants and parameters found during the calculation.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: pg_stat_statements and "IN" conditions

2022-01-04 Thread Andrey V. Lepikhov

On 1/5/22 4:02 AM, Tom Lane wrote:

Dmitry Dolgov <9erthali...@gmail.com> writes:

And now for something completely different, here is a new patch version.
It contains a small fix for one problem we've found during testing (one
path code was incorrectly assuming find_const_walker results).


I've been saying from day one that pushing the query-hashing code into the
core was a bad idea, and I think this patch perfectly illustrates why.
We can debate whether the rules proposed here are good for
pg_stat_statements or not, but it seems inevitable that they will be a
disaster for some other consumers of the query hash.  In particular,
dropping external parameters from the hash seems certain to break
something for somebody

+1.

In a couple of extensions I use different logic of query jumbling - hash 
value is more stable in some cases than in default implementation. For 
example, it should be stable to permutations in 'FROM' section of a query.
And If anyone subtly changes jumbling logic when the extension is 
active, the instance could get huge performance issues.


Let me suggest, that the core should allow an extension at least to 
detect such interference between extensions. Maybe hook could be 
replaced with callback to allow extension see an queryid with underlying 
generation logic what it expects.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Clarify planner_hook calling convention

2022-01-04 Thread Andrey V. Lepikhov

On 1/3/22 8:59 PM, Tom Lane wrote:

"Andrey V. Lepikhov"  writes:

planner hook is frequently used in monitoring and advising extensions.


Yeah.


The call to this hook is implemented in the way, that the
standard_planner routine must be called at least once in the hook's call
chain.
But, as I see in [1], it should allow us "... replace the planner
altogether".
In such situation it haven't sense to call standard_planner at all.


That's possible in theory, but who's going to do it in practice?


We use it in an extension that freezes a plan for specific parameterized 
query (using plancache + shared storage) - exactly the same technique as 
extended query protocol does, but spreading across all backends.
As I know, the community doesn't like such features, and we use it in 
enterprise code only.



But, maybe more simple solution is to describe requirements to such kind
of extensions in the code and documentation (See patch in attachment)?
+ * 2. If your extension implements some planning activity, write in the 
extension
+ * docs a requirement to set the extension at the begining of shared libraries 
list.


This advice seems pretty unhelpful.  If more than one extension is
getting into the planner_hook, they can't all be first.


I want to check planner_hook on startup and log an error if it isn't 
NULL and give a user an advice how to fix it. I want to legalize this 
logic, if permissible.




(Also, largely the same issue applies to very many of our other
hooks.)


Agreed. Interference between extensions is a very annoying issue now.

--
regards,
Andrey Lepikhov
Postgres Professional




Clarify planner_hook calling convention

2022-01-02 Thread Andrey V. Lepikhov

Hi,

planner hook is frequently used in monitoring and advising extensions. 
The call to this hook is implemented in the way, that the 
standard_planner routine must be called at least once in the hook's call 
chain.


But, as I see in [1], it should allow us "... replace the planner 
altogether".
In such situation it haven't sense to call standard_planner at all. 
Moreover, if an extension make some expensive planning activity, 
monitoring tools, like pg_stat_statements, can produce different 
results, depending on a hook calling order.
I thought about additional hooks, explicit hook priorities and so on. 
But, maybe more simple solution is to describe requirements to such kind 
of extensions in the code and documentation (See patch in attachment)?
It would allow an extension developer legally check and log a situation, 
when the extension doesn't last in the call chain.



[1] 
https://www.postgresql.org/message-id/flat/27516.1180053940%40sss.pgh.pa.us


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index afbb6c35e3..79a5602850 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9401,6 +9401,7 @@ SET XML OPTION { DOCUMENT | CONTENT };
 double quotes if you need to include whitespace or commas in the name.
 This parameter can only be set at server start.  If a specified
 library is not found, the server will fail to start.
+Libraries are loaded in the order in which they appear in the list.

 

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bd01ec0526..7251b88ad1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -261,8 +261,11 @@ static int	common_prefix_cmp(const void *a, const void *b);
  * after the standard planning process.  The plugin would normally call
  * standard_planner().
  *
- * Note to plugin authors: standard_planner() scribbles on its Query input,
- * so you'd better copy that data structure if you want to plan more than once.
+ * Notes to plugin authors:
+ * 1. standard_planner() scribbles on its Query input, so you'd better copy that
+ * data structure if you want to plan more than once.
+ * 2. If your extension implements some planning activity, write in the extension
+ * docs a requirement to set the extension at the begining of shared libraries list.
  *
  */
 PlannedStmt *


Re: Look at all paths?

2021-12-28 Thread Andrey V. Lepikhov

On 12/29/21 5:07 AM, Chris Cleveland wrote:
I'm developing a new index access method. Sometimes the planner uses it 
and sometimes it doesn't. I'm trying to debug the process to understand 
why the index does or doesn't get picked up.


Is there a way to dump all of the query plans that the planner 
considered, along with information on why they were rejected? EXPLAIN 
only gives info on the plan that was actually selected.


You can enable OPTIMIZER_DEBUG option. Also the gdbpg code [1] makes our 
work much easier, sometimes.


[1] https://github.com/tvondra/gdbpg

--
regards,
Andrey Lepikhov
Postgres Professional




Re: POC: GROUP BY optimization

2021-12-28 Thread Andrey V. Lepikhov

On 9/2/20 9:12 PM, Tomas Vondra wrote:
> We could simply use the input "tuples" value here, and then divide the
> current and previous estimate to calculate the number of new groups.

Performing a review of this patch I made a number of changes (see 
cleanup.txt). Maybe it will be useful.
As I see, the code, which implements the main idea, is quite stable. 
Doubts localized in the cost estimation routine. Maybe try to finish 
this work by implementing an conservative strategy to a cost estimation 
of sorting?


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index e617e2ce0e..211ae66b33 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1865,7 +1865,7 @@ get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
  *
  * For multi-column sorts we need to estimate the number of comparisons for
  * each individual column - for example with columns (c1, c2, ..., ck) we
- * can estimate that number of comparions on ck is roughly
+ * can estimate that number of comparisons on ck is roughly
  *
  * ncomparisons(c1, c2, ..., ck) / ncomparisons(c1, c2, ..., c(k-1))
  *
@@ -1874,10 +1874,10 @@ get_width_cost_multiplier(PlannerInfo *root, Expr *expr)
  *
  * N * sum( Fk * log(Gk) )
  *
- * Note: We also consider column witdth, not just the comparator cost.
+ * Note: We also consider column width, not just the comparator cost.
  *
  * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  In this case, it will fallback to
+ * can't conveniently supply the sort keys. In this case, it will fallback to
  * simple comparison cost estimate.
  */
 static Cost
@@ -1925,13 +1925,13 @@ compute_cpu_sort_cost(PlannerInfo *root, List 
*pathkeys, int nPresortedKeys,
 */
foreach(lc, pathkeys)
{
-   PathKey *pathkey = (PathKey*)lfirst(lc);
+   PathKey *pathkey = (PathKey*) 
lfirst(lc);
EquivalenceMember   *em;
double   nGroups,
 correctedNGroups;
 
/*
-* We believe than equivalence members aren't very  different, 
so, to
+* We believe than equivalence members aren't very different, 
so, to
 * estimate cost we take just first member
 */
em = (EquivalenceMember *) 
linitial(pathkey->pk_eclass->ec_members);
@@ -1964,7 +1964,7 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, 
int nPresortedKeys,
 
totalFuncCost += funcCost;
 
-   /* remeber if we have a fake var in pathkeys */
+   /* Remember if we have a fake var in pathkeys */
has_fake_var |= is_fake_var(em->em_expr);
pathkeyExprs = lappend(pathkeyExprs, em->em_expr);
 
@@ -1974,7 +1974,7 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, 
int nPresortedKeys,
 */
if (has_fake_var == false)
/*
-* Recursively compute number of group in group from 
previous step
+* Recursively compute number of groups in a group from 
previous step
 */
nGroups = estimate_num_groups_incremental(root, 
pathkeyExprs,

  tuplesPerPrevGroup, NULL, NULL,
@@ -1992,8 +1992,7 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, 
int nPresortedKeys,
 *
 * XXX What's the logic of the following formula?
 */
-   nGroups = ceil(2.0 + sqrt(tuples) *
-   list_length(pathkeyExprs) / 
list_length(pathkeys));
+   nGroups = ceil(2.0 + sqrt(tuples) * (i + 1) / 
list_length(pathkeys));
else
nGroups = tuples;
 
@@ -2033,7 +2032,7 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, 
int nPresortedKeys,
 
/*
 * We could skip all following columns for cost estimation, 
because we
-* believe that tuples are unique by set ot previous columns
+* believe that tuples are unique by the set of previous columns
 */
if (tuplesPerPrevGroup <= 1.0)
break;
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 82831a4fa4..707b5ba75b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -544,9 +544,10 @@ PathkeyMutatorNext(PathkeyMutatorState *state)
return state->elemsList;
 }
 
-typedef struct 

Re: Global snapshots

2021-11-19 Thread Andrey V. Lepikhov

Patch in the previous letter is full of faulties. Please, use new version.
Also, here we fixed the problem with loosing CSN value in a parallel 
worker (TAP test 003_parallel_safe.pl). Thanks for a.pyhalov for the 
problem detection and a bugfix.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 7aa57724fc42b8ca7054f9b6edfa33c0cffb24bf Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 17 Nov 2021 11:13:37 +0500
Subject: [PATCH] Add Commit Sequence Number (CSN) machinery into MVCC
 implementation for a timestamp-based resolving of visibility conflicts.

It allows to achieve proper snapshot isolation semantics in the case
of distributed transactions involving more than one Postgres instance.

Authors: K.Knizhnik, S.Kelvich, A.Sher, A.Lepikhov, M.Usama.

Discussion:
(2020/05/21 -)
https://www.postgresql.org/message-id/flat/CA%2Bfd4k6HE8xLGEvqWzABEg8kkju5MxU%2Bif7bf-md0_2pjzXp9Q%40mail.gmail.com#ed1359340871688bed2e643921f73365
(2018/05/01 - 2019/04/21)
https://www.postgresql.org/message-id/flat/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
---
 doc/src/sgml/config.sgml  |  50 +-
 src/backend/access/rmgrdesc/Makefile  |   1 +
 src/backend/access/rmgrdesc/csnlogdesc.c  |  95 +++
 src/backend/access/rmgrdesc/xlogdesc.c|   6 +-
 src/backend/access/transam/Makefile   |   2 +
 src/backend/access/transam/csn_log.c  | 748 ++
 src/backend/access/transam/csn_snapshot.c | 687 
 src/backend/access/transam/rmgr.c |   1 +
 src/backend/access/transam/twophase.c | 154 
 src/backend/access/transam/varsup.c   |   2 +
 src/backend/access/transam/xact.c |  32 +
 src/backend/access/transam/xlog.c |  23 +-
 src/backend/access/transam/xloginsert.c   |   2 +
 src/backend/commands/vacuum.c |   3 +-
 src/backend/replication/logical/snapbuild.c   |   4 +
 src/backend/storage/ipc/ipci.c|   6 +
 src/backend/storage/ipc/procarray.c   |  85 ++
 src/backend/storage/lmgr/lwlock.c |   2 +
 src/backend/storage/lmgr/lwlocknames.txt  |   2 +
 src/backend/storage/lmgr/proc.c   |   6 +
 src/backend/storage/sync/sync.c   |   5 +
 src/backend/utils/misc/guc.c  |  37 +
 src/backend/utils/probes.d|   2 +
 src/backend/utils/time/snapmgr.c  | 183 -
 src/bin/initdb/initdb.c   |   3 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +
 src/bin/pg_upgrade/pg_upgrade.c   |   5 +
 src/bin/pg_upgrade/pg_upgrade.h   |   2 +
 src/bin/pg_waldump/rmgrdesc.c |   1 +
 src/include/access/csn_log.h  |  98 +++
 src/include/access/csn_snapshot.h |  54 ++
 src/include/access/rmgrlist.h |   1 +
 src/include/access/xlog_internal.h|   2 +
 src/include/catalog/pg_control.h  |   1 +
 src/include/catalog/pg_proc.dat   |  17 +
 src/include/datatype/timestamp.h  |   3 +
 src/include/fmgr.h|   1 +
 src/include/portability/instr_time.h  |  10 +
 src/include/storage/lwlock.h  |   1 +
 src/include/storage/proc.h|  14 +
 src/include/storage/procarray.h   |   7 +
 src/include/storage/sync.h|   1 +
 src/include/utils/snapmgr.h   |   7 +-
 src/include/utils/snapshot.h  |  11 +
 src/test/modules/Makefile |   1 +
 src/test/modules/csnsnapshot/Makefile |  22 +
 .../csnsnapshot/expected/csnsnapshot.out  |   1 +
 src/test/modules/csnsnapshot/t/001_base.pl| 100 +++
 src/test/modules/csnsnapshot/t/002_standby.pl |  68 ++
 .../csnsnapshot/t/003_parallel_safe.pl|  67 ++
 src/test/modules/snapshot_too_old/sto.conf|   1 +
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  28 +
 src/test/regress/expected/sysviews.out|   4 +-
 53 files changed, 2660 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/csnlogdesc.c
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/backend/access/transam/csn_snapshot.c
 create mode 100644 src/include/access/csn_log.h
 create mode 100644 src/include/access/csn_snapshot.h
 create mode 100644 src/test/modules/csnsnapshot/Makefile
 create mode 100644 src/test/modules/csnsnapshot/expected/csnsnapshot.out
 create mode 100644 src/test/modules/csnsnapshot/t/001_base.pl
 create mode 100644 src/test/modules/csnsnapshot/t/002_standby.pl
 create mode 100644 src/test/modules/csnsnapshot/t/003_parallel_safe.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5..f4f6c83fd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9682,8 +9682,56 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
  
 
  

Re: Global snapshots

2021-11-17 Thread Andrey V. Lepikhov
Next version of CSN implementation in snapshots to achieve a proper 
snapshot isolation in the case of a cross-instance distributed transaction.


--
regards,
Andrey Lepikhov
Postgres Professional
>From bbb7dd1d7621c091f11e697d3d894fe7a36918a6 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 17 Nov 2021 11:13:37 +0500
Subject: [PATCH] Add Commit Sequence Number (CSN) machinery into MVCC
 implementation for a timestamp-based resolving of visibility conflicts.

It allows to achieve proper snapshot isolation semantics in the case
of distributed transactions involving more than one Postgres instance.

Authors: K.Knizhnik, S.Kelvich, A.Sher, A.Lepikhov, M.Usama.

Discussion:
(2020/05/21 -)
https://www.postgresql.org/message-id/flat/CA%2Bfd4k6HE8xLGEvqWzABEg8kkju5MxU%2Bif7bf-md0_2pjzXp9Q%40mail.gmail.com#ed1359340871688bed2e643921f73365
(2018/05/01 - 2019/04/21)
https://www.postgresql.org/message-id/flat/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
---
 doc/src/sgml/config.sgml  |  50 +-
 src/backend/access/rmgrdesc/Makefile  |   1 +
 src/backend/access/rmgrdesc/csnlogdesc.c  |  95 +++
 src/backend/access/rmgrdesc/xlogdesc.c|   6 +-
 src/backend/access/transam/Makefile   |   2 +
 src/backend/access/transam/csn_log.c  | 748 ++
 src/backend/access/transam/csn_snapshot.c | 687 
 src/backend/access/transam/rmgr.c |   1 +
 src/backend/access/transam/twophase.c | 154 
 src/backend/access/transam/varsup.c   |   2 +
 src/backend/access/transam/xact.c |  32 +
 src/backend/access/transam/xlog.c |  23 +-
 src/backend/access/transam/xloginsert.c   |   2 +
 src/backend/commands/vacuum.c |   3 +-
 src/backend/storage/ipc/ipci.c|   6 +
 src/backend/storage/ipc/procarray.c   |  85 ++
 src/backend/storage/lmgr/lwlock.c |   2 +
 src/backend/storage/lmgr/lwlocknames.txt  |   2 +
 src/backend/storage/lmgr/proc.c   |   6 +
 src/backend/storage/sync/sync.c   |   5 +
 src/backend/utils/misc/guc.c  |  37 +
 src/backend/utils/probes.d|   2 +
 src/backend/utils/time/snapmgr.c  | 149 +++-
 src/bin/initdb/initdb.c   |   3 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +
 src/bin/pg_upgrade/pg_upgrade.c   |   5 +
 src/bin/pg_upgrade/pg_upgrade.h   |   2 +
 src/bin/pg_waldump/rmgrdesc.c |   1 +
 src/include/access/csn_log.h  |  98 +++
 src/include/access/csn_snapshot.h |  54 ++
 src/include/access/rmgrlist.h |   1 +
 src/include/access/xlog_internal.h|   2 +
 src/include/catalog/pg_control.h  |   1 +
 src/include/catalog/pg_proc.dat   |  17 +
 src/include/datatype/timestamp.h  |   3 +
 src/include/fmgr.h|   1 +
 src/include/portability/instr_time.h  |  10 +
 src/include/storage/lwlock.h  |   1 +
 src/include/storage/proc.h|  14 +
 src/include/storage/procarray.h   |   7 +
 src/include/storage/sync.h|   1 +
 src/include/utils/snapmgr.h   |   7 +-
 src/include/utils/snapshot.h  |  11 +
 src/test/modules/Makefile |   1 +
 src/test/modules/csnsnapshot/Makefile |  25 +
 .../modules/csnsnapshot/csn_snapshot.conf |   1 +
 .../csnsnapshot/expected/csnsnapshot.out  |   1 +
 src/test/modules/csnsnapshot/t/001_base.pl| 103 +++
 src/test/modules/csnsnapshot/t/002_standby.pl |  66 ++
 .../modules/csnsnapshot/t/003_time_skew.pl| 214 +
 .../csnsnapshot/t/004_read_committed.pl   |  97 +++
 .../csnsnapshot/t/005_basic_visibility.pl | 181 +
 src/test/modules/snapshot_too_old/sto.conf|   1 +
 src/test/regress/expected/sysviews.out|   4 +-
 54 files changed, 3024 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/csnlogdesc.c
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/backend/access/transam/csn_snapshot.c
 create mode 100644 src/include/access/csn_log.h
 create mode 100644 src/include/access/csn_snapshot.h
 create mode 100644 src/test/modules/csnsnapshot/Makefile
 create mode 100644 src/test/modules/csnsnapshot/csn_snapshot.conf
 create mode 100644 src/test/modules/csnsnapshot/expected/csnsnapshot.out
 create mode 100644 src/test/modules/csnsnapshot/t/001_base.pl
 create mode 100644 src/test/modules/csnsnapshot/t/002_standby.pl
 create mode 100644 src/test/modules/csnsnapshot/t/003_time_skew.pl
 create mode 100644 src/test/modules/csnsnapshot/t/004_read_committed.pl
 create mode 100644 src/test/modules/csnsnapshot/t/005_basic_visibility.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5..f4f6c83fd0 100644

Make query ID more portable

2021-10-12 Thread Andrey V. Lepikhov

Hi,

QueryID is good tool for query analysis. I want to improve core jumbling 
machinery in two ways:
1. QueryID value should survive dump/restore of a database (use fully 
qualified name of table instead of relid).
2. QueryID could represent more general class of queries: for example, 
it can be independent from permutation of tables in a FROM clause.


See the patch in attachment as an POC. Main idea here is to break 
JumbleState down to a 'clocations' part that can be really interested in
a post parse hook and a 'context data', that needed to build query or 
subquery signature (hash) and, I guess, isn't really needed in any 
extensions.


I think, it adds not much complexity and overhead. It still not 
guaranteed equality of queryid on two instances with an equal schema, 
but survives across an instance upgrade and allows to do some query 
analysis on a replica node.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 714111f82569ba827d6387ca3e01e5f364a2c8dd Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Tue, 12 Oct 2021 11:53:50 +0500
Subject: [PATCH] Make queryid more portable. 1. Extract local context from a
 JumbleState. 2. Make a hash value for each range table entry. 3. Make a hash
 signature for each subquery. 4. Use hash instead of rti. 5. Sort hashes of
 range table entries before adding to a context.

TODO:
- Use attnames instead of varattno.
- Use sort of argument hashes at each level of expression jumbling.
---
 contrib/pg_stat_statements/Makefile   |   1 +
 .../t/001_queryid_portability.pl  |  61 
 src/backend/utils/adt/regproc.c   |  25 +-
 src/backend/utils/misc/queryjumble.c  | 319 +++---
 src/include/utils/queryjumble.h   |  20 +-
 src/include/utils/regproc.h   |   1 +
 6 files changed, 299 insertions(+), 128 deletions(-)
 create mode 100644 contrib/pg_stat_statements/t/001_queryid_portability.pl

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38..bef304e7d4 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -17,6 +17,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = pg_stat_statements oldextversions
+TAP_TESTS = 1
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/t/001_queryid_portability.pl b/contrib/pg_stat_statements/t/001_queryid_portability.pl
new file mode 100644
index 00..80f8bb4e93
--- /dev/null
+++ b/contrib/pg_stat_statements/t/001_queryid_portability.pl
@@ -0,0 +1,61 @@
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+
+use Test::More tests => 3;
+
+my ($node1, $node2, $result1, $result2);
+
+$node1 = PostgresNode->new('node1');
+$node1->init;
+$node1->append_conf('postgresql.conf', qq{
+	shared_preload_libraries = 'pg_stat_statements'
+	pg_stat_statements.track = 'all'
+});
+$node1->start;
+
+$node2 = PostgresNode->new('node2');
+$node2->init;
+$node2->append_conf('postgresql.conf', qq{
+	shared_preload_libraries = 'pg_stat_statements'
+	pg_stat_statements.track = 'all'
+});
+$node2->start;
+$node2->safe_psql('postgres', qq{CREATE TABLE a(); DROP TABLE a;});
+
+$node1->safe_psql('postgres', q(CREATE EXTENSION pg_stat_statements));
+$node2->safe_psql('postgres', q(CREATE EXTENSION pg_stat_statements));
+
+$node1->safe_psql('postgres', "
+	SELECT pg_stat_statements_reset();
+	CREATE TABLE a (x int, y varchar);
+	CREATE TABLE b (x int);
+	SELECT * FROM a;"
+);
+$node2->safe_psql('postgres', "
+	SELECT pg_stat_statements_reset();
+	CREATE TABLE a (y varchar, x int);
+	CREATE TABLE b (x int);
+	SELECT * FROM a;
+");
+
+$result1 = $node1->safe_psql('postgres', "SELECT queryid FROM pg_stat_statements WHERE query LIKE 'SELECT * FROM a';");
+$result2 = $node2->safe_psql('postgres', "SELECT queryid FROM pg_stat_statements WHERE query LIKE 'SELECT * FROM a';");
+is($result1, $result2);
+
+$node1->safe_psql('postgres', "SELECT x FROM a");
+$node2->safe_psql('postgres', "SELECT x FROM a");
+$result1 = $node1->safe_psql('postgres', "SELECT queryid FROM pg_stat_statements WHERE query LIKE 'SELECT x FROM a';");
+$result2 = $node2->safe_psql('postgres', "SELECT queryid FROM pg_stat_statements WHERE query LIKE 'SELECT x FROM a';");
+is(($result1 != $result2), 1); # TODO
+
+$node1->safe_psql('postgres', "SELECT * FROM a,b WHERE a.x = b.x;");
+$node2->safe_psql('postgres', "SELECT * FROM b,a WHERE a.x = b.x;");
+$result1 = $node1->safe_psql('postgres', "SELECT queryid FROM pg_stat_statements WHERE query LIKE 'SELECT * FROM a,b WHERE a.x = b.x;'");
+$result2 = $node2->safe_psql('postgres', "SELECT queryid FROM pg_stat_statements WHERE query LIKE 'SELECT * FROM b,a WHERE a.x = b.x;'");
+diag("$result1, \n 

Re: Asymmetric partition-wise JOIN

2021-09-14 Thread Andrey V. Lepikhov

On 9/9/21 8:38 PM, Jaime Casanova wrote:

On Thu, Sep 09, 2021 at 09:50:46AM +, Aleksander Alekseev wrote:

It looks like this patch needs to be updated. According to 
http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the 
status to save time for reviewers.

The new status of this patch is: Waiting on Author


Just to give some more info to work on I found this patch made postgres
crash with a segmentation fault.

"""
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at 
bitmapset.c:126
126 if (shorter->words[i] != longer->words[i])
"""

attached are the query that triggers the crash and the backtrace.



Thank you for this good catch!
The problem was in the adjust_child_relids_multilevel routine. The 
tmp_result variable sometimes points to original required_outer.
This patch adds new ways which optimizer can generate plans. One 
possible way is optimizer reparameterizes an inner by a plain relation 
from the outer (maybe as a result of join of the plain relation and 
partitioned relation). In this case we have to compare tmp_result with 
original pointer to realize, it was changed or not.
The patch in attachment fixes this problem. Additional regression test 
added.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 6976e463e950f91a6a18e9f2630af1c4cb73b94b Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.

Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.

Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause huge consumption of CPU and memory
during reparameterization of NestLoop path.

Change logic of the multilevel child relids adjustment, because this
feature allows the optimizer to plan in new way.
---
 src/backend/optimizer/path/joinpath.c|   9 +
 src/backend/optimizer/path/joinrels.c| 187 +
 src/backend/optimizer/plan/setrefs.c |  17 +-
 src/backend/optimizer/util/appendinfo.c  |  51 ++-
 src/backend/optimizer/util/pathnode.c|   9 +-
 src/backend/optimizer/util/relnode.c |  19 +-
 src/include/optimizer/paths.h|   7 +-
 src/test/regress/expected/partition_join.out | 378 +++
 src/test/regress/sql/partition_join.sql  | 167 
 9 files changed, 808 insertions(+), 36 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 6407ede12a..32618ebbd5 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
 			   jointype, );
+
+	/*
+	 * 7. If outer relation is delivered from partition-tables, consider
+	 * distributing inner relation to every partition-leaf prior to
+	 * append these leafs.
+	 */
+	try_asymmetric_partitionwise_join(root, joinrel,
+	  outerrel, innerrel,
+	  jointype, );
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 8b69870cf4..9453258f83 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
 
 #include "miscadmin.h"
 #include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -1552,6 +1553,192 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	}
 }
 
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs is impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+		 RelOptInfo *joinrel,
+		 AppendPath *append_path,
+		 RelOptInfo *inner_rel,
+		 JoinType jointype,
+		 JoinPathExtraData *extra)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, append_path->subpaths)
+	{
+		Path			*child_path = lfirst(lc);
+		RelOptInfo		*child_rel = child_path->parent;
+		Relids			child_joinrelids;
+		Relids			parent_relids;
+		RelOptInfo		*child_joinrel;
+		SpecialJoinInfo	*child_sjinfo;
+		List			*child_restrictlist;
+
+		child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+		parent_relids = bms_union(append_path->path.parent->relids,
+  inner_rel->relids);
+
+		child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+			   child_rel->relids,
+			   inner_rel->relids);
+		child_restrictlist = (List *)
+			adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+			  child_joinrelids, 

Re: Increase value of OUTER_VAR

2021-09-14 Thread Andrey V. Lepikhov

On 9/11/21 10:37 PM, Tom Lane wrote:

Aleksander Alekseev  writes:
(v2 below is a rebase up to HEAD; no actual code changes except
for adjusting the definition of IS_SPECIAL_VARNO.)

I have looked at this code. No problems found.
Also, as a test, I used two tables with 1E5 partitions each. I tried to 
do plain SELECT, JOIN, join with plain table. No errors found, only 
performance issues. But it is a subject for another research.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Defer selection of asynchronous subplans until the executor initialization stage

2021-08-30 Thread Andrey V. Lepikhov

On 8/23/21 2:18 PM, Etsuro Fujita wrote:

To just execute what was planned at execution time, I think we should
return to the patch in [1].  The patch was created for Horiguchi-san’s
async-execution patch, so I modified it to work with HEAD, and added a
simplified version of your test cases.  Please find attached a patch.
[1] 
https://www.postgresql.org/message-id/7fe10f95-ac6c-c81d-a9d3-227493eb9...@postgrespro.ru
I agree, this way is more safe. I tried to search for another approach, 
because here isn't general solution: for each plan node we should 
implement support of asynchronous behaviour.
But for practical use, for small set of nodes, it will work good. I 
haven't any objections for this patch.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Postgres picks suboptimal index after building of an extended statistics

2021-08-13 Thread Andrey V. Lepikhov

On 8/12/21 4:26 AM, Tomas Vondra wrote:

On 8/11/21 2:48 AM, Peter Geoghegan wrote:

On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov
 wrote:

Ivan Frolkov reported a problem with choosing a non-optimal index during
a query optimization. This problem appeared after building of an
extended statistics.


Any thoughts on this, Tomas?



Thanks for reminding me, I missed / forgot about this thread.

I agree the current behavior is unfortunate, but I'm not convinced the 
proposed patch is fixing the right place - doesn't this mean the index 
costing won't match the row estimates displayed by EXPLAIN?
I think, it is not a problem. In EXPLAIN you will see only 1 row 
with/without this patch.


I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
and improve the cardinality estimates directly, not just costing for 
index scans.

This idea looks better. I will try to implement it.


Also, is it correct that the patch calculates num_sa_scans only when 
(numIndexTuples >= 0.0)?

Thanks, fixed.

--
regards,
Andrey Lepikhov
Postgres Professional
>From 8a4ad08d61a5d14a45ef5e182f002e918f0eaccc Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 23 Jun 2021 12:05:24 +0500
Subject: [PATCH] In the case of an unique one row btree index scan only one
 row can be returned. In the genericcostestimate() routine we must arrange the
 index selectivity value in accordance with this knowledge.

---
 src/backend/utils/adt/selfuncs.c | 78 
 1 file changed, 48 insertions(+), 30 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0c8c05f6c2..9538c4a5b4 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6369,15 +6369,9 @@ genericcostestimate(PlannerInfo *root,
 	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
-	List	   *selectivityQuals;
 	ListCell   *l;
 
-	/*
-	 * If the index is partial, AND the index predicate with the explicitly
-	 * given indexquals to produce a more accurate idea of the index
-	 * selectivity.
-	 */
-	selectivityQuals = add_predicate_to_index_quals(index, indexQuals);
+	numIndexTuples = costs->numIndexTuples;
 
 	/*
 	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
@@ -6398,36 +6392,59 @@ genericcostestimate(PlannerInfo *root,
 		}
 	}
 
-	/* Estimate the fraction of main-table tuples that will be visited */
-	indexSelectivity = clauselist_selectivity(root, selectivityQuals,
-			  index->rel->relid,
-			  JOIN_INNER,
-			  NULL);
+	if (numIndexTuples >= 0.0)
+	{
+		List		*selectivityQuals;
 
-	/*
-	 * If caller didn't give us an estimate, estimate the number of index
-	 * tuples that will be visited.  We do it in this rather peculiar-looking
-	 * way in order to get the right answer for partial indexes.
-	 */
-	numIndexTuples = costs->numIndexTuples;
-	if (numIndexTuples <= 0.0)
+		/*
+		 * If the index is partial, AND the index predicate with the explicitly
+		 * given indexquals to produce a more accurate idea of the index
+		 * selectivity.
+		 */
+		selectivityQuals = add_predicate_to_index_quals(index, indexQuals);
+
+		/* Estimate the fraction of main-table tuples that will be visited */
+		indexSelectivity = clauselist_selectivity(root, selectivityQuals,
+  index->rel->relid,
+  JOIN_INNER,
+  NULL);
+
+		/*
+		 * If caller didn't give us an estimate, estimate the number of index
+		 * tuples that will be visited.  We do it in this rather peculiar-looking
+		 * way in order to get the right answer for partial indexes.
+		 */
+		if (numIndexTuples == 0.0)
+		{
+			numIndexTuples = indexSelectivity * index->rel->tuples;
+
+			/*
+			 * The above calculation counts all the tuples visited across all
+			 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
+			 * average per-indexscan number, so adjust.  This is a handy place to
+			 * round to integer, too.  (If caller supplied tuple estimate, it's
+			 * responsible for handling these considerations.)
+			 */
+			numIndexTuples = rint(numIndexTuples / num_sa_scans);
+		}
+	}
+	else
 	{
-		numIndexTuples = indexSelectivity * index->rel->tuples;
+		Assert(numIndexTuples == -1.0);
 
 		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
+		 * Unique one row scan can select no more than one row. It needs to
+		 * manually set the selectivity of the index. The value of numIndexTuples
+		 * will be corrected later.
 		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
+		indexSelectivity = 1.0 / index->rel->tuples;
 	}
 
 	/*
 	 * We can bound the numb

Extra code in commit_ts.h

2021-08-03 Thread Andrey V. Lepikhov

Hi,

I found two extra code lines in commit_ts.h (see attachment).
They confused me during exploring of the code. If they still needed, may 
be add some comments?


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index e045dd416f..a1538978c6 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -15,14 +15,10 @@
 #include "datatype/timestamp.h"
 #include "replication/origin.h"
 #include "storage/sync.h"
-#include "utils/guc.h"
 
 
 extern PGDLLIMPORT bool track_commit_timestamp;
 
-extern bool check_track_commit_timestamp(bool *newval, void **extra,
-		 GucSource source);
-
 extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
 		   TransactionId *subxids, TimestampTz timestamp,
 		   RepOriginId nodeid);


Re: Removing unneeded self joins

2021-07-26 Thread Andrey V. Lepikhov

On 7/16/21 12:28 AM, Zhihong Yu wrote:



On Thu, Jul 15, 2021 at 8:25 AM Zhihong Yu <mailto:z...@yugabyte.com>> wrote:

bq. We can proof the uniqueness

proof -> prove

Fixed


1. Collect all mergejoinable join quals looks like a.x = b.x

  quals looks like -> quals which look like

For update_ec_sources(), the variable cc is not used.

Fixed

+   *otherjoinquals = rjoinquals;

Maybe rename rjoinquals as ojoinquals to align with the target variable 
name.

Agree, fixed


+   int k; /* Index of kept relation */
+   int r = -1; /* Index of removed relation */

Naming k as kept, r as removed would make the code more readable (remain 
starts with r but has opposite meaning).
I think it is correct now: k - index of inner (keeping) relation. r - of 
outer (removing) relation.


+               if (bms_is_member(r, info->syn_righthand) &&
+                   !bms_is_member(k, info->syn_righthand))
+                   jinfo_check = false;
+
+               if (!jinfo_check)
+                   break;

There are 4 if statements where jinfo_check is assigned false. Once 
jinfo_check is assigned, we can break out of the loop - instead of 
checking the remaining conditions.

Fixed


+           else if (!innerrel_is_unique(root, joinrelids, outer->relids,

nit: the 'else' is not needed since the if block above it goes to next 
iteration of the loop.

Fixed


+           /* See for row marks. */
+           foreach (lc, root->rowMarks)

It seems once imark and omark are set, we can come out of the loop.

Maybe you right. fixed.

--
regards,
Andrey Lepikhov
Postgres Professional
>From e8b4047aa71c808fa5799b2739b2ae0ab7b6d7e3 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 15 Jul 2021 15:26:13 +0300
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if could prove that the join
can be replaced with a scan. We can prove the uniqueness
using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars, an inner row
must be (physically) the same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals which look like a.x = b.x
2. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. So proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 886 +-
 src/backend/optimizer/plan/planmain.c |   5 +
 src/backend/optimizer/util/joininfo.c |   3 +
 src/backend/optimizer/util/relnode.c  |  26 +-
 src/backend/utils/misc/guc.c  |  10 +
 src/include/optimizer/pathnode.h  |   4 +
 src/include/optimizer/planmain.h  |   2 +
 src/test/regress/expected/equivclass.out  |  32 +
 src/test/regress/expected/join.out| 399 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/equivclass.sql   |  16 +
 src/test/regress/sql/join.sql | 189 +
 12 files changed, 1546 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 37eb64bcef..eb9d83b424 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -32,10 +33,12 @@
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
 
+bool enable_self_join_removal;
+
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
-  Relids joinrelids);
+  Relids joinrelids, int subst_relid);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
@@ -47,6 +50,9 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
    RelOptInfo *innerrel,
    JoinType jointype,
    List *restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
+static Bitmapset* change_relid(Relids relids, Index oldId, Index newId);
+static void change_varno(Expr *expr, Index oldRelid, Index newRelid);
 
 
 /*
@@ -86,7 +92,7 @@ restart:
 
 		remove_rel_from_query(root, innerrel

Postgres picks suboptimal index after building of an extended statistics

2021-06-23 Thread Andrey V. Lepikhov

Hi,

Ivan Frolkov reported a problem with choosing a non-optimal index during 
a query optimization. This problem appeared after building of an 
extended statistics.


I prepared the test case (see t.sql in attachment).
For reproduction of this case we need to have a composite primary key 
index and one another index.
Before creation of extended statistics, SELECT from the table choose PK 
index and returns only one row. But after, this SELECT picks alternative 
index, fetches and filters many tuples.


The problem is related to a corner case in btree cost estimation procedure:
if postgres detects unique one-row index scan, it sets
numIndexTuples to 1.0.

But the selectivity is calculated as usual, by the 
clauselist_selectivity() routine and can have a value, much more than 
corresponding to single tuple. This selectivity value is used later in 
the code to calculate a number of fetched tuples and can lead to 
choosing of an suboptimal index.


The attached patch is my suggestion to fix this problem.

--
regards,
Andrey Lepikhov
Postgres Professional


t.sql
Description: application/sql
From 810eb4691acec26a07d8fa67bedb7a7381c31824 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 23 Jun 2021 12:05:24 +0500
Subject: [PATCH] In the case of an unique one row btree index scan only one
 row can be returned. In the genericcostestimate() routine we must arrange the
 index selectivity value in accordance with this knowledge.

---
 src/backend/utils/adt/selfuncs.c | 106 ++-
 1 file changed, 62 insertions(+), 44 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0c8c05f6c2..91160960aa 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6369,65 +6369,82 @@ genericcostestimate(PlannerInfo *root,
double  num_scans;
double  qual_op_cost;
double  qual_arg_cost;
-   List   *selectivityQuals;
-   ListCell   *l;
-
-   /*
-* If the index is partial, AND the index predicate with the explicitly
-* given indexquals to produce a more accurate idea of the index
-* selectivity.
-*/
-   selectivityQuals = add_predicate_to_index_quals(index, indexQuals);
 
-   /*
-* Check for ScalarArrayOpExpr index quals, and estimate the number of
-* index scans that will be performed.
-*/
+   numIndexTuples = costs->numIndexTuples;
num_sa_scans = 1;
-   foreach(l, indexQuals)
+
+   if (numIndexTuples >= 0.0)
{
-   RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+   List*selectivityQuals;
+   ListCell*l;
 
-   if (IsA(rinfo->clause, ScalarArrayOpExpr))
+   /*
+* If the index is partial, AND the index predicate with the 
explicitly
+* given indexquals to produce a more accurate idea of the index
+* selectivity.
+*/
+   selectivityQuals = add_predicate_to_index_quals(index, 
indexQuals);
+
+   /*
+* Check for ScalarArrayOpExpr index quals, and estimate the 
number of
+* index scans that will be performed.
+*/
+   foreach(l, indexQuals)
{
-   ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) 
rinfo->clause;
-   int alength = 
estimate_array_length(lsecond(saop->args));
+   RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
 
-   if (alength > 1)
-   num_sa_scans *= alength;
+   if (IsA(rinfo->clause, ScalarArrayOpExpr))
+   {
+   ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) 
rinfo->clause;
+   int alength = 
estimate_array_length(lsecond(saop->args));
+
+   if (alength > 1)
+   num_sa_scans *= alength;
+   }
}
-   }
 
-   /* Estimate the fraction of main-table tuples that will be visited */
-   indexSelectivity = clauselist_selectivity(root, selectivityQuals,
-   
  index->rel->relid,
-   
  JOIN_INNER,
-   
  NULL);
+   /* Estimate the fraction of main-table tuples that will be 
visited */
+   indexSelectivity = clauselist_selectivity(root, 
selectivityQuals,
+   
  index->rel->relid,
+   

Re: Removing unneeded self joins

2021-05-27 Thread Andrey V. Lepikhov

On 5/8/21 2:00 AM, Hywel Carver wrote:
On Fri, May 7, 2021 at 8:23 AM Andrey Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:

Here I didn't work on 'unnecessary IS NOT NULL filter'.

I've tested the new patch, and it is giving the same improved behaviour 
as the old patch.

Thank you for this efforts!

I cleaned the code of previous version, improved regression tests and 
rebased on current master.


Also, I see that we could do additional optimizations for an 
EC-generated selfjoin clause (See equivclass.patch for necessary 
changes). Example:

explain (costs off)
select * from sj t1, sj t2 where t1.a = t1.b and t1.b = t2.b and t2.b = 
t2.a;

 QUERY PLAN
-
 Seq Scan on sj t2
   Filter: ((a IS NOT NULL) AND (b = a) AND (a = b))
(2 rows)

But I'm not sure that this patch need to be a part of the self-join 
removal feature because of code complexity.


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 6f1abbe47d..12a1d390b7 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1623,8 +1623,21 @@ generate_join_implied_equalities_normal(PlannerInfo *root,
 		EquivalenceMember *best_inner_em = NULL;
 		Oid			best_eq_op = InvalidOid;
 		int			best_score = -1;
+		int			max_score = 3;
 		RestrictInfo *rinfo;
 
+		/* The case of possible self-join */
+		if (bms_num_members(outer_relids) == 1 &&
+			bms_num_members(inner_relids) == 1)
+		{
+			int orel = bms_next_member(outer_relids, -1);
+			int irel = bms_next_member(inner_relids, -1);
+
+			if (root->simple_rte_array[irel]->relid ==
+root->simple_rte_array[orel]->relid)
+max_score = 4;
+		}
+
 		foreach(lc1, outer_members)
 		{
 			EquivalenceMember *outer_em = (EquivalenceMember *) lfirst(lc1);
@@ -1653,17 +1666,31 @@ generate_join_implied_equalities_normal(PlannerInfo *root,
 if (op_hashjoinable(eq_op,
 	exprType((Node *) outer_em->em_expr)))
 	score++;
+if (score == 3 && score < max_score)
+{
+	/* Look for self-join clause */
+	Var *outer_var = (Var *) (IsA(outer_em->em_expr, Var) ?
+	outer_em->em_expr :
+	((RelabelType *) outer_em->em_expr)->arg);
+	Var *inner_var = (Var *) (IsA(inner_em->em_expr, Var) ?
+	inner_em->em_expr :
+	((RelabelType *) inner_em->em_expr)->arg);
+
+	if (outer_var->varattno == inner_var->varattno)
+		score++;
+}
+
 if (score > best_score)
 {
 	best_outer_em = outer_em;
 	best_inner_em = inner_em;
 	best_eq_op = eq_op;
 	best_score = score;
-	if (best_score == 3)
+	if (best_score == max_score)
 		break;	/* no need to look further */
 }
 			}
-			if (best_score == 3)
+			if (best_score == max_score)
 break;			/* no need to look further */
 		}
 		if (best_score < 0)
>From 836049b1467ded2f257ffe1844e5656b3f273d6c Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 28 Apr 2021 18:27:53 +0500
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if could prove that the join
can be replaced with a scan. We can proof the uniqueness
using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars, an inner row
must be (physically) the same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals looks like a.x = b.x
2. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. So proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 890 +-
 src/backend/optimizer/plan/planmain.c |   5 +
 src/backend/optimizer/util/joininfo.c |   3 +
 src/backend/optimizer/util/relnode.c  |  26 +-
 src/backend/utils/misc/guc.c  |  10 +
 src/include/optimizer/pathnode.h  |   4 +
 src/include/optimizer/planmain.h  |   2 +
 src/test/regress/expected/equivclass.out  |  32 +
 src/test/regress/expected/join.out| 399 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/equivclass.sql   |  16 +
 src/test/regress/sql/join.sql | 189 +
 12 files changed, 1550 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 37eb64bcef..a8e638f6e7 100644
--- a/src/backend/opt

Re: Asymmetric partition-wise JOIN

2021-04-29 Thread Andrey V. Lepikhov

On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
This entry was inactive during this CF, so I've marked it as returned 
with feedback. Feel free to resubmit an updated version to a future 
commitfest. 
I return the patch to commitfest. My current reason differs from reason 
of origin author.
This patch can open a door for more complex optimizations in the 
partitionwise join push-down technique.
I mean, we can push-down join not only of two partitioned tables with 
the same partition schema, but a partitioned (sharded) table with an 
arbitrary subplan that is provable independent of local resources.


Example:

CREATE TABLE p(a int) PARTITION BY HASH (a);
CREATE TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 0);
CREATE TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 1);
CREATE TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 2);

SELECT * FROM p, (SELECT * FROM generate_series(1,2) AS a) AS s
WHERE p.a=s.a;

 Hash Join
   Hash Cond: (p.a = a.a)
   ->  Append
 ->  Seq Scan on p1 p_1
 ->  Seq Scan on p2 p_2
 ->  Seq Scan on p3 p_3
   ->  Hash
 ->  Function Scan on generate_series a

But with asymmetric join feature we have the plan:

 Append
   ->  Hash Join
 Hash Cond: (p_1.a = a.a)
 ->  Seq Scan on p1 p_1
 ->  Hash
   ->  Function Scan on generate_series a
   ->  Hash Join
 Hash Cond: (p_2.a = a.a)
 ->  Seq Scan on p2 p_2
 ->  Hash
   ->  Function Scan on generate_series a
   ->  Hash Join
 Hash Cond: (p_3.a = a.a)
 ->  Seq Scan on p3 p_3
 ->  Hash
   ->  Function Scan on generate_series a

In the case of FDW-sharding it means that if we can prove that the inner 
relation is independent from the execution server, we can push-down 
these joins and execute it in parallel.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Asynchronous Append on postgres_fdw nodes.

2021-04-27 Thread Andrey V. Lepikhov

On 4/23/21 8:12 AM, Etsuro Fujita wrote:

On Thu, Apr 22, 2021 at 12:30 PM Etsuro Fujita  wrote:
I have committed the patch.


One more question. Append choose async plans at the stage of the Append 
plan creation.
Later, the planner performs some optimizations, such as eliminating 
trivial Subquery nodes. So, AsyncAppend is impossible in some 
situations, for example:


(SELECT * FROM f1 WHERE a < 10)
  UNION ALL
(SELECT * FROM f2 WHERE a < 10);

But works for the query:

SELECT *
  FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
WHERE a < 10;

As far as I understand, this is not a hard limit. We can choose async 
subplans at the beginning of the execution stage.

For a demo, I prepared the patch (see in attachment).
It solves the problem and passes the regression tests.

--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index a960ada441..655e743c6e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1246,6 +1246,7 @@ postgresGetForeignPlan(PlannerInfo *root,
 	bool		has_final_sort = false;
 	bool		has_limit = false;
 	ListCell   *lc;
+	ForeignScan *fsplan;
 
 	/*
 	 * Get FDW private data created by postgresGetForeignUpperPaths(), if any.
@@ -1430,7 +1431,7 @@ postgresGetForeignPlan(PlannerInfo *root,
 	 * field of the finished plan node; we can't keep them in private state
 	 * because then they wouldn't be subject to later planner processing.
 	 */
-	return make_foreignscan(tlist,
+	fsplan = make_foreignscan(tlist,
 			local_exprs,
 			scan_relid,
 			params_list,
@@ -1438,6 +1439,13 @@ postgresGetForeignPlan(PlannerInfo *root,
 			fdw_scan_tlist,
 			fdw_recheck_quals,
 			outer_plan);
+
+	/* If appropriate, consider participation in async operations */
+	fsplan->scan.plan.async_capable = (enable_async_append &&
+	   best_path->path.pathkeys == NIL &&
+	   !fsplan->scan.plan.parallel_safe &&
+	   is_async_capable_path((Path *)best_path));
+	return fsplan;
 }
 
 /*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b3726a54f3..4e70f4eb54 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -524,6 +524,9 @@ ExecSupportsBackwardScan(Plan *node)
 	if (node->parallel_aware)
 		return false;
 
+	if (node->async_capable)
+		return false;
+
 	switch (nodeTag(node))
 	{
 		case T_Result:
@@ -536,10 +539,6 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 ListCell   *l;
 
-/* With async, tuples may be interleaved, so can't back up. */
-if (((Append *) node)->nasyncplans > 0)
-	return false;
-
 foreach(l, ((Append *) node)->appendplans)
 {
 	if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 3c1f12adaf..363cf9f4a5 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -117,6 +117,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	int			firstvalid;
 	int			i,
 j;
+	ListCell   *l;
+	bool		consider_async = false;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & EXEC_FLAG_MARK));
@@ -197,6 +199,23 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendplanstates = (PlanState **) palloc(nplans *
 			 sizeof(PlanState *));
 
+	/* If appropriate, consider async append */
+	consider_async = (list_length(node->appendplans) > 1);
+
+	if (!consider_async)
+	{
+		foreach(l, node->appendplans)
+		{
+			Plan *subplan = (Plan *) lfirst(l);
+
+			/* Check to see if subplan can be executed asynchronously */
+			if (subplan->async_capable)
+			{
+subplan->async_capable = false;
+			}
+		}
+	}
+
 	/*
 	 * call ExecInitNode on each of the valid plans to be executed and save
 	 * the results into the appendplanstates array.
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 632cc31a04..f7302ccf28 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,7 +242,6 @@ _copyAppend(const Append *from)
 	 */
 	COPY_BITMAPSET_FIELD(apprelids);
 	COPY_NODE_FIELD(appendplans);
-	COPY_SCALAR_FIELD(nasyncplans);
 	COPY_SCALAR_FIELD(first_partial_plan);
 	COPY_NODE_FIELD(part_prune_info);
 
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c723f6d635..665cdf3add 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -432,7 +432,6 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_BITMAPSET_FIELD(apprelids);
 	WRITE_NODE_FIELD(appendplans);
-	WRITE_INT_FIELD(nasyncplans);
 	WRITE_INT_FIELD(first_partial_plan);
 	WRITE_NODE_FIELD(part_prune_info);
 }
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3746668f52..9e3822f7db 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1716,7 +1716,6 @@ _readAppend(void)
 
 	

Re: Asynchronous Append on postgres_fdw nodes.

2021-04-26 Thread Andrey V. Lepikhov

On 4/23/21 8:12 AM, Etsuro Fujita wrote:

I have committed the patch.
Small mistake i found. If no tuple was received from a foreign 
partition, explain shows that we never executed node. For example,

if we have 0 tuples in f1 and 100 tuples in f2:

Query:
EXPLAIN (ANALYZE, VERBOSE, TIMING OFF, COSTS OFF)
SELECT * FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
LIMIT 101;

Explain:
 Limit (actual rows=100 loops=1)
   Output: f1.a
   ->  Append (actual rows=100 loops=1)
 ->  Async Foreign Scan on public.f1 (never executed)
   Output: f1.a
   Remote SQL: SELECT a FROM public.l1
 ->  Async Foreign Scan on public.f2 (actual rows=100 loops=1)
   Output: f2.a
   Remote SQL: SELECT a FROM public.l2

The patch in the attachment fixes this.

--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e201b5404e..a960ada441 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6857,8 +6857,13 @@ produce_tuple_asynchronously(AsyncRequest *areq, bool fetch)
 		}
 		else
 		{
-			/* There's nothing more to do; just return a NULL pointer */
-			result = NULL;
+			/*
+			 * There's nothing more to do; just check it and get an empty slot
+			 * from the child node.
+			 */
+			result = ExecProcNode((PlanState *) node);
+			Assert(TupIsNull(result));
+
 			/* Mark the request as complete */
 			ExecAsyncRequestDone(areq, result);
 		}


Re: Asynchronous Append on postgres_fdw nodes.

2021-04-26 Thread Andrey V. Lepikhov

On 4/23/21 8:12 AM, Etsuro Fujita wrote:

I have committed the patch.
While studying the capabilities of AsyncAppend, I noticed an 
inconsistency with the cost model of the optimizer:


async_capable = off:

Append  (cost=100.00..695.00 ...)
   ->  Foreign Scan on f1 part_1  (cost=100.00..213.31 ...)
   ->  Foreign Scan on f2 part_2  (cost=100.00..216.07 ...)
   ->  Foreign Scan on f3 part_3  (cost=100.00..215.62 ...)

async_capable = on:
---
Append  (cost=100.00..695.00 ...)
   ->  Async Foreign Scan on f1 part_1  (cost=100.00..213.31 ...)
   ->  Async Foreign Scan on f2 part_2  (cost=100.00..216.07 ...)
   ->  Async Foreign Scan on f3 part_3  (cost=100.00..215.62 ...)


Here I see two problems:
1. Cost of an AsyncAppend is the same as cost of an Append. But 
execution time of the AsyncAppend for three remote partitions has more 
than halved.

2. Cost of an AsyncAppend looks as a sum of the child ForeignScan costs.

I haven't ideas why it may be a problem right now. But I can imagine 
that it may be a problem in future if we have alternative paths: complex 
pushdown in synchronous mode (a few rows to return) or simple 
asynchronous append with a large set of rows to return.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

2021-04-09 Thread Andrey V. Lepikhov

On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
This entry was inactive during this CF, so I've marked it as returned 
with feedback. Feel free to resubmit an updated version to a future 
commitfest.


Attached version is rebased on current master and fixes problems with 
complex parameterized plans - 'reparameterize by child' feature.
Problems with reparameterization machinery can be demonstrated by TPC-H 
benchmark.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 6a15a52bfb90659c51b3a918d48037c474ffe9dd Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.

Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
This technique cause changes of 'reparameterize by child' machinery.
---
 src/backend/optimizer/path/joinpath.c|   9 +
 src/backend/optimizer/path/joinrels.c| 151 ++
 src/backend/optimizer/util/appendinfo.c  |  28 ++-
 src/backend/optimizer/util/pathnode.c|   9 +-
 src/backend/optimizer/util/relnode.c |  14 +-
 src/include/optimizer/paths.h|   7 +-
 src/test/regress/expected/partition_join.out | 209 +++
 src/test/regress/sql/partition_join.sql  |  99 +
 8 files changed, 509 insertions(+), 17 deletions(-)

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index e9b6968b1d..6ba6d32ae4 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
 			   jointype, );
+
+	/*
+	 * 7. If outer relation is delivered from partition-tables, consider
+	 * distributing inner relation to every partition-leaf prior to
+	 * append these leafs.
+	 */
+	try_asymmetric_partitionwise_join(root, joinrel,
+	  outerrel, innerrel,
+	  jointype, );
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..6f900475bb 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
 
 #include "miscadmin.h"
 #include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -1551,6 +1552,156 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	}
 }
 
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+		 RelOptInfo *joinrel,
+		 AppendPath *append_path,
+		 RelOptInfo *inner_rel,
+		 JoinType jointype,
+		 JoinPathExtraData *extra)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, append_path->subpaths)
+	{
+		Path			*child_path = lfirst(lc);
+		RelOptInfo		*child_rel = child_path->parent;
+		Relids			child_join_relids;
+		RelOptInfo		*child_join_rel;
+		SpecialJoinInfo	*child_sjinfo;
+		List			*child_restrictlist;
+		AppendRelInfo	**appinfos;
+		intnappinfos;
+
+		child_join_relids = bms_union(child_rel->relids,
+	  inner_rel->relids);
+		appinfos = find_appinfos_by_relids(root, child_join_relids,
+		   );
+		child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+			   child_rel->relids,
+			   inner_rel->relids);
+		child_restrictlist = (List *)
+			adjust_appendrel_attrs(root, (Node *)extra->restrictlist,
+   nappinfos, appinfos);
+		pfree(appinfos);
+
+		child_join_rel = find_join_rel(root, child_join_relids);
+		if (!child_join_rel)
+		{
+			child_join_rel = build_child_join_rel(root,
+  child_rel,
+  inner_rel,
+  joinrel,
+  child_restrictlist,
+  child_sjinfo,
+  jointype);
+			if (!child_join_rel)
+			{
+/*
+ * If can't build JOIN between inner relation and one of the outer
+ * partitions - return immediately.
+ */
+return NIL;
+			}
+		}
+		else
+		{
+			/*
+			 * TODO:
+			 * Can't imagine situation when join relation already exists. But in
+			 * the 'partition_join' regression test it happens.
+			 * It may be an indicator of possible problems.
+			 */
+		}
+
+		populate_joinrel_with_paths(root,
+	child_rel,
+	inner_rel,
+	child_join_rel,
+	child_sjinfo,
+	child_restrictlist);
+
+		/* Give up if asymmetric partition-wise join is not available */
+		if (child_join_rel->pathlist == NIL)
+			return NIL;
+
+		set_cheapest(child_join_rel);
+		result = lappend(result, child_join_rel);
+	}
+	return result;
+}
+
+void

Re: Increase value of OUTER_VAR

2021-04-07 Thread Andrey V. Lepikhov

On 4/8/21 8:13 AM, Tom Lane wrote:

I wrote:

Peter Eisentraut  writes:

Can we move forward with this?



We could just push the change and see what happens.  But I was thinking
more in terms of doing that early in the v15 cycle.  I remain skeptical
that we need a near-term fix.


To make sure we don't forget, I added an entry to the next CF for this.

Thanks for your efforts.

I tried to dive deeper: replace ROWID_VAR with -4 and explicitly change 
types of varnos in the description of functions that can only work with 
special varnos.
Use cases of OUTER_VAR looks simple (i guess). Use cases of INNER_VAR is 
more complex because of the map_variable_attnos(). It is needed to 
analyze how negative value of INNER_VAR can affect on this function.


INDEX_VAR causes potential problem:
in ExecInitForeignScan() and ExecInitForeignScan() we do
tlistvarno = INDEX_VAR;

here tlistvarno has non-negative type.


ROWID_VAR caused two problems in the check-world tests:
set_pathtarget_cost_width():
if (var->varno < root->simple_rel_array_size)
{
RelOptInfo *rel = root->simple_rel_array[var->varno];
...

and

replace_nestloop_params_mutator():
if (!bms_is_member(var->varno, root->curOuterRels))

I skipped this problems to see other weak points, but check-world 
couldn't find another.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 6ba9441cc43a2ccf868ca271494bf5b9950692e6 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 8 Apr 2021 08:43:04 +0500
Subject: [PATCH] Remove 64k rangetable limit.

---
 src/backend/nodes/bitmapset.c   |  1 +
 src/backend/nodes/outfuncs.c|  2 +-
 src/backend/nodes/readfuncs.c   |  2 +-
 src/backend/optimizer/path/costsize.c   |  3 ++-
 src/backend/optimizer/plan/createplan.c |  3 ++-
 src/backend/optimizer/plan/setrefs.c| 30 +
 src/include/nodes/primnodes.h   | 12 +-
 7 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 649478b0d4..c0d50c85da 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -23,6 +23,7 @@
 #include "common/hashfn.h"
 #include "nodes/bitmapset.h"
 #include "nodes/pg_list.h"
+#include "nodes/primnodes.h"
 #include "port/pg_bitutils.h"
 
 
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4a8dc2d86d..4c3de615d1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1115,7 +1115,7 @@ _outVar(StringInfo str, const Var *node)
 {
 	WRITE_NODE_TYPE("VAR");
 
-	WRITE_UINT_FIELD(varno);
+	WRITE_INT_FIELD(varno);
 	WRITE_INT_FIELD(varattno);
 	WRITE_OID_FIELD(vartype);
 	WRITE_INT_FIELD(vartypmod);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9924727851..d084eee6ef 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -577,7 +577,7 @@ _readVar(void)
 {
 	READ_LOCALS(Var);
 
-	READ_UINT_FIELD(varno);
+	READ_INT_FIELD(varno);
 	READ_INT_FIELD(varattno);
 	READ_OID_FIELD(vartype);
 	READ_INT_FIELD(vartypmod);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 05686d0194..ac72bddfae 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -5934,7 +5934,8 @@ set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target)
 			Assert(var->varlevelsup == 0);
 
 			/* Try to get data from RelOptInfo cache */
-			if (var->varno < root->simple_rel_array_size)
+			if (!IS_SPECIAL_VARNO(var->varno) &&
+var->varno < root->simple_rel_array_size)
 			{
 RelOptInfo *rel = root->simple_rel_array[var->varno];
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 22f10fa339..defd179bca 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4809,7 +4809,8 @@ replace_nestloop_params_mutator(Node *node, PlannerInfo *root)
 		/* Upper-level Vars should be long gone at this point */
 		Assert(var->varlevelsup == 0);
 		/* If not to be replaced, we can just return the Var unmodified */
-		if (!bms_is_member(var->varno, root->curOuterRels))
+		if (IS_SPECIAL_VARNO(var->varno) ||
+			!bms_is_member(var->varno, root->curOuterRels))
 			return node;
 		/* Replace the Var with a nestloop Param */
 		return (Node *) replace_nestloop_param_var(root, var);
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 70c0fa07e6..6009eabaf2 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -66,7 +66,7 @@ typedef struct
 {
 	PlannerInfo *root;
 	indexed_tlist *subplan_itlist;
-	Index		newvarno;
+	int			refvarno;
 	int			rtoffset;
 	double		num_exec;
 } fix_upper_expr_context;
@@ -147,7 +147,7 @@ static Var *search_indexed_tlist_for_var(Var *var,
 		 int rtoffset);
 static Var *search_indexed_tlist_for_non_var(Expr *node,
 	

Re: Global snapshots

2021-02-25 Thread Andrey V. Lepikhov

On 1/1/21 8:14 AM, tsunakawa.ta...@fujitsu.com wrote:

--
11. A method comprising:
receiving information relating to a distributed database transaction operating 
on data in data stores associated with respective participating nodes 
associated with the distributed database transaction;
requesting commit time votes from the respective participating nodes, the 
commit time votes reflecting local clock values of the respective participating 
nodes;
receiving the commit time votes from the respective participating nodes in 
response to the requesting;
computing a global commit timestamp for the distributed database transaction 
based at least in part on the commit time votes, the global commit timestamp 
reflecting a maximum value of the commit time votes received from the 
respective participating nodes; and
synchronizing commitment of the distributed database transaction at the 
respective participating nodes to the global commit timestamp,
wherein at least the computing is performed by a computing device.


Thank you for this analysis of the patent.
After researching in depth, I think this is the real problem.
My idea was that we are not using real clocks, we only use clock ticks 
to measure time intervals. It can also be interpreted as a kind of clock.


That we can do:
1. Use global clocks at the start of transaction.
2. Use CSN-based snapshot as a machinery and create an extension to 
allow user defined commit protocols.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-16 Thread Andrey V. Lepikhov

On 2/15/21 1:31 PM, Amit Langote wrote:

Tsunakawa-san, Andrey,
+static void
+postgresBeginForeignCopy(ModifyTableState *mtstate,
+  ResultRelInfo *resultRelInfo)
+{
...
+   if (resultRelInfo->ri_RangeTableIndex == 0)
+   {
+   ResultRelInfo *rootResultRelInfo = resultRelInfo->ri_RootResultRelInfo;
+
+   rte = exec_rt_fetch(rootResultRelInfo->ri_RangeTableIndex, estate);

It's better to add an Assert(rootResultRelInfo != NULL) here.
Apparently, there are cases where ri_RangeTableIndex == 0 without
ri_RootResultRelInfo being set.  The Assert will ensure that
BeginForeignCopy() is not mistakenly called on such ResultRelInfos.


+1


I can't parse what the function's comment says about "using list of
parameters".  Maybe it means to say "list of columns" specified in the
COPY FROM statement.  How about writing this as:

/*
  * Deparse remote COPY FROM statement
  *
  * Note that this explicitly specifies the list of COPY's target columns
  * to account for the fact that the remote table's columns may not match
  * exactly with the columns declared in the local definition.
  */

I'm hoping that I'm interpreting the original note correctly.  Andrey?


Yes, this is a good option.



+
+ mtstate is the overall state of the
+ ModifyTable plan node being executed;
global data about
+ the plan and execution state is available via this structure.
...
+typedef void (*BeginForeignCopy_function) (ModifyTableState *mtstate,
+  ResultRelInfo *rinfo);

Maybe a bit late realizing this, but why does BeginForeignCopy()
accept a ModifyTableState pointer whereas maybe just an EState pointer
will do?  I can't imagine why an FDW would want to look at the
ModifyTableState.  Case in point, I see that
postgresBeginForeignCopy() only uses the EState from the
ModifyTableState passed to it.  I think the ResultRelInfo that's being
passed to the Copy APIs contains most of the necessary information.
Also, EndForeignCopy() seems fine with just receiving the EState.


+1


If the intention is to only prevent this error, maybe the condition
above could be changed as this:

 /*
  * Check whether we support copying data out of the specified relation,
  * unless the caller also passed a non-NULL data_dest_cb, in which case,
  * the callback will take care of it
  */
 if (rel != NULL && rel->rd_rel->relkind != RELKIND_RELATION &&
 data_dest_cb == NULL)


Agreed. This is an atavism. In the first versions, I did not use the 
data_dest_cb routine. But now this is a redundant parameter.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-09 Thread Andrey V. Lepikhov

On 2/9/21 12:47 PM, tsunakawa.ta...@fujitsu.com wrote:

From: Andrey V. Lepikhov 
I guess you used many hash partitions.  Sadly, The current COPY implementation 
only accumulates either 1,000 rows or 64 KB of input data (very small!) before 
flushing all CopyMultiInsertBuffers.  One CopyMultiInsertBuffer corresponds to 
one partition.  Flushing a CopyMultiInsertBuffer calls ExecForeignCopy() once, 
which connects to a remote database, runs COPY FROM STDIN, and disconnects.  
Here, the flushing trigger (1,000 rows or 64 KB input data, whichever comes 
first) is so small that if there are many target partitions, the amount of data 
for each partition is small.
I tried to use 1E4 - 1E8 rows in a tuple buffer. But the results weren't 
impressive.

We can use one more GUC instead of a precompiled constant.


Why don't we focus on committing the basic part and addressing the extended 
part (0003 and 0004) separately later?

I focused only on the 0001 and 0002 patches.

 As Tang-san and you showed, the basic part already demonstrated impressive 
improvement.  If there's no objection, I'd like to make this ready for 
committer in a few days.

Good.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-08 Thread Andrey V. Lepikhov

On 2/9/21 9:35 AM, tsunakawa.ta...@fujitsu.com wrote:

From: tsunakawa.ta...@fujitsu.com 

From: Andrey Lepikhov 
Also, I might defer working on the extended part (v9 0003 and 0004) and further
separate them in a different thread, if it seems to take longer.


I reviewed them but haven't rebased them (it seems to take more labor.)
Andrey-san, could you tell us:

* Why is a separate FDW connection established for each COPY?  To avoid using 
the same FDW connection for multiple foreign table partitions in a single COPY 
run?
With separate connection you can init a 'COPY FROM' session for each 
foreign partition just one time on partition initialization.


* In what kind of test did you get 2-4x performance gain?  COPY into many 
foreign table partitions where the input rows are ordered randomly enough that 
many rows don't accumulate in the COPY buffer?
I used 'INSERT INTO .. SELECT * FROM generate_series(1, N)' to generate 
test data and HASH partitioning to avoid skews.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-01-11 Thread Andrey V. Lepikhov

On 1/11/21 11:16 PM, Tomas Vondra wrote:

Hi Andrey,

Unfortunately, this no longer applies :-( I tried to apply this on top
of c532d15ddd (from 2020/12/30) but even that has non-trivial conflicts.

Can you send a rebased version?

regards


Applied on 044aa9e70e.

--
regards,
Andrey Lepikhov
Postgres Professional
>From f8e0cd305c691108313c2365cc4576e4d5e0bd38 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Tue, 12 Jan 2021 08:54:45 +0500
Subject: [PATCH 2/2] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.

FDWAPI was extended by next routines:
* BeginForeignCopy
* EndForeignCopy
* ExecForeignCopy

BeginForeignCopy and EndForeignCopy initialize and free
the CopyState of bulk COPY. The ExecForeignCopy routine send
'COPY ... FROM STDIN' command to the foreign server, in iterative
manner send tuples by CopyTo() machinery, send EOF to this connection.

Code that constructed list of columns for a given foreign relation
in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList().
It is reused in the deparseCopyFromSql().

Added TAP-tests on the specific corner cases of COPY FROM STDIN operation.

By the analogy of CopyFrom() the CopyState structure was extended
with data_dest_cb callback. It is used for send text representation
of a tuple to a custom destination.
The PgFdwModifyState structure is extended with the cstate field.
It is needed for avoid repeated initialization of CopyState. ALso for this
reason CopyTo() routine was split into the set of routines CopyToStart()/
CopyTo()/CopyToFinish().

Enum CopyInsertMethod removed. This logic implements by ri_usesMultiInsert
field of the ResultRelInfo sructure.

Discussion:
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote
---
 contrib/postgres_fdw/deparse.c|  60 ++--
 .../postgres_fdw/expected/postgres_fdw.out|  46 ++-
 contrib/postgres_fdw/postgres_fdw.c   | 130 ++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  45 ++
 doc/src/sgml/fdwhandler.sgml  |  73 ++
 src/backend/commands/copy.c   |   4 +-
 src/backend/commands/copyfrom.c   | 126 ++---
 src/backend/commands/copyto.c |  84 ---
 src/backend/executor/execMain.c   |   8 +-
 src/backend/executor/execPartition.c  |  27 +++-
 src/include/commands/copy.h   |   8 +-
 src/include/foreign/fdwapi.h  |  15 ++
 13 files changed, 533 insertions(+), 94 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 3cf7b4eb1e..b1ca479a65 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1763,6 +1765,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2066,6 +2082,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list in parantheses.
+ */
+static List *
+deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens)
 {
 	Oid			relid = RelationGetRelid(rel);
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2074,10 +2114,8 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List 

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-01-11 Thread Andrey V. Lepikhov

On 1/11/21 4:59 PM, Tang, Haiying wrote:

Hi Andrey,

I had a general look at this extension feature, I think it's beneficial for 
some application scenarios of PostgreSQL. So I did 7 performance cases test on 
your patch(v13). The results are really good. As you can see below we can get 
7-10 times improvement with this patch.

PSA test_copy_from.sql shows my test cases detail(I didn't attach my data file 
since it's too big).

Below are the test results:
'Test No' corresponds to the number(0 1...6) in attached test_copy_from.sql.
%reg=(Patched-Unpatched)/Unpatched), Unit is millisecond.

|Test No| Test Case 
  |Patched(ms)  | Unpatched(ms) |%reg   |
|---|-|-|---|---|
|0  |COPY FROM insertion into the partitioned table(parition is foreign 
table)| 102483.223  |  1083300.907  |  -91% |
|1  |COPY FROM insertion into the partitioned table(parition is foreign 
partition)| 104779.893  |  1207320.287  |  -91% |
|2  |COPY FROM insertion into the foreign table(without partition)  
  | 100268.730  |  1077309.158  |  -91% |
|3  |COPY FROM insertion into the partitioned table(part of foreign 
partitions)   | 104110.620  |  1134781.855  |  -91% |
|4  |COPY FROM insertion into the partitioned table with constraint(part of 
foreign partition)| 136356.201  |  1238539.603  |  -89% |
|5  |COPY FROM insertion into the foreign table with constraint(without 
partition)| 136818.262  |  1189921.742  |  -89% |
|6  |\copy insertion into the partitioned table with constraint.
  | 140368.072  |  1242689.924  |  -89% |

If there is any question on my tests, please feel free to ask.

Best Regard,
Tang

Thank you for this work.
Sometimes before i suggested additional optimization [1] which can 
additionally speed up COPY by 2-4 times. Maybe you can perform the 
benchmark for this solution too?


[1] 
https://www.postgresql.org/message-id/da7ed3f5-b596-2549-3710-4cc2a602ec17%40postgrespro.ru


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Removing unneeded self joins

2021-01-10 Thread Andrey V. Lepikhov

On 1/7/21 7:08 PM, Masahiko Sawada wrote:

On Mon, Nov 30, 2020 at 2:51 PM Andrey V. Lepikhov
 wrote:

Thanks, it is my fault. I tried to extend this patch with foreign key
references and made a mistake.
Currently I rollback this new option (see patch in attachment), but will
be working for a while to simplify this patch.


Are you working to simplify the patch? This patch has been "Waiting on
Author" for 1 month and doesn't seem to pass cfbot tests. Please
update the patch.


Yes, I'm working to improve this feature.
In attachment - fixed and rebased on ce6a71fa53.

--
regards,
Andrey Lepikhov
Postgres Professional
>From 3caeb297320af690be71b367329d86c49564e231 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Mon, 11 Jan 2021 09:01:11 +0500
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if can be proven that such
join can be replaced with a scan. We can build the required proofs of
uniqueness using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row, if:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars then the inner row
is (physically) same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals looks like a.x = b.x
2. Collect all another join quals.
3. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. Proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 1186 +
 src/backend/optimizer/plan/planmain.c |5 +
 src/backend/optimizer/util/relnode.c  |   26 +-
 src/backend/utils/misc/guc.c  |   10 +
 src/include/optimizer/pathnode.h  |4 +
 src/include/optimizer/planmain.h  |2 +
 src/test/regress/expected/equivclass.out  |   32 +
 src/test/regress/expected/join.out|  331 ++
 src/test/regress/expected/sysviews.out|3 +-
 src/test/regress/sql/equivclass.sql   |   16 +
 src/test/regress/sql/join.sql |  166 +++
 11 files changed, 1765 insertions(+), 16 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 90460a69bd..d631e95f89 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -29,8 +30,12 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
+#include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
+
+bool enable_self_join_removal;
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
@@ -47,6 +52,7 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
    RelOptInfo *innerrel,
    JoinType jointype,
    List *restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
 
 
 /*
@@ -1118,3 +1124,1183 @@ is_innerrel_unique_for(PlannerInfo *root,
 	/* Let rel_is_distinct_for() do the hard work */
 	return rel_is_distinct_for(root, innerrel, clause_list);
 }
+
+typedef struct
+{
+	Index oldRelid;
+	Index newRelid;
+} ChangeVarnoContext;
+
+
+static bool
+change_varno_walker(Node *node, ChangeVarnoContext *context)
+{
+	if (node == NULL)
+		return false;
+
+	if (IsA(node, Var))
+	{
+		Var* var = (Var*)node;
+		if (var->varno == context->oldRelid)
+		{
+			var->varno = context->newRelid;
+			var->varnosyn = context->newRelid;
+			var->location = -1;
+		}
+		else if (var->varno == context->newRelid)
+			var->location = -1;
+
+		return false;
+	}
+	if (IsA(node, RestrictInfo))
+	{
+		change_rinfo((RestrictInfo*)node, context->oldRelid, context->newRelid);
+		return false;
+	}
+	return expression_tree_walker(node, change_varno_walker, context);
+}
+
+/*
+ * For all Vars in the expression that have varno = oldRelid, set
+ * varno = newRelid.
+ */
+static void
+change_varno(Expr *expr, Index oldRelid, Index newRelid)
+{
+	ChangeVarnoContext context;
+
+	context.oldRelid = oldRelid;
+	context.newRelid = newRelid;
+	change_varno_walker((Node *) expr, );
+}
+
+/*
+ * Substitute newId for oldId in relids.
+ */
+static void
+change_relid(Relids *relids, Index oldId, Index newId)
+{
+	if (bms_is_member(oldId, *relids))
+		*re

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-12-23 Thread Andrey V. Lepikhov

On 12/22/20 12:04 PM, Tang, Haiying wrote:

Hi Andrey,

There is an error report in your patch as follows. Please take a check.

https://travis-ci.org/github/postgresql-cfbot/postgresql/jobs/750682857#L1519


copyfrom.c:374:21: error: ‘save_cur_lineno’ is used uninitialized in this 
function [-Werror=uninitialized]


Regards,
Tang




Thank you,
see new version in attachment.

--
regards,
Andrey Lepikhov
Postgres Professional
>From e2bc0980f05061afe199de63b76b00020208510a Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Mon, 14 Dec 2020 13:37:40 +0500
Subject: [PATCH 2/2] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.

FDWAPI was extended by next routines:
* BeginForeignCopy
* EndForeignCopy
* ExecForeignCopy

BeginForeignCopy and EndForeignCopy initialize and free
the CopyState of bulk COPY. The ExecForeignCopy routine send
'COPY ... FROM STDIN' command to the foreign server, in iterative
manner send tuples by CopyTo() machinery, send EOF to this connection.

Code that constructed list of columns for a given foreign relation
in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList().
It is reused in the deparseCopyFromSql().

Added TAP-tests on the specific corner cases of COPY FROM STDIN operation.

By the analogy of CopyFrom() the CopyState structure was extended
with data_dest_cb callback. It is used for send text representation
of a tuple to a custom destination.
The PgFdwModifyState structure is extended with the cstate field.
It is needed for avoid repeated initialization of CopyState. ALso for this
reason CopyTo() routine was split into the set of routines CopyToStart()/
CopyTo()/CopyToFinish().

Enum CopyInsertMethod removed. This logic implements by ri_usesMultiInsert
field of the ResultRelInfo sructure.

Discussion:
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote
---
 contrib/postgres_fdw/deparse.c|  60 ++--
 .../postgres_fdw/expected/postgres_fdw.out|  46 +-
 contrib/postgres_fdw/postgres_fdw.c   | 137 ++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  45 ++
 doc/src/sgml/fdwhandler.sgml  |  73 ++
 src/backend/commands/copy.c   |   4 +-
 src/backend/commands/copyfrom.c   | 126 +---
 src/backend/commands/copyto.c |  84 ---
 src/backend/executor/execMain.c   |   8 +-
 src/backend/executor/execPartition.c  |  27 +++-
 src/include/commands/copy.h   |   8 +-
 src/include/foreign/fdwapi.h  |  15 ++
 13 files changed, 540 insertions(+), 94 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ca2f9f3215..b2a71faabc 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1763,6 +1765,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2066,6 +2082,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list in parantheses.
+ */
+static List *
+deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens)
 {
 	Oid			relid = RelationGetRelid(rel);
 	TupleDesc	tupdesc = 

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-12-14 Thread Andrey V. Lepikhov

On 12/1/20 2:02 PM, Amit Langote wrote:

On Tue, Dec 1, 2020 at 2:40 PM tsunakawa.ta...@fujitsu.com
 wrote:

From: Amit Langote 

>> The code appears to require both BeginForeignCopy and EndForeignCopy,
>> while the following documentation says they are optional.  Which is
>> correct?  (I suppose the latter is correct just like other existing
>> Begin/End functions are optional.)

Fixed.

> Anyway, one thing we could do is rename
> ExecRelationAllowsMultiInsert() to ExecSetRelationUsesMultiInsert(

Renamed.

>> I agree with your idea of adding multi_insert argument to 
ExecFindPartition() to request a multi-insert-capable partition.  At 
first, I thought ExecFindPartition() is used for all operations, 
insert/delete/update/select, so I found it odd to add multi_insert 
argument.  But ExecFindPartion() is used only for insert, so 
multi_insert argument seems okay.

>
> Good.  Andrey, any thoughts on this?

I have no serious technical arguments against this, other than code 
readability and reduce of a routine parameters. Maybe we will be 
rethinking it later?


The new version rebased on commit 525e60b742 is attached.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 98a6f077cd3b694683ec0e4a3250c040cc33cb39 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Mon, 14 Dec 2020 11:29:03 +0500
Subject: [PATCH 1/2] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
properties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copyfrom.c  | 142 ++-
 src/backend/executor/execMain.c  |  52 +
 src/backend/executor/execPartition.c |   7 ++
 src/include/commands/copyfrom_internal.h |  10 --
 src/include/executor/execPartition.h |   2 +
 src/include/executor/executor.h  |   2 +
 src/include/nodes/execnodes.h|   8 +-
 7 files changed, 108 insertions(+), 115 deletions(-)

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1b14e9a6eb..6d4f6cb80d 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -535,12 +535,10 @@ CopyFrom(CopyFromState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 	Assert(list_length(cstate->range_table) == 1);
@@ -650,6 +648,30 @@ CopyFrom(CopyFromState cstate)
 	resultRelInfo = target_resultRelInfo = makeNode(ResultRelInfo);
 	ExecInitResultRelation(estate, resultRelInfo, 1);
 
+	Assert(target_resultRelInfo->ri_usesMultiInsert == false);
+
+	/*
+	 * It's generally more efficient to prepare a bunch of tuples for
+	 * insertion, and insert them in bulk, for example, with one
+	 * table_multi_insert() call than call table_tuple_insert() separately for
+	 * every tuple. However, there are a number of reasons why we might not be
+	 * able to do this.  For example, if there any volatile expressions in the
+	 * table's default values or in the statement's WHERE clause, which may
+	 * query the table we are inserting into, buffering tuples might produce
+	 * wrong results.  Also, the relation we are trying to insert into itself
+	 * may not be amenable to buffered inserts.
+	 *
+	 * Note: For partitions, this flag is set considering the target table's
+	 * flag that is being set here and partition's own properties which are
+	 * checked by calling ExecSetRelationUsesMultiInsert().  It does not matter
+	 * whether partitions have any volatile default expressions as we use the
+	 * defaults from the target of the COPY command.
+	 */
+	if (!cstate->volatile_defexprs &&
+		!contain_volatile_functions(cstate->whereClause))
+		target_resultRelInfo->ri_usesMultiInsert =
+	ExecSetRelationUsesMultiInsert(target_resultRelInfo, NULL);
+
 	/* Verify the named relation is a valid target for INSERT */
 	CheckValidResultRel(resultRelInfo, CMD_INSERT);
 
@@ -665,6 +687,10 @@ CopyFrom(CopyFromState cstate)
 	mtstate->operation = CMD_INSERT;
 	mtstate->resultRelInfo = resultRelInfo;
 
+	/*
+	 * Init copying process into foreign table. Initialization of copying into
+	 * foreign 

Re: Asynchronous Append on postgres_fdw nodes.

2020-12-09 Thread Andrey V. Lepikhov

On 11/17/20 2:56 PM, Etsuro Fujita wrote:

On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita  wrote:
Comments welcome!  The attached is still WIP and maybe I'm missing
something, though.
I reviewed your patch and used it in my TPC-H benchmarks. It is still 
WIP. Will you improve this patch?


I also want to say that, in my opinion, Horiguchi-san's version seems 
preferable: it is more structured, simple to understand, executor-native 
and allows to reduce FDW interface changes. This code really only needs 
one procedure - IsForeignPathAsyncCapable.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Cost overestimation of foreign JOIN

2020-12-02 Thread Andrey V. Lepikhov

On 12/1/20 6:17 PM, Ashutosh Bapat wrote:

On Mon, Nov 30, 2020 at 11:56 PM Andrey Lepikhov
 wrote:


On 30.11.2020 22:38, Tom Lane wrote:

Andrey Lepikhov  writes:
If you're unhappy with the planning results you get for this,
why don't you have use_remote_estimate turned on?


I have a mixed load model. Large queries are suitable for additional
estimate queries. But for many simple SELECT's that touch a small
portion of the data, the latency has increased significantly. And I
don't know how to switch the use_remote_estimate setting in such case.


You may disable use_remote_estimates for given table or a server. So
if tables participating in short queries are different from those in
the large queries, you could set use_remote_estimate at table level to
turn it off for the first set. Otherwise, we need a FDW level GUC
which can be turned on/off for a given session or a query.


Currently I implemented another technique:
- By default, use_remote_estimate is off.
- On the estimate_path_cost_size() some estimation criteria is checked. 
If true, we force remote estimation for this JOIN.
This approach solves the push-down problem in my case - TPC-H test with 
6 servers/instances. But it is not so scalable, as i want.


Generally use_remote_estimate isn't scalable and there have been
discussions about eliminating the need of it. But no concrete proposal
has come yet.

Above I suggested to use results of cost calculation on local JOIN, 
assuming that in the case of postgres_fdw wrapper very likely, that 
foreign server will use the same type of join (or even better, if it has 
some index, for example).

If this approach is of interest, I can investigate it.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: Removing unneeded self joins

2020-11-29 Thread Andrey V. Lepikhov

On 11/29/20 10:10 PM, Heikki Linnakangas wrote:

On 28/11/2020 19:21, Andrey Lepikhov wrote:

On 27.11.2020 21:49, Heikki Linnakangas wrote:
CREATE TABLE a(x int, y int);
CREATE UNIQUE INDEX ON a(x);
SELECT a1.* FROM a a1, a a2 WHERE a1.x = a2.x;  -- self-join
CREATE UNIQUE INDEX ON a(y);
SELECT a1.* FROM a a1, a a2 WHERE a1.x = a2.y;  -- self-join too


The latter join is not "useless". The patch is returning incorrect 
result for that query:



postgres=# insert into a values (1, 2);
INSERT 0 1
postgres=# insert into a values (2, 1);
INSERT 0 1
postgres=# SELECT a1.* FROM a a1, a a2 WHERE a1.x = a2.y; -- WRONG RESULT
 x | y ---+---
(0 rows)

postgres=# set enable_self_join_removal=off;
SET
postgres=# SELECT a1.* FROM a a1, a a2 WHERE a1.x = a2.y; -- CORRECT 
RESULT

 x | y ---+---
 1 | 2
 2 | 1
(2 rows)


Thanks, it is my fault. I tried to extend this patch with foreign key 
references and made a mistake.
Currently I rollback this new option (see patch in attachment), but will 
be working for a while to simplify this patch.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 7be9cc9790b51b6afaabe2fbcf293f1b649265ea Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 30 Oct 2020 10:24:40 +0500
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if can be proven that such
join can be replaced with a scan. We can build the required proofs of
uniqueness using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row, if:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars then the inner row
is (physically) same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals looks like a.x = b.x
2. Collect all another join quals.
3. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. Proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 1186 +
 src/backend/optimizer/plan/planmain.c |5 +
 src/backend/optimizer/util/relnode.c  |   26 +-
 src/backend/utils/misc/guc.c  |   10 +
 src/include/optimizer/pathnode.h  |4 +
 src/include/optimizer/planmain.h  |2 +
 src/test/regress/expected/equivclass.out  |   32 +
 src/test/regress/expected/join.out|  331 ++
 src/test/regress/expected/sysviews.out|3 +-
 src/test/regress/sql/equivclass.sql   |   16 +
 src/test/regress/sql/join.sql |  174 +++
 11 files changed, 1773 insertions(+), 16 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 806629fff2..0e92245116 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -29,8 +30,12 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
+#include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
+
+bool enable_self_join_removal;
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
@@ -47,6 +52,7 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
    RelOptInfo *innerrel,
    JoinType jointype,
    List *restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
 
 
 /*
@@ -1118,3 +1124,1183 @@ is_innerrel_unique_for(PlannerInfo *root,
 	/* Let rel_is_distinct_for() do the hard work */
 	return rel_is_distinct_for(root, innerrel, clause_list);
 }
+
+typedef struct
+{
+	Index oldRelid;
+	Index newRelid;
+} ChangeVarnoContext;
+
+
+static bool
+change_varno_walker(Node *node, ChangeVarnoContext *context)
+{
+	if (node == NULL)
+		return false;
+
+	if (IsA(node, Var))
+	{
+		Var* var = (Var*)node;
+		if (var->varno == context->oldRelid)
+		{
+			var->varno = context->newRelid;
+			var->varnosyn = context->newRelid;
+			var->location = -1;
+		}
+		else if (var->varno == context->newRelid)
+			var->location = -1;
+
+		return false;
+	}
+	if (IsA(node, RestrictInfo))
+	{
+		change_rinfo((RestrictInfo*)node, context->oldRelid, context->newRelid);
+		return false;
+	}
+	return expression_tree_walker(node, change_varno_walker, context);
+}
+
+/*
+ * For all Vars in the expression that have varno = oldRelid, set
+ * varno = newRelid.
+ */
+static void
+change_varno(Expr *expr, Index oldRelid, Index newRelid)
+{
+	

Re: Removing unneeded self joins

2020-10-31 Thread Andrey V. Lepikhov

Thank you for this partial review, I included your changes:

On 9/23/20 9:23 AM, David Rowley wrote:

On Fri, 3 Apr 2020 at 17:43, Andrey Lepikhov  wrote:
Doing thing the way I describe will allow you to get rid of all the
UniqueRelInfo stuff.

Thanks for the review and sorry for the late reply.
I fixed small mistakes, mentioned in your letter.
Also I rewrote this patch at your suggestion [1].
Because of many changes, this patch can be viewed as a sketch.

To change self-join detection algorithm I used your delta patch from 
[2]. I added in the split_selfjoin_quals routine complex expressions 
handling for demonstration. But, it is not very useful with current 
infrastructure, i think.


Also I implemented one additional way for self-join detection algorithm: 
if the join target list isn't contained vars from inner relation, then 
we can detect self-join with only quals like a1.x=a2.y if check 
innerrel_is_unique is true.
Analysis of the target list is contained in the new routine - 
tlist_contains_rel_exprs - rewritten version of the build_joinrel_tlist 
routine.


Also changes of the join_is_removable() routine is removed from the 
patch. I couldn't understand why it is needed here.


Note, this patch causes change of one join.sql regression test output. 
It is not a bug, but maybe fixed.


Applied over commit 4a071afbd0.

> [1] 
https://www.postgresql.org/message-id/CAKJS1f8p-KiEujr12k-oa52JNWWaQUjEjNg%2Bo1MGZk4mHBn_Rg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAKJS1f8cJOCGyoxi7a_LG7eu%2BWKF9%2BHTff3wp1KKS5gcUg2Qfg%40mail.gmail.com


--
regards,
Andrey Lepikhov
Postgres Professional
>From b3d69fc66f9d9ecff5e43842ca71b5aeb8f7e92b Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 30 Oct 2020 10:24:40 +0500
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if can be proven that such
join can be replaced with a scan. We can build the required proofs of
uniqueness using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row, if:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars then the inner row
is (physically) same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals looks like a.x = b.x
2. Collect all another join quals.
3. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. Proved, that this join is self-join and can be replaced by
a scan.
4. If the list from (1) is NIL, check that the vars from the inner
and outer relations falls into the join target list.
5. If vars from the inner relation can't fall into the target list,
check innerrel_is_unique() for the qual list from (2). If it returns
true then outer row matches only one inner row, not necessary same.
But this is no longer a problem here. Proved, that this is removable
self-join.

Some regression tests change due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 1185 +
 src/backend/optimizer/plan/planmain.c |5 +
 src/backend/optimizer/util/relnode.c  |   26 +-
 src/backend/utils/misc/guc.c  |   10 +
 src/include/optimizer/pathnode.h  |4 +
 src/include/optimizer/planmain.h  |2 +
 src/test/regress/expected/equivclass.out  |   32 +
 src/test/regress/expected/join.out|  339 +-
 src/test/regress/expected/sysviews.out|3 +-
 src/test/regress/sql/equivclass.sql   |   16 +
 src/test/regress/sql/join.sql |  166 +++
 11 files changed, 1769 insertions(+), 19 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index d0ff660284..1221bf4599 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -29,8 +30,12 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
+#include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
+
+bool enable_self_join_removal;
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
@@ -47,6 +52,7 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
    RelOptInfo *innerrel,
    JoinType jointype,
    List *restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
 
 
 /*
@@ -1118,3 +1124,1182 @@ is_innerrel_unique_for(PlannerInfo *root,
 	/* Let rel_is_distinct_for() do the hard work */
 	

Re: Asynchronous Append on postgres_fdw nodes.

2020-10-08 Thread Andrey V. Lepikhov

On 10/5/20 11:35 AM, Etsuro Fujita wrote:
Hi,
I found a small problem. If we have a mix of async and sync subplans 
when we catch an assertion on a busy connection. Just for example:


PLAN

Nested Loop  (cost=100.00..174316.95 rows=975 width=8) (actual 
time=5.191..9.262 rows=9 loops=1)

   Join Filter: (frgn.a = l.a)
   Rows Removed by Join Filter: 8991
   ->  Append  (cost=0.00..257.20 rows=11890 width=4) (actual 
time=0.419..2.773 rows=1000 loops=1)

 Async subplans: 4
 ->  Async Foreign Scan on f_1 l_2  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.381..0.585 rows=211 loops=1)
 ->  Async Foreign Scan on f_2 l_3  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.005..0.206 rows=195 loops=1)
 ->  Async Foreign Scan on f_3 l_4  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.003..0.282 rows=187 loops=1)
 ->  Async Foreign Scan on f_4 l_5  (cost=100.00..197.75 
rows=2925 width=4) (actual time=0.003..0.316 rows=217 loops=1)
 ->  Seq Scan on l_0 l_1  (cost=0.00..2.90 rows=190 width=4) 
(actual time=0.017..0.057 rows=190 loops=1)
   ->  Materialize  (cost=100.00..170.94 rows=975 width=4) (actual 
time=0.001..0.002 rows=9 loops=1000)
 ->  Foreign Scan on frgn  (cost=100.00..166.06 rows=975 
width=4) (actual time=0.766..0.768 rows=9 loops=1)


Reproduction script 'test1.sql' see in attachment. Here I force the 
problem reproduction with setting enable_hashjoin and enable_mergejoin 
to off.


'asyncmix.patch' contains my solution to this problem.

--
regards,
Andrey Lepikhov
Postgres Professional


test1.sql
Description: application/sql
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 14824368cc..613d406982 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -455,7 +455,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 	  void *arg);
 static void create_cursor(ForeignScanState *node);
 static void request_more_data(ForeignScanState *node);
-static void fetch_received_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node, bool vacateconn);
 static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
@@ -1706,15 +1706,19 @@ postgresIterateForeignScan(ForeignScanState *node)
 		{
 			/*
 			 * finish the running query before sending the next command for
-			 * this node
+			 * this node.
+			 * When the plan contains both asynchronous subplans and non-async
+			 * subplans backend could request more data in async mode and want to
+			 * get data in sync mode by the same connection. Here it must wait
+			 * for async data before request another.
 			 */
-			if (!fsstate->s.commonstate->busy)
-vacate_connection((PgFdwState *)fsstate, false);
+			if (fsstate->s.commonstate->busy)
+vacate_connection(>s, false);
 
 			request_more_data(node);
 
 			/* Fetch the result immediately. */
-			fetch_received_data(node);
+			fetch_received_data(node, false);
 		}
 		else if (!fsstate->s.commonstate->busy)
 		{
@@ -1749,7 +1753,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 			/* fetch the leader's data and enqueue it for the next request */
 			if (available)
 			{
-fetch_received_data(leader);
+fetch_received_data(leader, false);
 add_async_waiter(leader);
 			}
 		}
@@ -3729,7 +3733,7 @@ request_more_data(ForeignScanState *node)
  * Fetches received data and automatically send requests of the next waiter.
  */
 static void
-fetch_received_data(ForeignScanState *node)
+fetch_received_data(ForeignScanState *node, bool vacateconn)
 {
 	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
@@ -3817,7 +3821,8 @@ fetch_received_data(ForeignScanState *node)
 	waiter = move_to_next_waiter(node);
 
 	/* send the next request if any */
-	if (waiter)
+	if (waiter && (!vacateconn ||
+		GetPgFdwScanState(node)->s.conn != GetPgFdwScanState(waiter)->s.conn))
 		request_more_data(waiter);
 
 	MemoryContextSwitchTo(oldcontext);
@@ -3843,7 +3848,7 @@ vacate_connection(PgFdwState *fdwstate, bool clear_queue)
 	 * query
 	 */
 	leader = commonstate->leader;
-	fetch_received_data(leader);
+	fetch_received_data(leader, true);
 
 	/* let the first waiter be the next leader of this connection */
 	move_to_next_waiter(leader);


Re: Adding Support for Copy callback functionality on COPY TO api

2020-09-30 Thread Andrey V. Lepikhov

On 7/2/20 2:41 AM, Sanaba, Bilva wrote:

Hi hackers,

Currently, the COPY TO api does not support callback functions, while 
the COPY FROM api does. The COPY TO code does, however, include 
placeholders for supporting callbacks in the future.


Rounding out the support of callback functions to both could be very 
beneficial for extension development. In particular, supporting 
callbacks for COPY TO will allow developers to utilize the preexisting 
command in order to create tools that give users more support for moving 
data for storage, backup, analytics, etc.


We are aiming to get the support in core PostgreSQL and add COPY TO 
callback support in the next commitfest.The attached patch contains a 
change to COPY TO api to support callbacks.


Your code almost exactly the same as proposed in [1] as part of 'Fast 
COPY FROM' command. But it seems there are differences.


[1] 
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru


--
regards,
Andrey Lepikhov
Postgres Professional




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-10 Thread Andrey V. Lepikhov

On 9/9/20 5:51 PM, Amit Langote wrote:

On Wed, Sep 9, 2020 at 6:42 PM Alexey Kondratov
 wrote:

On 2020-09-09 11:45, Andrey V. Lepikhov wrote:

This does not seem very convenient and will lead to errors in the
future. So, I agree with Amit.


And InitResultRelInfo() may set ri_usesMultiInsert to false by default,
since it's used only by COPY now. Then you won't need this in several
places:

+   resultRelInfo->ri_usesMultiInsert = false;

While the logic of turning multi-insert on with all the validations
required could be factored out of InitResultRelInfo() to a separate
routine.


Interesting idea.  Maybe better to have a separate routine like Alexey says.

Ok. I rewrited the patch 0001 with the Alexey suggestion.
Patch 0002... required minor changes (new version see in attachment).

Also I added some optimization (see 0003 and 0004 patches). Here we 
execute 'COPY .. FROM  STDIN' at foreign server only once, in the 
BeginForeignCopy routine. It is a proof-of-concept patches.


Also I see that error messages processing needs to be rewritten. Unlike 
the INSERT operation applied to each row, here we find out copy errors 
only after sending the END of copy. Currently implementations 0002 and 
0004 provide uninformative error messages for some cases.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 2053ac530db87ae4617aa953142c447e0b27e3a2 Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Mon, 24 Aug 2020 15:08:37 +0900
Subject: [PATCH 1/4] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
proprties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copy.c  | 190 ++-
 src/backend/executor/execMain.c  |   3 +
 src/backend/executor/execPartition.c |  47 +++
 src/include/executor/execPartition.h |   2 +
 src/include/nodes/execnodes.h|   9 +-
 5 files changed, 131 insertions(+), 120 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..2119db4213 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -85,16 +85,6 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
-/*
- * Represents the heap insert method to be used during COPY FROM.
- */
-typedef enum CopyInsertMethod
-{
-	CIM_SINGLE,	/* use table_tuple_insert or fdw routine */
-	CIM_MULTI,	/* always use table_multi_insert */
-	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
-} CopyInsertMethod;
-
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2715,12 +2705,10 @@ CopyFrom(CopyState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 
@@ -2833,6 +2821,58 @@ CopyFrom(CopyState cstate)
 	  0);
 	target_resultRelInfo = resultRelInfo;
 
+	Assert(target_resultRelInfo->ri_usesMultiInsert == false);
+
+	/*
+	 * It's generally more efficient to prepare a bunch of tuples for
+	 * insertion, and insert them in bulk, for example, with one
+	 * table_multi_insert() call than call table_tuple_insert() separately
+	 * for every tuple. However, there are a number of reasons why we might
+	 * not be able to do this.  We check some conditions below while some
+	 * other target relation properties are checked in InitResultRelInfo().
+	 * Partition initialization will use result of this check implicitly as
+	 * the ri_usesMultiInsert value of the parent relation.
+	 */
+	if (!checkMultiInsertMode(target_resultRelInfo, NULL))
+	{
+		/*
+		 * Do nothing. Can't allow multi-insert mode if previous conditions
+		 * checking disallow this.
+		 */
+	}
+	else if (cstate->volatile_defexprs || list_length(cstate->attnumlist) == 0)
+	{
+		/*
+		 * Can't support bufferization of copy into foreign tables without any
+		 * defined columns or if there are any volatile default expressions in the
+		 * table. Similarly to the trigger case above, such expressions may query
+		 * the table we're inserting into.
+		 *
+		 * Note: 

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Andrey V. Lepikhov

Version 8 split into two patches (in accordance with Amit suggestion).
Also I eliminate naming inconsistency (thanks to Alexey).
Based on master, f481d28232.

--
regards,
Andrey Lepikhov
Postgres Professional
>From 21b11f4ec0bec71bc7226014ef15c58dee9002da Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Mon, 24 Aug 2020 15:08:37 +0900
Subject: [PATCH 1/2] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
proprties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copy.c  | 186 ---
 src/backend/commands/tablecmds.c |   1 +
 src/backend/executor/execMain.c  |  49 ++
 src/backend/executor/execPartition.c |   3 +-
 src/backend/replication/logical/worker.c |   2 +-
 src/include/executor/executor.h  |   1 +
 src/include/nodes/execnodes.h|   9 +-
 7 files changed, 129 insertions(+), 122 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..4e63926cb7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -85,16 +85,6 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
-/*
- * Represents the heap insert method to be used during COPY FROM.
- */
-typedef enum CopyInsertMethod
-{
-	CIM_SINGLE,	/* use table_tuple_insert or fdw routine */
-	CIM_MULTI,	/* always use table_multi_insert */
-	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
-} CopyInsertMethod;
-
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2715,12 +2705,11 @@ CopyFrom(CopyState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
+	bool		use_multi_insert;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 
@@ -2820,6 +2809,52 @@ CopyFrom(CopyState cstate)
 		ti_options |= TABLE_INSERT_FROZEN;
 	}
 
+	/*
+	 * It's generally more efficient to prepare a bunch of tuples for
+	 * insertion, and insert them in bulk, for example, with one
+	 * table_multi_insert() call than call table_tuple_insert() separately
+	 * for every tuple. However, there are a number of reasons why we might
+	 * not be able to do this.  We check some conditions below while some
+	 * other target relation properties are left for InitResultRelInfo() to
+	 * check, because they must also be checked for partitions which are
+	 * initialized later.
+	 */
+	if (cstate->volatile_defexprs || list_length(cstate->attnumlist) == 0)
+	{
+		/*
+		 * Can't support bufferization of copy into foreign tables without any
+		 * defined columns or if there are any volatile default expressions in the
+		 * table. Similarly to the trigger case above, such expressions may query
+		 * the table we're inserting into.
+		 *
+		 * Note: It does not matter if any partitions have any volatile
+		 * default expressions as we use the defaults from the target of the
+		 * COPY command.
+		 */
+		use_multi_insert = false;
+	}
+	else if (contain_volatile_functions(cstate->whereClause))
+	{
+		/*
+		 * Can't support multi-inserts if there are any volatile function
+		 * expressions in WHERE clause.  Similarly to the trigger case above,
+		 * such expressions may query the table we're inserting into.
+		 */
+		use_multi_insert = false;
+	}
+	else
+	{
+		/*
+		 * Looks okay to try multi-insert, but that may change once we
+		 * check few more properties in InitResultRelInfo().
+		 *
+		 * For partitioned tables, whether or not to use multi-insert depends
+		 * on the individual parition's properties which are also checked in
+		 * InitResultRelInfo().
+		 */
+		use_multi_insert = true;
+	}
+
 	/*
 	 * We need a ResultRelInfo so we can use the regular executor's
 	 * index-entry-making machinery.  (There used to be a huge amount of code
@@ -2830,6 +2865,7 @@ CopyFrom(CopyState cstate)
 	  cstate->rel,
 	  1,		/* must match rel's position in range_table */
 	  NULL,
+	  use_multi_insert,
 	  0);
 	target_resultRelInfo = resultRelInfo;
 
@@ -2854,10 +2890,14 @@ CopyFrom(CopyState cstate)
 	

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Andrey V. Lepikhov

On 9/8/20 8:34 PM, Alexey Kondratov wrote:

On 2020-09-08 17:00, Amit Langote wrote:

 wrote:

On 2020-09-08 10:34, Amit Langote wrote:
Another ambiguous part of the refactoring was in changing
InitResultRelInfo() arguments:

@@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
  Relation resultRelationDesc,
  Index resultRelationIndex,
  Relation partition_root,
+ bool use_multi_insert,
  int instrument_options)

Why do we need to pass this use_multi_insert flag here? Would it be
better to set resultRelInfo->ri_usesMultiInsert in the
InitResultRelInfo() unconditionally like it is done for
ri_usesFdwDirectModify? And after that it will be up to the caller
whether to use multi-insert or not based on their own circumstances.
Otherwise now we have a flag to indicate that we want to check for
another flag, while this check doesn't look costly.


Hmm, I think having two flags seems confusing and bug prone,
especially if you consider partitions.  For example, if a partition's
ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, then
execPartition.c: ExecInitPartitionInfo() would wrongly perform
BeginForeignCopy() based on only ri_usesMultiInsert, because it
wouldn't know CopyFrom()'s local flag.  Am I missing something?


No, you're right. If someone want to share a state and use ResultRelInfo 
(RRI) for that purpose, then it's fine, but CopyFrom() may simply 
override RRI->ri_usesMultiInsert if needed and pass this RRI further.


This is how it's done for RRI->ri_usesFdwDirectModify. 
InitResultRelInfo() initializes it to false and then 
ExecInitModifyTable() changes the flag if needed.


Probably this is just a matter of personal choice, but for me the 
current implementation with additional argument in InitResultRelInfo() 
doesn't look completely right. Maybe because a caller now should pass an 
additional argument (as false) even if it doesn't care about 
ri_usesMultiInsert at all. It also adds additional complexity and feels 
like abstractions leaking.
I didn't feel what the problem was and prepared a patch version 
according to Alexey's suggestion (see Alternate.patch).
This does not seem very convenient and will lead to errors in the 
future. So, I agree with Amit.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 73705843d300ad1016384e6cb8893c80246372a6 Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Mon, 24 Aug 2020 15:08:37 +0900
Subject: [PATCH 1/2] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
proprties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copy.c  | 189 +--
 src/backend/commands/tablecmds.c |   1 +
 src/backend/executor/execMain.c  |  40 +
 src/backend/executor/execPartition.c |   7 +
 src/backend/replication/logical/worker.c |   1 +
 src/include/nodes/execnodes.h|   9 +-
 6 files changed, 127 insertions(+), 120 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..94f6e71a94 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -85,16 +85,6 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
-/*
- * Represents the heap insert method to be used during COPY FROM.
- */
-typedef enum CopyInsertMethod
-{
-	CIM_SINGLE,	/* use table_tuple_insert or fdw routine */
-	CIM_MULTI,	/* always use table_multi_insert */
-	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
-} CopyInsertMethod;
-
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2715,12 +2705,10 @@ CopyFrom(CopyState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 
@@ -2833,6 +2821,57 @@ CopyFrom(CopyState cstate)
 	  0);
 	target_resultRelInfo = resultRelInfo;
 
+	/*
+	 * It's 

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-08 Thread Andrey V. Lepikhov

On 9/8/20 12:34 PM, Amit Langote wrote:

Hi Andrey,

On Mon, Sep 7, 2020 at 7:31 PM Andrey V. Lepikhov
 wrote:

On 9/7/20 12:26 PM, Michael Paquier wrote:

While on it, the CF bot is telling that the documentation of the patch
fails to compile.  This needs to be fixed.
--
Michael


v.7 (in attachment) fixes this problem.
I also accepted Amit's suggestion to rename all fdwapi routines such as
ForeignCopyIn to *ForeignCopy.


Any thoughts on the taking out the refactoring changes out of the main
patch as I suggested?

Sorry I thought you asked to ignore your previous letter. I'll look into 
this patch set shortly.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Ideas about a better API for postgres_fdw remote estimates

2020-09-07 Thread Andrey V. Lepikhov

On 9/4/20 6:23 PM, Ashutosh Bapat wrote:



On Thu, 3 Sep 2020 at 10:44, Andrey V. Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:


On 8/31/20 6:19 PM, Ashutosh Bapat wrote:
 > On Mon, Aug 31, 2020 at 3:36 PM Andrey V. Lepikhov
 > mailto:a.lepik...@postgrespro.ru>> wrote:
 >>
 >> Thanks for this helpful feedback.
 > I think the patch has some other problems like it works only for
 > regular tables on foreign server but a foreign table can be pointing
 > to any relation like a materialized view, partitioned table or a
 > foreign table on the foreign server all of which have statistics
 > associated with them. I didn't look closely but it does not consider
 > that the foreign table may not have all the columns from the relation
 > on the foreign server or may have different names. But I think those
 > problems are kind of secondary. We have to agree on the design first.
 >
In accordance with discussion i made some changes in the patch:
1. The extract statistic routine moved into the core.


Bulk of the patch implements the statistics conversion to and fro json 
format. I am still not sure whether we need all of that code here.

Yes, i'm sure we'll replace it with something.

Right now, i want to discuss format of statistics dump. Remind, that a 
statistics dump is needed not only for fdw, but it need for the pg_dump. 
And in the dump will be placed something like this:

'SELECT store_relation_statistics(rel, serialized_stat)'

my reasons for using JSON:
* it have conversion infrastructure like json_build_object()
* this is flexible readable format, that can be useful in text dumps of 
relations.


Can we re-use pg_stats view? That is converting some of the OIDs to names. I 
agree with anyarray but if that's a problem here it's also a problem for 
pg_stats view, isn't it?
Right now, I don't know if it is possible to unambiguously convert the 
pg_stats information to a pg_statistic tuple.


If we can reduce the stats handling code to a 
minimum or use it for some other purpose as well e.g. pg_stats 
enhancement, the code changes required will be far less compared to the 
value that this patch provides.

+1

--
regards,
Andrey Lepikhov
Postgres Professional




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-07 Thread Andrey V. Lepikhov

On 9/7/20 12:26 PM, Michael Paquier wrote:

On Mon, Aug 24, 2020 at 06:19:28PM +0900, Amit Langote wrote:

On Mon, Aug 24, 2020 at 4:18 PM Amit Langote  wrote:

I would


Oops, thought I'd continue writing, but hit send before actually doing
that.  Please ignore.

I have some comments on v6, which I will share later this week.


While on it, the CF bot is telling that the documentation of the patch
fails to compile.  This needs to be fixed.
--
Michael


v.7 (in attachment) fixes this problem.
I also accepted Amit's suggestion to rename all fdwapi routines such as 
ForeignCopyIn to *ForeignCopy.


--
regards,
Andrey Lepikhov
Postgres Professional
>From db4ba1bac6a8d642dffd1b907dcc1dd082203fab Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 9 Jul 2020 11:16:56 +0500
Subject: [PATCH] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.

FDWAPI was extended by next routines:
* BeginForeignCopy
* EndForeignCopy
* ExecForeignCopy

BeginForeignCopy and EndForeignCopy initialize and free
the CopyState of bulk COPY. The ExecForeignCopy routine send
'COPY ... FROM STDIN' command to the foreign server, in iterative
manner send tuples by CopyTo() machinery, send EOF to this connection.

Code that constructed list of columns for a given foreign relation
in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList().
It is reused in the deparseCopyFromSql().

Added TAP-tests on the specific corner cases of COPY FROM STDIN operation.

By the analogy of CopyFrom() the CopyState structure was extended
with data_dest_cb callback. It is used for send text representation
of a tuple to a custom destination.
The PgFdwModifyState structure is extended with the cstate field.
It is needed for avoid repeated initialization of CopyState. ALso for this
reason CopyTo() routine was split into the set of routines CopyToStart()/
CopyTo()/CopyToFinish().

Discussion: https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote
---
 contrib/postgres_fdw/deparse.c|  60 ++-
 .../postgres_fdw/expected/postgres_fdw.out|  46 +-
 contrib/postgres_fdw/postgres_fdw.c   | 143 +++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  45 ++
 doc/src/sgml/fdwhandler.sgml  |  75 
 src/backend/commands/copy.c   | 398 +-
 src/backend/commands/tablecmds.c  |   1 +
 src/backend/executor/execMain.c   |  53 +++
 src/backend/executor/execPartition.c  |  28 +-
 src/backend/replication/logical/worker.c  |   2 +-
 src/include/commands/copy.h   |  11 +
 src/include/executor/executor.h   |   1 +
 src/include/foreign/fdwapi.h  |  15 +
 src/include/nodes/execnodes.h |   9 +-
 15 files changed, 670 insertions(+), 218 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..a37981ff66 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list 

Re: Ideas about a better API for postgres_fdw remote estimates

2020-09-02 Thread Andrey V. Lepikhov

On 8/31/20 6:19 PM, Ashutosh Bapat wrote:

On Mon, Aug 31, 2020 at 3:36 PM Andrey V. Lepikhov
 wrote:


Thanks for this helpful feedback.

I think the patch has some other problems like it works only for
regular tables on foreign server but a foreign table can be pointing
to any relation like a materialized view, partitioned table or a
foreign table on the foreign server all of which have statistics
associated with them. I didn't look closely but it does not consider
that the foreign table may not have all the columns from the relation
on the foreign server or may have different names. But I think those
problems are kind of secondary. We have to agree on the design first.


In accordance with discussion i made some changes in the patch:
1. The extract statistic routine moved into the core.
2. Serialized stat contains 'version' field to indicate format of 
statistic received.
3. ANALYZE and VACUUM ANALYZE uses this approach only in the case of 
implicit analysis of the relation.


I am currently keeping limitation of using the approach for regular 
relations only, because i haven't studied the specifics of another types 
of relations.

But I don't know any reason to keep this limit in the future.

The patch in attachment is very raw. I publish for further substantive 
discussion.


--
regards,
Andrey Lepikhov
Postgres Professional
From 9cfd9b8a43691f1dacd0967dcc32fdf8414ddb56 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 4 Aug 2020 09:29:37 +0500
Subject: [PATCH] Pull statistic for a foreign table from remote server.

Add the extract_relation_statistics() routine that convert statistics
on the relation into json format.
All OIDs - starelid, staop[] stacoll[] is converted to a portable
representation. Operation uniquely defined by set of features:
namespace, operator name, left operator namespace and name, right
operator namespace and name.
Collation uniquely defined by namespace, collation name and encoding.
New fdw API routine GetForeignRelStat() implements access to this
machinery and returns JSON string to the caller.
This function is called by ANALYZE command (without explicit relation
name) as an attempt to reduce the cost of updating statistics.
If attempt fails, ANALYZE go the expensive way. Add this feature into
the VACUUM ANALYZE and autovacuum.
In accordance with discussion [1] move the extract_relation_statistics()
routine into the core.

ToDo: tests on custom operations and collations.

1. https://www.postgresql.org/message-id/flat/1155731.1593832096%40sss.pgh.pa.us
---
 contrib/postgres_fdw/Makefile |   2 +-
 contrib/postgres_fdw/deparse.c|   8 +
 .../postgres_fdw/expected/foreign_stat.out| 112 +++
 contrib/postgres_fdw/postgres_fdw.c   |  49 ++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/foreign_stat.sql |  46 +
 src/backend/commands/analyze.c| 794 ++
 src/backend/commands/vacuum.c |  13 +-
 src/backend/utils/adt/json.c  |   6 +
 src/backend/utils/cache/lsyscache.c   | 167 
 src/include/catalog/pg_proc.dat   |   3 +
 src/include/catalog/pg_statistic.h|   1 +
 src/include/foreign/fdwapi.h  |   2 +
 src/include/utils/json.h  |   1 +
 src/include/utils/lsyscache.h |   8 +
 15 files changed, 1211 insertions(+), 2 deletions(-)
 create mode 100644 contrib/postgres_fdw/expected/foreign_stat.out
 create mode 100644 contrib/postgres_fdw/sql/foreign_stat.sql

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index ee8a80a392..a5a838b8fc 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -16,7 +16,7 @@ SHLIB_LINK_INTERNAL = $(libpq)
 EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
-REGRESS = postgres_fdw
+REGRESS = postgres_fdw foreign_stat
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..e63cb4982f 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -2053,6 +2053,14 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
 	appendStringInfo(buf, "::pg_catalog.regclass) / %d", BLCKSZ);
 }
 
+void
+deparseGetStatSql(StringInfo buf, Relation rel)
+{
+	appendStringInfo(buf, "SELECT * FROM extract_relation_statistics('");
+	deparseRelation(buf, rel);
+	appendStringInfoString(buf, "');");
+}
+
 /*
  * Construct SELECT statement to acquire sample rows of given relation.
  *
diff --git a/contrib/postgres_fdw/expected/foreign_stat.out b/contrib/postgres_fdw/expected/foreign_stat.out
new file mode 100644
index 00..46fd2f8427
--- /dev/null
+++ b/contrib/postgres_fdw/expected/foreign_stat.out
@@ -0,0 +1,112 @@
+CREATE TABLE ltable (a int, b real);
+CREATE FOREIGN TABLE ftable (a int) server loopback options (table_name 'ltable');
+VACUU

Re: Ideas about a better API for postgres_fdw remote estimates

2020-08-31 Thread Andrey V. Lepikhov

On 8/29/20 9:50 PM, Tom Lane wrote:

Years ago (when I was still at Salesforce, IIRC, so ~5 years) we had
some discussions about making it possible for pg_dump and/or pg_upgrade
to propagate stats data forward to the new database.  There is at least
one POC patch in the archives for doing that by dumping the stats data
wrapped in a function call, where the target database's version of the
function would be responsible for adapting the data if necessary, or
maybe just discarding it if it couldn't adapt.  We seem to have lost
interest but it still seems like something worth pursuing.  I'd guess
that if such infrastructure existed it could be helpful for this.


Thanks for this helpful feedback.

I found several threads related to the problem [1-3].
I agreed that this task needs to implement an API for 
serialization/deserialization of statistics:

pg_load_relation_statistics(json_string text);
pg_get_relation_statistics(relname text);
We can use a version number for resolving conflicts with different 
statistics implementations.
"Load" function will validate the values[] anyarray while deserializing 
the input json string to the datatype of the relation column.


Maybe I didn't feel all the problems of this task?

1. https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook
2. 
https://www.postgresql.org/message-id/flat/CAAZKuFaWdLkK8eozSAooZBets9y_mfo2HS6urPAKXEPbd-JLCA%40mail.gmail.com
3. 
https://www.postgresql.org/message-id/flat/GNELIHDDFBOCMGBFGEFOOEOPCBAA.chriskl%40familyhealth.com.au


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

2020-08-21 Thread Andrey V. Lepikhov

On 7/1/20 2:10 PM, Daniel Gustafsson wrote:

On 27 Dec 2019, at 08:34, Kohei KaiGai  wrote:



The attached v2 fixed the problem, and regression test finished correctly.


This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we 
don't break partitionwise join of a relation with foreign partition and 
a local relation if we have info that remote server has foreign table 
link to the local relation (by analogy with shippable extensions).


Maybe mark as 'Needs review'?

--
regards,
Andrey Lepikhov
Postgres Professional
>From 8dda8c4ba29ed4b2a54f66746ebedd9ab0bfded9 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Fri, 21 Aug 2020 10:38:59 +0500
Subject: [PATCH] Add one more planner strategy to JOIN with a partitioned
 relation.

Try to join inner relation with each partition of outer relation
and append results.
This strategy has potential benefits because it allows partitionwise
join with an unpartitioned relation or with a relation that is
partitioned by another schema.
---
 src/backend/optimizer/path/allpaths.c|   9 +-
 src/backend/optimizer/path/joinpath.c|   9 ++
 src/backend/optimizer/path/joinrels.c| 132 +
 src/backend/optimizer/plan/planner.c |   6 +-
 src/backend/optimizer/util/appendinfo.c  |  18 ++-
 src/backend/optimizer/util/relnode.c |  14 +-
 src/include/optimizer/paths.h|  10 +-
 src/test/regress/expected/partition_join.out | 145 +++
 src/test/regress/sql/partition_join.sql  |  63 
 9 files changed, 385 insertions(+), 21 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..4f110c5a2f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1278,7 +1278,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	}
 
 	/* Add paths to the append relation. */
-	add_paths_to_append_rel(root, rel, live_childrels);
+	add_paths_to_append_rel(root, rel, live_childrels, NIL);
 }
 
 
@@ -1295,7 +1295,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
  */
 void
 add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
-		List *live_childrels)
+		List *live_childrels,
+		List *original_partitioned_rels)
 {
 	List	   *subpaths = NIL;
 	bool		subpaths_valid = true;
@@ -1307,7 +1308,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
-	List	   *partitioned_rels = NIL;
+	List	   *partitioned_rels = original_partitioned_rels;
 	double		partial_rows = -1;
 
 	/* If appropriate, consider parallel append */
@@ -3950,7 +3951,7 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
 	}
 
 	/* Build additional paths for this rel from child-join paths. */
-	add_paths_to_append_rel(root, rel, live_children);
+	add_paths_to_append_rel(root, rel, live_children, NIL);
 	list_free(live_children);
 }
 
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..36464e31aa 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -324,6 +324,15 @@ add_paths_to_joinrel(PlannerInfo *root,
 	if (set_join_pathlist_hook)
 		set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
 			   jointype, );
+
+	/*
+	 * 7. If outer relation is delivered from partition-tables, consider
+	 * distributing inner relation to every partition-leaf prior to
+	 * append these leafs.
+	 */
+	try_asymmetric_partitionwise_join(root, joinrel,
+	  outerrel, innerrel,
+	  jointype, );
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..4a7d0d0604 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
 
 #include "miscadmin.h"
 #include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -1551,6 +1552,137 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 	}
 }
 
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+		 RelOptInfo *joinrel,
+		 AppendPath *append_path,
+		 RelOptInfo *inner_rel,
+		 JoinType jointype,
+		 JoinPathExtraData *extra)
+{
+	List		*result = NIL;
+	ListCell	*lc;
+
+	foreach (lc, append_path->subpaths)
+	{
+		Path			*child_path = lfirst(lc);
+		RelOptInfo		*child_rel = child_path->parent;
+		Relids			child_join_relids;
+		RelOptInfo		*child_join_rel;
+		SpecialJoinInfo	*child_sjinfo;
+		List			*child_restrictlist;
+		AppendRelInfo	

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-29 Thread Andrey V. Lepikhov

On 7/29/20 1:03 PM, Amit Langote wrote:

Hi Andrey,

Thanks for updating the patch.  I will try to take a look later.

On Wed, Jul 22, 2020 at 6:09 PM Andrey V. Lepikhov
 wrote:

On 7/16/20 2:14 PM, Amit Langote wrote:

* Why the "In" in these API names?

+   /* COPY a bulk of tuples into a foreign relation */
+   BeginForeignCopyIn_function BeginForeignCopyIn;
+   EndForeignCopyIn_function EndForeignCopyIn;
+   ExecForeignCopyIn_function ExecForeignCopyIn;


I used an analogy from copy.c.


Hmm, if we were going to also need *ForeignCopyOut APIs, maybe it
makes sense to have "In" here, but maybe we don't, so how about
leaving out the "In" for clarity?

Ok, sounds good.



* I see that the remote copy is performed from scratch on every call
of postgresExecForeignCopyIn(), but wouldn't it be more efficient to
send the `COPY remote_table FROM STDIN` in
postgresBeginForeignCopyIn() and end it in postgresEndForeignCopyIn()
when there are no errors during the copy?


It is not possible. FDW share one connection between all foreign
relations from a server. If two or more partitions will be placed at one
foreign server you will have problems with concurrent COPY command.


Ah, you're right.  I didn't consider multiple foreign partitions
pointing to the same server.  Indeed, we would need separate
connections to a given server to COPY to multiple remote relations on
that server in parallel.


May be we can create new connection for each partition?


Yeah, perhaps, although it sounds like something that might be more
generally useful and so we should work on that separately if at all.

I will try to prepare a separate patch.



I tried implementing these two changes -- pgfdw_copy_data_dest_cb()
and sending `COPY remote_table FROM STDIN` only once instead of on
every flush -- and I see significant speedup.  Please check the
attached patch that applies on top of yours.


I integrated first change and rejected the second by the reason as above.


Thanks.

Will send more comments after reading the v5 patch.


Ok. I'll be waiting for the end of your review.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: Global snapshots

2020-07-27 Thread Andrey V. Lepikhov

On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:

Hi Andrey san, Movead san,


From: tsunakawa.ta...@fujitsu.com 

While Clock-SI seems to be considered the best promising for global
serializability here,

* Why does Clock-SI gets so much attention?  How did Clock-SI become the
only choice?

* Clock-SI was devised in Microsoft Research.  Does Microsoft or some other
organization use Clock-SI?


Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.  Microsoft 
holds this until 2031.  I couldn't find this with the keyword "Clock-SI.""


US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?


Regards
Takayuki Tsunakawa




Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-23 Thread Andrey V. Lepikhov

On 7/16/20 2:14 PM, Amit Langote wrote:

Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Version 5 of the patch. With changes caused by Amit's comments.

--
regards,
Andrey Lepikhov
Postgres Professional
>From 24465d61d6f0ec6a45578d252bda1690ac045543 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 9 Jul 2020 11:16:56 +0500
Subject: [PATCH] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.

FDWAPI was extended by next routines:
* BeginForeignCopyIn
* EndForeignCopyIn
* ExecForeignCopyIn

BeginForeignCopyIn and EndForeignCopyIn initialize and free
the CopyState of bulk COPY. The ExecForeignCopyIn routine send
'COPY ... FROM STDIN' command to the foreign server, in iterative
manner send tuples by CopyTo() machinery, send EOF to this connection.

Code that constructed list of columns for a given foreign relation
in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList().
It is reused in the deparseCopyFromSql().

Added TAP-tests on the specific corner cases of COPY FROM STDIN operation.

By the analogy of CopyFrom() the CopyState structure was extended
with data_dest_cb callback. It is used for send text representation
of a tuple to a custom destination.
The PgFdwModifyState structure is extended with the cstate field.
It is needed for avoid repeated initialization of CopyState. ALso for this
reason CopyTo() routine was split into the set of routines CopyToStart()/
CopyTo()/CopyToFinish().

Discussion: https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote
---
 contrib/postgres_fdw/deparse.c|  60 -
 .../postgres_fdw/expected/postgres_fdw.out|  33 ++-
 contrib/postgres_fdw/postgres_fdw.c   | 146 +++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  28 ++
 doc/src/sgml/fdwhandler.sgml  |  74 ++
 src/backend/commands/copy.c   | 247 +++---
 src/backend/executor/execMain.c   |   1 +
 src/backend/executor/execPartition.c  |  34 ++-
 src/include/commands/copy.h   |  11 +
 src/include/foreign/fdwapi.h  |  15 ++
 src/include/nodes/execnodes.h |   8 +
 12 files changed, 547 insertions(+), 111 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..a37981ff66 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list in parantheses.
+ */
+static List *
+deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens)
 {
 	Oid			relid = RelationGetRelid(rel);
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2069,10 +2109,8 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
 	List	   *options;
 	ListCell   *lc;
 	bool		first = true;
+	List	   *retrieved_attrs = NIL;
 
-	*retrieved_attrs = NIL;
-
-	appendStringInfoString(buf, "SELECT ");
 	for (i = 0; i < tupdesc->natts; i++)
 	{
 		/* Ignore dropped columns. */
@@ -2081,6 +2119,9 @@ deparseAnalyzeSql(StringInfo 

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-22 Thread Andrey V. Lepikhov

On 7/16/20 2:14 PM, Amit Langote wrote:

Hi Andrey,

Thanks for this work.  I have been reading through your patch and
here's a what I understand it does and how:

The patch aims to fix the restriction that COPYing into a foreign
table can't use multi-insert buffer mechanism effectively.  That's
because copy.c currently uses the ExecForeignInsert() FDW API which
can be passed only 1 row at a time.  postgres_fdw's implementation
issues an `INSERT INTO remote_table VALUES (...)` statement to the
remote side for each row, which is pretty inefficient for bulk loads.
The patch introduces a new FDW API ExecForeignCopyIn() that can
receive multiple rows and copy.c now calls it every time it flushes
the multi-insert buffer so that all the flushed rows can be sent to
the remote side in one go.  postgres_fdw's now issues a `COPY
remote_table FROM STDIN` to the remote server and
postgresExecForeignCopyIn() funnels the tuples flushed by the local
copy to the server side waiting for tuples on the COPY protocol.


Fine


Here are some comments on the patch.

* Why the "In" in these API names?

+   /* COPY a bulk of tuples into a foreign relation */
+   BeginForeignCopyIn_function BeginForeignCopyIn;
+   EndForeignCopyIn_function EndForeignCopyIn;
+   ExecForeignCopyIn_function ExecForeignCopyIn;


I used an analogy from copy.c.


* fdwhandler.sgml should be updated with the description of these new APIs.




* As far as I can tell, the following copy.h additions are for an FDW
to use copy.c to obtain an external representation (char string) to
send to the remote side of the individual rows that are passed to
ExecForeignCopyIn():

+typedef void (*copy_data_dest_cb) (void *outbuf, int len);
+extern CopyState BeginForeignCopyTo(Relation rel);
+extern char *NextForeignCopyRow(CopyState cstate, TupleTableSlot *slot);
+extern void EndForeignCopyTo(CopyState cstate);

So, an FDW's ExecForeignCopyIn() calls copy.c: NextForeignCopyRow()
which in turn calls copy.c: CopyOneRowTo() which fills
CopyState.fe_msgbuf.  The data_dest_cb() callback that runs after
fe_msgbuf contains the full row simply copies it into a palloc'd char
buffer whose pointer is returned back to ExecForeignCopyIn().  I
wonder why not let FDWs implement the callback and pass it to copy.c
through BeginForeignCopyTo()?  For example, you could implement a
pgfdw_copy_data_dest_cb() in postgres_fdw.c which gets a direct
pointer of fe_msgbuf to send it to the remote server.

It is good point! Thank you.


Do you think all FDWs would want to use copy,c like above?  If not,
maybe the above APIs are really postgres_fdw-specific?  Anyway, adding
comments above the definitions of these functions would be helpful.

Agreed


* I see that the remote copy is performed from scratch on every call
of postgresExecForeignCopyIn(), but wouldn't it be more efficient to
send the `COPY remote_table FROM STDIN` in
postgresBeginForeignCopyIn() and end it in postgresEndForeignCopyIn()
when there are no errors during the copy?
It is not possible. FDW share one connection between all foreign 
relations from a server. If two or more partitions will be placed at one 
foreign server you will have problems with concurrent COPY command. May 
be we can create new connection for each partition?


I tried implementing these two changes -- pgfdw_copy_data_dest_cb()
and sending `COPY remote_table FROM STDIN` only once instead of on
every flush -- and I see significant speedup.  Please check the
attached patch that applies on top of yours.

I integrated first change and rejected the second by the reason as above.
  One problem I spotted

when trying my patch but didn't spend much time debugging is that
local COPY cannot be interrupted by Ctrl+C anymore, but that should be
fixable by adjusting PG_TRY() blocks.

Thanks


* ResultRelInfo.UseBulkModifying should be ri_usesBulkModify for consistency.

+1

I will post a new version of the patch a little bit later.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: Partitioning and postgres_fdw optimisations for multi-tenancy

2020-07-16 Thread Andrey V. Lepikhov

On 7/16/20 9:35 PM, Etsuro Fujita wrote:

On Thu, Jul 16, 2020 at 8:56 PM Andrey Lepikhov
 wrote:

On 7/16/20 9:55 AM, Etsuro Fujita wrote:



On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov
 wrote:

Some real-life test queries show, that all single-node queries aren't
pushed-down to the required node. For example:

SELECT
   *
FROM
   documents
   INNER JOIN users ON documents.user_id = users.id
WHERE
   documents.company_id = 5
   AND users.company_id = 5;



PWJ cannot be applied
to the join due to the limitation of the PWJ matching logic.  See the
discussion started in [1].  I think the patch in [2] would address
this issue as well, though the patch is under review.



I think, discussion [1] is little relevant to the current task. Here we
join not on partition attribute and PWJ can't be used at all.


The main point of the discussion is to determine whether PWJ can be
used for a join between partitioned tables, based on
EquivalenceClasses, not just join clauses created by
build_joinrel_restrictlist().  For the above join, for example, the
patch in [2] would derive a join clause "documents.company_id =
users.company_id" from an EquivalenceClass that recorded the knowledge
"documents.company_id = 5" and "users.company_id = 5", and then the
planner would consider from it that PWJ can be used for the join.

Ok, this patch works and you solved a part of the problem with this 
interesting approach.

But you can see that modification of the query:

SELECT * FROM documents, users WHERE documents.company_id = 5 AND 
users.company_id = 7;


also can be pushed into node2 and joined there but not.
My point is that we can try to solve the whole problem.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: POC and rebased patch for CSN based snapshots

2020-07-15 Thread Andrey V. Lepikhov

On 7/13/20 11:46 AM, movead...@highgo.ca wrote:

I continue to see your patch. Some code improvements see at the attachment.

Questions:
* csnSnapshotActive is the only member of the CSNshapshotShared struct.
* The WriteAssignCSNXlogRec() routine. I din't understand why you add 20 
nanosec to current CSN and write this into the WAL. For simplify our 
communication, I rewrote this routine in accordance with my opinion (see 
patch in attachment).


At general, maybe we will add your WAL writing CSN machinery + TAP tests 
to the patch from the thread [1] and work on it together?


[1] 
https://www.postgresql.org/message-id/flat/07b2c899-4ed0-4c87-1327-23c750311248%40postgrespro.ru


--
regards,
Andrey Lepikhov
Postgres Professional
>From 9a1595507c83b5fde61a6a3cc30f6df9df410e76 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 15 Jul 2020 11:55:00 +0500
Subject: [PATCH] 1

---
 src/backend/access/transam/csn_log.c   | 35 --
 src/include/access/csn_log.h   |  8 +++---
 src/test/regress/expected/sysviews.out |  3 ++-
 3 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/csn_log.c b/src/backend/access/transam/csn_log.c
index 319e89c805..53d3877851 100644
--- a/src/backend/access/transam/csn_log.c
+++ b/src/backend/access/transam/csn_log.c
@@ -150,8 +150,8 @@ CSNLogSetCSN(TransactionId xid, int nsubxids,
  */
 static void
 CSNLogSetPageStatus(TransactionId xid, int nsubxids,
-		   TransactionId *subxids,
-		   XidCSN csn, int pageno)
+	TransactionId *subxids,
+	XidCSN csn, int pageno)
 {
 	int			slotno;
 	int			i;
@@ -187,8 +187,8 @@ CSNLogSetCSNInSlot(TransactionId xid, XidCSN csn, int slotno)
 
 	Assert(LWLockHeldByMe(CSNLogControlLock));
 
-	ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
-
+	ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] +
+	entryno * sizeof(XidCSN));
 	*ptr = csn;
 }
 
@@ -205,17 +205,16 @@ CSNLogGetCSNByXid(TransactionId xid)
 	int			pageno = TransactionIdToPage(xid);
 	int			entryno = TransactionIdToPgIndex(xid);
 	int			slotno;
-	XidCSN *ptr;
-	XidCSN	xid_csn;
+	XidCSN		csn;
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 	slotno = SimpleLruReadPage_ReadOnly(CsnlogCtl, pageno, xid);
-	ptr = (XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
-	xid_csn = *ptr;
+	csn = *(XidCSN *) (CsnlogCtl->shared->page_buffer[slotno] +
+	entryno * sizeof(XidCSN));
 
 	LWLockRelease(CSNLogControlLock);
 
-	return xid_csn;
+	return csn;
 }
 
 /*
@@ -501,15 +500,15 @@ WriteAssignCSNXlogRec(XidCSN xidcsn)
 {
 	XidCSN log_csn = 0;
 
-	if(xidcsn > get_last_log_wal_csn())
-	{
-		log_csn = CSNAddByNanosec(xidcsn, 20);
-		set_last_log_wal_csn(log_csn);
-	}
-	else
-	{
+	if(xidcsn <= get_last_log_wal_csn())
+		/*
+		 * WAL-write related code. If concurrent backend already wrote into WAL
+		 * its CSN with bigger value it isn't needed to write this value.
+		 */
 		return;
-	}
+
+	log_csn = CSNAddByNanosec(xidcsn, 0);
+	set_last_log_wal_csn(log_csn);
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) (_csn), sizeof(XidCSN));
@@ -571,7 +570,6 @@ csnlog_redo(XLogReaderState *record)
 		LWLockAcquire(CSNLogControlLock, LW_EXCLUSIVE);
 		set_last_max_csn(csn);
 		LWLockRelease(CSNLogControlLock);
-
 	}
 	else if (info == XLOG_CSN_SETXIDCSN)
 	{
@@ -589,7 +587,6 @@ csnlog_redo(XLogReaderState *record)
 		SimpleLruWritePage(CsnlogCtl, slotno);
 		LWLockRelease(CSNLogControlLock);
 		Assert(!CsnlogCtl->shared->page_dirty[slotno]);
-
 	}
 	else if (info == XLOG_CSN_TRUNCATE)
 	{
diff --git a/src/include/access/csn_log.h b/src/include/access/csn_log.h
index 5838028a30..c23a71446a 100644
--- a/src/include/access/csn_log.h
+++ b/src/include/access/csn_log.h
@@ -15,10 +15,10 @@
 #include "utils/snapshot.h"
 
 /* XLOG stuff */
-#define XLOG_CSN_ASSIGNMENT 0x00
-#define XLOG_CSN_SETXIDCSN   0x10
-#define XLOG_CSN_ZEROPAGE   0x20
-#define XLOG_CSN_TRUNCATE   0x30
+#define XLOG_CSN_ASSIGNMENT	0x00
+#define XLOG_CSN_SETXIDCSN	0x10
+#define XLOG_CSN_ZEROPAGE	0x20
+#define XLOG_CSN_TRUNCATE	0x30
 
 typedef struct xl_xidcsn_set
 {
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..cc169a1999 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -73,6 +73,7 @@ select name, setting from pg_settings where name like 'enable%';
   name  | setting 
 +-
  enable_bitmapscan  | on
+ enable_csn_snapshot| off
  enable_gathermerge | on
  enable_hashagg | on
  enable_hashjoin| on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan | on
  enable_sort| on
  enable_tidscan | on

Re: POC and rebased patch for CSN based snapshots

2020-07-12 Thread Andrey V. Lepikhov

On 7/4/20 7:56 PM, movead...@highgo.ca wrote:



As far as I know about Clock-SI, left part of the blue line will
setup as a snapshot

if master require a snapshot at time t1. But in fact data A should
in snapshot but

not and data B should out of snapshot but not.


If this scene may appear in your origin patch? Or something my
understand about

Clock-SI is wrong?





Sorry for late answer.

I have doubts that I fully understood your question, but still.
What real problems do you see here? Transaction t1 doesn't get state of 
shard2 until time at node with shard2 won't reach start time of t1.
If transaction, that inserted B wants to know about it position in time 
relatively to t1 it will generate CSN, attach to node1 and will see, 
that t1 is not started yet.


Maybe you are saying about the case that someone who has a faster data 
channel can use the knowledge from node1 to change the state at node2?

If so, i think it is not a problem, or you can explain your idea.

--
regards,
Andrey Lepikhov
Postgres Professional




Re: POC: postgres_fdw insert batching

2020-07-09 Thread Andrey V. Lepikhov

On 6/28/20 8:10 PM, Tomas Vondra wrote:

Now, the primary reason why the performance degrades like this is that
while FDW has batching for SELECT queries (i.e. we read larger chunks of
data from the cursors), we don't have that for INSERTs (or other DML).
Every time you insert a row, it has to go all the way down into the
partition synchronously.


You added new fields into the PgFdwModifyState struct. Why you didn't 
reused ResultRelInfo::ri_CopyMultiInsertBuffer field and 
CopyMultiInsertBuffer machinery as storage for incoming tuples?


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Asymmetric partition-wise JOIN

2020-07-06 Thread Andrey V. Lepikhov

On 12/27/19 12:34 PM, Kohei KaiGai wrote:

The attached v2 fixed the problem, and regression test finished correctly.
Using your patch I saw incorrect value of predicted rows at the top node 
of the plan: "Append  (cost=270.02..35165.37 rows=40004 width=16)"
Full explain of the query plan see in attachment - 
explain_with_asymmetric.sql


if  I disable enable_partitionwise_join then:
"Hash Join  (cost=270.02..38855.25 rows=10001 width=16)"
Full explain - explain_no_asymmetric.sql

I thought that is the case of incorrect usage of cached values of 
norm_selec, but it is a corner-case problem of the eqjoinsel() routine :


selectivity = 1/size_of_larger_relation; (selfuncs.c:2567)
tuples = selectivity * outer_tuples * inner_tuples; (costsize.c:4607)

i.e. number of tuples depends only on size of smaller relation.
It is not a bug of your patch but I think you need to know because it 
may affect on planner decision.


===
P.S. Test case:
CREATE TABLE t0 (a serial, b int);
INSERT INTO t0 (b) (SELECT * FROM generate_series(1e4, 2e4) as g);
CREATE TABLE parts (a serial, b int) PARTITION BY HASH(a)
INSERT INTO parts (b) (SELECT * FROM generate_series(1, 1e6) as g);

--
regards,
Andrey Lepikhov
Postgres Professional


explain_with_asymmetric.sql
Description: application/sql


explain_no_asymmetric.sql
Description: application/sql


Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-02 Thread Andrey V. Lepikhov

On 6/22/20 5:11 PM, Ashutosh Bapat wrote:


mailto:a.lepik...@postgrespro.ru>> wrote:
It looks like we call BeginForeignInsert and EndForeignInsert even 
though actual copy is performed using BeginForeignCopy, ExecForeignCopy 
and EndForeignCopy. BeginForeignInsert constructs the INSERT query which 
looks unnecessary. Also some of the other PgFdwModifyState members are 
initialized unnecessarily. It also gives an impression that we are using 
INSERT underneath the copy. Instead a better way would be to 
call BeginForeignCopy instead of BeginForeignInsert and EndForeignCopy 
instead of EndForeignInsert, if we are going to use COPY protocol to 
copy data to the foreign server. Corresponding postgres_fdw 
implementations need to change in order to do that.


I did not answer for a long time, because of waiting for the results of 
the discussion on Tomas approach to bulk INSERT/UPDATE/DELETE. It seems 
more general.
I can move the query construction into the first execution of INSERT or 
COPY operation. But another changes seems more invasive because 
BeginForeignInsert/EndForeignInsert are used in the execPartition.c 
module. We will need to pass copy/insert state of operation into 
ExecFindPartition() and ExecCleanupTupleRouting().


--
regards,
Andrey Lepikhov
Postgres Professional




Re: POC and rebased patch for CSN based snapshots

2020-07-02 Thread Andrey V. Lepikhov

On 7/2/20 7:31 PM, Movead Li wrote:

Thanks for the remarks,

 >Some remarks on your patch:
 >1. The variable last_max_csn can be an atomic variable.
Yes will consider.

 >2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn
 >This is the case when someone changed the value of the system clock. I
 >think it is needed to write a WARNING to the log file. (May be we can do
 >synchronization with a time server.
Yes good point, I will work out a way to report the warning, it should 
exist a

report gap rather than report every time it generates CSN.
If we really need a correct time? What's the inferiority if one node 
generate

csn by monotonically increasing?
Changes in time values can lead to poor effects, such as old snapshot. 
Adjusting the time can be a kind of defense.


 >3. That about global snapshot xmin? In the pgpro version of the patch we
 >had GlobalSnapshotMapXmin() routine to maintain circular buffer of
 >oldestXmins for several seconds in past. This buffer allows to shift
 >oldestXmin in the past when backend is importing global transaction.
 >Otherwise old versions of tuples that were needed for this transaction
 >can be recycled by other processes (vacuum, HOT, etc).
 >How do you implement protection from local pruning? I saw
 >SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
I have researched your patch which is so great, in the patch only data
out of 'global_snapshot_defer_time' can be vacuum, and it keep dead
tuple even if no snapshot import at all,right?

I am thanking about a way if we can start remain dead tuple just before
we import a csn snapshot.

Base on Clock-SI paper, we should get local CSN then send to shard nodes,
because we do not known if the shard nodes' csn bigger or smaller then
master node, so we should keep some dead tuple all the time to support
snapshot import anytime.

Then if we can do a small change to CLock-SI model, we do not use the
local csn when transaction start, instead we touch every shard node for
require their csn, and shard nodes start keep dead tuple, and master node
choose the biggest csn to send to shard nodes.

By the new way, we do not need to keep dead tuple all the time and do
not need to manage a ring buf, we can give to ball to 'snapshot too old'
feature. But for trade off, almost all shard node need wait.
I will send more detail explain in few days.
I think, in the case of distributed system and many servers it can be 
bottleneck.
Main idea of "deferred time" is to reduce interference between DML 
queries in the case of intensive OLTP workload. This time can be reduced 
if the bloationg of a database prevails over the frequency of 
transaction aborts.



 >4. The current version of the patch is not applied clearly with current
 >master.
Maybe it's because of the release of PG13, it cause some conflict, I will
rebase it.

Ok


---
Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca





--
regards,
Andrey Lepikhov
Postgres Professional




Re: POC and rebased patch for CSN based snapshots

2020-06-29 Thread Andrey V. Lepikhov

On 6/12/20 2:41 PM, movead...@highgo.ca wrote:

Hello hackers,

Currently, I do some changes based on the last version:
1. Catch up to the current  commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.


Some remarks on your patch:
1. The variable last_max_csn can be an atomic variable.
2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn 
This is the case when someone changed the value of the system clock. I 
think it is needed to write a WARNING to the log file. (May be we can do 
synchronization with a time server.
3. That about global snapshot xmin? In the pgpro version of the patch we 
had GlobalSnapshotMapXmin() routine to maintain circular buffer of 
oldestXmins for several seconds in past. This buffer allows to shift 
oldestXmin in the past when backend is importing global transaction. 
Otherwise old versions of tuples that were needed for this transaction 
can be recycled by other processes (vacuum, HOT, etc).
How do you implement protection from local pruning? I saw 
SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
4. The current version of the patch is not applied clearly with current 
master.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Global snapshots

2020-06-19 Thread Andrey V. Lepikhov

On 6/19/20 11:48 AM, Amit Kapila wrote:

On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
 wrote:

On 09.06.2020 11:41, Fujii Masao wrote:

The patches seem not to be registered in CommitFest yet.
Are you planning to do that?

Not now. It is a sharding-related feature. I'm not sure that this
approach is fully consistent with the sharding way now.

Can you please explain in detail, why you think so?  There is no
commit message explaining what each patch does so it is difficult to
understand why you said so?
For now I used this patch set for providing correct visibility in the 
case of access to the table with foreign partitions from many nodes in 
parallel. So I saw at this patch set as a sharding-related feature, but 
[1] shows another useful application.

CSN-based approach has weak points such as:
1. Dependency on clocks synchronization
2. Needs guarantees of monotonically increasing of the CSN in the case 
of an instance restart/crash etc.
3. We need to delay increasing of OldestXmin because it can be needed 
for a transaction snapshot at another node.
So I do not have full conviction that it will be better than a single 
distributed transaction manager.

  Also, can you let us know if this

supports 2PC in some way and if so how is it different from what the
other thread on the same topic [1] is trying to achieve?
Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains 
2PC machinery. Now I'd not judge which approach is better.

 Also, I

would like to know if the patch related to CSN based snapshot [2] is a
precursor for this, if not, then is it any way related to this patch
because I see the latest reply on that thread [2] which says it is an
infrastructure of sharding feature but I don't understand completely
whether these patches are related?

I need some time to study this patch. At first sight it is different.


Basically, there seem to be three threads, first, this one and then
[1] and [2] which seems to be doing the work for sharding feature but
there is no clear explanation anywhere if these are anyway related or
whether combining all these three we are aiming for a solution for
atomic commit and atomic visibility.

It can be useful to study all approaches.


I am not sure if you know answers to all these questions so I added
the people who seem to be working on the other two patches.  I am also
afraid that if there is any duplicate or conflicting work going on in
these threads so we should try to find that as well.

Ok



[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/2020061911294657960322%40highgo.ca



[1] 
https://www.postgresql.org/message-id/flat/20200301083601.ews6hz5dduc3w2se%40alap3.anarazel.de


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com




Re: Asynchronous Append on postgres_fdw nodes.

2020-06-17 Thread Andrey V. Lepikhov

On 6/16/20 1:30 PM, Kyotaro Horiguchi wrote:

They return 25056 rows, which is far more than 9741 rows. So remote
join won.

Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.


Thanks for the attention.
I see one slight flaw of this approach to asynchronous append:
AsyncAppend works only for ForeignScan subplans. if we have 
PartialAggregate, Join or another more complicated subplan, we can't use 
asynchronous machinery.
It may lead to a situation than small difference in a filter constant 
can cause a big difference in execution time.
I imagine an Append node, that can switch current subplan from time to 
time and all ForeignScan nodes of the overall plan are added to one 
queue. The scan buffer can be larger than a cursor fetch size and each 
IterateForeignScan() call can induce asynchronous scan of another 
ForeignScan node if buffer is not full.
But these are only thoughts, not an proposal. I have no questions to 
your patch right now.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-06-17 Thread Andrey V. Lepikhov

On 6/15/20 10:26 AM, Ashutosh Bapat wrote:

Thanks Andrey for the patch. I am glad that the patch has taken care
of some corner cases already but there exist still more.

COPY command constructed doesn't take care of dropped columns. There
is code in deparseAnalyzeSql which constructs list of columns for a
given foreign relation. 0002 patch attached here, moves that code to a
separate function and reuses it for COPY. If you find that code change
useful please include it in the main patch.


Thanks, i included it.


2. In the same case, if the foreign table declared locally didn't have
any non-dropped columns but the relation that it referred to on the
foreign server had some non-dropped columns, COPY command fails. I
added a test case for this in 0002 but haven't fixed it.


I fixed it.
This is very special corner case. The problem was that COPY FROM does 
not support semantics like the "INSERT INTO .. DEFAULT VALUES". To 
simplify the solution, i switched off bulk copying for this case.


> I think this work is useful. Please add it to the next commitfest so
> that it's tracked.
Ok.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
>From abe4db0a5391735f7663daac81df579644a70fc3 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 17 Jun 2020 11:07:54 +0500
Subject: [PATCH] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.
---
 contrib/postgres_fdw/deparse.c|  60 -
 .../postgres_fdw/expected/postgres_fdw.out|  33 ++-
 contrib/postgres_fdw/postgres_fdw.c   |  98 
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  28 +++
 src/backend/commands/copy.c   | 223 --
 src/include/commands/copy.h   |   5 +
 src/include/foreign/fdwapi.h  |   9 +
 8 files changed, 374 insertions(+), 83 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..a37981ff66 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list in parantheses.
+ */
+static List *
+deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens)
 {
 	Oid			relid = RelationGetRelid(rel);
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2069,10 +2109,8 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
 	List	   *options;
 	ListCell   *lc;
 	bool		first = true;
+	List	   *retrieved_attrs = NIL;
 
-	*retrieved_attrs = NIL;
-
-	appendStringInfoString(buf, "SELECT ");
 	for (i = 0; i < tupdesc->natts; i++)
 	{
 		/* Ignore dropped columns. */
@@ -2081,6 +2119,9 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
 
 		if (!first)
 			appendStringInfoString(buf, ", ");
+		else if (enclose_in_parens)
+			appendStringInfoChar(buf, '(');
+
 		first = false;
 
 		/* Use attribute name or column_name option. */
@@ -2100,18 +2141,13 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
 
 		appendStringInfoString(buf, quote_identifier(colname));
 
-		*retrieved_attrs = lappend_int(*retrieved_attrs, i + 1);
+		retrieved_attrs = 

Re: Asynchronous Append on postgres_fdw nodes.

2020-06-15 Thread Andrey V. Lepikhov

On 6/15/20 1:29 PM, Kyotaro Horiguchi wrote:

Thanks for testing, but..

At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" 
 wrote in

The patch has a problem with partitionwise aggregates.

Asynchronous append do not allow the planner to use partial
aggregates. Example you can see in attachment. I can't understand why:
costs of partitionwise join are less.
Initial script and explains of the query with and without the patch
you can see in attachment.


I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).

I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.

Could you tell me how did you get the first plan?


1. Use clear current vanilla master.

2. Start two instances with the script 'frgn2n.sh' from attachment.
There are I set GUCs:
enable_partitionwise_join = true
enable_partitionwise_aggregate = true

3. Execute query:
explain analyze SELECT sum(parts.b)
FROM parts, second
WHERE parts.a = second.a AND second.b < 100;

That's all.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com


frgn2n.sh
Description: application/shellscript


Re: Asynchronous Append on postgres_fdw nodes.

2020-06-14 Thread Andrey V. Lepikhov

The patch has a problem with partitionwise aggregates.

Asynchronous append do not allow the planner to use partial aggregates. 
Example you can see in attachment. I can't understand why: costs of 
partitionwise join are less.
Initial script and explains of the query with and without the patch you 
can see in attachment.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com



frgn2n.sh
Description: application/shellscript
Execution without asynchronous append
=

explain analyze SELECT sum(parts.b) FROM parts, second WHERE parts.a = second.a 
AND second.b < 100;
   QUERY PLAN   
 
-
 Finalize Aggregate  (cost=2144.36..2144.37 rows=1 width=8) (actual 
time=25.821..25.821 rows=1 loops=1)
   ->  Append  (cost=463.39..2144.35 rows=4 width=8) (actual time=9.495..25.816 
rows=4 loops=1)
 ->  Partial Aggregate  (cost=463.39..463.40 rows=1 width=8) (actual 
time=9.495..9.495 rows=1 loops=1)
   ->  Hash Join  (cost=5.58..463.33 rows=27 width=4) (actual 
time=0.109..9.486 rows=27 loops=1)
 Hash Cond: (parts.a = second.a)
 ->  Seq Scan on part_0 parts  (cost=0.00..363.26 
rows=25126 width=8) (actual time=0.018..5.901 rows=25126 loops=1)
 ->  Hash  (cost=5.24..5.24 rows=27 width=4) (actual 
time=0.084..0.084 rows=27 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 9kB
   ->  Seq Scan on second_0 second  (cost=0.00..5.24 
rows=27 width=4) (actual time=0.014..0.071 rows=27 loops=1)
 Filter: (b < 100)
 Rows Removed by Filter: 232
 ->  Partial Aggregate  (cost=560.68..560.69 rows=1 width=8) (actual 
time=6.017..6.017 rows=1 loops=1)
   ->  Foreign Scan  (cost=105.29..560.61 rows=29 width=4) (actual 
time=6.008..6.011 rows=30 loops=1)
 Relations: (part_1 parts_1) INNER JOIN (second_1)
 ->  Partial Aggregate  (cost=560.88..560.89 rows=1 width=8) (actual 
time=5.920..5.920 rows=1 loops=1)
   ->  Foreign Scan  (cost=105.75..560.82 rows=24 width=4) (actual 
time=5.908..5.912 rows=25 loops=1)
 Relations: (part_2 parts_2) INNER JOIN (second_2)
 ->  Partial Aggregate  (cost=559.33..559.34 rows=1 width=8) (actual 
time=4.380..4.381 rows=1 loops=1)
   ->  Foreign Scan  (cost=105.09..559.29 rows=16 width=4) (actual 
time=4.371..4.373 rows=17 loops=1)
 Relations: (part_3 parts_3) INNER JOIN (second_3)
 Planning Time: 6.734 ms
 Execution Time: 26.079 ms

Execution with asynchronous append
==

explain analyze SELECT sum(parts.b) FROM parts, second WHERE parts.a = second.a 
AND second.b < 100;
  QUERY 
PLAN   
---
 Finalize Aggregate  (cost=5758.83..5758.84 rows=1 width=8) (actual 
time=184.849..184.849 rows=1 loops=1)
   ->  Append  (cost=727.82..5758.82 rows=4 width=8) (actual 
time=11.735..184.843 rows=4 loops=1)
 ->  Partial Aggregate  (cost=727.82..727.83 rows=1 width=8) (actual 
time=11.735..11.735 rows=1 loops=1)
   ->  Hash Join  (cost=677.34..725.94 rows=753 width=4) (actual 
time=11.693..11.729 rows=27 loops=1)
 Hash Cond: (second.a = parts.a)
 ->  Seq Scan on second_0 second  (cost=0.00..38.25 
rows=753 width=4) (actual time=0.024..0.052 rows=27 loops=1)
   Filter: (b < 100)
   Rows Removed by Filter: 232
 ->  Hash  (cost=363.26..363.26 rows=25126 width=8) (actual 
time=11.644..11.644 rows=25126 loops=1)
   Buckets: 32768  Batches: 1  Memory Usage: 1238kB
   ->  Seq Scan on part_0 parts  (cost=0.00..363.26 
rows=25126 width=8) (actual time=0.013..5.595 rows=25126 loops=1)
 ->  Partial Aggregate  (cost=1676.97..1676.98 rows=1 width=8) (actual 
time=58.958..58.958 rows=1 loops=1)
   ->  Hash Join  (cost=1377.15..1440.85 rows=94449 width=4) 
(actual time=58.922..58.948 rows=30 loops=1)
 Hash Cond: (second_1.a = parts_1.a)
 ->  Foreign Scan on second_1  (cost=100.00..153.31 
rows=753 width=4) (actual time=0.366..0.374 rows=30 loops=1)
 ->  Hash  (cost=963.58..963.58 rows=25086 width=8) (actual 
time=58.534..58.534 rows=24978 loops=1)
 

Re: Asynchronous Append on postgres_fdw nodes.

2020-06-11 Thread Andrey V. Lepikhov

On 6/10/20 8:05 AM, Kyotaro Horiguchi wrote:

Hello, Andrey.

At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov  
wrote in

On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?


Yes.  For the reason mentioned above, foreign subpaths don't affect
the startup cost of Append as far as any sync subpaths exist.  If no
sync subpaths exist, the Append's startup cost is the minimum startup
cost among the async subpaths.
I mean that you can possibly change computation of total cost of the 
Async append node. It may affect the planner choice between ForeignScan 
(followed by the execution of the JOIN locally) and partitionwise join 
strategies.


Have you also considered the possibility of dynamic choice between 
synchronous and async append (during optimization)? This may be useful 
for a query with the LIMIT clause.


--
Andrey Lepikhov
Postgres Professional




Re: Global snapshots

2020-06-09 Thread Andrey V. Lepikhov


On 09.06.2020 11:41, Fujii Masao wrote:



On 2020/05/12 19:24, Andrey Lepikhov wrote:

Rebased onto current master (fb544735f1).


Thanks for the patches!

These patches are no longer applied cleanly and caused the compilation 
failure.

So could you rebase and update them?

Rebased onto 57cb806308 (see attachment).


The patches seem not to be registered in CommitFest yet.
Are you planning to do that?
Not now. It is a sharding-related feature. I'm not sure that this 
approach is fully consistent with the sharding way now.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com

>From cd6a8585f9814b7e465abb2649ac84e80e7c726b Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Tue, 9 Jun 2020 14:55:38 +0500
Subject: [PATCH 1/3] GlobalCSNLog-SLRU

---
 src/backend/access/transam/Makefile |   1 +
 src/backend/access/transam/global_csn_log.c | 439 
 src/backend/access/transam/twophase.c   |   1 +
 src/backend/access/transam/varsup.c |   2 +
 src/backend/access/transam/xlog.c   |  12 +
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/ipc/procarray.c |   3 +
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/tcop/postgres.c |   1 +
 src/backend/utils/misc/guc.c|   9 +
 src/backend/utils/probes.d  |   2 +
 src/bin/initdb/initdb.c |   3 +-
 src/include/access/global_csn_log.h |  30 ++
 src/include/storage/lwlock.h|   1 +
 src/include/utils/snapshot.h|   3 +
 15 files changed, 510 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/global_csn_log.c
 create mode 100644 src/include/access/global_csn_log.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..60ff8b141e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	global_csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/global_csn_log.c b/src/backend/access/transam/global_csn_log.c
new file mode 100644
index 00..6f7fded350
--- /dev/null
+++ b/src/backend/access/transam/global_csn_log.c
@@ -0,0 +1,439 @@
+/*-
+ *
+ * global_csn_log.c
+ *		Track global commit sequence numbers of finished transactions
+ *
+ * Implementation of cross-node transaction isolation relies on commit sequence
+ * number (CSN) based visibility rules.  This module provides SLRU to store
+ * CSN for each transaction.  This mapping need to be kept only for xid's
+ * greater then oldestXid, but that can require arbitrary large amounts of
+ * memory in case of long-lived transactions.  Because of same lifetime and
+ * persistancy requirements this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/global_csn_log.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "access/global_csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool track_global_snapshots;
+
+/*
+ * Defines for GlobalCSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0x,
+ * GlobalCSNLog page numbering also wraps around at
+ * 0x/GLOBAL_CSN_LOG_XACTS_PER_PAGE, and GlobalCSNLog segment numbering at
+ * 0x/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateGlobalCSNLog (see GlobalCSNLogPagePrecedes).
+ */
+
+/* We store the commit GlobalCSN for each xid */
+#define GCSNLOG_XACTS_PER_PAGE (BLCKSZ / sizeof(GlobalCSN))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData GlobalCSNLogCtlData;
+#define GlobalCsnlogCtl ()
+
+static int	ZeroGlobalCSNLogPage(int pageno);
+static bool GlobalCSNLogPagePrecedes(int page1, int page2);
+static void GlobalCSNLogSetPageStatus(TransactionId xid, int nsubxids,
+	  TransactionId *subxids,
+	  GlobalCSN csn, int pageno);
+static void GlobalCSNLogSetCSNInSlot(TransactionId xid, GlobalCSN csn,
+	  int slotno);
+
+/*
+ * GlobalCSNLogSetCSN
+ *
+ * Record GlobalCSN of transaction and its subtransaction 

Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record

2018-07-12 Thread Andrey V. Lepikhov




On 10.07.2018 22:26, Fujii Masao wrote:

On Tue, Jul 10, 2018 at 6:41 PM, Andrey V. Lepikhov
 wrote:



On 10.07.2018 06:45, Andres Freund wrote:


Hi,

On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote:


This functionality is needed in practice when we have to determine a
recovery time of specific backup.



What do you mean by "recovery time of specific backup"?



recovery time - is a time point where backup of PostgreSQL database instance
was made.
Performing database recovery, we want to know what point in time the
restored database will correspond to.
This functionality refers to improving the usability of pg_basebackup and
pg_probackup utilities.


Why don't you use a backup history file for that purpose?


Timestamp in a backup history file not correspond to any WAL record and 
can't be bind with a time of backup exactly.
In my opinion, keeping timestamp in XLOG_BACKUP_END is more reliable, 
safe and easy way for recovering a database to a specific time.




Regards,



--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record

2018-07-10 Thread Andrey V. Lepikhov




On 10.07.2018 06:45, Andres Freund wrote:

Hi,

On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote:

This functionality is needed in practice when we have to determine a
recovery time of specific backup.


What do you mean by "recovery time of specific backup"?



recovery time - is a time point where backup of PostgreSQL database 
instance was made.
Performing database recovery, we want to know what point in time the 
restored database will correspond to.
This functionality refers to improving the usability of pg_basebackup 
and pg_probackup utilities.





This code developed in compatibility with WAL segments, which do not have a
timestamp in a XLOG_BACKUP_END record.


I don't understand what "compatibility with WAL segments" could mean?
And how are WAL segments related to "XLOG_BACKUP_END record", except as
to how every WAL record is related? Are you thinking about the switch
records?



In this case 'compatibility' means that patched postgres codes 
(pg_basebackup, pg_probackup, pg_waldump etc) will correctly read WAL 
segments which not contains a timestamp field in XLOG_BACKUP_END record.



Greetings,

Andres Freund



--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



[PATCH] Timestamp for a XLOG_BACKUP_END WAL-record

2018-07-09 Thread Andrey V. Lepikhov

Hi,
I prepared a patch which adds a timestamp into a XLOG_BACKUP_END 
WAL-record. This functionality is needed in practice when we have to 
determine a recovery time of specific backup.
This code developed in compatibility with WAL segments, which do not 
have a timestamp in a XLOG_BACKUP_END record.


--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company
>From 8852a64156e7726bae11c1904b142c9b157cf654 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Mon, 9 Jul 2018 10:57:10 +0500
Subject: [PATCH] BACKUP_END timestamp addition

---
 src/backend/access/rmgrdesc/xlogdesc.c | 14 +-
 src/backend/access/transam/xlog.c  |  7 +--
 src/include/access/xlog_internal.h |  7 +++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 00741c7..5a0d61a 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -87,7 +87,19 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		XLogRecPtr	startpoint;
 
 		memcpy(, rec, sizeof(XLogRecPtr));
-		appendStringInfo(buf, "%X/%X",
+		/* Check for the format of WAL-record with timestamp */
+		if (XLogRecGetDataLen(record) >= sizeof(xl_backup_end))
+		{
+			TimestampTz	timestamp;
+
+			memcpy(, &((xl_backup_end *)rec)->timestamp, sizeof(TimestampTz));
+			appendStringInfo(buf, "%X/%X; timestamp: %s",
+		 (uint32) (startpoint >> 32), (uint32) startpoint,
+		 timestamptz_to_str(timestamp));
+		}
+		else
+			/* WAL-record not have a timestamp */
+			appendStringInfo(buf, "%X/%X; ",
 		 (uint32) (startpoint >> 32), (uint32) startpoint);
 	}
 	else if (info == XLOG_PARAMETER_CHANGE)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 44017d3..12a5eb6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9950,7 +9950,7 @@ xlog_redo(XLogReaderState *record)
 	{
 		XLogRecPtr	startpoint;
 
-		memcpy(, XLogRecGetData(record), sizeof(startpoint));
+		memcpy(, &((xl_backup_end *)XLogRecGetData(record))->startpoint, sizeof(startpoint));
 
 		if (ControlFile->backupStartPoint == startpoint)
 		{
@@ -11069,11 +11069,14 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	}
 	else
 	{
+		xl_backup_end	xlrec;
 		/*
 		 * Write the backup-end xlog record
 		 */
 		XLogBeginInsert();
-		XLogRegisterData((char *) (), sizeof(startpoint));
+		xlrec.startpoint = startpoint;
+		xlrec.timestamp = GetCurrentTimestamp();
+		XLogRegisterData((char *) (), sizeof(xl_backup_end));
 		stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);
 		stoptli = ThisTimeLineID;
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 7c76683..8c5d851 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -233,6 +233,13 @@ typedef struct xl_parameter_change
 	bool		track_commit_timestamp;
 } xl_parameter_change;
 
+/* BACKUP_END WAL record main data structure */
+typedef struct xl_backup_end
+{
+	XLogRecPtr	startpoint;
+	TimestampTz	timestamp;
+} xl_backup_end;
+
 /* logs restore point */
 typedef struct xl_restore_point
 {
-- 
2.7.4



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-07-03 Thread Andrey V. Lepikhov



On 03.07.2018 00:40, Peter Geoghegan wrote:

On Mon, Jul 2, 2018 at 9:28 AM, Peter Geoghegan  wrote:

Execution time of last "VACUUM test;" command on my notebook was:

with bulk deletion: 1.6 s;
with Quick Vacuum Strategy: 5.2 s;
with Quick Vacuum Strategy & TID sorting: 0.6 s.


I'm glad that you looked into this. You could make this faster still,
by actually passing the lowest heap TID in the list of TIDs to kill to
_bt_search() and _bt_binsrch(). You might have to work through several
extra B-Tree leaf pages per bttargetdelete() call without this (you'll
move right multiple times within bttargetdelete()).


I should add: I think that this doesn't matter so much in your
original test case with v1 of my patch, because you're naturally
accessing the index tuples in almost the most efficient way already,
since you VACUUM works its way from the start of the table until the
end of the table. You'll definitely need to pass a heap TID to
routines like _bt_search() once you start using my v2, though, since
that puts the heap TIDs in DESC sort order. Otherwise, it'll be almost
as slow as the plain "Quick Vacuum Strategy" case was.

In general, the big idea with my patch is that heap TID is just
another attribute. I am not "cheating" in any way; if it's not
possible to descend the tree and arrive at the right leaf page without
looking through several leaf pages, then my patch is broken.

You might also use _bt_moveright() with my patch. That way, you can
quickly detect that you need to move right immediately, without going
through all the items on the page. This should only be an issue in the
event of a concurrent page split, though. In my patch, I use
_bt_moveright() in a special way for unique indexes: I need to start
at the first leaf page a duplicate could be on for duplicate checking,
but once that's over I want to "jump immediately" to the leaf page the
index tuple actually needs to be inserted on. That's when
_bt_moveright() is called. (Actually, that looks like it breaks unique
index enforcement in the case of my patch, which I need to fix, but
you might still do this.)



Done.
Attachment contains an update for use v.2 of the 'Ensure nbtree leaf 
tuple keys are always unique' patch.


Apply order:
1. 0001-Retail-IndexTuple-Deletion-Access-Method.patch - from previous email
2. 0002-Quick-vacuum-strategy.patch - from previous email
3. v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch - from [1]
4. 0004-Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf.patch

[1] 
https://www.postgresql.org/message-id/CAH2-Wzm6D%3DKnV%2BP8bZE-ZtP4e%2BW64HtVTdOenqd1d7HjJL3xZQ%40mail.gmail.com


--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company
>From 1c8569abe9479e547911ec3079633f79056eff96 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 3 Jul 2018 16:54:46 +0500
Subject: [PATCH 4/4] Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf

---
 src/backend/access/nbtree/nbtree.c | 75 +-
 src/backend/commands/vacuumlazy.c  |  8 ++--
 src/include/access/genam.h |  2 +-
 3 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c54aeac..7c617e9 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -887,15 +887,19 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	return stats;
 }
 
-static int
-tid_list_search(ItemPointer tid, ItemPointer tid_list, int ntid)
-{
-	for (int i = 0; i < ntid; i++)
-		if (ItemPointerEquals(tid, &(tid_list[i])))
-			return i;
-	return -1;
-}
-
+/*
+ * Deletion of index entries pointing to heap tuples.
+ *
+ * Constraints:
+ * 1. TID list info->dead_tuples arranged in ASC order.
+ * 2. Logical duplicates of index tuples stored in DESC order.
+ *
+ * The function generates an insertion scan key and descent by btree for first
+ * index tuple what satisfies scan key and last TID in info->dead_tuples list.
+ * For the scan results it deletes all index entries, matched to the TID list.
+ *
+ * Result: a palloc'd struct containing statistical info.
+ */
 IndexTargetDeleteResult*
 bttargetdelete(IndexTargetDeleteInfo *info,
 			   IndexTargetDeleteResult *stats,
@@ -914,20 +918,21 @@ bttargetdelete(IndexTargetDeleteInfo *info,
 	intndeletable = 0;
 	OffsetNumber	deletable[MaxOffsetNumber];
 	IndexTuple		itup;
+	intpos = info->last_dead_tuple;
 
 	if (stats == NULL)
 		stats = (IndexTargetDeleteResult *) palloc0(sizeof(IndexTargetDeleteResult));
 
+	/* Assemble scankey */
 	itup = index_form_tuple(RelationGetDescr(irel), values, isnull);
 	skey = _bt_mkscankey(irel, itup);
 
 	/* Descend the tree and position ourselves on the target leaf page. */
-	stack = _bt_search(irel, keysCount, skey, false, , BT_READ, NULL);
-	_bt_freestack(stack);
+	stack = _bt_search(irel, keysC

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-07-02 Thread Andrey V. Lepikhov



On 29.06.2018 14:07, Юрий Соколов wrote:
чт, 28 июн. 2018 г., 8:37 Andrey V. Lepikhov <mailto:a.lepik...@postgrespro.ru>>:




On 28.06.2018 05:00, Peter Geoghegan wrote:
 > On Tue, Jun 26, 2018 at 11:40 PM, Andrey V. Lepikhov
 > mailto:a.lepik...@postgrespro.ru>> wrote:
 >> I still believe that the patch for physical TID ordering in btree:
 >> 1) has its own value, not only for target deletion,
 >> 2) will require only a few local changes in my code,
 >> and this patches can be developed independently.
 >
 > I want to be clear on something now: I just don't think that this
 > patch has any chance of getting committed without something like my
 > own patch to go with it. The worst case for your patch without that
 > component is completely terrible. It's not really important for
you to
 > actually formally make it part of your patch, so I'm not going to
 > insist on that or anything, but the reality is that my patch does not
 > have independent value -- and neither does yours.
 >
As I wrote before in the last email, I will integrate TID sorting to my
patches right now. Please, give me access to the last version of your
code, if it possible.
You can track the progress at https://github.com/danolivo/postgres git
repository


Peter is absolutely right, imho: tie-breaking by TID within index
  ordering is inevitable for reliable performance of this patch.



In the new version the patch [1] was used in cooperation with 'retail 
indextuple deletion' and 'quick vacuum strategy' patches (see 
'0004-Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf-.patch'.


To demonstrate the potential benefits, I did a test:

CREATE TABLE test (id serial primary key, value integer, factor integer);
INSERT INTO test (value, factor) SELECT random()*1e5, random()*1e3 FROM 
generate_series(1, 1e7);

CREATE INDEX ON test(value);
VACUUM;
DELETE FROM test WHERE (factor = 1);
VACUUM test;

Execution time of last "VACUUM test;" command on my notebook was:

with bulk deletion: 1.6 s;
with Quick Vacuum Strategy: 5.2 s;
with Quick Vacuum Strategy & TID sorting: 0.6 s.

[1] 
https://www.postgresql.org/message-id/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com



With regards,
Sokolov Yura.


--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company

>From 7f2384691de592bf4d11cc8b4d75eca5500cd500 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Mon, 2 Jul 2018 16:13:08 +0500
Subject: [PATCH 1/4] Retail-IndexTuple-Deletion-Access-Method

---
 contrib/bloom/blutils.c  |   1 +
 src/backend/access/brin/brin.c   |   1 +
 src/backend/access/gin/ginutil.c |   1 +
 src/backend/access/gist/gist.c   |   1 +
 src/backend/access/hash/hash.c   |   1 +
 src/backend/access/index/indexam.c   |  15 
 src/backend/access/nbtree/nbtree.c   | 138 +++
 src/backend/access/spgist/spgutils.c |   1 +
 src/include/access/amapi.h   |   6 ++
 src/include/access/genam.h   |  18 +
 src/include/access/nbtree.h  |   4 +
 11 files changed, 187 insertions(+)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3..96f1d47 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -126,6 +126,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbc..a0e06bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -103,6 +103,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182..acf14e7 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -58,6 +58,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42eff..d7a53d2 100644
--- a/src/backend/access/gist/gist.c

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-28 Thread Andrey V. Lepikhov




On 29.06.2018 10:00, Kuntal Ghosh wrote:

On Wed, Jun 27, 2018 at 12:10 PM, Andrey V. Lepikhov
 wrote:

I prepare third version of the patches. Summary:
1. Mask DEAD tuples at a page during consistency checking (See comments for
the mask_dead_tuples() function).


- ItemIdSetDead(lp);
+ if (target_index_deletion_factor > 0)
+ ItemIdMarkDead(lp);
+ else
+ ItemIdSetDead(lp);
IIUC, you want to hold the storage of DEAD tuples to form the ScanKey
which is required for the index scan in the second phase of quick
vacuum strategy. To achieve that, you've only marked the tuple as DEAD
during pruning. Hence, PageRepairFragmentation cannot claim the space
for DEAD tuples since they still have storage associated with them.
But, you've WAL-logged it as DEAD tuples having no storage. So, when
the WAL record is replayed in standby(or crash recovery), the tuples
will be marked as DEAD having no storage and their space may be
reclaimed by PageRepairFragmentation. Now, if you do byte-by-byte
comparison with wal_consistency tool, it may fail even for normal
tuple as well. Please let me know if you feel the same way.


Thanks for your analysis.
In this development version of the patch I expect the same prune() 
strategy on a master and standby (i.e. target_index_deletion_factor is 
equal for both).
In this case storage of a DEAD tuple holds during replay or recovery in 
the same way.
On some future step of development I plan to use more flexible prune() 
strategy. This will require to append a 'isDeadStorageHold' field to the 
WAL record.


--
Regards,
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-27 Thread Andrey V. Lepikhov




On 28.06.2018 05:00, Peter Geoghegan wrote:

On Tue, Jun 26, 2018 at 11:40 PM, Andrey V. Lepikhov
 wrote:

I still believe that the patch for physical TID ordering in btree:
1) has its own value, not only for target deletion,
2) will require only a few local changes in my code,
and this patches can be developed independently.


I want to be clear on something now: I just don't think that this
patch has any chance of getting committed without something like my
own patch to go with it. The worst case for your patch without that
component is completely terrible. It's not really important for you to
actually formally make it part of your patch, so I'm not going to
insist on that or anything, but the reality is that my patch does not
have independent value -- and neither does yours.

As I wrote before in the last email, I will integrate TID sorting to my 
patches right now. Please, give me access to the last version of your 
code, if it possible.
You can track the progress at https://github.com/danolivo/postgres git 
repository



I'm sorry if that sounds harsh, but this is a difficult, complicated
project. It's better to be clear about this stuff earlier on.


Ok. It is clear now.



I prepare third version of the patches. Summary:
1. Mask DEAD tuples at a page during consistency checking (See comments for
the mask_dead_tuples() function).
2. Still not using physical TID ordering.
3. Index cleanup() after each quick_vacuum_index() call was excluded.


How does this patch affect opportunistic pruning in particular? Not
being able to immediately reclaim tuple space in the event of a dead
hot chain that is marked LP_DEAD could hurt quite a lot, including
with very common workloads, such as pgbench (pgbench accounts tuples
are quite a lot wider than a raw item pointer, and opportunistic
pruning is much more important than vacuuming). Is that going to be
acceptable, do you think? Have you measured the effects? Can we do
something about it, like make pruning behave differently when it's
opportunistic?


This is the most "tasty" part of the work. I plan some experimental 
research on it at the end of patches developing (including TID sort)
and parametrized opportunistic pruning for flexibility of switching 
between strategies on the fly.
My current opinion on this question: we can develop flexible strategy 
based on parameters: free space at a block, frequency of UPDATE/DELETE 
queries, percent of DEAD tuples in a block/relation.
Background cleaner, raised by heap_page_prune(), give an opportunity for 
using different ways for each block or relation.
This technique should be able to configure from fully non-storage DEAD 
tuples+vacuum to all-storage DEAD tuples+target deletion by DB admin.




Are you aware of the difference between _bt_delitems_delete() and
_bt_delitems_vacuum(), and the considerations for hot standby? I think
that that's another TODO list item for this patch.



Ok

--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-26 Thread Andrey V. Lepikhov




On 26.06.2018 15:31, Masahiko Sawada wrote:

On Fri, Jun 22, 2018 at 8:24 PM, Andrey V. Lepikhov
 wrote:

Hi,
According to your feedback, i develop second version of the patch.
In this version:
1. High-level functions index_beginscan(), index_rescan() not used. Tree
descent made by _bt_search(). _bt_binsrch() used for positioning on the
page.
2. TID list introduced in amtargetdelete() interface. Now only one tree
descent needed for deletion all tid's from the list with equal scan key
value - logical duplicates deletion problem.
3. Only one WAL record for index tuple deletion per leaf page per
amtargetdelete() call.
4. VACUUM can sort TID list preliminary for more quick search of duplicates.

Background worker will come later.




Thank you for updating patches! Here is some comments for the latest patch.

+static void
+quick_vacuum_index(Relation irel, Relation hrel,
+  IndexBulkDeleteResult **overall_stats,
+  LVRelStats *vacrelstats)
+{
(snip)
+   /*
+* Collect statistical info
+*/
+   lazy_cleanup_index(irel, *overall_stats, vacrelstats);
+}

I think that we should not call lazy_cleanup_index at the end of
quick_vacuum_index because we call it multiple times during a lazy
vacuum and index statistics can be changed during vacuum. We already
call lazy_cleanup_index at the end of lazy_scan_heap.


Ok


bttargetdelete doesn't delete btree pages even if pages become empty.
I think we should do that. Otherwise empty page never be recycled. But
please note that if we delete btree pages during bttargetdelete,
recyclable pages might not be recycled. That is, if we choose the
target deletion method every time then the deleted-but-not-recycled
pages could never be touched, unless reaching
vacuum_cleanup_index_scale_factor. So I think we need to either run
bulk-deletion method or do cleanup index before btpo.xact wraparound.

+   ivinfo.indexRelation = irel;
+   ivinfo.heapRelation = hrel;
+   qsort((void *)vacrelstats->dead_tuples,
vacrelstats->num_dead_tuples, sizeof(ItemPointerData),
tid_comparator);

Ok. I think caller of bttargetdelete() must decide when to make index 
cleanup.



I think the sorting vacrelstats->dead_tuples is not necessary because
garbage TIDs  are stored in a sorted order.

Sorting was introduced because I keep in mind background worker and more 
flexible cleaning strategies, not only full tuple-by-tuple relation and 
block scan.
Caller of bttargetdelete() can set info->isSorted to prevent sorting 
operation.



Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-25 Thread Andrey V. Lepikhov




On 23.06.2018 01:14, Peter Geoghegan wrote:

On Fri, Jun 22, 2018 at 12:43 PM, Peter Geoghegan  wrote:

On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov
 wrote:

According to your feedback, i develop second version of the patch.
In this version:
1. High-level functions index_beginscan(), index_rescan() not used. Tree
descent made by _bt_search(). _bt_binsrch() used for positioning on the
page.
2. TID list introduced in amtargetdelete() interface. Now only one tree
descent needed for deletion all tid's from the list with equal scan key
value - logical duplicates deletion problem.
3. Only one WAL record for index tuple deletion per leaf page per
amtargetdelete() call.


Cool.

What is this "race" code about?


I introduce this check because keep in mind about another vacuum 
workers, which can make cleaning a relation concurrently. may be it is 
redundant.



I noticed another bug in your patch, when running a
"wal_consistency_checking=all" smoke test. I do this simple, generic
test for anything that touches WAL-logging, actually -- it's a good
practice to adopt.

I enable "wal_consistency_checking=all" on the installation, create a
streaming replica with pg_basebackup (which also has
"wal_consistency_checking=all"), and then run "make installcheck"
against the primary. Here is what I see on the standby when I do this
with v2 of your patch applied:

9524/2018-06-22 13:03:12 PDT LOG:  entering standby mode
9524/2018-06-22 13:03:12 PDT LOG:  consistent recovery state reached
at 0/3D0
9524/2018-06-22 13:03:12 PDT LOG:  invalid record length at 0/3D0:
wanted 24, got 0
9523/2018-06-22 13:03:12 PDT LOG:  database system is ready to accept
read only connections
9528/2018-06-22 13:03:12 PDT LOG:  started streaming WAL from primary
at 0/300 on timeline 1
9524/2018-06-22 13:03:12 PDT LOG:  redo starts at 0/3D0
9524/2018-06-22 13:03:32 PDT FATAL:  inconsistent page found, rel
1663/16384/1259, forknum 0, blkno 0
9524/2018-06-22 13:03:32 PDT CONTEXT:  WAL redo at 0/3360B00 for
Heap2/CLEAN: remxid 599
9523/2018-06-22 13:03:32 PDT LOG:  startup process (PID 9524) exited
with exit code 1
9523/2018-06-22 13:03:32 PDT LOG:  terminating any other active server processes
9523/2018-06-22 13:03:32 PDT LOG:  database system is shut down

I haven't investigated this at all, but I assume that the problem is a
simple oversight. The new ItemIdSetDeadRedirect() concept that you've
introduced probably necessitates changes in both the WAL logging
routines and the redo/recovery routines. You need to go make those
changes. (By the way, I don't think you should be using the constant
"3" with the ItemIdIsDeadRedirection() macro definition.)

Let me know if you get stuck on this, or need more direction.

I was investigated the bug of the simple smoke test. You're right: make 
any manipulations with line pointer in heap_page_prune() without 
reflection in WAL record is no good idea.
But this consistency problem arises even on clean PostgreSQL 
installation (without my patch) with ItemIdSetDead() -> ItemIdMarkDead() 
replacement.
Byte-by-byte comparison of master and replay pages shows only 2 bytes 
difference in the tuple storage part of page.

I don't stuck on yet, but good ideas are welcome.

--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-22 Thread Andrey V. Lepikhov

Hi,
According to your feedback, i develop second version of the patch.
In this version:
1. High-level functions index_beginscan(), index_rescan() not used. Tree 
descent made by _bt_search(). _bt_binsrch() used for positioning on the 
page.
2. TID list introduced in amtargetdelete() interface. Now only one tree 
descent needed for deletion all tid's from the list with equal scan key 
value - logical duplicates deletion problem.
3. Only one WAL record for index tuple deletion per leaf page per 
amtargetdelete() call.

4. VACUUM can sort TID list preliminary for more quick search of duplicates.

Background worker will come later.

On 19.06.2018 22:38, Peter Geoghegan wrote:

On Tue, Jun 19, 2018 at 2:33 AM, Masahiko Sawada  wrote:

I think that we do the partial lazy vacuum using visibility map even
now. That does heap pruning, index tuple killing but doesn't advance
relfrozenxid.


Right, that's what I was thinking. Opportunistic HOT pruning isn't
like vacuuming because it doesn't touch indexes. This patch adds an
alternative strategy for conventional lazy vacuum that is also able to
run a page at a time if needed. Perhaps page-at-a-time operation could
later be used for doing something that is opportunistic in the same
way that pruning is opportunistic, but it's too early to worry about
that.


Since this patch adds an ability to delete small amount
of index tuples quickly, what I'd like to do with this patch is to
invoke autovacuum more frequently, and do the target index deletion or
the index bulk-deletion depending on amount of garbage and index size
etc. That is, it might be better if lazy vacuum scans heap in ordinary
way and then plans and decides a method of index deletion based on
costs similar to what query planning does.


That seems to be what Andrey wants to do, though right now the
prototype patch actually just always uses its alternative strategy
while doing any kind of lazy vacuuming (some simple costing code is
commented out right now). It shouldn't be too hard to add some costing
to it. Once we do that, and once we polish the patch some more, we can
do performance testing. Maybe that alone will be enough to make the
patch worth committing; "opportunistic microvacuuming" can come later,
if at all.



--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3..96f1d47 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -126,6 +126,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbc..a0e06bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -103,6 +103,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182..acf14e7 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -58,6 +58,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42eff..d7a53d2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -80,6 +80,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df3..5fb32d6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -76,6 +76,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 22b5cc9..9ebeb78 100644
--- 

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-19 Thread Andrey V. Lepikhov




On 19.06.2018 04:05, Peter Geoghegan wrote:

On Mon, Jun 18, 2018 at 2:54 PM, Peter Geoghegan  wrote:

On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov
 wrote:

Patch '0001-retail-indextuple-deletion' introduce new function
amtargetdelete() in access method interface. Patch
'0002-quick-vacuum-strategy' implements this function for an alternative
strategy of lazy index vacuum, called 'Quick Vacuum'.


My compiler shows the following warnings:


Some real feedback:

What we probably want to end up with here is new lazyvacuum.c code
that does processing for one heap page (and associated indexes) that
is really just a "partial" lazy vacuum. Though it won't do things like
advance relfrozenxid, it will do pruning for the heap page, index
tuple killing, and finally heap tuple killing. It will do all of these
things reliably, just like traditional lazy vacuum. This will be what
your background worker eventually uses.


It is final goal of the patch.


I doubt that you can use routines like index_beginscan() within
bttargetdelete() at all. I think that you need something closer to
_bt_doinsert() or _bt_pagedel(), that manages its own scan (your code
should probably live in nbtpage.c). It does not make sense to teach
external, generic routines like index_beginscan() about heap TID being
an implicit final index attribute, which is one reason for this (I'm
assuming that this patch relies on my own patch).  Another reason is
that you need to hold an exclusive buffer lock at the point that you
identify the tuple to be killed, until the point that you actually
kill it. You don't do that now.

IOW, the approach you've taken in bttargetdelete() isn't quite correct
because you imagine that it's okay to occasionally "lose" the index
tuple that you originally found, and just move on. That needs to be
100% reliable, or else we'll end up with index tuples that point to
the wrong heap tuples in rare cases with concurrent insertions. As I
said, we want a "partial" lazy vacuum here, which must mean that it's
reliable. Note that _bt_pagedel() actually calls _bt_search() when it
deletes a page. Your patch will not be the first patch that makes
nbtree vacuuming do an index scan. You should be managing your own
insertion scan key, much like _bt_pagedel() does. If you use my patch,
_bt_search() can be taught to look for a specific heap TID.

Agree with this notes. Corrections will made in the next version of the 
patch.



Finally, doing things this way would let you delete multiple
duplicates in one shot, as I described in an earlier e-mail. Only a
single descent of the tree is needed to delete quite a few index
tuples, provided that they all happen to be logical duplicates. Again,
your background worker will take advantage of this.

It is very interesting idea. According to this, I plan to change 
bttargetdelete() interface as follows:

IndexTargetDeleteStats*
amtargetdelete(IndexTargetDeleteInfo *info,
   IndexTargetDeleteStats *stats,
   Datum *values, bool *isnull);
where structure IndexTargetDeleteInfo contains a TID list of dead heap 
tuples. All index entries, corresponding to this list, may be deleted 
(or only some of it) by one call of amtargetdelete() function with 
single descent of the tree.



This code does not follow the Postgres style:


-   else
+   }
+   else {
+   if (rootoffnum != latestdead)
+   heap_prune_record_unused(prstate, latestdead);
 heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
+   }
 }


Please be more careful about that. I find it very distracting.


Done

--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company



[WIP] [B-Tree] Retail IndexTuple deletion

2018-06-17 Thread Andrey V. Lepikhov

Hi,
I have written a code for quick indextuple deletion from an relation by 
heap tuple TID. The code relate to "Retail IndexTuple deletion" 
enhancement of btree index on postgresql wiki [1].

Briefly, it includes three steps:
1. Key generation for index tuple searching.
2. Index relation search for tuple with the heap tuple TID.
3. Deletion of the tuple from the index relation.

Now, index relation cleaning performs by vacuum which scan all index 
relation for dead entries sequentially, tuple-by-tuple. This simplistic 
and safe method can be significantly surpasses in the cases, than a 
number of dead entries is not large by retail deletions which uses index 
scans for search dead entries. Also, it can be used by distributed 
systems for reduce cost of a global index vacuum.


Patch '0001-retail-indextuple-deletion' introduce new function 
amtargetdelete() in access method interface. Patch 
'0002-quick-vacuum-strategy' implements this function for an alternative 
strategy of lazy index vacuum, called 'Quick Vacuum'.


The code demands hold DEAD tuple storage until scan key will be created. 
In this version I add 'target_index_deletion_factor' option. If it more 
than 0, heap_page_prune() uses ItemIdMarkDead() instead of 
ItemIdSetDead() function for set DEAD flag and hold the tuple storage.
Next step is developing background worker which will collect pairs (tid, 
scankey) of DEAD tuples from heap_page_prune() function.


Here are test description and some execution time measurements results 
showing the benefit of this patches:


Test:
-
create table test_range(id serial primary key, value integer);
insert into test_range (value) select random()*1e7/10^N from 
generate_series(1, 1e7);

DELETE FROM test_range WHERE value=1;
VACUUM test_range;

Results:


| n | t1, s  | t2, s  | speedup |
|---|---|
| 0 | 0.3| 0.4476 | 1748.4  |
| 1 | 0.6| 0.5367 | 855.99  |
| 2 | 0.0004 | 0.9804 | 233.99  |
| 3 | 0.0048 | 1.6493 | 34.576  |
| 4 | 0.5600 | 2.4854 | 4.4382  |
| 5 | 3.3300 | 3.9555 | 1.2012  |
| 6 | 17.700 | 5.6000 | 0.3164  |
|---|---|
in the table, t1 - measured execution time of lazy_vacuum_index() 
function by Quick-Vacuum strategy; t2 - measured execution time of 
lazy_vacuum_index() function by Lazy-Vacuum strategy;


Note, guaranteed allowable time of index scans (used for quick deletion) 
will be achieved by storing equal-key index tuples in physical TID order 
[2] with patch [3].


[1] 
https://wiki.postgresql.org/wiki/Key_normalization#Retail_IndexTuple_deletion
[2] 
https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
[3] 
https://www.postgresql.org/message-id/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com


--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3..96f1d47 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -126,6 +126,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbc..a0e06bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -103,6 +103,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182..acf14e7 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -58,6 +58,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42eff..d7a53d2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -80,6 +80,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/hash/hash.c