date:20201207

On Mon, Dec 7, 2020 at 9:21 AM Amit Kapila  wrote:
>
> On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
>  wrote:
> >
>
> >>
> >> I am not sure why but it seems acceptable to original authors that the
> >> data of transactions are visibly partially during the initial
> >> synchronization phase for a subscription.
> >
> >
> > I don't think there's much alternative there.
> >
>
> I am not sure about this. I think it is primarily to allow some more
> parallelism among apply and sync workers. One primitive way to achieve
> parallelism and don't have this problem is to allow apply worker to
> wait till all the tablesync workers are in DONE state.
>

As the slot of apply worker is created before all the tablesync
workers it should never miss any LSN which tablesync workers would
have processed. Also, the table sync workers should not process any
xact if the apply worker has not processed anything. I think tablesync
currently always processes one transaction (because we call
process_sync_tables at commit of a txn) even if that is not required
to be in sync with the apply worker. This should solve both the
problems (a) visibility of partial transactions (b) allow prepared
transactions because tablesync worker no longer needs to combine
multiple transactions data.

I think the other advantages of this would be that it would reduce the
load (both CPU and I/O) on the publisher-side by allowing to decode
the data only once instead of for each table sync worker once and
separately for the apply worker. I think it will use fewer resources
to finish the work.

Is there any flaw in this idea which I am missing?

-- 
With Regards,
Amit Kapila.

Re: Logical archiving

On Mon, Dec 7, 2020 at 8:35 AM Craig Ringer
 wrote:
>
> Reply follows inline. I addressed your last point first, so it's out of order.
>
> On Fri, 4 Dec 2020 at 15:33, Andrey Borodin  wrote
>
> We'd also need to separate the existing apply worker into a "receiver" and 
> "apply/writer" part, so the wire-protocol handling isn't tightly coupled with 
> the actual change apply code, in order to make it possible to actually 
> consume those archives and apply them to the database. In pglogical3 we did 
> that by splitting them into two processes, connected by a shm_mq. Originally 
> the process split was optional and you could run a combined receiver/writer 
> process without the shm_mq if you wanted, but we quickly found it difficult 
> to reliably handle locking issues etc that way so the writers all moved 
> out-of-process.
>
> That was done mainly to make it possible to support parallelism in logical 
> decoding apply. But we also have the intention of supporting an alternative 
> reader process that can ingest "logical archives" and send them to the writer 
> to apply them, as if they'd been received from the on-wire stream. That's not 
> implemented at this time though. It'd be useful for a number of things:
>
> * PITR-style logical replay and recovery
> * Ability to pre-decode a txn once on the upstream then send the buffered 
> protocol-stream to multiple subscribers, saving on logical decoding and 
> reorder buffering overheads and write-multiplication costs
>

I think doing parallel apply and ability to decode a txn once are
really good improvements independent of all the work you listed.
Thanks for sharing your knowledge.

-- 
With Regards,
Amit Kapila.

Re: Parallel Inserts in CREATE TABLE AS

On Mon, Dec 7, 2020 at 11:32 AM Hou, Zhijie  wrote:
>
> Hi
>
> +   /*
> +* Flag to let the planner know that the SELECT query is for CTAS. 
> This is
> +* used to calculate the tuple transfer cost from workers to gather 
> node(in
> +* case parallelism kicks in for the SELECT part of the CTAS), to 
> zero as
> +* each worker will insert its share of tuples in parallel.
> +*/
> +   if (IsParallelInsertInCTASAllowed(into, NULL))
> +   query->isForCTAS = true;
>
>
> +   /*
> +* We do not compute the parallel_tuple_cost for CTAS because the 
> number of
> +* tuples that are transferred from workers to the gather node is 
> zero as
> +* each worker, in parallel, inserts the tuples that are resulted 
> from its
> +* chunk of plan execution. This change may make the parallel plan 
> cheap
> +* among all other plans, and influence the planner to consider this
> +* parallel plan.
> +*/
> +   if (!(root->parse->isForCTAS &&
> +   root->query_level == 1))
> +   run_cost += parallel_tuple_cost * path->path.rows;
>
> I noticed that the parallel_tuple_cost will still be ignored,
> When Gather is not the top node.
>
> Example:
> Create table test(i int);
> insert into test values(generate_series(1,1000,1));
> explain create table ntest3 as select * from test where i < 200 limit 
> 1;
>
>   QUERY PLAN
> ---
>  Limit  (cost=1000.00..97331.33 rows=1000 width=4)
>->  Gather  (cost=1000.00..97331.33 rows=1000 width=4)
>  Workers Planned: 2
>  ->  Parallel Seq Scan on test  (cost=0.00..96331.33 rows=417 width=4)
>Filter: (i < 200)
>
>
> The isForCTAS will be true because [create table as], the
> query_level is always 1 because there is no subquery.
> So even if gather is not the top node, parallel cost will still be ignored.
>
> Is that works as expected ?
>

I don't think that is expected and is not the case without this patch.
The cost shouldn't be changed for existing cases where the write is
not pushed to workers.

-- 
With Regards,
Amit Kapila.

RE: Parallel copy

2020-12-07 Thread Hou, Zhijie

> > 4.
> > A suggestion for CacheLineInfo.
> >
> > It use appendBinaryStringXXX to store the line in memory.
> > appendBinaryStringXXX will double the str memory when there is no enough
> spaces.
> >
> > How about call enlargeStringInfo in advance, if we already know the whole
> line size?
> > It can avoid some memory waste and may impove a little performance.
> >
> 
> Here we will not know the size beforehand, in some cases we will start
> processing the data when current block is populated and keep processing
> block by block, we will come to know of the size at the end. We cannot use
> enlargeStringInfo because of this.
> 
> Attached v11 patch has the fix for this, it also includes the changes to
> rebase on top of head.

Thanks for the explanation.

I think there is still chances we can know the size.

+* line_size will be set. Read the line_size again to be sure 
if it is
+* completed or partial block.
+*/
+   dataSize = pg_atomic_read_u32(&lineInfo->line_size);
+   if (dataSize != -1)
+   {

If I am not wrong, this seems the branch that procsssing the populated block.
I think we can check the copiedSize here, if copiedSize == 0, that means
Datasizes is the size of the whole line and in this case we can do the enlarge.


Best regards,
houzj

Re: PoC/WIP: Extended statistics on expressions

2020-12-07 Thread Dean Rasheed

On Thu, 3 Dec 2020 at 15:23, Tomas Vondra  wrote:
>
> Attached is a patch series rebased on top of 25a9e54d2d.

After reading this thread and [1], I think I prefer the name
"standard" rather than "expressions", because it is meant to describe
the kind of statistics being built rather than what they apply to, but
maybe that name doesn't actually need to be exposed to the end user:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

  CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

  CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

  CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

  CREATE STATISTICS s ON col FROM tbl;
  ERROR:  extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
ON (expression)
FROM table_name

  CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON { column_name | (expression) } , { column_name | (expression) } [, ...]
FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/1009.1579038764%40sss.pgh.pa.us#8624792a20ae595683b574f5933dae53

Re: Yet another fast GiST build

2020-12-07 Thread Andrey Borodin



> 28 сент. 2020 г., в 13:12, Heikki Linnakangas  написал(а):
> 
> I wrote a couple of 'pageinspect' function to inspect GiST pages for this. 
> See attached.
> <0001-Add-functions-to-pageinspect-to-inspect-GiST-indexes.patch>

Here's version with tests and docs. I still have no idea how to print some 
useful information about tuples keys.

Thanks!

Best regards, Andrey Borodin.


v2-0001-Add-functions-to-pageinspect-to-inspect-GiST-inde.patch
Description: Binary data

get_constraint_index() and conindid

2020-12-07 Thread Peter Eisentraut

get_constraint_index() does its work by going through pg_depend.  It was 
added before pg_constraint.conindid was added, and some callers are 
still not changed.  Are there reasons for that?  Probably not.  The 
attached patch changes get_constraint_index() to an lsyscache-style 
lookup instead.


The nearby get_index_constraint() should probably also be changed to 
scan pg_constraint instead of pg_depend, but that doesn't have a 
syscache to use, so it would be a different approach, so I figured I'd 
ask about get_constraint_index() first.



From 29371949b2e61f894f651eef990c3c6eabfa8145 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 7 Dec 2020 10:38:36 +0100
Subject: [PATCH] Change get_constraint_index() to use pg_constraint.conindid

It was still using a scan of pg_depend instead of using the conindid
column that has been added since.
---
 src/backend/catalog/pg_depend.c  | 69 
 src/backend/commands/tablecmds.c |  1 -
 src/backend/optimizer/util/plancat.c |  1 -
 src/backend/utils/adt/ruleutils.c|  3 +-
 src/backend/utils/cache/lsyscache.c  | 27 +++
 src/include/catalog/dependency.h |  2 -
 src/include/utils/lsyscache.h|  1 +
 7 files changed, 29 insertions(+), 75 deletions(-)

diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 25290c821f..429791694f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -968,75 +968,6 @@ getIdentitySequence(Oid relid, AttrNumber attnum, bool 
missing_ok)
return linitial_oid(seqlist);
 }
 
-/*
- * get_constraint_index
- * Given the OID of a unique, primary-key, or exclusion constraint,
- * return the OID of the underlying index.
- *
- * Return InvalidOid if the index couldn't be found; this suggests the
- * given OID is bogus, but we leave it to caller to decide what to do.
- */
-Oid
-get_constraint_index(Oid constraintId)
-{
-   Oid indexId = InvalidOid;
-   RelationdepRel;
-   ScanKeyData key[3];
-   SysScanDesc scan;
-   HeapTuple   tup;
-
-   /* Search the dependency table for the dependent index */
-   depRel = table_open(DependRelationId, AccessShareLock);
-
-   ScanKeyInit(&key[0],
-   Anum_pg_depend_refclassid,
-   BTEqualStrategyNumber, F_OIDEQ,
-   ObjectIdGetDatum(ConstraintRelationId));
-   ScanKeyInit(&key[1],
-   Anum_pg_depend_refobjid,
-   BTEqualStrategyNumber, F_OIDEQ,
-   ObjectIdGetDatum(constraintId));
-   ScanKeyInit(&key[2],
-   Anum_pg_depend_refobjsubid,
-   BTEqualStrategyNumber, F_INT4EQ,
-   Int32GetDatum(0));
-
-   scan = systable_beginscan(depRel, DependReferenceIndexId, true,
- NULL, 3, key);
-
-   while (HeapTupleIsValid(tup = systable_getnext(scan)))
-   {
-   Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
-
-   /*
-* We assume any internal dependency of an index on the 
constraint
-* must be what we are looking for.
-*/
-   if (deprec->classid == RelationRelationId &&
-   deprec->objsubid == 0 &&
-   deprec->deptype == DEPENDENCY_INTERNAL)
-   {
-   charrelkind = 
get_rel_relkind(deprec->objid);
-
-   /*
-* This is pure paranoia; there shouldn't be any other 
relkinds
-* dependent on a constraint.
-*/
-   if (relkind != RELKIND_INDEX &&
-   relkind != RELKIND_PARTITIONED_INDEX)
-   continue;
-
-   indexId = deprec->objid;
-   break;
-   }
-   }
-
-   systable_endscan(scan);
-   table_close(depRel, AccessShareLock);
-
-   return indexId;
-}
-
 /*
  * get_index_constraint
  * Given the OID of an index, return the OID of the owning unique,
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 46f1637e77..1fa9f19f08 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -26,7 +26,6 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
-#include "catalog/dependency.h"
 #include "catalog/heap.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index 3e94256d34..daf1759623 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -26,7 +26,6 @

Re: RFC: Deadlock detector hooks for victim selection and edge injection

On Mon, Dec 7, 2020 at 9:25 AM Craig Ringer
 wrote:
>
> Hi folks
>
> Now that we're well on track for streaming logical decoding, it's becoming 
> more practical to look at parallel logical apply.
>
> The support for this in pglogical3 benefits from a deadlock detector hook. It 
> was added in the optional patched postgres pglogical can use to enable 
> various extra features that weren't possible without core changes, but isn't 
> present in community postgres yet.
>
> I'd like to add it.
>
> The main benefit is that it lets the logical replication support tell the 
> deadlock detector that it should prefer to kill the victim whose txn has a 
> higher upstream commit lsn. That helps encourage parallel logical apply to 
> make progress in the face of deadlocks between concurrent txns.
>
> Any in-principle objections?
>

I think it will depend on your exact proposal of the hook but one
thing we might want to consider is whether it is acceptable to invoke
third-party code after holding LWLocks. We acquire LWLocks in
CheckDeadLock and then run the deadlock detector code.

Also, it might be better if you can expand the use case a bit more. It
is not very clear from what you have written.

-- 
With Regards,
Amit Kapila.

Re: Parallel Inserts in CREATE TABLE AS

On Mon, Dec 7, 2020 at 2:55 PM Amit Kapila  wrote:
>
> On Mon, Dec 7, 2020 at 11:32 AM Hou, Zhijie  wrote:
> >
> > Hi
> >
> > +   /*
> > +* Flag to let the planner know that the SELECT query is for CTAS. 
> > This is
> > +* used to calculate the tuple transfer cost from workers to gather 
> > node(in
> > +* case parallelism kicks in for the SELECT part of the CTAS), to 
> > zero as
> > +* each worker will insert its share of tuples in parallel.
> > +*/
> > +   if (IsParallelInsertInCTASAllowed(into, NULL))
> > +   query->isForCTAS = true;
> >
> >
> > +   /*
> > +* We do not compute the parallel_tuple_cost for CTAS because the 
> > number of
> > +* tuples that are transferred from workers to the gather node is 
> > zero as
> > +* each worker, in parallel, inserts the tuples that are resulted 
> > from its
> > +* chunk of plan execution. This change may make the parallel plan 
> > cheap
> > +* among all other plans, and influence the planner to consider this
> > +* parallel plan.
> > +*/
> > +   if (!(root->parse->isForCTAS &&
> > +   root->query_level == 1))
> > +   run_cost += parallel_tuple_cost * path->path.rows;
> >
> > I noticed that the parallel_tuple_cost will still be ignored,
> > When Gather is not the top node.
> >
> > Example:
> > Create table test(i int);
> > insert into test values(generate_series(1,1000,1));
> > explain create table ntest3 as select * from test where i < 200 
> > limit 1;
> >
> >   QUERY PLAN
> > ---
> >  Limit  (cost=1000.00..97331.33 rows=1000 width=4)
> >->  Gather  (cost=1000.00..97331.33 rows=1000 width=4)
> >  Workers Planned: 2
> >  ->  Parallel Seq Scan on test  (cost=0.00..96331.33 rows=417 
> > width=4)
> >Filter: (i < 200)
> >
> >
> > The isForCTAS will be true because [create table as], the
> > query_level is always 1 because there is no subquery.
> > So even if gather is not the top node, parallel cost will still be ignored.
> >
> > Is that works as expected ?
> >
>
> I don't think that is expected and is not the case without this patch.
> The cost shouldn't be changed for existing cases where the write is
> not pushed to workers.
>

Thanks for pointing that out. Yes it should not change for the cases
where parallel inserts will not be picked later.

Any better suggestions on how to make the planner consider that the
CTAS might choose parallel inserts later at the same time avoiding the
above issue in case it doesn't?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Add primary keys to system catalogs

2020-12-07 Thread Peter Eisentraut


On 2020-10-03 08:40, Peter Eisentraut wrote:

Since we have ADD PRIMARY KEY USING INDEX, we can declare a primary key
for an existing index.  So this doesn't have to affect the low-level
early bootstrapping.  The attached patch adds those commands manually.
Another option might be to have the catalog generation Perl tooling
create those declarations automatically from some marker in the catalog
files.  That would also allow declaring unique constraints for the
unique indexes that don't end up being the primary key.


I have reworked my patch like this.  Now, the primary key is identified 
by using DECLARE_UNIQUE_INDEX_PKEY() instead of DECLARE_UNIQUE_INDEX() 
in the catalog pg_*.h files.  This automatically generates the required 
ALTER TABLE commands in a new file system_constraints.sql.  initdb is 
ordered so that that file is run before dependency processing, so the 
system constraints also end up pinned (as was requested during previous 
discussion).


Note, this needs the get_constraint_index() patch I just posted 
separately.  The old implementation of get_constraint_index() does not 
handle pinned constraints.


There is something strange going on in the misc_sanity test, as seen by 
the test output change


-NOTICE:  pg_constraint contains unpinned initdb-created object(s)

This previously detected some constraints from the information schema, 
so it was correct to point those out.  But since it looks for the 
min(oid) of a catalog, this will now find one of the new pinned 
constraints, so it won't detect this case anymore.  I think that test is 
not written correctly.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From 90bee9c99aa7347861a5a7651922cc7116d9f54b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 7 Dec 2020 11:11:24 +0100
Subject: [PATCH v2] Add primary keys and unique constraints to system catalogs

For those system catalogs that have a unique indexes, make a primary
key and unique constraint, using ALTER TABLE ... PRIMARY KEY/UNIQUE
USING INDEX.

This can be helpful for GUI tools that look for a primary key, and it
might in the future allow declaring foreign keys, for making schema
diagrams.

The constraint creation statements are automatically created by
genbki.pl from DECLARE_UNIQUE_INDEX directives.  To specify which one
of the available unique indexes is the primary key, use the new
directive DECLARE_UNIQUE_INDEX_PKEY instead.
---
 src/backend/catalog/.gitignore|  1 +
 src/backend/catalog/Catalog.pm| 11 ++--
 src/backend/catalog/Makefile  |  3 +-
 src/backend/catalog/genbki.pl | 21 +++
 src/bin/initdb/initdb.c   | 60 +--
 src/include/catalog/genbki.h  |  3 +-
 src/include/catalog/pg_aggregate.h|  2 +-
 src/include/catalog/pg_am.h   |  2 +-
 src/include/catalog/pg_amop.h |  2 +-
 src/include/catalog/pg_amproc.h   |  2 +-
 src/include/catalog/pg_attrdef.h  |  2 +-
 src/include/catalog/pg_attribute.h|  2 +-
 src/include/catalog/pg_auth_members.h |  2 +-
 src/include/catalog/pg_authid.h   |  2 +-
 src/include/catalog/pg_cast.h |  2 +-
 src/include/catalog/pg_class.h|  2 +-
 src/include/catalog/pg_collation.h|  2 +-
 src/include/catalog/pg_constraint.h   |  2 +-
 src/include/catalog/pg_conversion.h   |  2 +-
 src/include/catalog/pg_database.h |  2 +-
 src/include/catalog/pg_db_role_setting.h  |  2 +-
 src/include/catalog/pg_default_acl.h  |  2 +-
 src/include/catalog/pg_description.h  |  2 +-
 src/include/catalog/pg_enum.h |  2 +-
 src/include/catalog/pg_event_trigger.h|  2 +-
 src/include/catalog/pg_extension.h|  2 +-
 src/include/catalog/pg_foreign_data_wrapper.h |  2 +-
 src/include/catalog/pg_foreign_server.h   |  2 +-
 src/include/catalog/pg_foreign_table.h|  2 +-
 src/include/catalog/pg_index.h|  2 +-
 src/include/catalog/pg_inherits.h |  2 +-
 src/include/catalog/pg_init_privs.h   |  2 +-
 src/include/catalog/pg_language.h |  2 +-
 src/include/catalog/pg_largeobject.h  |  2 +-
 src/include/catalog/pg_largeobject_metadata.h |  2 +-
 src/include/catalog/pg_namespace.h|  2 +-
 src/include/catalog/pg_opclass.h  |  2 +-
 src/include/catalog/pg_operator.h |  2 +-
 src/include/catalog/pg_opfamily.h |  2 +-
 src/include/catalog/pg_partitioned_table.h|  2 +-
 src/include/catalog/pg_policy.h   |  2 +-
 src/include/catalog/pg_proc.h |  2 +-
 src/include/catalog/pg_publication.h  |  2 +-
 src/include/catalog/pg_publication_rel.h  |  2 +-
 src/include/catalog/pg_range.h|  2 +-
 src/include/catalog/pg_replication_origin.h   |  2 +-
 sr

Re: Parallel Inserts in CREATE TABLE AS

On Mon, Dec 7, 2020 at 3:44 PM Bharath Rupireddy
 wrote:
>
> On Mon, Dec 7, 2020 at 2:55 PM Amit Kapila  wrote:
> >
> > On Mon, Dec 7, 2020 at 11:32 AM Hou, Zhijie  
> > wrote:
> > >
> > > +   if (!(root->parse->isForCTAS &&
> > > +   root->query_level == 1))
> > > +   run_cost += parallel_tuple_cost * path->path.rows;
> > >
> > > I noticed that the parallel_tuple_cost will still be ignored,
> > > When Gather is not the top node.
> > >
> > > Example:
> > > Create table test(i int);
> > > insert into test values(generate_series(1,1000,1));
> > > explain create table ntest3 as select * from test where i < 200 
> > > limit 1;
> > >
> > >   QUERY PLAN
> > > ---
> > >  Limit  (cost=1000.00..97331.33 rows=1000 width=4)
> > >->  Gather  (cost=1000.00..97331.33 rows=1000 width=4)
> > >  Workers Planned: 2
> > >  ->  Parallel Seq Scan on test  (cost=0.00..96331.33 rows=417 
> > > width=4)
> > >Filter: (i < 200)
> > >
> > >
> > > The isForCTAS will be true because [create table as], the
> > > query_level is always 1 because there is no subquery.
> > > So even if gather is not the top node, parallel cost will still be 
> > > ignored.
> > >
> > > Is that works as expected ?
> > >
> >
> > I don't think that is expected and is not the case without this patch.
> > The cost shouldn't be changed for existing cases where the write is
> > not pushed to workers.
> >
>
> Thanks for pointing that out. Yes it should not change for the cases
> where parallel inserts will not be picked later.
>
> Any better suggestions on how to make the planner consider that the
> CTAS might choose parallel inserts later at the same time avoiding the
> above issue in case it doesn't?
>

What is the need of checking query_level when 'isForCTAS' is set only
when Gather is a top-node?

-- 
With Regards,
Amit Kapila.

Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

On Mon, Dec 7, 2020 at 11:36 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Bharath Rupireddy 
> > IMHO, we should also change the parent table. Say, I have 2 local
> > partitions for a logged table, then I alter that table to
> > unlogged(with your patch, parent table doesn't become unlogged whereas
> > the partitions will), and I detach all the partitions for some reason.
> > Now, the user would think that he set the parent table to unlogged but
> > it didn't happen. So, I think we should also change the parent table's
> > logged/unlogged property though it may not have data associated with
> > it when it has all the partitions. Thoughts?
>
> I'd like to think that the logged/unlogged property is basically specific to 
> each storage unit, which is a partition here, and ALTER TABLE on a 
> partitioned table conveniently changes the properties of underlying storage 
> units.  (In that regard, it's unfortunate that it fails with an ERROR to try 
> to change storage parameters like fillfactor with ALTER TABLE.)
>

Do you mean to say that if we detach all the partitions(assuming they
are all unlogged) then the parent table(assuming logged) gets changed
to unlogged? Does it happen on master? Am I missing something here?

>
> > I think we can add foreign partition case into postgres_fdw.sql so
> > that the tests will run as part of make check-world.
>
> I was hesitant to add this test in postgres_fdw, but it seems to be a good 
> place considering that postgres_fdw, and other contrib modules as well, are 
> part of Postgres core.
>

+1 to add tests in postgres_fdw.

>
> > How about documenting all these points in alter_table.sgml,
> > create_table.sgml and create_foreign_table.sgml under the partition
> > section?
>
> I'd like to add a statement in the description of ALTER TABLE SET 
> LOGGED/UNLOGGED as other ALTER actions do.  I don't want to add a new 
> partition section for all CREATE/ALTER actions in this patch.
>

+1.

>
> If there's no objection, I think I'll submit the (hopefully final) revised 
> patch after a few days.
>

Please do so. Thanks.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Inserts in CREATE TABLE AS

On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila  wrote:
>
> What is the need of checking query_level when 'isForCTAS' is set only
> when Gather is a top-node?
>

isForCTAS is getting set before pg_plan_query() which is being used in
cost_gather(). We will not have a Gather node by then and hence will
not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while
setting isForCTAS to true. Intention to check query_level == 1 in
cost_gather is to consider for only top level query not for other sub
queries.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] Optimize dropping of relation buffers using dlist

On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
 wrote:
>
> On Friday, December 4, 2020 8:27 PM, Amit Kapila  
> wrote:
> > On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com"
> > >  wrote in
> > > > > From: Kyotaro Horiguchi  Hello, Kirk.
> > > > > Thank you for the new version.
> > > >
> > > > Hi, Horiguchi-san. Thank you for your very helpful feedback.
> > > > I'm updating the patches addressing those.
> > > >
> > > > > +   if (!smgrexists(rels[i], j))
> > > > > +   continue;
> > > > > +
> > > > > +   /* Get the number of blocks for a relation's fork 
> > > > > */
> > > > > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > > > > NULL);
> > > > >
> > > > > If we see a fork which its size is not cached we must give up this
> > > > > optimization for all target relations.
> > > >
> > > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and
> > > > use InRecovery when deciding for optimization because of the following
> > reasons:
> > > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to
> > > > relation page contents. So in DropRelFileNodeBuffers(),
> > > > XLogReadBufferExtended() is called during VACUUM replay because
> > VACUUM changes the page content.
> > > > OTOH, TRUNCATE doesn't change the relation content, it just
> > > > truncates relation pages without changing the page contents. So
> > > > XLogReadBufferExtended() is not called, and the "cached" flag will
> > > > always return false. I tested with "cached" flags before, and this
> > >
> > > A bit different from the point, but if some tuples have been inserted
> > > to the truncated table, XLogReadBufferExtended() is called for the
> > > table and the length is cached.
> > >
> > > > always return false, at least in DropRelFileNodesAllBuffers. Due to
> > > > this, we cannot use the cached flag in DropRelFileNodesAllBuffers().
> > > > However, I think we can still rely on smgrnblocks to get the file
> > > > size as long as we're InRecovery. That cached nblocks is still 
> > > > guaranteed
> > to be the maximum in the shared buffer.
> > > > Thoughts?
> > >
> > > That means that we always think as if smgrnblocks returns "cached" (or
> > > "safe") value during recovery, which is out of our current consensus.
> > > If we go on that side, we don't need to consult the "cached" returned
> > > from smgrnblocks at all and it's enough to see only InRecovery.
> > >
> > > I got confused..
> > >
> > > We are relying on the "fact" that the first lseek() call of a
> > > (startup) process tells the truth.  We added an assertion so that we
> > > make sure that the cached value won't be cleared during recovery.  A
> > > possible remaining danger would be closing of an smgr object of a live
> > > relation just after a file extension failure. I think we are thinking
> > > that that doesn't happen during recovery.  Although it seems to me
> > > true, I'm not confident.
> > >
> >
> > Yeah, I also think it might not be worth depending upon whether smgr close
> > has been done before or not. I feel the current idea of using 'cached'
> > parameter is relatively solid and we should rely on that.
> > Also, which means that in DropRelFileNodesAllBuffers() we should rely on
> > the same, I think doing things differently in this regard will lead to 
> > confusion. I
> > agree in some cases we might not get benefits but it is more important to be
> > correct and keep the code consistent to avoid introducing bugs now or in the
> > future.
> >
> Hi,
> I have reported before that it is not always the case that the "cached" flag 
> of
> srnblocks() return true. So when I checked the truncate test case used in my
> patch, it does not enter the optimization path despite doing INSERT before
> truncation of table.
> The reason for that is because in TRUNCATE, a new RelFileNode is assigned
> to the relation when creating a new file. In recovery, 
> XLogReadBufferExtended()
> always opens the RelFileNode and calls smgrnblocks() for that RelFileNode for 
> the
> first time. And for recovery processing, different RelFileNodes are used for 
> the
> INSERTs to the table and TRUNCATE to the same table.
>

Hmm, how is it possible if Insert is done before Truncate? The insert
should happen in old RelFileNode only. I have verified by adding a
break-in (while (1), so that it stops there) heap_xlog_insert and
DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
How have you verified what you are saying?

-- 
With Regards,
Amit Kapila.

Re: Parallel Inserts in CREATE TABLE AS

On Mon, Dec 7, 2020 at 4:20 PM Bharath Rupireddy
 wrote:
>
> On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila  wrote:
> >
> > What is the need of checking query_level when 'isForCTAS' is set only
> > when Gather is a top-node?
> >
>
> isForCTAS is getting set before pg_plan_query() which is being used in
> cost_gather(). We will not have a Gather node by then and hence will
> not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while
> setting isForCTAS to true.
>

IsParallelInsertInCTASAllowed() seems to be returning false if
queryDesc is NULL, so won't isForCTAS be always set to false? I think
I am missing something here.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-12-07 Thread Andy Fan

On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen 
wrote:

> Hi,
>
> On 12/5/20 10:38 PM, Andy Fan wrote:
> > Currently the UniqueKey is defined as a List of Expr, rather than
> > EquivalenceClasses.
> > A complete discussion until now can be found at [1] (The messages I
> replied
> > to also
> > care a lot and the information is completed). This patch has stopped at
> > this place for
> > a while,  I'm planning to try EquivalenceClasses,  but any suggestion
> would
> > be welcome.
> >
> >
>
> Unfortunately I think we need a RfC style patch of both versions in
> their minimum implementation.
>
> Hopefully this will make it easier for one or more committers to decide
> on the right direction since they can do a side-by-side comparison of
> the two solutions.
>
>
I do get the exact same idea.  Actually I have made EquivalenceClasses
works with baserel last weekend and then I realized it is hard to compare
the 2 situations without looking into the real/Poc code, even for very
experienced people.  I will submit a new patch after I get the partitioned
relation, subquery works.  Hope I can make it in one week.


> Just my $0.02.
>
> Thanks for working on this Andy !
>
> Best regards,
>   Jesper
>
>

-- 
Best Regards
Andy Fan

initscan for MVCC snapshot

2020-12-07 Thread Andy Fan

Hi:
 I see initscan calls RelationGetwNumberOfBlocks every time and rescan calls
 initscan as well. In my system, RelationGetNumberOfBlocks is expensive
(the reason
 doesn't deserve a talk.. ),  so in a nest loop + Bitmap heap scan case,
the
impact will be huge. The comments of initscan are below.

/*
* Determine the number of blocks we have to scan.
*
* It is sufficient to do this once at scan start, since any tuples added
* while the scan is in progress will be invisible to my snapshot anyway.
* (That is not true when using a non-MVCC snapshot.  However, we couldn't
* guarantee to return tuples added after scan start anyway, since they
* might go into pages we already scanned.  To guarantee consistent
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/

I still do not fully understand the comments. Looks we only need to call
multi times for non-MVCC snapshot, IIUC, does the following change
reasonable?

===

diff --git a/src/backend/access/heap/heapam.c
b/src/backend/access/heap/heapam.c
index 1b2f70499e..8238eabd8b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -211,6 +211,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
ParallelBlockTableScanDesc bpscan = NULL;
boolallow_strat;
boolallow_sync;
+   boolis_mvcc = scan->rs_base.rs_snapshot &&
IsMVCCSnapshot(scan->rs_base.rs_snapshot);

/*
 * Determine the number of blocks we have to scan.
@@ -229,7 +230,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
scan->rs_nblocks = bpscan->phs_nblocks;
}
else
-   scan->rs_nblocks =
RelationGetNumberOfBlocks(scan->rs_base.rs_rd);
+   if (scan->rs_nblocks == -1 || !is_mvcc)
+   scan->rs_nblocks =
RelationGetNumberOfBlocks(scan->rs_base.rs_rd);

/*
 * If the table is large relative to NBuffers, use a bulk-read
access
@@ -1210,6 +1212,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
else
scan->rs_base.rs_key = NULL;

+   scan->rs_nblocks = -1;
initscan(scan, key, false);

-- 
Best Regards
Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-12-07 Thread Ashutosh Bapat

On Sun, Dec 6, 2020 at 9:09 AM Andy Fan  wrote:
>>
>> I have not been following this thread closely enough to understand
>> why we need a new "UniqueKeys" data structure at all.
>
>
> Currently the UniqueKey is defined as a List of Expr, rather than 
> EquivalenceClasses.
> A complete discussion until now can be found at [1] (The messages I replied 
> to also
> care a lot and the information is completed). This patch has stopped at this 
> place for
> a while,  I'm planning to try EquivalenceClasses,  but any suggestion would 
> be welcome.
>
>>
>> But if the
>> motivation is only to remove this overspecification, I humbly suggest
>> that it ain't worth the trouble.

AFAIK, the simple answer is we need some way to tell that certain
expressions together form a unique key for a given relation. E.g.
group by clause forms a unique key for the output of GROUP BY.
Pathkeys have a stronger requirement that the relation is ordered on
that expression, which may not be the case with uniqueness e.g. output
of GROUP BY produced by hash grouping. To me it's Pathkeys - ordering,
so we could use Pathkeys with reduced strength. But that might affect
a lot of places which depend upon stronger pathkeys.

-- 
Best Wishes,
Ashutosh Bapat

Re: Refactor MD5 implementations and switch to EVP for OpenSSL

2020-12-07 Thread Daniel Gustafsson

> On 4 Dec 2020, at 08:05, Michael Paquier  wrote:
> 
> On Tue, Nov 10, 2020 at 01:28:09PM +0900, Michael Paquier wrote:
>> The CF bot has been complaining on Windows and this issue is fixed in
>> the attached.  A refresh of src/tools/msvc for pgcrypto was just
>> missing.
> 
> Now that HEAD has the necessary infrastructure to be able to plug in
> easily new cryptohash routines, here is a rebased patch for MD5.

This is a pretty straightforward patch which given the cryptohash framework
added previously does what it says on the tin.  +1 on rolling MD5 into what we
already have for SHA2 now. Applies clean and all tests pass.

One (two-part) comment on the patch though:

The re-arrangement of the code does lead to some attribution confusion however
IMO.  pgcrypto/md5.c and src/common/md5.c are combined with the actual MD5
implementation from pgcrypto/md5.c and the PG glue code from src/common/md5.c.
That is in itself a good choice, but the file headers are intact and now claims
two implementations of which only one remains.

Further, bytesToHex was imported in commit 957613be18e6b7 with attribution to
"Sverre H.  Huseby " without a PGDG defined copyright,
which was later added in 2ca65f716aee9ec441dda91d91b88dd7a00bffa1.  This patch
moves bytesToHex from md5.c (where his attribution is) to md5_common.c with no
maintained attribution.  Off the cuff it seems we should either attribute in a
comment, or leave the function and export it, but the usual "I'm not a lawyer"
disclaimer applies.  Do you have any thoughts?

> The amount of code shaved is still nice:
> 13 files changed, 641 insertions(+), 775 deletions(-) 

Always nice with a net minus patch with sustained functionality.

cheers ./daniel

Re: Parallel Inserts in CREATE TABLE AS

On Mon, Dec 7, 2020 at 5:25 PM Amit Kapila  wrote:
>
> On Mon, Dec 7, 2020 at 4:20 PM Bharath Rupireddy
>  wrote:
> >
> > On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila  wrote:
> > >
> > > What is the need of checking query_level when 'isForCTAS' is set only
> > > when Gather is a top-node?
> > >
> >
> > isForCTAS is getting set before pg_plan_query() which is being used in
> > cost_gather(). We will not have a Gather node by then and hence will
> > not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while
> > setting isForCTAS to true.
> >
>
> IsParallelInsertInCTASAllowed() seems to be returning false if
> queryDesc is NULL, so won't isForCTAS be always set to false? I think
> I am missing something here.
>

My bad. I utterly missed this, sorry for the confusion.

My intention to have IsParallelInsertInCTASAllowed() is for two
purposes. 1. when called before planning without queryDesc, it should
return true if IS_CTAS(into) is true and is not a temporary table. 2.
when called after planning with a non-null queryDesc, along with 1)
checks, it should also perform the Gather State checks and return
accordingly.

I have corrected it in v9 patch. Please have a look.

>
> > > The isForCTAS will be true because [create table as], the
> > > query_level is always 1 because there is no subquery.
> > > So even if gather is not the top node, parallel cost will still be 
> > > ignored.
> > >
> > > Is that works as expected ?
> > >
> >
> > I don't think that is expected and is not the case without this patch.
> > The cost shouldn't be changed for existing cases where the write is
> > not pushed to workers.
> >
>
> Thanks for pointing that out. Yes it should not change for the cases
> where parallel inserts will not be picked later.
>
> Any better suggestions on how to make the planner consider that the
> CTAS might choose parallel inserts later at the same time avoiding the
> above issue in case it doesn't?
>

I'm not quite sure how to address this. Can we not allow the planner
to consider that the select is for CTAS and check only after the
planning is done for the Gather node and other checks? This is simple
to do, but we might miss some parallel plans for the SELECTs because
the planner would have already considered the tuple transfer cost from
workers to Gather wrongly because of which that parallel plan would
have become costlier compared to non parallel plans. IMO, we can do
this since it also keeps the existing behaviour of the planner i.e.
when the planner is planning for SELECTs it doesn't know that it is
doing it for CTAS. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v9-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch
Description: Binary data

Hello all!I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by Greenplum’s analyze improvement for append-optimized row and column AM with variable size compressed blocks.Currently we do analyze in two steps.1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)2. Collect tuples into reservoir with algorithm Z from Vitter.So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS) algorithms like A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks() estimating a physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and can’t benefit from column storage at ANALYZE TABLE (COL).The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single function: table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler functions if it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes VacAttrStats to table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.

am-analyze-1.patch
Description: Binary data

Best regards,Denis Smirnov | Developers...@arenadata.io Arenadata | Godovikova 9-17, Moscow 129085 Russia

Re: Huge memory consumption on partitioned table with FKs

2020-12-07 Thread Amit Langote

On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi
 wrote:
> > Also, the comment that was in RI_ConstraintInfo now appears in
> > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> > undocumented.  What is the relationship between those two structs?  I
> > see that they have pointers to each other, but I think the relationship
> > should be documented more clearly.
>
> I'm not sure the footprint of this patch worth doing but here is a bit
> more polished version.

I noticed that the foreign_key test fails and it may have to do with
the fact that a partition's param info remains attached to the
parent's RI_ConstraintInfo even after it's detached from the parent
table using DETACH PARTITION.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: PoC/WIP: Extended statistics on expressions

2020-12-07 Thread Tomas Vondra




On 12/7/20 10:56 AM, Dean Rasheed wrote:
> On Thu, 3 Dec 2020 at 15:23, Tomas Vondra  
> wrote:
>>
>> Attached is a patch series rebased on top of 25a9e54d2d.
> 
> After reading this thread and [1], I think I prefer the name
> "standard" rather than "expressions", because it is meant to describe
> the kind of statistics being built rather than what they apply to, but
> maybe that name doesn't actually need to be exposed to the end user:
> 
> Looking at the current behaviour, there are a couple of things that
> seem a little odd, even though they are understandable. For example,
> the fact that
> 
>   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
> 
> fails, but
> 
>   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
> 
> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
> 
>   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
> 
> tends to suggest that it's going to create statistics on the pair of
> expressions, describing their correlation, when actually it builds 2
> independent statistics. Also, this error text isn't entirely accurate:
> 
>   CREATE STATISTICS s ON col FROM tbl;
>   ERROR:  extended statistics require at least 2 columns
> 
> because extended statistics don't always require 2 columns, they can
> also just have an expression, or multiple expressions and 0 or 1
> columns.
> 
> I think a lot of this stems from treating "expressions" in the same
> way as the other (multi-column) stats kinds, and it might actually be
> neater to have separate documented syntaxes for single- and
> multi-column statistics:
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> ON (expression)
> FROM table_name
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> [ ( statistics_kind [, ... ] ) ]
> ON { column_name | (expression) } , { column_name | (expression) } [, ...]
> FROM table_name
> 
> The first syntax would create single-column stats, and wouldn't accept
> a statistics_kind argument, because there is only one kind of
> single-column statistic. Maybe that might change in the future, but if
> so, it's likely that the kinds of single-column stats will be
> different from the kinds of multi-column stats.
> 
> In the second syntax, the only accepted kinds would be the current
> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> would always build stats describing the correlations between the
> columns listed. It would continue to build standard/expression stats
> on any expressions in the list, but that's more of an implementation
> detail.
> 
> It would no longer be possible to do "CREATE STATISTICS s
> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> issue 2 separate "CREATE STATISTICS" commands, but that seems more
> logical, because they're independent stats.
> 
> The parsing code might not change much, but some of the errors would
> be different. For example, the errors "building only extended
> expression statistics on simple columns not allowed" and "extended
> expression statistics require at least one expression" would go away,
> and the error "extended statistics require at least 2 columns" might
> become more specific, depending on the stats kind.
> 

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Huge memory consumption on partitioned table with FKs

On 2020-Dec-07, Amit Langote wrote:

> On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi
>  wrote:
> > > Also, the comment that was in RI_ConstraintInfo now appears in
> > > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> > > undocumented.  What is the relationship between those two structs?  I
> > > see that they have pointers to each other, but I think the relationship
> > > should be documented more clearly.
> >
> > I'm not sure the footprint of this patch worth doing but here is a bit
> > more polished version.
> 
> I noticed that the foreign_key test fails and it may have to do with
> the fact that a partition's param info remains attached to the
> parent's RI_ConstraintInfo even after it's detached from the parent
> table using DETACH PARTITION.

I think this bit about splitting the struct is a distraction.  Let's get
a patch that solves the bug first, and then we can discuss what further
refinements we want to do.  I think we should get your patch in
CA+HiwqEOrfN9b=f3sdmyspgc4go-l_vmfhxllxvmmdp34e6...@mail.gmail.com
committed (which I have not read yet.)  Do you agree with this plan?

Re: Parallel Inserts in CREATE TABLE AS

2020-12-07 Thread Dilip Kumar

On Mon, Dec 7, 2020 at 7:04 PM Bharath Rupireddy
 wrote:
>
> On Mon, Dec 7, 2020 at 5:25 PM Amit Kapila  wrote:
> >
> > On Mon, Dec 7, 2020 at 4:20 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Mon, Dec 7, 2020 at 4:04 PM Amit Kapila  
> > > wrote:
> > > >
> > > > What is the need of checking query_level when 'isForCTAS' is set only
> > > > when Gather is a top-node?
> > > >
> > >
> > > isForCTAS is getting set before pg_plan_query() which is being used in
> > > cost_gather(). We will not have a Gather node by then and hence will
> > > not pass queryDesc to IsParallelInsertInCTASAllowed(into, NULL) while
> > > setting isForCTAS to true.
> > >
> >
> > IsParallelInsertInCTASAllowed() seems to be returning false if
> > queryDesc is NULL, so won't isForCTAS be always set to false? I think
> > I am missing something here.
> >
>
> My bad. I utterly missed this, sorry for the confusion.
>
> My intention to have IsParallelInsertInCTASAllowed() is for two
> purposes. 1. when called before planning without queryDesc, it should
> return true if IS_CTAS(into) is true and is not a temporary table. 2.
> when called after planning with a non-null queryDesc, along with 1)
> checks, it should also perform the Gather State checks and return
> accordingly.
>
> I have corrected it in v9 patch. Please have a look.
>
> >
> > > > The isForCTAS will be true because [create table as], the
> > > > query_level is always 1 because there is no subquery.
> > > > So even if gather is not the top node, parallel cost will still be 
> > > > ignored.
> > > >
> > > > Is that works as expected ?
> > > >
> > >
> > > I don't think that is expected and is not the case without this patch.
> > > The cost shouldn't be changed for existing cases where the write is
> > > not pushed to workers.
> > >
> >
> > Thanks for pointing that out. Yes it should not change for the cases
> > where parallel inserts will not be picked later.
> >
> > Any better suggestions on how to make the planner consider that the
> > CTAS might choose parallel inserts later at the same time avoiding the
> > above issue in case it doesn't?
> >
>
> I'm not quite sure how to address this. Can we not allow the planner
> to consider that the select is for CTAS and check only after the
> planning is done for the Gather node and other checks? This is simple
> to do, but we might miss some parallel plans for the SELECTs because
> the planner would have already considered the tuple transfer cost from
> workers to Gather wrongly because of which that parallel plan would
> have become costlier compared to non parallel plans. IMO, we can do
> this since it also keeps the existing behaviour of the planner i.e.
> when the planner is planning for SELECTs it doesn't know that it is
> doing it for CTAS. Thoughts?
>

I have done some initial review and I have a few comments.

@@ -328,6 +316,15 @@ ExecCreateTableAs(ParseState *pstate,
CreateTableAsStmt *stmt,
query = linitial_node(Query, rewritten);
Assert(query->commandType == CMD_SELECT);

+   /*
+* Flag to let the planner know that the SELECT query is for CTAS. This
+* is used to calculate the tuple transfer cost from workers to gather
+* node(in case parallelism kicks in for the SELECT part of the CTAS),
+* to zero as each worker will insert its share of tuples in parallel.
+*/
+   if (IsParallelInsertInCTASAllowed(into, NULL))
+   query->isForCTAS = true;
+
/* plan the query */
plan = pg_plan_query(query, pstate->p_sourcetext,
 CURSOR_OPT_PARALLEL_OK, params);
@@ -350,6 +347,15 @@ ExecCreateTableAs(ParseState *pstate,
CreateTableAsStmt *stmt,
/* call ExecutorStart to prepare the plan for execution */
ExecutorStart(queryDesc, GetIntoRelEFlags(into));
+   /*
+* If SELECT part of the CTAS is parallelizable, then make each
+* parallel worker insert the tuples that are resulted in its execution
+* into the target table. We need plan state to be initialized by the
+* executor to decide whether to allow parallel inserts or not.
+   */
+   if (IsParallelInsertInCTASAllowed(into, queryDesc))
+   SetCTASParallelInsertState(queryDesc);


Once we have called IsParallelInsertInCTASAllowed and set the
query->isForCTAS flag then why we are calling this again?

——
---

+*/
+   if (!(root->parse->isForCTAS &&
+   root->query_level == 1))
+   run_cost += parallel_tuple_cost * path->path.rows;

>From this check, it appeared that the lower level gather will also get
influenced by this, consider this

-> NLJ
  -> Gather
 -> Parallel Seq Scan
  -> Index Scan

This condition is only checking that it should be a top-level query
and it should be under CTAS then this will impact all the gather nodes
as shown in the above example.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

Does "ALTER TABLE ONLY parent" work correctly?  Namely, do not affect
existing partitions, but cause future partitions to acquire the new
setting.

This sounds very much related to previous discussion on REPLICA IDENTITY
not propagating to partitions; see
https://postgr.es/m/201902041630.gpadougzab7v@alvherre.pgsql
particularly Robert Haas' comments at
http://postgr.es/m/CA+TgmoYjksObOzY8b1U1=bsp_m+702eof42fqrtqtyio1nu...@mail.gmail.com
and downstream discussion from there.

Re: PoC/WIP: Extended statistics on expressions

2020-12-07 Thread Dean Rasheed

On Mon, 7 Dec 2020 at 14:15, Tomas Vondra  wrote:
>
> On 12/7/20 10:56 AM, Dean Rasheed wrote:
> > it might actually be
> > neater to have separate documented syntaxes for single- and
> > multi-column statistics:
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> > ON (expression)
> > FROM table_name
> >
> >   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
> > [ ( statistics_kind [, ... ] ) ]
> > ON { column_name | (expression) } , { column_name | (expression) } [, 
> > ...]
> > FROM table_name
>
> I think it makes sense in general. I see two issues with this approach,
> though:
>
> * By adding expression/standard stats for individual statistics, it
> makes the list of statistics longer - I wonder if this might have
> measurable impact on lookups in this list.
>
> * I'm not sure it's a good idea that the second syntax would always
> build the per-expression stats. Firstly, it seems a bit strange that it
> behaves differently than the other kinds. Secondly, I wonder if there
> are cases where it'd be desirable to explicitly disable building these
> per-expression stats. For example, what if we have multiple extended
> statistics objects, overlapping on a couple expressions. It seems
> pointless to build the stats for all of them.
>

Hmm, I'm not sure it would really be a good idea to build MCV stats on
expressions without also building the standard stats for those
expressions, otherwise the assumptions that
mcv_combine_selectivities() makes about simple_sel and mcv_basesel
wouldn't really hold. But then, if multiple MCV stats shared the same
expression, it would be quite wasteful to build standard stats on the
expression more than once.

It feels like it should build a single extended stats object for each
unique expression, with appropriate dependencies for any MCV stats
that used those expressions, but I'm not sure how complex that would
be. Dropping the last MCV stat object using a standard expression stat
object might reasonably drop the expression stats ... except if they
were explicitly created by the user, independently of any MCV stats.
That could get quite messy.

Regards,
Dean

Re: Additional improvements to extended statistics

2020-12-07 Thread Dean Rasheed

On Wed, 2 Dec 2020 at 15:51, Dean Rasheed  wrote:
>
> The sort of queries I had in mind were things like this:
>
>   WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)
>
> However, the new code doesn't apply the extended stats directly using
> clauselist_selectivity_or() for this kind of query because there are
> no RestrictInfos for the nested AND clauses, so
> find_single_rel_for_clauses() (and similarly
> statext_is_compatible_clause()) regards those clauses as not
> compatible with extended stats. So what ends up happening is that
> extended stats are used only when we descend down to the two AND
> clauses, and their results are combined using the original "s1 + s2 -
> s1 * s2" formula. That actually works OK in this case, because there
> is no overlap between the two AND clauses, but it wouldn't work so
> well if there was.
>
> I'm pretty sure that can be fixed by teaching
> find_single_rel_for_clauses() and statext_is_compatible_clause() to
> handle BoolExpr clauses, looking for RestrictInfos underneath them,
> but I think that should be left for a follow-in patch.

Attached is a patch doing that, which improves a couple of the
estimates for queries with AND clauses underneath OR clauses, as
expected.

This also revealed a minor bug in the way that the estimates for
multiple statistics objects were combined while processing an OR
clause -- the estimates for the overlaps between clauses only apply
for the current statistics object, so we really have to combine the
estimates for each set of clauses for each statistics object as if
they were independent of one another.

0001 fixes the multiple-extended-stats issue for OR clauses, and 0002
improves the estimates for sub-AND clauses underneath OR clauses.

These are both quite small patches, that hopefully won't interfere
with any of the other extended stats patches.

Regards,
Dean

0001-Improve-estimation-of-OR-clauses-using-multiple-exte.patch
Description: Binary data

0002-Improve-estimation-of-ANDs-under-ORs-using-extended-.patch
Description: Binary data

Re: Huge memory consumption on partitioned table with FKs

2020-12-07 Thread Amit Langote

Hi Alvaro,

On Mon, Dec 7, 2020 at 23:48 Alvaro Herrera  wrote:

> On 2020-Dec-07, Amit Langote wrote:
>
> > On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi
> >  wrote:
> > > > Also, the comment that was in RI_ConstraintInfo now appears in
> > > > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> > > > undocumented.  What is the relationship between those two structs?  I
> > > > see that they have pointers to each other, but I think the
> relationship
> > > > should be documented more clearly.
> > >
> > > I'm not sure the footprint of this patch worth doing but here is a bit
> > > more polished version.
> >
> > I noticed that the foreign_key test fails and it may have to do with
> > the fact that a partition's param info remains attached to the
> > parent's RI_ConstraintInfo even after it's detached from the parent
> > table using DETACH PARTITION.
>
> I think this bit about splitting the struct is a distraction.  Let's get
> a patch that solves the bug first, and then we can discuss what further
> refinements we want to do.  I think we should get your patch in
> CA+HiwqEOrfN9b=f3sdmyspgc4go-l_vmfhxllxvmmdp34e6...@mail.gmail.com
> committed (which I have not read yet.)  Do you agree with this plan?


Yeah, I agree.

- Amit

> --
Amit Langote
EDB: http://www.enterprisedb.com

Re: POC: Better infrastructure for automated testing of concurrency issues

2020-12-07 Thread Alexander Korotkov

Hi!

On Mon, Dec 7, 2020 at 9:10 AM Craig Ringer
 wrote:
> On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov  wrote:
>> PostgreSQL is a complex multi-process system, and we are periodically faced 
>> with complicated concurrency issues. While the postgres community does a 
>> great job on investigating and fixing the problems, our ability to reproduce 
>> concurrency issues in the source code test suites is limited.
>>
>> I think we currently have two general ways to reproduce the concurrency 
>> issues.
>> 1. A text scenario for manual reproduction of the issue, which could involve 
>> psql sessions, gdb sessions etc. Couple of examples are [1] and [2]. This 
>> method provides reliable reproduction of concurrency issues. But it's  hard 
>> to automate, because it requires external instrumentation (debugger) and 
>> it's not stable in terms of postgres code changes (that is particular line 
>> numbers for breakpoints could be changed). I think this is why we currently 
>> don't have such scenarios among postgres test suites.
>> 2. Another way is to reproduce the concurrency issue without directly 
>> touching the database internals using pgbench or other way to simulate the 
>> workload (see [3] for example). This way is easier to automate, because it 
>> doesn't need external instrumentation and it's not so sensitive to source 
>> code changes. But at the same time this way is not reliable and is 
>> resource-consuming.
>
> Agreed.
>
> For a useful but limited set of cases there's (3) the isolation tester and 
> pg_isolation_regress. But IIRC the patches to teach it to support multiple 
> upstream nodes never got in, so it's essentially useless for any replication 
> related testing.
>
> There's also (4), write a TAP test that uses concurrent psql sessions via 
> IPC::Run. Then play games with heavyweight or advisory lock waits to order 
> events, use instance starts/stops, change ports or connstrings to simulate 
> network issues, use SIGSTOP/SIGCONTs, add src/test/modules extensions that 
> inject faults or provide custom blocking wait functions for the event you 
> want, etc. I've done that more than I'd care to, and I don't want to do it 
> any more than I have to in future.

Sure, there are isolation tester and TAP tests.  I just meant the
scenarios, where we can't reliably reproduce using either isolation
tests or tap tests.  Sorry for confusion.

> In some cases I've gone further and written tests that use systemtap in 
> "guru" mode (read/write, with embedded C enabled) to twiddle the memory of 
> the target process(es) when a probe is hit, e.g. to modify a function 
> argument or return value or inject a fault. Not exactly portable or 
> convenient, though very powerful.

Exactly, systemtap is good, but we need something more portable and
convenient for builtin test suites.

>> In the view of above, I'd like to propose a POC patch, which implements new 
>> builtin infrastructure for reproduction of concurrency issues in automated 
>> test suites.  The general idea is so-called "stop events", which are special 
>> places in the code, where the execution could be stopped on some condition.  
>> Stop event also exposes a set of parameters, encapsulated into jsonb value.  
>> The condition over stop event parameters is defined using jsonpath language.
>
>
> The patched PostgreSQL used by 2ndQuadrant internally has a feature called 
> PROBE_POINT()s that is somewhat akin to this. Since it's not a customer 
> facing feature I'm sure I can discuss it here, though I'll need to seek 
> permission before I can show code.
>
> TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various 
> other behaviour at points in the code marked by name, using GUCs, hooks 
> loaded from test extensions, or even systemtap scripts to control what fires 
> and when. Performance impact is essentially zero when no probes are currently 
> enabled at runtime, so they're fine for cassert builds.
>
> Details:
>
> A PROBE_POINT() is a macro that works as a marker, a bit like a 
> TRACE_POSTGRESQL_ dtrace macro. But instead of the super lightweight 
> tracepoint that SDT marker points emit, a PROBE_POINT tests an 
> unlikely(probe_points_enabled) flag, and if true, it prepares arguments for 
> the probe handler: A probe name, probe action, sleep duration, and a hit 
> counter.
>
> The default probe action and sleep duration come from GUCs. So your control 
> of the probe is limited to the granularity you can easily manage GUCs at. 
> That's often sufficient
>
> But if you want finer control for something, there are two ways to achieve it.
>
> After picking the default arguments to the handler, the probe point checks 
> for a hook. If defined, it calls it with the probe point name and pointers to 
> the action and sleep duration values, so the hook function can modify them 
> per probe-point hit. That way you can use in src/test/modules extensions or 
> your own test extensions first, with the probe point name as a

Change default of checkpoint_completion_target

2020-12-07 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Sun, Dec 06, 2020 at 10:03:08AM -0500, Stephen Frost wrote:
> > * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> >> You keep making this statement, and I don't necessarily disagree, but if
> >> that is the case, please explain why don't we have
> >> checkpoint_completion_target set to 0.9 by default?  Should we change
> >> that?
> > 
> > Yes, I do think we should change that..
> 
> Agreed.  FWIW, no idea for others, but it is one of those parameters I
> keep telling to update after a default installation.

Concretely, attached is a patch which changes the default and updates
the documentation accordingly.

Passes regression tests and doc build.  Will register in the January
commitfest as Needs Review.

Thanks,

Stephen
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd3d6901c..47db860dd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3260,7 +3260,13 @@ include_dir 'conf.d'
   

 Specifies the target of checkpoint completion, as a fraction of
-total time between checkpoints. The default is 0.5.
+total time between checkpoints. The default is 0.9, which spreads the
+checkpoint across the entire checkpoint timeout period of time,
+providing a consistent amount of I/O during the entire checkpoint.
+Reducing this parameter is not recommended as that causes the I/O from
+the checkpoint to have to complete faster, resulting in a higher I/O
+rate, while then having a period of less I/O between the completion of
+the checkpoint and the start of the next scheduled checkpoint.
 This parameter can only be set in the postgresql.conf
 file or on the server command line.

diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..8ac3f971b7 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -524,22 +524,28 @@
writing dirty buffers during a checkpoint is spread over a period of time.
That period is controlled by
, which is
-   given as a fraction of the checkpoint interval.
+   given as a fraction of the checkpoint interval (configured by using
+   checkpoint_timeout).
The I/O rate is adjusted so that the checkpoint finishes when the
given fraction of
checkpoint_timeout seconds have elapsed, or before
max_wal_size is exceeded, whichever is sooner.
-   With the default value of 0.5,
+   With the default value of 0.9,
PostgreSQL can be expected to complete each checkpoint
-   in about half the time before the next checkpoint starts.  On a system
-   that's very close to maximum I/O throughput during normal operation,
-   you might want to increase checkpoint_completion_target
-   to reduce the I/O load from checkpoints.  The disadvantage of this is that
-   prolonging checkpoints affects recovery time, because more WAL segments
-   will need to be kept around for possible use in recovery.  Although
-   checkpoint_completion_target can be set as high as 1.0,
-   it is best to keep it less than that (perhaps 0.9 at most) since
-   checkpoints include some other activities besides writing dirty buffers.
+   a bit before the next scheduled checkpoint.  This spreads out the I/O as much as
+   possible to have the I/O load be consistent during the checkpoint.  The
+   disadvantage of this is that prolonging checkpoints affects recovery time,
+   because more WAL segments will need to be kept around for possible use in recovery.
+   A user concerned about the amount of time required to recover might wish to reduce
+   checkpoint_timeout, causing checkpoints to happen more
+   frequently while still spreading out the I/O from each checkpoint.  Alternatively,
+   checkpoint_completion_target could be reduced, but this would
+   result in times of more intense I/O (during the checkpoint) and times of less I/O
+   (after the checkpoint completed but before the next scheduled checkpoint) and
+   therefore is not recommended.
+   Although checkpoint_completion_target could be set as high as
+   1.0, it is best to keep it less than that (such as at the default of 0.9, at most)
+   since checkpoints include some other activities besides writing dirty buffers.
A setting of 1.0 is quite likely to result in checkpoints not being
completed on time, which would result in performance loss due to
unexpected variation in the number of WAL segments needed.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 635d91d50a..c0b2f99c11 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3643,7 +3643,7 @@ static struct config_real ConfigureNamesReal[] =
 			NULL
 		},
 		&CheckPointCompletionTarget,
-		0.5, 0.0, 1.0,
+		0.9, 0.0, 1.0,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9c9091e601..91d759dc61 100644
-

pg_rewind race condition just after promotion

2020-12-07 Thread Heikki Linnakangas

There's a race condition between the checkpoint at promotion and 
pg_rewind. When a server is promoted, the startup process writes an 
end-of-recovery checkpoint that includes the new TLI, and the server is 
immediate opened for business. The startup process requests the 
checkpointer process to perform a checkpoint, but it can take a few 
seconds or more to complete. If you run pg_rewind, using the just 
promoted server as the source, pg_rewind will think that the server is 
still on the old timeline, because it only looks at TLI in the control 
file's copy of the checkpoint record. That's not updated until the 
checkpoint is finished.


This isn't a new issue. Stephen Frost first reported it back 2015 [1]. 
Back then, it was deemed just a small annoyance, and we just worked 
around it in the tests by issuing a checkpoint command after promotion, 
to wait for the checkpoint to finish. I just ran into it again today, 
with the new pg_rewind test, and silenced it in the similar way.


I think we should fix this properly. I'm not sure if it can lead to a 
broken cluster, but at least it can cause pg_rewind to fail 
unnecessarily and in a user-unfriendly way. But this is actually pretty 
simple to fix. pg_rewind looks at the control file to find out the 
timeline the server is on. When promotion happens, the startup process 
updates minRecoveryPoint and minRecoveryPointTLI fields in the control 
file. We just need to read it from there. Patch attached.


I think we should also backpatch this. Back in 2015, we decided that we 
can live with this, but it's always been a bit bogus, and seems simple 
enough to fix.


Thoughts?

[1] 
https://www.postgresql.org/message-id/20150428180253.GU30322%40tamriel.snowman.net


- Heikki
>From 274c59ed0e6573dd387a03e36eda15cb41658447 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 7 Dec 2020 19:59:50 +0200
Subject: [PATCH 1/1] pg_rewind: Fix determining TLI when server was just
 promoted.

If the source server was just promoted, and it hasn't written the
checkpoint record yet, pg_rewind considered the server to be still on
the old timeline. Because of that, it would claim incorrectly that no
rewind is required. Fix that by looking at the minRecoveryPointTLI in
the control file in addition to the ThisTimeLineID on the checkpoint.

Discussion: TODO
---
 src/bin/pg_rewind/pg_rewind.c | 55 +++
 src/bin/pg_rewind/t/007_standby_source.pl |  1 -
 src/bin/pg_rewind/t/008_min_recovery_point.pl |  9 ---
 src/bin/pg_rewind/t/RewindTest.pm |  8 ---
 4 files changed, 33 insertions(+), 40 deletions(-)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index d89c08f81d..8eb9b378dc 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -44,6 +44,7 @@ static void digestControlFile(ControlFileData *ControlFile,
 			  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
+static TimeLineHistoryEntry *getTimelineHistory(TimeLineID tli, bool is_source, int *nentries);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
 static void ensureCleanShutdown(const char *argv0);
 static void disconnect_atexit(void);
@@ -67,9 +68,11 @@ bool		dry_run = false;
 bool		do_sync = true;
 bool		restore_wal = false;
 
-/* Target history */
+/* Source target timeline histories */
 TimeLineHistoryEntry *targetHistory;
 int			targetNentries;
+static TimeLineHistoryEntry *sourceHistory;
+static int	sourceNentries;
 
 /* Progress counters */
 uint64		fetch_size;
@@ -129,6 +132,8 @@ main(int argc, char **argv)
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
 	XLogRecPtr	chkptredo;
+	TimeLineID	source_tli;
+	TimeLineID	target_tli;
 	XLogRecPtr	target_wal_endrec;
 	size_t		size;
 	char	   *buffer;
@@ -325,14 +330,28 @@ main(int argc, char **argv)
 
 	sanityChecks();
 
+	/*
+	 * Usually, the TLI can be found in the latest checkpoint record. But if
+	 * the source server is just being promoted (or it's a standby that's
+	 * following a primary that's just being promoted), and the checkpoint
+	 * requested by the promotion hasn't completed yet, the latest timeline
+	 * is in minRecoveryPoint. So we check which is later, the TLI of the
+	 * minRecoveryPoint or the latest checkpoint.
+	 */
+	source_tli = Max(ControlFile_source.minRecoveryPointTLI,
+	 ControlFile_source.checkPointCopy.ThisTimeLineID);
+
+	/* Similarly for the target. */
+	target_tli = Max(ControlFile_target.minRecoveryPointTLI,
+	 ControlFile_target.checkPointCopy.ThisTimeLineID);
+
 	/*
 	 * Find the common ancestor timeline between the clusters.
 	 *
 	 * If both clusters are already on the same timeline, there's nothing to
 	 * do.
 	 */
-	if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
-		ControlFile_source.checkPointCopy.ThisTimeLineID)
+	if (target_tli == source_tli)
 	{
 		pg_log_info("source and target cluster are on the same timeline"

Re: Yet another fast GiST build

2020-12-07 Thread Peter Geoghegan

On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin  wrote:
> Here's version with tests and docs. I still have no idea how to print some 
> useful information about tuples keys.

I suggest calling BuildIndexValueDescription() from your own custom
debug instrumentation code.

-- 
Peter Geoghegan

Re: Change default of checkpoint_completion_target

2020-12-07 Thread Peter Eisentraut


On 2020-12-07 18:53, Stephen Frost wrote:

* Michael Paquier (mich...@paquier.xyz) wrote:

On Sun, Dec 06, 2020 at 10:03:08AM -0500, Stephen Frost wrote:

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:

You keep making this statement, and I don't necessarily disagree, but if
that is the case, please explain why don't we have
checkpoint_completion_target set to 0.9 by default?  Should we change
that?


Yes, I do think we should change that..


Agreed.  FWIW, no idea for others, but it is one of those parameters I
keep telling to update after a default installation.


Concretely, attached is a patch which changes the default and updates
the documentation accordingly.


I agree with considering this change, but I wonder why the value 0.9. 
Why not, say, 0.95, 0.99, or 1.0?

Re: [HACKERS] [PATCH] Generic type subscripting

I decided that the way to get this moved forward is to ignore the jsonb
parts for the moment and focus on getting the core feature into
committable shape.  It's possible that the lack of a concrete use-case
other than arrays will cause us to miss a detail or two, but if so we
can fix it later, I think.  (We should make sure to get the jsonb parts
in for v14, though, before we ship this.)

Accordingly, I went through all of the core and array code and dealt
with a lot of details that hadn't gotten seen to, including pg_dump
support and dependency considerations.  I ended up rewriting the
parser code pretty heavily, because I didn't like the original either
from an invasiveness or usability standpoint.  I also did the thing
I suggested earlier of using separate handler functions for varlena
and fixed-length arrays, though I made no effort to separate the code
paths.  I think the attached v37 is committable or nearly so, though
there remain a few loose ends:

1. I'm still wondering if TypeParamBool is the right thing to pass to
LLVMFunctionType() to describe a function-returning-bool.  It does
seem to work on x64_64 and aarch64, for what that's worth.

2. I haven't pulled the trigger on merging the three now-identical
execution step types.  That could be done separately, of course.

3. There are some semantic details that probably need debating.
As I wrote in subscripting.h:

 * There are some general restrictions on what subscripting can do.  The
 * planner expects subscripting fetches to be strict (i.e., return NULL for
 * any null input), immutable (same inputs always give same results), and
 * leakproof (data-value-dependent errors must not be thrown; in other
 * words, you must silently return NULL for any bad subscript value).
 * Subscripting assignment need not be, and usually isn't, strict; it need
 * not be leakproof either; but it must be immutable.

I doubt that there's anything too wrong with assuming immutability
of SubscriptingRef.  And perhaps strictness is OK too, though I worry
about somebody deciding that it'd be cute to define a NULL subscript
value as doing something special.  But the leakproofness assumption
seems like a really dangerous one.  All it takes is somebody deciding
that they should throw an error for a bad subscript, and presto, we
have a security hole.

What I'm slightly inclined to do here, but did not do in the attached
patch, is to have check_functions_in_node look at the leakproofness
marking of the subscripting support function; that's a bit of a hack,
but since the support function can't be called from SQL there is no
other use for its proleakproof flag.  Alternatively we could extend
the SubscriptingRoutine return struct to include some flags saying
whether fetch and/or assignment is leakproof.

BTW, right now check_functions_in_node() effectively treats
SubscriptingRef as unconditionally leakproof.  That's okay for the
array "fetch" code, but the array "store" code does throw errors
for bad subscripts, meaning it's not leakproof.  The only reason
this isn't a live security bug is that "store" SubscriptingRefs can
only occur in the targetlist of an INSERT or UPDATE, which is not
a place that could access any variables we are trying to protect
with the leakproofness mechanism.  (If it could, you could just
store the variable's value directly into another table.)  Still,
that's a shaky chain of reasoning.  The patch's change to
check_functions_in_node() would work in the back branches, so
I'm inclined to back-patch it that way even if we end up doing
something smarter in HEAD.

Comments?

regards, tom lane

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 2d44df19fe..ca2f9f3215 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -426,23 +426,28 @@ foreign_expr_walker(Node *node,
 	return false;
 
 /*
- * Recurse to remaining subexpressions.  Since the container
- * subscripts must yield (noncollatable) integers, they won't
- * affect the inner_cxt state.
+ * Recurse into the remaining subexpressions.  The container
+ * subscripts will not affect collation of the SubscriptingRef
+ * result, so do those first and reset inner_cxt afterwards.
  */
 if (!foreign_expr_walker((Node *) sr->refupperindexpr,
 		 glob_cxt, &inner_cxt))
 	return false;
+inner_cxt.collation = InvalidOid;
+inner_cxt.state = FDW_COLLATE_NONE;
 if (!foreign_expr_walker((Node *) sr->reflowerindexpr,
 		 glob_cxt, &inner_cxt))
 	return false;
+inner_cxt.collation = InvalidOid;
+inner_cxt.state = FDW_COLLATE_NONE;
 if (!foreign_expr_walker((Node *) sr->refexpr,
 		 glob_cxt, &inner_cxt))
 	return false;
 
 /*
- * Container subscripting should yield same collation as
- * input, but for safety use same logic as for function nodes.
+ * Container subscripting typically yields same collation as
+ * refexp

Re: Change default of checkpoint_completion_target

2020-12-07 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote:
> On 2020-12-07 18:53, Stephen Frost wrote:
> >* Michael Paquier (mich...@paquier.xyz) wrote:
> >>On Sun, Dec 06, 2020 at 10:03:08AM -0500, Stephen Frost wrote:
> >>>* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> You keep making this statement, and I don't necessarily disagree, but if
> that is the case, please explain why don't we have
> checkpoint_completion_target set to 0.9 by default?  Should we change
> that?
> >>>
> >>>Yes, I do think we should change that..
> >>
> >>Agreed.  FWIW, no idea for others, but it is one of those parameters I
> >>keep telling to update after a default installation.
> >
> >Concretely, attached is a patch which changes the default and updates
> >the documentation accordingly.
> 
> I agree with considering this change, but I wonder why the value 0.9. Why
> not, say, 0.95, 0.99, or 1.0?

The documentation (which my patch updates to match the new default)
covers this pretty well here:

https://www.postgresql.org/docs/current/wal-configuration.html

"Although checkpoint_completion_target can be set as high as 1.0, it is
best to keep it less than that (perhaps 0.9 at most) since checkpoints
include some other activities besides writing dirty buffers. A setting
of 1.0 is quite likely to result in checkpoints not being completed on
time, which would result in performance loss due to unexpected variation
in the number of WAL segments needed."

Thanks,

Stephen

signature.asc
Description: PGP signature

small cleanup in unicode_norm.c

2020-12-07 Thread John Naylor

We've had get_canonical_class() for a while as a backend-only function.
There is some ad-hoc code elsewhere that implements the same logic in a
couple places, so it makes sense for all sites to use this function
instead, as in the attached.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


use-canonical-class-func.patch
Description: Binary data

Re: [HACKERS] [PATCH] Generic type subscripting

Hi,

On 2020-12-07 14:08:35 -0500, Tom Lane wrote:
> 1. I'm still wondering if TypeParamBool is the right thing to pass to
> LLVMFunctionType() to describe a function-returning-bool.  It does
> seem to work on x64_64 and aarch64, for what that's worth.

> - v_ret = build_EvalXFunc(b, mod, 
> "ExecEvalSubscriptingRef",
> - 
> v_state, op);
> + param_types[0] = l_ptr(StructExprState);
> + param_types[1] = l_ptr(TypeSizeT);
> + param_types[2] = 
> l_ptr(StructExprContext);
> +
> + v_functype = 
> LLVMFunctionType(TypeParamBool,
> + 
>   param_types,
> + 
>   lengthof(param_types),
> + 
>   false);
> + v_func = 
> l_ptr_const(op->d.sbsref_subscript.subscriptfunc,
> + 
>  l_ptr(v_functype));
> +
> + v_params[0] = v_state;
> + v_params[1] = l_ptr_const(op, 
> l_ptr(TypeSizeT));
> + v_params[2] = v_econtext;
> + v_ret = LLVMBuildCall(b,
> + 
>   v_func,
> + 
>   v_params, lengthof(v_params), "");
>   v_ret = LLVMBuildZExt(b, v_ret, 
> TypeStorageBool, "");

The TypeParamBool stuff here is ok. Basically LLVM uses a '1bit' integer
to represent booleans in the IR. But when it comes to storing such a
value in memory, it uses 1 byte, for obvious reasons. Hence the two
types.

We infer it like this:

> /*
>  * Clang represents stdbool.h style booleans that are returned by functions
>  * differently (as i1) than stored ones (as i8). Therefore we do not just need
>  * TypeBool (above), but also a way to determine the width of a returned
>  * integer. This allows us to keep compatible with non-stdbool using
>  * architectures.
>  */
> extern bool FunctionReturningBool(void);
> bool
> FunctionReturningBool(void)
> {
>   return false;
> }

so you should be good.


I think it'd be a better to rely on the backend's definition of
ExecEvalBoolSubroutine etc. For the functions implementing expression
steps I've found that far easier to work with over time (because you can
get LLVM to issue type mismatch errors when the signature changes,
instead of seeing compile failures).

I've attached a prototype conversion for two other such places. Which
immediately pointed to a bug. And one harmless issue (using a pointer to
size_t instead of ExprEvalOp* to represent the 'op' parameter), which
you promptly copied...

If I pushed a slightly cleaned up version of that, it should be fairly
easy to adapt your code to it, I think?


WRT the prototype, I think it may be worth removing most of the types
from llvmjit.h. Worth keeping the most common ones, but most aren't used
all the time so terseness doesn't matter that much, and
the llvm_pg_var_type() would suffice.

Greetings,

Andres Freund
>From 679a47a7be220576845c50092a1cccd8303d4535 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 7 Dec 2020 13:16:55 -0800
Subject: [PATCH] jit: wip: reference function types instead of re-creating
 them.

Also includes a bugfix that needs to be split out and backpatched.
---
 src/include/jit/llvmjit.h|  2 +
 src/backend/jit/llvm/llvmjit.c   | 95 +---
 src/backend/jit/llvm/llvmjit_expr.c  | 35 ++
 src/backend/jit/llvm/llvmjit_types.c |  4 ++
 4 files changed, 71 insertions(+), 65 deletions(-)

diff --git a/src/include/jit/llvmjit.h b/src/include/jit/llvmjit.h
index 325409acd5c..1c89075eaff 100644
--- a/src/include/jit/llvmjit.h
+++ b/src/include/jit/llvmjit.h
@@ -92,6 +92,8 @@ extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
 extern char *llvm_expand_funcname(LLVMJitContext *context, const char *basename);
 extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
 extern void llvm_split_symbol_name(const char *name, char **modname, char **funcname);
+extern LLVMTypeRef llvm_pg_var_type(const char *varname);
+extern LLVMTypeRef llvm_pg_var_func_type(const char *varname);
 extern LLVMValueRef llvm_pg_func(LLVMModuleRef mod, const char *funcname);
 extern void llvm_copy_attributes(LLVMValueRef from, LLVMValueRef to);
 extern LLVMValueRef llvm_function_reference(LLVMJitContext *context,
diff --git

Re: [HACKERS] [PATCH] Generic type subscripting

Andres Freund  writes:
> The TypeParamBool stuff here is ok. Basically LLVM uses a '1bit' integer
> to represent booleans in the IR. But when it comes to storing such a
> value in memory, it uses 1 byte, for obvious reasons. Hence the two
> types.

Cool, thanks for taking a look.

> I think it'd be a better to rely on the backend's definition of
> ExecEvalBoolSubroutine etc. For the functions implementing expression
> steps I've found that far easier to work with over time (because you can
> get LLVM to issue type mismatch errors when the signature changes,
> instead of seeing compile failures).

I'm a little unclear on what you mean here?  There wasn't such a
thing as ExecEvalBoolSubroutine until I added it in this patch.

> I've attached a prototype conversion for two other such places. Which
> immediately pointed to a bug. And one harmless issue (using a pointer to
> size_t instead of ExprEvalOp* to represent the 'op' parameter), which
> you promptly copied...
> If I pushed a slightly cleaned up version of that, it should be fairly
> easy to adapt your code to it, I think?

Sure.  I just copied the existing code for EEOP_PARAM_CALLBACK;
if that changes, I'll just copy the new code.

What did you think of the idea of merging EEOP_SBSREF_OLD / ASSIGN / FETCH
into a single step type distinguished only by the callback function?

regards, tom lane

Re: Add table access method as an option to pgbench

2020-12-07 Thread David Zhang


Again, thanks a lot for the feedback.

On 2020-12-02 12:03 p.m., Fabien COELHO wrote:


Hello David,

Some feedback about v4.

It looks that the option is *silently* ignored when creating a 
partitionned table, which currently results in an error, and not 
passed to partitions, which would accept them. This is pretty weird.
The input check is added with an error message when both partitions 
and table access method are used.


Hmmm. If you take the resetting the default, I do not think that you 
should have to test anything? AFAICT the access method is valid on 
partitions, although not on the partitioned table declaration. So I'd 
say that you could remove the check.
Yes, it makes sense to remove the *blocking* check, and actually the 
table access method interface does work with partitioned table. I tested 
this/v5 by duplicating the heap access method with a different name. For 
this reason, I removed the condition check as well when applying the 
table access method.


They should also trigger failures, eg 
"--table-access-method=no-such-table-am", to be added to the @errors 
list.
No sure how to address this properly, since the table access method 
potentially can be *any* name.


I think it is pretty unlikely that someone would chose the name 
"no-such-table-am" when developing a new storage engine for Postgres 
inside Postgres, so that it would make this internal test fail.


There are numerous such names used in tests, eg "no-such-database", 
"no-such-command".


So I'd suggest to add such a test to check for the expected failure.
The test case "no-such-table-am" has been added to the errors list, and 
the regression test is ok.



I do not understand why you duplicated all possible option entry.

Test with just table access method looks redundant if the feature is 
make to work orthonogonally to partitions, which should be the case.
Only one positive test case added using *heap* as the table access 
method to verify the initialization.


Yep, only "heap" if currently available for tables.


Thanks,
--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
From 1d6f434d56c36f95da82d1bae4f99bb917351c08 Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Mon, 7 Dec 2020 13:42:00 -0800
Subject: [PATCH] add table access method as an option to pgbench

---
 doc/src/sgml/ref/pgbench.sgml| 10 
 src/bin/pgbench/pgbench.c| 25 +++-
 src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 7180fedd65..992fbbdb4b 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -359,6 +359,16 @@ pgbench  options 
 d
   
  
 
+ 
+  
--table-access-method=TABLEAM
+  
+   
+Create tables using the specified table access method, rather than the 
default.
+tablespace.
+   
+  
+ 
+ 
  
   
--tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3057665bbe..eb91df6d6b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -184,10 +184,11 @@ doublethrottle_delay = 0;
 int64  latency_limit = 0;
 
 /*
- * tablespace selection
+ * tablespace and table access method selection
  */
 char  *tablespace = NULL;
 char  *index_tablespace = NULL;
+char  *table_access_method = NULL;
 
 /*
  * Number of "pgbench_accounts" partitions.  0 is the default and means no
@@ -641,6 +642,9 @@ usage(void)
   "  --partition-method=(range|hash)\n"
   "   partition pgbench_accounts with 
this method (default: range)\n"
   "  --partitions=NUM partition pgbench_accounts into 
NUM parts (default: 0)\n"
+  "  --table-access-method=TABLEAM\n"
+  "   create tables with using 
specified table access method,\n"
+  "   rather than the default.\n"
   "  --tablespace=TABLESPACE  create tables in the specified 
tablespace\n"
   "  --unlogged-tablescreate tables as unlogged 
tables\n"
   "\nOptions to select what to run:\n"
@@ -3747,6 +3751,20 @@ initCreateTables(PGconn *con)
 
fprintf(stderr, "creating tables...\n");
 
+   if (table_access_method != NULL)
+   {
+   char   *escaped_table_access_method;
+
+   initPQExpBuffer(&query);
+   escaped_table_access_method = PQescapeIdentifier(con,
+   table_access_method, 
strlen(table_access_method));
+   appendPQExpBuffer(&query, "set default_table_access_method to 
%s;\n",
+   escaped_table_access_method);
+   PQfreemem(escaped_table_access_method);
+

Re: Commitfest statistics

Anastasia Lubennikova  writes:
> I want to share some stats and thoughts about CF.

First, thanks again for managing this CF!

> The first is a graph with the numbers of committed, moved, returned, and 
> rejected CF patches over time - [cf_items_status.png]. Credits to Dmitry 
> Dolgov for sharing his scripts to gather this stat.

Yeah, that is a very interesting graph.  It shows that our actual
throughput of resolving patches has held more or less steady, which
isn't very surprising because the available person-power has not
changed much in the last couple of years.  But it looks like the
number of cans getting kicked down the road is progressively
increasing.  That's not something we can sustain indefinitely.

> Besides, I noticed that we have a lot of long-living discussions. And I 
> was curious what is the chance to get something committed after several 
> CFs. The graph is in [num_commitfests.png]. So, most entries make it to 
> release after just one or two commitfests.

It's hard to see anything in this graph about what happens after the
first couple of CFs.  Maybe if you re-did it with a log Y axis, the
tail would be more readable?

> I think that the issue here is that the commitfest application now 
> serves two different purposes:

> Firstly, we use it to track patches that we want to see in the nearest 
> releases and concentrate our efforts on. And current CFM guideline [1] 
> reflects this idea. It suggests, that after the commitfest closure date 
> we relentlessly throw to RWF patches that got at least some feedback. To 
> be honest, I was reluctant to return semi-ready patches, because it 
> means that they will get lost somewhere in mailing lists. And it seems 
> like other CFMs did the same.
> On the other hand, we use Commitfest to track important entries that we 
> want to remember at least once in a while. You can find many examples in 
> the 'Bug Fixes' group of patches. They are too serious to move them to 
> TODO list, yet too complex and/or rare to move on. And such entries 
> simply move from one CF to another.

Yeah, the aggressive policy suggested in "Sudden Death Overtime" is
certainly not what's been followed lately.  I agree that that's
probably too draconic.  On the other hand, if a patch sits in the
queue for several CFs without getting committed, that suggests that
maybe we ought to reject it on the grounds of "apparently nobody but
the author cares about this".  That argument is easier to make for
features than bug fixes of course, so maybe the policy needs to
distinguish what kind of change is being considered.

regards, tom lane

Re: [HACKERS] [PATCH] Generic type subscripting

Hi,

On 2020-12-07 16:32:32 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I think it'd be a better to rely on the backend's definition of
> > ExecEvalBoolSubroutine etc. For the functions implementing expression
> > steps I've found that far easier to work with over time (because you can
> > get LLVM to issue type mismatch errors when the signature changes,
> > instead of seeing compile failures).
> 
> I'm a little unclear on what you mean here?  There wasn't such a
> thing as ExecEvalBoolSubroutine until I added it in this patch.

Basically that I suggest doing what I did in the prototype patch I
attached, mirroring what it did with TypeExecEvalSubroutine for the new
ExecEvalBoolSubroutine case.

> What did you think of the idea of merging EEOP_SBSREF_OLD / ASSIGN / FETCH
> into a single step type distinguished only by the callback function?

I don't have a strong opinion on this. I guess find it a bit easier to
understand the generated "program" if the opcodes are distinct (I've a
pending patch printing the opcode sequence). Especially as the payload
is just function pointers.

So I think I'd just merge the *implementation* of the steps, but leave
the different opcodes around?

Greetings,

Andres Freund

Re: [HACKERS] [PATCH] Generic type subscripting

Andres Freund  writes:
> On 2020-12-07 16:32:32 -0500, Tom Lane wrote:
>> What did you think of the idea of merging EEOP_SBSREF_OLD / ASSIGN / FETCH
>> into a single step type distinguished only by the callback function?

> I don't have a strong opinion on this. I guess find it a bit easier to
> understand the generated "program" if the opcodes are distinct (I've a
> pending patch printing the opcode sequence). Especially as the payload
> is just function pointers.
> So I think I'd just merge the *implementation* of the steps, but leave
> the different opcodes around?

Fair enough.  It wasn't entirely clear to me whether it'd be kosher to
write
EEO_CASE(EEOP_SBSREF_OLD)
EEO_CASE(EEOP_SBSREF_ASSIGN)
EEO_CASE(EEOP_SBSREF_FETCH)
{
// do something
EEO_NEXT();
}

I can see that that should work for the two existing implementations
of EEO_CASE, but I wasn't sure if you wanted to wire in an assumption
that it'll always work.

regards, tom lane

Re: [HACKERS] [PATCH] Generic type subscripting

Hi,

On 2020-12-07 17:25:41 -0500, Tom Lane wrote:
> Fair enough.  It wasn't entirely clear to me whether it'd be kosher to
> write
>   EEO_CASE(EEOP_SBSREF_OLD)
>   EEO_CASE(EEOP_SBSREF_ASSIGN)
>   EEO_CASE(EEOP_SBSREF_FETCH)
>   {
>   // do something
>   EEO_NEXT();
>   }
> 
> I can see that that should work for the two existing implementations
> of EEO_CASE, but I wasn't sure if you wanted to wire in an assumption
> that it'll always work.

I don't think it's likely to be a problem, and if it ends up being one,
we can still deduplicate the ops at that point...

Greetings,

Andres Freund

Re: Minor documentation error regarding streaming replication protocol

2020-12-07 Thread Jeff Davis

On Thu, 2020-12-03 at 13:14 -0500, Bruce Momjian wrote:
> How do we mandate that?  Just mention it in the docs and C comments?

Can't we just throw an error in pg_create_restore_point() if any high
bits are set? And perhaps error or Assert at a later point in case we
change the code in such a way as to let non-ASCII data in there.

This is probably the best approach, and it builds on your previous
patch (which already got consensus) rather than replacing it.

Regards,
Jeff Davis

Re: Commitfest statistics

On 2020-Dec-07, Tom Lane wrote:

> Anastasia Lubennikova  writes:

> > Firstly, we use it to track patches that we want to see in the nearest 
> > releases and concentrate our efforts on. And current CFM guideline [1] 
> > reflects this idea. It suggests, that after the commitfest closure date 
> > we relentlessly throw to RWF patches that got at least some feedback. To 
> > be honest, I was reluctant to return semi-ready patches, because it 
> > means that they will get lost somewhere in mailing lists. And it seems 
> > like other CFMs did the same.
> 
> Yeah, the aggressive policy suggested in "Sudden Death Overtime" is
> certainly not what's been followed lately.  I agree that that's
> probably too draconic.  On the other hand, if a patch sits in the
> queue for several CFs without getting committed, that suggests that
> maybe we ought to reject it on the grounds of "apparently nobody but
> the author cares about this".  That argument is easier to make for
> features than bug fixes of course, so maybe the policy needs to
> distinguish what kind of change is being considered.

Note that this checklist was written in 2013 and has never been updated
since then.  I think there is nothing in that policy that we do use.
I'm thinking that rather than try to fine-tune that document, we ought
to rewrite one from scratch.

For one thing, "a beer or three" only at end of CF is surely not
sufficient.

Re: Add header support to text format and matching feature

2020-12-07 Thread Rémi Lapeyre

Hi, here’s a rebased version of the patch.

Best regards,
Rémi



v7-0001-Add-header-support-to-COPY-TO-text-format.patch
Description: Binary data


v7-0002-Add-header-matching-mode-to-COPY-FROM.patch
Description: Binary data



> On 21 Oct 2020, at 19:49, Daniel Verite  wrote:
> 
>   Rémi Lapeyre wrote:
> 
>> It looks like this is not in the current commitfest and that Cabot does not
>> find it. I’m not yet accustomed to the PostgreSQL workflow, should I just
>> create a new entry in the current commitfest?
> 
> Yes. Because in the last CommitFest it was marked
> as "Returned with feedback"
> https://commitfest.postgresql.org/29/2504/
> 
> 
> Best regards,
> -- 
> Daniel Vérité
> PostgreSQL-powered mailer: https://www.manitou-mail.org
> Twitter: @DanielVerite
> 
> 
> 
>

Re: Consider parallel for lateral subqueries with limit

2020-12-07 Thread James Coleman

Brian's email didn't keep the relevant headers, and so didn't show up
as a reply, so I've pasted it here and am replying in this original
thread. You can find the original at [1].

On Sun, Dec 6, 2020 at 7:34 PM Brian Davis  wrote:
>
> > Note that near the end of grouping planner we have a similar check:
> >
> > if (final_rel->consider_parallel && root->query_level > 1 &&
> > !limit_needed(parse))
> >
> > guarding copying the partial paths from the current rel to the final
> > rel. I haven't managed to come up with a test case that exposes that
>
> Played around with this a bit, here's a non-correlated subquery that gets us 
> to that if statement
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (bar int);
>
> INSERT INTO foo (bar)
> SELECT
>   g
> FROM
>   generate_series(1, 1) AS g;
>
>
> SELECT
>   (
> SELECT
>   bar
> FROM
>   foo
> LIMIT 1
>   ) AS y
> FROM
>   foo;

Thanks. That triggers the parallel if case for limit -- any additional
GUCs need to be modified to get that result? I assume regardless a
parallel plan isn't chosen (even if you remove the limit needed
check)?

> I also was thinking about the LATERAL part.
>
> I couldn't think of any reason why the uncorrelated subquery's results would 
> need to be shared and therefore the same, when we'll be "looping" over each 
> row of the source table, running the subquery anew for each, conceptually.
>
> But then I tried this...
>
> test=# CREATE TABLE foo (bar int);
> CREATE TABLE
> test=#
> test=# INSERT INTO foo (bar)
> test-# SELECT
> test-#   g
> test-# FROM
> test-#   generate_series(1, 10) AS g;
> INSERT 0 10
> test=#
> test=#
> test=# SELECT
> test-#   foo.bar,
> test-#   lat.bar
> test-# FROM
> test-#   foo JOIN LATERAL (
> test(# SELECT
> test(#   bar
> test(# FROM
> test(#   foo AS foo2
> test(# ORDER BY
> test(#   random()
> test(# LIMIT 1
> test(#   ) AS lat ON true;
>  bar | bar
> -+-
>1 |   7
>2 |   7
>3 |   7
>4 |   7
>5 |   7
>6 |   7
>7 |   7
>8 |   7
>9 |   7
>   10 |   7
> (10 rows)
>
>
> As you can see, random() is only called once.  If postgres were supposed to 
> be running the subquery for each source row, conceptually, it would be a 
> mistake to cache the results of a volatile function like random().

This was genuinely surprising to me. I think one could argue that this
is just an optimization (after all -- if there is no correlation, then
running it once is conceptually/safely the same as running it multiple
times), but that doesn't seem to hold water with the volatile function
in play.

Of course, given the volatile function we'd never parallelize this
anyway. But we still have to consider the case where the result is
otherwise ordered differently between workers (just by virtue of disk
order, for example).

I've tried the above query using tenk1 from the regress tests to get a
bit more data, and, along with modifying several GUCs, can force
parallel plans. However in no case can I get it to execute that
uncorrelated lateral in multiple workers. That makes me suspicious
that there's another check in play here ensuring the lateral subquery
is executed for each group, and that in the uncorrelated case really
that rule still holds -- it's just a single group.

> The docs say: "When a FROM item contains LATERAL cross-references, evaluation 
> proceeds as follows: for each row of the FROM item providing the 
> cross-referenced column(s), or set of rows of multiple FROM items providing 
> the columns, the LATERAL item is evaluated using that row or row set's values 
> of the columns. The resulting row(s) are joined as usual with the rows they 
> were computed from. This is repeated for each row or set of rows from the 
> column source table(s)."
>
> They don't say what happens with LATERAL when there aren't cross-references 
> though.  As we expect, adding one does show random() being called once for 
> each source row.

If my theory above is correct then it's implicit that the row set is
the whole previous FROM group.

> test=# SELECT
> test-#   foo.bar,
> test-#   lat.bar
> test-# FROM
> test-#   foo JOIN LATERAL (
> test(# SELECT
> test(#   bar
> test(# FROM
> test(#   foo AS foo2
> test(# WHERE
> test(#   foo2.bar < foo.bar + 10
> test(# ORDER BY
> test(#   random()
> test(# LIMIT 1
> test(#   ) AS lat ON true;
>  bar | bar
> -+-
>1 |   5
>2 |   8
>3 |   3
>4 |   4
>5 |   5
>6 |   5
>7 |   1
>8 |   3
>9 |   7
>   10 |   3
> (10 rows)
>
> It seems like to keep the same behavior that exists today, results of LATERAL 
> subqueries would need to be the same if they aren't correlated, and so you 
> couldn't run them in parallel with a limit if the order wasn't guaranteed.  
> But I'll be the first to admit that it's easy enough for me to miss a key 
> piece of logic on something like this, so I could be way off base too.

If it weren't for the

Re: range_agg

2020-12-07 Thread Alexander Korotkov

On Mon, Nov 30, 2020 at 11:39 PM Alexander Korotkov
 wrote:
> On Mon, Nov 30, 2020 at 10:35 PM Alexander Korotkov
>  wrote:
> > On Sun, Nov 29, 2020 at 11:53 PM Paul A Jungwirth
> >  wrote:
> > >
> > > On Sun, Nov 29, 2020 at 11:43 AM Alexander Korotkov
> > >  wrote:
> > > > Thank you.  Could you please, update doc/src/sgml/catalogs.sgml,
> > > > because pg_type and pg_range catalogs are updated.
> > >
> > > Attached! :-)
> >
> > You're quick, thank you.  Please, also take a look at cfbot failure
> > https://travis-ci.org/github/postgresql-cfbot/postgresql/builds/746623942
> > I've tried to reproduce it, but didn't manage yet.
>
> Got it.  type_sanity test fails on any platform, you just need to
> repeat "make check" till it fails.
>
> The failed query checked consistency of range types, but it didn't
> take into account ranges of domains and ranges of records, which are
> exercised by multirangetypes test running in parallel.  We could teach
> this query about such kinds of ranges, but I think that would be
> overkill, because we're not going to introduce such builtin ranges
> yet.  So, I'm going to just move multirangetypes test into another
> group of parallel tests.

I also found a problem in multirange types naming logic.  Consider the
following example.

create type a_multirange AS (x float, y float);
create type a as range(subtype=text, collation="C");
create table tbl (x __a_multirange);
drop type a_multirange;

If you dump this database, the dump couldn't be restored.  The
multirange type is named __a_multirange, because the type named
a_multirange already exists.  However, it might appear that
a_multirange type is already deleted.  When the dump is restored, a
multirange type is named a_multirange, and the corresponding table
fails to be created.  The same thing doesn't happen with arrays,
because arrays are not referenced in dumps by their internal names.

I think we probably should add an option to specify multirange type
names while creating a range type.  Then dump can contain exact type
names used in the database, and restore wouldn't have a names
collision.

Another thing that worries me is the multirange serialization format.

typedef struct
{
int32   vl_len_;/* varlena header */
charflags;  /* range flags */
char_padding;   /* Bounds must be aligned */
/* Following the header are zero to two bound values. */
} ShortRangeType;

Comment says this structure doesn't contain a varlena header, while
structure obviously has it.

In general, I wonder if we can make the binary format of multiranges
more efficient.  It seems that every function involving multiranges
from multirange_deserialize().  I think we can make functions like
multirange_contains_elem() much more efficient.  Multirange is
basically an array of ranges.  So we can pack it as follows.
1. Typeid and rangecount
2. Tightly packed array of flags (1-byte for each range)
3. Array of indexes of boundaries (4-byte for each range).  Or even
better we can combine offsets and lengths to be compression-friendly
like jsonb JEntry's do.
4. Boundary values
Using this format, we can implement multirange_contains_elem(),
multirange_contains_range() without deserialization and using binary
search.  That would be much more efficient.  What do you think?

--
Regards,
Alexander Korotkov

Re: range_agg

On 2020-Dec-08, Alexander Korotkov wrote:

> I also found a problem in multirange types naming logic.  Consider the
> following example.
> 
> create type a_multirange AS (x float, y float);
> create type a as range(subtype=text, collation="C");
> create table tbl (x __a_multirange);
> drop type a_multirange;
> 
> If you dump this database, the dump couldn't be restored.  The
> multirange type is named __a_multirange, because the type named
> a_multirange already exists.  However, it might appear that
> a_multirange type is already deleted.  When the dump is restored, a
> multirange type is named a_multirange, and the corresponding table
> fails to be created.  The same thing doesn't happen with arrays,
> because arrays are not referenced in dumps by their internal names.
> 
> I think we probably should add an option to specify multirange type
> names while creating a range type.  Then dump can contain exact type
> names used in the database, and restore wouldn't have a names
> collision.

Hmm, good point.  I agree that a dump must preserve the name, since once
created it is user-visible.  I had not noticed this problem, but it's
obvious in retrospect.

> In general, I wonder if we can make the binary format of multiranges
> more efficient.  It seems that every function involving multiranges
> from multirange_deserialize().  I think we can make functions like
> multirange_contains_elem() much more efficient.  Multirange is
> basically an array of ranges.  So we can pack it as follows.
> 1. Typeid and rangecount
> 2. Tightly packed array of flags (1-byte for each range)
> 3. Array of indexes of boundaries (4-byte for each range).  Or even
> better we can combine offsets and lengths to be compression-friendly
> like jsonb JEntry's do.
> 4. Boundary values
> Using this format, we can implement multirange_contains_elem(),
> multirange_contains_range() without deserialization and using binary
> search.  That would be much more efficient.  What do you think?

I also agree.  I spent some time staring at the I/O code a couple of
months back but was unable to focus on it for long enough.  I don't know
JEntry's format, but I do remember that the storage format for JSONB was
widely discussed back then; it seems wise to apply similar logic or at
least similar reasoning.

Re: On login trigger: take three

2020-12-07 Thread Greg Nancarrow

On Fri, Dec 4, 2020 at 9:05 PM Konstantin Knizhnik
 wrote:
>
> As far as I understand Pavel concern was about the case when superuser
> defines wrong login trigger which prevents login to the system
> all user including himself. Right now solution of this problem is to
> include "options='-c disable_session_start_trigger=true'" in connection
> string.
> I do not know if it can be done with pgAdmin.
> >

As an event trigger is tied to a particular database, and a GUC is
global to the cluster, as long as there is one database in the cluster
for which an event trigger for the "client_connection" event is NOT
defined (say the default "postgres" maintenance database), then the
superuser can always connect to that database, issue "ALTER SYSTEM SET
disable_client_connection_trigger TO true" and reload the
configuration. I tested this with pgAdmin4 and it worked fine for me,
to allow login to a database for which login was previously prevented
due to a badly-defined logon trigger.

Pavel, is this an acceptable solution or do you still see problems
with this approach?

Regards,
Greg Nancarrow
Fujitsu Australia

Re: [Patch] Optimize dropping of relation buffers using dlist

At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  wrote 
in 
> On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
>  wrote:
> >
> > On Friday, December 4, 2020 8:27 PM, Amit Kapila  
> > wrote:
> > Hi,
> > I have reported before that it is not always the case that the "cached" 
> > flag of
> > srnblocks() return true. So when I checked the truncate test case used in my
> > patch, it does not enter the optimization path despite doing INSERT before
> > truncation of table.
> > The reason for that is because in TRUNCATE, a new RelFileNode is assigned
> > to the relation when creating a new file. In recovery, 
> > XLogReadBufferExtended()
> > always opens the RelFileNode and calls smgrnblocks() for that RelFileNode 
> > for the
> > first time. And for recovery processing, different RelFileNodes are used 
> > for the
> > INSERTs to the table and TRUNCATE to the same table.
> >
> 
> Hmm, how is it possible if Insert is done before Truncate? The insert
> should happen in old RelFileNode only. I have verified by adding a
> break-in (while (1), so that it stops there) heap_xlog_insert and
> DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> How have you verified what you are saying?

You might be thinking of in-transaction sequence of
Inert-truncate. What *I* mention before is truncation of a relation
that smgrnblocks() has already been called for.  The most common way
to make it happen was INSERTs *before* the truncating transaction
starts. It may be a SELECT on a hot-standby.  Sorry for the confusing
expression.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> wrote in 
> > Hmm, how is it possible if Insert is done before Truncate? The insert
> > should happen in old RelFileNode only. I have verified by adding a
> > break-in (while (1), so that it stops there) heap_xlog_insert and
> > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > How have you verified what you are saying?
> 
> You might be thinking of in-transaction sequence of
> Inert-truncate. What *I* mention before is truncation of a relation
> that smgrnblocks() has already been called for.  The most common way
> to make it happen was INSERTs *before* the truncating transaction
> starts. It may be a SELECT on a hot-standby.  Sorry for the confusing
> expression.

And ,to make sure, it is a bit off from the point of the discussion as
I noted.  I just meant that the proposition that "smgrnblokcs() always
returns false for "cached" when it is called in
DropRelFileNodesAllBuffers()" doesn't always holds.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

From: Bharath Rupireddy 
> Do you mean to say that if we detach all the partitions(assuming they
> are all unlogged) then the parent table(assuming logged) gets changed
> to unlogged? Does it happen on master? Am I missing something here?

No, the parent remains logged in that case both on master and with patched.  I 
understand this behavior is based on the idea that (1) each storage unit 
(=partition) is independent, and (2) a partitioned table has no storage so the 
logged/unlogged setting has no meaning.  (I don't know there was actually such 
an idea and the feature was implemented on that idea, though.)


Regards
Takayuki Tsunakawa

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-07 Thread David Rowley

On Mon, 7 Dec 2020 at 14:25, Zhihong Yu  wrote:
>
> > +   /* Make a guess at a good size when we're not given a valid size. */
> > +   if (size == 0)
> > +   size = 1024;
> >
> > Should the default size be logged ?
>
> > I'm not too sure if I know what you mean here. Should it be a power of
> > 2? It is.  Or do you mean I shouldn't use a magic number?
>
> Using 1024 seems to be fine. I meant logging the default value of 1024 so 
> that user / dev can have better idea the actual value used (without looking 
> at the code).

Oh, right. In EXPLAIN ANALYZE. Good point.  I wonder if that's going
to be interesting enough to show.

> >> Or do you think 98% is not a good number?
>
> Since you have played with Result Cache, I would trust your judgment in 
> choosing the percentage.
> It is fine not to expose this constant until the need arises.

I did some experimentation with different values on a workload that
never gets a cache hit. and just always evicts the oldest entry.
There's only very slight changes in performance between 90%, 98% and
100% with 1MB work_mem.

times in milliseconds measured over 60 seconds on each run.

90% 98% 100%
run1 2318 2339 2344
run2 2339 2333 2309
run3 2357 2339 2346
avg (ms) 2338 2337 2333

Perhaps this is an argument for just removing the logic that has the
soft upper limit and just have it do cache evictions after each
allocation after the cache first fills.

Setup: same tables as [1]
alter table hundredk alter column hundredk set (n_distinct = 10);
analyze hundredk;
alter system set work_mem = '1MB';
select pg_reload_conf();

Query
select count(*) from hundredk hk inner join lookup l on hk.hundredk = l.a;

David

[1] 
https://www.postgresql.org/message-id/caaphdvrpcqyqdwergywx8j+2dlungxu+fosbq1uscxrunyx...@mail.gmail.com

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 8, 2020 at 6:23 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> > wrote in
> > > Hmm, how is it possible if Insert is done before Truncate? The insert
> > > should happen in old RelFileNode only. I have verified by adding a
> > > break-in (while (1), so that it stops there) heap_xlog_insert and
> > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > > How have you verified what you are saying?
> >
> > You might be thinking of in-transaction sequence of
> > Inert-truncate. What *I* mention before is truncation of a relation
> > that smgrnblocks() has already been called for.  The most common way
> > to make it happen was INSERTs *before* the truncating transaction
> > starts.

What I have tried is Insert and Truncate in separate transactions like below:
postgres=# insert into mytbl values(1);
INSERT 0 1
postgres=# truncate mytbl;
TRUNCATE TABLE

After above, manually killed the server, and then during recovery, we
have called heap_xlog_insert() and DropRelFileNodesAllBuffers() and at
both places, RelFileNode is the same and I don't see any reason for it
to be different.

> > It may be a SELECT on a hot-standby.  Sorry for the confusing
> > expression.
>
> And ,to make sure, it is a bit off from the point of the discussion as
> I noted.  I just meant that the proposition that "smgrnblokcs() always
> returns false for "cached" when it is called in
> DropRelFileNodesAllBuffers()" doesn't always holds.
>

Right, I feel in some cases the 'cached' won't be true like if we
would have done Checkpoint after Insert in the above case (say when
the only WAL to replay during recovery is of Truncate) but I think
that should be fine. What do you think?


-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

I'm out of it more than usual..

At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> wrote in 
> > On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, December 4, 2020 8:27 PM, Amit Kapila 
> > >  wrote:
> > > Hi,
> > > I have reported before that it is not always the case that the "cached" 
> > > flag of
> > > srnblocks() return true. So when I checked the truncate test case used in 
> > > my
> > > patch, it does not enter the optimization path despite doing INSERT before
> > > truncation of table.
> > > The reason for that is because in TRUNCATE, a new RelFileNode is assigned
> > > to the relation when creating a new file. In recovery, 
> > > XLogReadBufferExtended()
> > > always opens the RelFileNode and calls smgrnblocks() for that RelFileNode 
> > > for the
> > > first time. And for recovery processing, different RelFileNodes are used 
> > > for the
> > > INSERTs to the table and TRUNCATE to the same table.
> > >
> > 
> > Hmm, how is it possible if Insert is done before Truncate? The insert
> > should happen in old RelFileNode only. I have verified by adding a
> > break-in (while (1), so that it stops there) heap_xlog_insert and
> > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > How have you verified what you are saying?

It's irrelvant that the insert happens on the old relfilenode. We drop
buffers for the old relfilenode on truncation anyway.

What I did is:

a: Create a physical replication pair.
b: On the master, create a table. (without explicitly starting a tx)
c: On the master, insert a tuple into the table.
d: On the master truncate the table.

On the standby, smgrnblocks is called for the old relfilenode of the
table at c, then the same function is called for the same relfilenode
at d and the function takes the cached path.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
 wrote:
>
> I'm out of it more than usual..
>
> At Tue, 08 Dec 2020 09:45:53 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > At Mon, 7 Dec 2020 17:18:31 +0530, Amit Kapila  
> > wrote in
> > > On Mon, Dec 7, 2020 at 12:32 PM k.jami...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Friday, December 4, 2020 8:27 PM, Amit Kapila 
> > > >  wrote:
> > > > Hi,
> > > > I have reported before that it is not always the case that the "cached" 
> > > > flag of
> > > > srnblocks() return true. So when I checked the truncate test case used 
> > > > in my
> > > > patch, it does not enter the optimization path despite doing INSERT 
> > > > before
> > > > truncation of table.
> > > > The reason for that is because in TRUNCATE, a new RelFileNode is 
> > > > assigned
> > > > to the relation when creating a new file. In recovery, 
> > > > XLogReadBufferExtended()
> > > > always opens the RelFileNode and calls smgrnblocks() for that 
> > > > RelFileNode for the
> > > > first time. And for recovery processing, different RelFileNodes are 
> > > > used for the
> > > > INSERTs to the table and TRUNCATE to the same table.
> > > >
> > >
> > > Hmm, how is it possible if Insert is done before Truncate? The insert
> > > should happen in old RelFileNode only. I have verified by adding a
> > > break-in (while (1), so that it stops there) heap_xlog_insert and
> > > DropRelFileNodesAllBuffers(), and both get the same (old) RelFileNode.
> > > How have you verified what you are saying?
>
> It's irrelvant that the insert happens on the old relfilenode.
>

I think it is relevant because it will allow the 'blocks' value to be cached.

> We drop
> buffers for the old relfilenode on truncation anyway.
>
> What I did is:
>
> a: Create a physical replication pair.
> b: On the master, create a table. (without explicitly starting a tx)
> c: On the master, insert a tuple into the table.
> d: On the master truncate the table.
>
> On the standby, smgrnblocks is called for the old relfilenode of the
> table at c, then the same function is called for the same relfilenode
> at d and the function takes the cached path.
>

This is on the lines I have tried for recovery. So, it seems we are in
agreement that we can use the 'cached' flag in
DropRelFileNodesAllBuffers and it will take the optimized path in many
such cases, right?

-- 
With Regards,
Amit Kapila.

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-12-07 Thread Zhihong Yu

>>  just removing the logic that has the
soft upper limit and just have it do cache evictions after each
allocation after the cache first fills

Yeah - having one fewer limit would simplify the code.

Cheers

On Mon, Dec 7, 2020 at 5:27 PM David Rowley  wrote:

> On Mon, 7 Dec 2020 at 14:25, Zhihong Yu  wrote:
> >
> > > +   /* Make a guess at a good size when we're not given a valid size.
> */
> > > +   if (size == 0)
> > > +   size = 1024;
> > >
> > > Should the default size be logged ?
> >
> > > I'm not too sure if I know what you mean here. Should it be a power of
> > > 2? It is.  Or do you mean I shouldn't use a magic number?
> >
> > Using 1024 seems to be fine. I meant logging the default value of 1024
> so that user / dev can have better idea the actual value used (without
> looking at the code).
>
> Oh, right. In EXPLAIN ANALYZE. Good point.  I wonder if that's going
> to be interesting enough to show.
>
> > >> Or do you think 98% is not a good number?
> >
> > Since you have played with Result Cache, I would trust your judgment in
> choosing the percentage.
> > It is fine not to expose this constant until the need arises.
>
> I did some experimentation with different values on a workload that
> never gets a cache hit. and just always evicts the oldest entry.
> There's only very slight changes in performance between 90%, 98% and
> 100% with 1MB work_mem.
>
> times in milliseconds measured over 60 seconds on each run.
>
> 90% 98% 100%
> run1 2318 2339 2344
> run2 2339 2333 2309
> run3 2357 2339 2346
> avg (ms) 2338 2337 2333
>
> Perhaps this is an argument for just removing the logic that has the
> soft upper limit and just have it do cache evictions after each
> allocation after the cache first fills.
>
> Setup: same tables as [1]
> alter table hundredk alter column hundredk set (n_distinct = 10);
> analyze hundredk;
> alter system set work_mem = '1MB';
> select pg_reload_conf();
>
> Query
> select count(*) from hundredk hk inner join lookup l on hk.hundredk = l.a;
>
> David
>
> [1]
> https://www.postgresql.org/message-id/caaphdvrpcqyqdwergywx8j+2dlungxu+fosbq1uscxrunyx...@mail.gmail.com
>

Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer

Hi all

A new kernel API called io_uring has recently come to my attention. I
assume some of you (Andres?) have been following it for a while.

io_uring appears to offer a way to make system calls including reads,
writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
with or without O_DIRECT. Basically async I/O with usable buffered I/O and
fsync support. It has ordering support which is really important for us.

This should be on our radar. The main barriers to benefiting from linux-aio
based async I/O in postgres in the past has been its reliance on direct
I/O, the various kernel-version quirks, platform portability, and its
maybe-async-except-when-it's-randomly-not nature.

The kernel version and portability remain an issue with io_uring so it's
not like this is something we can pivot over to completely. But we should
probably take a closer look at it.

PostgreSQL spends a huge amount of time waiting, doing nothing, for
blocking I/O. If we can improve that then we could potentially realize some
major increases in I/O utilization especially for bigger, less concurrent
workloads. The most obvious candidates to benefit would be redo, logical
apply, and bulk loading.

But I have no idea how to even begin to fit this into PostgreSQL's executor
pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative
in nature, with a push/pull executor pipeline. It seems to have been
recognised for some time that this is increasingly hurting our performance
and scalability as platforms become more and more parallel.

To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we
have to be able to dispatch I/O and do something else while we wait for the
results. So we need the ability to pipeline the executor and pipeline redo.

I thought I'd start the discussion on this and see where we can go with it.
What incremental steps can be done to move us toward parallelisable I/O
without having to redesign everything?

I'm thinking that redo is probably a good first candidate. It doesn't
depend on the guts of the executor. It is much less sensitive to ordering
between operations in shmem and on disk since it runs in the startup
process. And it hurts REALLY BADLY from its single-threaded blocking
approach to I/O - as shown by an extension written by 2ndQuadrant that can
double redo performance by doing read-ahead on btree pages that will soon
be needed.

Thoughts anybody?

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer

References to get things started:

* https://lwn.net/Articles/810414/
* https://unixism.net/loti/what_is_io_uring.html
*
https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
*
https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

You'll probably notice how this parallels my sporadic activities around
pipelining in other areas, and the PoC libpq pipelining patch I sent in a
few years ago.

Re: Huge memory consumption on partitioned table with FKs

At Tue, 8 Dec 2020 01:16:00 +0900, Amit Langote  wrote 
in 
> Hi Alvaro,
> 
> On Mon, Dec 7, 2020 at 23:48 Alvaro Herrera  wrote:
> 
> > On 2020-Dec-07, Amit Langote wrote:
> >
> > > On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi
> > >  wrote:
> > > > > Also, the comment that was in RI_ConstraintInfo now appears in
> > > > > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> > > > > undocumented.  What is the relationship between those two structs?  I
> > > > > see that they have pointers to each other, but I think the
> > relationship
> > > > > should be documented more clearly.
> > > >
> > > > I'm not sure the footprint of this patch worth doing but here is a bit
> > > > more polished version.
> > >
> > > I noticed that the foreign_key test fails and it may have to do with
> > > the fact that a partition's param info remains attached to the
> > > parent's RI_ConstraintInfo even after it's detached from the parent
> > > table using DETACH PARTITION.
> >
> > I think this bit about splitting the struct is a distraction.  Let's get
> > a patch that solves the bug first, and then we can discuss what further
> > refinements we want to do.  I think we should get your patch in
> > CA+HiwqEOrfN9b=f3sdmyspgc4go-l_vmfhxllxvmmdp34e6...@mail.gmail.com
> > committed (which I have not read yet.)  Do you agree with this plan?
> 
> 
> Yeah, I agree.

Or 
https://www.postgresql.org/message-id/ca+hiwqgrr2yoo6vobm6m_oac9w-wmxe1gouq-uydpin6zjt...@mail.gmail.com
 ?

+1 from me to either one.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Stronger safeguard for archive recovery not to miss data

2020-12-07 Thread osumi.takami...@fujitsu.com

Hi


On Thursday, November 26, 2020 4:29 PM
Kyotaro Horiguchi  wrote:
> At Thu, 26 Nov 2020 07:18:39 +, "osumi.takami...@fujitsu.com"
>  wrote in
> > The attached patch is intended to prevent a scenario that archive
> > recovery hits WALs which come from wal_level=minimal and the server
> > continues to work, which was discussed in the thread of [1].
> > The motivation is to protect that user ends up with both getting
> > replica that could miss data and getting the server to miss data in targeted
> recovery mode.
> >
> > About how to modify this, we reached the consensus in the thread.
> > It is by changing the ereport's level from WARNING to FATAL in
> CheckRequiredParameterValues().
> >
> > In order to test this fix, what I did is
> > 1 - get a base backup during wal_level is replica
> > 2 - stop the server and change the wal_level from replica to minimal
> > 3 - restart the server(to generate XLOG_PARAMETER_CHANGE)
> > 4 - stop the server and make the wal_level back to replica
> > 5 - start the server again
> > 6 - execute archive recoveries in both cases
> > (1) by editing the postgresql.conf and
> > touching recovery.signal in the base backup from 1th step
> > (2) by making a replica with standby.signal
> > * During wal_level is replica, I enabled archive_mode in this test.
> >
> > First of all, I confirmed the server started up without this patch.
> > After applying this safeguard patch, I checked that the server cannot
> > start up any more in the scenario case.
> > I checked the log and got the result below with this patch.
> >
> > 2020-11-26 06:49:46.003 UTC [19715] FATAL:  WAL was generated with
> > wal_level=minimal, data may be missing
> > 2020-11-26 06:49:46.003 UTC [19715] HINT:  This happens if you
> temporarily set wal_level=minimal without taking a new base backup.
> >
> > Lastly, this should be backpatched.
> > Any comments ?
> 
> Perhaps we need the TAP test that conducts the above steps.
I added the TAP tests to reproduce and share the result,
using the case of 6-(1) described above.
Here, I created a new file for it because the purposes of other files in
src/recovery didn't match the purpose of my TAP tests perfectly.
If you are dubious about this idea, please have a look at the comments
in each file.

When the attached patch is applied,
my TAP tests are executed like other ones like below.

t/018_wal_optimize.pl  ok
t/019_replslot_limit.pl .. ok
t/020_archive_status.pl .. ok
t/021_row_visibility.pl .. ok
t/022_archive_recovery.pl  ok
All tests successful.

Also, I confirmed that there's no regression by make check-world.
Any comments ?

Best,
Takamichi Osumi



stronger_safeguard_for_archive_recovery_v02.patch
Description: stronger_safeguard_for_archive_recovery_v02.patch

RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

From: Alvaro Herrera 
> Does "ALTER TABLE ONLY parent" work correctly?  Namely, do not affect
> existing partitions, but cause future partitions to acquire the new
> setting.

Yes, it works correctly in the sense that ALTER TABLE ONLY on a partitioned 
table does nothing because it has no storage and therefore logged/unlogged has 
no meaning.


> This sounds very much related to previous discussion on REPLICA IDENTITY
> not propagating to partitions; see
> https://postgr.es/m/201902041630.gpadougzab7v@alvherre.pgsql
> particularly Robert Haas' comments at
> http://postgr.es/m/CA+TgmoYjksObOzY8b1U1=BsP_m+702eOf42fqRtQTYio
> 1nu...@mail.gmail.com
> and downstream discussion from there.

There was a hot discussion.  I've read through it.  I hope I'm not doing 
strange in light of that discussioin.


Regards
Takayuki Tsunakawa

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Thomas Munro

On Tue, Dec 8, 2020 at 3:56 PM Craig Ringer
 wrote:
> I thought I'd start the discussion on this and see where we can go with it. 
> What incremental steps can be done to move us toward parallelisable I/O 
> without having to redesign everything?
>
> I'm thinking that redo is probably a good first candidate. It doesn't depend 
> on the guts of the executor. It is much less sensitive to ordering between 
> operations in shmem and on disk since it runs in the startup process. And it 
> hurts REALLY BADLY from its single-threaded blocking approach to I/O - as 
> shown by an extension written by 2ndQuadrant that can double redo performance 
> by doing read-ahead on btree pages that will soon be needed.

About the redo suggestion: https://commitfest.postgresql.org/31/2410/
does exactly that!  It currently uses POSIX_FADV_WILLNEED because
that's what PrefetchSharedBuffer() does, but when combined with a
"real AIO" patch set (see earlier threads and conference talks on this
by Andres) and a few small tweaks to control batching of I/O
submissions, it does exactly what you're describing.  I tried to keep
the WAL prefetcher project entirely disentangled from the core AIO
work, though, hence the "poor man's AIO" for now.

Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.

2020-12-07 Thread Michael Paquier

On Mon, Nov 30, 2020 at 04:37:14PM -0800, Ashwin Agrawal wrote:
> I am thinking if there is some way to assert this aspect, but seems no way.
> So, yes, having at least a comment is good for now.

Yeah, I have been thinking about this one without coming to a clear
idea, but that would be nice.

> Yes the attached patch looks good to me for PostgreSQL. Thanks Michael.

Okay, committed to HEAD then.
--
Michael


signature.asc
Description: PGP signature

Re: PG vs LLVM 12 on seawasp, next round

Hi,

On 2020-12-01 12:08:10 -0800, Andres Freund wrote:
> On 2020-12-01 21:04:44 +0100, Fabien COELHO wrote:
> > Andres investigated a few days ago, managed to reproduce the issue locally,
> > and has one line patch. I'm unsure if it should be prevently back-patched,
> > though.
> 
> I see no reason not to backpatch - it's more correct for past versions
> of LLVM as well.

I pushed that now.

- Andres

Re: upcoming API changes for LLVM 12

Hi,

On 2020-11-09 20:13:43 -0800, Andres Freund wrote:
> I pushed the change to master. If that doesn't show any problems, I'll
> backpatch in a week or so. Seawasp runs only on master, so it should
> satisfy the buildfarm at least.

It was a bit longer than a week, but I finally have done so... Let's see
what the world^Wbuildfarm says.

- Andres

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Andreas Karlsson


On 12/8/20 3:55 AM, Craig Ringer wrote:
A new kernel API called io_uring has recently come to my attention. I 
assume some of you (Andres?) have been following it for a while.


Andres did a talk on this at FOSDEM PGDay earlier this year. You can see 
his slides below, but since they are from January things might have 
changed since then.


https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/

Andreas

Re: Huge memory consumption on partitioned table with FKs

2020-12-07 Thread Amit Langote

On Tue, Dec 8, 2020 at 12:04 PM Kyotaro Horiguchi
 wrote:
> At Tue, 8 Dec 2020 01:16:00 +0900, Amit Langote  
> wrote in
> > Hi Alvaro,
> >
> > On Mon, Dec 7, 2020 at 23:48 Alvaro Herrera  wrote:
> >
> > > On 2020-Dec-07, Amit Langote wrote:
> > >
> > > > On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi
> > > >  wrote:
> > > > > > Also, the comment that was in RI_ConstraintInfo now appears in
> > > > > > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> > > > > > undocumented.  What is the relationship between those two structs?  
> > > > > > I
> > > > > > see that they have pointers to each other, but I think the
> > > relationship
> > > > > > should be documented more clearly.
> > > > >
> > > > > I'm not sure the footprint of this patch worth doing but here is a bit
> > > > > more polished version.
> > > >
> > > > I noticed that the foreign_key test fails and it may have to do with
> > > > the fact that a partition's param info remains attached to the
> > > > parent's RI_ConstraintInfo even after it's detached from the parent
> > > > table using DETACH PARTITION.
> > >
> > > I think this bit about splitting the struct is a distraction.  Let's get
> > > a patch that solves the bug first, and then we can discuss what further
> > > refinements we want to do.  I think we should get your patch in
> > > CA+HiwqEOrfN9b=f3sdmyspgc4go-l_vmfhxllxvmmdp34e6...@mail.gmail.com
> > > committed (which I have not read yet.)  Do you agree with this plan?
> >
> >
> > Yeah, I agree.
>
> Or 
> https://www.postgresql.org/message-id/ca+hiwqgrr2yoo6vobm6m_oac9w-wmxe1gouq-uydpin6zjt...@mail.gmail.com
>  ?
>
> +1 from me to either one.

Oh, I hadn't actually checked the actual message that Alvaro
mentioned, but yeah I too am fine with either that one or the latest
one.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Blocking I/O, async I/O and io_uring

Hi,

On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> A new kernel API called io_uring has recently come to my attention. I
> assume some of you (Andres?) have been following it for a while.

Yea, I've spent a *lot* of time working on AIO support, utilizing
io_uring. Recently Thomas also joined in the fun. I've given two talks
referencing it (last pgcon, last pgday brussels), but otherwise I've not
yet written much about. Things aren't *quite* right yet architecturally,
but I think we're getting there.

Thomas is working on making the AIO infrastructure portable (a worker
based fallback, posix AIO support for freebsd & OSX). Once that's done,
and some of the architectural thins are resolved, I plan to write a long
email about what I think the right design is, and where I am at.

The current state is at https://github.com/anarazel/postgres/tree/aio
(but it's not a very clean history at the moment).

There's currently no windows AIO support, but it shouldn't be too hard
to add. My preliminary look indicates that we'd likely have to use
overlapped IO with WaitForMultipleObjects(), not IOCP, since we need to
be able to handle latches etc, which seems harder with IOCP. But perhaps
we can do something using the signal handling emulation posting events
onto IOCP instead.

> io_uring appears to offer a way to make system calls including reads,
> writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
> with or without O_DIRECT. Basically async I/O with usable buffered I/O and
> fsync support. It has ordering support which is really important for us.

My results indicate that we really want to have have, optional & not
enabled by default of course, O_DIRECT support. We just can't benefit
fully of modern SSDs otherwise. Buffered is also important, of course.

> But I have no idea how to even begin to fit this into PostgreSQL's executor
> pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative
> in nature, with a push/pull executor pipeline. It seems to have been
> recognised for some time that this is increasingly hurting our performance
> and scalability as platforms become more and more parallel.

> To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we
> have to be able to dispatch I/O and do something else while we wait for the
> results. So we need the ability to pipeline the executor and pipeline redo.

> I thought I'd start the discussion on this and see where we can go with it.
> What incremental steps can be done to move us toward parallelisable I/O
> without having to redesign everything?

I'm pretty sure that I've got the basics of this working pretty well. I
don't think the executor architecture is as big an issue as you seem to
think. There are further benefits that could be unlocked if we had a
more flexible executor model (imagine switching between different parts
of the query whenever blocked on IO - can't do that due to the stack
right now).

The way it currently works is that things like sequential scans, vacuum,
etc use a prefetching helper which will try to use AIO to read ahead of
the next needed block. That helper uses callbacks to determine the next
needed block, which e.g. vacuum uses to skip over all-visible/frozen
blocks. There's plenty other places that should use that helper, but we
already can get considerably higher throughput for seqscans, vacuum on
both very fast local storage, and high-latency cloud storage.

Similarly, for writes there's a small helper to manage a write-queue of
configurable depth, which currently is used to by checkpointer and
bgwriter (but should be used in more places). Especially with direct IO
checkpointing can be a lot faster *and* less impactful on the "regular"
load.

I've got asynchronous writing of WAL mostly working, but need to
redesign the locking a bit further. Right now it's a win in some cases,
but not others. The latter to a significant degree due to unnecessary
blocking

> I'm thinking that redo is probably a good first candidate. It doesn't
> depend on the guts of the executor. It is much less sensitive to
> ordering between operations in shmem and on disk since it runs in the
> startup process. And it hurts REALLY BADLY from its single-threaded
> blocking approach to I/O - as shown by an extension written by
> 2ndQuadrant that can double redo performance by doing read-ahead on
> btree pages that will soon be needed.

Thomas has a patch for prefetching during WAL apply. It currently uses
posix_fadvise(), but he took care that it'd be fairly easy to rebase it
onto "real" AIO. Most of the changes necessary are pretty independent of
posix_fadvise vs aio.

Greetings,

Andres Freund

RE: Blocking I/O, async I/O and io_uring

From: Andres Freund 
> Especially with direct IO
> checkpointing can be a lot faster *and* less impactful on the "regular"
> load.

I'm looking forward to this from the async+direct I/O, since the throughput of 
some write-heavy workload decreased by half or more during checkpointing (due 
to fsync?) Would you mind sharing any preliminary results on this if you have 
something?


Regards
Takayuki Tsunakawa

Re: On login trigger: take three

2020-12-07 Thread Pavel Stehule

út 8. 12. 2020 v 1:17 odesílatel Greg Nancarrow 
napsal:

> On Fri, Dec 4, 2020 at 9:05 PM Konstantin Knizhnik
>  wrote:
> >
> > As far as I understand Pavel concern was about the case when superuser
> > defines wrong login trigger which prevents login to the system
> > all user including himself. Right now solution of this problem is to
> > include "options='-c disable_session_start_trigger=true'" in connection
> > string.
> > I do not know if it can be done with pgAdmin.
> > >
>
> As an event trigger is tied to a particular database, and a GUC is
> global to the cluster, as long as there is one database in the cluster
> for which an event trigger for the "client_connection" event is NOT
> defined (say the default "postgres" maintenance database), then the
> superuser can always connect to that database, issue "ALTER SYSTEM SET
> disable_client_connection_trigger TO true" and reload the
> configuration. I tested this with pgAdmin4 and it worked fine for me,
> to allow login to a database for which login was previously prevented
> due to a badly-defined logon trigger.
>

yes, it can work .. Maybe for this operation only database owner rights
should be necessary. The super user is maybe too strong.

There are two maybe generic questions?

1. Maybe we can introduce more generic GUC for all event triggers like
disable_event_triggers? This GUC can be checked only by the database owner
or super user. It can be an alternative ALTER TABLE DISABLE TRIGGER ALL. It
can be protection against necessity to restart to single mode to repair the
event trigger. I think so more generic solution is better than special
disable_client_connection_trigger GUC.

2. I have no objection against client_connection. It is probably better for
the mentioned purpose - possibility to block connection to database. Can be
interesting, and I am not sure how much work it is to introduce the second
event - session_start. This event should be started after connecting - so
the exception there doesn't block connect, and should be started also after
the new statement "DISCARD SESSION", that will be started automatically
after DISCARD ALL.  This feature should not be implemented in first step,
but it can be a plan for support pooled connections

Regards

Pavel

> Pavel, is this an acceptable solution or do you still see problems
> with this approach?
>
>
> Regards,
> Greg Nancarrow
> Fujitsu Australia
>

Re: pg_rewind race condition just after promotion

At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas  wrote 
in 
> There's a race condition between the checkpoint at promotion and
> pg_rewind. When a server is promoted, the startup process writes an
> end-of-recovery checkpoint that includes the new TLI, and the server
> is immediate opened for business. The startup process requests the
> checkpointer process to perform a checkpoint, but it can take a few
> seconds or more to complete. If you run pg_rewind, using the just
> promoted server as the source, pg_rewind will think that the server is
> still on the old timeline, because it only looks at TLI in the control
> file's copy of the checkpoint record. That's not updated until the
> checkpoint is finished.
> 
> This isn't a new issue. Stephen Frost first reported it back 2015
> [1]. Back then, it was deemed just a small annoyance, and we just
> worked around it in the tests by issuing a checkpoint command after
> promotion, to wait for the checkpoint to finish. I just ran into it
> again today, with the new pg_rewind test, and silenced it in the
> similar way.

I (or we) faced that and avoided that by checking for history file on
the primary.

> I think we should fix this properly. I'm not sure if it can lead to a
> broken cluster, but at least it can cause pg_rewind to fail
> unnecessarily and in a user-unfriendly way. But this is actually
> pretty simple to fix. pg_rewind looks at the control file to find out
> the timeline the server is on. When promotion happens, the startup
> process updates minRecoveryPoint and minRecoveryPointTLI fields in the
> control file. We just need to read it from there. Patch attached.

Looks fine to me.  A bit concerned about making sourceHistory
needlessly file-local but on the other hand unifying sourceHistory and
targetHistory looks better.

For the test part, that change doesn't necessariry catch the failure
of the current version, but I *believe* the prevous code is the result
of an actual failure in the past so the test probablistically (or
dependently on platforms?) hits the failure if it happned.

> I think we should also backpatch this. Back in 2015, we decided that
> we can live with this, but it's always been a bit bogus, and seems
> simple enough to fix.

I don't think this changes any successful behavior and it just saves
the failure case so +1 for back-patching.

> Thoughts?
> 
> [1]
> https://www.postgresql.org/message-id/20150428180253.GU30322%40tamriel.snowman.net

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Blocking I/O, async I/O and io_uring

2020-12-07 Thread Craig Ringer

On Tue, 8 Dec 2020 at 12:02, Andres Freund  wrote:

> Hi,
>
> On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> > A new kernel API called io_uring has recently come to my attention. I
> > assume some of you (Andres?) have been following it for a while.
>
> Yea, I've spent a *lot* of time working on AIO support, utilizing
> io_uring. Recently Thomas also joined in the fun. I've given two talks
> referencing it (last pgcon, last pgday brussels), but otherwise I've not
> yet written much about. Things aren't *quite* right yet architecturally,
> but I think we're getting there.
>

That's wonderful. Thankyou.

I'm badly behind on the conference circuit due to geographic isolation and
small children. I'll hunt up your talks.

The current state is at https://github.com/anarazel/postgres/tree/aio
> (but it's not a very clean history at the moment).
>

Fantastic!

Have you done much bpf / systemtap / perf based work on measurement and
tracing of latencies etc? If not that's something I'd be keen to help with.
I've mostly been using systemtap so far but I'm trying to pivot over to
bpf.

I hope to submit a big tracepoints patch set for PostgreSQL soon to better
expose our wait points and latencies, improve visibility of blocking, and
help make activity traceable through all the stages of processing. I'll Cc
you when I do.

> > io_uring appears to offer a way to make system calls including reads,
> > writes, fsync()s, and more in a non-blocking, batched and pipelined
> manner,
> > with or without O_DIRECT. Basically async I/O with usable buffered I/O
> and
> > fsync support. It has ordering support which is really important for us.
>
> My results indicate that we really want to have have, optional & not
> enabled by default of course, O_DIRECT support. We just can't benefit
> fully of modern SSDs otherwise. Buffered is also important, of course.
>

Even more so for NVDRAM, Optane and all that, where zero-copy and low
context switches becomes important too.

We're a long way from that being a priority but it's still not to be
dismissed.

I'm pretty sure that I've got the basics of this working pretty well. I
> don't think the executor architecture is as big an issue as you seem to
> think. There are further benefits that could be unlocked if we had a
> more flexible executor model (imagine switching between different parts
> of the query whenever blocked on IO - can't do that due to the stack
> right now).
>

Yep, that's what I'm talking about being an issue.

Blocked on an index read? Move on to the next tuple and come back when the
index read is done.

I really like what I see of the io_uring architecture so far. It's ideal
for callback-based event-driven flow control. But that doesn't fit postgres
well for the executor. It's better for redo etc.

> The way it currently works is that things like sequential scans, vacuum,
> etc use a prefetching helper which will try to use AIO to read ahead of
> the next needed block. That helper uses callbacks to determine the next
> needed block, which e.g. vacuum uses to skip over all-visible/frozen
> blocks. There's plenty other places that should use that helper, but we
> already can get considerably higher throughput for seqscans, vacuum on
> both very fast local storage, and high-latency cloud storage.
>
> Similarly, for writes there's a small helper to manage a write-queue of
> configurable depth, which currently is used to by checkpointer and
> bgwriter (but should be used in more places). Especially with direct IO
> checkpointing can be a lot faster *and* less impactful on the "regular"
> load.
>

Sure sounds like a useful interim step. That's great.

I've got asynchronous writing of WAL mostly working, but need to
> redesign the locking a bit further. Right now it's a win in some cases,
> but not others. The latter to a significant degree due to unnecessary
> blocking
>

That's where io_uring's I/O ordering operations looked interesting. But I
haven't looked closely enough to see if they're going to help us with I/O
ordering in a multiprocessing architecture like postgres.

In an ideal world we could tell the kernel about WAL-to-heap I/O
dependencies and even let it apply WAL then heap changes out-of-order so
long as they didn't violate any ordering constraints we specify between
particular WAL records or between WAL writes and their corresponding heap
blocks. But I don't know if the io_uring interface is that capable.

I did some basic experiments a while ago with using write barriers between
WAL records and heap writes instead of fsync()ing, but as you note, the
increased blocking and reduction in the kernel's ability to do I/O
reordering is generally worse than the costs of the fsync()s we do now.

> I'm thinking that redo is probably a good first candidate. It doesn't
> > depend on the guts of the executor. It is much less sensitive to
> > ordering between operations in shmem and on disk since it runs in the
> > startup process. And it hurts REALLY BADLY from

About to add WAL write/fsync statistics to pg_stat_wal view

2020-12-07 Thread Masahiro Ikeda


Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following 
information on pg_stat_wal view.

I separate the parameter from "track_io_timing" to "track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query 
performance than database I/O activity.


```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time 
FROM pg_stat_wal;

-[ RECORD 1 ]--+
wal_write  | 650  # Total number of times WAL data was written to 
the disk


wal_write_time | 43   # Total amount of time that has been spent in the 
portion of WAL data was written to disk
  # if track-wal-io-timing is enabled, otherwise 
zero


wal_sync   | 78   # Total number of times WAL data was synced to the 
disk


wal_sync_time  | 104  # Total amount of time that has been spent in the 
portion of WAL data was synced to disk
  # if track-wal-io-timing is enabled, otherwise 
zero

```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd3d690..ba923a2b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7388,6 +7388,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
  
 
+ 
+  track_wal_io_timing (boolean)
+  
+   track_wal_io_timing configuration parameter
+  
+  
+  
+   
+Enables timing of WAL I/O calls. This parameter is off by default,
+because it will repeatedly query the operating system for
+the current time, which may cause significant overhead on some
+platforms.  You can use the  tool to
+measure the overhead of timing on your system.
+I/O timing information is
+displayed in 
+pg_stat_wal.  Only superusers can
+change this setting.
+   
+  
+ 
+
  
   track_functions (enum)
   
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a53..ce4f652d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3479,7 +3479,51 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
wal_buffers_full bigint
   
   
-   Number of times WAL data was written to the disk because WAL buffers got full
+   Total number of times WAL data was written to the disk because WAL buffers got full
+  
+ 
+
+ 
+  
+   wal_write bigint
+  
+  
+   Total number of times WAL data was written to the disk
+  
+ 
+
+ 
+  
+   wal_write_time double precision
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was written to disk, in milliseconds
+   (if  is enabled, otherwise zero).
+   To avoid standby server's performance degradation, they don't collect
+   this statistics
+  
+ 
+
+ 
+  
+   wal_sync bigint
+  
+  
+   Total number of times WAL data was synced to the disk
+  
+ 
+
+ 
+  
+   wal_sync_time double precision
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was synced to disk, in milliseconds
+   (if  is enabled, otherwise zero).
+   To avoid standby server's performance degradation, they don't collect
+   this statistics
   
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f..c9f33d23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -109,6 +109,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2528,6 +2529,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
+			instr_time	duration;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2536,9 +2539,24 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 errno = 0;
+
+/* Measure i/o timing to write WAL data */
+if (track_wal_io_timing)
+	INSTR_TIME_SET_CURRENT(start);
+
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
 pgstat_report_wait_end();
+
+if (track_wal_io_timing)
+{
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	WalStats.m_wal_write_time += INSTR_TIME_GET_MILLISEC(

Re: [Patch] Optimize dropping of relation buffers using dlist

At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
>  wrote:
> > We drop
> > buffers for the old relfilenode on truncation anyway.
> >
> > What I did is:
> >
> > a: Create a physical replication pair.
> > b: On the master, create a table. (without explicitly starting a tx)
> > c: On the master, insert a tuple into the table.
> > d: On the master truncate the table.
> >
> > On the standby, smgrnblocks is called for the old relfilenode of the
> > table at c, then the same function is called for the same relfilenode
> > at d and the function takes the cached path.
> >
> 
> This is on the lines I have tried for recovery. So, it seems we are in
> agreement that we can use the 'cached' flag in
> DropRelFileNodesAllBuffers and it will take the optimized path in many
> such cases, right?

Mmm. There seems to be a misunderstanding.. What I opposed to is
referring only to InRecovery and ignoring the value of "cached".

The remaining issue is we don't get to the optimized path when a
standby makes the first call to smgrnblocks() when truncating a
relation. Still we can get to the optimized path as far as any
update(+insert) or select is performed earlier on the relation so I
think it doesn't matter so match.

But I'm not sure what others think.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Single transaction in the tablesync worker?

On Mon, Dec 7, 2020 at 2:21 PM Amit Kapila  wrote:
>
> On Mon, Dec 7, 2020 at 9:21 AM Amit Kapila  wrote:
> >
> > On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
> >  wrote:
> > >
> >
> > >>
> > >> I am not sure why but it seems acceptable to original authors that the
> > >> data of transactions are visibly partially during the initial
> > >> synchronization phase for a subscription.
> > >
> > >
> > > I don't think there's much alternative there.
> > >
> >
> > I am not sure about this. I think it is primarily to allow some more
> > parallelism among apply and sync workers. One primitive way to achieve
> > parallelism and don't have this problem is to allow apply worker to
> > wait till all the tablesync workers are in DONE state.
> >
>
> As the slot of apply worker is created before all the tablesync
> workers it should never miss any LSN which tablesync workers would
> have processed. Also, the table sync workers should not process any
> xact if the apply worker has not processed anything. I think tablesync
> currently always processes one transaction (because we call
> process_sync_tables at commit of a txn) even if that is not required
> to be in sync with the apply worker.
>

One more thing to consider here is that currently in tablesync worker,
we create a slot with CRS_USE_SNAPSHOT option which creates a
transaction snapshot on the publisher, and then we use the same
snapshot for a copy from the publisher. After this, when we try to
receive the data from the publisher using the same slot, it will be in
sync with the COPY. I think to keep the same consistency between COPY
and the data we receive from the publisher in this approach, we need
to export the snapshot while creating a slot in the apply worker by
using CRS_EXPORT_SNAPSHOT and then use the same snapshot by all the
tablesync workers doing the copy. In tablesync workers, we can use the
SET TRANSACTION SNAPSHOT command after "BEGIN READ ONLY ISOLATION
LEVEL REPEATABLE READ" to achieve it. That way the COPY will use the
same snapshot as is used for receiving the changes in apply worker and
the data will be in sync.

-- 
With Regards,
Amit Kapila.

Re: [Proposal] Global temporary tables

2020-12-07 Thread Pavel Stehule

Hi

čt 26. 11. 2020 v 12:46 odesílatel 曾文旌  napsal:

> This is the latest patch for feature GTT(v38).
> Everybody, if you have any questions, please let me know.
>

please, co you send a rebased patch. It is broken again

Regards

Pavel


>
> Wenjing
>
>
>

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 8, 2020 at 10:41 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila  
> wrote in
> > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
> >  wrote:
> > > We drop
> > > buffers for the old relfilenode on truncation anyway.
> > >
> > > What I did is:
> > >
> > > a: Create a physical replication pair.
> > > b: On the master, create a table. (without explicitly starting a tx)
> > > c: On the master, insert a tuple into the table.
> > > d: On the master truncate the table.
> > >
> > > On the standby, smgrnblocks is called for the old relfilenode of the
> > > table at c, then the same function is called for the same relfilenode
> > > at d and the function takes the cached path.
> > >
> >
> > This is on the lines I have tried for recovery. So, it seems we are in
> > agreement that we can use the 'cached' flag in
> > DropRelFileNodesAllBuffers and it will take the optimized path in many
> > such cases, right?
>
>
> Mmm. There seems to be a misunderstanding.. What I opposed to is
> referring only to InRecovery and ignoring the value of "cached".
>

Okay, I think it was Kirk-San who proposed to use InRecovery and
ignoring the value of "cached" based on the theory that even if Insert
(or other DMLs) are done before Truncate, it won't use an optimized
path and I don't agree with the same. So, I did a small test to check
the same and found that it should use the optimized path and the same
is true for the experiment done by you. I am not sure why Kirk-San is
seeing something different?

> The remaining issue is we don't get to the optimized path when a
> standby makes the first call to smgrnblocks() when truncating a
> relation. Still we can get to the optimized path as far as any
> update(+insert) or select is performed earlier on the relation so I
> think it doesn't matter so match.
>

+1.

With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread k.jami...@fujitsu.com

On Tuesday, December 8, 2020 2:35 PM, Amit Kapila wrote:

> On Tue, Dec 8, 2020 at 10:41 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila
> >  wrote in
> > > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
> > >  wrote:
> > > > We drop
> > > > buffers for the old relfilenode on truncation anyway.
> > > >
> > > > What I did is:
> > > >
> > > > a: Create a physical replication pair.
> > > > b: On the master, create a table. (without explicitly starting a
> > > > tx)
> > > > c: On the master, insert a tuple into the table.
> > > > d: On the master truncate the table.
> > > >
> > > > On the standby, smgrnblocks is called for the old relfilenode of
> > > > the table at c, then the same function is called for the same
> > > > relfilenode at d and the function takes the cached path.
> > > >
> > >
> > > This is on the lines I have tried for recovery. So, it seems we are
> > > in agreement that we can use the 'cached' flag in
> > > DropRelFileNodesAllBuffers and it will take the optimized path in
> > > many such cases, right?
> >
> >
> > Mmm. There seems to be a misunderstanding.. What I opposed to is
> > referring only to InRecovery and ignoring the value of "cached".
> >
> 
> Okay, I think it was Kirk-San who proposed to use InRecovery and ignoring
> the value of "cached" based on the theory that even if Insert (or other DMLs)
> are done before Truncate, it won't use an optimized path and I don't agree
> with the same. So, I did a small test to check the same and found that it
> should use the optimized path and the same is true for the experiment done
> by you. I am not sure why Kirk-San is seeing something different?
> 
> > The remaining issue is we don't get to the optimized path when a
> > standby makes the first call to smgrnblocks() when truncating a
> > relation. Still we can get to the optimized path as far as any
> > update(+insert) or select is performed earlier on the relation so I
> > think it doesn't matter so match.
> >
> 
> +1.

My question/proposal before was to either use InRecovery,
or completely drop the smgrnblocks' "cached" flag.
But that is coming from the results of my investigation below when
I used "cached" in DropRelFileNodesAllBuffers().
The optimization path was skipped because one of the
Rels' "cached" value was "false".

Test Case. (shared_buffer = 1GB)
0. Set physical replication to both master and standby.
1. Create 1 table.
2. Insert Data (1MB) to TABLE.
16385 is the relnode for insert (both Master and Standby).

3. Pause WAL on Standby.
4. TRUNCATE table on Primary.
  nrels = 3. relNodes 16389, 16388, 16385.

5. Stop Primary.

6. Promote standby and resume WAL recovery.  nrels = 3 
  1st rel's check for optimization: "cached" is TRUE. relNode = 16389.
  2nd rel's check for optimization. "cached" was returned FALSE by
smgrnblocks). relNode = 16388.
  Since one of the rels' cached is "FALSE", the optimization check for
  3rd relation and the whole optimization itself is skipped.
  Go to full-scan path in DropRelFileNodesAllBuffers().
  Then smgrclose for relNodes 16389, 16388, 16385.

Because one of the rel's cached value was false, it forced the
full-scan path for TRUNCATE.
Is there a possible workaround for this? 

Regards,
Kirk Jamison

Re: Single transaction in the tablesync worker?

2020-12-07 Thread Peter Smith

On Mon, Dec 7, 2020 at 7:49 PM Amit Kapila  wrote:
> As the slot of apply worker is created before all the tablesync
> workers it should never miss any LSN which tablesync workers would
> have processed. Also, the table sync workers should not process any
> xact if the apply worker has not processed anything. I think tablesync
> currently always processes one transaction (because we call
> process_sync_tables at commit of a txn) even if that is not required
> to be in sync with the apply worker. This should solve both the
> problems (a) visibility of partial transactions (b) allow prepared
> transactions because tablesync worker no longer needs to combine
> multiple transactions data.
>
> I think the other advantages of this would be that it would reduce the
> load (both CPU and I/O) on the publisher-side by allowing to decode
> the data only once instead of for each table sync worker once and
> separately for the apply worker. I think it will use fewer resources
> to finish the work.

Yes, I observed this same behavior.

IIUC the only way for the tablesync worker to go from CATCHUP mode to
SYNCDONE is via the call to process_sync_tables.

But a side-effect of this is, when messages arrive during this CATCHUP
phase one tx will be getting handled by the tablesync worker before
the process_sync_tables() is ever encountered.

I have created and attached a simple patch which allows the tablesync
to detect if there is anything to do *before* it enters the apply main
loop. Calling process_sync_tables() before the apply main loop  offers
a quick way out so the message handling will not be split
unnecessarily between the workers.

~

The result of the patch is demonstrated by the following test/logs
which are also attached.
Note: I added more logging (not in this patch) to make it easier to
see what is going on.

LOGS1. Current code.
Test: 10 x INSERTS done at CATCHUP time.
Result: tablesync worker does 1 x INSERT, then apply worker skips 1
and does remaining 9 x INSERTs.

LOGS2. Patched code.
Test: Same 10 x INSERTS done at CATCHUP time.
Result: tablesync can exit early. apply worker handles all 10 x INSERTs

LOGS3. Patched code.
Test: 2PC PREPARE then COMMIT PREPARED [1] done at CATCHUP time
psql -d test_pub -c "BEGIN;INSERT INTO test_tab VALUES(1,
'foo');PREPARE TRANSACTION 'test_prepared_tab';"
psql -d test_pub -c "COMMIT PREPARED 'test_prepared_tab';"
Result: The PREPARE and COMMIT PREPARED are both handle by apply
worker. This avoids complications which the split otherwise causes.
[1] 2PC prepare test requires v29 patch from
https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com

---

Kind Regards,
Peter Smith.
Fujitsu Australia
## Not patched

2020-12-07 17:03:45.237 AEDT [26325] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-07 17:03:46.237 AEDT [26325] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-07 17:03:46.238 AEDT [26325] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-07 17:03:47.240 AEDT [26325] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-07 17:03:47.240 AEDT [26325] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-07 17:03:47.818 AEDT [27247] LOG:  logical decoding found consistent 
point at 0/1624870
2020-12-07 17:03:47.818 AEDT [27247] DETAIL:  There are no running transactions.
2020-12-07 17:03:47.818 AEDT [27247] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub_16433_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2020-12-07 17:03:47.820 AEDT [26325] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-07 17:03:47.821 AEDT [26325] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-07 17:03:47.832 AEDT [26333] LOG:  !!>> tablesync worker: wait for 
CATCHUP state notification
2020-12-07 17:03:47.832 AEDT [26325] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-07 17:03:47.832 AEDT [26325] LOG:  !!>> apply worker: called 
process_syncing_tables

2020-12-07 16:59:21.501 AEDT [19106] LOG:  !!>> apply worker: called 
process_syncing_tables

## Attached and paused in debugger where the "tablesync" is told to CATCHUP.

### At this point we might execute any number of SQL commands

e.g 10 inserts

psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(2, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(3, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(4, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(5, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(6, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(7, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(8, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(9, 'foo');"
psql -d test_pub -c "INSERT INTO test_tab VALUES(10, 'foo');"

## Now continue in the "tablesync" worker

2020-12-07 17:06:25.315 AEDT [26333] LOG:  !!>> tablesync worker: 
LogicalRepApplyLoop
2020-12-07 17:06:52.606 AEDT [26333

Re: Blocking I/O, async I/O and io_uring

Hi,

On 2020-12-08 13:01:38 +0800, Craig Ringer wrote:
> Have you done much bpf / systemtap / perf based work on measurement and
> tracing of latencies etc? If not that's something I'd be keen to help with.
> I've mostly been using systemtap so far but I'm trying to pivot over to
> bpf.

Not much - there's still so many low hanging fruits and architectural
things to finish that it didn't yet seem pressing.

> I've got asynchronous writing of WAL mostly working, but need to
> > redesign the locking a bit further. Right now it's a win in some cases,
> > but not others. The latter to a significant degree due to unnecessary
> > blocking

> That's where io_uring's I/O ordering operations looked interesting. But I
> haven't looked closely enough to see if they're going to help us with I/O
> ordering in a multiprocessing architecture like postgres.

The ordering ops aren't quite powerful enough to be a huge boon
performance-wise (yet). They can cut down on syscall and intra-process
context switch overhead to some degree, but otherwise it's not different
than userspace submitting another request upon receving of a completion.

> In an ideal world we could tell the kernel about WAL-to-heap I/O
> dependencies and even let it apply WAL then heap changes out-of-order so
> long as they didn't violate any ordering constraints we specify between
> particular WAL records or between WAL writes and their corresponding heap
> blocks. But I don't know if the io_uring interface is that capable.

It's not. And that kind of dependency inferrence wouldn't be cheap on
the PG side either.

I don't think it'd help that much for WAL apply anyway. You need
read-ahead of the WAL to avoid unnecessary waits for a lot of records
anyway. And the writes during WAL are mostly pretty asynchronous (mainly
writeback during buffer replacement).

An imo considerably more interesting case is avoiding blocking on a WAL
flush when needing to write a page out in an OLTPish workload. But I can
think of more efficient ways there too.

> How feasible do you think it'd be to take it a step further and structure
> redo as a pipelined queue, where redo calls enqueue I/O operations and
> completion handlers then return immediately? Everything still goes to disk
> in the order it's enqueued, and the callbacks will be invoked in order, so
> they can update appropriate shmem state etc. Since there's no concurrency
> during redo, it should be *much* simpler than normal user backend
> operations where we have all the tight coordination of buffer management,
> WAL write ordering, PGXACT and PGPROC, the clog, etc.

I think it'd be a fairly massive increase in complexity. And I don't see
a really large payoff: Once you have real readahead in the WAL there's
really not much synchronous IO left. What am I missing?

Greetings,

Andres Freund

Re: PG vs LLVM 12 on seawasp, next round

Hi,

On 2020-12-07 19:38:19 -0800, Andres Freund wrote:
> On 2020-12-01 12:08:10 -0800, Andres Freund wrote:
> > On 2020-12-01 21:04:44 +0100, Fabien COELHO wrote:
> > > Andres investigated a few days ago, managed to reproduce the issue 
> > > locally,
> > > and has one line patch. I'm unsure if it should be prevently back-patched,
> > > though.
> > 
> > I see no reason not to backpatch - it's more correct for past versions
> > of LLVM as well.
> 
> I pushed that now.

I hadn't checked that before, but for the last few days there's been a
different failure than the one I saw earlier:

+ERROR:  could not load library 
"/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so":
 libLLVMOrcJIT.so.12git: cannot open shared object file: No such file or 
directory

whereas what I fixed is about:

+ERROR:  could not load library 
"/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so":
 
/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so:
 undefined symbol: LLVMInitializeX86Target

Changed somewhere between
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2020-11-20%2009%3A17%3A10
and
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2020-11-21%2023%3A17%3A11

The "no such file" error seems more like a machine local issue to me.

Greetings,

Andres Freund

RE: [Patch] Optimize dropping of relation buffers using dlist

From: Jamison, Kirk/ジャミソン カーク 
> Because one of the rel's cached value was false, it forced the
> full-scan path for TRUNCATE.
> Is there a possible workaround for this?

Hmm, the other two relfilenodes are for the TOAST table and index of the target 
table.  I think the INSERT didn't access those TOAST relfilenodes because the 
inserted data was stored in the main storage.  But TRUNCATE always truncates 
all the three relfilenodes.  So, the standby had not opened the relfilenode for 
the TOAST stuff or cached its size when replaying the TRUNCATE.

I'm afraid this is more common than we can ignore and accept the slow 
traditional path, but I don't think of a good idea to use the cached flag.


Regards
Takayuki Tsunakawa

Re: Blocking I/O, async I/O and io_uring