date:20230628

On Thu, Jun 29, 2023 at 8:19 AM Michael Paquier  wrote:

> Nothing much to add, so applied with the initial comment fix.


Thanks for pushing it!

Thanks
Richard

Trivial revise for the check of parameterized partial paths

While working on the invalid parameterized join path issue [1], I
noticed that we can simplify the codes for checking parameterized
partial paths in try_partial_hashjoin/mergejoin_path, with the help of
macro PATH_REQ_OUTER.

-   if (inner_path->param_info != NULL)
-   {
-   Relids  inner_paramrels =
inner_path->param_info->ppi_req_outer;
-
-   if (!bms_is_empty(inner_paramrels))
-   return;
-   }
+   if (!bms_is_empty(PATH_REQ_OUTER(inner_path)))
+   return;

Also there is a comment there that is not correct.

 * If the inner path is parameterized, the parameterization must be fully
 * satisfied by the proposed outer path.

This is true for nestloop but not for hashjoin/mergejoin.

Besides, I wonder if it'd be better that we verify that the outer input
path for a partial join path should not have any parameterization
dependency.

Attached is a patch for all these changes.

[1]
https://www.postgresql.org/message-id/flat/CAJKUy5g2uZRrUDZJ8p-%3DgiwcSHVUn0c9nmdxPSY0jF0Ov8VoEA%40mail.gmail.com

Thanks
Richard


v1-0001-Trivial-revise-for-the-check-of-parameterized-partial-paths.patch
Description: Binary data

Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-06-28 Thread jian he

On Fri, Jun 23, 2023 at 10:25 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> While discussing based on the article[1] with Japanese developers,
> I found inconsistencies between codes and documents.
>
> 45b1a67a[2] changed the behavior when non-ASCII characters was set as 
> application_name,
> cluster_name and postgres_fdw.application_name, but it seemed not to be 
> documented.
> Previously non-ASCII chars were replaed with question makrs '?', but now they 
> are replaced
> with a hex escape instead.
>
> How do you think? Is my understanding correct?
>
> Acknowledgement:
> Sawada-san and Shinoda-san led the developer's discussion.
> Fujii-san was confirmed my points. Thank you for all of their works!
>
> [1]: 
> https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf
> [2]: 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45b1a67a0fcb3f1588df596431871de4c93cb76f;hp=da5d4ea5aaac4fc02f2e2aec272efe438dd4e171
>
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED
>

in your patch:
> printable ASCII characters will be replaced with a hex escape.

My wording is not good. I think the result will be: ASCII characters
will be as is, non-ASCII characters will be replaced with "a hex
escape".

set application_name to 'abc漢字Abc';
SET
test16=# show application_name;
application_name

 abc\xe6\xbc\xa2\xe5\xad\x97Abc
(1 row)

I see multi escape, so I am not sure "a hex escape".

to properly render it back to  'abc漢字Abc'
here is how i do it:
select 'abc' || convert_from(decode(' e6bca2e5ad97','hex'), 'UTF8') || 'Abc';

I guess it's still painful if your application_name has non-ASCII chars.

Re: Assert !bms_overlap(joinrel->relids, required_outer)

On Wed, Jun 28, 2023 at 10:09 PM Tom Lane  wrote:

> However, given that what we need is to exclude parameterization
> that depends on the currently-formed OJ, it seems to me we can do
> it more simply and without any new JoinPathExtraData field,
> as attached.  What do you think?

I think it makes sense.  At first I wondered if we should also exclude
parameterization that depends on OJs that have already been formed as
part of this joinrel.  But it seems not possible that the input paths
have parameterization dependency on these OJs.  So it should be
sufficient to only consider the currently-formed OJ.

> > * I think we need to check the incompatible relids also in
> > try_hashjoin_path and try_mergejoin_path besides try_nestloop_path.
>
> I think this isn't necessary, at least in my formulation.
> Those cases will go through calc_non_nestloop_required_outer
> which has
>
> /* neither path can require rels from the other */
> Assert(!bms_overlap(outer_paramrels, inner_path->parent->relids));
> Assert(!bms_overlap(inner_paramrels, outer_path->parent->relids));
>
> In order to have a dependency on an OJ, a path would have to have
> a dependency on at least one of the OJ's base relations too, so
> I think these assertions show that the case won't arise.  (Of
> course, if someone can trip one of these assertions, I'm wrong.)

Hmm, while this holds in most cases, it does not if the joins have been
commuted according to identity 3.  If we change the t3/t4 join's qual to
't3.a = t4.a' to make hashjoin possible, we'd see the same Assert
failure through try_hashjoin_path.  I think it's also possible for merge
join.

explain (costs off)
select 1 from t t1
 join lateral
   (select t1.a from (select 1) foo offset 0) s1 on true
 join
(select 1 from t t2
inner join t t3
 left join t t4 left join t t5 on t4.a = 1
on t3.a = t4.a on false
 where t3.a = coalesce(t5.a,1)) as s2
  on true;
server closed the connection unexpectedly

Thanks
Richard

RE: add \dpS to psq [16beta1]

2023-06-28 Thread Shinoda, Noriyoshi (PN Japan FSIP)

Hi,
Thank you for developing a good feature.
I found while testing PostgreSQL 16 Beta 1 that the output of the \? 
metacommand did not include \dS, \dpS. 
The attached patch changes the output of the \? meta command to:

Current output
psql=# \? 
  \z  [PATTERN]  same as \dp
  \dp [PATTERN]  list table, view, and sequence access privileges

Patched output
psql=# \?
  \dp[S]  [PATTERN]  list table, view, and sequence access privileges
  \z[S]   [PATTERN]  same as \dp

Regards,
Noriyoshi Shinoda

-Original Message-
From: Nathan Bossart  
Sent: Tuesday, January 10, 2023 2:46 AM
To: Dean Rasheed 
Cc: Maxim Orlov ; Andrew Dunstan ; 
Michael Paquier ; Tom Lane ; Isaac 
Morland ; Justin Pryzby ; Pavel 
Luzanov ; pgsql-hack...@postgresql.org
Subject: Re: add \dpS to psql

On Sat, Jan 07, 2023 at 11:18:59AM +, Dean Rasheed wrote:
> It might be true that temp tables aren't usually interesting from a 
> permissions point of view, but it's not hard to imagine situations 
> where interesting things do happen. It's also probably the case that 
> most users won't have many temp tables, so I don't think including 
> them by default will be particularly intrusive.
> 
> Also, from a user perspective, I think it would be something of a POLA 
> violation for \dp[S] and \dt[S] to include different sets of tables, 
> though I appreciate that we do that now. There's nothing in the docs 
> to indicate that that's the case.

Agreed.

> Anyway, I've pushed the v2 patch as-is. If anyone feels strongly 
> enough that we should change its behaviour for temp tables, then we 
> can still discuss that.

Thanks!

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com 




psql_dpS_metacommand_v1.diff
Description: psql_dpS_metacommand_v1.diff

Re: POC, WIP: OR-clause support for indexes

2023-06-28 Thread Ranier Vilela

Em qua., 28 de jun. de 2023 às 18:45, Tomas Vondra <
tomas.von...@enterprisedb.com> escreveu:

> On 6/27/23 20:55, Ranier Vilela wrote:
> > Hi,
> >
> >>I finished writing the code patch for transformation "Or" expressions to
> >>"Any" expressions. I didn't see any problems in regression tests, even
> >>when I changed the constant at which the minimum or expression is
> >>replaced by any at 0. I ran my patch on sqlancer and so far the code has
> >>never fallen.
> > Thanks for working on this.
> >
> > I took the liberty of making some modifications to the patch.
> > I didn't compile or test it.
> > Please feel free to use them.
> >
>
> I don't want to be rude, but this doesn't seem very helpful.
>
Sorry, It was not my intention to cause interruptions.


> - You made some changes, but you don't even attempt to explain what you
> changed or why you changed it.
>
1. Reduce scope
2. Eliminate unnecessary variables
3. Eliminate unnecessary expressions


>
> - You haven't even tried to compile the code, nor tested it. If it
> happens to compile, wow could others even know it actually behaves the
> way you wanted?
>
Attached v2 with make check pass all tests.
Ubuntu 64 bits
gcc 64 bits


> - You responded in a way that breaks the original thread, so it's not
> clear which message you're responding to.
>
It was a pretty busy day.

Sorry for the noise, I hope I was of some help.

regards,
Ranier Vilela

P.S.
0001-Replace-clause-X-N1-OR-X-N2-.-with-X-ANY-N1-N2-on.patch fails with 4
tests.
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 346fd272b6..74c256258d 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -95,6 +95,291 @@ static Expr *make_distinct_op(ParseState *pstate, List 
*opname,
 static Node *make_nulltest_from_distinct(ParseState *pstate,

 A_Expr *distincta, Node *arg);
 
+typedef struct OrClauseGroupEntry
+{
+   Node   *node;
+   List   *consts;
+   Oid scalar_type;
+   Oid opno;
+   Expr   *expr;
+} OrClauseGroupEntry;
+
+static int const_transform_or_limit = 0;
+
+static Node *
+transformBoolExprOr(ParseState *pstate, Expr *expr_orig)
+{
+   List   *or_list = NIL;
+   List   *groups_list = NIL;
+   ListCell   *lc_eargs;
+   Node   *result;
+   BoolExpr   *expr;
+   const char *opname;
+   boolchange_apply = false;
+   boolor_statement;
+
+   Assert(IsA(expr, BoolExpr));
+
+   expr = (BoolExpr *) expr_orig;
+   if (list_length(expr->args) < const_transform_or_limit)
+   return transformBoolExpr(pstate, (BoolExpr *)expr_orig);
+
+   /* If this is not expression "Or", then will do it the old way. */
+   switch (expr->boolop)
+   {
+   case AND_EXPR:
+   opname = "AND";
+   break;
+   case OR_EXPR:
+   return transformBoolExpr(pstate, (BoolExpr *)expr_orig);
+   break;
+   case NOT_EXPR:
+   opname = "NOT";
+   break;
+   default:
+   elog(ERROR, "unrecognized boolop: %d", (int) 
expr->boolop);
+   opname = NULL;  /* keep compiler quiet */
+   break;
+   }
+
+   /*
+   * NOTE:
+   * It is an OR-clause. So, rinfo->orclause is a BoolExpr node, 
contains
+   * a list of sub-restrictinfo args, and rinfo->clause - which is 
the
+   * same expression, made from bare clauses. To not break 
selectivity
+   * caches and other optimizations, use both:
+   * - use rinfos from orclause if no transformation needed
+   * - use  bare quals from rinfo->clause in the case of 
transformation,
+   * to create new RestrictInfo: in this case we have no options 
to avoid
+   * selectivity estimation procedure.
+   */
+   or_statement = false;
+   expr = (BoolExpr *)copyObject(expr_orig);
+   foreach(lc_eargs, expr->args)
+   {
+   A_Expr *arg = (A_Expr *) lfirst(lc_eargs);
+   Node   *bare_orarg;
+   Node   *const_expr;
+   Node   *non_const_expr;
+   ListCell   *lc_groups;
+   OrClauseGroupEntry *gentry;
+   boolallow_transformation;
+
+   /*
+* The first step: checking that the expression consists only 
of equality.
+* We can only do this here, while arg is still row data type,

Re: pg_rewind WAL segments deletion pitfall

2023-06-28 Thread Kyotaro Horiguchi

At Wed, 28 Jun 2023 22:28:13 +0900, torikoshia  
wrote in 
> 
> On 2022-09-29 17:18, Polina Bungina wrote:
> > I agree with your suggestions, so here is the updated version of
> > patch. Hope I haven't missed anything.
> > Regards,
> > Polina Bungina
> 
> Thanks for working on this!
> It seems like we are also facing the same issue.

Thanks for looking this.

> I tested the v3 patch under our condition, old primary has succeeded
> to become new standby.
> 
> 
> BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached
> in [1], old primary also failed to become standby:
> 
>   FATAL: could not receive data from WAL stream: ERROR: requested WAL
>   segment 00020007 has already been removed
> 
> However, I think this is not a problem: just adding restore_command
> like below fixed the situation.
> 
>   echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >>
>   oldprim/postgresql.conf

I thought on the same line at first, but that's not the point
here. The problem we want ot address is that pg_rewind ultimately
removes certain crucial WAL files required for the new primary to
start, despite them being present previously. In other words, that
restore_command works, but it only undoes what pg_rewind wrongly did,
resulting in unnecessary consupmtion of I/O and/or network bandwidth
that essentially serves no purpose.

pg_rewind already has a feature that determines how each file should
be handled, but it is currently making wrong dicisions for WAL
files. The goal here is to rectify this behavior and ensure that
pg_rewind makes the right decisions.

> Attached modified reproduction script for reference.
> 
> [1]https://www.postgresql.org/message-id/CAFh8B%3DnNiFZOAPsv49gffxHBPzwmZ%3D6Msd4miMis87K%3Dd9rcRA%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Another incorrect comment for pg_stat_statements

2023-06-28 Thread Japin Li



On Thu, 29 Jun 2023 at 08:19, Michael Paquier  wrote:
> On Wed, Jun 28, 2023 at 09:26:02PM +0800, Japin Li wrote:
>> +1. LGTM.
>
> Nothing much to add, so applied with the initial comment fix.

Thanks!

-- 
Regrads,
Japin Li.

Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs

On Wed, Jun 28, 2023 at 10:45:02AM +0200, Peter Eisentraut wrote:
> Right, the usual style is just to check whether an environment variable is
> set to something, not what it is.
> 
> Also note that in general not all environment variables are processed by
> Perl, so I would avoid encoding Perl semantics about what is "true" or
> whatever into it.

Agreed.  I am not sure that this is worth changing to have
boolean-like checks.  Hence, I would also to keep the patch that
checks if the environment variable is defined to enforce the behavior,
without checking for a specific value.
--
Michael


signature.asc
Description: PGP signature

Re: Another incorrect comment for pg_stat_statements

On Wed, Jun 28, 2023 at 09:26:02PM +0800, Japin Li wrote:
> +1. LGTM.

Nothing much to add, so applied with the initial comment fix.
--
Michael


signature.asc
Description: PGP signature

Re: Changing types of block and chunk sizes in memory contexts

Hi,

On 2023-06-28 17:56:55 -0400, Tom Lane wrote:
> Tomas Vondra  writes:
> > ... 4B is tiny compared to what we waste due to the doubling.
> 
> Yeah.  I've occasionally wondered if we should rethink aset.c's
> "only power-of-2 chunk sizes" rule.  Haven't had the bandwidth
> to pursue the idea though.

Me too. It'd not be trivial to do without also incurring performance overhead.

A somewhat easier thing we could try is to carve the "rounding up" space into
smaller chunks, similar to what we do for full blocks. It wouldn't make sense
to do that for the smaller size classes, but above 64-256 bytes or such, I
think the wins might be big enough to outweight the costs?

Of course that doesn't guarantee that that memory in those smaller size
classes is going to be used...

Greetings,

Andres Freund

Add more sanity checks around callers of changeDependencyFor()

Hi all,

While working on a different patch, I have noted three code paths that
call changeDependencyFor() but don't check that they do not return
errors.  In all the three cases (support function, extension/schema
and object/schema), it seems to me that only one dependency update is
expected.

I am adding that to the next CF.  Thoughts or comments about the
attached?
--
Michael
From 2b66afedbfd27a80beae3a48e0eb362e21ef4547 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 29 Jun 2023 08:35:23 +0900
Subject: [PATCH] Add checks for values returned by changeDependencyFor()

---
 src/backend/commands/alter.c|  6 --
 src/backend/commands/extension.c|  6 --
 src/backend/commands/functioncmds.c | 10 +++---
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index e95dc31bde..455d8dd86a 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -848,8 +848,10 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid)
 	pfree(replaces);
 
 	/* update dependencies to point to the new schema */
-	changeDependencyFor(classId, objid,
-		NamespaceRelationId, oldNspOid, nspOid);
+	if (changeDependencyFor(classId, objid,
+		   NamespaceRelationId, oldNspOid, nspOid) != 1)
+		elog(ERROR, "failed to change schema dependency for object %u",
+			 objid);
 
 	InvokeObjectPostAlterHook(classId, objid, 0);
 
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 0eabe18335..e95642e14f 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -2948,8 +2948,10 @@ AlterExtensionNamespace(const char *extensionName, const char *newschema, Oid *o
 	table_close(extRel, RowExclusiveLock);
 
 	/* update dependencies to point to the new schema */
-	changeDependencyFor(ExtensionRelationId, extensionOid,
-		NamespaceRelationId, oldNspOid, nspOid);
+	if (changeDependencyFor(ExtensionRelationId, extensionOid,
+			NamespaceRelationId, oldNspOid, nspOid) != 1)
+		elog(ERROR, "failed to change schema dependency for extension %s",
+			 NameStr(extForm->extname));
 
 	InvokeObjectPostAlterHook(ExtensionRelationId, extensionOid, 0);
 
diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index 49c7864c7c..5ec2c002d2 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -1450,9 +1450,13 @@ AlterFunction(ParseState *pstate, AlterFunctionStmt *stmt)
 
 		/* Add or replace dependency on support function */
 		if (OidIsValid(procForm->prosupport))
-			changeDependencyFor(ProcedureRelationId, funcOid,
-ProcedureRelationId, procForm->prosupport,
-newsupport);
+		{
+			if (changeDependencyFor(ProcedureRelationId, funcOid,
+	ProcedureRelationId, procForm->prosupport,
+	newsupport) != 1)
+elog(ERROR, "failed to change support dependency for function %s",
+	 get_func_name(funcOid));
+		}
 		else
 		{
 			ObjectAddress referenced;
-- 
2.40.1



signature.asc
Description: PGP signature

Re: Changing types of block and chunk sizes in memory contexts

Hi,

On 2023-06-28 23:26:00 +0200, Tomas Vondra wrote:
> Yeah. FWIW I was interested what the patch does in practice, so I
> checked what pahole says about impact on struct sizes:
> 
> AllocSetContext   224B -> 208B   (4 cachelines)
> GenerationContext 152B -> 136B   (3 cachelines)
> SlabContext   200B -> 200B   (no change, adds 4B hole)
> 
> Nothing else changes, AFAICS. I find it hard to believe this could have
> any sort of positive benefit - I doubt we ever have enough contexts for
> this to matter.

I don't think it's that hard to believe. We create a lot of memory contexts
that we never or just barely use.  Just reducing the number of cachelines
touched for that can't hurt.  This does't quite get us to reducing the size to
a lower number of cachelines, but it's a good step.

There are a few other fields that we can get rid of.

- Afaics AllocSet->keeper is unnecessary these days, as it is always allocated
  together with the context itself. Saves 8 bytes.

- The set of memory context types isn't runtime extensible. We could replace
  MemoryContextData->methods with a small integer index into mcxt_methods. I
  think that might actually end up being as-cheap or even cheaper than the
  current approach.  Saves 8 bytes.

Tthat's sufficient for going to 3 cachelines.


- We could store the power of 2 for initBlockSize, nextBlockSize,
  maxBlockSize, instead of the "raw" value. That'd make them one byte
  each. Which also would get rid of the concerns around needing a
  "mini_size_t" type.

- freeListIndex could be a single byte as well (saving 7 bytes, as right now
  we loose 4 trailing bytes due to padding).

That would save another 12 bytes, if I calculate correctly.  25% shrinkage
together ain't bad.

Greetings,

Andres Freund

vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

2023-06-28 Thread Nathan Bossart

While working on some other patches, I found myself wanting to use the
following command to vacuum the catalogs in all databases in a cluster:

vacuumdb --all --schema pg_catalog

However, this presently fails with the following error:

cannot vacuum specific schema(s) in all databases

AFAICT there no technical reason to block this, and the resulting behavior
feels intuitive to me, so I wrote 0001 to allow it.  0002 allows specifying
tables to process in all databases in clusterdb, and 0003 allows specifying
tables, indexes, schemas, or the system catalogs to process in all
databases in reindexdb.

I debated also allowing users to specify different types of objects in the
same command (e.g., "vacuumdb --schema myschema --table mytable"), but it
looked like this would require a more substantial rewrite, and I didn't
feel that the behavior was intuitive.  For the example I just gave, does
the user expect us to process both the "myschema" schema and the "mytable"
table, or does the user want us to process the "mytable" table in the
"myschema" schema?  In vacuumdb, this is already blocked, but reindexdb
accepts combinations of tables, schemas, and indexes (yet disallows
specifying --system along with other types of objects).  Since this is
inconsistent with vacuumdb and IMO ambiguous, I've restricted such
combinations in 0003.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 272375cee9214da54f423241b5bee7b8a1f8faa3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 28 Jun 2023 15:12:18 -0700
Subject: [PATCH v1 1/3] vacuumdb: allow specifying tables or schemas to
 process in all databases

---
 src/bin/scripts/t/100_vacuumdb.pl | 26 +-
 src/bin/scripts/vacuumdb.c| 19 +--
 2 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index eccfcc54a1..43fba676f1 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -161,7 +161,7 @@ $node->issues_sql_like(
 	'vacuumdb --schema');
 $node->issues_sql_like(
 	[ 'vacuumdb', '--exclude-schema', '"Foo"', 'postgres' ],
-	qr/(?:(?!VACUUM "Foo".bar).)*/,
+	qr/(?:(?!VACUUM \(SKIP_DATABASE_STATS\) "Foo".bar).)*/,
 	'vacuumdb --exclude-schema');
 $node->command_fails_like(
 	[ 'vacuumdb', '-N', 'pg_catalog', '-t', 'pg_class', 'postgres', ],
@@ -175,18 +175,18 @@ $node->command_fails_like(
 	[ 'vacuumdb', '-n', 'pg_catalog', '-N', '"Foo"', 'postgres' ],
 	qr/cannot vacuum all tables in schema\(s\) and exclude schema\(s\) at the same time/,
 	'cannot use options -n and -N at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-N', '"Foo"' ],
-	qr/cannot exclude specific schema\(s\) in all databases/,
-	'cannot use options -a and -N at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-n', '"Foo"' ],
-	qr/cannot vacuum specific schema\(s\) in all databases/,
-	'cannot use options -a and -n at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-t', '"Foo".bar' ],
-	qr/cannot vacuum specific table\(s\) in all databases/,
-	'cannot use options -a and -t at the same time');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-N', 'pg_catalog' ],
+	qr/(?:(?!VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class).)*/,
+	'vacuumdb -a -N');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-n', 'pg_catalog' ],
+	qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/,
+	'vacuumdb -a -n');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-t', 'pg_class' ],
+	qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/,
+	'vacuumdb -a -t');
 $node->command_fails_like(
 	[ 'vacuumdb', '-a', '-d', 'postgres' ],
 	qr/cannot vacuum all databases and a specific one at the same time/,
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 4b17a07089..d7f4871198 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -72,6 +72,7 @@ static void vacuum_one_database(ConnParams *cparams,
 static void vacuum_all_databases(ConnParams *cparams,
  vacuumingOptions *vacopts,
  bool analyze_in_stages,
+ SimpleStringList *objects,
  int concurrentCons,
  const char *progname, bool echo, bool quiet);
 
@@ -376,6 +377,7 @@ main(int argc, char *argv[])
 
 		vacuum_all_databases(, ,
 			 analyze_in_stages,
+			 ,
 			 concurrentCons,
 			 progname, echo, quiet);
 	}
@@ -427,18 +429,6 @@ check_objfilter(void)
 		(objfilter & OBJFILTER_DATABASE))
 		pg_fatal("cannot vacuum all databases and a specific one at the same time");
 
-	if ((objfilter & OBJFILTER_ALL_DBS) &&
-		(objfilter & OBJFILTER_TABLE))
-		pg_fatal("cannot vacuum specific table(s) in all databases");
-
-	if ((objfilter & OBJFILTER_ALL_DBS) &&
-		(objfilter & OBJFILTER_SCHEMA))
-		pg_fatal("cannot vacuum specific schema(s) in all databases");
-
-	if ((objfilter & OBJFILTER_ALL_DBS) &&
-		(objfilter &

Re: Add GUC to tune glibc's malloc implementation.

Hi,

On 2023-06-28 07:26:03 +0200, Ronan Dunklau wrote:
> I see it as a way to have *some* sort of control over the malloc
> implementation we use, instead of tuning our allocations pattern on top of it
> while treating it entirely as a black box. As for the tuning, I proposed
> earlier to replace this parameter expressed in terms of size as a "profile"
> (greedy / conservative) to make it easier to pick a sensible value.

I don't think that makes it very usable - we'll still have idle connections
use up a lot more memory than now in some cases, and not in others, even
though it doesn't help. And it will be very heavily dependent on the OS and
glibc version.


> Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit :
> > > Except if you hinted we should write our own directly instead ?
> > > > We e.g. could keep a larger number of memory blocks reserved
> > > > ourselves. Possibly by delaying the release of additionally held blocks
> > > > until we have been idle for a few seconds or such.
> > >
> > > I think keeping work_mem around after it has been used a couple times make
> > > sense. This is the memory a user is willing to dedicate to operations,
> > > after all.
> >
> > The biggest overhead of returning pages to the kernel is that that triggers
> > zeroing the data during the next allocation. Particularly on multi-node
> > servers that's surprisingly slow.  It's most commonly not the brk() or
> > mmap() themselves that are the performance issue.
> >
> > Indeed, with your benchmark, I see that most of the time, on my dual Xeon
> > Gold 5215 workstation, is spent zeroing newly allocated pages during page
> > faults. That microarchitecture is worse at this than some others, but it's
> > never free (or cache friendly).
>
> I'm not sure I see the practical difference between those, but that's
> interesting. Were you able to reproduce my results ?

I see a bit smaller win than what you observed, but it is substantial.


The runtime difference between the "default" and "cached" malloc are almost
entirely in these bits:

cached:
-8.93%  postgres  libc.so.6 [.] __memmove_evex_unaligned_erms
   - __memmove_evex_unaligned_erms
  + 6.77% minimal_tuple_from_heap_tuple
  + 2.04% _int_realloc
  + 0.04% AllocSetRealloc
0.02% 0x56281094806f
0.02% 0x56281094e0bf

vs

uncached:

-   14.52%  postgres  libc.so.6 [.] __memmove_evex_unaligned_erms
 8.61% asm_exc_page_fault
   - 5.91% __memmove_evex_unaligned_erms
  + 5.78% minimal_tuple_from_heap_tuple
0.04% 0x560130a2900f
0.02% 0x560130a20faf
  + 0.02% AllocSetRealloc
  + 0.02% _int_realloc

+3.81%  postgres  [kernel.vmlinux]  [k] native_irq_return_iret
+1.88%  postgres  [kernel.vmlinux]  [k] __handle_mm_fault
+1.76%  postgres  [kernel.vmlinux]  [k] clear_page_erms
+1.67%  postgres  [kernel.vmlinux]  [k] get_mem_cgroup_from_mm
+1.42%  postgres  [kernel.vmlinux]  [k] cgroup_rstat_updated
+1.00%  postgres  [kernel.vmlinux]  [k] get_page_from_freelist
+0.93%  postgres  [kernel.vmlinux]  [k] mtree_range_walk

None of the latter are visible in a profile in the cached case.

I.e. the overhead is encountering page faults and individually allocating the
necessary memory in the kernel.


This isn't surprising, I just wanted to make sure I entirely understand.


Part of the reason this code is a bit worse is that it's using generation.c,
which doesn't cache any part of of the context. Not that aset.c's level of
caching would help a lot, given that it caches the context itself, not later
blocks.


> > FWIW, in my experience trimming the brk()ed region doesn't work reliably
> > enough in real world postgres workloads to be worth relying on (from a
> > memory usage POV). Sooner or later you're going to have longer lived
> > allocations placed that will prevent it from happening.
>
> I'm not sure I follow: given our workload is clearly split at queries and
> transactions boundaries, releasing memory at that time, I've assumed (and
> noticed in practice, albeit not on a production system) that most memory at
> the top of the heap would be trimmable as we don't keep much in between
> queries / transactions.

That's true for very simple workloads, but once you're beyond that you just
need some longer-lived allocation to happen. E.g. some relcache / catcache
miss during the query execution, and there's no exant memory in
CacheMemoryContext, so a new block is allocated.

Greetings,

Andres Freund

Re: Row pattern recognition

2023-06-28 Thread Vik Fearing


On 6/28/23 14:17, Tatsuo Ishii wrote:

Small question.


This query (with all the defaults made explicit):

SELECT s.company, s.tdate, s.price,
FIRST_VALUE(s.tdate) OVER w,
LAST_VALUE(s.tdate) OVER w,
lowest OVER w
FROM stock AS s
WINDOW w AS (
   PARTITION BY s.company
   ORDER BY s.tdate
   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
   EXCLUDE NO OTHERS
   MEASURES
 LAST(DOWN) AS lowest
   AFTER MATCH SKIP PAST LAST ROW
   INITIAL PATTERN (START DOWN+ UP+)
   DEFINE
 START AS TRUE,
 UP AS price > PREV(price),
 DOWN AS price < PREV(price)
);



 LAST(DOWN) AS lowest


should be "LAST(DOWN.price) AS lowest"?


Yes, it should be.  And the tdate='07-08-2023' row in the first 
resultset should have '07-08-2023' and '07-10-2023' as its 4th and 5th 
columns.


Since my brain is doing the processing instead of postgres, I made some 
human errors. :-)

--
Vik Fearing

Re: several attstattarget-related improvements

Tomas Vondra  writes:
> On 6/28/23 16:52, Peter Eisentraut wrote:
>> - 0001: Change type of pg_statistic_ext.stxstattarget, to match
>> attstattarget.  Maybe this should go into PG16, for consistency?

> Not sure about 0001 vs PG16, it'd require catversion bump.

Yeah, past beta1 I think we should be conservative about bumping
catversion.  Suggest you hold this for now, and if we find some
more-compelling reason for a catversion bump in v16, we can sneak
it in at that time.  Otherwise, I won't cry if it waits for v17.

regards, tom lane

Re: Changing types of block and chunk sizes in memory contexts

Tomas Vondra  writes:
> ... 4B is tiny compared to what we waste due to the doubling.

Yeah.  I've occasionally wondered if we should rethink aset.c's
"only power-of-2 chunk sizes" rule.  Haven't had the bandwidth
to pursue the idea though.

regards, tom lane

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs




On 6/26/23 20:15, Alena Rybakina wrote:
> Hi, all!
> 
> On 24.06.2023 14:23, Tomas Vondra wrote:
>> On 6/24/23 02:08, Tom Lane wrote:
>>> Tomas Vondra  writes:
 The problem is that the selectivity for "IS NULL" is estimated using the
 table-level statistics. But the LEFT JOIN entirely breaks the idea that
 the null_frac has anything to do with NULLs in the join result.
>>> Right.
>>>
 I wonder how to improve this, say by adjusting the IS NULL selectivity
 when we know to operate on the outer side of the join. We're able to
 do this for antijoins, so maybe we could do that here, somehow?
>>> This mess is part of the long-term plan around the work I've been doing
>>> on outer-join-aware Vars.  We now have infrastructure that can let
>>> the estimator routines see "oh, this Var isn't directly from a scan
>>> of its table, it's been passed through a potentially-nulling outer
>>> join --- and I can see which one".  I don't have more than vague ideas
>>> about what happens next, but that is clearly an essential step on the
>>> road to doing better.
>>>
>> I was wondering if that work on outer-join-aware Vars could help with
>> this, but I wasn't following it very closely. I agree the ability to
>> check if the Var could be NULL due to an outer join seems useful, as it
>> says whether applying raw attribute statistics makes sense or not.
>>
>> I was thinking about what to do for the case when that's not possible,
>> i.e. when the Var refers to nullable side of the join. Knowing that this
>> is happening is clearly not enough - we need to know how many new NULLs
>> are "injected" into the join result, and "communicate" that to the
>> estimation routines.
>>
>> Attached is a very ugly experimental patch doing that, and with it the
>> estimate changes to this:
>>
>>  QUERY PLAN
>>   --
>>Hash Left Join  (cost=3.25..18179.88 rows=00 width=16)
>>(actual time=0.528..596.151 rows=00 loops=1)
>>  Hash Cond: (large.id = small.id)
>>  Filter: ((small.id IS NULL) OR
>>   (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
>>  Rows Removed by Filter: 100
>>  ->  Seq Scan on large  (cost=0.00..14425.00 rows=100 width=8)
>>  (actual time=0.069..176.138 rows=100 loops=1)
>>  ->  Hash  (cost=2.00..2.00 rows=100 width=8)
>>(actual time=0.371..0.373 rows=100 loops=1)
>>Buckets: 1024  Batches: 1  Memory Usage: 12kB
>>->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)
>>  (actual time=0.032..0.146 rows=100 loops=1)
>>Planning Time: 3.845 ms
>>Execution Time: 712.405 ms
>>   (10 rows)
>>
>> Seems nice, but. The patch is pretty ugly, I don't claim it works for
>> other queries or that this is exactly what we should do. It calculates
>> "unmatched frequency" next to eqjoinsel_inner, stashes that info into
>> sjinfo and the estimator (nulltestsel) then uses that to adjust the
>> nullfrac it gets from the statistics.
>>
>> The good thing is this helps even for IS NULL checks on non-join-key
>> columns (where we don't switch to an antijoin), but there's a couple
>> things that I dislike ...
>>
>> 1) It's not restricted to outer joins or anything like that (this is
>> mostly just my laziness / interest in one particular query, but also
>> something the outer-join-aware patch might help with).
>>
>> 2) We probably don't want to pass this kind of information through
>> sjinfo. That was the simplest thing for an experimental patch, but I
>> suspect it's not the only piece of information we may need to pass to
>> the lower levels of estimation code.
>>
>> 3) I kinda doubt we actually want to move this responsibility (to
>> consider fraction of unmatched rows) to the low-level estimation
>> routines (e.g. nulltestsel and various others). AFAICS this just
>> "introduces NULLs" into the relation, so maybe we could "adjust" the
>> attribute statistics (in examine_variable?) by inflating null_frac and
>> modifying the other frequencies in MCV/histogram.
>>
>> 4) But I'm not sure we actually want to do that in these low-level
>> selectivity functions. The outer join essentially produces output with
>> two subsets - one with matches on the outer side, one without them. But
>> the side without matches has NULLs in all columns. In a way, we know
>> exactly how are these columns correlated - if we do the usual estimation
>> (even with the null_frac adjusted), we just throw this information away.
>> And when there's a lot of rows without a match, that seems bad.
>>
>> So maybe we should split the join estimate into two parts, one for each
>> subset of the join result. One for the rows with a match (and then we
>> can just do what we do now, with the attribute stats we already have).
>> And one for the "unmatched part" where we know the values on the

Re: POC, WIP: OR-clause support for indexes

On 6/27/23 20:55, Ranier Vilela wrote:
> Hi,
> 
>>I finished writing the code patch for transformation "Or" expressions to
>>"Any" expressions. I didn't see any problems in regression tests, even
>>when I changed the constant at which the minimum or expression is
>>replaced by any at 0. I ran my patch on sqlancer and so far the code has
>>never fallen.
> Thanks for working on this.
> 
> I took the liberty of making some modifications to the patch.
> I didn't compile or test it.
> Please feel free to use them.
> 

I don't want to be rude, but this doesn't seem very helpful.

- You made some changes, but you don't even attempt to explain what you
changed or why you changed it.

- You haven't even tried to compile the code, nor tested it. If it
happens to compile, wow could others even know it actually behaves the
way you wanted?

- You responded in a way that breaks the original thread, so it's not
clear which message you're responding to.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: several attstattarget-related improvements

On 6/28/23 16:52, Peter Eisentraut wrote:
> Here are a few patches related to attstattarget:
> 
> - 0001: Change type of pg_statistic_ext.stxstattarget, to match
> attstattarget.  Maybe this should go into PG16, for consistency?
> 
> - 0002: Add macro for maximum statistics target, instead of hardcoding
> it everywhere.
> 
> - 0003: Take pg_attribute out of VacAttrStats.  This simplifies some
> code, especially for extended statistics, which had to have weird
> workarounds.

+1 to all three patches

Not sure about 0001 vs PG16, it'd require catversion bump.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Changing types of block and chunk sizes in memory contexts

On 6/28/23 12:59, Tom Lane wrote:
> David Rowley  writes:
>> Perhaps it's ok to leave the context creation functions with Size
>> typed parameters and then just Assert the passed-in sizes are not
>> larger than 1GB within the context creation function.
> 
> Yes, I'm strongly opposed to not using Size/size_t in the mmgr APIs.
> If we go that road, we're going to have a problem when someone
> inevitably wants to pass a larger-than-GB value for some context
> type.

+1

> What happens in semi-private structs is a different matter, although
> I'm a little dubious that shaving a couple of bytes from context
> headers is a useful activity.  The self-documentation argument
> still has some force there, so I agree with Peter that some positive
> benefit has to be shown.
> 

Yeah. FWIW I was interested what the patch does in practice, so I
checked what pahole says about impact on struct sizes:

AllocSetContext   224B -> 208B   (4 cachelines)
GenerationContext 152B -> 136B   (3 cachelines)
SlabContext   200B -> 200B   (no change, adds 4B hole)

Nothing else changes, AFAICS. I find it hard to believe this could have
any sort of positive benefit - I doubt we ever have enough contexts for
this to matter.

When I first saw the patch I was thinking it's probably changing how we
store the per-chunk requested_size. Maybe that'd make a difference,
although 4B is tiny compared to what we waste due to the doubling.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Document efficient self-joins / UPDATE LIMIT techniques.

2023-06-28 Thread Corey Huinker

This patch adds a few examples to demonstrate the following:

* The existence of the ctid column on every table
* The utility of ctds in self joins
* A practical usage of SKIP LOCKED

The reasoning for this is a bit long, but if you're interested, keep
reading.

In the past, there has been a desire to see a LIMIT clause of some sort on
UPDATE and DELETE statements. The reason for this usually stems from having
a large archive or backfill operation that if done in one single
transaction would overwhelm normal operations, either by the transaction
failing outright, locking too many rows, flooding the WAL causing replica
lag, or starving other processes of limited I/O.

The reasons for not adding a LIMIT clause are pretty straightforward: it
isn't in the SQL Standard, and UPDATE/DELETE operations are unordered
operations, so updating 1000 rows randomly isn't a great idea. The people
wanting the LIMIT clause were undeterred by this, because they know that
they intend to keep issuing updates until they run out of rows to update.

Given these limitations, I would write something like this:

WITH doomed AS (
SELECT t.id
FROM my_table AS t
WHERE t.expiration_date < :'some_archive_date'
FOR UPDATE SKIP LOCKED
LIMIT 1000 )
DELETE FROM my_table
WHERE id IN (SELECT id FROM doomed );

This wouldn't interfere with any other updates, so I felt good about it
running when the system was not-too-busy. I'd then write a script to run
that in a loop, with sleeps to allow the replicas a chance to catch their
breath. Then, when the rowcount finally dipped below 1000, I'd issue the
final

DELETE FROM my_table WHERE expiration_date < :'some_archive_date';

And this was ok, because at that point I have good reason to believe that
there are at most 1000 rows lingering out there, so waiting on locks for
those was no big deal.

But a query like this involves one scan along one index (or worse, a seq
scan) followed by another scan, either index or seq. Either way, we're
taking up a lot of cache with rows we don't even care about.

Then in v12, the query planner got hip to bitmap tidscans, allowing for
this optimization:

WITH doomed AS (
SELECT t.ctid AS tid
FROM my_table AS t
WHERE t.expiration_date < :'some_archive_date'
FOR UPDATE SKIP LOCKED
LIMIT 1000 )
DELETE FROM my_table
USING doomed WHERE my_table.ctid = doomed.tid;

And this works pretty well, especially if you set up a partial index to
meet the quals in the CTE. But we don't document this anywhere, and until
UPDATE and DELETE get a LIMIT clause, we probably should document this
workaround.
From 209fd8abe50603e85ca0cc07aecd72b87889e757 Mon Sep 17 00:00:00 2001
From: coreyhuinker 
Date: Tue, 13 Jun 2023 13:00:40 -0400
Subject: [PATCH v1] Add examples that highlight the usage of the system column
 ctid in self-joins.

Currently we do not show any examples of using ctid anywhere, nor do we
address the often-requested but problematic use case of having a LIMIT
clause on UPDATE and DELETE statements. These examples are a subtle way
of addressing both those concerns.
---
 doc/src/sgml/ref/delete.sgml | 24 
 doc/src/sgml/ref/select.sgml | 21 +
 doc/src/sgml/ref/update.sgml | 23 +++
 3 files changed, 68 insertions(+)

diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 1b81b4e7d7..cca9138843 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -234,6 +234,30 @@ DELETE FROM films
In some cases the join style is easier to write or faster to
execute than the sub-select style.
   
+  
+   In situations where a single operation would consume too many resources, it
+   may be desirable to break up a large DELETE into multiple
+   separate commands. The SQL
+   standard does not define a LIMIT clause for
+   DELETE operations, but it is possible get the equivalent
+   functionality through the USING clause to a
+   Common Table Expression which identifies
+   a subset of rows to be deleted, locks those rows, and returns their system
+   column ctid values:
+
+WITH delete_batch AS (
+  SELECT l.ctid
+  FROM user_logs AS l
+  WHERE l.status = 'archived'
+  ORDER BY l.creation_date
+  LIMIT 1
+  FOR UPDATE )
+DELETE FROM user_logs AS ul
+USING delete_branch AS del
+WHERE ul.ctid = del.ctid
+
+  This allows for flexible search criteria within the CTE and an efficient self-join.
+  
  
 
  
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 0ee0cc7e64..9d7c3d5c41 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1676,6 +1676,27 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss WHERE col1 = 5;
 condition is not textually within the sub-query.

 
+   
+In cases where a DML operation involving many rows
+must be performed, and that table experiences numerous other simultaneous
+DML operations, a FOR UPDATE clause
+used in conjunction with SKIP

Re: Detecting use-after-free bugs using gcc's malloc() attribute

Hi,

On 2023-06-28 10:40:22 +0200, Peter Eisentraut wrote:
> On 26.06.23 21:54, Andres Freund wrote:
> > For something like pg_list.h the malloc(free) attribute is a bit awkward to
> > use, because one a) needs to list ~30 functions that can free a list and b)
> > the referenced functions need to be declared.
>
> Hmm.  Saying list_concat() "deallocates" a list is mighty confusing because
> 1) it doesn't, and 2) it might actually allocate a new list.

list_concat() basically behaves like realloc(), except that the "pointer is
still valid" case is much more common.  And the way that's modelled in the
annotations is to say a function frees and allocates.

Note that the free attribute references the first element for list_concat(),
not the second.

> So while you get the useful behavior of "you probably didn't mean to use
> this variable again after passing it into list_concat()", if some other tool
> actually took these allocate/deallocate decorations at face value and did a
> memory leak analysis with them, they'd get completely bogus results.

How would the annotations possibly lead to a bogus result?  I see neither how
it could lead to false negatives nor false positives?

The gcc attributes are explicitly intended to track not just plain memory
allocations, the example in the docs
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
is to add them for fopen() etc.  So I don't think it's likely that external
tools will interpret this is a much more stringent way.

> > I also added such attributes to bitmapset.h and palloc() et al, but that
> > didn't find existing bugs.  It does find use-after-free instances if I add
> > some, similarly it does find cases of mismatching palloc with free etc.
>
> This seems more straightforward.  Even if it didn't find any bugs, I'd
> imagine it would be useful during development.

Agreed. Given our testing regimen (valgrind etc), I'd expect to find many such
bugs before long in the tree anyway. But it's much nicer to get that far. And
to find paths that aren't covered by tests.

> > Do others think this would be useful enough to be worth polishing? And do 
> > you
> > agree the warnings above are bugs?
>
> I actually just played with this the other day, because I can never remember
> termPQExpBuffer() vs. destroyPQExpBuffer().

That's a pretty nasty one :(

> I couldn't quite make it work for that, but I found the feature overall
> useful, so I'd welcome support for it.

Yea, I don't think the attributes can comfortable handle initPQExpBuffer()
style allocation.  It's somewhat posible by moving the allocation to an inline
function, and then making the thing that's allocated ->data. But it ends up
pretty messy, particularly because we need ABI stability for pqexpbuffer.h.

But createPQExpBuffer() can be dealt with reasonably.

Doing so points out:

[51/354 42  14%] Compiling C object src/bin/initdb/initdb.p/initdb.c.o
../../../../home/andres/src/postgresql/src/bin/initdb/initdb.c: In function 
‘replace_guc_value’:
../../../../home/andres/src/postgresql/src/bin/initdb/initdb.c:566:9: warning: 
‘free’ called on pointer returned from a mismatched allocation function 
[-Wmismatched-dealloc]
  566 | free(newline);  /* but don't free 
newline->data */
  | ^
../../../../home/andres/src/postgresql/src/bin/initdb/initdb.c:470:31: note: 
returned from ‘createPQExpBuffer’
  470 | PQExpBuffer newline = createPQExpBuffer();
  |   ^~~

which is intentional, but ... not pretty, and could very well be a bug in
other cases.  If we want to do stuff like that, we'd probably better off
having a dedicated version of destroyPQExpBuffer().  Although here it looks
like the code should just use an on-stack PQExpBuffer.

Greetings,

Andres Freund

Re: Incremental View Maintenance, take 2

2023-06-28 Thread jian he

On Wed, Jun 28, 2023 at 4:06 PM Yugo NAGATA  wrote:
>
> On Wed, 28 Jun 2023 00:01:02 +0800
> jian he  wrote:
>
> > On Thu, Jun 1, 2023 at 2:47 AM Yugo NAGATA  wrote:
> > >
> > > On Thu, 1 Jun 2023 23:59:09 +0900
> > > Yugo NAGATA  wrote:
> > >
> > > > Hello hackers,
> > > >
> > > > Here's a rebased version of the patch-set adding Incremental View
> > > > Maintenance support for PostgreSQL. That was discussed in [1].
> > >
> > > > [1] 
> > > > https://www.postgresql.org/message-id/flat/20181227215726.4d166b4874f8983a641123f5%40sraoss.co.jp
> > >
> > > ---
> > > * Overview
> > >
> > > Incremental View Maintenance (IVM) is a way to make materialized views
> > > up-to-date by computing only incremental changes and applying them on
> > > views. IVM is more efficient than REFRESH MATERIALIZED VIEW when
> > > only small parts of the view are changed.
> > >
> > > ** Feature
> > >
> > > The attached patchset provides a feature that allows materialized views
> > > to be updated automatically and incrementally just after a underlying
> > > table is modified.
> > >
> > > You can create an incementally maintainable materialized view (IMMV)
> > > by using CREATE INCREMENTAL MATERIALIZED VIEW command.
> > >
> > > The followings are supported in view definition queries:
> > > - SELECT ... FROM ... WHERE ..., joins (inner joins, self-joins)
> > > - some built-in aggregate functions (count, sum, avg, min, max)
> > > - GROUP BY clause
> > > - DISTINCT clause
> > >
> > > Views can contain multiple tuples with the same content (duplicate 
> > > tuples).
> > >
> > > ** Restriction
> > >
> > > The following are not supported in a view definition:
> > > - Outer joins
> > > - Aggregates otehr than above, window functions, HAVING
> > > - Sub-queries, CTEs
> > > - Set operations (UNION, INTERSECT, EXCEPT)
> > > - DISTINCT ON, ORDER BY, LIMIT, OFFSET
> > >
> > > Also, a view definition query cannot contain other views, materialized 
> > > views,
> > > foreign tables, partitioned tables, partitions, VALUES, non-immutable 
> > > functions,
> > > system columns, or expressions that contains aggregates.
> > >
> > > ---
> > > * Design
> > >
> > > An IMMV is maintained using statement-level AFTER triggers.
> > > When an IMMV is created, triggers are automatically created on all base
> > > tables contained in the view definition query.
> > >
> > > When a table is modified, changes that occurred in the table are extracted
> > > as transition tables in the AFTER triggers. Then, changes that will occur 
> > > in
> > > the view are calculated by a rewritten view dequery in which the modified 
> > > table
> > > is replaced with the transition table.
> > >
> > > For example, if the view is defined as "SELECT * FROM R, S", and tuples 
> > > inserted
> > > into R are stored in a transiton table dR, the tuples that will be 
> > > inserted into
> > > the view are calculated as the result of "SELECT * FROM dR, S".
> > >
> > > ** Multiple Tables Modification
> > >
> > > Multiple tables can be modified in a statement when using triggers, 
> > > foreign key
> > > constraint, or modifying CTEs. When multiple tables are modified, we need
> > > the state of tables before the modification.
> > >
> > > For example, when some tuples, dR and dS, are inserted into R and S 
> > > respectively,
> > > the tuples that will be inserted into the view are calculated by the 
> > > following
> > > two queries:
> > >
> > >  "SELECT * FROM dR, S_pre"
> > >  "SELECT * FROM R, dS"
> > >
> > > where S_pre is the table before the modification, R is the current state 
> > > of
> > > table, that is, after the modification. This pre-update states of table
> > > is calculated by filtering inserted tuples and appending deleted tuples.
> > > The subquery that represents pre-update state is generated in 
> > > get_prestate_rte().
> > > Specifically, the insterted tuples are filtered by calling 
> > > IVM_visible_in_prestate()
> > > in WHERE clause. This function checks the visibility of tuples by using
> > > the snapshot taken before table modification. The deleted tuples are 
> > > contained
> > > in the old transition table, and this table is appended using UNION ALL.
> > >
> > > Transition tables for each modification are collected in each AFTER 
> > > trigger
> > > function call. Then, the view maintenance is performed in the last call of
> > > the trigger.
> > >
> > > In the original PostgreSQL, tuplestores of transition tables are freed at 
> > > the
> > > end of each nested query. However, their lifespan needs to be prolonged to
> > > the end of the out-most query in order to maintain the view in the last 
> > > AFTER
> > > trigger. For this purpose, SetTransitionTablePreserved is added in 
> > > trigger.c.
> > >
> > > ** Duplicate Tulpes
> > >
> > > When calculating changes that will occur in the view (= delta

Re: Bytea PL/Perl transform

=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> Andrew Dunstan  writes:
>> On 2023-06-22 Th 16:56, Greg Sabino Mullane wrote:
>>> * Do all of these transforms need to be their own contrib modules? So
>>> much duplicated code across contrib/*_plperl already (and *plpython 
>>> too for that matter) ...

>> Yeah, that's a bit of a mess. Not sure what we can do about it now.

> Would it be possible to move the functions and other objects to a new
> combined extension, and make the existing ones depend on that?

Perhaps another way could be to accept that the packaging is what it
is, but look for ways to share the repetitive source code.  The .so's
wouldn't get any smaller, but they're not that big anyway.

regards, tom lane

Re: Assistance Needed: Issue with pg_upgrade and --link option


On 28.06.23 12:46, Laurenz Albe wrote:

On Wed, 2023-06-28 at 15:40 +0530, Pradeep Kumar wrote:

I was under the impression that the --link option would create hard links 
between the
old and new cluster's data files, but it appears that the entire old cluster 
data was
copied to the new cluster, resulting in a significant increase in the new 
cluster's size.


Please provide some numbers, ideally

   du -sk  


du -sk ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master 
~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg
11224524 /home/test/pradeep_test/pg_upgrade_testing/postgres_11.4/master
41952 /home/test/pradeep_test/pg_upgrade_testing/postgres_14/new_pg


That looks fine.  The files exist only once, and the 41MB that only exist in
the new data directory are catalog data and other stuff that is different
on the new cluster.


Interesting, so it actually does count files with multiple hardlinks 
only once.

tablecmds.c/MergeAttributes() cleanup

The MergeAttributes() and related code in and around tablecmds.c is huge 
and ancient, with many things bolted on over time, and difficult to deal 
with.  I took some time to make careful incremental updates and 
refactorings to make the code easier to follow, more compact, and more 
modern in appearance.  I also found several pieces of obsolete code 
along the way.  This resulted in the attached long patch series.  Each 
patch tries to make a single change and can be considered incrementally. 
 At the end, the code is shorter, better factored, and I hope easier to 
understand.  There shouldn't be any change in behavior.From 60a671aeb03293bdec65fd86f2a393c3aced6eb9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 27 Jun 2023 17:10:25 +0200
Subject: [PATCH 01/17] Remove obsolete comment about OID support

---
 src/backend/catalog/heap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 2a0d82aedd..9196dcd39f 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -843,9 +843,9 @@ AddNewAttributeTuples(Oid new_rel_oid,
}
 
/*
-* Next we add the system attributes.  Skip OID if rel has no OIDs. Skip
-* all for a view or type relation.  We don't bother with making 
datatype
-* dependencies here, since presumably all these types are pinned.
+* Next we add the system attributes.  Skip all for a view or type
+* relation.  We don't bother with making datatype dependencies here,
+* since presumably all these types are pinned.
 */
if (relkind != RELKIND_VIEW && relkind != RELKIND_COMPOSITE_TYPE)
{

base-commit: b381d9637030c163c3b1f8a9d3de51dfc1b4ee58
-- 
2.41.0

From e12a69675919fb026c008e37649518f3d52e2a90 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 22 Jun 2023 12:05:06 +0200
Subject: [PATCH 02/17] Remove ancient special case code for adding oid columns

The special handling of negative attribute numbers in
ATExecAddColumn() was introduced to support SET WITH OIDS (commit
6d1e361852).  But that feature doesn't exist anymore, so we can revert
to the previous, simpler version.  In passing, also remove an obsolete
comment about OID support.
---
 src/backend/commands/tablecmds.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d985278ac6..a5493705aa 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2449,8 +2449,7 @@ MergeAttributes(List *schema, List *supers, char 
relpersistence,
 
/*
 * Scan the parents left-to-right, and merge their attributes to form a
-* list of inherited attributes (inhSchema).  Also check to see if we 
need
-* to inherit an OID column.
+* list of inherited attributes (inhSchema).
 */
child_attno = 0;
foreach(entry, supers)
@@ -6944,7 +6943,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, 
Relation rel,
attribute.attrelid = myrelid;
namestrcpy(&(attribute.attname), colDef->colname);
attribute.atttypid = typeOid;
-   attribute.attstattarget = (newattnum > 0) ? -1 : 0;
+   attribute.attstattarget = -1;
attribute.attlen = tform->typlen;
attribute.attnum = newattnum;
if (list_length(colDef->typeName->arrayBounds) > PG_INT16_MAX)
@@ -7067,7 +7066,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, 
Relation rel,
 * is certainly not going to touch them.  System attributes don't have
 * interesting defaults, either.
 */
-   if (RELKIND_HAS_STORAGE(relkind) && attribute.attnum > 0)
+   if (RELKIND_HAS_STORAGE(relkind))
{
/*
 * For an identity column, we can't use build_column_default(),
-- 
2.41.0

From 58af867b3e3990e787a33c5d5023753e623dffe0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 27 Jun 2023 14:43:55 +0200
Subject: [PATCH 03/17] Remove ancient special case code for dropping oid
 columns

The special handling of negative attribute numbers in
RemoveAttributeById() was introduced to support SET WITHOUT OIDS
(commit 24614a9880).  But that feature doesn't exist anymore, so we
can revert to the previous, simpler version.
---
 src/backend/catalog/heap.c | 99 +-
 1 file changed, 43 insertions(+), 56 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 9196dcd39f..4c30c7d461 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1666,68 +1666,56 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
 attnum, relid);
attStruct = (Form_pg_attribute) GETSTRUCT(tuple);
 
-   if (attnum < 0)
-   {
-   /* System attribute (probably OID) ... just delete the row */
-
-

Re: eqjoinsel_semi still sucks ...

Alena Rybakina  writes:
> On 23.06.2023 14:28, Andrey Lepikhov wrote:
>> On 2/5/2012 20:34, Tom Lane wrote:
>>> On reflection I think that the idea of clamping ndistinct beforehand is
>>> just wrong, and what we ought to do instead is apply a multiplier to the
>>> selectivity estimate afterwards.

>> I got stuck in some cases where (due to a tree of filters) the planner 
>> underestimates the JOIN just because the ndistinct conveys a huge 
>> number to the selectivity estimation formula. However, the estimation 
>> of both input relations is made correctly and is limited.
>> I've tried to understand the logic through commits 0d3b231eebf, 
>> 97930cf578e and 7f3eba30c9d. But it is still not clear.
>> So, why the idea of clamping ndistinct is terrible in general? Could 
>> you explain your reasons a bit more?

> Hi, I also have an interest in understanding the same issue, and I dug 
> into the above commits and the topic . I hope this letter will help to 
> sort out this issue.

Note that nothing's actually been done about the idea I expressed at the
start of this thread.  The behavior is better now: if you try the example
in v12 or later you get

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.unique2 and b.two = 0);
 QUERY PLAN 
 
-
 Hash Anti Join  (cost=532.50..1059.38 rows=5000 width=244) (actual 
time=1.993..3.959 rows=5090 loops=1)
   Hash Cond: (a.thousand = b.unique2)
   ->  Seq Scan on tenk1 a  (cost=0.00..445.00 rows=1 width=244) (actual 
time=0.003..0.502 rows=1 loops=1)
   ->  Hash  (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 
rows=5000 loops=1)
 Buckets: 8192  Batches: 1  Memory Usage: 240kB
 ->  Seq Scan on tenk1 b  (cost=0.00..470.00 rows=5000 width=4) (actual 
time=0.003..1.449 rows=5000 loops=1)
   Filter: (two = 0)
   Rows Removed by Filter: 5000
 Planning Time: 0.760 ms
 Execution Time: 4.087 ms
(10 rows)

I believe this improvement just stems from commit a314c3407 ("Clamp
semijoin selectivity to be not more than inner-join selectivity"),
and so this example looks good only because the actual semijoin
output size happens to be the same as the inner-join output size.
With a slight adjustment, that's no longer true and the estimate
still sucks:

regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = 
b.fivethous and b.two = 0;
  QUERY PLAN
   
---
 Hash Join  (cost=532.50..1127.50 rows=1 width=488) (actual 
time=2.000..5.742 rows=1 loops=1)
...

regression=# explain analyze select * from tenk1 a where exists(select 1 from 
tenk1 b where a.thousand = b.fivethous and b.two = 0);
 QUERY PLAN 
 
-
 Hash Semi Join  (cost=532.50..1127.50 rows=1 width=244) (actual 
time=1.529..3.474 rows=5000 loops=1)
...

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.fivethous and b.two = 0);
 QUERY PLAN 
 
-
 Hash Anti Join  (cost=532.50..1027.50 rows=1 width=244) (actual 
time=1.564..3.476 rows=5000 loops=1)
...

So we still have an issue to fix.  I've not had time to pursue 
the idea I expressed at the start of the thread.  Also, it's a
bit scary to change this logic with only a few examples to look
at.  If you could reduce the problem cases you're looking at
to something sharable, that could help move things along.

regards, tom lane

several attstattarget-related improvements


Here are a few patches related to attstattarget:

- 0001: Change type of pg_statistic_ext.stxstattarget, to match 
attstattarget.  Maybe this should go into PG16, for consistency?


- 0002: Add macro for maximum statistics target, instead of hardcoding 
it everywhere.


- 0003: Take pg_attribute out of VacAttrStats.  This simplifies some 
code, especially for extended statistics, which had to have weird 
workarounds.From f01d4c9aba44cc5b9028200b21a6d1df9770c5a2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 27 Jun 2023 16:38:00 +0200
Subject: [PATCH 1/3] Change type of pg_statistic_ext.stxstattarget

Change from int32 to int16, to match attstattarget (changed in
90189eefc1).
---
 doc/src/sgml/catalogs.sgml | 2 +-
 src/backend/commands/statscmds.c   | 4 ++--
 src/include/catalog/pg_statistic_ext.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ed32ca0349..852cb30ae1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7622,7 +7622,7 @@ pg_statistic_ext 
Columns
 
  
   
-   stxstattarget int4
+   stxstattarget int2
   
   
stxstattarget controls the level of detail
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 26ebd0819d..c2dab20bdc 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -498,7 +498,7 @@ CreateStatistics(CreateStatsStmt *stmt)
values[Anum_pg_statistic_ext_stxrelid - 1] = ObjectIdGetDatum(relid);
values[Anum_pg_statistic_ext_stxname - 1] = NameGetDatum();
values[Anum_pg_statistic_ext_stxnamespace - 1] = 
ObjectIdGetDatum(namespaceId);
-   values[Anum_pg_statistic_ext_stxstattarget - 1] = Int32GetDatum(-1);
+   values[Anum_pg_statistic_ext_stxstattarget - 1] = Int16GetDatum(-1);
values[Anum_pg_statistic_ext_stxowner - 1] = ObjectIdGetDatum(stxowner);
values[Anum_pg_statistic_ext_stxkeys - 1] = PointerGetDatum(stxkeys);
values[Anum_pg_statistic_ext_stxkind - 1] = PointerGetDatum(stxkind);
@@ -676,7 +676,7 @@ AlterStatistics(AlterStatsStmt *stmt)
 
/* replace the stxstattarget column */
repl_repl[Anum_pg_statistic_ext_stxstattarget - 1] = true;
-   repl_val[Anum_pg_statistic_ext_stxstattarget - 1] = 
Int32GetDatum(newtarget);
+   repl_val[Anum_pg_statistic_ext_stxstattarget - 1] = 
Int16GetDatum(newtarget);
 
newtup = heap_modify_tuple(oldtup, RelationGetDescr(rel),
   repl_val, repl_null, 
repl_repl);
diff --git a/src/include/catalog/pg_statistic_ext.h 
b/src/include/catalog/pg_statistic_ext.h
index 53eec9025a..1e64aa8f16 100644
--- a/src/include/catalog/pg_statistic_ext.h
+++ b/src/include/catalog/pg_statistic_ext.h
@@ -43,7 +43,7 @@ CATALOG(pg_statistic_ext,3381,StatisticExtRelationId)

 * object's namespace */
 
Oid stxowner BKI_LOOKUP(pg_authid); /* statistics 
object's owner */
-   int32   stxstattarget BKI_DEFAULT(-1);  /* statistics target */
+   int16   stxstattarget BKI_DEFAULT(-1);  /* statistics target */
 
/*
 * variable-length fields start here, but we allow direct access to
-- 
2.41.0

From a473c258577f9f9b28435feef27e8cbc25721d5c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 27 Jun 2023 14:09:39 +0200
Subject: [PATCH 2/3] Add macro for maximum statistics target

The number of places where 1 was hardcoded had grown a bit beyond
the comfort level.  Introduce a macro MAX_STATISTICS_TARGET instead.
---
 src/backend/commands/statscmds.c| 4 ++--
 src/backend/commands/tablecmds.c| 5 +++--
 src/backend/statistics/extended_stats.c | 2 +-
 src/backend/utils/misc/guc_tables.c | 2 +-
 src/include/catalog/pg_attribute.h  | 2 +-
 src/include/commands/vacuum.h   | 7 +++
 src/include/statistics/statistics.h | 4 ++--
 7 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index c2dab20bdc..36bc8c33ba 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -619,9 +619,9 @@ AlterStatistics(AlterStatsStmt *stmt)
 errmsg("statistics target %d is too low",
newtarget)));
}
-   else if (newtarget > 1)
+   else if (newtarget > MAX_STATISTICS_TARGET)
{
-   newtarget = 1;
+   newtarget = MAX_STATISTICS_TARGET;
ereport(WARNING,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg("lowering statistics target to %d",
diff --git a/src/backend/commands/tablecmds.c

Re: Assert !bms_overlap(joinrel->relids, required_outer)

Richard Guo  writes:
> On Wed, Jun 28, 2023 at 6:28 AM Tom Lane  wrote:
>> For a real fix, I'm inclined to extend the loop that calculates
>> param_source_rels (in add_paths_to_joinrel) so that it also tracks
>> a set of incompatible relids that *must not* be present in the
>> parameterization of a proposed path.  This would basically include
>> OJ relids of OJs that partially overlap the target joinrel; maybe
>> we should also include the min RHS of such OJs.  Then we could
>> check that in try_nestloop_path.  I've not tried to code this yet.

> I went ahead and drafted a patch based on this idea.

Hmm.  This patch is the opposite of what I'd been imagining, because
I was thinking we needed to add OJs to param_incompatible_relids if
they were *not* already in the join, rather than if they were.
However, I tried it like that and while it did stop the assertion
failure, it also broke a bunch of other test cases that no longer
found the parameterized-nestloop plans they were supposed to find.
So clearly I just didn't have my head screwed on in the correct
direction yesterday.

However, given that what we need is to exclude parameterization
that depends on the currently-formed OJ, it seems to me we can do
it more simply and without any new JoinPathExtraData field,
as attached.  What do you think?

> * I think we need to check the incompatible relids also in
> try_hashjoin_path and try_mergejoin_path besides try_nestloop_path.

I think this isn't necessary, at least in my formulation.
Those cases will go through calc_non_nestloop_required_outer
which has

/* neither path can require rels from the other */
Assert(!bms_overlap(outer_paramrels, inner_path->parent->relids));
Assert(!bms_overlap(inner_paramrels, outer_path->parent->relids));

In order to have a dependency on an OJ, a path would have to have
a dependency on at least one of the OJ's base relations too, so
I think these assertions show that the case won't arise.  (Of
course, if someone can trip one of these assertions, I'm wrong.)

regards, tom lane

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index c2f211a60d..4b6ed6e312 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -698,6 +698,17 @@ try_nestloop_path(PlannerInfo *root,
 	Relids		inner_paramrels = PATH_REQ_OUTER(inner_path);
 	Relids		outer_paramrels = PATH_REQ_OUTER(outer_path);
 
+	/*
+	 * If we are forming an outer join at this join, it's nonsensical to use
+	 * an input path that uses the outer join as part of its parameterization.
+	 * (This can happen despite our join order restrictions, since those apply
+	 * to what is in an input relation not what its parameters are.)
+	 */
+	if (extra->sjinfo && extra->sjinfo->ojrelid != 0 &&
+		(bms_is_member(extra->sjinfo->ojrelid, outer_paramrels) ||
+		 bms_is_member(extra->sjinfo->ojrelid, inner_paramrels)))
+		return;
+
 	/*
 	 * Paths are parameterized by top-level parents, so run parameterization
 	 * tests on the parent relids.
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 6917faec14..12b828fae3 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5063,6 +5063,37 @@ select 1 from
 --
 (0 rows)
 
+explain (costs off)
+select 1 from tenk1 t1
+ join lateral
+   (select t1.unique1 from (select 1) foo offset 0) s1 on true
+ join
+(select 1 from tenk1 t2
+inner join tenk1 t3
+ left join tenk1 t4 left join tenk1 t5 on t4.unique1 = 1
+on t4.unique1 = 1 on false
+ where t3.unique1 = coalesce(t5.unique1,1)) as s2
+  on true;
+QUERY PLAN
+--
+ Result
+   One-Time Filter: false
+(2 rows)
+
+select 1 from tenk1 t1
+ join lateral
+   (select t1.unique1 from (select 1) foo offset 0) s1 on true
+ join
+(select 1 from tenk1 t2
+inner join tenk1 t3
+ left join tenk1 t4 left join tenk1 t5 on t4.unique1 = 1
+on t4.unique1 = 1 on false
+ where t3.unique1 = coalesce(t5.unique1,1)) as s2
+  on true;
+ ?column? 
+--
+(0 rows)
+
 --
 -- check a case in which a PlaceHolderVar forces join order
 --
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 55080bec9a..38899ed3b9 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1751,6 +1751,29 @@ select 1 from
   on false,
   lateral (select i4.f1, ss1.n from int8_tbl as i8 limit 1) as ss3;
 
+explain (costs off)
+select 1 from tenk1 t1
+ join lateral
+   (select t1.unique1 from (select 1) foo offset 0) s1 on true
+ join
+(select 1 from tenk1 t2
+inner join tenk1 t3
+ left join tenk1 t4 left join tenk1 t5 on

Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes

2023-06-28 Thread Ashutosh Bapat

Hi Vignesh,
Thanks for working on this.

On Wed, Jun 28, 2023 at 4:52 PM vignesh C  wrote:
>
> Here is a patch having the fix for the same. I have not added any
> tests as the existing tests cover this scenario. The same issue is
> present in back branches too.

Interesting, we have a test for this scenario and it accepts erroneous
output :).

> v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_master.patch
> can be applied on master, PG15 and PG14,
> v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_PG13.patch
> patch can be applied on PG13, PG12 and PG11.
> Thoughts?

I noticed this when looking at Tomas's patches for logical decoding of
sequences. The code block you have added is repeated in
pg_decode_change() and pg_decode_truncate(). It might be better to
push the conditions in pg_output_begin() itself so that any future
callsite of pg_output_begin() automatically takes care of these
conditions.

Otherwise the patches look good to me.

-- 
Best Wishes,
Ashutosh Bapat

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2023-06-28 Thread Alena Rybakina


Hi!

On 15.06.2023 11:30, Andrey Lepikhov wrote:

Hi, all.

Some of my clients use JOIN's with three - four clauses. Quite 
frequently, I see complaints on unreasonable switch of JOIN algorithm 
to Merge Join instead of Hash Join. Quick research have shown one weak 
place - estimation of an average bucket size in final_cost_hashjoin 
(see q2.sql in attachment) with very conservative strategy.
Unlike estimation of groups, here we use smallest ndistinct value 
across all buckets instead of multiplying them (or trying to make 
multivariate analysis).
It works fine for the case of one clause. But if we have many clauses, 
and if each has high value of ndistinct, we will overestimate average 
size of a bucket and, as a result, prefer to use Merge Join. As the 
example in attachment shows, it leads to worse plan than possible, 
sometimes drastically worse.
I assume, this is done with fear of functional dependencies between 
hash clause components. But as for me, here we should go the same way, 
as estimation of groups.

The attached patch shows a sketch of the solution.


This problem is very important.

Honestly, I'm still learning your code and looking for cases on which 
cases your patch can affect for the worse or for the better. But I have 
already found something that seemed interesting to me. I have found 
several other interesting cases where your patch can solve some problem 
in order to choose a more correct plan, but in focus on memory consumption.


To make it easier to evaluate, I added a hook to your patch that makes 
it easier to switch to your or the original way of estimating the size 
of baskets (diff_estimate.diff).


Here are other cases where your fix improves the query plan.

First of all, I changed the way creation of tables are created to look 
at the behavior of the query plan in terms of planning and execution time:


DROP TABLE IF EXISTS a,b CASCADE;
CREATE TABLE a AS
  SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
  FROM generate_series(1,1e5) AS gs;
CREATE TABLE b AS
  SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
  FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;

SET enable_cost_size = 'on';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

SET enable_cost_size = 'off';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;


    QUERY PLAN
---
 Hash Join (actual time=200.872..200.879 rows=0 loops=1)
   Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
   ->  Seq Scan on b (actual time=0.029..15.946 rows=10 loops=1)
   ->  Hash (actual time=97.645..97.649 rows=10 loops=1)
 Buckets: 131072  Batches: 1  Memory Usage: 5612kB
 ->  Seq Scan on a (actual time=0.024..17.153 rows=10 loops=1)
 Planning Time: 2.910 ms
 Execution Time: 201.949 ms
(8 rows)

SET
    QUERY PLAN
---
 Merge Join (actual time=687.415..687.416 rows=0 loops=1)
   Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z))
   ->  Sort (actual time=462.022..536.716 rows=10 loops=1)
 Sort Key: b.y, b.x, b.z
 Sort Method: external merge  Disk: 3328kB
 ->  Seq Scan on b (actual time=0.017..12.326 rows=10 loops=1)
   ->  Sort (actual time=111.295..113.196 rows=16001 loops=1)
 Sort Key: a.y, a.x, a.z
 Sort Method: external sort  Disk: 2840kB
 ->  Seq Scan on a (actual time=0.020..10.129 rows=10 loops=1)
 Planning Time: 0.752 ms
 Execution Time: 688.829 ms
(12 rows)

Secondly, I found another case that is not related to the fact that the 
planner would prefer to choose merge join rather than hash join, but we 
have the opportunity to see that the plan has become better due to the 
consumption of less memory, and also takes less planning time.


Here, with the same query, the planning time was reduced by 5 times, and 
the number of buckets by 128 times, therefore, memory consumption also 
decreased:


DROP TABLE IF EXISTS a,b CASCADE;

CREATE TABLE a AS
  SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
  FROM generate_series(1,600) AS gs;
CREATE TABLE b AS
  SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
  FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;

SET enable_cost_size = 'on';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

SET enable_cost_size = 'off';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

QUERY PLAN

 Hash Join  (cost=20.50..3157.58 rows=8 width=32) (actual 
time=95.648..95.651 rows=0 loops=1)
   Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND 
(b.z = (a.z)::numeric))
   ->  Seq Scan on b

Re: pg_rewind WAL segments deletion pitfall

2023-06-28 Thread torikoshia



On 2022-09-29 17:18, Polina Bungina wrote:

I agree with your suggestions, so here is the updated version of
patch. Hope I haven't missed anything.

Regards,
Polina Bungina


Thanks for working on this!
It seems like we are also facing the same issue.

I tested the v3 patch under our condition, old primary has succeeded to 
become new standby.



BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached in 
[1], old primary also failed to become standby:


  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL 
segment 00020007 has already been removed


However, I think this is not a problem:  just adding restore_command 
like below fixed the situation.


  echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> 
oldprim/postgresql.conf


Attached modified reproduction script for reference.

[1]https://www.postgresql.org/message-id/CAFh8B%3DnNiFZOAPsv49gffxHBPzwmZ%3D6Msd4miMis87K%3Dd9rcRA%40mail.gmail.com


--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONmkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> 
oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> 
newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
# advance WAL on the old primary; four WAL segments will never make it to the 
archive
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select 
pg_switch_wal();'; done

# record approx. diverging WAL segment
start_wal=`psql -p 5432 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - 
(select setting from pg_settings where name = 'wal_segment_size')::int);"`
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select 
pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> 
oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select 
pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal
echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> 
oldprim/postgresql.conf

postgres -D oldprim  # fails with "WAL file has been removed"

## The alternative of copying-in
## echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> 
oldprim/postgresql.conf
#
## copy-in WAL files from new primary's archive to old primary
#(cd newarch;
#for f in `ls`; do
#  if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
#done)
#
#postgres -D oldprim  # also fails with "requested WAL segment XXX has already 
been removed"

Re: Another incorrect comment for pg_stat_statements

2023-06-28 Thread Japin Li



On Wed, 28 Jun 2023 at 16:27, Richard Guo  wrote:
> On Wed, Jun 28, 2023 at 3:36 PM Michael Paquier  wrote:
>
>> On Wed, Jun 28, 2023 at 03:09:55PM +0800, Richard Guo wrote:
>> > +1.  To nitpick, how about we remove the blank line just before removing
>> > the key for top level entry?
>> >
>> > -  /* Also remove entries for top level statements */
>> > +  /* Also remove entries if exist for top level statements */
>> >key.toplevel = true;
>> > -
>> > -  /* Remove the key if exists */
>> >entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL);
>>
>> Why not if it improves the overall situation.  Could you send a patch
>> with everything you have in mind?
>
>
> Here is the patch.  I don't have too much in mind, so the patch just
> removes the blank line and revises the comment a bit.
>

+1. LGTM.

-- 
Regrads,
Japin Li.

Re: Request for new function in view update

2023-06-28 Thread Terry Brennan

Hi Yugo

Thank you for taking a look at the paper.

The key difference from PostgreSQL is that it only allows updates on single
table views.  My paper discusses two kinds of joins of two tables.  It
discusses how to update them and how to determine when they occur.

The paper also discusses unioning two tables.  I've done work on table
intersection and table difference too, but didn't include it in the paper.

Terry Brennan





On Wed, Jun 28, 2023 at 2:49 AM Yugo NAGATA  wrote:

> On Thu, 1 Jun 2023 12:18:47 -0500
> Terry Brennan  wrote:
>
> > Hello all,
> >
> > I am a researcher in databases who would like to suggest a new
> function.  I
> > am writing to you because you have an active developer community.  Your
> > website said that suggestions for new functions should go to this mailing
> > list.  If there is another mailing list you prefer, please let me know.
> >
> > My research is in updating views -- the problem of translating an update
> in
> > a view to an update to a set of underlying base tables.  This problem has
> > been partially solved for many years, including in PostgreSQL, but a
> > complete solution hasn't been found.
> >
> > Views are useful for data independence; if users only access data through
> > views, then underlying databases can change without user programs.  Data
> > independence requires an automatic solution to the view update problem.
> >
> > In my research, I went back to the initial papers about the problem.  The
> > most promising approach was the "constant complement" approach.  It
> starts
> > from the idea that a view shows only part of the information in a
> database,
> > and that view updates should never change the part of the database that
> > isn't exposed in the view.  (The "complement" is the unexposed part, and
> > "constant" means that a view update shouldn't change the complement.)
> The
> > "constant complement" constraint is intuitive, that a view update
> shouldn't
> > have side effects on information not available through the view.
> >
> > A seminal paper showed that defining a complement is enough, because each
> > complement of a view creates a unique view update.  Unfortunately, there
> > are limitations.  Views have multiple complements, and no unique minimal
> > complement exists.  Because of this limitation and other practical
> > difficulties, the constant complement approach was abandoned.
> >
> > I used a theorem in this initial paper that other researchers didn't use,
> > that shows the inverse.  An update method defines a unique complement.  I
> > used the two theorems as a saw's upstroke and downstroke  to devise view
> > update methods for several relational operators.  Unlike other
> approaches,
> > these methods have a solid mathematical foundation.
> >
> > Some relational operators are easy (selection), others are hard
> > (projection); some have several valid update methods that can be used
> > interchangeably (union) and some can have several valid update methods
> that
> > reflect different semantics (joins).  For joins, I found clues in the
> > database that can determine which update method to use.  I address the
> > other relational operators, but not in the attached paper
> > .
> > I also discuss the problem of when views can't have updates, and possible
> > reasons why.
> >
> > I have attached my arXiv paper.  I would appreciate anyone's interest in
> > this topic.
>
> I'm interested in the view update problem because we have some works on
> this topic [1][2].
>
> I read your paper. Although I don't understand the theoretical part enough,
> I found your proposal methods to update views as for several relational
> operators.
>
> The method for updating selection views seems same as the way of
> automatically
> updatable views in the current PostgreSQL, that is, deleting/updating rows
> in
> a view results in deletes/updates for corresponding rows in the base table.
> Inserting rows that is not compliant to the view is checked and prevented
> if
> the view is defined with WITH CHECK OPTION.
>
> However, the proposed  method for projection is that a column not contained
> in the view is updated to NULL when a row is deleted. I think it is not
> desirable to use NULL in a such special purpose, and above all, this is
> different the current PostgreSQL behavior, so I wonder it would not
> accepted
> to change it. I think it would be same for the method for JOIN that uses
> NULL
> for the special use.
>
> I think it would be nice to extend view updatability of PostgreSQL because
> the SQL standard allows more than the current limited support. In this
> case,
> I wonder we should follow SQL:1999 or later, and maybe this would be
> somehow
> compatible to the spec in Oracle.
>
> [1] https://dl.acm.org/doi/10.1145/3164541.3164584
> [2] https://www.pgcon.org/2017/schedule/events/1074.en.html
>
> Regards,
> Yugo Nagata
>
> > Yours
> > Terry Brennan
>
>
> --
> Yugo NAGATA 
>

Re: Row pattern recognition

2023-06-28 Thread Tatsuo Ishii

Small question.

> This query (with all the defaults made explicit):
> 
> SELECT s.company, s.tdate, s.price,
>FIRST_VALUE(s.tdate) OVER w,
>LAST_VALUE(s.tdate) OVER w,
>lowest OVER w
> FROM stock AS s
> WINDOW w AS (
>   PARTITION BY s.company
>   ORDER BY s.tdate
>   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>   EXCLUDE NO OTHERS
>   MEASURES
> LAST(DOWN) AS lowest
>   AFTER MATCH SKIP PAST LAST ROW
>   INITIAL PATTERN (START DOWN+ UP+)
>   DEFINE
> START AS TRUE,
> UP AS price > PREV(price),
> DOWN AS price < PREV(price)
> );

> LAST(DOWN) AS lowest

should be "LAST(DOWN.price) AS lowest"?

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes

2023-06-28 Thread vignesh C

On Mon, 26 Jun 2023 at 15:51, Amit Kapila  wrote:
>
> On Mon, Jun 26, 2023 at 3:07 PM Ashutosh Bapat
>  wrote:
> >
> > Hi All,
> > Every pg_decode routine except pg_decode_message  that decodes a
> > transactional change, has following block
> > /* output BEGIN if we haven't yet */
> > if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
> > {
> > pg_output_begin(ctx, data, txn, false);
> > }
> > txndata->xact_wrote_changes = true;
> >
> > But pg_decode_message() doesn't call pg_output_begin(). If a WAL
> > message is the first change in the transaction, it won't have a BEGIN
> > before it. That looks like a bug. Why is pg_decode_message()
> > exception?
> >
>
> I can't see a reason why we shouldn't have a similar check for
> transactional messages. So, agreed this is a bug.

Here is a patch having the fix for the same. I have not added any
tests as the existing tests cover this scenario. The same issue is
present in back branches too.
v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_master.patch
can be applied on master, PG15 and PG14,
v1-0001-Call-pg_output_begin-in-pg_decode_message-if-it-i_PG13.patch
patch can be applied on PG13, PG12 and PG11.
Thoughts?

Regards,
Vignesh
From 2df5e87ec7c82daa5f17da50f770f923eb7765b4 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 28 Jun 2023 14:01:22 +0530
Subject: [PATCH] Call pg_output_begin in pg_decode_message if it is the first
 change in the transaction.

Call pg_output_begin in pg_decode_message if it is the first change in
the transaction.
---
 contrib/test_decoding/expected/messages.out | 10 --
 contrib/test_decoding/sql/messages.sql  |  2 +-
 contrib/test_decoding/test_decoding.c   | 12 
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/messages.out b/contrib/test_decoding/expected/messages.out
index c75d40190b..0fd70036bd 100644
--- a/contrib/test_decoding/expected/messages.out
+++ b/contrib/test_decoding/expected/messages.out
@@ -58,17 +58,23 @@ SELECT 'ignorethis' FROM pg_logical_emit_message(true, 'test', 'czechtastic');
  ignorethis
 (1 row)
 
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1', 'include-xids', '0');
 data
 
+ BEGIN
  message: transactional: 1 prefix: test, sz: 4 content:msg1
+ COMMIT
  message: transactional: 0 prefix: test, sz: 4 content:msg2
  message: transactional: 0 prefix: test, sz: 4 content:msg4
  message: transactional: 0 prefix: test, sz: 4 content:msg6
+ BEGIN
  message: transactional: 1 prefix: test, sz: 4 content:msg5
  message: transactional: 1 prefix: test, sz: 4 content:msg7
+ COMMIT
+ BEGIN
  message: transactional: 1 prefix: test, sz: 11 content:czechtastic
-(7 rows)
+ COMMIT
+(13 rows)
 
 -- test db filtering
 \set prevdb :DBNAME
diff --git a/contrib/test_decoding/sql/messages.sql b/contrib/test_decoding/sql/messages.sql
index cf3f7738e5..3d8500f99c 100644
--- a/contrib/test_decoding/sql/messages.sql
+++ b/contrib/test_decoding/sql/messages.sql
@@ -19,7 +19,7 @@ COMMIT;
 
 SELECT 'ignorethis' FROM pg_logical_emit_message(true, 'test', 'czechtastic');
 
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1');
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1', 'include-xids', '0');
 
 -- test db filtering
 \set prevdb :DBNAME
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 12d1d0505d..2bbac5f6a8 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -743,6 +743,18 @@ pg_decode_message(LogicalDecodingContext *ctx,
   ReorderBufferTXN *txn, XLogRecPtr lsn, bool transactional,
   const char *prefix, Size sz, const char *message)
 {
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata;
+
+	txndata = transactional ? txn->output_plugin_private : NULL;
+
+	/* output BEGIN if we haven't yet */
+	if (transactional && data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		pg_output_begin(ctx, data, txn, false);
+
+	if (transactional)
+		txndata->xact_wrote_changes = true;
+
 	OutputPluginPrepareWrite(ctx, true);
 	appendStringInfo(ctx->out, "message: transactional: %d prefix: %s, sz: %zu content:",
 	 transactional, prefix, sz);
-- 
2.34.1

From 273776c53da22ecd69f8b7d4e7712ef66ab0ea5c Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 28 Jun 2023 14:01:22 +0530
Subject: [PATCH] Call pg_output_begin in pg_decode_message if it is the first
 change in the transaction.

Call pg_output_begin in pg_decode_message if it is the first change in

Re: Changing types of block and chunk sizes in memory contexts

David Rowley  writes:
> Perhaps it's ok to leave the context creation functions with Size
> typed parameters and then just Assert the passed-in sizes are not
> larger than 1GB within the context creation function.

Yes, I'm strongly opposed to not using Size/size_t in the mmgr APIs.
If we go that road, we're going to have a problem when someone
inevitably wants to pass a larger-than-GB value for some context
type.

What happens in semi-private structs is a different matter, although
I'm a little dubious that shaving a couple of bytes from context
headers is a useful activity.  The self-documentation argument
still has some force there, so I agree with Peter that some positive
benefit has to be shown.

regards, tom lane

Re: Making empty Bitmapsets always be NULL

2023-06-28 Thread David Rowley

Thank you for running the profiles.

On Tue, 27 Jun 2023 at 21:11, Yuya Watari  wrote:
> On Sat, Jun 24, 2023 at 1:15 PM David Rowley  wrote:
> > I think it's also important to check we don't slow anything down for
> > more normal-sized sets.  The vast majority of sets will contain just a
> > single word, so we should probably focus on making sure we're not
> > slowing anything down for those.
>
> I agree with you and thank you for sharing the results. I ran
> installcheck with your patch. The result is as follows. The speedup
> was 0.33%. At least in my environment, I did not observe any
> regression with this test. So, the patch looks very good.
>
> Master:  2.559648 seconds
> Patched: 2.551116 seconds (0.33% faster)

I wondered if the common case could be made slightly faster by
checking the 0th word before checking the word count before going onto
check the remaining words. For bms_equal(), that's something like:

if (a->words[0] != b->words[0] || a->nwords != b->nwords)
return false;

/* check all the remaining words match */
for (int i = 1; i < a->nwords; i++) ...

I wrote the patch and tried it out, but it seems slightly slower than
the v4 patch.

Linux with AMD 3990x, again using the patch from [1] with make installcheck

master: 1.41720145 seconds
v4: 1.392969606 seconds (1.74% faster than master)
v4 with 0th word check: 1.404199748 seconds (0.93% faster than master)

I've attached a delta patch of what I used to test.  Since it's not
any faster, I don't think it's worth doing. It'll also produce
slightly more compiled code.

David

[1] 
https://postgr.es/m/caaphdvo68m_0juthnehfnsdsjeb2upphk6bwxstj93u_qei...@mail.gmail.com
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 9cda3b1cc1..2ee7385f02 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -9,12 +9,7 @@
  * the minimum possible number of words, i.e, there are never any trailing
  * zero words.  Enforcing this requires that an empty set is represented as
  * NULL.  Because an empty Bitmapset is represented as NULL, a non-NULL
- * Bitmapset always has at least 1 Bitmapword.  We can exploit this fact to
- * speedup various loops over the Bitmapset's words array by using "do while"
- * loops instead of "for" loops.  This means the code does not waste time
- * checking the loop condition before the first iteration.  For Bitmapsets
- * containing only a single word (likely the majority of them) this reduces
- * the loop condition tests by half.
+ * Bitmapset always has at least 1 Bitmapword.
  *
  *
  * Copyright (c) 2003-2023, PostgreSQL Global Development Group
@@ -96,8 +91,6 @@ bms_copy(const Bitmapset *a)
 bool
 bms_equal(const Bitmapset *a, const Bitmapset *b)
 {
-   int i;
-
/* Handle cases where either input is NULL */
if (a == NULL)
{
@@ -108,17 +101,20 @@ bms_equal(const Bitmapset *a, const Bitmapset *b)
else if (b == NULL)
return false;
 
-   /* can't be equal if the word counts don't match */
-   if (a->nwords != b->nwords)
+   /*
+* Most Bitmapsets will likely just have 1 word, so check for equality
+* there before checking the number of words match.  The sets cannot be
+* equal when they don't have the same number of words.
+*/
+   if (a->words[0] != b->words[0] || a->nwords != b->nwords)
return false;
 
-   /* check each word matches */
-   i = 0;
-   do
+   /* check all the remaining words match */
+   for (int i = 1; i < a->nwords; i++)
{
if (a->words[i] != b->words[i])
return false;
-   } while (++i < a->nwords);
+   }
 
return true;
 }
@@ -353,25 +349,27 @@ bms_difference(const Bitmapset *a, const Bitmapset *b)
 bool
 bms_is_subset(const Bitmapset *a, const Bitmapset *b)
 {
-   int i;
-
/* Handle cases where either input is NULL */
if (a == NULL)
return true;/* empty set is a subset of 
anything */
if (b == NULL)
return false;
 
-   /* 'a' can't be a subset of 'b' if it contains more words */
-   if (a->nwords > b->nwords)
+   /*
+* Check the 0th word first as many sets will only have 1 word.
+* Otherwise, 'a' can't be a subset of 'b' if it contains more words, so
+* there's no need to check remaining words if 'a' contains more words
+* than 'b'.
+*/
+   if ((a->words[0] & ~b->words[0]) != 0 || a->nwords > b->nwords)
return false;
 
-   /* Check all 'a' members are set in 'b' */
-   i = 0;
-   do
+   /* Check all remaining words to see 'b' has members not set in 'a'. */
+   for (int i = 1; i < a->nwords; i++)
{
if ((a->words[i] & ~b->words[i]) != 0)
return false;
-   } while (++i < a->nwords);
+   }

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-06-28 Thread Amit Kapila

On Fri, Apr 14, 2023 at 4:00 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Sorry for the delay, I didn't had time to come back to it until this 
> > afternoon.
>
> No issues, everyone is busy:-).
>
> > I don't think that your analysis is correct.  Slots are guaranteed to be
> > stopped after all the normal backends have been stopped, exactly to avoid 
> > such
> > extraneous records.
> >
> > What is happening here is that the slot's confirmed_flush_lsn is properly
> > updated in memory and ends up being the same as the current LSN before the
> > shutdown.  But as it's a logical slot and those records aren't decoded, the
> > slot isn't marked as dirty and therefore isn't saved to disk.  You don't see
> > that behavior when doing a manual checkpoint before (per your script 
> > comment),
> > as in that case the checkpoint also tries to save the slot to disk but then
> > finds a slot that was marked as dirty and therefore saves it.
> >

Here, why the behavior is different for manual and non-manual checkpoint?

> > In your script's scenario, when you restart the server the previous slot 
> > data
> > is restored and the confirmed_flush_lsn goes backward, which explains those
> > extraneous records.
>
> So you meant to say that the key point was that some records which are not 
> sent
> to subscriber do not mark slots as dirty, hence the updated confirmed_flush 
> was
> not written into slot file. Is it right? LogicalConfirmReceivedLocation() is 
> called
> by walsender when the process gets reply from worker process, so your analysis
> seems correct.
>

Can you please explain what led to updating the confirmed_flush in
memory but not in the disk? BTW, have we ensured that discarding the
additional records are already sent to the subscriber, if so, why for
those records confirmed_flush LSN is not progressed?

-- 
With Regards,
Amit Kapila.

Re: Assistance Needed: Issue with pg_upgrade and --link option

2023-06-28 Thread Laurenz Albe

On Wed, 2023-06-28 at 15:40 +0530, Pradeep Kumar wrote:
> > > I was under the impression that the --link option would create hard links 
> > > between the
> > > old and new cluster's data files, but it appears that the entire old 
> > > cluster data was
> > > copied to the new cluster, resulting in a significant increase in the new 
> > > cluster's size.
> > 
> > Please provide some numbers, ideally
> > 
> >   du -sk  
>
> du -sk ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master 
> ~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg
> 11224524 /home/test/pradeep_test/pg_upgrade_testing/postgres_11.4/master
> 41952 /home/test/pradeep_test/pg_upgrade_testing/postgres_14/new_pg

That looks fine.  The files exist only once, and the 41MB that only exist in
the new data directory are catalog data and other stuff that is different
on the new cluster.

Yours,
Laurenz Albe

Re: Assistance Needed: Issue with pg_upgrade and --link option

2023-06-28 Thread Pradeep Kumar

This is my numbers.
 df  ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master
~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg
Filesystem  1K-blocks  Used Available Use% Mounted on
/dev/mapper/nvme0n1p4_crypt 375161856 102253040 270335920  28% /home
/dev/mapper/nvme0n1p4_crypt 375161856 102253040 270335920  28% /home

On Wed, Jun 28, 2023 at 3:14 PM Peter Eisentraut 
wrote:

> On 28.06.23 08:24, Laurenz Albe wrote:
> > On Wed, 2023-06-28 at 11:49 +0530, Pradeep Kumar wrote:
> >> I was under the impression that the --link option would create hard
> links between the
> >> old and new cluster's data files, but it appears that the entire old
> cluster data was
> >> copied to the new cluster, resulting in a significant increase in the
> new cluster's size.
> >
> > Please provide some numbers, ideally
> >
> >du -sk  
>
> I don't think you can observe the effects of the --link option this way.
>   It would just give you the full size count for both directories, even
> though the point to the same underlying inodes.
>
> To see the effect, you could perhaps use `df` to see how much overall
> disk space the upgrade step eats up.
>
>

Re: Assistance Needed: Issue with pg_upgrade and --link option

2023-06-28 Thread Pradeep Kumar

Sure,
du -sk ~/pradeep_test/pg_upgrade_testing/postgres_11.4/master
~/pradeep_test/pg_upgrade_testing/postgres_14/new_pg
11224524 /home/test/pradeep_test/pg_upgrade_testing/postgres_11.4/master
41952 /home/test/pradeep_test/pg_upgrade_testing/postgres_14/new_pg

On Wed, Jun 28, 2023 at 11:54 AM Laurenz Albe 
wrote:

> On Wed, 2023-06-28 at 11:49 +0530, Pradeep Kumar wrote:
> > I was under the impression that the --link option would create hard
> links between the
> > old and new cluster's data files, but it appears that the entire old
> cluster data was
> > copied to the new cluster, resulting in a significant increase in the
> new cluster's size.
>
> Please provide some numbers, ideally
>
>   du -sk  
>
> Yours,
> Laurenz Albe
>

Show WAL write and fsync stats in pg_stat_io

2023-06-28 Thread Nazir Bilal Yavuz

Hi,

This is a WIP patch to add WAL write and fsync stats to pg_stat_io
view. There is a track_io_timing variable to control pg_stat_io
timings and a track_wal_io_timing variable to control WAL timings. I
couldn't decide on which logic to enable WAL timings on pg_stat_io.
For now, both pg_stat_io and track_wal_io_timing are needed to be
enabled to track WAL timings in pg_stat_io.

Also, if you compare WAL stats in pg_stat_wal and pg_stat_io; you can
come across differences. These differences are caused by the
background writer's WAL stats not being flushed. Because of that,
background writer's WAL stats are not seen in pg_stat_wal but in
pg_stat_io. I already sent a patch [1] to fix that.

[1] 
https://www.postgresql.org/message-id/CAN55FZ2FPYngovZstr%3D3w1KSEHe6toiZwrurbhspfkXe5UDocg%40mail.gmail.com

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft


v1-0001-WIP-Show-WAL-write-and-fsync-stats-on-pg_stat_io.patch
Description: Binary data

Re: Assistance Needed: Issue with pg_upgrade and --link option


On 28.06.23 08:24, Laurenz Albe wrote:

On Wed, 2023-06-28 at 11:49 +0530, Pradeep Kumar wrote:

I was under the impression that the --link option would create hard links 
between the
old and new cluster's data files, but it appears that the entire old cluster 
data was
copied to the new cluster, resulting in a significant increase in the new 
cluster's size.


Please provide some numbers, ideally

   du -sk  


I don't think you can observe the effects of the --link option this way. 
 It would just give you the full size count for both directories, even 
though the point to the same underlying inodes.


To see the effect, you could perhaps use `df` to see how much overall 
disk space the upgrade step eats up.

Re: Changing types of block and chunk sizes in memory contexts

2023-06-28 Thread David Rowley

On Wed, 28 Jun 2023 at 20:13, Peter Eisentraut  wrote:
> size_t (= Size) is the correct type in C to store the size of an object
> in memory.  This is partially a self-documentation issue: If I see
> size_t in a function signature, I know what is intended; if I see
> uint32, I have to wonder what the intent was.

Perhaps it's ok to leave the context creation functions with Size
typed parameters and then just Assert the passed-in sizes are not
larger than 1GB within the context creation function.  That way we
could keep this change self contained in the .c file for the given
memory context.  That would mean there's no less readability. If we
ever wanted to lift the 1GB limit on block sizes then we'd not need to
switch the function signature again. There's documentation where the
struct's field is declared, so having a smaller type in the struct
itself does not seem like a reduction of documentation quality.

> You could make an argument that using shorter types would save space for
> some internal structs, but then you'd have to show some more information
> where and why that would be beneficial.

I think there's not much need to go proving this speeds something up.
There's just simply no point in the struct fields being changed in
Melih's patch to be bigger than 32 bits as we never need to store more
than 1GB in them.  Reducing these down means we may have to touch
fewer cache lines and we'll also have more space on the keeper blocks
to store allocations.  Memory allocation performance is fairly
fundamental to Postgres's performance. In my view, we shouldn't have
fields that are twice as large as they need to be in code as hot as
this.

> Absent any strong performance argument, I don't see the benefit of this
> change.  People might well want to experiment with MEMORYCHUNK_...
> settings larger than 1GB.

Anyone doing so will be editing C code anyway.  They can adjust these
fields then.

David

Re: Targetlist lost when CTE join

HI,

On Jun 28, 2023, 17:26 +0800, Julien Rouhaud , wrote:
> On Wed, Jun 28, 2023 at 05:17:14PM +0800, Julien Rouhaud wrote:
> > >
> > > Table t1 and  t2 both has 2 columns: c1, c2, when CTE join select *, the 
> > > result target list seems to lost one’s column c1.
> > > But it looks good when select cte1.* and t1.* explicitly .
> > >
> > > Is it a bug?
> >
> > This is working as intended. When using a USING clause you "merge" both
> > columns so the final target list only contain one version of the merged
> > columns, which doesn't happen if you use e.g. ON instead. I'm assuming that
> > what the SQL standard says, but I don't have a copy to confirm.
>
> I forgot to mention that this is actually documented:
>
> https://www.postgresql.org/docs/current/queries-table-expressions.html
>
> Furthermore, the output of JOIN USING suppresses redundant columns: there is 
> no
> need to print both of the matched columns, since they must have equal values.
> While JOIN ON produces all columns from T1 followed by all columns from T2,
> JOIN USING produces one output column for each of the listed column pairs (in
> the listed order), followed by any remaining columns from T1, followed by any
> remaining columns from T2.

Thanks for your help.

Regards,
Zhang Mingli

Re: Targetlist lost when CTE join

2023-06-28 Thread Julien Rouhaud

On Wed, Jun 28, 2023 at 05:17:14PM +0800, Julien Rouhaud wrote:
> >
> > Table t1 and  t2 both has 2 columns: c1, c2, when CTE join select *, the 
> > result target list seems to lost one’s column c1.
> > But it looks good when select cte1.* and t1.* explicitly .
> >
> > Is it a bug?
>
> This is working as intended.  When using a USING clause you "merge" both
> columns so the final target list only contain one version of the merged
> columns, which doesn't happen if you use e.g. ON instead.  I'm assuming that
> what the SQL standard says, but I don't have a copy to confirm.

I forgot to mention that this is actually documented:

https://www.postgresql.org/docs/current/queries-table-expressions.html

Furthermore, the output of JOIN USING suppresses redundant columns: there is no
need to print both of the matched columns, since they must have equal values.
While JOIN ON produces all columns from T1 followed by all columns from T2,
JOIN USING produces one output column for each of the listed column pairs (in
the listed order), followed by any remaining columns from T1, followed by any
remaining columns from T2.

Re: Targetlist lost when CTE join

Hi

Regards,
Zhang Mingli
On Jun 28, 2023, 17:17 +0800, Julien Rouhaud , wrote:
> This is working as intended. When using a USING clause you "merge" both
> columns so the final target list only contain one version of the merged
> columns, which doesn't happen if you use e.g. ON instead. I'm assuming that
> what the SQL standard says, but I don't have a copy to confirm.

Thanks. You’r right.

Have a test:

gpadmin=# with cte1 as (insert into t2 values (1, 2) returning *) select * from 
cte1 join t1 on t1.c1 = cte1.c1;
 c1 | c2 | c1 | c2
+++
(0 rows)

Re: Targetlist lost when CTE join

2023-06-28 Thread Julien Rouhaud

Hi,

On Wed, Jun 28, 2023 at 04:52:34PM +0800, Zhang Mingli wrote:
>
> Mini repo
>
> create table t1(c1 int, c2 int);
> CREATE TABLE
> create table t2(c1 int, c2 int);
> CREATE TABLE
> explain with cte1 as (insert into t2 values (1, 2) returning *) select * from 
> cte1 join t1 using(c1);
>  QUERY PLAN
> 
>  Hash Join (cost=0.04..41.23 rows=11 width=12)
>  Hash Cond: (t1.c1 = cte1.c1)
>  CTE cte1
>  -> Insert on t2 (cost=0.00..0.01 rows=1 width=8)
>  -> Result (cost=0.00..0.01 rows=1 width=8)
>  -> Seq Scan on t1 (cost=0.00..32.60 rows=2260 width=8)
>  -> Hash (cost=0.02..0.02 rows=1 width=8)
>  -> CTE Scan on cte1 (cost=0.00..0.02 rows=1 width=8)
> (8 rows)
>
> with cte1 as (insert into t2 values (1, 2) returning *) select * from cte1 
> join t1 using(c1);
>  c1 | c2 | c2
> ++
> (0 rows)
>
> truncate t2;
> TRUNCATE TABLE
> with cte1 as (insert into t2 values (1, 2) returning *) select cte1.*, t1.* 
> from cte1 join t1 using(c1);
>  c1 | c2 | c1 | c2
> +++
> (0 rows)
>
> Table t1 and  t2 both has 2 columns: c1, c2, when CTE join select *, the 
> result target list seems to lost one’s column c1.
> But it looks good when select cte1.* and t1.* explicitly .
>
> Is it a bug?

This is working as intended.  When using a USING clause you "merge" both
columns so the final target list only contain one version of the merged
columns, which doesn't happen if you use e.g. ON instead.  I'm assuming that
what the SQL standard says, but I don't have a copy to confirm.

Re: Targetlist lost when CTE join

Hi,

Explain verbose, seems HashJoin node drop that column.


gpadmin=# explain(verbose) with cte1 as (insert into t2 values (1, 2) returning 
*) select * from cte1 join t1 using(c1);
 QUERY PLAN
---
 Hash Join (cost=0.04..41.23 rows=11 width=12)
 Output: cte1.c1, cte1.c2, t1.c2
 Hash Cond: (t1.c1 = cte1.c1)
 CTE cte1
 -> Insert on public.t2 (cost=0.00..0.01 rows=1 width=8)
 Output: t2.c1, t2.c2
 -> Result (cost=0.00..0.01 rows=1 width=8)
 Output: 1, 2
 -> Seq Scan on public.t1 (cost=0.00..32.60 rows=2260 width=8)
 Output: t1.c1, t1.c2
 -> Hash (cost=0.02..0.02 rows=1 width=8)
 Output: cte1.c1, cte1.c2
 -> CTE Scan on cte1 (cost=0.00..0.02 rows=1 width=8)
 Output: cte1.c1, cte1.c2
(14 rows)


Regards,
Zhang Mingli

Targetlist lost when CTE join

Hi,


Mini repo

create table t1(c1 int, c2 int);
CREATE TABLE
create table t2(c1 int, c2 int);
CREATE TABLE
explain with cte1 as (insert into t2 values (1, 2) returning *) select * from 
cte1 join t1 using(c1);
 QUERY PLAN

 Hash Join (cost=0.04..41.23 rows=11 width=12)
 Hash Cond: (t1.c1 = cte1.c1)
 CTE cte1
 -> Insert on t2 (cost=0.00..0.01 rows=1 width=8)
 -> Result (cost=0.00..0.01 rows=1 width=8)
 -> Seq Scan on t1 (cost=0.00..32.60 rows=2260 width=8)
 -> Hash (cost=0.02..0.02 rows=1 width=8)
 -> CTE Scan on cte1 (cost=0.00..0.02 rows=1 width=8)
(8 rows)

with cte1 as (insert into t2 values (1, 2) returning *) select * from cte1 join 
t1 using(c1);
 c1 | c2 | c2
++
(0 rows)

truncate t2;
TRUNCATE TABLE
with cte1 as (insert into t2 values (1, 2) returning *) select cte1.*, t1.* 
from cte1 join t1 using(c1);
 c1 | c2 | c1 | c2
+++
(0 rows)

Table t1 and  t2 both has 2 columns: c1, c2, when CTE join select *, the result 
target list seems to lost one’s column c1.
But it looks good when select cte1.* and t1.* explicitly .

Is it a bug?

Regards,
Zhang Mingli

Re: Do we want a hashset type?

2023-06-28 Thread Joel Jacobson

On Wed, Jun 28, 2023, at 08:26, jian he wrote:

> Hi there.
> I changed the function hashset_contains to strict.

Changing hashset_contains to STRICT would cause it to return NULL
if any of the operands are NULL, which I don't believe is correct, since:

SELECT NULL = ANY('{}'::int4[]);
 ?column?
--
 f
(1 row)

Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.

Did you try running `make installcheck` after your change?
You would then have seen one of the tests failing:

test array-and-multiset-semantics ... FAILED   21 ms

Check the content of `regression.diffs` to see why:

% cat regression.diffs
diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 
/Users/joel/src/hashset/results/array-and-multiset-semantics.out
--- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out  
2023-06-27 10:07:38
+++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out
2023-06-28 10:13:27
@@ -158,7 +158,7 @@
 |  | {NULL}  | {NULL}   |  |
 |1 | {1} | {1}  |  |
 |4 | {4} | {4}  |  |
- {} |  | {NULL}  | {NULL}   | f| f
+ {} |  | {NULL}  | {NULL}   |  | f
  {} |1 | {1} | {1}  | f| f
  {} |4 | {4} | {4}  | f| f
  {NULL} |  | {NULL}  | {NULL,NULL}  |  |
@@ -284,7 +284,8 @@
 "= ANY(...)";
  arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
 --+--+-+--+--+
-(0 rows)
+ {}   |  | {NULL}  | {NULL}   |  | f
+(1 row)


> also change the way to return an empty array.

Nice.
I agree the `Datum d` variable was unnecessary.
I also removed the unused includes.

> in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> int4hashset can speed distinct aggregate and distinct counts?
> like the following:
>
> explain(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
>
> explain(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3

The 100k tables seems to be too small to give any meaningful results,
when trying to measure individual queries:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
 Execution Time: 26.790 ms
 Execution Time: 30.616 ms
 Execution Time: 33.253 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k;
 Execution Time: 32.797 ms
 Execution Time: 27.605 ms
 Execution Time: 26.228 ms

If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
is actually faster for the `i` column where all input integers are unique:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
 Execution Time: 799.017 ms
 Execution Time: 796.008 ms
 Execution Time: 799.121 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_10M;
 Execution Time: 1204.873 ms
 Execution Time: 1221.822 ms
 Execution Time: 1216.340 ms

For random integers, hashset is a win though:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1874.722 ms
 Execution Time: 1878.760 ms
 Execution Time: 1861.640 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(rnd) FROM benchmark_input_10M;
 Execution Time: 1253.709 ms
 Execution Time: 1222.651 ms
 Execution Time: 1237.849 ms

> explain(costs off,timing off, analyze,buffers)
> select count(distinct rnd) from benchmark_input_100k \watch c=3
>
> explain(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> benchmark_input_100k) sub(x) \watch c=3

I tried these with 10M:

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1733.320 ms
 Execution Time: 1725.214 ms
 Execution Time: 1716.636 ms

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM 
benchmark_input_10M) sub(x);
 Execution Time: 1249.612 ms
 Execution Time: 1240.558 ms
 Execution Time: 1252.103 ms

Not sure what I think of the current benchmark suite.

I think it would be better to only include some realistic examples from
real-life, such as the graph query which was the reason I personally started
working on this. Otherwise there is a risk we optimise for some hypothetical
scenario that is not relevant in practise.

Would be good with more examples of typical work loads for when the hashset
type would be

Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs


On 27.06.23 17:54, Dagfinn Ilmari Mannsåker wrote:

However, the docs
(https://www.postgresql.org/docs/16/regress-tap.html#REGRESS-TAP-VARS)
say "If the environment variable PG_TEST_NOCLEAN is set", not "is set to
a true value", and the existing test in PostgreSQL::Test::Cluster's END
block is:

# skip clean if we are requested to retain the basedir
next if defined $ENV{'PG_TEST_NOCLEAN'};
   
So the original `not defined` test is consistent with that.


Right, the usual style is just to check whether an environment variable 
is set to something, not what it is.


Also note that in general not all environment variables are processed by 
Perl, so I would avoid encoding Perl semantics about what is "true" or 
whatever into it.

Re: Detecting use-after-free bugs using gcc's malloc() attribute


On 26.06.23 21:54, Andres Freund wrote:

For something like pg_list.h the malloc(free) attribute is a bit awkward to
use, because one a) needs to list ~30 functions that can free a list and b)
the referenced functions need to be declared.


Hmm.  Saying list_concat() "deallocates" a list is mighty confusing 
because 1) it doesn't, and 2) it might actually allocate a new list.  So 
while you get the useful behavior of "you probably didn't mean to use 
this variable again after passing it into list_concat()", if some other 
tool actually took these allocate/deallocate decorations at face value 
and did a memory leak analysis with them, they'd get completely bogus 
results.



I also added such attributes to bitmapset.h and palloc() et al, but that
didn't find existing bugs.  It does find use-after-free instances if I add
some, similarly it does find cases of mismatching palloc with free etc.


This seems more straightforward.  Even if it didn't find any bugs, I'd 
imagine it would be useful during development.



Do others think this would be useful enough to be worth polishing? And do you
agree the warnings above are bugs?


I actually just played with this the other day, because I can never 
remember termPQExpBuffer() vs. destroyPQExpBuffer().  I couldn't quite 
make it work for that, but I found the feature overall useful, so I'd 
welcome support for it.

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-06-28 Thread Daniel Gustafsson

> On 31 May 2023, at 21:07, Sandro Santilli  wrote:
> On Thu, Apr 27, 2023 at 12:49:57PM +0200, Sandro Santilli wrote:

>> I'm happy to bring back the control-file switch if there's an
>> agreement about that.
> 
> I'm attaching an up-to-date version of the patch with the control-file
> switch back in, so there's an explicit choice by extension developers.

This version fails the extension test since the comment from the .sql file is
missing from test_extensions.out.  Can you fix that in a rebase such that the
CFBot can have a green build of this patch?

--
Daniel Gustafsson

Re: Another incorrect comment for pg_stat_statements

On Wed, Jun 28, 2023 at 3:36 PM Michael Paquier  wrote:

> On Wed, Jun 28, 2023 at 03:09:55PM +0800, Richard Guo wrote:
> > +1.  To nitpick, how about we remove the blank line just before removing
> > the key for top level entry?
> >
> > -  /* Also remove entries for top level statements */
> > +  /* Also remove entries if exist for top level statements */
> >key.toplevel = true;
> > -
> > -  /* Remove the key if exists */
> >entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL);
>
> Why not if it improves the overall situation.  Could you send a patch
> with everything you have in mind?


Here is the patch.  I don't have too much in mind, so the patch just
removes the blank line and revises the comment a bit.

Thanks
Richard


v1-0001-trivial-revise-for-entry_reset.patch
Description: Binary data

Re: Changing types of block and chunk sizes in memory contexts


In memory contexts, block and chunk sizes are likely to be limited by
some upper bounds. Some examples of those bounds can be
MEMORYCHUNK_MAX_BLOCKOFFSET and MEMORYCHUNK_MAX_VALUE. Both values are
only 1 less than 1GB.
This makes memory contexts to have blocks/chunks with sizes less than
1GB. Such sizes can be stored in 32-bits. Currently, "Size" type,
which is 64-bit, is used, but 32-bit integers should be enough to
store any value less than 1GB.


size_t (= Size) is the correct type in C to store the size of an object 
in memory.  This is partially a self-documentation issue: If I see 
size_t in a function signature, I know what is intended; if I see 
uint32, I have to wonder what the intent was.


You could make an argument that using shorter types would save space for 
some internal structs, but then you'd have to show some more information 
where and why that would be beneficial.  (But again, self-documentation: 
If one were to do that, I would argue for introducing a custom type like 
pg_short_size_t.)


Absent any strong performance argument, I don't see the benefit of this 
change.  People might well want to experiment with MEMORYCHUNK_... 
settings larger than 1GB.

Re: Incremental View Maintenance, take 2

2023-06-28 Thread Yugo NAGATA

On Wed, 28 Jun 2023 00:01:02 +0800
jian he  wrote:

> On Thu, Jun 1, 2023 at 2:47 AM Yugo NAGATA  wrote:
> >
> > On Thu, 1 Jun 2023 23:59:09 +0900
> > Yugo NAGATA  wrote:
> >
> > > Hello hackers,
> > >
> > > Here's a rebased version of the patch-set adding Incremental View
> > > Maintenance support for PostgreSQL. That was discussed in [1].
> >
> > > [1] 
> > > https://www.postgresql.org/message-id/flat/20181227215726.4d166b4874f8983a641123f5%40sraoss.co.jp
> >
> > ---
> > * Overview
> >
> > Incremental View Maintenance (IVM) is a way to make materialized views
> > up-to-date by computing only incremental changes and applying them on
> > views. IVM is more efficient than REFRESH MATERIALIZED VIEW when
> > only small parts of the view are changed.
> >
> > ** Feature
> >
> > The attached patchset provides a feature that allows materialized views
> > to be updated automatically and incrementally just after a underlying
> > table is modified.
> >
> > You can create an incementally maintainable materialized view (IMMV)
> > by using CREATE INCREMENTAL MATERIALIZED VIEW command.
> >
> > The followings are supported in view definition queries:
> > - SELECT ... FROM ... WHERE ..., joins (inner joins, self-joins)
> > - some built-in aggregate functions (count, sum, avg, min, max)
> > - GROUP BY clause
> > - DISTINCT clause
> >
> > Views can contain multiple tuples with the same content (duplicate tuples).
> >
> > ** Restriction
> >
> > The following are not supported in a view definition:
> > - Outer joins
> > - Aggregates otehr than above, window functions, HAVING
> > - Sub-queries, CTEs
> > - Set operations (UNION, INTERSECT, EXCEPT)
> > - DISTINCT ON, ORDER BY, LIMIT, OFFSET
> >
> > Also, a view definition query cannot contain other views, materialized 
> > views,
> > foreign tables, partitioned tables, partitions, VALUES, non-immutable 
> > functions,
> > system columns, or expressions that contains aggregates.
> >
> > ---
> > * Design
> >
> > An IMMV is maintained using statement-level AFTER triggers.
> > When an IMMV is created, triggers are automatically created on all base
> > tables contained in the view definition query.
> >
> > When a table is modified, changes that occurred in the table are extracted
> > as transition tables in the AFTER triggers. Then, changes that will occur in
> > the view are calculated by a rewritten view dequery in which the modified 
> > table
> > is replaced with the transition table.
> >
> > For example, if the view is defined as "SELECT * FROM R, S", and tuples 
> > inserted
> > into R are stored in a transiton table dR, the tuples that will be inserted 
> > into
> > the view are calculated as the result of "SELECT * FROM dR, S".
> >
> > ** Multiple Tables Modification
> >
> > Multiple tables can be modified in a statement when using triggers, foreign 
> > key
> > constraint, or modifying CTEs. When multiple tables are modified, we need
> > the state of tables before the modification.
> >
> > For example, when some tuples, dR and dS, are inserted into R and S 
> > respectively,
> > the tuples that will be inserted into the view are calculated by the 
> > following
> > two queries:
> >
> >  "SELECT * FROM dR, S_pre"
> >  "SELECT * FROM R, dS"
> >
> > where S_pre is the table before the modification, R is the current state of
> > table, that is, after the modification. This pre-update states of table
> > is calculated by filtering inserted tuples and appending deleted tuples.
> > The subquery that represents pre-update state is generated in 
> > get_prestate_rte().
> > Specifically, the insterted tuples are filtered by calling 
> > IVM_visible_in_prestate()
> > in WHERE clause. This function checks the visibility of tuples by using
> > the snapshot taken before table modification. The deleted tuples are 
> > contained
> > in the old transition table, and this table is appended using UNION ALL.
> >
> > Transition tables for each modification are collected in each AFTER trigger
> > function call. Then, the view maintenance is performed in the last call of
> > the trigger.
> >
> > In the original PostgreSQL, tuplestores of transition tables are freed at 
> > the
> > end of each nested query. However, their lifespan needs to be prolonged to
> > the end of the out-most query in order to maintain the view in the last 
> > AFTER
> > trigger. For this purpose, SetTransitionTablePreserved is added in 
> > trigger.c.
> >
> > ** Duplicate Tulpes
> >
> > When calculating changes that will occur in the view (= delta tables),
> > multiplicity of tuples are calculated by using count(*).
> >
> > When deleting tuples from the view, tuples to be deleted are identified by
> > joining the delta table with the view, and tuples are deleted as many as
> > specified multiplicity by numbered using row_number() function.
> > This is

Re: Request for new function in view update

2023-06-28 Thread Yugo NAGATA

On Thu, 1 Jun 2023 12:18:47 -0500
Terry Brennan  wrote:

> Hello all,
> 
> I am a researcher in databases who would like to suggest a new function.  I
> am writing to you because you have an active developer community.  Your
> website said that suggestions for new functions should go to this mailing
> list.  If there is another mailing list you prefer, please let me know.
> 
> My research is in updating views -- the problem of translating an update in
> a view to an update to a set of underlying base tables.  This problem has
> been partially solved for many years, including in PostgreSQL, but a
> complete solution hasn't been found.
> 
> Views are useful for data independence; if users only access data through
> views, then underlying databases can change without user programs.  Data
> independence requires an automatic solution to the view update problem.
> 
> In my research, I went back to the initial papers about the problem.  The
> most promising approach was the "constant complement" approach.  It starts
> from the idea that a view shows only part of the information in a database,
> and that view updates should never change the part of the database that
> isn't exposed in the view.  (The "complement" is the unexposed part, and
> "constant" means that a view update shouldn't change the complement.)  The
> "constant complement" constraint is intuitive, that a view update shouldn't
> have side effects on information not available through the view.
> 
> A seminal paper showed that defining a complement is enough, because each
> complement of a view creates a unique view update.  Unfortunately, there
> are limitations.  Views have multiple complements, and no unique minimal
> complement exists.  Because of this limitation and other practical
> difficulties, the constant complement approach was abandoned.
> 
> I used a theorem in this initial paper that other researchers didn't use,
> that shows the inverse.  An update method defines a unique complement.  I
> used the two theorems as a saw's upstroke and downstroke  to devise view
> update methods for several relational operators.  Unlike other approaches,
> these methods have a solid mathematical foundation.
> 
> Some relational operators are easy (selection), others are hard
> (projection); some have several valid update methods that can be used
> interchangeably (union) and some can have several valid update methods that
> reflect different semantics (joins).  For joins, I found clues in the
> database that can determine which update method to use.  I address the
> other relational operators, but not in the attached paper
> .
> I also discuss the problem of when views can't have updates, and possible
> reasons why.
> 
> I have attached my arXiv paper.  I would appreciate anyone's interest in
> this topic.

I'm interested in the view update problem because we have some works on
this topic [1][2].

I read your paper. Although I don't understand the theoretical part enough,
I found your proposal methods to update views as for several relational 
operators. 

The method for updating selection views seems same as the way of automatically
updatable views in the current PostgreSQL, that is, deleting/updating rows in
a view results in deletes/updates for corresponding rows in the base table.
Inserting rows that is not compliant to the view is checked and prevented if
the view is defined with WITH CHECK OPTION. 

However, the proposed  method for projection is that a column not contained
in the view is updated to NULL when a row is deleted. I think it is not
desirable to use NULL in a such special purpose, and above all, this is
different the current PostgreSQL behavior, so I wonder it would not accepted
to change it. I think it would be same for the method for JOIN that uses NULL
for the special use.

I think it would be nice to extend view updatability of PostgreSQL because
the SQL standard allows more than the current limited support. In this case, 
I wonder we should follow SQL:1999 or later, and maybe this would be somehow
compatible to the spec in Oracle.

[1] https://dl.acm.org/doi/10.1145/3164541.3164584
[2] https://www.pgcon.org/2017/schedule/events/1074.en.html

Regards,
Yugo Nagata

> Yours
> Terry Brennan


-- 
Yugo NAGATA

Commitfest manager for July

2023-06-28 Thread Daniel Gustafsson

A quick scan of the archives doesn't turn up anyone who has volunteered in
advance to run the upcoming commitfest.  Is anyone keen at trying their hand at
this very important community work?  The July CF is good for anyone doing this
for the first time IMHO as it's usually less stressful than the ones later in
the cycle.

--
Daniel Gustafsson

Re: Another incorrect comment for pg_stat_statements

On Wed, Jun 28, 2023 at 03:09:55PM +0800, Richard Guo wrote:
> +1.  To nitpick, how about we remove the blank line just before removing
> the key for top level entry?
> 
> -  /* Also remove entries for top level statements */
> +  /* Also remove entries if exist for top level statements */
>key.toplevel = true;
> -
> -  /* Remove the key if exists */
>entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL);

Why not if it improves the overall situation.  Could you send a patch
with everything you have in mind?
--
Michael


signature.asc
Description: PGP signature

Re: Add TLI number to name of files generated by pg_waldump --save-fullpage

On Wed, Jun 28, 2023 at 09:20:27AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 27 Jun 2023 18:58:39 -0500, David Christensen 
>  wrote in 
>>> Adjusted as per the v2 attached.
>> 
>> +1
> 
> +1

Okay, cool.  Both of you seem happy with it, so I have applied it.
Thanks for the quick checks.
--
Michael


signature.asc
Description: PGP signature

Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)

2023-06-28 Thread Amit Langote

Hi,

On Tue, Feb 21, 2023 at 4:12 PM Amit Langote  wrote:
> On Tue, Feb 21, 2023 at 12:40 AM Tom Lane  wrote:
> > Alvaro Herrera  writes:
> > > On 2023-Feb-20, Amit Langote wrote:
> > >> One more thing we could try is come up with a postgres_fdw test case,
> > >> because it uses the RelOptInfo.userid value for remote-costs-based
> > >> path size estimation.  But adding a test case to contrib module's
> > >> suite test a core planner change might seem strange, ;-).
> >
> > > Maybe.  Perhaps adding it in a separate file there is okay?
> >
> > There is plenty of stuff in contrib module tests that is really
> > there to test core-code behavior.  (You could indeed argue that
> > *all* of contrib is there for that purpose.)  If it's not
> > convenient to test something without an extension, just do it
> > and don't sweat about that.
>
> OK.  Attached adds a test case to postgres_fdw's suite.  You can see
> that it fails without a316a3bc.

Noticed that there's an RfC entry for this in the next CF.  Here's an
updated version of the patch where I updated the comments a bit and
the commit message.

I'm thinking of pushing this on Friday barring objections.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v2-0001-Add-a-test-case-for-a316a3bc.patch
Description: Binary data

removing limitations from bitmap index scan

2023-06-28 Thread John Naylor

This email has no patch yet -- it's more of a placeholder to gather some
issues into one place. During previous work on replacing vacuum's bsearch'd
array for TID storage with a radix tree, Andres suggested [1] that the hash
table in tidbitmap.c should also be replaced. This will hopefully
illuminate how to get there.

* Current limitations

Page table entries are of fixed-size. At PGCon, Mark Dilger and Andres
discussed the problem that bitmap scans assume a hard-coded limit on the
offset of a TID, one that's particular to heap AM. That's not a requirement
of hash tables in general, but that's the current state of the
implementation. Using a radix tree would smooth the way to allowing
variable-length page table entries. I have briefly talked about it with
colleagues offlist as well.

The radix tree implementation that Masahiko and I have worked on does still
currently assume fixed-sized values. (not absolutely, but fixed at compile
time for a given use), but last month I did some refactoring that would
make variable-sized values fairly straightforward, at least no big
challenge. There would of course also be some extra complexity in doing TBM
union/intersection operations etc. Recent work I did also went in the
direction of storing small-enough values in the last-level pointer, saving
memory (as well as time spent accessing it). That seems important, since
trees do have some space overhead compared to arrays.

* Iteration/ordering

There are also now some unpleasant consequences that stem from hashed
blocknumbers:
- To get them ready for the executor the entries need to be sorted by
blocknumber, and "random" is a strenuous sorting case, because of cache
misses and branch mispredicts.
- Pages get lossified (when necessary) more-or-less at random

Radix trees maintain logical ordering, allowing for ordered iteration, so
that solves the sorting problem, and should help give a performance boost.

One hurdle is that shared iteration must work so that each worker can have
a disjoint subset of the input. The radix tree does have shared memory
support, but not yet shared iteration since there hasn't been a concrete
use case. Also, DSA has a noticeable performance cost. A good interim
development step is to use a local-mem radix tree for the index scan, and
then move everything out to the current array for the executor, in shmem if
the heap scan will be parallel. (I have coded some steps in that direction,
not ready to share.) That keeps that part of the interface the same,
simplifying testing. It's possible this much would work even for varlen
bitmaps: the iteration array could use a "tall skinny" page table entry
format, like

{ blockno; ; wordnum; bitmapword; }

...which would save space in many cases. Long term, we will want to move to
shared memory for the radix tree, at least as a prerequisite for parallel
bitmap index scan. The concurrency scheme is likely too coarse to make that
worthwhile now, but that will hopefully change at some point.

* Possible simplification

Some of the above adds complexity, but I see a possible simplification:
Many places in tidbitmap.c need to know if we have a single entry, to keep
from creating the hash table. That was added before simplehash.h existed. I
suspect most of the overhead now in creating the hash table is in zeroing
the backing array (correct me if I'm wrong). The radix tree wouldn't do
that, but it would create about half a dozen memory contexts, and inserting
a single entry would allocate one or two context blocks. Neither of these
are free either. If the single-entry case is still worth optimizing, it
could be pushed down inside inside the radix tree as a template option that
lazily creates memory contexts etc.

* Multiple use

Vacuum concerns started this in the first place, so it'll have to be kept
in mind as we proceed. At the very least, vacuum will need a boolean to
disallow lossifying pages, but the rest should work about the same.

There are some other things left out, like memory management and lossy
entries to work out, but this is enough to give a sense of what's involved.

[1]
https://www.postgresql.org/message-id/20230216164408.bcatntzzxj3jqn3q%40awork3.anarazel.de

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Another incorrect comment for pg_stat_statements

On Wed, Jun 28, 2023 at 3:04 PM Michael Paquier  wrote:

> On Wed, Jun 28, 2023 at 12:15:47PM +0800, Japin Li wrote:
> > - /* Remove the key if it exists, starting with the
> top-level entry  */
> > + /* Remove the key if it exists, starting with the
> non-top-level entry */
> >   key.toplevel = false;
> >   entry = (pgssEntry *) hash_search(pgss_hash, ,
> HASH_REMOVE, NULL);
> >   if (entry)  /* found */
>
> Nice catch.  That's indeed wrong.  Will fix.


+1.  To nitpick, how about we remove the blank line just before removing
the key for top level entry?

-  /* Also remove entries for top level statements */
+  /* Also remove entries if exist for top level statements */
   key.toplevel = true;
-
-  /* Remove the key if exists */
   entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, NULL);

Thanks
Richard

Re: Another incorrect comment for pg_stat_statements

On Wed, Jun 28, 2023 at 12:15:47PM +0800, Japin Li wrote:
> - /* Remove the key if it exists, starting with the top-level 
> entry  */
> + /* Remove the key if it exists, starting with the non-top-level 
> entry */
>   key.toplevel = false;
>   entry = (pgssEntry *) hash_search(pgss_hash, , HASH_REMOVE, 
> NULL);
>   if (entry)  /* found */

Nice catch.  That's indeed wrong.  Will fix.
--
Michael


signature.asc
Description: PGP signature

Re: Assert !bms_overlap(joinrel->relids, required_outer)