Re: wrong relkind error messages

2021-07-20 Thread Peter Eisentraut

On 21.07.21 04:21, Michael Paquier wrote:

On Tue, Jul 20, 2021 at 05:08:53PM +0200, Peter Eisentraut wrote:

While reviewing the logical decoding of sequences patch, I found a few more
places that could be updated in the new style introduced by this thread.
See attached patch.


Those changes look fine.  I am spotting one instance in
init_sequence() that looks worth aligning with the others?


I think if you write "ALTER SEQUENCE foo", then "foo is not a sequence" 
would be an appropriate error message, so this doesn't need changing.



Did you consider changing RangeVarCallbackForAlterRelation() or
ExecGrant_Relation() when it came to this thread?  Just noticing that,
while going through the code.


These might be worth another look, but I'd need to investigate more in 
what situations they happen.





Re: A micro-optimisation for ProcSendSignal()

2021-07-20 Thread Thomas Munro
Hi Soumyadeep and Ashwin,

Thanks for looking!

On Sun, Jul 18, 2021 at 6:58 AM Soumyadeep Chakraborty
 wrote:
> You might have missed a spot to initialize SERIALIZABLE_XACT->pgprocno in
> InitPredicateLocks(), so:
>
> + PredXact->OldCommittedSxact->pgprocno = INVALID_PGPROCNO;

The magic OldCommittedSxact shouldn't be the target of a "signal", but
this is definitely tidier.  Thanks.

> Slightly tangential: we should add a comment to PGPROC.pgprocno, for more
> immediate understandability:
>
> + int pgprocno; /* index of this PGPROC in ProcGlobal->allProcs */

I wonder why we need this member anyway, when you can compute it from
the address... #define GetPGProcNumber(p) ((p) - ProcGlobal->allProcs)
or something like that?  Kinda wonder why we don't use
GetPGProcByNumber() in more places instead of open-coding access to
ProcGlobal->allProcs, too...

> Also, why don't we take the opportunity to get rid of SERIALIZABLEXACT->pid? 
> We
> took a stab. Attached is v2 of your patch with these changes.

SERIALIZABLEXACT objects can live longer than the backends that
created them.  They hang around to sabotage other transactions' plans,
depending on what else they overlapped with before they committed.
With that change, the pg_locks view might show the pid of some
unrelated session that moves into the same PGPROC.

It's only an "informational" pid, and pids are imperfect information
anyway because (1) they are themselves recycled, and (2) they won't be
interesting in a hypothetical multi-threaded future.  One solution
would be to hide the pids from the view after the backend disconnects
(somehow -- add a generation number?), but they're also still kinda
useful, despite the weaknesses.  I wonder what the ideal way would be
to refer to sessions, anyway, including those that are no longer
active; perhaps we could invent a new "session ID" concept.
From 4d5787efc2d13126476a371219cb16f7ec69e41b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 11 Mar 2021 23:09:11 +1300
Subject: [PATCH v3] Optimize ProcSendSignal().

Instead of referring to target backends by pid, use pgprocno.  This
means that we don't have to scan the ProcArray, and we can drop some
special case code for dealing with the startup process.

Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com
Reviewed-by: Soumyadeep Chakraborty 
Reviewed-by: Ashwin Agrawal 
---
 src/backend/access/transam/xlog.c |  1 -
 src/backend/storage/buffer/buf_init.c |  3 +-
 src/backend/storage/buffer/bufmgr.c   | 10 ++---
 src/backend/storage/lmgr/predicate.c  |  6 ++-
 src/backend/storage/lmgr/proc.c   | 49 ++-
 src/backend/utils/adt/lockfuncs.c |  1 +
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/predicate_internals.h |  1 +
 src/include/storage/proc.h|  6 +--
 9 files changed, 19 insertions(+), 60 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ee9515139..c2950a7c06 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7309,7 +7309,6 @@ StartupXLOG(void)
 		 */
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
-			PublishStartupProcessInformation();
 			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index a299be1043..b9a83c0e9b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -16,6 +16,7 @@
 
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -118,7 +119,7 @@ InitBufferPool(void)
 			CLEAR_BUFFERTAG(buf->tag);
 
 			pg_atomic_init_u32(&buf->state, 0);
-			buf->wait_backend_pid = 0;
+			buf->wait_backend_pgprocno = INVALID_PGPROCNO;
 
 			buf->buf_id = i;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 86ef607ff3..26122418fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1890,11 +1890,11 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
 BUF_STATE_GET_REFCOUNT(buf_state) == 1)
 			{
 /* we just released the last pin other than the waiter's */
-int			wait_backend_pid = buf->wait_backend_pid;
+int			wait_backend_pgprocno = buf->wait_backend_pgprocno;
 
 buf_state &= ~BM_PIN_COUNT_WAITER;
 UnlockBufHdr(buf, buf_state);
-ProcSendSignal(wait_backend_pid);
+ProcSendSignal(wait_backend_pgprocno);
 			}
 			else
 UnlockBufHdr(buf, buf_state);
@@ -3995,7 +3995,7 @@ UnlockBuffers(void)
 		 * got a cancel/die interrupt before getting the signal.
 		 */
 		if ((buf_state & BM_PIN_COUNT_WAITER) != 0 &&
-			buf->wait_backend_pid == MyProcPid)
+			buf->wait_backend_pgprocno == MyProc->pgprocno)
 			buf_s

Re: Doc necessity for superuser privileges to execute pg_import_system_collations()

2021-07-20 Thread Fujii Masao




On 2021/07/19 13:30, Fujii Masao wrote:



On 2021/07/19 11:45, torikoshia wrote:

Hi.

It seems that only superusers can execute pg_import_system_collations(), but 
this is not mentioned in the manual.

Since other functions that require superuser privileges describe that in the 
manual, I think it would be better to do the same for consistency.

Thoughts?


LGTM.

IMO it's better to back-patch this to v10
where pg_import_system_collations() was added.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Addition of authenticated ID to pg_stat_activity

2021-07-20 Thread Michael Paquier
On Mon, Jul 19, 2021 at 04:56:24PM +0300, Aleksander Alekseev wrote:
> Any reason why we shouldn't simply exclude internal processes from the
> view? Do they have a connection that the VIEW could show?

Yeah, they don't really have any useful information.  Still, I kept
that mainly as a matter of consistency with pg_stat_activity, and this
can be useful to find out about internal processes without relying on
an extra check based on pg_stat_activity.backend_type.  Perhaps we
could just add a check in system_views.sql based on the NULL-ness of
client_port.

> Secondly, maybe instead of magic constants like -1, we could use an
> additional text column, e.g. connection_type: "unix"?

I am not really incline to break that more, as told by
pg_stat_get_activity() there are cases where this looks useful:
/*
 * Unix sockets always reports NULL for host and -1 for
 * port, so it's possible to tell the difference to
 * connections we have no permissions to view, or with
 * errors.
 */

> Thirdly, not
> sure if client_hostname is really needed, isn't address:port pair
> enough to identify the client?

It seems to me that this is still useful to know about
Port->remote_hostname.

> Lastly, introducing a new GUC for
> truncating values in a single view, which can only be set at server
> start, doesn't strike me as a great idea. What is the worst-case
> scenario if instead we will always allocate
> `strlen(MyProcPort->authn_id)` and the user will truncate the result
> manually if needed?

The authenticated ID could be a SSL DN longer than the default of
128kB that this patch is proposing.  I think that it is a good idea to
provide some way to the user to be able to control that without a
recompilation.
--
Michael


signature.asc
Description: PGP signature


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-20 Thread David Rowley
On Mon, 19 Jul 2021 at 18:32, Ronan Dunklau  wrote:
> regression=# explain select sum(unique1 order by ten, two), sum(unique1 order
> by four), sum(unique1 order by two, four) from tenk2 group by ten;
>QUERY PLAN
> 
>  GroupAggregate  (cost=1109.39..1234.49 rows=10 width=28)
>Group Key: ten
>->  Sort  (cost=1109.39..1134.39 rows=1 width=16)
>  Sort Key: ten, ten, two
>  ->  Seq Scan on tenk2  (cost=0.00..445.00 rows=1 width=16)
>
>
> We would ideally like to sort on ten, two, four to satisfy the first and last
> aggref at once. Stripping the common prefix (ten) would eliminate this 
> problem.

Thanks for finding this.  I've made a few changes to make this case
work as you describe. Please see attached v6 patches.

Because I had to add additional code to take the GROUP BY pathkeys
into account when choosing the best ORDER BY agg pathkeys, the
function that does that became a little bigger.  To try to reduce the
complexity of it, I got rid of all the processing from the initial
loop and instead it now uses the logic from the 2nd loop to find the
best pathkeys.  The reason I'd not done that in the first place was
because I'd thought I could get away without building an additional
Bitmapset for simple cases, but that's probably fairly cheap compared
to building Pathkeys.   With the additional complexity for the GROUP
BY pathkeys, the extra code seemed not worth it.

The 0001 patch is the ORDER BY aggregate code.  0002 is to fix up some
broken regression tests in postgres_fdw that 0001 caused.  It appears
that 0001 uncovered a bug in the postgres_fdw code.  I've reported
that in [1]. If that turns out to be a bug then it'll need to be fixed
before this can progress.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvr4OeC2DBVY--zVP83-K=byrtd7f8szdhn4g+pj2f2...@mail.gmail.com


v6-0001-Add-planner-support-for-ORDER-BY-aggregates.patch
Description: Binary data


v6-0002-Make-the-postgres_fdw-test-pass.patch
Description: Binary data


Re: Have I found an interval arithmetic bug?

2021-07-20 Thread Bruce Momjian
On Tue, Jul 20, 2021 at 05:13:37PM -0700, Zhihong Yu wrote:
> On Tue, Jul 20, 2021 at 3:53 PM Bruce Momjian  wrote:
> With your patch, the following example (Coutesy Bryn) still shows fraction:
> 
> # select (interval '1 month')*1.123;
>        ?column?
> ---
>  1 mon 3 days 16:33:36
> (1 row) 
> 
> Do you think the output can be improved (by getting rid of fraction) ?

Well, I focused on how fractional units were processed inside of
interval values.  I never considered how multiplication should be
handled.  I have not really thought about how to handle that, but this
example now gives me concern:

SELECT INTERVAL '1.06 months 1 hour';
   interval
---
 1 mon 2 days 01:00:00

Notice that it rounds the '1.06 months' to '1 mon 2 days', rather than
spilling to hours/minutes/seconds, even though hours is already
specified.  I don't see a better way to handle this than the current
code already does, but it is something odd.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: add 'noError' to euc_tw_and_big5.c

2021-07-20 Thread Michael Paquier
On Wed, Jul 21, 2021 at 02:15:14AM +, wangyu...@fujitsu.com wrote:
> 'noError' argument was added at commit ea1b99a661,
> but it seems to be neglected in euc_tw_and_big5.c Line 289.
> please see the attachment.

That sounds right to me.  Double-checking the area, I am not seeing
another portion of the code to fix.
--
Michael


signature.asc
Description: PGP signature


ORDER BY pushdowns seem broken in postgres_fdw

2021-07-20 Thread David Rowley
I'm working on a patch [1] to get the planner to consider adding
PathKeys to satisfy ORDER BY / DISTINCT aggregates.  I think this has
led me to discover some problems with postgres_fdw's handling of
pushing down ORDER BY clauses into the foreign server.

The following test exists in the postgres_fdw module:

create operator class my_op_class for type int using btree family
my_op_family as
 operator 1 public.<^,
 operator 3 public.=^,
 operator 5 public.>^,
 function 1 my_op_cmp(int, int);
-- This will not be pushed as user defined sort operator is not part of the
-- extension yet.
explain (verbose, costs off)
select array_agg(c1 order by c1 using operator(public.<^)) from ft2
where c2 = 6 and c1 < 100 group by c2;
 QUERY PLAN

 GroupAggregate
   Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
   Group Key: ft2.c2
   ->  Foreign Scan on public.ft2
 Output: c1, c2
 Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" <
100)) AND ((c2 = 6))
(6 rows)

Here the test claims that it wants to ensure that the order by using
operator(public.<^) is not pushed down into the foreign scan.
However, unless I'm mistaken, it seems there's a completely wrong
assumption there that the planner would even attempt that.  In current
master we don't add PathKeys for ORDER BY aggregates, why would that
sort get pushed down in the first place?

If I adjust that query to something that would have the planner set
pathkeys for, it does push the ORDER BY to the foreign server without
any consideration that the sort operator is not shippable to the
foreign server.

postgres=# explain verbose select * from ft2 order by c1 using
operator(public.<^);
  QUERY PLAN
---
 Foreign Scan on public.ft2  (cost=100.28..169.27 rows=1000 width=88)
   Output: c1, c2, c3, c4, c5, c6, c7, c8
   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T
1" ORDER BY "C 1" ASC NULLS LAST
(3 rows)

Am I missing something here, or is postgres_fdw.c's
get_useful_pathkeys_for_relation() just broken?

David

[1] 
https://www.postgresql.org/message-id/flat/1882015.KPgzjnsp5C%40aivenronan#159e89188e172ca38cb28ef7c5be9b2c




Re: 回复: Why is XLOG_FPI_FOR_HINT always need backups?

2021-07-20 Thread Fujii Masao




On 2021/07/19 10:16, Kyotaro Horiguchi wrote:

At Sat, 17 Jul 2021 00:14:34 +0900, Fujii Masao  
wrote in

Thanks for updating the patch! It basically looks good to me.

 * Full-page image (FPI) records contain nothing else but a 
backup
 * block (or multiple backup blocks). Every block reference must
 * include a full-page image - otherwise there would be no 
point in
 * this record.

The above comment also needs to be updated?


In short, no.  In contrast to the third paragraph, the first paragraph
should be thought that it is describing XLOG_FPI.  However, actually
it is not super obvious so it's better to make it clearer. Addition to
that, it seems to me (yes, to *me*) somewhat confused between "block
reference", "backup block" and "full-page image". So I'd like to
adjust the paragraph as the following.


Understood. Thanks for updating the patch!

I slightly modified the comments and pushed the patch. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: wrong relkind error messages

2021-07-20 Thread Michael Paquier
On Tue, Jul 20, 2021 at 05:08:53PM +0200, Peter Eisentraut wrote:
> While reviewing the logical decoding of sequences patch, I found a few more
> places that could be updated in the new style introduced by this thread.
> See attached patch.

Those changes look fine.  I am spotting one instance in
init_sequence() that looks worth aligning with the others?

Did you consider changing RangeVarCallbackForAlterRelation() or
ExecGrant_Relation() when it came to this thread?  Just noticing that,
while going through the code.
--
Michael


signature.asc
Description: PGP signature


add 'noError' to euc_tw_and_big5.c

2021-07-20 Thread wangyu...@fujitsu.com
Hi, Heikki

'noError' argument was added at commit ea1b99a661,
but it seems to be neglected in euc_tw_and_big5.c Line 289.
please see the attachment.

Regards,
Yukun Wang


add-noError.patch
Description: add-noError.patch


Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error

2021-07-20 Thread Yugo NAGATA
On Mon, 19 Jul 2021 10:51:36 +0900
Yugo NAGATA  wrote:

> Hello Fabien,
> 
> On Sat, 17 Jul 2021 07:03:01 +0200 (CEST)
> Fabien COELHO  wrote:
> 
> > 
> > Hello Yugo-san,
> > 
> > > [...] One way to avoid these errors is to send Parse messages before 
> > > pipeline mode starts. I attached a patch to fix to prepare commands at 
> > > starting of a script instead of at the first execution of the command.
> > 
> > > What do you think?
> > 
> > ISTM that moving prepare out of command execution is a good idea, so I'm 
> > in favor of this approach: the code is simpler and cleaner.
> > 
> > ISTM that a minor impact is that the preparation is not counted in the 
> > command performance statistics. I do not think that it is a problem, even 
> > if it would change detailed results under -C -r -M prepared.
> 
> I agree with you. Currently, whether prepares are sent in pipeline mode or
> not depends on whether the first SQL command is placed between \startpipeline
> and \endpipeline regardless whether other commands are executed in pipeline
> or not. ISTM, this behavior would be not intuitive for users.  Therefore, 
> I think preparing all statements not using pipeline mode is not problem for 
> now.
> 
> If some users would like to send prepares in  pipeline, I think it would be
> better to provide more simple and direct way. For example, we prepare 
> statements
> in pipeline if the user use an option, or if the script has at least one
> \startpipeline in their pipeline. Maybe,  --pipeline option is useful for 
> users
> who want to use pipeline mode for all queries in scirpts including built-in 
> ones.
> However, these features seems to be out of the patch proposed in this thread.
> 
> > Patch applies & compiles cleanly, global & local make check ok. However 
> > the issue is not tested. I think that the patch should add a tap test case 
> > for the problem being addressed.
> 
> Ok. I'll add a tap test to confirm the error I found is avoidable.
> 
> > I'd suggest to move the statement preparation call in the 
> > CSTATE_CHOOSE_SCRIPT case.
> 
> I thought so at first, but I noticed we cannot do it at least if we are
> using -C because the connection may not be established in the
> CSTATE_CHOOSE_SCRIPT state. 
> 
> > In comments: not yet -> needed.
> 
> Thanks. I'll fix it.

I attached the updated patch v2, which includes a comment fix and a TAP test.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 364b5a2e47..4e6db32fc9 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2858,6 +2858,30 @@ chooseScript(TState *thread)
 	return i - 1;
 }
 
+/* Prepare SQL commands in the chosen script */
+static void
+prepareCommands(CState *st)
+{
+	int			j;
+	Command   **commands = sql_script[st->use_file].commands;
+
+	for (j = 0; commands[j] != NULL; j++)
+	{
+		PGresult   *res;
+		char		name[MAX_PREPARE_NAME];
+
+		if (commands[j]->type != SQL_COMMAND)
+			continue;
+		preparedStatementName(name, st->use_file, j);
+		res = PQprepare(st->con, name,
+		commands[j]->argv[0], commands[j]->argc - 1, NULL);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+			pg_log_error("%s", PQerrorMessage(st->con));
+		PQclear(res);
+	}
+	st->prepared[st->use_file] = true;
+}
+
 /* Send a SQL command, using the chosen querymode */
 static bool
 sendCommand(CState *st, Command *command)
@@ -2891,42 +2915,6 @@ sendCommand(CState *st, Command *command)
 		char		name[MAX_PREPARE_NAME];
 		const char *params[MAX_ARGS];
 
-		if (!st->prepared[st->use_file])
-		{
-			int			j;
-			Command   **commands = sql_script[st->use_file].commands;
-
-			for (j = 0; commands[j] != NULL; j++)
-			{
-PGresult   *res;
-char		name[MAX_PREPARE_NAME];
-
-if (commands[j]->type != SQL_COMMAND)
-	continue;
-preparedStatementName(name, st->use_file, j);
-if (PQpipelineStatus(st->con) == PQ_PIPELINE_OFF)
-{
-	res = PQprepare(st->con, name,
-	commands[j]->argv[0], commands[j]->argc - 1, NULL);
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pg_log_error("%s", PQerrorMessage(st->con));
-	PQclear(res);
-}
-else
-{
-	/*
-	 * In pipeline mode, we use asynchronous functions. If a
-	 * server-side error occurs, it will be processed later
-	 * among the other results.
-	 */
-	if (!PQsendPrepare(st->con, name,
-	   commands[j]->argv[0], commands[j]->argc - 1, NULL))
-		pg_log_error("%s", PQerrorMessage(st->con));
-}
-			}
-			st->prepared[st->use_file] = true;
-		}
-
 		getQueryParams(st, command, params);
 		preparedStatementName(name, st->use_file, st->command);
 
@@ -3194,6 +3182,11 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	memset(st->prepared, 0, sizeof(st->prepared));
 }
 
+
+/* Prepare SQL commands if needed */
+if (querymode == QUERY_PREPARED && !st->prepared[st->use_file])
+	prepareCommands(st);
+
 /* record transaction start ti

Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread James Coleman
On Tue, Jul 20, 2021 at 4:35 AM David Rowley  wrote:
>
> On Tue, 20 Jul 2021 at 01:10, James Coleman  wrote:
> > To be clear up front: I'm in favor of the patch, and I don't want to
> > put unnecessary stumbling blocks up for it getting committed. So if we
> > decide to proceed as is, that's fine with me.
>
> Thanks for making that clear.
>
> > But I'm not sure that the "cost model, unfortunately, does not account
> > for that" is entirely accurate. The end of cost_tuplesort contains a
> > cost "per extracted tuple". It does, however, note that it doesn't
> > charge cpu_tuple_cost, which maybe is what you'd want to fully
> > incorporate this into the model. But given this run_cost isn't about
> > accounting for comparison cost (that's been done earlier) which is the
> > part that'd be the same between tuple and datum sort, it seems to me
> > that we could lower the cpu_operator_cost here by something like 10%
> > if it's byref and 30% if it's byval?
>
> I failed to notice that last part that adds the additional cpu_operator_cost.
>
> The default cpu_operator_cost is 0.0025, so with the 10k tuple
> benchmark, that adds an additional charge of 25 to the total cost.
>
> If we take test 1 from my results on v5 as an example:
>
> > Test1 446.1 657.3 147.32%
>
> Looking at explain for that query:
>
> regression=# explain select two from tenk1 order by two offset 100;
>   QUERY PLAN
> --
>  Limit  (cost=1133.95..1133.95 rows=1 width=4)
>->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
>  Sort Key: two
>  ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
> (4 rows)
>
> If we want the costs to reflect reality again here then we'd have
> reduce 1133.95 by something like 147.32% (the performance difference).
> That would bring the cost down to 769.72, which is way more than we
> have to play with than the 25 that the cpu_operator_cost * tuples
> gives us.
>
> If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
> cost would become 1126.45.  That's not great considering the actual
> performance indicates that 769.72 would be a better number.
>
> If we look at John's result for test 1: He saw 588 tps on master and
> 998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
> to get close to reality on that machine.
>
> My thoughts are that the small surcharge added at the end of
> cost_tuplesort() is just not enough for us to play with.  I think to
> get us closer to fixing this correctly would require a redesign of the
> tuplesort costing entirely. I think that would be about an order of
> magnitude more effort than this patch was, so I really feel like I
> don't want to do this.
>
> I kinda feel that since the comparison_cost is always just 2.0 *
> cpu_operator_cost regardless of the number of columns in the sort,
> then if we add too many new smarts to try and properly adjust for this
> new optimization, unless we do a completly new cost modal for this,
> then we might as well be putting lipstick on a pig.
>
> It sounds like James mostly just mentioned the sorting just to ensure
> it was properly considered and does not really feel strongly that it
> needs to be adjusted.  Does anyone else feel that we should be
> adjusting it?

Thanks for doing the math measuring how much we could impact things.

I'm +lots on getting this committed as is.

Thanks all for your work on the improvement!

James




Re: Bitmap reuse

2021-07-20 Thread Tom Lane
David Rowley  writes:
> On Wed, 21 Jul 2021 at 11:25, Tom Lane  wrote:
>> Uh  it's not the "exact same bitmap each time", because the selected
>> rows vary depending on the value of g.i.

> I imagined Jeff was meaning the bitmap from the scan of foo_x_idx, not
> the combined ANDed bitmap from both indexes.

To re-use that, you'd need a way to prevent the upper node from
destructively modifying the tidbitmap.

regards, tom lane




Re: Question about non-blocking mode in libpq

2021-07-20 Thread Yugo NAGATA
Hello Alvaro,

On Tue, 20 Jul 2021 12:05:11 -0400
Alvaro Herrera  wrote:

> On 2021-Jul-19, Yugo NAGATA wrote:
> 
> > On Tue, 13 Jul 2021 11:59:49 +0900
> > Yugo NAGATA  wrote:
> 
> > > However, looking into the code, PQsendQuery seems not to return an error
> > > in non-bloking mode even if unable to send all data. In such cases,
> > > pqSendSome will return 1 but it doesn't cause an error. Moreover,
> > > we would not need to call PQsendQuery again. Indead, we need to call
> > > PQflush until it returns 0, as documented with regard to PQflush.
> > > 
> > > Do we need to fix the description of PQsetnonblocking?
> 
> Yeah, I think you're right -- these functions don't error out, the
> commands are just stored locally in the output buffer.

Thank you for your explanation!
I attached a patch fix the description.

> >  "In the nonblocking state, calls to PQsendQuery, PQputline, PQputnbytes,
> >  PQputCopyData, and PQendcopy will not block" 
> > 
> > this seems to me that this is a list of functions that could block in 
> > blocking
> > mode, but I wander PQflush also could block because it calls pqSendSome, 
> > right?
> 
> I don't see that.  If pqSendSome can't write anything, it'll just return 1.

Well, is this the case of non-blocking mode, nor? If I understood correctly,
pqSendSome could block in blocking mode, so PQflush could block, too.  I thought
we should add PQflush to the list in the description to enphasis that this would
not block  in non-blocking mode. However, now I don't think so because PQflush
seems useful only in non-blocking mode.

> > Also, in the last paragraph of the section, I can find the following:
> > 
> >  "After sending any command or data on a nonblocking connection, call 
> > PQflush. ..."
> > 
> > However, ISTM we don't need to call PQflush in non-bloking mode and we can
> > call PQgetResult immediately because PQgetResult internally calls pqFlush
> > until it returns 0 (or -1).
> 
> Well, maybe you don't *need* to PQflush(); but if you don't call it,
> then the commands will sit in the output buffer indefinitely, which
> means the server won't execute them.  So even if it works to just call
> PQgetResult and have it block, surely you would like to only call
> PQgetResult when the query has already been executed and the result
> already been received and processed; that is, so that you can call
> PQgetResult and obtain the result immediately, and avoid (say) blocking
> a GUI interface while PQgetResult flushes the commands out, the server
> executes the query and sends the results back.

I understood that, although PQgetResult() also flushes the buffer, we still
should call PQflush() beforehand because we would not like get blocked after
calling PQgetResult(). Thanks.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 56689ba873..03ac480d0c 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -4924,8 +4924,8 @@ int PQsetnonblocking(PGconn *conn, int arg);
In the nonblocking state, calls to
, ,
, ,
-   and  will not block but instead return
-   an error if they need to be called again.
+   and  will not block but the commands are
+   stored locally in the output buffer until it is flushed.
   
 
   


Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread David Rowley
On Tue, 20 Jul 2021 at 23:28, Ranier Vilela  wrote:
> I took a look at cost_tuplesort and I think that some small adjustments could 
> be made as part of the improvement process.
> It is attached.
> 1. long is a very problematic type; better int64?
> 2. 1024 can be int, not long?
> 3. 2 changed all to 2.0 (double)?
> 4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?
>
> Finally, to at least document (add comments) those conclusions,
> would be nice, wouldn't it?

I don't think there's anything useful here. If you think otherwise,
please take it to another thread.  Also, I'd recommend at least
compiling any patches you send to -hackers in the future. Going by the
CF bot, this one does not.

You might also want to read up on type promotion rules in C.  Your
sort_mem calculation change does not do what you think it does. Class
it as homework to figure out what's wrong with it.  No need to report
your findings here. Just thought it would be useful for you to learn
those things.

David




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread David Rowley
On Fri, 16 Jul 2021 at 15:44, David Rowley  wrote:
>
> On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau  wrote:
> > Please find attached a v9 just moving the flag setting to ExecInitSort, and 
> > my
> > apologies if I misunderstood your point.
>
> I took this and adjusted a few things and ended up with the attached patch.

Attaching the same v10 patch again so the CF bot picks up the correct
patch again.

David


v10-0001-Make-nodeSort.c-do-Datum-sorts-for-single-column.patch
Description: Binary data


Re: Micro-optimizations to avoid some strlen calls.

2021-07-20 Thread David G. Johnston
On Tue, Jul 20, 2021 at 5:28 PM Michael Paquier  wrote:

> On Mon, Jul 19, 2021 at 07:48:55PM -0300, Ranier Vilela wrote:
> > There are some places, where strlen can have an overhead.
> > This patch tries to fix this.
> >
> > Pass check-world at linux ubuntu (20.04) 64 bits.
>
> Why does it matter?  No code paths you are changing here are
> performance-critical, meaning that such calls won't really show up
> high in profiles.
>
> I don't think there is anything to change here.
>

Agreed.  To borrow from a nearby email of a similar nature (PGConn
information retrieval IIRC) - it is not generally a benefit to avoid
function call access to data multiple times in a block by substituting in a
saved local variable. The function call tends to be more readable then
having yet one more unimportant name to keep in short-term memory.  As much
code already conforms to that the status quo is a preferred state unless
there is a demonstrable performance gain to be had.  The readability, and
lack of churn, is otherwise more important.

David J.


Re: CLUSTER on partitioned index

2021-07-20 Thread Michael Paquier
On Tue, Jul 20, 2021 at 08:27:02PM -0400, Alvaro Herrera wrote:
> I have to wonder if there really *is* a use case for CLUSTER in the
> first place on regular tables, let alone on partitioned tables, which
> are likely to be large and thus take a lot of time.  What justifies
> spending so much time on this implementation?  My impression is that
> CLUSTER is pretty much a fringe command nowadays, because of the access
> exclusive lock required.
> 
> Does anybody actually use it?

Yeah, I am not getting really excited about doing anything here
either.  I thought for some time about the interactions with
indisclustered and partitioned tables, but anything I could come up
with felt clunky.
--
Michael


signature.asc
Description: PGP signature


Re: Bitmap reuse

2021-07-20 Thread David Rowley
On Wed, 21 Jul 2021 at 11:25, Tom Lane  wrote:
>
> Jeff Janes  writes:
> > For some queries PostgreSQL can spend most of its time creating the exact
> > same bitmap over and over.  For example, in the below case: (also attached
> > as a file because line-wrapping is going to make a mess of it)
>
> Uh  it's not the "exact same bitmap each time", because the selected
> rows vary depending on the value of g.i.

I imagined Jeff was meaning the bitmap from the scan of foo_x_idx, not
the combined ANDed bitmap from both indexes.

I didn't look in detail, but I'd think it would just be a matter of
caching the bitmap then in ExecReScanBitmapIndexScan(), if the
PlanState's chgParam indicate a parameter has changed, then throw away
the cache.  Then if the cache is still valid in
MultiExecBitmapIndexScan(), return it.  Not too sure about the memory
context part for the cache. As I said, I didn't look in detail.

David




Re: Micro-optimizations to avoid some strlen calls.

2021-07-20 Thread Michael Paquier
On Mon, Jul 19, 2021 at 07:48:55PM -0300, Ranier Vilela wrote:
> There are some places, where strlen can have an overhead.
> This patch tries to fix this.
> 
> Pass check-world at linux ubuntu (20.04) 64 bits.

Why does it matter?  No code paths you are changing here are
performance-critical, meaning that such calls won't really show up
high in profiles.

I don't think there is anything to change here.
--
Michael


signature.asc
Description: PGP signature


Re: CLUSTER on partitioned index

2021-07-20 Thread Alvaro Herrera
I have to wonder if there really *is* a use case for CLUSTER in the
first place on regular tables, let alone on partitioned tables, which
are likely to be large and thus take a lot of time.  What justifies
spending so much time on this implementation?  My impression is that
CLUSTER is pretty much a fringe command nowadays, because of the access
exclusive lock required.

Does anybody actually use it?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)




Re: Column Filtering in Logical Replication

2021-07-20 Thread Alvaro Herrera
Hello,

I think this looks good regarding the PublicationRelationInfo API that was
discussed.

Looking at OpenTableList(), I think you forgot to update the comment --
it says "open relations specified by a RangeVar list", but the list is
now of PublicationTable.  Also I think it would be good to say that the
returned tables are PublicationRelationInfo, maybe such as "In the
returned list of PublicationRelationInfo, the tables are locked ..."

In AlterPublicationTables() I was confused by some code that seemed
commented a bit too verbosely (for a moment I thought the whole list was
being copied into a different format).  May I suggest something more
compact like

/* Not yet in list; open it and add it to the list */
if (!found)
{
Relationoldrel;
PublicationRelationInfo *pubrel;
   
oldrel = table_open(oldrelid, 
ShareUpdateExclusiveLock);

/* Wrap it in PublicationRelationInfo */
pubrel = 
palloc(sizeof(PublicationRelationInfo));
pubrel->relation = oldrel;
pubrel->relid = oldrelid;
pubrel->columns = NIL;  /* not needed */

delrels = lappend(delrels, pubrel);
}

Thanks!

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: Have I found an interval arithmetic bug?

2021-07-20 Thread Zhihong Yu
On Tue, Jul 20, 2021 at 3:53 PM Bruce Momjian  wrote:

> On Tue, Jul 20, 2021 at 02:33:07PM -0700, Zhihong Yu wrote:
> > On Mon, Jul 19, 2021 at 9:14 PM Bruce Momjian  wrote:
> > > Obviously this should return '1 mon 26 days', but with my most
> recent
> > > patch, it returned '1 mon 25 days'.  Turns out I had not properly
> used
> > > rint() in AdjustFractDays, and in fact the function is now longer
> needed
> > > because it is just a multiplication and an rint().
> >
> > Patch looks good.
> > Maybe add the statement above as a test case :
> >
> > SELECT INTERVAL '1.8594 months'
>
> Good idea --- updated patch attached.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>
> Hi,
With your patch, the following example (Coutesy Bryn) still shows fraction:

# select (interval '1 month')*1.123;
   ?column?
---
 1 mon 3 days 16:33:36
(1 row)

Do you think the output can be improved (by getting rid of fraction) ?

Thanks


Re: badly calculated width of emoji in psql

2021-07-20 Thread Jacob Champion
On Mon, 2021-07-19 at 13:13 +0200, Laurenz Albe wrote:
> On Mon, 2021-07-19 at 16:46 +0900, Michael Paquier wrote:
> > > In your opinion, would the current one-line patch proposal make things
> > > strictly better than they are today, or would it have mixed results?
> > > I'm wondering how to help this patch move forward for the current
> > > commitfest, or if we should maybe return with feedback for now.
> > 
> > Based on the following list, it seems to me that [u+1f300,u+0x1faff]
> > won't capture everything, like the country flags:
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Femoji%2Fcharts%2Ffull-emoji-list.html&data=04%7C01%7Cpchampion%40vmware.com%7Cbc3f4cff42094f60fa7708d94aa64f11%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637622900429154586%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lfSsqU%2BEiSJrwftt9FL13ib7pw0Mzt5DYl%2BSjL2%2Bm%2F0%3D&reserved=0

On my machine, the regional indicator codes take up one column each
(even though they display as a wide uppercase letter), so making them
wide would break alignment. This seems to match up with Annex #11 [1]:

ED4. East Asian Wide (W): All other characters that are always
 wide. [...] This category includes characters that have explicit
 halfwidth counterparts, along with characters that have the [UTS51]
 property Emoji_Presentation, with the exception of characters that
 have the [UCD] property Regional_Indicator

So for whatever reason, those indicator codes aren't considered East
Asian Wide by Unicode (and therefore glibc), even though they are
Emoji_Presentation. And glibc appears to be using East Asian Wide as
the flag for a 2-column character.

glibc 2.31 is based on Unicode 12.1, I think. So if Postgres is built
against a Unicode database that's different from the system's,
obviously you'll see odd results no matter what we do here.

And _all_ of that completely ignores the actual country-flag-combining
behavior, which my terminal doesn't do and I assume would be part of a
separate conversation entirely, along with things like ZWJ sequences.

> That could be adapted; the question is if the approach as such is
> desirable or not.  This is necessarily a moving target, at the rate
> that emojis are created and added to Unicode.

Sure. We already have code in the tree that deals with that moving
target, though, by parsing apart pieces of the Unicode database. So the
added maintenance cost should be pretty low.

> My personal feeling is that something simple and perhaps imperfect
> as my one-liner that may miss some corner cases would be ok, but
> anything that saps more performance or is complicated would not
> be worth the effort.

Another data point: on my machine (Ubuntu 20.04, glibc 2.31) that
additional range not only misses a large number of emoji (e.g. in the
2xxx codepoint range), it incorrectly treats some narrow codepoints as
wide (e.g. many in the 1F32x range have Emoji_Presentation set to
false).

I note that the doc comment for ucs_wcwidth()...

>  *  - Spacing characters in the East Asian Wide (W) or East Asian
>  *FullWidth (F) category as defined in Unicode Technical
>  *Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

--Jacob

[1] http://www.unicode.org/reports/tr11/


Bitmap reuse

2021-07-20 Thread Jeff Janes
For some queries PostgreSQL can spend most of its time creating the exact
same bitmap over and over.  For example, in the below case: (also attached
as a file because line-wrapping is going to make a mess of it)

drop table if exists foo;
create table foo (x daterange, i int, t text);
insert into foo select daterange(x::date,x::date+3), random()*3000 from
(select now()-interval '3 years'*random() as x from
generate_series(1,1e6))foo;
vacuum analyze foo;
create index ON  foo using gist  ( x);
create index ON  foo   ( i);
explain (analyze, buffers) select * from generate_series(1,20) g(i), foo
where x && '[2019-08-09,2019-08-11)' and g.i=foo.i;

  QUERY PLAN

---
 Nested Loop  (cost=170.21..3563.24 rows=33 width=54) (actual
time=1.295..24.890 rows=28 loops=1)
   Buffers: shared hit=543 read=8
   I/O Timings: read=0.040
   ->  Function Scan on generate_series g  (cost=0.00..0.20 rows=20
width=4) (actual time=0.007..0.014 rows=20 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=170.21..178.13 rows=2 width=50)
(actual time=1.238..1.240 rows=1 loops=20)
 Recheck Cond: ((i = g.i) AND (x &&
'[2019-08-09,2019-08-11)'::daterange))
 Heap Blocks: exact=28
 Buffers: shared hit=543 read=8
 I/O Timings: read=0.040
 ->  BitmapAnd  (cost=170.21..170.21 rows=2 width=0) (actual
time=1.234..1.234 rows=0 loops=20)
   Buffers: shared hit=515 read=8
   I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_i_idx  (cost=0.00..6.92
rows=333 width=0) (actual time=0.031..0.031 rows=327 loops=20)
 Index Cond: (i = g.i)
 Buffers: shared hit=55 read=8
 I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..161.78
rows=5000 width=0) (actual time=1.183..1.183 rows=3670 loops=20)
 Index Cond: (x && '[2019-08-09,2019-08-11)'::daterange)
 Buffers: shared hit=460

Note that the fast bitmap index scan is parameterized to the other side of
the nested loop, so has to be recomputed.  While the slow one is
parameterized to a constant, so it could in principle just be reused.

What kind of infrastructure would be needed to detect this case and reuse
that bitmap?

Cheers,

Jeff
drop table if exists foo;
create table foo (x daterange, i int, t text);
insert into foo select daterange(x::date,x::date+3), random()*3000 from (select 
now()-interval '3 years'*random() as x from generate_series(1,1e6))foo;
vacuum analyze foo;
create index ON  foo using gist  ( x);
create index ON  foo   ( i);
explain (analyze, buffers) select * from generate_series(1,20) g(i), foo where 
x && '[2019-08-09,2019-08-11)' and g.i=foo.i;
\q
  QUERY PLAN
   
---
 Nested Loop  (cost=170.21..3563.24 rows=33 width=54) (actual 
time=1.295..24.890 rows=28 loops=1)
   Buffers: shared hit=543 read=8
   I/O Timings: read=0.040
   ->  Function Scan on generate_series g  (cost=0.00..0.20 rows=20 width=4) 
(actual time=0.007..0.014 rows=20 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=170.21..178.13 rows=2 width=50) (actual 
time=1.238..1.240 rows=1 loops=20)
 Recheck Cond: ((i = g.i) AND (x && 
'[2019-08-09,2019-08-11)'::daterange))
 Heap Blocks: exact=28
 Buffers: shared hit=543 read=8
 I/O Timings: read=0.040
 ->  BitmapAnd  (cost=170.21..170.21 rows=2 width=0) (actual 
time=1.234..1.234 rows=0 loops=20)
   Buffers: shared hit=515 read=8
   I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_i_idx  (cost=0.00..6.92 rows=333 
width=0) (actual time=0.031..0.031 rows=327 loops=20)
 Index Cond: (i = g.i)
 Buffers: shared hit=55 read=8
 I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..161.78 rows=5000 
width=0) (actual time=1.183..1.183 rows=3670 loops=20)
 Index Cond: (x && '[2019-08-09,2019-08-11)'::daterange)
 Buffers: shared hit=460
 Planning:
   Buffers: shared hit=47 read=4
   I/O Timings: read=0.034
 Planning Time: 0.380 ms
 Execution Time: 24.937 ms
(24 rows)

Time: 26.094 ms


Re: Bitmap reuse

2021-07-20 Thread Tom Lane
Jeff Janes  writes:
> For some queries PostgreSQL can spend most of its time creating the exact
> same bitmap over and over.  For example, in the below case: (also attached
> as a file because line-wrapping is going to make a mess of it)

Uh  it's not the "exact same bitmap each time", because the selected
rows vary depending on the value of g.i.

If the output of the subplan was indeed constant, I'd expect the planner
to stick a Materialize node atop it.  That would almost certainly win
more than re-using the index output to scan the heap additional times.

regards, tom lane




Re: POC: GROUP BY optimization

2021-07-20 Thread Tomas Vondra
On 7/20/21 3:12 PM, Tomas Vondra wrote:
> ...
>
> The other comments from the review still apply - I'm particularly
> concerned about the (1) point, i.e. plan changes in postgres_fdw. Those
> seem to be rather strange (LIMIT not being pushed down in queries
> without any grouping). I'd bet this is due to changes in sort costing
> and does not seem very desirable.
> 

I did look at this a bit closer, and yes - this very much seems like a
costing issue in the patch. The first query in postgres_fdw that changes
switches from a query with LIMIT pushed to remote

  QUERY PLAN

 Foreign Scan  (cost=196.29..196.52 rows=10 width=14)
   Output: t1.c1, t2.c1, t1.c3
   Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
   Remote SQL: SELECT r1."C 1", r2."C 1", r1.c3 FROM ("S 1"."T 1" r1
   INNER JOIN "S 1"."T 1" r2 ON ... LIMIT ... OFFSET ...
(4 rows)

to the LIMIT executed locally

  QUERY PLAN
-
 Limit  (cost=241.72..241.94 rows=10 width=14)
   Output: t1.c1, t2.c1, t1.c3
   ->  Foreign Scan  (cost=239.47..261.97 rows=1000 width=14)
 Output: t1.c1, t2.c1, t1.c3
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r2."C 1", r1.c3 FROM ("S 1"."T 1"
 r1 INNER JOIN ...
(6 rows)

The FDW code runs explain on both queries - with and without LIMIT
pushed to remote side, to get estimates, and without the patch it gets this:

  QUERY PLAN
--
 Sort  (cost=106.97..109.47 rows=1000 width=14)
   Sort Key: r1.c3, r1."C 1"
   ->  Hash Join  (cost=33.50..57.14 rows=1000 width=14)
 Hash Cond: (r1."C 1" = r2."C 1")
 ->  Seq Scan on "T 1" r1  (cost=0.00..21.00 rows=1000 width=10)
 ->  Hash  (cost=21.00..21.00 rows=1000 width=4)
   ->  Seq Scan on "T 1" r2  (cost=0.00..21.00 rows=1000
width=4)
(7 rows)

 QUERY PLAN

 Limit  (cost=96.29..96.32 rows=10 width=14)
   ->  Sort  (cost=96.04..98.54 rows=1000 width=14)
 Sort Key: r1.c3, r1."C 1"
 ->  Hash Join  (cost=33.50..57.14 rows=1000 width=14)
   Hash Cond: (r1."C 1" = r2."C 1")
   ->  Seq Scan on "T 1" r1  (cost=0.00..21.00 rows=1000
width=10)
   ->  Hash  (cost=21.00..21.00 rows=1000 width=4)
 ->  Seq Scan on "T 1" r2  (cost=0.00..21.00
rows=1000 width=4)
(8 rows)


while with the patch it gets this:

  QUERY PLAN
--
 Sort  (cost=139.47..141.97 rows=1000 width=14)
   Sort Key: r1.c3, r1."C 1"
   ->  Hash Join  (cost=33.50..57.14 rows=1000 width=14)
 Hash Cond: (r1."C 1" = r2."C 1")
 ->  Seq Scan on "T 1" r1  (cost=0.00..21.00 rows=1000 width=10)
 ->  Hash  (cost=21.00..21.00 rows=1000 width=4)
   ->  Seq Scan on "T 1" r2  (cost=0.00..21.00 rows=1000
width=4)
(7 rows)

 QUERY PLAN

 Limit  (cost=145.75..145.77 rows=10 width=14)
   ->  Sort  (cost=145.50..148.00 rows=1000 width=14)
 Sort Key: r1.c3, r1."C 1"
 ->  Hash Join  (cost=33.50..57.14 rows=1000 width=14)
   Hash Cond: (r1."C 1" = r2."C 1")
   ->  Seq Scan on "T 1" r1  (cost=0.00..21.00 rows=1000
width=10)
   ->  Hash  (cost=21.00..21.00 rows=1000 width=4)
 ->  Seq Scan on "T 1" r2  (cost=0.00..21.00
rows=1000 width=4)
(8 rows)


Notice that the costs get "inverted"

   masterpatched
   --
no limit   106.97..109.47 139.47..141.97
limit   96.29.. 96.32 145.75..145.77

so the limit looks a bit more expensive - just enough so that it seems
cheaper to transfer more rows and execute the limit locally.

IMO this looks rather suspicious (or wrong), and compute_cpu_sort_cost
may need some fixes. I wonder why differs from the old code so much ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Have I found an interval arithmetic bug?

2021-07-20 Thread Bruce Momjian
On Tue, Jul 20, 2021 at 02:33:07PM -0700, Zhihong Yu wrote:
> On Mon, Jul 19, 2021 at 9:14 PM Bruce Momjian  wrote:
> > Obviously this should return '1 mon 26 days', but with my most recent
> > patch, it returned '1 mon 25 days'.  Turns out I had not properly used
> > rint() in AdjustFractDays, and in fact the function is now longer needed
> > because it is just a multiplication and an rint().
>
> Patch looks good.
> Maybe add the statement above as a test case :
> 
> SELECT INTERVAL '1.8594 months' 

Good idea --- updated patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



interval.diff.gz
Description: application/gzip


Re: Signed vs Unsigned (take 2) (src/backend/storage/ipc/procarray.c)

2021-07-20 Thread Ranier Vilela
Em sex., 16 de jul. de 2021 às 09:41, Ranier Vilela 
escreveu:

> Em sex., 16 de jul. de 2021 às 09:05, Aleksander Alekseev <
> aleksan...@timescale.com> escreveu:
>
>> Hi Rainer,
>>
>> > Here are the two patches.
>> > As suggested, reclassified as refactoring only.
>>
>> Please don't change the status of the patch on CF application before
>> it was reviewed. It will only slow things down.
>>
> Hi Aleksander,
> Sorry, lack of practice.
>
>
>> Your patch seems to have some problems on FreeBSD. Please see
>> http://commitfest.cputube.org/
>
> I saw.
> Very strange, in all other architectures, it went well.
> I will have to install a FreeBSD to be able to debug.
>
There are a typo in
0001-Promove-unshadowing-of-two-variables-PGPROC-type.patch

- ProcArrayEndTransactionInternal(proc, proc->procArrayGroupMemberXid);
+ ProcArrayEndTransactionInternal(nextproc,
nextproc->procArrayGroupMemberXid);

Attached new version v1, with fix.
Now pass check-world at FreeBSD 13 with clang 11.

regards,
Ranier Vilela


v1-0001-Promove-unshadowing-of-two-variables-PGPROC-type.patch
Description: Binary data


Re: OpenSSL 3.0.0 compatibility

2021-07-20 Thread Daniel Gustafsson
> On 20 Jul 2021, at 09:54, Michael Paquier  wrote:
> 
> On Tue, Jul 20, 2021 at 01:23:42AM +0200, Daniel Gustafsson wrote:
>> Another aspect of OpenSSL 3 compatibility is that of legacy cipher support, 
>> and
>> as we concluded upthread it's best to leave that to the user to define in
>> openssl.cnf.  The attached 0002 adds alternative output files for 3.0.0
>> installations without the legacy provider loaded, as well as adds a note in 
>> the
>> pgcrypto docs to enable it in case DES is needed.  It does annoy me a bit 
>> that
>> we don't load the openssl.cnf file for 1.0.1 if we start mentioning it in the
>> docs for other versions, but it's probably not worth the effort to fix it 
>> given
>> the lack of complaints so far (it needs a call to OPENSSL_config(NULL); 
>> guarded
>> to HAVE_ macros for 1.0.1).
> 
> Sounds sensible as a whole.

Thanks for reviewing!

> Another thing I can notice is that
> OpenSSL 3.0.0beta1 has taken care of the issue causing diffs in the
> tests of src/test/ssl/.  So once pgcrypto is addressed, it looks like
> there is nothing left for this thread.

That's a good point, I forgot to bring that up.

--
Daniel Gustafsson   https://vmware.com/





Re: logical decoding and replication of sequences

2021-07-20 Thread Tomas Vondra
Here's a rebased version of the patch, no other changes.

I think the crucial aspect of this that needs discussion/feedback the
most is the transactional vs. non-transactional behavior. All the other
questions are less important / cosmetic.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From f61f2cabf6221451af8f427a5d4b4f7f4583e618 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 20 Jul 2021 23:20:10 +0200
Subject: [PATCH] Logical decoding / replication of sequences

---
 contrib/test_decoding/Makefile|   3 +-
 contrib/test_decoding/expected/sequence.out   | 327 +++
 contrib/test_decoding/sql/sequence.sql| 119 ++
 contrib/test_decoding/test_decoding.c |  40 ++
 doc/src/sgml/catalogs.sgml|  71 
 doc/src/sgml/ref/alter_publication.sgml   |  28 +-
 doc/src/sgml/ref/alter_subscription.sgml  |   6 +-
 src/backend/catalog/pg_publication.c  | 160 +++-
 src/backend/catalog/system_views.sql  |  10 +
 src/backend/commands/publicationcmds.c| 232 ++-
 src/backend/commands/sequence.c   | 132 +-
 src/backend/commands/subscriptioncmds.c   | 272 +
 src/backend/executor/execReplication.c|   4 +-
 src/backend/nodes/copyfuncs.c |   4 +-
 src/backend/nodes/equalfuncs.c|   4 +-
 src/backend/parser/gram.y |  44 +-
 src/backend/replication/logical/decode.c  | 131 +-
 src/backend/replication/logical/logical.c |  42 ++
 src/backend/replication/logical/proto.c   |  52 +++
 .../replication/logical/reorderbuffer.c   | 384 ++
 src/backend/replication/logical/tablesync.c   | 118 +-
 src/backend/replication/logical/worker.c  |  56 +++
 src/backend/replication/pgoutput/pgoutput.c   |  80 +++-
 src/backend/utils/cache/relcache.c|   4 +-
 src/bin/psql/tab-complete.c   |   2 +-
 src/include/catalog/pg_proc.dat   |   5 +
 src/include/catalog/pg_publication.h  |  14 +
 src/include/commands/sequence.h   |   2 +
 src/include/nodes/parsenodes.h|   4 +-
 src/include/replication/logicalproto.h|  20 +
 src/include/replication/output_plugin.h   |  14 +
 src/include/replication/pgoutput.h|   1 +
 src/include/replication/reorderbuffer.h   |  34 +-
 src/test/regress/expected/publication.out |   8 +-
 src/test/regress/expected/rules.out   |   8 +
 35 files changed, 2389 insertions(+), 46 deletions(-)
 create mode 100644 contrib/test_decoding/expected/sequence.out
 create mode 100644 contrib/test_decoding/sql/sequence.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..56ddc3abae 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,8 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate stream stats twophase twophase_stream
+	spill slot truncate stream stats twophase twophase_stream \
+	sequence
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
diff --git a/contrib/test_decoding/expected/sequence.out b/contrib/test_decoding/expected/sequence.out
new file mode 100644
index 00..07f3e3e92c
--- /dev/null
+++ b/contrib/test_decoding/expected/sequence.out
@@ -0,0 +1,327 @@
+-- predictability
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+--
+ init
+(1 row)
+
+CREATE SEQUENCE test_sequence;
+-- test the sequence changes by several nextval() calls
+SELECT nextval('test_sequence');
+ nextval 
+-
+   1
+(1 row)
+
+SELECT nextval('test_sequence');
+ nextval 
+-
+   2
+(1 row)
+
+SELECT nextval('test_sequence');
+ nextval 
+-
+   3
+(1 row)
+
+SELECT nextval('test_sequence');
+ nextval 
+-
+   4
+(1 row)
+
+-- test the sequence changes by several ALTER commands
+ALTER SEQUENCE test_sequence INCREMENT BY 10;
+SELECT nextval('test_sequence');
+ nextval 
+-
+  14
+(1 row)
+
+ALTER SEQUENCE test_sequence START WITH 3000;
+ALTER SEQUENCE test_sequence MAXVALUE 1;
+ALTER SEQUENCE test_sequence RESTART WITH 4000;
+SELECT nextval('test_sequence');
+ nextval 
+-
+4000
+(1 row)
+
+-- test the sequence changes by several setval() calls
+SELECT setval('test_sequence', 3500);
+ setval 
+
+   3500
+(1 row)
+
+SELECT nextval('test_sequence');
+ nextval 
+-
+3510
+(1 row)
+
+SELECT setval('test_sequence', 3500, true);
+ setval 
+
+   3500
+(1 row)
+
+SELECT nextval('test_sequence');
+

Re: Have I found an interval arithmetic bug?

2021-07-20 Thread Zhihong Yu
On Mon, Jul 19, 2021 at 9:14 PM Bruce Momjian  wrote:

> On Wed, Jul 14, 2021 at 09:03:21AM -0700, Zhihong Yu wrote:
> > On Thu, Jul 8, 2021 at 10:22 AM Zhihong Yu  wrote:
> > On Wed, Jun 30, 2021 at 9:35 AM Bruce Momjian 
> wrote:
> >
> > On Tue, Jun 29, 2021 at 06:49:45PM +0200, Daniel Gustafsson
> wrote:
> > > > On 29 Jun 2021, at 18:50, Zhihong Yu 
> wrote:
> > >
> > > > Now that PG 15 is open for commit, do you think the patch
> can land
> > ?
> > >
> > > Adding it to the commitfest patch tracker is a good way to
> ensure
> > it's not
> > > forgotten about:
> > >
> > >   https://commitfest.postgresql.org/33/
> >
> > OK, I have been keeping it in my git tree since I wrote it and
> will
> > apply it in the next few days.
> > Thanks, Bruce.
> >
> > Hopefully you can get to this soon.
> >
> > Bruce:
> > Please see if the patch can be integrated now.
>
> I found a mistake in my most recent patch.  For example, in master we
> see this output:
>
> SELECT INTERVAL '1.8594 months';
>  interval
> --
>  1 mon 25 days 18:46:04.8
>
> Obviously this should return '1 mon 26 days', but with my most recent
> patch, it returned '1 mon 25 days'.  Turns out I had not properly used
> rint() in AdjustFractDays, and in fact the function is now longer needed
> because it is just a multiplication and an rint().
>
> Updated patch attached.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>

Hi,
Patch looks good.
Maybe add the statement above as a test case :

SELECT INTERVAL '1.8594 months'

Cheers


Re: speed up verifying UTF-8

2021-07-20 Thread John Naylor
> On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> > An alternative idea: should we optimize for validation of **valid**
inputs rather than optimizing the worst case?
> > In other words, what if the implementation processes all characters
always and uses a slower method in case of validation failure?
> > I would guess it is more important to be faster with accepting valid
input rather than "faster to reject invalid input".
>
> > static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> > if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid,
then just accept it
> > return s + len;
> > }
> > // slow path: the string is not valid, perform a slower analysis
> > return s + ;
> > }

This turned out to be a really good idea (v19 attached):

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
2944 |  1523 |   871 |1473 |   1509

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 888 |   607 |   179 | 777 |   1328

v19:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 809 |   472 |   223 | 558 |805

x86 Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 974 |   691 |   370 | 456 |526

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 674 |   346 |78 | 309 |504

v19:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 379 |   181 |94 | 219 |376

Note that the branchy code's worst case (mixed8) is here the same speed as
multibyte. With Vladimir's idea * , we call check_ascii only every 8 bytes
of input, not every time we verify one multibyte character. Also, we only
have to check the DFA state every time we loop over 8 bytes, not every time
we step through the DFA. That means we have to walk backwards at the end to
find the last leading byte, but the SSE code already knew how to do that,
so I used that logic here in the caller, which will allow some
simplification of how the SSE code returns.

The state check is likely why the ascii case is slightly slower than v14.
We could go back to checking ascii 16 bytes at a time, since there's little
penalty for doing so.

* (Greg was thinking the same thing upthread, but I don't think the branchy
code I posted at the time could have taken advantage of this)

I'm pretty confident this improvement is architecture-independent. Next
month I'll clean this up and rebase the SSE patch over this.

I wrote:

> + /*
> + * NB: This check must be strictly greater-than, otherwise an invalid
byte
> + * at the end might not get detected.
> + */
> + while (len > sizeof(__m128i))

Note to self: I actually think this isn't needed anymore since I changed
how the SSE code deals with remainder sequences at the end.

--
John Naylor
EDB: http://www.enterprisedb.com


v19-rewrite-pg_verify_str-for-speed.patch
Description: Binary data


Re: logical decoding and replication of sequences

2021-07-20 Thread Tomas Vondra



On 7/20/21 5:30 PM, Peter Eisentraut wrote:
> On 08.06.21 00:28, Tomas Vondra wrote:
>>
>> new "sequence" publication action
>> -
>>
>> The publications now have a new "sequence" publication action, which is
>> enabled by default. This determines whether the publication decodes
>> sequences or what.
>>
>>
>> FOR ALL SEQUENCES
>> -
>>
>> It should be possible to create FOR ALL SEQUENCES publications, just
>> like we have FOR ALL TABLES. But this produces shift/reduce conflicts
>> in the grammar, and I didn't bother dealing with that. So for now it's
>> required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ...
> 
> I have been thinking about these DDL-level issues a bit.  The most
> common use case will be to have a bunch of tables with implicit
> sequences, and you just want to replicate them from here to there
> without too much effort.  So ideally an implicit sequence should be
> replicated by default if the table is part of a publication (unless
> sequences are turned off by the publication option).
> 

Agreed, that seems like a reasonable approach.

> We already have support for things like that in
> GetPublicationRelations(), where a partitioned table is expanded to
> include the actual partitions.  I think that logic could be reused.  So
> in general I would have GetPublicationRelations() include sequences and
> don't have GetPublicationSequenceRelations() at all.  Then sequences
> could also be sent by pg_publication_tables(), maybe add a relkind
> column.  And then you also don't need so much duplicate DDL code, if you
> just consider everything as a relation.  For example, there doesn't seem
> to be an actual need to have fetch_sequence_list() and subsequent
> processing on the subscriber side.  It does the same thing as
> fetch_table_list(), so it might as well just all be one thing.
> 

Not sure. I agree with replicating implicit sequences by default,
without having to add them to the publication. But I think we should
allow adding other sequences too, and I think some of this code and
differentiation from tables will be needed.

FWIW I'm not claiming there are no duplicate parts - I've mostly
copy-pasted the table-handling code for sequences, and I'll look into
reusing some of it.

> We do, however, probably need some checking that we don't replicate
> tables to sequences or vice versa.
> 

True. I haven't tried doing such silly things yet.

> We probably also don't need a separate FOR ALL SEQUENCES option.  What
> users really want is a "for everything" option.  We could think about
> renaming or alternative syntax, but in principle I think FOR ALL TABLES
> should include sequences by default.
> 
> Tests under src/test/subscription/ are needed.
> 

Yeah, true. At the moment there are just tests in test_decoding, mostly
because the previous patch versions did not add support for sequences to
built-in replication. Will fix.

> I'm not sure why test_decoding needs a skip-sequences option.  The
> source code says it's for backward compatibility, but I don't see why we
> need that.

Hmmm, I'm a bit baffled by skip-sequences true. I think Cary added it to
limit chances in test_decoding tests, while the misleading comment about
backwards compatibility comes from me.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Rename of triggers for partitioned tables

2021-07-20 Thread Alvaro Herrera
On 2021-Jul-19, Alvaro Herrera wrote:

> Well, it does rename a trigger named 'name' on the table 'table_name',
> as well as all its descendant triggers.  I guess I am surprised that
> anybody would rename a descendant trigger in the first place.  I'm not
> wedded to the decision of removing the NOTICE, though  ... are there any
> other votes for that, anyone?

I put it back, mostly because I don't really care and it's easily
removed if people don't want it.  (I used a different wording though,
not necessarily final.)

I also realized that if you rename a trigger and the target name is
already occupied, we'd better throw a nicer error message than failure
by violation of a unique constraint; so I moved the check for the name
to within renametrig_internal().  Added a test for this.

Also added a test to ensure that nothing happens with statement
triggers.  This doesn't need any new code, because those triggers don't
have tgparentid set.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
>From 0a2a5e92437475162161837001c7aef16f57908e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 20 Jul 2021 16:00:31 -0400
Subject: [PATCH v9] Make ALTER TRIGGER recurse to partitions

---
 src/backend/commands/trigger.c | 197 +++--
 src/backend/parser/gram.y  |   2 +-
 src/test/regress/expected/triggers.out | 109 ++
 src/test/regress/sql/triggers.sql  |  59 
 4 files changed, 323 insertions(+), 44 deletions(-)

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 6d4b7ee92a..0005cf657d 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -71,6 +71,12 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 static int	MyTriggerDepth = 0;
 
 /* Local function prototypes */
+static void renametrig_internal(Relation tgrel, Relation targetrel,
+HeapTuple trigtup, const char *newname,
+const char *expected_name);
+static void renametrig_partition(Relation tgrel, Oid partitionId,
+ Oid parentTriggerOid, const char *newname,
+ const char *expected_name);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 			   EPQState *epqstate,
@@ -1442,38 +1448,17 @@ renametrig(RenameStmt *stmt)
 	targetrel = relation_open(relid, NoLock);
 
 	/*
-	 * Scan pg_trigger twice for existing triggers on relation.  We do this in
-	 * order to ensure a trigger does not exist with newname (The unique index
-	 * on tgrelid/tgname would complain anyway) and to ensure a trigger does
-	 * exist with oldname.
-	 *
-	 * NOTE that this is cool only because we have AccessExclusiveLock on the
-	 * relation, so the trigger set won't be changing underneath us.
+	 * On partitioned tables, this operation recurses to partitions, unless
+	 * caller requested not to.  Lock all tables upfront, if needed.
 	 */
+	if (stmt->relation->inh &&
+		targetrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		(void) find_all_inheritors(relid, AccessExclusiveLock, NULL);
+
 	tgrel = table_open(TriggerRelationId, RowExclusiveLock);
 
 	/*
-	 * First pass -- look for name conflict
-	 */
-	ScanKeyInit(&key[0],
-Anum_pg_trigger_tgrelid,
-BTEqualStrategyNumber, F_OIDEQ,
-ObjectIdGetDatum(relid));
-	ScanKeyInit(&key[1],
-Anum_pg_trigger_tgname,
-BTEqualStrategyNumber, F_NAMEEQ,
-PointerGetDatum(stmt->newname));
-	tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true,
-NULL, 2, key);
-	if (HeapTupleIsValid(tuple = systable_getnext(tgscan)))
-		ereport(ERROR,
-(errcode(ERRCODE_DUPLICATE_OBJECT),
- errmsg("trigger \"%s\" for relation \"%s\" already exists",
-		stmt->newname, RelationGetRelationName(targetrel;
-	systable_endscan(tgscan);
-
-	/*
-	 * Second pass -- look for trigger existing with oldname and update
+	 * Search for the trigger to modify.
 	 */
 	ScanKeyInit(&key[0],
 Anum_pg_trigger_tgrelid,
@@ -1489,27 +1474,27 @@ renametrig(RenameStmt *stmt)
 	{
 		Form_pg_trigger trigform;
 
-		/*
-		 * Update pg_trigger tuple with new tgname.
-		 */
-		tuple = heap_copytuple(tuple);	/* need a modifiable copy */
 		trigform = (Form_pg_trigger) GETSTRUCT(tuple);
 		tgoid = trigform->oid;
 
-		namestrcpy(&trigform->tgname,
-   stmt->newname);
+		/* Rename the trigger on this relation ... */
+		renametrig_internal(tgrel, targetrel, tuple, stmt->newname,
+			stmt->subname);
 
-		CatalogTupleUpdate(tgrel, &tuple->t_self, tuple);
+		/* ... and if it is partitioned, recurse to its partitions */
+		if (stmt->relation->inh &&
+			targetrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			PartitionDesc partdesc = RelationGetPartitionDesc(targetrel, true);
 
-		InvokeObjectPostAlterHook(TriggerRelationId,
-  tgoid, 0);
+			for (int i = 0; i < partdesc->nparts; i++)
+			{
+Oid			partitionId = partdesc->oids[i];
 
-		/*
-		 * Invalidate re

Re: refactoring basebackup.c

2021-07-20 Thread Mark Dilger



> On Jul 20, 2021, at 11:57 AM, Robert Haas  wrote:
> 
> I don't really understand what your problem is with how the patch set
> leaves pg_basebackup.

I don't have a problem with how the patch set leaves pg_basebackup.  

> On the server side, because I dropped the
> bbarchiver stuff, basebackup.c still ends up knowing a bunch of stuff
> about tar. pg_basebackup.c, however, really doesn't know anything much
> about tar any more. It knows that if it's getting a tar file and needs
> to parse a tar file then it had better call the tar parsing code, but
> that seems difficult to avoid.

I was only imagining having a callback for injecting manifests or recovery 
configurations.  It is not necessary that this be done in the current patch 
set, or perhaps ever.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Fix pkg-config file for static linking

2021-07-20 Thread Filip Gospodinov

On 06.07.21 15:13, Peter Eisentraut wrote:

This doesn't work.

This patch adds libpgcommon and libpgport to Requires.private.  But they 
are not pkg-config names but library names, so they should be added to 
Libs.private.


Then, I must admit that I have misunderstood this patch at first
https://github.com/postgres/postgres/commit/beff361bc1edc24ee5f8b2073a1e5e4c92ea66eb 
.


My impression was that PKG_CONFIG_REQUIRES_PRIVATE ends up in 
Libs.private because of this line
https://github.com/postgres/postgres/blob/d9809bf8694c17e05537c5dd96cde3e67c02d52a/src/Makefile.shlib#L403 
.


After taking a second look, I've noticed that 
PKG_CONFIG_REQUIRES_PRIVATE is *filtered out*. But unfortunately, this 
currently doesn't work as intended because PKG_CONFIG_REQUIRES_PRIVATE 
seems to be unset in Makefile.shlib which leaves Requires.private empty 
and doesn't filter out -lcrypto and -lssl for Libs.private.
That must be also the reason why I first believed that 
PKG_CONFIG_REQUIRES_PRIVATE is used to populate Libs.private.


Anyway, this issue is orthogonal to my original patch. I'm proposing to 
hardcode -lpgcommon and -lpgport in Libs.private instead. Patch is attached.
diff --git a/src/Makefile.shlib b/src/Makefile.shlib
index 29a7f6d38c..d643473807 100644
--- a/src/Makefile.shlib
+++ b/src/Makefile.shlib
@@ -400,7 +400,7 @@ endif # PORTNAME == cygwin || PORTNAME == win32
 # those that point inside the build or source tree.  Use sort to
 # remove duplicates.  Also record the -l flags necessary for static
 # linking, but not those already covered by Requires.private.
-	echo 'Libs.private: $(sort $(filter-out -L.% -L$(top_srcdir)/%,$(filter -L%,$(LDFLAGS) $(SHLIB_LINK $(filter-out $(PKG_CONFIG_REQUIRES_PRIVATE:lib%=-l%),$(filter -l%,$(SHLIB_LINK)))' >>$@
+	echo 'Libs.private: $(sort $(filter-out -L.% -L$(top_srcdir)/%,$(filter -L%,$(LDFLAGS) $(SHLIB_LINK -lpgcommon -lpgport $(filter-out $(PKG_CONFIG_REQUIRES_PRIVATE:lib%=-l%),$(filter -l%,$(SHLIB_LINK)))' >>$@
 
 
 ##


Re: [PATCH] Automatic HASH and LIST partition creation

2021-07-20 Thread Robert Haas
On Tue, Jul 20, 2021 at 3:13 PM Justin Pryzby  wrote:
> On Tue, Jul 20, 2021 at 02:42:16PM -0400, Robert Haas wrote:
> > The bigger issue IMHO with on-the-fly
> > partition creation is avoiding deadlocks in the presence of current
> > inserters; I submit that without at least some kind of attempt to
> > avoid deadlocks and spurious errors there, it's not really a usable
> > scheme, and that seems hard.
>
> I was thinking that for dynamic creation, there would be a DDL command to
> create the necessary partitions:
>
> -- Creates 2021-01-02, unless the month already exists:
> ALTER TABLE bydate SET GRANULARITY='1day';
> ALTER TABLE bydate CREATE PARTITION FOR VALUE ('2021-01-02');

Well, that dodges the deadlock issue with doing it implicitly, but it
also doesn't seem to offer a lot of value over just creating the
partitions in a fully manual way. I mean you could just say:

CREATE TABLE bydate_2021_02_02 PARTITION OF bydate FOR VALUES FROM
('2021-01-02') TO ('2021-02-03');

It's longer, but it's not really that bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Automatic HASH and LIST partition creation

2021-07-20 Thread Justin Pryzby
On Tue, Jul 20, 2021 at 02:42:16PM -0400, Robert Haas wrote:
> The bigger issue IMHO with on-the-fly
> partition creation is avoiding deadlocks in the presence of current
> inserters; I submit that without at least some kind of attempt to
> avoid deadlocks and spurious errors there, it's not really a usable
> scheme, and that seems hard.

I was thinking that for dynamic creation, there would be a DDL command to
create the necessary partitions:

-- Creates 2021-01-02, unless the month already exists:
ALTER TABLE bydate SET GRANULARITY='1day';
ALTER TABLE bydate CREATE PARTITION FOR VALUE ('2021-01-02');

I'd want it to support changing the granularity of the range partitions:

-- Creates 2021-01 unless the month already exists.
-- Errors if a day partition already exists which would overlap?
ALTER TABLE bydate SET granularity='1month';
ALTER TABLE bydate CREATE PARTITION FOR VALUE ('2021-01-03');

It could support creating ranges, which might create multiple partitions,
depending on the granularity:

ALTER TABLE bydate CREATE PARTITION FOR VALUES ('2021-01-01') TO ('2021-02-01')

Or the catalog could include not only granularity, but also endpoints:

ALTER TABLE bydate SET ENDPOINTS ('2012-01-01') ('2022-01-01')
ALTER TABLE bydate CREATE PARTITIONS; --create anything needed to fill from a->b
ALTER TABLE bydate PRUNE PARTITIONS; --drop anything outside of [a,b]

I would use this to set "fine" granularity for large tables, and "course"
granularity for tables that were previously set to "fine" granularity, but its
partitions are no longer large enough to justify it.  This logic currently
exists in our application - we create partitions dynamically immediately before
inserting.  But it'd be nicer if it were created asynchronously.  It may create
tables which were never inserted into, which is fine - they'd be course
granularity tables (one per month).

I think this might elegantly allow both 1) subpartitioning; 2) repartitioning
to a different granularity (for which I currently have my own tool).

-- 
Justin




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-07-20 Thread Robert Haas
On Sun, Jul 4, 2021 at 1:44 AM Dilip Kumar  wrote:
> In general, for the non-partitioned table, where we don't have much
> overhead of checking the parallel safety and invalidation is also not
> a big problem so I am tempted to provide an automatic parallel safety
> check.  This would enable parallelism for more cases wherever it is
> suitable without user intervention.  OTOH, I understand that providing
> automatic checking might be very costly if the number of partitions is
> more.  Can't we provide some mid-way where the parallelism is enabled
> by default for the normal table but for the partitioned table it is
> disabled by default and the user has to set it safe for enabling
> parallelism?  I agree that such behavior might sound a bit hackish.

I think that's basically the proposal that Amit and I have been discussing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: refactoring basebackup.c

2021-07-20 Thread Robert Haas
On Mon, Jul 19, 2021 at 2:51 PM Mark Dilger
 wrote:
> The difficulty in v3-0007 with pg_basebackup only knowing how to parse tar 
> archives seems to be a natural consequence of not sufficiently abstracting 
> out the handling of the tar format.  If the bbsink and bbstreamer 
> abstractions fully encapsulated a set of parsing callbacks, then 
> pg_basebackup wouldn't contain things like:
>
> streamer = bbstreamer_tar_parser_new(streamer);
>
> but instead would use the parser callbacks without knowledge of whether they 
> were parsing tar vs. cpio vs. whatever.  It just seems really odd that 
> pg_basebackup is using the extensible abstraction layer and then defeating 
> the purpose by knowing too much about the format.  It might even be a useful 
> exercise to write cpio support into this patch set rather than waiting until 
> v16, just to make sure the abstraction layer doesn't have tar-specific 
> assumptions left over.

Well, I had a patch in an earlier patch set that tried to get
knowledge of tar out of basebackup.c, but it couldn't use the bbsink
abstraction; it needed a whole separate abstraction layer which I had
called bbarchiver with a different API. So I dropped it, for fear of
being told, not without some justification, that I was just changing
things for the sake of changing them, and also because having exactly
one implementation of some interface is really not great. I do
conceptually like the idea of making the whole thing flexible enough
to generate cpio or zip archives, because like you I think that having
tar-specific stuff all over the place is grotty, but I have a feeling
there's little market demand for having pg_basebackup produce cpio,
pax, zip, iso, etc. archives. On the other hand, server-side
compression and server-side backup seem like functionality with real
utility. Still, if you or others want to vote for resurrecting
bbarchiver on the grounds that general code cleanup is worthwhile for
its own sake, I'm OK with that, too.

I don't really understand what your problem is with how the patch set
leaves pg_basebackup. On the server side, because I dropped the
bbarchiver stuff, basebackup.c still ends up knowing a bunch of stuff
about tar. pg_basebackup.c, however, really doesn't know anything much
about tar any more. It knows that if it's getting a tar file and needs
to parse a tar file then it had better call the tar parsing code, but
that seems difficult to avoid. What we can avoid, and I think the
patch set does, is pg_basebackup.c having any real knowledge of what
the tar parser is doing under the hood.

Thanks also for the detailed comments. I'll try to the right number of
verbs in each sentence in the next version of the patch. I will also
look into the issues mentioned by Dilip and Tushar.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: something is wonky with pgbench pipelining

2021-07-20 Thread Alvaro Herrera
On 2021-Jul-20, Andres Freund wrote:

> I think what's happening is that the first recvfrom() actually gets all 7
> connection results. The server doesn't have any queries to process at that
> point. But we ask the kernel whether there is new network input over and over
> again, despite having results to process!

Hmm, yeah, that seems a missed opportunity.

> with-isbusy:
> ...
> tps = 3990.424742 (without initial connection time)
> ...
>   1,013.71 msec task-clock#0.202 CPUs utilized
> 80,203  raw_syscalls:sys_enter#   79.119 K/sec
> 19,947  context-switches  #   19.677 K/sec
>  2,943,676,361  cycles:u  #2.904 GHz
>346,607,769  cycles:k  #0.342 GHz
>  8,464,188,379  instructions:u#2.88  insn per cycle
>226,665,530  instructions:k#0.65  insn per cycle

This is quite compelling.

If you don't mind I can get this pushed soon in the next couple of days
-- or do you want to do it yourself?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La espina, desde que nace, ya pincha" (Proverbio africano)




Re: [PATCH] Automatic HASH and LIST partition creation

2021-07-20 Thread Robert Haas
On Wed, Jul 14, 2021 at 7:28 AM Pavel Borisov  wrote:
> What do you think will the described approach lead to a useful patch? Should 
> it be done as a whole or it's possible to commit it in smaller steps? (E.g. 
> first part without AUTOMATIC capability, then add AUTOMATIC capability. Or 
> with some other order of features implementation)

I would suggest that you consider on-the-fly partition creation to be
a completely separate feature from initial partition creation as part
of CREATE TABLE. I think you can have either without the other, and I
think the latter is a lot easier than the former. I doubt that
on-the-fly partition creation makes any sense at all for hash
partitions; there seems to be no reason not to pre-create all the
partitions. It's pretty straightforward to see how it should work for
LIST, but RANGE needs an interval or something to be stored in the
system catalogs so you can figure out where to put the boundaries, and
somehow you've got to identify a + operator for the relevant data
type. Tom Lane probably won't be thrilled if you suggest looking it up
based on the operator NAME. The bigger issue IMHO with on-the-fly
partition creation is avoiding deadlocks in the presence of current
inserters; I submit that without at least some kind of attempt to
avoid deadlocks and spurious errors there, it's not really a usable
scheme, and that seems hard.

On the other hand, modulo syntax details, creating partitions at
CREATE TABLE time seems relatively simple and, especially in the case
of hash partitioning, useful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: .ready and .done files considered harmful

2021-07-20 Thread Robert Haas
On Tue, Jul 6, 2021 at 9:34 AM Stephen Frost  wrote:
> As was suggested on that subthread, it seems like it should be possible
> to just track the current timeline and adjust what we're doing if the
> timeline changes, and we should even know what the .history file is at
> that point and likely don't even need to scan the directory for it, as
> it'll be the old timeline ID.

I'm a little concerned that this might turn out to be more complicated
than it's worth. It's not a case that should happen often, and if you
handle it then you have to be careful to handle cases like two
timeline switches in very rapid succession, which seems like it could
be tricky.

Maybe it's fine, though. I'm not really sure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Use extended statistics to estimate (Var op Var) clauses

2021-07-20 Thread Tomas Vondra
Hi,

On 7/5/21 2:46 PM, Dean Rasheed wrote:
> On Sun, 13 Jun 2021 at 21:28, Tomas Vondra
>  wrote:
>>
>> Here is a slightly updated version of the patch
>>
> 
> Hi,
> 
> I have looked at this in some more detail, and it all looks pretty
> good, other than some mostly cosmetic stuff.
> 

Thanks for the review!

> The new code in statext_is_compatible_clause_internal() is a little
> hard to follow because some of the comments aren't right (e.g. when
> checking clause_expr2, it isn't an (Expr op Const) or (Const op Expr)
> as the comment says). Rather than trying to comment on each
> conditional branch, it might be simpler to just have a single
> catch-all comment at the top, and also remove the "return true" in the
> middle, to make it something like:
> 
> /*
>  * Check Vars appearing on either side by recursing, and make a note of
>  * any expressions.
>  */
> if (IsA(clause_expr, Var))
> {
> if (!statext_is_compatible_clause_internal(...))
> return false;
> }
> else
> *exprs = lappend(*exprs, clause_expr);
> 
> if (clause_expr2)
> {
> if (IsA(clause_expr2, Var))
> {
> if (!statext_is_compatible_clause_internal(...))
> return false;
> }
> else
> *exprs = lappend(*exprs, clause_expr2);
> }
> 
> return true;
> 

I ended up doing something slightly different - examine_opclause_args
now "returns" a list of expressions, instead of explicitly setting two
parameters. That means we can do a simple foreach() here, which seems
cleaner. It means we have to extract the expressions from the list in a
couple places, but that seems acceptable. Do you agree?

I also went through the comments and updated those that seemed wrong.

> Is the FIXME comment in examine_opclause_args() necessary? The check
> for a single relation has already been done in
> clause[list]_selectivity_ext(), and I'm not sure what
> examine_opclause_args() would do differently.
> 

Yeah, I came to the same conclusion.

> In mcv_get_match_bitmap(), perhaps do the RESULT_IS_FINAL() checks
> first in each loop.
> 

This is how master already does that now, and I wonder if it's done in
this order intentionally. It's not clear to me doing it in the other way
would be faster?

> Also in mcv_get_match_bitmap(), the 2 "First check whether the
> constant is below the lower boundary ..." comments don't make any
> sense to me. Were those perhaps copied and pasted from somewhere else?
> They should perhaps say "Otherwise, compare the MCVItem with the
> constant" and "Otherwise compare the values from the MCVItem using the
> clause operator", or something like that.
> 

Yeah, that's another bit that comes from current master - the patch just
makes a new copy of the comment. I agree it's bogus, Seems like a
remainder of the original code which did various "smart" things we
removed over time. Will fix.

> But other than such cosmetic things, I think the patch is good, and
> gives some nice estimate improvements.
> 

Thanks, sounds good. I guess the last thing is maybe mentioning this in
the docs, adding an example etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 788a909e09b372797c4b7b443e0e89c5d5181ec0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 20 Jul 2021 20:15:13 +0200
Subject: [PATCH] Handling Expr op Expr clauses in extended stats

---
 src/backend/optimizer/path/clausesel.c|  37 +++-
 src/backend/statistics/extended_stats.c   |  83 ++---
 src/backend/statistics/mcv.c  | 172 +-
 .../statistics/extended_stats_internal.h  |   4 +-
 src/test/regress/expected/stats_ext.out   |  96 ++
 src/test/regress/sql/stats_ext.sql|  26 +++
 6 files changed, 341 insertions(+), 77 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..6a7e9ceea5 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -714,6 +714,7 @@ clause_selectivity_ext(PlannerInfo *root,
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
 	bool		cacheable = false;
+	Node	   *src = clause;
 
 	if (clause == NULL)			/* can this still happen? */
 		return s1;
@@ -871,11 +872,37 @@ clause_selectivity_ext(PlannerInfo *root,
 		}
 		else
 		{
-			/* Estimate selectivity for a restriction clause. */
-			s1 = restriction_selectivity(root, opno,
-		 opclause->args,
-		 opclause->inputcollid,
-		 varRelid);
+			/*
+			 * It might be a single (Expr op Expr) clause, which goes here due
+			 * to the optimization at the beginning of clauselist_selectivity.
+			 * So we try applying extended stats first, and then fall back to
+			 * restriction_selectivity.
+			 */
+			bool	estimated = false;
+
+			if (use_extended_stats)
+			{
+List	   *cl

something is wonky with pgbench pipelining

2021-07-20 Thread Andres Freund
Hi,

I think something is slightly off with pgbench (or libpq) pipelining. Consider
e.g. the following pgbench workload:

\startpipeline
SELECT 1;
SELECT 1;
SELECT 1;
SELECT 1;
SELECT 1;
SELECT 1;
SELECT 1;
\endpipeline

A pgbench run using that results in in endless repetitions of the below:
pgbench -Mprepared -c 1 -T1000 -f ~/tmp/select1_batch.sql

sendto(3, "B\0\0\0\22\0P0_1\0\0\0\0\0\0\1\0\0D\0\0\0\6P\0E\0\0\0\t\0"..., 257, 
MSG_NOSIGNAL, NULL, 0) = 257
recvfrom(3, 0x5614032370f0, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "2\0\0\0\4T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\27\0"..., 16384, 
0, NULL, NULL) = 461
recvfrom(3, 0x56140323727c, 15988, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x56140323723b, 16053, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x5614032371fa, 16118, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x5614032371b9, 16183, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x561403237178, 16248, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x561403237137, 16313, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x5614032370f6, 16378, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
sendto(3, "B\0\0\0\22\0P0_1\0\0\0\0\0\0\1\0\0D\0\0\0\6P\0E\0\0\0\t\0"..., 257, 
MSG_NOSIGNAL, NULL, 0) = 257
recvfrom(3, 0x5614032370f0, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "2\0\0\0\4T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\27\0"..., 16384, 
0, NULL, NULL) = 461
recvfrom(3, 0x56140323727c, 15988, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x56140323723b, 16053, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x5614032371fa, 16118, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x5614032371b9, 16183, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x561403237178, 16248, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x561403237137, 16313, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
recvfrom(3, 0x5614032370f6, 16378, 0, NULL, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)
sendto(3, "B\0\0\0\22\0P0_1\0\0\0\0\0\0\1\0\0D\0\0\0\6P\0E\0\0\0\t\0"..., 257, 
MSG_NOSIGNAL, NULL, 0) = 257

Note how recvfrom() returning EAGAIN is called 7 times in a row? There's also
7 SQL statements in the workload...

I think what's happening is that the first recvfrom() actually gets all 7
connection results. The server doesn't have any queries to process at that
point. But we ask the kernel whether there is new network input over and over
again, despite having results to process!

With a short pipeline this doesn't matter much. But if it's longer, adding a
syscall for each statement in the pipeline does increase pgbench overhead
measurably. An easy way to avoid that is to put a PQisBusy() && before the
PQconsumeInput().

Comparing pgbench of 100 pipelined SELECT 1;'s, under perf stat yields:

perf stat -e 
task-clock,raw_syscalls:sys_enter,context-switches,cycles:u,cycles:k,instructions:u,instructions:k
 \
  schedtool -a 38 -e \
  /home/andres/build/postgres/dev-optimize/vpath/src/bin/pgbench/pgbench -n 
-Mprepared -c 1 -j1 -T5 -f ~/tmp/select1_batch.sql

default:
...
tps = 3617.823383 (without initial connection time)
...
  1,339.25 msec task-clock#0.267 CPUs utilized
 1,880,855  raw_syscalls:sys_enter#1.404 M/sec
18,084  context-switches  #   13.503 K/sec
 3,128,615,558  cycles:u  #2.336 GHz
 1,211,509,367  cycles:k  #0.905 GHz
 8,000,238,738  instructions:u#2.56  insn per cycle
 1,720,276,642  instructions:k#1.42  insn per cycle

   5.007540307 seconds time elapsed

   1.004346000 seconds user
   0.376209000 seconds sys

with-isbusy:
...
tps = 3990.424742 (without initial connection time)
...
  1,013.71 msec task-clock#0.202 CPUs utilized
80,203  raw_syscalls:sys_enter#   79.119 K/sec
19,947  context-switches  #   19.677 K/sec
 2,943,676,361  cycles:u  #2.904 GHz
   346,607,769  cycles:k  #0.342 GHz
 8,464,188,379  instructions:u#2.88  insn per cycle
   226,665,530  instructions:k#0.65  insn per cycle

   5.007539846 seconds time elapsed

   0.90609 seconds user
   0.151015000 seconds sys


1.8 million fewer syscalls, reduced overall "on cpu" time, and particularly
0.

Re: Question about non-blocking mode in libpq

2021-07-20 Thread Alvaro Herrera
On 2021-Jul-19, Yugo NAGATA wrote:

> On Tue, 13 Jul 2021 11:59:49 +0900
> Yugo NAGATA  wrote:

> > However, looking into the code, PQsendQuery seems not to return an error
> > in non-bloking mode even if unable to send all data. In such cases,
> > pqSendSome will return 1 but it doesn't cause an error. Moreover,
> > we would not need to call PQsendQuery again. Indead, we need to call
> > PQflush until it returns 0, as documented with regard to PQflush.
> > 
> > Do we need to fix the description of PQsetnonblocking?

Yeah, I think you're right -- these functions don't error out, the
commands are just stored locally in the output buffer.

>  "In the nonblocking state, calls to PQsendQuery, PQputline, PQputnbytes,
>  PQputCopyData, and PQendcopy will not block" 
> 
> this seems to me that this is a list of functions that could block in blocking
> mode, but I wander PQflush also could block because it calls pqSendSome, 
> right?

I don't see that.  If pqSendSome can't write anything, it'll just return 1.

> Also, in the last paragraph of the section, I can find the following:
> 
>  "After sending any command or data on a nonblocking connection, call 
> PQflush. ..."
> 
> However, ISTM we don't need to call PQflush in non-bloking mode and we can
> call PQgetResult immediately because PQgetResult internally calls pqFlush
> until it returns 0 (or -1).

Well, maybe you don't *need* to PQflush(); but if you don't call it,
then the commands will sit in the output buffer indefinitely, which
means the server won't execute them.  So even if it works to just call
PQgetResult and have it block, surely you would like to only call
PQgetResult when the query has already been executed and the result
already been received and processed; that is, so that you can call
PQgetResult and obtain the result immediately, and avoid (say) blocking
a GUI interface while PQgetResult flushes the commands out, the server
executes the query and sends the results back.

> Therefore, I wander the last paragraph of this section is
> now unnecessary. right?

Doesn't seem so to me.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: Avoid stack frame setup in performance critical routines using tail calls

2021-07-20 Thread Andres Freund
Hi,

On 2021-07-20 19:37:46 +1200, David Rowley wrote:
> On Tue, 20 Jul 2021 at 19:04, Andres Freund  wrote:
> > > * AllocateSetAlloc.txt
> > > * palloc.txt
> > > * percent.txt
> >
> > Huh, that's interesting. You have some control flow enforcement stuff 
> > turned on (the endbr64). And it looks like it has a non zero cost (or maybe 
> > it's just skid). Did you enable that intentionally? If not, what 
> > compiler/version/distro is it? I think at least on GCC that's 
> > -fcf-protection=...
>
> It's ubuntu 21.04 with gcc 10.3 (specifically gcc version 10.3.0
> (Ubuntu 10.3.0-1ubuntu1)
>
> I've attached the same results from compiling with clang 12
> (12.0.0-3ubuntu1~21.04.1)

It looks like the ubuntu folks have changed the default for CET to on.


andres@ubuntu2020:~$ echo 'int foo(void) { return 17;}' > test.c && gcc -O2  -c 
-o test.o test.c && objdump -S test.o

test.o: file format elf64-x86-64


Disassembly of section .text:

 :
   0:   f3 0f 1e fa endbr64
   4:   b8 11 00 00 00  mov$0x11,%eax
   9:   c3  retq
andres@ubuntu2020:~$ echo 'int foo(void) { return 17;}' > test.c && gcc -O2 
-fcf-protection=none -c -o test.o test.c && objdump -S test.o

test.o: file format elf64-x86-64


Disassembly of section .text:

 :
   0:   b8 11 00 00 00  mov$0x11,%eax
   5:   c3  retq


Independent of this patch, it might be worth running a benchmark with
the default options, and one with -fcf-protection=none. None of my
machines support it...

$ cpuid -1|grep CET
  CET_SS: CET shadow stack = false
  CET_IBT: CET indirect branch tracking= false
 XCR0 supported: CET_U state  = false
 XCR0 supported: CET_S state  = false

Here it adds about 40kB of .text, but I can't measure the CET
overhead...

Greetings,

Andres Freund




Re: Early Sort/Group resjunk column elimination.

2021-07-20 Thread Ronan Dunklau
Le vendredi 16 juillet 2021, 17:37:15 CEST James Coleman a écrit :
> Thanks for hacking on this; as you're not surprised given I made the
> original suggestion, I'm particularly interested in this for
> incremental sort benefits, but I find the other examples you gave
> compelling also.
> 
> Of course I haven't seen code yet, but my first intuition is to try to
> avoid adding extra nodes and teach the (hopefully few) relevant nodes
> to remove the resjunk entries themselves. Presumably in this case that
> would mostly be the sort nodes (including gather merge).
> 
> One thing to pay attention to here is that we can't necessarily remove
> resjunk entries every time in a sort node since, for example, in
> parallel mode the gather merge node above it will need those entries
> to complete the sort.

Yes that is actually a concern, especially as the merge node is already 
handled specially when applying a projection. 

> 
> I'm interested to see what you're working on with a patch.

I am posting this proof-of-concept, for the record, but I don't think the 
numerous problems can be solved easily. I tried to teach Sort to use a limited 
sort of projection, but it brings its own slate of problems...

Quick list of problems with the current implementation, leaving aside the fact 
that it's quite hacky in a few places:

* result nodes are added for numerous types of non-projection-capable paths, 
since the above (final) target includes resjunk columns which should be 
eliminated. 
* handling of appendrel seems difficult, as both ordered and unordered appends 
are generated at the same time against the same target
* I'm having trouble understanding the usefulness of a building physical 
tlists for SubqueryScans

The second patch is a very hacky way to try to eliminate some generated result 
nodes. The idea is to bypass the whole interpreter when using a "simple" 
projection which is just a reduction of the number of columns, and teach Sort 
and Result to perform it. To do this, I added a parameter to 
is_projection_capable_path to make the test depend on the actual asked target: 
for a sort node, only a "simple" projection. 

The implementation uses a junkfilter which assumes nothing else than Const and 
outer var will be present. 

I don't feel like this is going anywhere, but at least it's here for 
discussion and posterity, if someone is interested.


-- 
Ronan Dunklaucommit 31193a9040801965409ff608e222fa5b899d9721
Author: Ronan Dunklau 
Date:   Tue Jul 20 11:03:27 2021 +0200

Tag and remove resjunk added for SortGroupClause.

We can provide better plans if we remove the resjunk columns from
the final target. For example, index scan may not need to fetch /
compute the key column.

To achieve that without intruding everywhere, we tag the resjunk
TargetEntries with why they were added, and keep them in the PlannerInfo
processed_tlist until it's time to compute the plan's finaltarget.

This allows us to properly differentiate sort_input_target and the
finaltarget, only adding the needed column for nodes which actually need
them to perform the sort (explicit sorts, aggregates, gathermerge...)

One drawback is that this forces us to add many projection plans on top
of the sort node to finally get the final target.

This can be alleviated in a later patch by teaching indivual
non-projection-capable (for example, sort) to perform a limited form of
projection using the JunkFilter infrastructure.

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e81b990092..e89e2ac120 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2323,8 +2323,7 @@ static void
 show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 {
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
-
-	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
+	show_sort_group_keys((PlanState *) outerPlanState(sortstate), "Sort Key",
 		 plan->numCols, 0, plan->sortColIdx,
 		 plan->sortOperators, plan->collations,
 		 plan->nullsFirst,
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 01c110cd2f..87274193be 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -238,7 +238,7 @@ TargetEntry *
 makeTargetEntry(Expr *expr,
 AttrNumber resno,
 char *resname,
-bool resjunk)
+uint16 resjunk)
 {
 	TargetEntry *tle = makeNode(TargetEntry);
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c13da7a879..2722f9e35b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2119,6 +2119,11 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
    best_path->path.parent->relids : NULL);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	/*
+	 * XXX: This should probably be passed as an arg to make_sort

Re: logical decoding and replication of sequences

2021-07-20 Thread Peter Eisentraut

On 08.06.21 00:28, Tomas Vondra wrote:


new "sequence" publication action
-

The publications now have a new "sequence" publication action, which is
enabled by default. This determines whether the publication decodes
sequences or what.


FOR ALL SEQUENCES
-

It should be possible to create FOR ALL SEQUENCES publications, just
like we have FOR ALL TABLES. But this produces shift/reduce conflicts
in the grammar, and I didn't bother dealing with that. So for now it's
required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ...


I have been thinking about these DDL-level issues a bit.  The most 
common use case will be to have a bunch of tables with implicit 
sequences, and you just want to replicate them from here to there 
without too much effort.  So ideally an implicit sequence should be 
replicated by default if the table is part of a publication (unless 
sequences are turned off by the publication option).


We already have support for things like that in 
GetPublicationRelations(), where a partitioned table is expanded to 
include the actual partitions.  I think that logic could be reused.  So 
in general I would have GetPublicationRelations() include sequences and 
don't have GetPublicationSequenceRelations() at all.  Then sequences 
could also be sent by pg_publication_tables(), maybe add a relkind 
column.  And then you also don't need so much duplicate DDL code, if you 
just consider everything as a relation.  For example, there doesn't seem 
to be an actual need to have fetch_sequence_list() and subsequent 
processing on the subscriber side.  It does the same thing as 
fetch_table_list(), so it might as well just all be one thing.


We do, however, probably need some checking that we don't replicate 
tables to sequences or vice versa.


We probably also don't need a separate FOR ALL SEQUENCES option.  What 
users really want is a "for everything" option.  We could think about 
renaming or alternative syntax, but in principle I think FOR ALL TABLES 
should include sequences by default.


Tests under src/test/subscription/ are needed.

I'm not sure why test_decoding needs a skip-sequences option.  The 
source code says it's for backward compatibility, but I don't see why we 
need that.





Re: wrong relkind error messages

2021-07-20 Thread Peter Eisentraut


While reviewing the logical decoding of sequences patch, I found a few 
more places that could be updated in the new style introduced by this 
thread.  See attached patch.
From 92a06ebfa6f856246c2642a4e45c5b2af69a911d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Jul 2021 16:28:44 +0200
Subject: [PATCH] More improvements of error messages about mismatching relkind

Follow-up to 2ed532ee8c474e9767e76e1f3251cc3a0224358c, a few error
messages in the logical replication area currently only deal with
tables, but if we're anticipating more relkinds such as sequences
being handled, then these messages also fall into the category
affected by the previous patch, so adjust them too.
---
 src/backend/catalog/pg_publication.c  | 10 +-
 src/backend/executor/execReplication.c| 14 +-
 src/test/regress/expected/publication.out |  8 
 3 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/src/backend/catalog/pg_publication.c 
b/src/backend/catalog/pg_publication.c
index 36bfff9706..2a2fe03c13 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -54,23 +54,23 @@ check_publication_add_relation(Relation targetrel)
RelationGetForm(targetrel)->relkind != 
RELKIND_PARTITIONED_TABLE)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("\"%s\" is not a table",
+errmsg("cannot add relation \"%s\" to 
publication",

RelationGetRelationName(targetrel)),
-errdetail("Only tables can be added to 
publications.")));
+
errdetail_relkind_not_supported(RelationGetForm(targetrel)->relkind)));
 
/* Can't be system table */
if (IsCatalogRelation(targetrel))
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("\"%s\" is a system table",
+errmsg("cannot add relation \"%s\" to 
publication",

RelationGetRelationName(targetrel)),
-errdetail("System tables cannot be added to 
publications.")));
+errdetail("This operation is not supported for 
system tables.")));
 
/* UNLOGGED and TEMP relations cannot be part of publication. */
if (!RelationIsPermanent(targetrel))
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("table \"%s\" cannot be replicated",
+errmsg("cannot add relation \"%s\" to 
publication",

RelationGetRelationName(targetrel)),
 errdetail("Temporary and unlogged relations 
cannot be replicated.")));
 }
diff --git a/src/backend/executor/execReplication.c 
b/src/backend/executor/execReplication.c
index 1e285e0349..574d7d27fd 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -608,22 +608,10 @@ void
 CheckSubscriptionRelkind(char relkind, const char *nspname,
 const char *relname)
 {
-   /*
-* Give a more specific error for foreign tables.
-*/
-   if (relkind == RELKIND_FOREIGN_TABLE)
-   ereport(ERROR,
-   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
-errmsg("cannot use relation \"%s.%s\" as 
logical replication target",
-   nspname, relname),
-errdetail("\"%s.%s\" is a foreign table.",
-  nspname, relname)));
-
if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("cannot use relation \"%s.%s\" as 
logical replication target",
nspname, relname),
-errdetail("\"%s.%s\" is not a table.",
-  nspname, relname)));
+errdetail_relkind_not_supported(relkind)));
 }
diff --git a/src/test/regress/expected/publication.out 
b/src/test/regress/expected/publication.out
index b5b065a1b6..4a5ef0bc24 100644
--- a/src/test/regress/expected/publication.out
+++ b/src/test/regress/expected/publication.out
@@ -160,8 +160,8 @@ DROP TABLE testpub_parted1;
 DROP PUBLICATION testpub_forparted, testpub_forparted1;
 -- fail - view
 CREATE PUBLICATION testpub_fortbl FOR TABLE testpub_view;
-ERROR:  "testpub_view" is not 

Re: Rename of triggers for partitioned tables

2021-07-20 Thread Alvaro Herrera
On 2021-Jul-20, Arne Roland wrote:

> Thank you! I get a compile time error

> trigger.c:1588:17: note: in expansion of macro ‘PG_USED_FOR_ASSERTS_ONLY’
>   int  found = 0 PG_USED_FOR_ASSERTS_ONLY;
>  ^~~~

Oh, I misplaced the annotation of course.  It has to be

   int  found PG_USED_FOR_ASSERTS_ONLY = 0;

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/




Re: slab allocator performance issues

2021-07-20 Thread Ranier Vilela
Em ter., 20 de jul. de 2021 às 11:15, David Rowley 
escreveu:

> On Tue, 20 Jul 2021 at 10:04, Ranier Vilela  wrote:
> > Perhaps you would agree with me that in the most absolute of times,
> malloc will not fail.
> > So it makes more sense to test:
> > if (ret != NULL)
> > than
> > if (ret == NULL)
>
> I think it'd be better to use unlikely() for that.
>
Sure, it can be, but in this case, there is no way to reduce the scope.

regards,
Ranier Vilela


Re: slab allocator performance issues

2021-07-20 Thread David Rowley
On Tue, 20 Jul 2021 at 10:04, Ranier Vilela  wrote:
> Perhaps you would agree with me that in the most absolute of times, malloc 
> will not fail.
> So it makes more sense to test:
> if (ret != NULL)
> than
> if (ret == NULL)

I think it'd be better to use unlikely() for that.

David




Re: POC: GROUP BY optimization

2021-07-20 Thread Tomas Vondra
Hi,

here is an updated version of this patch, with some significant changes.

The main change that instead of calling get_cheapest_group_keys_order
directly, the planner now calls get_useful_group_keys_orderings and gets
multiple "interesting" pathkey orderings instead of just a single one.
The callers then loop over these orderings and construct paths for all
of them. This idea is similar to get_useful_pathkeys_for_relation()
added by incremental sort.

FWIW this addresses point (9) from my last review - I started with it,
because it was the main thing affecting the overall architecture. The
remaining bits are more "local".

I haven't investigated how expensive those changes are (in terms of
planning overhead), but the number of extra orderings is fairly low, and
I'd expect most of the paths to be eliminated fairly quickly.

I've also added / improved a number of comments etc. but I'm sure more
cleanup is needed.


The other comments from the review still apply - I'm particularly
concerned about the (1) point, i.e. plan changes in postgres_fdw. Those
seem to be rather strange (LIMIT not being pushed down in queries
without any grouping). I'd bet this is due to changes in sort costing
and does not seem very desirable.



regards


[1]
https://www.postgresql.org/message-id/22c44f98-bfa8-8630-62b5-5155e11eb284%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index ed25e7a743..6145b99f38 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1077,13 +1077,15 @@ ANALYZE ft5;
 -- join two tables
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
-   QUERY PLAN   
-
- Foreign Scan
+QUERY PLAN
+--
+ Limit
Output: t1.c1, t2.c1, t1.c3
-   Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
-   Remote SQL: SELECT r1."C 1", r2."C 1", r1.c3 FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST LIMIT 10::bigint OFFSET 100::bigint
-(4 rows)
+   ->  Foreign Scan
+ Output: t1.c1, t2.c1, t1.c3
+ Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
+ Remote SQL: SELECT r1."C 1", r2."C 1", r1.c3 FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST
+(6 rows)
 
 SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
  c1  | c1  
@@ -1103,13 +1105,15 @@ SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t
 -- join three tables
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c2, t3.c3 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) JOIN ft4 t3 ON (t3.c1 = t1.c1) ORDER BY t1.c3, t1.c1 OFFSET 10 LIMIT 10;
-   QUERY PLAN
--
- Foreign Scan
+ QUERY PLAN 
+
+ Limit
Output: t1.c1, t2.c2, t3.c3, t1.c3
-   Relations: ((public.ft1 t1) INNER JOIN (public.ft2 t2)) INNER JOIN (public.ft4 t3)
-   Remote SQL: SELECT r

Re: Next Steps with Hash Indexes

2021-07-20 Thread Simon Riggs
On Tue, Jul 20, 2021 at 1:26 PM Amit Kapila  wrote:

> One more thing we need to think about here is when to find the right
> bucket page in the chain where we can insert the new tuple. Do we
> first try to complete the uniqueness check (which needs to scan
> through the entire bucket chain) and then again scan the bucket with
> space to insert or do we want to do it along with uniqueness check
> scan and remember it?

The latter approach, but that is just a performance tweak for later.

On a unique hash index, regular splitting means there are almost no
bucket chains more than 2 long (bucket plus overflow), so it seems
like mostly wasted effort at this point.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Next Steps with Hash Indexes

2021-07-20 Thread Simon Riggs
On Tue, Jul 20, 2021 at 1:00 PM Amit Kapila  wrote:
>
> On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
>  wrote:
> >
> > 2. Unique Hash Indexes have been summarized here:
> > https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
> > which also seems to have two parts to it.
> >
> > 2.1 Uniqueness Check
> > Amit: "to ensure that there is no duplicate entry we need to traverse
> > the whole bucket chain"
> > Agreed. That seems straightforward and can also be improved later.
> >
> > 2.2 Locking
> > Amit's idea of holding ExclusiveLock on the bucket page works for me,
> > but there was some doubt about splitting.
> >
>
> I think the main thing to think about for uniqueness check during
> split (where we scan both the old and new buckets) was whether we need
> to lock both the old (bucket_being_split) and new
> (bucket_being_populated) buckets or just holding locks on one of them
> (the current bucket in which we are inserting) is sufficient? During a
> scan of the new bucket, we just retain pins on both the buckets (see
> comments in _hash_first()) but if we need to retain locks on both
> buckets then we need to do something different then we do it for
> scans. But, I think it is sufficient to just hold an exclusive lock on
> the primary bucket page in the bucket we are trying to insert and pin
> on the other bucket (old bucket as we do for scans). Because no
> concurrent inserter should try to insert into the old bucket and new
> bucket the same tuple as before starting the split we always update
> the metapage for hashm_lowmask and hashm_highmask which decides the
> routing of the tuples.

During an incomplete split, we need to scan both old and new. So
during insert, we need to scan both old and new, while holding
exclusive locks on both bucket pages. I've spent a few days looking at
the split behavior and this seems a complete approach. I'm working on
a patch now, still at hacking stage.

(BTW, my opinion of the split mechanism has now changed from bad to
very good. It works really well for unique data, but can be completely
ineffective for badly skewed data).

> Now, I think here the other problem we need to think about is that for
> the hash index after finding the tuple in the index, we need to always
> recheck in the heap as we don't store the actual value in the hash
> index. For that in the scan, we get the tuple(s) from the index
> (release locks) and then match qual after fetching tuple from the
> heap. But we can't do that for uniqueness check because if we release
> the locks on the index bucket page then another inserter could come
> before we match it in heap. I think we need some mechanism that after
> fetching TID from the index, we recheck the actual value in heap
> before releasing the lock on the index bucket page.

I don't think btree does that, so I'm not sure we do need that for
hash. Yes, there is a race condition, but someone will win. Do we care
who? Do we care enough to take the concurrency hit? Duplicate inserts
would be very rare in a declared unique index, so it would be a poor
trade-off.

> The other thing could be that if we have unique support for hash index
> then probably we can allow Insert ... ON Conflict if the user
> specifies unique index column as conflict_target.

Yes, that looks doable.

> I am not sure if multicol index support is mandatory to allow
> uniqueness for hash indexes, sure it would be good but I feel that can
> be done as a separate patch as well.

I have a patch for multicol support, attached.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


hash_multicol.v2.patch
Description: Binary data


Re: Next Steps with Hash Indexes

2021-07-20 Thread Amit Kapila
On Tue, Jul 20, 2021 at 5:30 PM Amit Kapila  wrote:
>
> On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
>  wrote:
> >
> > 2. Unique Hash Indexes have been summarized here:
> > https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
> > which also seems to have two parts to it.
> >
> > 2.1 Uniqueness Check
> > Amit: "to ensure that there is no duplicate entry we need to traverse
> > the whole bucket chain"
> > Agreed. That seems straightforward and can also be improved later.
> >
> > 2.2 Locking
> > Amit's idea of holding ExclusiveLock on the bucket page works for me,
> > but there was some doubt about splitting.
> >
>
> I think the main thing to think about for uniqueness check during
> split (where we scan both the old and new buckets) was whether we need
> to lock both the old (bucket_being_split) and new
> (bucket_being_populated) buckets or just holding locks on one of them
> (the current bucket in which we are inserting) is sufficient? During a
> scan of the new bucket, we just retain pins on both the buckets (see
> comments in _hash_first()) but if we need to retain locks on both
> buckets then we need to do something different then we do it for
> scans. But, I think it is sufficient to just hold an exclusive lock on
> the primary bucket page in the bucket we are trying to insert and pin
> on the other bucket (old bucket as we do for scans). Because no
> concurrent inserter should try to insert into the old bucket and new
> bucket the same tuple as before starting the split we always update
> the metapage for hashm_lowmask and hashm_highmask which decides the
> routing of the tuples.
>
> Now, I think here the other problem we need to think about is that for
> the hash index after finding the tuple in the index, we need to always
> recheck in the heap as we don't store the actual value in the hash
> index. For that in the scan, we get the tuple(s) from the index
> (release locks) and then match qual after fetching tuple from the
> heap. But we can't do that for uniqueness check because if we release
> the locks on the index bucket page then another inserter could come
> before we match it in heap. I think we need some mechanism that after
> fetching TID from the index, we recheck the actual value in heap
> before releasing the lock on the index bucket page.
>

One more thing we need to think about here is when to find the right
bucket page in the chain where we can insert the new tuple. Do we
first try to complete the uniqueness check (which needs to scan
through the entire bucket chain) and then again scan the bucket with
space to insert or do we want to do it along with uniqueness check
scan and remember it?

-- 
With Regards,
Amit Kapila.




Re: row filtering for logical replication

2021-07-20 Thread Amit Kapila
On Tue, Jul 20, 2021 at 5:13 PM Greg Nancarrow  wrote:
>
> On Tue, Jul 20, 2021 at 6:29 PM Amit Kapila  wrote:
> >
> > I think in terms of referring to old and new rows, we already have
> > terminology which we used at various other similar places. See Create
> > Rule docs [1]. For where clause, it says "Within condition and
> > command, the special table names NEW and OLD can be used to refer to
> > values in the referenced table. NEW is valid in ON INSERT and ON
> > UPDATE rules to refer to the new row being inserted or updated. OLD is
> > valid in ON UPDATE and ON DELETE rules to refer to the existing row
> > being updated or deleted.". We need similar things for the WHERE
> > clause in publication if we want special syntax to refer to old and
> > new rows.
> >
>
> I have no doubt we COULD allow references to OLD and NEW in the WHERE
> clause, but do we actually want to?
> This is what I thought could cause confusion, when mixed with the
> model that I previously described.
> It's not entirely clear to me exactly how it works, when the WHERE
> clause is applied to the OLD and NEW rows, when the WHERE condition
> itself can refer to OLD and/or NEW (coupled with the fact that NEW
> doesn't make sense for DELETE and OLD doesn't make sense for INSERT).
>

It is not new, the same is true when they are used in RULES and
probably in other places where we use them.

-- 
With Regards,
Amit Kapila.




Re: Next Steps with Hash Indexes

2021-07-20 Thread Amit Kapila
On Thu, Jul 15, 2021 at 10:11 PM Simon Riggs
 wrote:
>
> 2. Unique Hash Indexes have been summarized here:
> https://www.postgresql.org/message-id/CAA4eK1KATC1TA5bR5eobYQVO3RWsnH6djNpk3P376em4V8MuUA%40mail.gmail.com
> which also seems to have two parts to it.
>
> 2.1 Uniqueness Check
> Amit: "to ensure that there is no duplicate entry we need to traverse
> the whole bucket chain"
> Agreed. That seems straightforward and can also be improved later.
>
> 2.2 Locking
> Amit's idea of holding ExclusiveLock on the bucket page works for me,
> but there was some doubt about splitting.
>

I think the main thing to think about for uniqueness check during
split (where we scan both the old and new buckets) was whether we need
to lock both the old (bucket_being_split) and new
(bucket_being_populated) buckets or just holding locks on one of them
(the current bucket in which we are inserting) is sufficient? During a
scan of the new bucket, we just retain pins on both the buckets (see
comments in _hash_first()) but if we need to retain locks on both
buckets then we need to do something different then we do it for
scans. But, I think it is sufficient to just hold an exclusive lock on
the primary bucket page in the bucket we are trying to insert and pin
on the other bucket (old bucket as we do for scans). Because no
concurrent inserter should try to insert into the old bucket and new
bucket the same tuple as before starting the split we always update
the metapage for hashm_lowmask and hashm_highmask which decides the
routing of the tuples.

Now, I think here the other problem we need to think about is that for
the hash index after finding the tuple in the index, we need to always
recheck in the heap as we don't store the actual value in the hash
index. For that in the scan, we get the tuple(s) from the index
(release locks) and then match qual after fetching tuple from the
heap. But we can't do that for uniqueness check because if we release
the locks on the index bucket page then another inserter could come
before we match it in heap. I think we need some mechanism that after
fetching TID from the index, we recheck the actual value in heap
before releasing the lock on the index bucket page.

The other thing could be that if we have unique support for hash index
then probably we can allow Insert ... ON Conflict if the user
specifies unique index column as conflict_target.

I am not sure if multicol index support is mandatory to allow
uniqueness for hash indexes, sure it would be good but I feel that can
be done as a separate patch as well.

-- 
With Regards,
Amit Kapila.




Re: improvements in Unicode tables generation code

2021-07-20 Thread Peter Eisentraut

On 23.06.21 10:55, Peter Eisentraut wrote:

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make.  The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.


This could use a comment. At a quick glance, I don't understand what 
all the $(wordlist ...) magic does.


Perhaps we should change the script or Makefile so that it doesn't 
create all the maps in one go?


I agree, either comment it better or just write one file at a time. I'll 
take another look at that.


Here is a patch that does it one file (pair) at a time.  The other rules 
besides UCS_to_most.pl actually had the same problem, since they produce 
two output files, so running in parallel called each script twice.  In 
this patch, all of that is heavily refactored and works correctly now. 
Note that UCS_to_most.pl already accepted a command-line argument to 
specify which encoding to work with.
From 974720b0b6c92f42506ae37d8e88368ba279b973 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Jul 2021 13:47:13 +0200
Subject: [PATCH] Make Unicode makefile parallel-safe

Fix the rules so that each rule is parallel safe, using the same
trickery that we use elsewhere in the tree for rules that produce more
than one output file.  Refactor the whole makefile so that there is
less repetition.
---
 src/backend/utils/mb/Unicode/Makefile | 134 +-
 1 file changed, 45 insertions(+), 89 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/Makefile 
b/src/backend/utils/mb/Unicode/Makefile
index ed6fc07e08..cdcde4fbbd 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -12,101 +12,57 @@ subdir = src/backend/utils/mb/Unicode
 top_builddir = ../../../../..
 include $(top_builddir)/src/Makefile.global
 
-ISO8859MAPS = iso8859_2_to_utf8.map utf8_to_iso8859_2.map \
-   iso8859_3_to_utf8.map utf8_to_iso8859_3.map \
-   iso8859_4_to_utf8.map utf8_to_iso8859_4.map \
-   iso8859_5_to_utf8.map utf8_to_iso8859_5.map \
-   iso8859_6_to_utf8.map utf8_to_iso8859_6.map \
-   iso8859_7_to_utf8.map utf8_to_iso8859_7.map \
-   iso8859_8_to_utf8.map utf8_to_iso8859_8.map \
-   iso8859_9_to_utf8.map utf8_to_iso8859_9.map \
-   iso8859_10_to_utf8.map utf8_to_iso8859_10.map \
-   iso8859_13_to_utf8.map utf8_to_iso8859_13.map \
-   iso8859_14_to_utf8.map utf8_to_iso8859_14.map \
-   iso8859_15_to_utf8.map utf8_to_iso8859_15.map \
-   iso8859_16_to_utf8.map utf8_to_iso8859_16.map
-
-WINMAPS = win866_to_utf8.map utf8_to_win866.map \
-   win874_to_utf8.map utf8_to_win874.map \
-   win1250_to_utf8.map utf8_to_win1250.map \
-   win1251_to_utf8.map utf8_to_win1251.map \
-   win1252_to_utf8.map utf8_to_win1252.map \
-   win1253_to_utf8.map utf8_to_win1253.map \
-   win1254_to_utf8.map utf8_to_win1254.map \
-   win1255_to_utf8.map utf8_to_win1255.map \
-   win1256_to_utf8.map utf8_to_win1256.map \
-   win1257_to_utf8.map utf8_to_win1257.map \
-   win1258_to_utf8.map utf8_to_win1258.map
-
-GENERICMAPS = $(ISO8859MAPS) $(WINMAPS) \
-   gbk_to_utf8.map utf8_to_gbk.map \
-   koi8r_to_utf8.map utf8_to_koi8r.map \
-   koi8u_to_utf8.map utf8_to_koi8u.map
-
-SPECIALMAPS = euc_cn_to_utf8.map utf8_to_euc_cn.map \
-   euc_jp_to_utf8.map utf8_to_euc_jp.map \
-   euc_kr_to_utf8.map utf8_to_euc_kr.map \
-   euc_tw_to_utf8.map utf8_to_euc_tw.map \
-   sjis_to_utf8.map utf8_to_sjis.map \
-   gb18030_to_utf8.map utf8_to_gb18030.map \
-   big5_to_utf8.map utf8_to_big5.map \
-   johab_to_utf8.map utf8_to_johab.map \
-   uhc_to_utf8.map utf8_to_uhc.map \
-   euc_jis_2004_to_utf8.map utf8_to_euc_jis_2004.map \
-   shift_jis_2004_to_utf8.map utf8_to_shift_jis_2004.map
-
-MAPS = $(GENERICMAPS) $(SPECIALMAPS)
-
-ISO8859TEXTS = 8859-2.TXT 8859-3.TXT 8859-4.TXT 8859-5.TXT \
-   8859-6.TXT 8859-7.TXT 8859-8.TXT 8859-9.TXT \
-   8859-10.TXT 8859-13.TXT 8859-14.TXT 8859-15.TXT \
-   8859-16.TXT
-
-WINTEXTS = CP866.TXT CP874.TXT CP936.TXT \
-   CP1250.TXT CP1251.TXT \
-   CP1252.TXT CP1253.TXT CP1254.TXT CP1255.TXT \
-   CP1256.TXT CP1257.TXT CP1258.TXT
-
-GENERICTEXTS = $(ISO8859TEXTS) $(WINTEXTS) \
-   KOI8-R.TXT KOI8-U.TXT
 
-all: $(MAPS)
-
-$(GENERICMAPS): UCS_to_most.pl $(GENERICTEXTS)
-   $(PERL) -I $(srcdir) $<
-
-johab_to_utf8.map utf8_to_johab.map: UCS_to_JOHAB.pl JOHAB.TXT
-   $(PERL) -I $(srcdir) $<
-
-uhc_to_utf8.map utf8_to_uhc.map: UCS_to_UHC.pl windows-949-2000.xml
-   $(PERL) -I $(srcdir) $<
-
-euc_jp_to_utf8.map utf8_to_euc_jp.map: UCS_to_EUC_JP.pl CP932.TXT JIS0212.TXT
-   $(PERL) -I $(srcdir) $<
+# Define a rule to create to map files from downloaded text input
+# files using a script.  Arguments:
+#
+# 1: encoding name used in output files (lower case)
+# 2: script name
+# 3: 

Re: row filtering for logical replication

2021-07-20 Thread Greg Nancarrow
On Tue, Jul 20, 2021 at 6:29 PM Amit Kapila  wrote:
>
> I think in terms of referring to old and new rows, we already have
> terminology which we used at various other similar places. See Create
> Rule docs [1]. For where clause, it says "Within condition and
> command, the special table names NEW and OLD can be used to refer to
> values in the referenced table. NEW is valid in ON INSERT and ON
> UPDATE rules to refer to the new row being inserted or updated. OLD is
> valid in ON UPDATE and ON DELETE rules to refer to the existing row
> being updated or deleted.". We need similar things for the WHERE
> clause in publication if we want special syntax to refer to old and
> new rows.
>

I have no doubt we COULD allow references to OLD and NEW in the WHERE
clause, but do we actually want to?
This is what I thought could cause confusion, when mixed with the
model that I previously described.
It's not entirely clear to me exactly how it works, when the WHERE
clause is applied to the OLD and NEW rows, when the WHERE condition
itself can refer to OLD and/or NEW (coupled with the fact that NEW
doesn't make sense for DELETE and OLD doesn't make sense for INSERT).
Combine that with the fact that a publication can have multiple tables
each with their own WHERE clause, and tables can be dropped/(re)added
to the publication with a different WHERE clause, and it starts to get
a little complicated working out exactly what the result should be.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread Ranier Vilela
Em ter., 20 de jul. de 2021 às 05:35, David Rowley 
escreveu:

> On Tue, 20 Jul 2021 at 01:10, James Coleman  wrote:
> > To be clear up front: I'm in favor of the patch, and I don't want to
> > put unnecessary stumbling blocks up for it getting committed. So if we
> > decide to proceed as is, that's fine with me.
>
> Thanks for making that clear.
>
> > But I'm not sure that the "cost model, unfortunately, does not account
> > for that" is entirely accurate. The end of cost_tuplesort contains a
> > cost "per extracted tuple". It does, however, note that it doesn't
> > charge cpu_tuple_cost, which maybe is what you'd want to fully
> > incorporate this into the model. But given this run_cost isn't about
> > accounting for comparison cost (that's been done earlier) which is the
> > part that'd be the same between tuple and datum sort, it seems to me
> > that we could lower the cpu_operator_cost here by something like 10%
> > if it's byref and 30% if it's byval?
>
> I failed to notice that last part that adds the additional
> cpu_operator_cost.
>
> The default cpu_operator_cost is 0.0025, so with the 10k tuple
> benchmark, that adds an additional charge of 25 to the total cost.
>
> If we take test 1 from my results on v5 as an example:
>
> > Test1 446.1 657.3 147.32%
>
> Looking at explain for that query:
>
> regression=# explain select two from tenk1 order by two offset 100;
>   QUERY PLAN
> --
>  Limit  (cost=1133.95..1133.95 rows=1 width=4)
>->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
>  Sort Key: two
>  ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
> (4 rows)
>
> If we want the costs to reflect reality again here then we'd have
> reduce 1133.95 by something like 147.32% (the performance difference).
> That would bring the cost down to 769.72, which is way more than we
> have to play with than the 25 that the cpu_operator_cost * tuples
> gives us.
>
> If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
> cost would become 1126.45.  That's not great considering the actual
> performance indicates that 769.72 would be a better number.
>
> If we look at John's result for test 1: He saw 588 tps on master and
> 998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
> to get close to reality on that machine.
>
> My thoughts are that the small surcharge added at the end of
> cost_tuplesort() is just not enough for us to play with.  I think to
> get us closer to fixing this correctly would require a redesign of the
> tuplesort costing entirely. I think that would be about an order of
> magnitude more effort than this patch was, so I really feel like I
> don't want to do this.
>
I understand that redesign would require a lot of work,
but why not do it step by step?


> I kinda feel that since the comparison_cost is always just 2.0 *
> cpu_operator_cost regardless of the number of columns in the sort,
> then if we add too many new smarts to try and properly adjust for this
> new optimization, unless we do a completly new cost modal for this,
> then we might as well be putting lipstick on a pig.
>
I think one first step is naming this 2.0?
Does this magic number don't have a good name?


>
> It sounds like James mostly just mentioned the sorting just to ensure
> it was properly considered and does not really feel strongly that it
> needs to be adjusted.  Does anyone else feel that we should be
> adjusting it?
>
I took a look at cost_tuplesort and I think that some small adjustments
could be made as part of the improvement process.
It is attached.
1. long is a very problematic type; better int64?
2. 1024 can be int, not long?
3. 2 changed all to 2.0 (double)?
4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?

Finally, to at least document (add comments) those conclusions,
would be nice, wouldn't it?

regards,
Ranier Vilela


costsize.patch
Description: Binary data


Re: [PATCH] Allow multiple recursive self-references

2021-07-20 Thread Denis Hirn
> In the next version of the patch, would you be so kind as to include
> updated documentation of the feature and at least one regression test
> of same?

As requested, this new version of the patch contains regression tests and 
documentation.

Best wishes,
  -- Denis



0004-Allow-multiple-recursive-self-references.patch
Description: Binary data


Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-20 Thread David Rowley
On Mon, 19 Jul 2021 at 18:32, Ronan Dunklau  wrote:
> It means the logic for appending the order by pathkeys to the existing group
> by pathkeys would ideally need to remove the redundant group by keys from the
> order by keys, considering this example:
>
> regression=# explain select sum(unique1 order by ten, two), sum(unique1 order
> by four), sum(unique1 order by two, four) from tenk2 group by ten;
>QUERY PLAN
> 
>  GroupAggregate  (cost=1109.39..1234.49 rows=10 width=28)
>Group Key: ten
>->  Sort  (cost=1109.39..1134.39 rows=1 width=16)
>  Sort Key: ten, ten, two
>  ->  Seq Scan on tenk2  (cost=0.00..445.00 rows=1 width=16)
>
>
> We would ideally like to sort on ten, two, four to satisfy the first and last
> aggref at once. Stripping the common prefix (ten) would eliminate this 
> problem.

hmm, yeah. That's not great.  This comes from the way I'm doing
list_concat on the pathkeys from the GROUP BY with the ones from the
ordered aggregates. If it were possible to use
make_pathkeys_for_sortclauses() to make these in one go, that would
fix the problem since pathkey_is_redundant() would skip the 2nd "ten".
Unfortunately, it's not possible to pass the combined list of
SortGroupClauses to make_pathkeys_for_sortclauses since they're not
from the same targetlist.  Aggrefs have their own targetlist and the
SortGroupClauses for the Aggref reference that tlist.

I think to do this we'd need something like pathkeys_append() in
pathkeys.c which had a loop and appended the pathkey only if
pathkey_is_redundant returns false.

> Also, existing regression tests cover the first problem (order by a grouping
> key) but I feel like they should be extended with a case similar as the above
> to check which pathkeys are used in the "multiple ordered aggregates + group
> by" cases.

It does seem like a bit of a weird case to go to a lot of effort to
make work, but it would be nice if it did work without having to
contort the code too much.

David




Re: row filtering for logical replication

2021-07-20 Thread Dilip Kumar
On Tue, Jul 20, 2021 at 3:43 PM Tomas Vondra
 wrote:
>
> Do we log the TOAST-ed values that were not updated?

No, we don't, I have submitted a patch sometime back to fix that [1]

[1] https://commitfest.postgresql.org/33/3162/

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: row filtering for logical replication

2021-07-20 Thread Tomas Vondra
On 7/20/21 11:42 AM, Amit Kapila wrote:
> On Tue, Jul 20, 2021 at 2:39 PM Tomas Vondra
>  wrote:
>>
>> On 7/20/21 7:23 AM, Amit Kapila wrote:
>>> On Mon, Jul 19, 2021 at 7:02 PM Tomas Vondra
>>>  wrote:
>>
 So maybe the best thing is to stick to the simple approach already used
 e.g. by pglogical, which simply user the new row when available (insert,
 update) and old one for deletes.

 I think that behaves more or less sensibly and it's easy to explain.

>>>
>>> Okay, if nothing better comes up, then we can fall back to this option.
>>>
 All the other things (e.g. turning UPDATE to INSERT, advanced conflict
 resolution etc.) will require a lot of other stuff,

>>>
>>> I have not evaluated this yet but I think spending some time thinking
>>> about turning Update to Insert/Delete (yesterday's suggestion by
>>> Alvaro) might be worth especially as that seems to be followed by some
>>> other replication solution as well.
>>>
>>
>> I think that requires quite a bit of infrastructure, and I'd bet we'll
>> need to handle other types of conflicts too.
>>
> 
> Hmm, I don't see why we need any additional infrastructure here if we
> do this at the publisher. I think this could be done without many
> changes to the patch as explained in one of my previous emails [1].
> 

Oh, I see. I've been thinking about doing the "usual" conflict
resolution on the subscriber side. I'm not sure about doing this on the
publisher ...

>> I don't have a clear
>> opinion if that's required to get this patch working - I'd try getting
>> the simplest implementation with reasonable behavior, with those more
>> advanced things as future enhancements.
>>
 and I see them as
 improvements of this simple approach.

>>> Maybe a second option is to have replication change any UPDATE into
>>> either an INSERT or a DELETE, if the old or the new row do not pass the
>>> filter, respectively.  That way, the databases would remain consistent.
>
> Yeah, I think this is the best way to keep the data consistent.
>

 It'd also require REPLICA IDENTITY FULL, which seems like it'd add a
 rather significant overhead.

>>>
>>> Why? I think it would just need similar restrictions as we are
>>> planning for Delete operation such that filter columns must be either
>>> present in primary or replica identity columns.
>>>
>>
>> How else would you turn UPDATE to INSERT? For UPDATE we only send the
>> identity columns and modified columns, and the decision happens on the
>> subscriber.
>>
> 
> Hmm, we log the entire new tuple and replica identity columns for the
> old tuple in WAL for Update. And, we are going to use a new tuple for
> Insert, so we have everything we need.
> 

Do we log the TOAST-ed values that were not updated?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: row filtering for logical replication

2021-07-20 Thread Amit Kapila
On Tue, Jul 20, 2021 at 3:19 PM Dilip Kumar  wrote:
>
> On Tue, Jul 20, 2021 at 3:12 PM Amit Kapila  wrote:
> >
> > > > Why? I think it would just need similar restrictions as we are
> > > > planning for Delete operation such that filter columns must be either
> > > > present in primary or replica identity columns.
> > > >
> > >
> > > How else would you turn UPDATE to INSERT? For UPDATE we only send the
> > > identity columns and modified columns, and the decision happens on the
> > > subscriber.
> > >
> >
> > Hmm, we log the entire new tuple and replica identity columns for the
> > old tuple in WAL for Update. And, we are going to use a new tuple for
> > Insert, so we have everything we need.
> >
>
> But for making that decision we need to apply the filter on the old
> rows as well right.  So if we want to apply the filter on the old rows
> then either the filter should only be on the replica identity key or
> we need to use REPLICA IDENTITY FULL.  I think that is what Tomas
> wants to point out.
>

I have already mentioned that for Updates the filter needs criteria
similar to Deletes. This is exactly the requirement for Delete as
well.

-- 
With Regards,
Amit Kapila.




Re: row filtering for logical replication

2021-07-20 Thread Dilip Kumar
On Tue, Jul 20, 2021 at 3:12 PM Amit Kapila  wrote:
>
> > > Why? I think it would just need similar restrictions as we are
> > > planning for Delete operation such that filter columns must be either
> > > present in primary or replica identity columns.
> > >
> >
> > How else would you turn UPDATE to INSERT? For UPDATE we only send the
> > identity columns and modified columns, and the decision happens on the
> > subscriber.
> >
>
> Hmm, we log the entire new tuple and replica identity columns for the
> old tuple in WAL for Update. And, we are going to use a new tuple for
> Insert, so we have everything we need.
>

But for making that decision we need to apply the filter on the old
rows as well right.  So if we want to apply the filter on the old rows
then either the filter should only be on the replica identity key or
we need to use REPLICA IDENTITY FULL.  I think that is what Tomas
wants to point out.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: row filtering for logical replication

2021-07-20 Thread Amit Kapila
On Tue, Jul 20, 2021 at 2:39 PM Tomas Vondra
 wrote:
>
> On 7/20/21 7:23 AM, Amit Kapila wrote:
> > On Mon, Jul 19, 2021 at 7:02 PM Tomas Vondra
> >  wrote:
>
> >> So maybe the best thing is to stick to the simple approach already used
> >> e.g. by pglogical, which simply user the new row when available (insert,
> >> update) and old one for deletes.
> >>
> >> I think that behaves more or less sensibly and it's easy to explain.
> >>
> >
> > Okay, if nothing better comes up, then we can fall back to this option.
> >
> >> All the other things (e.g. turning UPDATE to INSERT, advanced conflict
> >> resolution etc.) will require a lot of other stuff,
> >>
> >
> > I have not evaluated this yet but I think spending some time thinking
> > about turning Update to Insert/Delete (yesterday's suggestion by
> > Alvaro) might be worth especially as that seems to be followed by some
> > other replication solution as well.
> >
>
> I think that requires quite a bit of infrastructure, and I'd bet we'll
> need to handle other types of conflicts too.
>

Hmm, I don't see why we need any additional infrastructure here if we
do this at the publisher. I think this could be done without many
changes to the patch as explained in one of my previous emails [1].

> I don't have a clear
> opinion if that's required to get this patch working - I'd try getting
> the simplest implementation with reasonable behavior, with those more
> advanced things as future enhancements.
>
> >> and I see them as
> >> improvements of this simple approach.
> >>
> > Maybe a second option is to have replication change any UPDATE into
> > either an INSERT or a DELETE, if the old or the new row do not pass the
> > filter, respectively.  That way, the databases would remain consistent.
> >>>
> >>> Yeah, I think this is the best way to keep the data consistent.
> >>>
> >>
> >> It'd also require REPLICA IDENTITY FULL, which seems like it'd add a
> >> rather significant overhead.
> >>
> >
> > Why? I think it would just need similar restrictions as we are
> > planning for Delete operation such that filter columns must be either
> > present in primary or replica identity columns.
> >
>
> How else would you turn UPDATE to INSERT? For UPDATE we only send the
> identity columns and modified columns, and the decision happens on the
> subscriber.
>

Hmm, we log the entire new tuple and replica identity columns for the
old tuple in WAL for Update. And, we are going to use a new tuple for
Insert, so we have everything we need.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BAXEd5bO-qPp6L9Ptckk09nbWvP8V7q5UW4hg%2BkHjXwQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2021-07-20 Thread Amit Kapila
On Tue, Jul 20, 2021 at 6:56 AM Masahiko Sawada  wrote:
>
> On Mon, Jul 19, 2021 at 8:38 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > 3) For 0003 patch, if user set skip_xid to a wrong xid which have not been
> >assigned, and then will the change be skipped when the xid is assigned in
> >the future even if it doesn't cause any conflicts ?
>
> Yes. Currently, setting a correct xid is the user's responsibility. I
> think it would be better to disable it or emit WARNING/ERROR when the
> user mistakenly set the wrong xid if we find out a convenient way to
> detect that.
>

I think in this regard we should clearly document how this can be
misused by users. I see that you have mentioned about skip_xid but
maybe we can add more on how it could lead to skipping a
non-conflicting XID and can lead to an inconsistent replica. As
discussed earlier as well, users can anyway do similar harm by using
pg_replication_slot_advance(). I think if possible we might want to
give some examples as well where it would be helpful for users to use
this functionality.

-- 
With Regards,
Amit Kapila.




Re: Rename of triggers for partitioned tables

2021-07-20 Thread Arne Roland
Thank you! I get a compile time error

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard 
-Wno-format-truncation -O2 -I../../../src/include  -D_GNU_SOURCE   -c -o 
trigger.o trigger.c

In file included from ../../../src/include/postgres.h:46:0,
 from trigger.c:14:
trigger.c: In function ‘renametrig_partition’:
../../../src/include/c.h:118:31: error: expected ‘,’ or ‘;’ before 
‘__attribute__’
 #define pg_attribute_unused() __attribute__((unused))
   ^
../../../src/include/c.h:155:34: note: in expansion of macro 
‘pg_attribute_unused’
 #define PG_USED_FOR_ASSERTS_ONLY pg_attribute_unused()
  ^~~
trigger.c:1588:17: note: in expansion of macro ‘PG_USED_FOR_ASSERTS_ONLY’
  int  found = 0 PG_USED_FOR_ASSERTS_ONLY;
 ^~~~
trigger.c:1588:7: warning: unused variable ‘found’ [-Wunused-variable]
  int  found = 0 PG_USED_FOR_ASSERTS_ONLY;
   ^
make[3]: *** [: trigger.o] Error 1


Any ideas on what I might be doing wrong?


Regards

Arne


Re: Numeric x^y for negative x

2021-07-20 Thread Yugo NAGATA
On Wed, 7 Jul 2021 18:36:56 +0100
Dean Rasheed  wrote:

> On Thu, 1 Jul 2021 at 14:17, Dean Rasheed  wrote:
> >
> > On Tue, 29 Jun 2021 at 12:08, Dean Rasheed  wrote:
> > >
> > > Numeric x^y is supported for x < 0 if y is an integer, but this
> > > currently fails if y is outside the range of an int32
> >
> > I've been doing some more testing of this, and I spotted another
> > problem with numeric_power().
> >
> > [loss of precision and overflow errors]
> >
> > I think we should attempt to avoid all such overflow errors,
> > that are actually underflows, and return zero instead.
> >
> 
> Finally getting back to this ... attached is an updated patch that now
> includes a fix for the loss-of-precision bug and the overflow errors.
> I don't think it's really worth trying to split these up, since
> they're all somewhat interrelated.

The patch can be applied cleanly.
(Though, I need to remove lines "new file mode 100644" else I get an error
 "error: git apply: bad git-diff - expected /dev/null on line 4".)

Compilation succeeded, and all tests passed.

This patch fixes numeric_power() to handle negative bases correctly and not
to raise an error "cannot take logarithm of a negative number", as well as a
bug that a result whose values is almost zero is incorrectly returend as 1.
The previous behaviors are obvious strange, and these fixes seem to me 
reasonable.

Also, improper overflow errors are corrected in numeric_power() and
numeric_exp() to return 0 when it is underflow in fact.
I think it is no problem that these functions return zero instead of underflow
errors because power_var_int() already do the same.

The patch includes additional tests for checking negative bases cases and
underflow and rounding of the almost zero results. It seems good.

Let me just make one comment.

(errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION),
 errmsg("zero raised to a negative power is undefined")));

-   if (sign1 < 0 && !numeric_is_integral(num2))
-   ereport(ERROR,
-   (errcode(ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION),
-errmsg("a negative number raised to a non-integer power yields 
a complex result")));
-
/*
 * Initialize things
 */

I don't think we need to move this check from numeric_power to power_var.
I noticed the following comment in a numeric_power(). 

/*
 * The SQL spec requires that we emit a particular SQLSTATE error code for
 * certain error conditions.  Specifically, we don't return a
 * divide-by-zero error code for 0 ^ -1.
 */

In the original code, two checks that could raise an error of
ERRCODE_INVALID_ARGUMENT_FOR_POWER_FUNCTION are following the comment.
I think these check codes are placed together under this comment intentionally,
so I suggest not to move one of them.


Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: row filtering for logical replication

2021-07-20 Thread Tomas Vondra
On 7/20/21 7:23 AM, Amit Kapila wrote:
> On Mon, Jul 19, 2021 at 7:02 PM Tomas Vondra
>  wrote:
>>
>> On 7/19/21 1:00 PM, Dilip Kumar wrote:
>>> On Mon, Jul 19, 2021 at 3:12 PM Amit Kapila  wrote:
 a. Just log it and move to the next row
 b. send to stats collector some info about this which can be displayed
 in a view and then move ahead
 c. just skip it like any other row that doesn't match the filter clause.

 I am not sure if there is any use of sending a row if one of the
 old/new rows doesn't match the filter. Because if the old row doesn't
 match but the new one matches the criteria, we will anyway just throw
 such a row on the subscriber instead of applying it.
>>>
>>> But at some time that will be true even if we skip the row based on
>>> (a) or (c) right.  Suppose the OLD row was not satisfying the
>>> condition but the NEW row is satisfying the condition, now even if we
>>> skip this operation then in the next operation on the same row even if
>>> both OLD and NEW rows are satisfying the filter the operation will
>>> just be dropped by the subscriber right? because we did not send the
>>> previous row when it first updated to value which were satisfying the
>>> condition.  So basically, any row is inserted which did not satisfy
>>> the condition first then post that no matter how many updates we do to
>>> that row either it will be skipped by the publisher because the OLD
>>> row was not satisfying the condition or it will be skipped by the
>>> subscriber as there was no matching row.
>>>
>>
>> I have a feeling it's getting overly complicated, to the extent that
>> it'll be hard to explain to users and reason about. I don't think
>> there's a "perfect" solution for cases when the filter expression gives
>> different answers for old/new row - it'll always be surprising for some
>> users :-(
>>
> 
> 
> It is possible but OTOH, the three replication solutions (Debezium,
> Oracle, IBM's InfoSphere Data Replication) which have this feature
> seems to filter based on both old and new rows in one or another way.
> Also, I am not sure if the simple approach of just filter based on the
> new row is very clear because it can also confuse users in a way that
> even if all the new rows matches the filters, they don't see anything
> on the subscriber and in fact, that can cause a lot of network
> overhead without any gain.
> 

True. My point is that it's easier to explain than when using some
combination of old/new row, and theapproach "replicate if the filter
matches both rows" proposed in this thread would be confusing too.

If the subscriber database can be modified, we kinda already have this
issue already - the row can be deleted, and all UPDATEs will be lost.
Yes, for read-only replicas that won't happen, but I think we're moving
to use cases more advanced than that.

I think there are only two ways to *guarantee* this does not happen:

* prohibit updates of columns referenced in row filters

* some sort of conflict resolution, turning UPDATE to INSERT etc.

>> So maybe the best thing is to stick to the simple approach already used
>> e.g. by pglogical, which simply user the new row when available (insert,
>> update) and old one for deletes.
>>
>> I think that behaves more or less sensibly and it's easy to explain.
>>
> 
> Okay, if nothing better comes up, then we can fall back to this option.
> 
>> All the other things (e.g. turning UPDATE to INSERT, advanced conflict
>> resolution etc.) will require a lot of other stuff,
>>
> 
> I have not evaluated this yet but I think spending some time thinking
> about turning Update to Insert/Delete (yesterday's suggestion by
> Alvaro) might be worth especially as that seems to be followed by some
> other replication solution as well.
> 

I think that requires quite a bit of infrastructure, and I'd bet we'll
need to handle other types of conflicts too. I don't have a clear
opinion if that's required to get this patch working - I'd try getting
the simplest implementation with reasonable behavior, with those more
advanced things as future enhancements.

>> and I see them as
>> improvements of this simple approach.
>>
> Maybe a second option is to have replication change any UPDATE into
> either an INSERT or a DELETE, if the old or the new row do not pass the
> filter, respectively.  That way, the databases would remain consistent.
>>>
>>> Yeah, I think this is the best way to keep the data consistent.
>>>
>>
>> It'd also require REPLICA IDENTITY FULL, which seems like it'd add a
>> rather significant overhead.
>>
> 
> Why? I think it would just need similar restrictions as we are
> planning for Delete operation such that filter columns must be either
> present in primary or replica identity columns.
> 

How else would you turn UPDATE to INSERT? For UPDATE we only send the
identity columns and modified columns, and the decision happens on the
subscriber. So we need to send everything if there's a risk we'll need
those 

Re: row filtering for logical replication

2021-07-20 Thread Amit Kapila
On Wed, Jul 14, 2021 at 2:08 AM Euler Taveira  wrote:
>
> On Tue, Jul 13, 2021, at 12:25 AM, Peter Smith wrote:
>
> I have reviewed the latest v18 patch. Below are some more review
> comments and patches.
>
> Peter, thanks for quickly check the new patch. I'm attaching a new patch 
> (v19).
>

The latest patch doesn't apply cleanly. Can you please rebase it and
see if you can address some simpler comments till we reach a consensus
on some of the remaining points?

-- 
With Regards,
Amit Kapila.




Re: row filtering for logical replication

2021-07-20 Thread Amit Kapila
On Tue, Jul 20, 2021 at 9:54 AM Amit Kapila  wrote:
>
> On Mon, Jul 19, 2021 at 4:31 PM Dilip Kumar  wrote:
> >
> > On Mon, Jul 19, 2021 at 3:12 PM Amit Kapila  wrote:
> >
> > > > Maybe a second option is to have replication change any UPDATE into
> > > > either an INSERT or a DELETE, if the old or the new row do not pass the
> > > > filter, respectively.  That way, the databases would remain consistent.
> >
> > Yeah, I think this is the best way to keep the data consistent.
> >
>
> Today, while studying the behavior of this particular operation in
> other databases, I found that IBM's InfoSphere Data Replication does
> exactly this. See [1]. I think there is a merit if want to follow this
> idea.
>

As per my initial analysis, there shouldn't be much difficulty in
implementing this behavior. We need to change the filter API
(pgoutput_row_filter) such that it tells us whether the filter is
satisfied by the old row, new row or both and then the caller should
be able to make a decision based on that. I think that should be
sufficient to turn update to insert/delete when required. I might be
missing something here but this doesn't appear to require any drastic
changes in the patch.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-20 Thread David Rowley
On Tue, 20 Jul 2021 at 01:10, James Coleman  wrote:
> To be clear up front: I'm in favor of the patch, and I don't want to
> put unnecessary stumbling blocks up for it getting committed. So if we
> decide to proceed as is, that's fine with me.

Thanks for making that clear.

> But I'm not sure that the "cost model, unfortunately, does not account
> for that" is entirely accurate. The end of cost_tuplesort contains a
> cost "per extracted tuple". It does, however, note that it doesn't
> charge cpu_tuple_cost, which maybe is what you'd want to fully
> incorporate this into the model. But given this run_cost isn't about
> accounting for comparison cost (that's been done earlier) which is the
> part that'd be the same between tuple and datum sort, it seems to me
> that we could lower the cpu_operator_cost here by something like 10%
> if it's byref and 30% if it's byval?

I failed to notice that last part that adds the additional cpu_operator_cost.

The default cpu_operator_cost is 0.0025, so with the 10k tuple
benchmark, that adds an additional charge of 25 to the total cost.

If we take test 1 from my results on v5 as an example:

> Test1 446.1 657.3 147.32%

Looking at explain for that query:

regression=# explain select two from tenk1 order by two offset 100;
  QUERY PLAN
--
 Limit  (cost=1133.95..1133.95 rows=1 width=4)
   ->  Sort  (cost=1108.97..1133.95 rows=9995 width=4)
 Sort Key: two
 ->  Seq Scan on tenk1  (cost=0.00..444.95 rows=9995 width=4)
(4 rows)

If we want the costs to reflect reality again here then we'd have
reduce 1133.95 by something like 147.32% (the performance difference).
That would bring the cost down to 769.72, which is way more than we
have to play with than the 25 that the cpu_operator_cost * tuples
gives us.

If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
cost would become 1126.45.  That's not great considering the actual
performance indicates that 769.72 would be a better number.

If we look at John's result for test 1: He saw 588 tps on master and
998 on v8.  1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
to get close to reality on that machine.

My thoughts are that the small surcharge added at the end of
cost_tuplesort() is just not enough for us to play with.  I think to
get us closer to fixing this correctly would require a redesign of the
tuplesort costing entirely. I think that would be about an order of
magnitude more effort than this patch was, so I really feel like I
don't want to do this.

I kinda feel that since the comparison_cost is always just 2.0 *
cpu_operator_cost regardless of the number of columns in the sort,
then if we add too many new smarts to try and properly adjust for this
new optimization, unless we do a completly new cost modal for this,
then we might as well be putting lipstick on a pig.

It sounds like James mostly just mentioned the sorting just to ensure
it was properly considered and does not really feel strongly that it
needs to be adjusted.  Does anyone else feel that we should be
adjusting it?

David




Re: row filtering for logical replication

2021-07-20 Thread Amit Kapila
On Tue, Jul 20, 2021 at 11:38 AM Greg Nancarrow  wrote:
>
> On Tue, Jul 20, 2021 at 2:25 PM Amit Kapila  wrote:
> >
> > Today, while studying the behavior of this particular operation in
> > other databases, I found that IBM's InfoSphere Data Replication does
> > exactly this. See [1]. I think there is a merit if want to follow this
> > idea.
> >
>
> So in this model (after initial sync of rows according to the filter),
> for UPDATE, the OLD row is checked against the WHERE clause, to know
> if the row had been previously published. If it hadn't, and the NEW
> row satisfies the WHERE clause, then it needs to be published as an
> INSERT. If it had been previously published, but the NEW row doesn't
> satisfy the WHERE condition, then it needs to be published as a
> DELETE. Otherwise, if both OLD and NEW rows satisfy the WHERE clause,
> it needs to be published as an UPDATE.
>

Yeah, this is what I also understood.

> At least, that seems to be the model when the WHERE clause refers to
> the NEW (updated) values, as used in most of their samples (i.e. in
> that database "the current log record", indicated by a ":" prefix on
> the column name).
> I think that allowing the OLD values ("old log record") to be
> referenced in the WHERE clause, as that model does, could be
> potentially confusing.
>

I think in terms of referring to old and new rows, we already have
terminology which we used at various other similar places. See Create
Rule docs [1]. For where clause, it says "Within condition and
command, the special table names NEW and OLD can be used to refer to
values in the referenced table. NEW is valid in ON INSERT and ON
UPDATE rules to refer to the new row being inserted or updated. OLD is
valid in ON UPDATE and ON DELETE rules to refer to the existing row
being updated or deleted.". We need similar things for the WHERE
clause in publication if we want special syntax to refer to old and
new rows.

I think if we use some existing way to refer to old/new values then it
shouldn't be confusing to users.

[1] - https://www.postgresql.org/docs/devel/sql-createrule.html

-- 
With Regards,
Amit Kapila.




RE: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE

2021-07-20 Thread wangsh.f...@fujitsu.com
Hi kuroda-san:

I find another problem about declare statement. The test source looks like:
> EXEC SQL AT con1 DECLARE stmt STATEMENT;
> EXEC SQL PREPARE stmt AS SELECT version();
> EXEC SQL DECLARE cur CURSOR FOR stmt;
> EXEC SQL WHENEVER SQLERROR STOP;

The outout about ecpg:
>test.pgc:14: ERROR: AT option not allowed in WHENEVER statement

After a simple research, I found that after calling function 
check_declared_list,
the variable connection will be updated, but in some case(e.g. ECPGCursorStmt)
reset connection is missing.

I'm not sure, but how about modify the ecpg.trailer:
> tatement: ecpgstart at toplevel_stmt ';' { connection = NULL; }
> | ecpgstart toplevel_stmt ';'

I think there are lots of 'connection = NULL;' in source, and we should find a 
good location to reset the connection.


Best regards.
Shenhao Wang


-Original Message-
From: kuroda.hay...@fujitsu.com  
Sent: Monday, July 12, 2021 12:05 PM
To: 'Kyotaro Horiguchi' 
Cc: pgsql-hackers@lists.postgresql.org
Subject: RE: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE

Dear Horiguchi-san,

Thank you for reviewing! I attached new version.
Sorry for delaying reply.

> Since we don't allow descriptors with the same name even if they are
> for the different connections, I think we can set the current
> connection if any (which is set either by AT option or statement-bound
> one) to i->connection silently if i->connection is NULL in
> lookup_descriptor(). What do you think about this?

I tried to implement. Is it correct?

> connection is "conn1" at the error time. The parser relies on
> output_statement and friends for connection name reset. So the rules
> that don't call the functions need to reset it by themselves.

Oh, I didn't notice that. Fixed.
I'm wondering why a output function is not implemented, like 
output_describe_statement(),
but anyway I put a connection reset in ecpg.addons.

> Similary, the following sequence doesn't yield an error, which is
> expected.
> 
> > EXEC SQL AT conn1 DECLARE stmt STATEMENT;
> > EXEC SQL AT conn2 EXECUTE stmt INTO ..;
> 
> In this case "conn2" set by the AT option is silently overwritten with
> "conn1" by check_declared_list(). I think we should reject AT option
> (with a different connection) in that case.

Actually this comes from Oracle's specification. Pro*C precompiler
overwrite their connection in the situation, hence I followed that.
But I agree this might be confused and I added the warning report.
How do you think? Is it still strange?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: OpenSSL 3.0.0 compatibility

2021-07-20 Thread Michael Paquier
On Tue, Jul 20, 2021 at 01:23:42AM +0200, Daniel Gustafsson wrote:
> Another aspect of OpenSSL 3 compatibility is that of legacy cipher support, 
> and
> as we concluded upthread it's best to leave that to the user to define in
> openssl.cnf.  The attached 0002 adds alternative output files for 3.0.0
> installations without the legacy provider loaded, as well as adds a note in 
> the
> pgcrypto docs to enable it in case DES is needed.  It does annoy me a bit that
> we don't load the openssl.cnf file for 1.0.1 if we start mentioning it in the
> docs for other versions, but it's probably not worth the effort to fix it 
> given
> the lack of complaints so far (it needs a call to OPENSSL_config(NULL); 
> guarded
> to HAVE_ macros for 1.0.1).

Sounds sensible as a whole.  Another thing I can notice is that
OpenSSL 3.0.0beta1 has taken care of the issue causing diffs in the
tests of src/test/ssl/.  So once pgcrypto is addressed, it looks like
there is nothing left for this thread.
--
Michael


signature.asc
Description: PGP signature


Re: Avoid stack frame setup in performance critical routines using tail calls

2021-07-20 Thread David Rowley
On Tue, 20 Jul 2021 at 19:04, Andres Freund  wrote:
> > * AllocateSetAlloc.txt
> > * palloc.txt
> > * percent.txt
>
> Huh, that's interesting. You have some control flow enforcement stuff turned 
> on (the endbr64). And it looks like it has a non zero cost (or maybe it's 
> just skid). Did you enable that intentionally? If not, what 
> compiler/version/distro is it? I think at least on GCC that's 
> -fcf-protection=...

It's ubuntu 21.04 with gcc 10.3 (specifically gcc version 10.3.0
(Ubuntu 10.3.0-1ubuntu1)

I've attached the same results from compiling with clang 12
(12.0.0-3ubuntu1~21.04.1)

David
 Percent |  Source code & Disassembly of postgres for cycles (707 samples, 
percent: local period)
-
 :
 :
 :
 :Disassembly of section .text:
 :
 :008e7c10 :
 :AllocSetAlloc():
 :
 :/*
 :* If requested size exceeds maximum for chunks, allocate 
an entire block
 :* for this request.
 :*/
 :if (unlikely(size > set->allocChunkLimit))
7.48 :   8e7c10: cmp%rsi,0xc8(%rdi)
3.26 :   8e7c17: jb 8e7c81 
0.00 :   8e7c19: xor%eax,%eax
 :AllocSetFreeIndex():
 :if (size > (1 << ALLOC_MINBITS))
0.44 :   8e7c1b: cmp$0x9,%rsi
0.00 :   8e7c1f: jb 8e7c2d 
 :idx = 31 - __builtin_clz((uint32) size - 1) - 
ALLOC_MINBITS + 1;
0.00 :   8e7c21: add$0x,%esi
0.98 :   8e7c24: bsr%esi,%eax
9.59 :   8e7c27: xor$0xffe0,%eax
1.44 :   8e7c2a: add$0x1e,%eax
 :AllocSetAlloc():
 :* corresponding free list to see if there is a free chunk 
we could reuse.
 :* If one is found, remove it from the free list, make it 
again a member
 :* of the alloc set and return its data address.
 :*/
 :fidx = AllocSetFreeIndex(size);
 :chunk = set->freelist[fidx];
1.67 :   8e7c2d: movslq %eax,%rcx
4.10 :   8e7c30: mov0x58(%rdi,%rcx,8),%rax
 :if (chunk != NULL)
   15.97 :   8e7c35: test   %rax,%rax
0.28 :   8e7c38: je 8e7c45 
 :{
 :Assert(chunk->size >= size);
 :
 :set->freelist[fidx] = (AllocChunk) chunk->aset;
0.00 :   8e7c3a: mov0x8(%rax),%rdx
   13.33 :   8e7c3e: mov%rdx,0x58(%rdi,%rcx,8)
0.28 :   8e7c43: jmp8e7c73 
0.00 :   8e7c45: mov$0x8,%eax
 :}
 :
 :/*
 :* Choose the actual chunk size to allocate.
 :*/
 :chunk_size = (1 << ALLOC_MINBITS) << fidx;
0.71 :   8e7c4a: shl%cl,%eax
0.15 :   8e7c4c: movslq %eax,%rsi
 :
 :/*
 :* If there is enough room in the active allocation block, 
we will put the
 :* chunk into that block.  Else must start a new one.
 :*/
 :if ((block = set->blocks) != NULL)
0.43 :   8e7c4f: mov0x50(%rdi),%rdx
1.13 :   8e7c53: test   %rdx,%rdx
0.14 :   8e7c56: je 8e7c7c 
 :{
 :Sizeavailspace = block->endptr - 
block->freeptr;
0.00 :   8e7c58: mov0x18(%rdx),%rax
6.98 :   8e7c5c: mov0x20(%rdx),%rcx
2.30 :   8e7c60: sub%rax,%rcx
 :
 :if (unlikely(availspace < (chunk_size + 
ALLOC_CHUNKHDRSZ)))
0.00 :   8e7c63: lea0x10(%rsi),%r8
0.14 :   8e7c67: cmp%r8,%rcx
2.02 :   8e7c6a: jb 8e7c86 
 :chunk = (AllocChunk) (block->freeptr);
 :
 :/* Prepare to initialize the chunk header. */
 :VALGRIND_MAKE_MEM_UNDEFINED(chunk, ALLOC_CHUNKHDRSZ);
 :
 :chunk->size = chunk_size;
2.04 :   8e7c6c: mov%rsi,(%rax)
 :
 :block->freeptr += (chunk_size + ALLOC_CHUNKHDRSZ);
   20.16 :   8e7c6f: add%r8,0x18(%rdx)
0.28 :   8e7c73: mov%rdi,0x8(%rax)
4.70 :   8e7c77: add$0x10,%rax
 :Assert(block->freeptr <= block->endptr);
 :
 :return AllocSetAllocReturnChunk(set, size, chunk, 
chunk_size);
 :}
0.00 :   8e7c7b: ret
 :return AllocSetAllocFromNewBlock(set, size, chunk_size);
0.00 :   8e7c7c: jmp8e8470 
 :return AllocSetAllocLarge(set, size, flags);
0.00 :   8e7c81: jmp8e8330 
 :return AllocSetAllocCarveOldAndAlloc(set, size, 
chunk_size,
0.00 :   8e7c86: jmp8e83e0 
 Percent |  Source code &

Re: Avoid stack frame setup in performance critical routines using tail calls

2021-07-20 Thread Andres Freund
Hi,

On Mon, Jul 19, 2021, at 23:53, David Rowley wrote:
> On Tue, 20 Jul 2021 at 18:17, Andres Freund  wrote:
> > Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
> > palloc` from a patched run? And perhaps how high their percentages of the
> > total work are. E.g. using something like
> > perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'
> 
> Sure. See attached.
> 
> David
> 
> Attachments:
> * AllocateSetAlloc.txt
> * palloc.txt
> * percent.txt

Huh, that's interesting. You have some control flow enforcement stuff turned on 
(the endbr64). And it looks like it has a non zero cost (or maybe it's just 
skid). Did you enable that intentionally? If not, what compiler/version/distro 
is it? I think at least on GCC that's -fcf-protection=...

Andres