Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Bharath Rupireddy
On Thu, Jan 26, 2023 at 12:35 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
> > In other words it allows slow down of any backend activity. Any feedback on
> > such a feature is welcome, including better GUC name proposals ;) and
> > conditions in which such feature should be disabled even if it would be
> > enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
> > not in the patch).
>
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:

+1 for the feature as it keeps replication lag in check and provides a
way for defining RPO (recovery point objective).

> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs).  And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
>
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay.  That should fix the hard deadlock danger and
> remove most of the increase in lock contention.

We've discussed this feature quite extensively in a recent thread -
https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
Folks were agreeing to Andres' suggestion there -
https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de.
However, that thread got lost from my radar. It's good that someone
else started working on it and I'm happy to help with this feature.

> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
>
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address.

Right.

> How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?

Seems reasonable.

> A bulk writer won't do a lot of XLogFlush()es, so the time/bytes since the
> last XLogFlush() will increase quickly, triggering a flush at the next
> opportunity. But short OLTP transactions will do XLogFlush()es at least at
> every commit.

Right.

> I also suspect the overhead will be more manageable if you were to force a
> flush only up to a point further back than the last fully filled page. We
> don't want to end up flushing WAL for every page, but if you just have a
> backend-local accounting mechanism, I think that's inevitably what you'd end
> up with when you have a large number of sessions. But if you'd limit the
> flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, and
> only ever to a prior boundary, the amount of unnecessary WAL flushes would be
> proportional to synchronous_commit_flush_wal_after.

When the WAL records are of large in size, then the backend inserting
them would throttle frequently generating more flushes and more waits
for sync replication ack and more contention on WALWriteLock. Not sure
if this is good unless the impact is measured.

Few more thoughts:

1. This feature keeps replication lag in check at the cost of
throttling write traffic. I'd be curious to know improvement in
replication lag vs drop in transaction throughput, say pgbench with a
custom insert script and one or more async/sync standbys.

2. So, heavy WAL throttling can turn into a lot of writes and fsyncs.
Eventually, each backend gets a chance to throttle WAL if it ever
generates WAL irrespective of whether there's a replication lag or
not. How about we let backends throttle themselves not just based on
wal_distance_from_last_flush but also depending on the replication lag
at the moment, say, if replication lag crosses
wal_throttle_replication_lag_threshold bytes?

3. Should the backends wait indefinitely for sync rep ack when they
crossed wal_throttle_threshold? Well, I don't think so, it must be
given a chance to do its stuff instead of favouring other backends by
throttling itself.

4. I'd prefer adding a TAP test for this feature to check if the WAL
throttle is picked up by a backend.

5. Can we also extend this WAL

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-26 Thread Masahiko Sawada
On Thu, Jan 26, 2023 at 3:54 PM John Naylor
 wrote:
>
> On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada  wrote:
> >
> > On Mon, Jan 23, 2023 at 8:20 PM John Naylor
> >  wrote:
> > >
> > > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > > >  wrote:
> > >
> > > In v21, all of your v20 improvements to the radix tree template and test 
> > > have been squashed into 0003, with one exception: v20-0010 (recursive 
> > > freeing of shared mem), which I've attached separately (for flexibility) 
> > > as v21-0006. I believe one of your earlier patches had a new DSA function 
> > > for freeing memory more quickly -- was there a problem with that 
> > > approach? I don't recall where that discussion went.
> >
> > Hmm, I don't remember I proposed such a patch, either.
>
> I went looking, and it turns out I remembered wrong, sorry.
>
> > One idea to address it would be that we pass a shared memory to
> > RT_CREATE() and we create a DSA area dedicated to the radix tree in
> > place. We should return the created DSA area along with the radix tree
> > so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
> > and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
> > area. A downside of this idea would be that one DSA area only for a
> > radix tree is always required.
> >
> > Another idea would be that we allocate a big enough DSA area and
> > quarry small memory for nodes from there. But it would need to
> > introduce another complexity so I prefer to avoid it.
> >
> > FYI the current design is inspired by dshash.c. In dshash_destory(),
> > we dsa_free() each elements allocated by dshash.c
>
> Okay, thanks for the info.
>
> > > 0007 makes the value type configurable. Some debug functionality still 
> > > assumes integer type, but I think the rest is agnostic.
> >
> > radixtree_search_impl.h still assumes that the value type is an
> > integer type as follows:
> >
> > #ifdef RT_NODE_LEVEL_LEAF
> > RT_VALUE_TYPE   value = 0;
> >
> > Assert(RT_NODE_IS_LEAF(node));
> > #else
> >
> > Also, I think if we make the value type configurable, it's better to
> > pass the pointer of the value to RT_SET() instead of copying the
> > values since the value size could be large.
>
> Thanks, I will remove the assignment and look into pass-by-reference.
>
> > Oops, the fix is missed in the patch for some reason. I'll fix it.
> >
> > > There is also this, in the template, which I'm not sure has been 
> > > addressed:
> > >
> > >  * XXX: Currently we allow only one process to do iteration. Therefore, 
> > > rt_node_iter
> > >  * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> > >  * We need either a safeguard to disallow other processes to begin the 
> > > iteration
> > >  * while one process is doing or to allow multiple processes to do the 
> > > iteration.
> >
> > It's not addressed yet. I think adding a safeguard is better for the
> > first version. A simple solution is to add a flag, say iter_active, to
> > allow only one process to enable the iteration. What do you think?
>
> I don't quite have enough info to offer an opinion, but this sounds like a 
> different form of locking. I'm sure it's come up before, but could you 
> describe why iteration is different from other operations, regarding 
> concurrency?

I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration. So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

>
> > > Would it be worth it (or possible) to calculate constants based on 
> > > compile-time block size? And/or have a fallback for other table AMs? 
> > > Since this file is in access/common, the intention is to allow 
> > > general-purpose, I imagine.
> >
> > I think we can pass the maximum offset numbers to tidstore_create()
> > and calculate these values.
>
> That would work easily for vacuumlazy.c, since it's in the "heap" subdir so 
> we know the max possible offset. I haven't looked at vacuumparallel.c, but I 
> can tell it is not in a heap-specific directory, so I don't know how easy 
> that would be to pass along the right value.

I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.

>
> > > About shared memory: I have some mild reservations about the naming of 
> > > the "control object", which may be in shared memory. Is that an 
> > > established term? (If so, disregard the rest): It seems backwards -- the 
> > > thing in shared memory is the actual tree itself. The thing in 
> > > backend-local memory has the "handle",

Re: Generating code for query jumbling through gen_node_support.pl

2023-01-26 Thread Peter Eisentraut

On 24.01.23 07:57, Michael Paquier wrote:

For the 0004 patch, it should be documented why one would want one behavior
or the other.  That's totally unclear right now.

I am not 100% sure whether we should have this part at the end, but as
an exit path in case somebody complains about the extra load with the
automated jumbling compared to the hash of the query strings, that can
be used as a backup.  Anyway, attached is what would be a
clarification.


Ok, the documentation make sense now.  I wonder what the performance 
impact is.  Probably, nobody cares about microoptimizing CREATE TABLE 
statements.  But BEGIN/COMMIT could matter.  However, whatever you do in 
between the BEGIN and COMMIT will already be jumbled, so you're already 
paying the overhead.  Hopefully, jumbling such simple commands will have 
no noticeable overhead.


In other words, we should test this and hopefully get rid of the 
'string' method.





Re: Generating code for query jumbling through gen_node_support.pl

2023-01-26 Thread Peter Eisentraut

On 25.01.23 01:08, Michael Paquier wrote:

On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote:

Makes sense.  That would be my intention if 0004 is the most
acceptable and splitting things makes things a bit easier to review.


There was a silly mistake in 0004 where the jumbling code relied on
compute_query_id rather than utility_query_id, so fixed and rebased as
of v7 attached.


Overall, this looks good to me.

There are a couple of repetitive comments, like "typmod and collation 
information are irrelevant for the query jumbling".  This applies to all 
nodes, so we don't need to repeat it for a number of nodes (and then not 
mention it for other nodes).  Maybe there should be a central place 
somewhere that describes "these kinds of fields should normally be ignored".






Re: pg_upgrade from PG-14.5 to PG-15.1 failing due to non-existing function

2023-01-26 Thread David Geier

Hi,

On 1/25/23 19:38, Dimos Stamatakis wrote:


Hi hackers,

I attempted to perform an upgrade from PG-14.5 to PG-15.1 with 
pg_upgrade and unfortunately it errors out because of a function that 
does not exist anymore in PG-15.1.


The function is ‘pg_catalog.close_lb’ and it exists in 14.5 but not in 
15.1.


In our scenario we changed the permissions of this function in PG14.5 
(via an automated tool) and then pg_upgrade tries to change the 
permissions in PG15.1 as well.



Here [1] is a very similar issue that has been reported in 2019.

The patch didn't make it in but it also seems to not fix the issue 
reported by Dimos. The patch in [1] seems to be concerned with changed 
function signatures rather than with dropped functions. Maybe [1] could 
be revived and extended to also ignore dropped functions?


[1] 
https://www.postgresql.org/message-id/flat/f85991ad-bbd4-ad57-fde4-e12f0661dbf0%40postgrespro.ru


--
David Geier
(ServiceNow)





Re: improving user.c error messages

2023-01-26 Thread Alvaro Herrera


Please use 
errdetail("You must have %s privilege to create roles with %s.",
"SUPERUSER", "SUPERUSER")));

in this kind of message where multiple copies appear that only differ in
the keyword to use, to avoid creating four copies of essentially the
same string.

This applies in several places.


> -  errmsg("must have createdb privilege 
> to change createdb attribute")));
> +  errmsg("permission denied to alter 
> role"),
> +  errhint("You must have CREATEDB 
> privilege to alter roles with CREATEDB.")));

I think this one is a bit ambiguous; does "with" mean that roles that
have that priv cannot be changed, or does it mean that you cannot meddle
with that bit in particular?  I think it'd be better to say
  "You must have %s privilege to change the %s attribute."
or something like that.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)




Re: meson oddities

2023-01-26 Thread Peter Eisentraut

On 19.01.23 21:45, Andres Freund wrote:

Hi,

On 2023-01-19 21:37:15 +0100, Peter Eisentraut wrote:

On 11.01.23 12:05, Peter Eisentraut wrote:

I think there is also an adjacent issue:  The subdir options may be
absolute or relative.  So if you specify --prefix=/usr/local and
--sysconfdir=/etc/postgresql, then

      config_paths_data.set_quoted('SYSCONFDIR', dir_prefix / dir_sysconf)

would produce something like /usr/local/etc/postgresql.


I don't think it would. The / operator understands absolute paths and doesn't
add the "first component" if the second component is absolute.


Oh, that is interesting.  In that case, this is not the right patch.  We 
should proceed with my previous patch in [0] then.


[0]: 
https://www.postgresql.org/message-id/a6a6de12-f705-2b33-2fd9-9743277de...@enterprisedb.com






Re: Helper functions for wait_for_catchup() in Cluster.pm

2023-01-26 Thread Drouvot, Bertrand

Hi,

On 1/24/23 7:27 PM, Alvaro Herrera wrote:

Looking again, I have two thoughts for making things easier:

1. I don't think wait_for_write_catchup is necessary, because
calling wait_for_catchup() and omitting the 'mode' and 'lsn' arguments
would already do the same thing.  So what we should do is patch places
that currently give those two arguments, so that they don't.



Agree but there is one place where the node passed as the second argument is not the 
"$self":

src/bin/pg_rewind/t/007_standby_source.pl:$node_b->wait_for_write_catchup('node_c',
 $node_a);

So it looks like there is still a need for wait_for_write_catchup().


2. Because wait_for_replay_catchup is an instance method, passing the
second node as argument is needlessly noisy, because that's already
known as $self.  So we can just say

   $primary_node->wait_for_replay_catchup($standby_node);



Yeah, but same here, there is places where the node passed as the second argument is not 
the "$self":

src/bin/pg_rewind/t/007_standby_source.pl:$node_b->wait_for_replay_catchup('node_c',
 $node_a);
src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
 $node_primary);
src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
 $node_primary);
src/test/recovery/t/001_stream_rep.pl:  
$node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary);

So it looks like there is still a need for wait_for_replay_catchup() with 2 
parameters.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Helper functions for wait_for_catchup() in Cluster.pm

2023-01-26 Thread Alvaro Herrera
On 2023-Jan-26, Drouvot, Bertrand wrote:

> On 1/24/23 7:27 PM, Alvaro Herrera wrote:

> > 1. I don't think wait_for_write_catchup is necessary, because
> > calling wait_for_catchup() and omitting the 'mode' and 'lsn' arguments
> > would already do the same thing.  So what we should do is patch places
> > that currently give those two arguments, so that they don't.
> 
> Agree but there is one place where the node passed as the second argument is 
> not the "$self":
> 
> src/bin/pg_rewind/t/007_standby_source.pl:$node_b->wait_for_write_catchup('node_c',
>  $node_a);
> 
> So it looks like there is still a need for wait_for_write_catchup().

Hmm, I think that one can use the more general wait_for_catchup.


> > 2. Because wait_for_replay_catchup is an instance method, passing the
> > second node as argument is needlessly noisy, because that's already
> > known as $self.  So we can just say
> > 
> >$primary_node->wait_for_replay_catchup($standby_node);
> 
> Yeah, but same here, there is places where the node passed as the second 
> argument is not the "$self":
> 
> src/bin/pg_rewind/t/007_standby_source.pl:$node_b->wait_for_replay_catchup('node_c',
>  $node_a);
> src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
>  $node_primary);
> src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
>  $node_primary);
> src/test/recovery/t/001_stream_rep.pl:  
> $node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary);
> 
> So it looks like there is still a need for wait_for_replay_catchup() with 2 
> parameters.

Ah, cascading replication.  In that case, let's make the second
parameter optional.  If it's not given, $self is used.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica" (Irulan)




Re: [PATCH] Make ON CONFLICT DO NOTHING and ON CONFLICT DO UPDATE consistent

2023-01-26 Thread Aleksander Alekseev
Hi Andres,

> It *certainly* can't be right to just continue with the update in heap_update,

I see no reason why. What makes this case so different from updating a
tuple created by the previous command?

> as you've done. You'd have to skip the update, not execute it. What am I
> missing here?

Simply skipping updates in a statement that literally says DO UPDATE
doesn't seem to be the behavior a user would expect.

> I think this'd completely break triggers, for example, because they won't be
> able to get the prior row version, since it won't actually be a row ever
> visible (due to cmin=cmax).
>
> I suspect it might break unique constraints as well, because we'd end up with
> an invisible row in part of the ctid chain.

That's a reasonable concern, however I was unable to break unique
constraints or triggers so far:

```
CREATE TABLE t (a INT UNIQUE, b INT);

CREATE OR REPLACE FUNCTION t_insert() RETURNS TRIGGER AS $$
BEGIN
  RAISE NOTICE 't_insert triggered: new = %, old = %', NEW, OLD;
  RETURN NULL;
END
$$ LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION t_update() RETURNS TRIGGER AS $$
BEGIN
  RAISE NOTICE 't_update triggered: new = %, old = %', NEW, OLD;
  RETURN NULL;
END
$$ LANGUAGE 'plpgsql';

CREATE TRIGGER t_insert_trigger
AFTER INSERT ON t
FOR EACH ROW EXECUTE PROCEDURE t_insert();

CREATE TRIGGER t_insert_update
AFTER UPDATE ON t
FOR EACH ROW EXECUTE PROCEDURE t_update();

INSERT INTO t VALUES (1,1), (1,2) ON CONFLICT (a) DO UPDATE SET b = 0;

NOTICE:  t_insert triggered: new = (1,1), old = 
NOTICE:  t_update triggered: new = (1,0), old = (1,1)

INSERT INTO t VALUES (2,1), (2,2), (3,1) ON CONFLICT (a) DO UPDATE SET b = 0;

NOTICE:  t_insert triggered: new = (2,1), old = 
NOTICE:  t_update triggered: new = (2,0), old = (2,1)
NOTICE:  t_insert triggered: new = (3,1), old = 

=# SELECT * FROM t;
 a | b
---+---
 1 | 0
 2 | 0
 3 | 1
```

PFA patch v2 that also includes the test shown above.

Are there any other scenarios we should check?

-- 
Best regards,
Aleksander Alekseev


v2-0001-Make-ON-CONFLICT-DO-NOTHING-and-ON-CONFLICT-DO-UP.patch
Description: Binary data


Re: to_hex() for negative inputs

2023-01-26 Thread Dean Rasheed
On Wed, 25 Jan 2023 at 21:43, Peter Eisentraut
 wrote:
>
> On 24.01.23 14:10, Dean Rasheed wrote:
> > I also think it might be useful for it to gain a couple of boolean options:
> >
> > 1). An option to output a signed value (defaulting to false, to
> > preserve the current two's complement output).
>
> I find the existing behavior so strange, I would rather give up and
> invent a whole new function family with correct behavior, which could
> then also support octal and binary, and perhaps any base.  This could be
> something generally named, like "convert" or "format".
>
> > 2). An option to output the base prefix "0x" (which comes after the
> > sign, making it inconvenient for the user to add themselves).
>
> This could also be handled with a "format"-like function.
>

Yeah, making a break from the existing to_hex() functions might not be
a bad idea.

My only concern with something general like convert() or format() is
that it'll end up being hard to use, with lots of different formatting
options, like to_char(). In fact Oracle's to_char() has an 'X'
formatting option to output hexadecimal, though it's pretty limited
(e.g., it doesn't support negative inputs). MySQL has conv() that'll
convert between any two bases, but it does bizarre things for negative
inputs or inputs larger than 2^64.

TBH, I'm not that interested in bases other than hex (I mean who
really uses octal anymore?), and I don't particularly care about
formats other than the format we accept as input. For that, just
adding a numeric_to_hex() function would suffice. That works fine for
int4 and int8 inputs too, and doesn't preclude adding a more general
base-conversion / formatting function later, if there's demand.

Regards,
Dean




Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Tue, Aug 23, 2022 at 1:13 PM John Naylor 
wrote:
>
> On Tue, Aug 23, 2022 at 11:24 AM David Rowley 
wrote:
> >
> > On Tue, 23 Aug 2022 at 15:22, John Naylor 
wrote:
> > > Did you happen to see
> > >
> > >
https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com
> >
> > I missed that.  It looks like a much more promising idea than what I
> > came up with. I've not looked at your code yet, but I'm interested and
> > will aim to look soon.
>
> Note that I haven't actually implemented this idea yet, just tried to
> model the effects by lobotomizing the current comparators. I think
> it's worth pursuing and will try to come back to it this cycle, but if
> you or anyone else wants to try, that's fine of course.

Coming back to this, I wanted to sketch out this idea in a bit more detail.

Have two memtuple arrays, one for first sortkey null and one for first
sortkey non-null:
- Qsort the non-null array, including whatever specialization is available.
Existing specialized comparators could ignore nulls (and their ordering)
taking less space in the binary.
- Only if there is more than one sort key, qsort the null array. Ideally at
some point we would have a method of ignoring the first sortkey (this is an
existing opportunity that applies elsewhere as well).
- To handle two arrays, grow_memtuples() would need some adjustment, as
would any callers that read the final result of an in-memory sort -- they
would need to retrieve the tuples starting with the appropriate array
depending on NULLS FIRST/LAST behavior.

I believe external merges wouldn't have to do anything different, since
when writing out the tapes, we read from the arrays in the right order.

(One could extend this idea further and have two pools of tapes for null
and non-null first sortkey, that are merged separately, in the right order.
That sounds like quite a bit more complexity than is worth, however.)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-01-26 Thread Andrei Zubkov
Hi,

The final version of this patch should fix meson build and tests.
-- 
Andrei Zubkov

From 94784bccd48a83cba58d6017253d0b8f051e159c Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Thu, 26 Jan 2023 13:18:11 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

This patch adds stats_since and minmax_stats_since columns to the
pg_stat_statements view and pg_stat_statements() function. The new min/max reset
mode for the pg_stat_stetments_reset() function is controlled by the
parameter minmax_only.
stat_since column is populated with the current timestamp when a new statement
is added to the pg_stat_statements hashtable. It provides clean information
about statistics collection time interval for each statement. Besides it can be
used by sampling solutions to detect situations when a statement was evicted and
stored again between samples.
Such sampling solution could derive any pg_stat_statements statistic values for
an interval between two samples with the exception of all min/max statistics. To
address this issue this patch adds the ability to reset min/max statistics
independently of the statement reset using the new minmax_only parameter
of the pg_stat_statements_reset(userid oid, dbid oid, queryid bigint,
minmax_only boolean) function. Timestamp of such reset is stored in the
minmax_stats_since field for each statement.
pg_stat_statements_reset() function now returns the timestamp of a reset as a
result.

Discussion:
https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 contrib/pg_stat_statements/Makefile   |   2 +-
 .../expected/oldextversions.out   |  70 
 .../expected/pg_stat_statements.out   | 361 +-
 contrib/pg_stat_statements/meson.build|   1 +
 .../pg_stat_statements--1.10--1.11.sql|  81 
 .../pg_stat_statements/pg_stat_statements.c   | 139 +--
 .../pg_stat_statements.control|   2 +-
 .../pg_stat_statements/sql/oldextversions.sql |   7 +
 .../sql/pg_stat_statements.sql| 149 +++-
 doc/src/sgml/pgstatstatements.sgml|  66 +++-
 10 files changed, 722 insertions(+), 156 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.10--1.11.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index edc40c8bbfb..0afb9060fa1 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,7 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.10--1.11.sql \
 	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index efb2049ecff..1e1cc11a8e2 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -250,4 +250,74 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
  t
 (1 row)
 
+-- New functions and views for pg_stat_statements in 1.11
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.11';
+\d pg_stat_statements
+  View "public.pg_stat_statements"
+ Column |   Type   | Collation | Nullable | Default 
++--+---+--+-
+ userid | oid  |   |  | 
+ dbid   | oid  |   |  | 
+ toplevel   | boolean  |   |  | 
+ queryid| bigint   |   |  | 
+ query  | text |   |  | 
+ plans  | bigint   |   |  | 
+ total_plan_time| double precision |   |  | 
+ min_plan_time  | double precision |   |  | 
+ max_plan_time  | double precision |   |  | 
+ mean_plan_time | double precision |   |  | 
+ stddev_plan_time   | double precision |   |  | 
+ calls  | bigint   |   |  | 
+ total_exec_time| double precision |   |  | 
+ min_exec_time  | double precision |   |  | 
+ max_exec_time  | double precision |   |  | 
+ mean_exec_time | double precision |   |  | 
+ stddev_exec_time   | double precision |   |  | 
+ rows  

Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Tomas Vondra



On 1/25/23 20:05, Andres Freund wrote:
> Hi,
> 
> On 2023-01-25 14:32:51 +0100, Jakub Wartak wrote:
>> In other words it allows slow down of any backend activity. Any feedback on
>> such a feature is welcome, including better GUC name proposals ;) and
>> conditions in which such feature should be disabled even if it would be
>> enabled globally (right now only anti- wraparound VACUUM comes to mind, it's
>> not in the patch).
> 
> Such a feature could be useful - but I don't think the current place of
> throttling has any hope of working reliably:
> 
>> @@ -1021,6 +1025,21 @@ XLogInsertRecord(XLogRecData *rdata,
>>  pgWalUsage.wal_bytes += rechdr->xl_tot_len;
>>  pgWalUsage.wal_records++;
>>  pgWalUsage.wal_fpi += num_fpi;
>> +
>> +backendWalInserted += rechdr->xl_tot_len;
>> +
>> +if ((synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY || 
>> synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_WRITE) &&
>> +synchronous_commit_flush_wal_after > 0 &&
>> +backendWalInserted > synchronous_commit_flush_wal_after 
>> * 1024L)
>> +{
>> +elog(DEBUG3, "throttling WAL down on this session 
>> (backendWalInserted=%d)", backendWalInserted);
>> +XLogFlush(EndPos);
>> +/* XXX: refactor SyncRepWaitForLSN() to have different 
>> waitevent than default WAIT_EVENT_SYNC_REP  */
>> +/* maybe new WAIT_EVENT_SYNC_REP_BIG or something like 
>> that */
>> +SyncRepWaitForLSN(EndPos, false);
>> +elog(DEBUG3, "throttling WAL down on this session - 
>> end");
>> +backendWalInserted = 0;
>> +}
>>  }
> 
> You're blocking in the middle of an XLOG insertion. We will commonly hold
> important buffer lwlocks, it'll often be in a critical section (no cancelling
> / terminating the session!). This'd entail loads of undetectable deadlocks
> (i.e. hard hangs).  And even leaving that aside, doing an unnecessary
> XLogFlush() with important locks held will seriously increase contention.
> 

Yeah, I agree the sleep would have to happen elsewhere.

It's not clear to me how could it cause deadlocks, as we're not waiting
for a lock/resource locked by someone else, but it's certainly an issue
for uninterruptible hangs.

> 
> My best idea for how to implement this in a somewhat safe way would be for
> XLogInsertRecord() to set a flag indicating that we should throttle, and set
> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed to
> proceed (i.e. we'll not be in a critical / interrupts off section) can
> actually perform the delay.  That should fix the hard deadlock danger and
> remove most of the increase in lock contention.
> 

The solution I've imagined is something like autovacuum throttling - do
some accounting of how much "WAL bandwidth" each process consumed, and
then do the delay/sleep in a suitable place.

> 
> I don't think doing an XLogFlush() of a record that we just wrote is a good
> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> we're doing. To make matters worse, this will often cause partially filled WAL
> pages to be flushed out - rewriting a WAL page multiple times can
> significantly increase overhead on the storage level. At the very least this'd
> have to flush only up to the last fully filled page.
> 

I don't see why would that significantly increase the number of flushes.
The goal is to do this only every ~1MB of a WAL or so, and chances are
there were many (perhaps hundreds or more) commits in between. At least
in a workload with a fair number of OLTP transactions (which is kinda
the point of all this).

And the "small" OLTP transactions don't really do any extra fsyncs,
because the accounting resets at commit (or places that flush WAL).

Same for the flushes of partially flushed pages - if there's enough of
small OLTP transactions, we're already having this issue. I don't see
why would this make it measurably worse. But if needed, we can simply
backoff to the last page boundary, so that we only flush the complete
page. That would work too - the goal is not to flush everything, but to
reduce how much of the lag is due to the current process (i.e. to wait
for sync replica to catch up).

> 
> Just counting the number of bytes inserted by a backend will make the overhead
> even worse, as the flush will be triggered even for OLTP sessions doing tiny
> transactions, even though they don't contribute to the problem you're trying
> to address.  How about counting how many bytes of WAL a backend has inserted
> since the last time that backend did an XLogFlush()?
> 

No, we should reset the counter at commit, so small OLTP transactions
should not reach really trigger this. That's kinda the point, to only
delay "large" transactions producing a lot of WAL.

> A bulk writer won't do a lot of XLogFlush()es, so the time

Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread Pavel Borisov
Hi, John!
Generally, I like the separation of non-null values before sorting and
would like to join as a reviewer when we come to patch. I have only a
small question:

> - Only if there is more than one sort key, qsort the null array. Ideally at 
> some point we would have a method of ignoring the first sortkey (this is an 
> existing opportunity that applies elsewhere as well).
Should we need to sort by the second sort key provided the first one
in NULL by standard or by some part of the code relying on this? I
suppose NULL values in the first sort key mean attribute values are
undefined and there is no preferred order between these tuples, even
if their second sort keys are different.

And maybe (unlikely IMO) we need some analog of NULLS DISCTICNT/NOT
DISTINCT in this scope?

Kind regards,
Pavel Borisov,
Supabase.




Re: drop postmaster symlink

2023-01-26 Thread Peter Eisentraut

On 26.01.23 01:03, Karl O. Pinc wrote:

Buried in
https://www.postgresql.org/message-id/20230107165942.748ccf4e%40slate.karlpinc.com
is the one change I see that should be made.


In doc/src/sgml/ref/allfiles.sgml at line 222 there is an ENTITY
defined which references the deleted postmaster.sgml file.


This line needs to be removed and the
0002-Don-t-install-postmaster-symlink-anymore.patch
updated.  (Unless there's some magic going on
with the various allfiles.sgml files of which I am
not aware.)

If this is fixed I see no other problems.


Good find.  Committed with this additional change.





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-26 Thread David Geier

Hi,

On 1/23/23 21:30, Andres Freund wrote:

That's been the case since my first post in the thread :). Mainly, it seems
easier to detect underflow cases during subtraction that way. And the factor
of 2 in range doesn't change a whole lot.

I just realized it the other day :).

If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

I'll include them in my new patch set and also have a careful look at them.


I reviewed the prerequisite patch which introduces 
INSTR_TIME_SET_SECONDS(), as well as the pg_test_timing patch. Here my 
comments:


- The prerequisite patch looks good me.

- By default, the test query in the pg_test_timing doc runs serially. 
What about adding SET max_parallel_workers_per_gather = 0 to make sure 
it really always does (e.g. on a system with different settings for 
parallel_tuple_cost / parallel_setup_cost)? Otherwise, the numbers will 
be much more flaky.


- Why have you added a case distinction for diff == 0? Have you 
encountered this case? If so, how? Maybe add a comment.


- To further reduce overhead we could call INSTR_TIME_SET_CURRENT() 
multiple times. But then again: why do we actually care about the 
per-loop time? Why not instead sum up diff and divide by the number of 
iterations to exclude all the overhead in the first place?


- In the computation of the per-loop time in nanoseconds you can now use 
INSTR_TIME_GET_NANOSEC() instead of INSTR_TIME_GET_DOUBLE() * NS_PER_S.


The rest looks good to me. The rebased patches are part of the patch set 
I sent out yesterday in reply to another mail in this thread.


--
David Geier
(ServiceNow)





Re: Add LZ4 compression in pg_dump

2023-01-26 Thread Michael Paquier
On Thu, Jan 26, 2023 at 11:24:47AM +, gkokola...@pm.me wrote:
> I gave this a little bit of thought. I think that ReadHead should not
> emit a warning, or at least not this warning as it is slightly misleading.
> It implies that it will automatically turn off data restoration, which is
> false. Further ahead, the code will fail with a conflicting error message
> stating that the compression is not available.
> 
> Instead, it would be cleaner both for the user and the maintainer to
> move the check in RestoreArchive and make it the sole responsible for
> this logic.

-pg_fatal("cannot restore from compressed archive (compression not 
supported in this installation)");
+pg_fatal("cannot restore data from compressed archive (compression not 
supported in this installation)");
Hmm.  I don't mind changing this part as you suggest.

-#ifndef HAVE_LIBZ
-   if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
-   pg_fatal("archive is compressed, but this installation does not 
support compression");
-#endif
However I think that we'd better keep the warning, as it can offer a
hint when using pg_restore -l not built with compression support if
looking at a dump that has been compressed.
--
Michael


signature.asc
Description: PGP signature


Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Thu, Jan 26, 2023 at 6:14 PM Pavel Borisov 
wrote:

> > - Only if there is more than one sort key, qsort the null array.
Ideally at some point we would have a method of ignoring the first sortkey
(this is an existing opportunity that applies elsewhere as well).

> Should we need to sort by the second sort key provided the first one
> in NULL by standard or by some part of the code relying on this? I

I'm not sure I quite understand the question.

If there is more than one sort key, and the specialized comparison on the
first key gives a definitive zero result, it falls back to comparing all
keys from the full tuple. (The sorttuple struct only contains the first
sortkey, which might actually be an abbreviated key.) A possible
optimization, relevant here and also elsewhere, is to compare only using
keys starting from key2. But note: if the first key is abbreviated, a zero
result is not definitive, and we must check the first key's full value from
the tuple.

> suppose NULL values in the first sort key mean attribute values are
> undefined and there is no preferred order between these tuples, even
> if their second sort keys are different.

There is in fact a preferred order between these tuples -- the second key
is the tie breaker in this case.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread David Rowley
On Thu, 26 Jan 2023 at 23:29, John Naylor  wrote:
> Coming back to this, I wanted to sketch out this idea in a bit more detail.
>
> Have two memtuple arrays, one for first sortkey null and one for first 
> sortkey non-null:
> - Qsort the non-null array, including whatever specialization is available. 
> Existing specialized comparators could ignore nulls (and their ordering) 
> taking less space in the binary.
> - Only if there is more than one sort key, qsort the null array. Ideally at 
> some point we would have a method of ignoring the first sortkey (this is an 
> existing opportunity that applies elsewhere as well).
> - To handle two arrays, grow_memtuples() would need some adjustment, as would 
> any callers that read the final result of an in-memory sort -- they would 
> need to retrieve the tuples starting with the appropriate array depending on 
> NULLS FIRST/LAST behavior.

Thanks for coming back to this. I've been thinking about this again
recently due to what was discovered in [1].  Basically, the patch on
that thread is trying to eliminate the need for the ORDER BY sort in a
query such as: SELECT a,b,row_number() over (order by a) from ab order
by a,b;  the idea is to just perform the full sort on a,b for the
WindowClause so save from having to do an Incremental Sort on b for
the ORDER BY after evaluating the window funcs.  Surprisingly (for
me), we found a bunch of cases where the performance is better to do a
sort on some of the keys, then do an Incremental sort on the remainder
rather than just doing a single sort on everything.

I don't really understand why this is fully yet, but one theory I have
is that it might be down to work_mem being larger than the CPU's L3
cache and causing more cache line misses.  With the more simple sort,
there's less swapping of items in the array because the comparison
function sees tuples as equal more often.  I did find that the gap
between the two is not as large with fewer tuples to sort.

I think the slower sorts I found in [2] could also be partially caused
by the current sort specialisation comparators re-comparing the
leading column during a tie-break. I've not gotten around to disabling
the sort specialisations to see if and how much this is a factor for
that test.

Why I'm bringing this up here is that I wondered, in addition to what
you're mentioning above, if we're making some changes to allow
multiple in-memory arrays, would it be worth going a little further
and allowing it so we could have N arrays which we'd try to keep under
L3 cache size. Maybe some new GUC could be used to know what a good
value is.  With fast enough disks, it's often faster to use smaller
values of work_mem which don't exceed L3 to keep these batches
smaller.  Keeping them in memory would remove the need to write out
tapes.

I don't really know exactly if such a feature could easily be tagged
on to what you propose above, but I feel like what you're talking
about is part of the way towards that at least, maybe at the least,
the additional work could be kept in mind when this is written so that
it's easier to extend in the future.

David

[1] 
https://postgr.es/m/CAApHDvpAO5H_L84kn9gCJ_hihOavtmDjimKYyftjWtF69BJ=8...@mail.gmail.com
[2] 
https://postgr.es/m/CAApHDvqh%2BqOHk4sbvvy%3DQr2NjPqAAVYf82oXY0g%3DZ2hRpC2Vmg%40mail.gmail.com




wrong Append/MergeAppend elision?

2023-01-26 Thread Amit Langote
Hi,

It seems that the planner currently elides an Append/MergeAppend that
has run-time pruning info (part_prune_index) set, but which I think is
a bug.  Here's an example:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
set plan_cache_mode to force_generic_plan ;
prepare q as select * from p where a = $1;
explain execute q (0);
  QUERY PLAN
--
 Seq Scan on p1 p  (cost=0.00..41.88 rows=13 width=4)
   Filter: (a = $1)
(2 rows)

Because the Append is elided in this case, run-time pruning doesn't
kick in to prune p1, even though PartitionPruneInfo to do so has been
generated.

Attached find a patch to fix that.  There are some expected output
diffs in partition_prune suite, though they all look sane to me.

Thoughts?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v1-0001-Don-t-elide-Append-MergeAppend-if-run-time-prunin.patch
Description: Binary data


Re: Record queryid when auto_explain.log_verbose is on

2023-01-26 Thread torikoshia

On 2023-01-26 12:40, Michael Paquier wrote:

On Wed, Jan 25, 2023 at 04:46:36PM +0900, Michael Paquier wrote:

Thanks.  Will check and probably apply on HEAD.


Done, after adding one test case with compute_query_id=regress and
applying some indentation.
--
Michael


Thanks!


On 2023-01-23 09:35, Michael Paquier wrote:

ExplainPrintTriggers() is kind of different because there is
auto_explain_log_triggers.  Still, we could add a flag in 
ExplainState
deciding if the triggers should be printed, so as it would be 
possible

to move ExplainPrintTriggers() and ExplainPrintJITSummary() within
ExplainPrintPlan(), as well?  The same kind of logic could be applied
for the planning time and the buffer usage if these are tracked in
ExplainState rather than being explicit arguments of 
ExplainOnePlan().

Not to mention that this reduces the differences between
ExplainOneUtility() and ExplainOnePlan().


I'll work on this next.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: Inconsistency in vacuum behavior

2023-01-26 Thread Nikita Malakhov
Hi!

Yes, I've checked that. What would be desirable behavior in the case above?
Anyway, waiting for table unlock seems to be not quite right.

On Sat, Jan 21, 2023 at 4:12 AM Nathan Bossart 
wrote:

> On Mon, Jan 16, 2023 at 11:18:08AM +0300, Alexander Pyhalov wrote:
> > Is it intended? Why don't we perform vacuum_is_permitted_for_relation()
> > check for inheritors in expand_vacuum_rel()?
>
> Since no lock is held on the partition, the calls to functions like
> object_ownercheck() and pg_class_aclcheck() in
> vacuum_is_permitted_for_relation() will produce cache lookup ERRORs if the
> relation is concurrently dropped.
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>
>
>

-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


Re: pg_upgrade from PG-14.5 to PG-15.1 failing due to non-existing function

2023-01-26 Thread Dimos Stamatakis
## Dimos Stamatakis (dimos.stamata...@servicenow.com):

> In our scenario we changed the permissions of this function in PG14.5
> (via an automated tool) and then pg_upgrade tries to change the
> permissions in PG15.1 as well.

Given that this function wasn't even documented and did nothing but
throw an error "function close_lb not implemented" - couldn't you
revert that permissions change for the upgrade? (if it comes to the
worst, a superuser could UPDATE pg_catalog.pg_proc and set proacl
to NULL for that function, but that's not how you manage ACLs in
production, it's for emergency fixing only).

Thanks Christoph! Actually, I already tried reverting the permissions but 
pg_upgrade attempts to replicate the revert SQL statement as well 😊
It would be nice to make pg_upgrade ignore some statements while upgrading.
As David mentions, we can alter the patch to ignore dropped functions.

Thanks,
Dimos
(ServiceNow)


Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Jakub Wartak
> On 1/25/23 20:05, Andres Freund wrote:
> > Hi,
> >
> > Such a feature could be useful - but I don't think the current place of
> > throttling has any hope of working reliably:
[..]
> > You're blocking in the middle of an XLOG insertion.
[..]
> Yeah, I agree the sleep would have to happen elsewhere.

Fixed.

> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
> > to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> >
>
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.
>

By the time you replied I've already tried what Andres has recommended.

[..]
>> At the very least this'd
> > have to flush only up to the last fully filled page.
> >
> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse. But if needed, we can simply
> backoff to the last page boundary, so that we only flush the complete
> page. That would work too - the goal is not to flush everything, but to
> reduce how much of the lag is due to the current process (i.e. to wait
> for sync replica to catch up).

I've introduced the rounding to the last written page (hopefully).

> > Just counting the number of bytes inserted by a backend will make the 
> > overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> >
>
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

Fixed.

> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, 
> > and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would 
> > be
> > proportional to synchronous_commit_flush_wal_after.
> >
>
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.
>
> I agree this is crude and we'd probably need some sort of "balancing"
> logic that distributes WAL bandwidth between backends, and throttles
> backends producing a lot of WAL.

OK - that's not included (yet?), as it would make this much more complex.

In summary:  Attached is a slightly reworked version of this patch.
1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
3. Removed GUC for now (always on->256kB); potentially to be restored
4. Resets backendWal counter on commits

It's still crude, but first tests indicate that it behaves similarly
(to the initial one with GUC = 256kB).

Also from the Bharath email, I've found another patch proposal by
Simon [1] and I would like to avoid opening the Pandora's box again,
but to stress this: this feature is specifically aimed at solving OLTP
latency on *sync*rep (somewhat some code could be added/generalized
later on and this feature could be extended to async case, but this
then opens the question of managing the possible WAL throughput
budget/back throttling - this was also raised in first thread here [2]
by Konstantin).

IMHO it could implement various strategies under user-settable GUC
"wal_throttle_larger_transactions=(sync,256kB)" or
"wal_throttle_larger_transactions=off" , and someday later someone
could teach the code the async case (let's say under
"wal_throttle_larger_transactions=(asyncMaxRPO, maxAllowedLag=8MB,
256kB)"). Thoughts?

[1] - 
https://www.postgresql.org/message-id/flat/CA%2BU5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/71f3e6fb-2fca-a798-856a-f23c8ede2333%40garret.ru


0001-WIP-Syncrep-and-improving-latency-due-to-WAL-throttl.patch
Description: Binary data


Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan  wrote:
> On Wed, Jan 25, 2023 at 7:41 PM Robert Haas  wrote:
> > Both Andres and I have repeatedly expressed concern about how much is
> > being changed in the behavior of vacuum, and how quickly, and IMHO on
> > the basis of very limited evidence that the changes are improvements.
> > The fact that Andres was very quickly able to find cases where the
> > patch produces large regression is just more evidence of that. It's
> > also hard to even understand what has been changed, because the
> > descriptions are so theoretical.
>
> Did you actually read the motivating examples Wiki page?

I don't know. I've read a lot of stuff that you've written on this
topic, which has taken a significant amount of time, and I still don't
understand a lot of what you're changing, and I don't agree with all
of the things that I do understand. I can't state with confidence that
the motivating examples wiki page was or was not among the things that
I read. But, you know, when people start running PostgreSQL 16, and
have some problem, they're not going to read the motivating examples
wiki page. They're going to read the documentation. If they can't find
the answer there, they (or some hacker that they contact) will
probably read the code comments and the relevant commit messages.
Those either clearly explain what was changed in a way that somebody
can understand, or they don't. If they don't, *the commits are not
good enough*, regardless of what other information may exist in any
other place.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: psql: Add role's membership options to the \du+ command

2023-01-26 Thread Pavel Luzanov

On 24.01.2023 20:16, David G. Johnston wrote:
Yeah, I noticed the lack too, then went a bit too far afield with 
trying to compose a graph of the roles.  I'm still working on that but 
at this point it probably won't be something I try to get committed to 
psql.  Something more limited like this does need to be included.


Glad to hear that you're working on it.

I'm not too keen on the idea of converting the existing array into a 
newline separated string.  I would try hard to make the modification 
here purely additional.


I agree with all of your arguments. A couple of months I tried to find 
an acceptable variant in the background.

But apparently tried not very hard ))

In the end, the variant proposed in the patch seemed to me worthy to 
show and

start a discussion. But I'm not sure that this is the best choice.

Another thing I did with the graph was have both "member" and 
"memberof" columns in the output.  In short, every grant row in 
pg_auth_members appears twice, once in each column, so the role being 
granted membership and the role into which membership is granted both 
have visibility when you filter on them.  For the role graph I took 
this idea and extended out to an entire chain of roles (and also broke 
out user and group separately) but I think doing the direct-grant only 
here would still be a big improvement.


It will be interesting to see the result.

--
Pavel Luzanov


Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Wed, Jan 25, 2023 at 10:56 PM Andres Freund  wrote:
> but that's only true because page level freezing neutered
> vacuum_freeze_min_age. Compared to <16, it's a *huge* change.

Do you think that page-level freezing
(1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
committed?

I have always been a bit skeptical of vacuum_freeze_min_age as a
mechanism. It's certainly true that it is a waste of energy to freeze
tuples that will soon be removed anyway, but on the other hand,
repeatedly dirtying the same page for various different freezing and
visibility related reasons *really sucks*, and even repeatedly reading
the page because we kept deciding not to do anything yet isn't great.
It seems possible that the page-level freezing mechanism could help
with that quite a bit, and I think that the heuristic that patch
proposes is basically reasonable: if there's at least one tuple on the
page that is old enough to justify freezing, it doesn't seem like a
bad bet to freeze all the others that can be frozen at the same time,
at least if it means that we can mark the page all-visible or
all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
deferring as much work as possible to a time when we *can* mark the
page all-visible or all-frozen.

In short, I think that neutering vacuum_freeze_min_age at least to
some degree might be a good thing, but that's not to say that I'm
altogether confident in that patch, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-01-26 Thread Melih Mutlu
Hi Shveta,

Thanks for reviewing.

shveta malik , 25 Oca 2023 Çar, 16:02 tarihinde
şunu yazdı:

> On Mon, Jan 23, 2023 at 6:30 PM Melih Mutlu 
> wrote:
> --I see initial data copied, but new catalog columns srrelslotname
> and srreloriginname are not updated:
> postgres=# select sublastusedid from pg_subscription;
>  sublastusedid
> ---
>  2
>
> postgres=# select * from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn  | srrelslotname |
> srreloriginname
>
> -+-++---+---+-
>16409 |   16384 | r  | 0/15219E0 |   |
>16409 |   16389 | r  | 0/15219E0 |   |
>16409 |   16396 | r  | 0/15219E0 |   |
>
> When are these supposed to be updated? I thought the slotname created
> will be updated here. Am I missing something here?
>

If a relation is currently being synced by a tablesync worker and uses a
replication slot/origin for that operation, then srrelslotname and
srreloriginname fields will have values.
When a relation is done with its replication slot/origin, their info gets
removed from related catalog row, so that slot/origin can be reused for
another table or dropped if not needed anymore.
In your case, all relations are in READY state so it's expected that
srrelslotname and srreloriginname are empty. READY relations do not need a
replication slot/origin anymore.

Tables are probably synced so quickly that you're missing the moments when
a tablesync worker copies a relation and stores its rep. slot/origin in the
catalog.
If initial sync is long enough, then you should be able to see the columns
get updated. I follow [1] to make it longer and test if the patch really
updates the catalog.



> Also the v8 patch does not apply on HEAD, giving merge conflicts.
>

Rebased and resolved conflicts. Please check the new version

---
[1]
publisher:
SELECT 'CREATE TABLE table_'||i||'(i int);' FROM generate_series(1, 100)
g(i) \gexec
SELECT 'INSERT INTO table_'||i||' SELECT x FROM generate_series(1, 1)
x' FROM generate_series(1, 100) g(i) \gexec
CREATE PUBLICATION mypub FOR ALL TABLES;

subscriber:
SELECT 'CREATE TABLE table_'||i||'(i int);' FROM generate_series(1, 100)
g(i) \gexec
CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres port=5432 '
PUBLICATION mypub;
select * from pg_subscription_rel where srrelslotname <> ''; \watch 0.5


Thanks,
-- 
Melih Mutlu
Microsoft


v9-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data


Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Andrew Dunstan

On 2023-01-23 Mo 09:49, Jelte Fennema wrote:
> Indeed the flags you added are enough. Attached is a patch
> that adds an updated pre-commit hook with the same behaviour
> as the one before. I definitely think having a pre-commit hook
> in the repo is beneficial, since writing one that works in all
> cases definitely takes some time.


I didn't really like your hook, as it forces a reindent, and many people
won't want that (for reasons given elsewhere in this thread).

Here's an extract from my pre-commit hook that does that if PGAUTOINDENT
is set to "yes", and otherwise just warns you that you need to run pgindent.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
#!/bin/sh

set -u

: ${PGAUTOINDENT:=no} 

branch=$(git rev-parse --abbrev-ref HEAD)
files=$(git diff --cached --name-only --diff-filter=ACMR)

check_indent () {

# only do this on master
test  "$branch" = "master" || return 0

# no need to filter files - pgindent ignores everything that isn't a
# .c or .h file

src/tools/pgindent/pgindent --silent-diff $files && return 0

exec 2>&1

if [ "$PGAUTOINDENT" = yes ] ; then
echo "Running pgindent on changed files"
src/tools/pgindent/pgindent $files
echo "Commit abandoned. Rerun git commit to adopt pgindent 
changes"
else
echo 'You need a pgindent run, e.g:'
echo -n 'src/tools/pgindent/pgindent '
echo '`git diff --name-only --diff-filter=ACMR`'
fi

exit 1
}

# nothing to do if there are no files
test -z "$files" && exit 0

check_indent


Re: Non-superuser subscription owners

2023-01-26 Thread Robert Haas
On Wed, Jan 25, 2023 at 10:45 PM Jeff Davis  wrote:
> I propose that we have two predefined roles: pg_create_subscription,
> and pg_create_connection. If creating a subscription with a connection
> string, you'd need to be a member of both roles. But to create a
> subscription with a server object, you'd just need to be a member of
> pg_create_subscription and have the USAGE privilege on the server
> object.

I have no issue with that as a long-term plan. However, I think that
for right now we should just introduce pg_create_subscription. It
would make sense to add pg_create_connection in the same patch that
adds a CREATE CONNECTION command (or whatever exact syntax we end up
with) -- and that patch can also change CREATE SUBSCRIPTION to require
both privileges where a connection string is specified directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
> It's not clear to me how could it cause deadlocks, as we're not waiting
> for a lock/resource locked by someone else, but it's certainly an issue
> for uninterruptible hangs.

Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
lock-ordering rules.


> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
> > to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> > 
> 
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.

Where would such a point be, except for ProcessInterrupts()? Iteratively
adding a separate set of "wal_delay()" points all over the executor,
commands/, ... seems like a bad plan.


> > I don't think doing an XLogFlush() of a record that we just wrote is a good
> > idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> > we're doing. To make matters worse, this will often cause partially filled 
> > WAL
> > pages to be flushed out - rewriting a WAL page multiple times can
> > significantly increase overhead on the storage level. At the very least 
> > this'd
> > have to flush only up to the last fully filled page.
> > 
> 
> I don't see why would that significantly increase the number of flushes.
> The goal is to do this only every ~1MB of a WAL or so, and chances are
> there were many (perhaps hundreds or more) commits in between. At least
> in a workload with a fair number of OLTP transactions (which is kinda
> the point of all this).

Because the accounting is done separately in each process. Even if you just
add a few additional flushes for each connection, in aggregate that'll be a
lot.


> And the "small" OLTP transactions don't really do any extra fsyncs,
> because the accounting resets at commit (or places that flush WAL).
> [...]
> > Just counting the number of bytes inserted by a backend will make the 
> > overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> > 
> 
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

I might have missed something, but I don't think the first version of patch
resets the accounting at commit?


> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse.

Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
calls to partial boundaries by every autovacuum, by every process doing a bit
more bulky work. Because you're flushing the LSN after a record you just
inserted, you're commonly not going to be "joining" the work of an already
in-process XLogFlush().


> But if needed, we can simply backoff to the last page boundary, so that we
> only flush the complete page. That would work too - the goal is not to flush
> everything, but to reduce how much of the lag is due to the current process
> (i.e. to wait for sync replica to catch up).

Yes, as I proposed...


> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, 
> > and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would 
> > be
> > proportional to synchronous_commit_flush_wal_after.
> > 
> 
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.

I think flushing only up to a point further in the past than the last page
boundary is somewhat important to prevent unnecessary flushes.


Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 14:40:56 +0100, Jakub Wartak wrote:
> In summary:  Attached is a slightly reworked version of this patch.
> 1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
> 2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
> 3. Removed GUC for now (always on->256kB); potentially to be restored

Huh? Why did you remove the GUC? Clearly this isn't something we can default
to on.


> diff --git a/src/backend/access/transam/xact.c 
> b/src/backend/access/transam/xact.c
> index d85e313908..05d56d65f9 100644
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -2395,6 +2395,7 @@ CommitTransaction(void)
>  
>   XactTopFullTransactionId = InvalidFullTransactionId;
>   nParallelCurrentXids = 0;
> + backendWalInserted = 0;
>  
>   /*
>* done with commit processing, set current transaction state back to

I don't like the resets around like this. Why not just move it into
XLogFlush()?



> +/*
> + * Called from ProcessMessageInterrupts() to avoid waiting while being in 
> critical section
> + */ 
> +void HandleXLogDelayPending()
> +{
> + /* flush only up to the last fully filled page */
> + XLogRecPtr  LastFullyWrittenXLogPage = XactLastRecEnd - 
> (XactLastRecEnd % XLOG_BLCKSZ);
> + XLogDelayPending = false;

Hm - wonder if it'd be better to determine the LSN to wait for in
XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
bounded number of) WAL records before processing interrupts. No need to flush
more aggressively than necessary...


> + //HOLD_INTERRUPTS();
> +
> + /* XXX Debug for now */
> + elog(WARNING, "throttling WAL down on this session 
> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)", 
> + backendWalInserted, 
> + LSN_FORMAT_ARGS(XactLastRecEnd),
> + LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
> +
> + /* XXX: refactor SyncRepWaitForLSN() to have different waitevent than 
> default WAIT_EVENT_SYNC_REP  */
> + /* maybe new WAIT_EVENT_SYNC_REP_BIG or something like that */
> + XLogFlush(LastFullyWrittenXLogPage);
> + SyncRepWaitForLSN(LastFullyWrittenXLogPage, false);

SyncRepWaitForLSN() has this comment:
 * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
 * represents a commit record.  If it doesn't, then we wait only for the WAL
 * to be flushed if synchronous_commit is set to the higher level of
 * remote_apply, because only commit records provide apply feedback.


> + elog(WARNING, "throttling WAL down on this session - end");
> + backendWalInserted = 0;
> +
> + //RESUME_INTERRUPTS();
> +}

I think we'd want a distinct wait event for this.



> diff --git a/src/backend/utils/init/globals.c 
> b/src/backend/utils/init/globals.c
> index 1b1d814254..8ed66b2eae 100644
> --- a/src/backend/utils/init/globals.c
> +++ b/src/backend/utils/init/globals.c
> @@ -37,6 +37,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
>  volatile sig_atomic_t ProcSignalBarrierPending = false;
>  volatile sig_atomic_t LogMemoryContextPending = false;
>  volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
> +volatile sig_atomic_t XLogDelayPending = false;
>  volatile uint32 InterruptHoldoffCount = 0;
>  volatile uint32 QueryCancelHoldoffCount = 0;
>  volatile uint32 CritSectionCount = 0;

I don't think the new variable needs to be volatile, or even
sig_atomic_t. We'll just manipulate it from outside signal handlers.


Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 13:33:27 +0530, Bharath Rupireddy wrote:
> 6. Backends can ignore throttling for WAL records marked as unimportant, no?

Why would that be a good idea? Not that it matters today, but those records
still need to be flushed in case of a commit by another transaction.


> 7. I think we need to not let backends throttle too frequently even
> though they have crossed wal_throttle_threshold bytes. The best way is
> to rely on replication lag, after all the goal of this feature is to
> keep replication lag under check - say, throttle only when
> wal_distance > wal_throttle_threshold && replication_lag >
> wal_throttle_replication_lag_threshold.

I think my idea of only forcing to flush/wait an LSN some distance in the past
would automatically achieve that?

Greetings,

Andres Freund




Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Jelte Fennema
On Thu, 26 Jan 2023 at 15:40, Andrew Dunstan  wrote:
> I didn't really like your hook, as it forces a reindent, and many people
> won't want that (for reasons given elsewhere in this thread).

I'm not sure what you mean by "forces a reindent". Like I explained
you can simply run "git commit" again to ignore the changes and
commit anyway. As long as the files are indented on your filesystem
the hook doesn't care if you actually included the indentation changes
in the changes that you're currently committing.

So to be completely clear you can do the following with my hook:
git commit # runs pgindent and fails
git commit # commits changes anyway
git commit -am 'Run pgindent' # commit indentation changes separately

Or what I usually do:
git commit # runs pgindent and fails
git add --patch # choose relevant changes to add to commit
git commit # commit the changes
git checkout -- . # undo irrelevant changes on filesystem

Honestly PGAUTOINDENT=no seems stricter, since the only
way to bypass the failure is now to run manually run pgindent
or git commit with the --no-verify flag.

> files=$(git diff --cached --name-only --diff-filter=ACMR)
> src/tools/pgindent/pgindent $files

That seems like it would fail if there's any files or directories with
spaces in them. Maybe this isn't something we care about though.

> # no need to filter files - pgindent ignores everything that isn't a
> # .c or .h file

If the first argument is a non .c or .h file, then pgindent interprets
it as the typedefs file. So it's definitely important to filter non .c
and .h files out. Because now if you commit a single
non .c or .h file this hook messes up the indentation in all of
your files. You can reproduce by running:
src/tools/pgindent/pgindent README

> # only do this on master
> test  "$branch" = "master" || return 0

I would definitely want a way to disable this check. As a normal
submitter I never work directly on master.




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Wed, Jan 25, 2023 at 7:56 PM Andres Freund  wrote:
> > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
> >
> > The difference between this and VACUUM FREEZE is described here:
> >
> > "Note how we freeze most pages, but still leave a significant number
> > unfrozen each time, despite using an eager approach to freezing
> > (2981204 scanned - 2355230 frozen = 625974 pages scanned but left
> > unfrozen). Again, this is because we don't freeze pages unless they're
> > already eligible to be set all-visible.
>
> The only reason there is a substantial difference is because of pgbench's
> uniform access pattern. Most real-world applications don't have that.

It's not pgbench! It's TPC-C. It's actually an adversarial case for
the patch series.

-- 
Peter Geoghegan




Re: [EXTERNAL] Re: Support load balancing in libpq

2023-01-26 Thread Jelte Fennema
After discussing this patch privately with Andres here's a new version of this
patch. The major differences are:
1. Use the pointer value of the connection as a randomness source
2. Use more precise time as randomness source
3. Move addrinfo changes into a separate commit. This is both to make
the actual change cleaner, and because another patch of mine (non
blocking cancels) benefits from the same change.
4. Use the same type of Fisher-Yates shuffle as is done in two other
places in the PG source code.
5. Move tests depending on hosts file to a separate file. This makes
it clear in the output that tests are skipped, because skip_all shows
a nice message.
6. Only enable hosts file load balancing when loadbalance is included
in PG_TEST_EXTRA, since this test listens on TCP socket and is thus
dangerous on a multi-user system.

On Wed, 18 Jan 2023 at 11:24, Jelte Fennema  wrote:
>
> As far as I can tell this is ready for committer feedback now btw. I'd
> really like to get this into PG16.
>
> > It hadn't been my intention to block the patch on it, sorry. Just
> > registering a preference.
>
> No problem. I hadn't looked into the shared PRNG solution closely
> enough to determine if I thought it was better or not. Now that I
> implemented an initial version I personally don't think it brings
> enough advantages to warrant the added complexity. I definitely
> don't think it's required for this patch, if needed this change can
> always be done later without negative user impact afaict. And the
> connection local PRNG works well enough to bring advantages.
>
> > my
> > assumption was that you could use the existing pglock_thread() to
> > handle it, since it didn't seem like the additional contention would
> > hurt too much.
>
> That definitely would have been the easier approach and I considered
> it. But the purpose of pglock_thread seemed so different from this lock
> that it felt weird to combine the two. Another reason I refactored the lock
> code is that it would be probably be necessary for a future round-robin
> load balancing, which would require sharing state between different
> connections.
>
> > > And my thought was that the one-time
> > > initialization could be moved to a place that doesn't need to know the
> > > connection options at all, to make it easier to reason about the
> > > architecture. Say, next to the WSAStartup machinery.
>
> That's an interesting thought, but I don't think it would really simplify
> the initialization code. Mostly it would change its location.
>
> > (And after marinating on this over the weekend, it occurred to me that
> > keeping the per-connection option while making the PRNG global
> > introduces an additional hazard, because two concurrent connections
> > can now fight over the seed value.)
>
> I think since setting the initial seed value is really only meant for testing
> it's not a big deal if it doesn't work with concurrent connections.


v8-0003-Support-load-balancing-in-libpq.patch
Description: Binary data


v8-0002-Refactor-libpq-to-store-addrinfo-in-a-libpq-owned.patch
Description: Binary data


v8-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch
Description: Binary data


Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 5:41 AM Robert Haas  wrote:
> On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan  wrote:
> > On Wed, Jan 25, 2023 at 7:41 PM Robert Haas  wrote:
> > > Both Andres and I have repeatedly expressed concern about how much is
> > > being changed in the behavior of vacuum, and how quickly, and IMHO on
> > > the basis of very limited evidence that the changes are improvements.
> > > The fact that Andres was very quickly able to find cases where the
> > > patch produces large regression is just more evidence of that. It's
> > > also hard to even understand what has been changed, because the
> > > descriptions are so theoretical.
> >
> > Did you actually read the motivating examples Wiki page?
>
> I don't know. I've read a lot of stuff that you've written on this
> topic, which has taken a significant amount of time, and I still don't
> understand a lot of what you're changing, and I don't agree with all
> of the things that I do understand.

You complained about the descriptions being theoretical. But there's
nothing theoretical about the fact that we more or less do *all*
freezing in an eventual aggressive VACUUM in many important cases,
including very simple cases like pgbench_history -- the simplest
possible append-only table case. We'll merrily rewrite the entire
table, all at once, for no good reason at all. Consistently, reliably.
It's so incredibly obvious that this makes zero sense! And yet I don't
think you've ever engaged with such basic points as that one.

-- 
Peter Geoghegan




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 09:20:57 -0500, Robert Haas wrote:
> On Wed, Jan 25, 2023 at 10:56 PM Andres Freund  wrote:
> > but that's only true because page level freezing neutered
> > vacuum_freeze_min_age. Compared to <16, it's a *huge* change.
> 
> Do you think that page-level freezing
> (1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
> committed?

I think it's probably ok, but perhaps deserves a bit more thought about when
to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
now.

With "opportunistic freezing" I mean freezing the page, even though we don't
*have* to freeze any of the tuples.

The overall condition gating freezing is:
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
 fpi_before != pgWalUsage.wal_fpi))

fpi_before is set before the heap_page_prune() call.

To me the
  fpi_before != pgWalUsage.wal_fpi"
part doesn't make a whole lot of sense. For one, it won't at all work if
full_page_writes=off. But more importantly, it also means we'll not freeze
when VACUUMing a recently modified page, even if pruning already emitted a WAL
record and we'd not emit an FPI if we freezed the page now.


To me a condition that checked if the buffer is already dirty and if another
XLogInsert() would be likely to generate an FPI would make more sense. The
rare race case of a checkpoint starting concurrently doesn't matter IMO.


A minor complaint I have about the code is that the "tuples_frozen == 0" path
imo is confusing. We go into the "freeze" path, which then inside has another
if for the tuples_frozen == 0 part. I get that this deduplicates the
NewRelFrozenXid handling, but it still looks odd.


> I have always been a bit skeptical of vacuum_freeze_min_age as a
> mechanism. It's certainly true that it is a waste of energy to freeze
> tuples that will soon be removed anyway, but on the other hand,
> repeatedly dirtying the same page for various different freezing and
> visibility related reasons *really sucks*, and even repeatedly reading
> the page because we kept deciding not to do anything yet isn't great.
> It seems possible that the page-level freezing mechanism could help
> with that quite a bit, and I think that the heuristic that patch
> proposes is basically reasonable: if there's at least one tuple on the
> page that is old enough to justify freezing, it doesn't seem like a
> bad bet to freeze all the others that can be frozen at the same time,
> at least if it means that we can mark the page all-visible or
> all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
> deferring as much work as possible to a time when we *can* mark the
> page all-visible or all-frozen.

Agreed. Freezing everything if we need to freeze some things seems quite safe
to me.


> In short, I think that neutering vacuum_freeze_min_age at least to
> some degree might be a good thing, but that's not to say that I'm
> altogether confident in that patch, either.

I am not too woried about the neutering in the page level freezing patch.

The combination of the page level work with the eager strategy is where the
sensibly-more-aggressive freeze_min_age got turbocharged to an imo dangerous
degree.

Greetings,

Andres Freund




Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Andrew Dunstan


On 2023-01-26 Th 11:16, Jelte Fennema wrote:
> On Thu, 26 Jan 2023 at 15:40, Andrew Dunstan  wrote:
>> I didn't really like your hook, as it forces a reindent, and many people
>> won't want that (for reasons given elsewhere in this thread).
> I'm not sure what you mean by "forces a reindent". Like I explained
> you can simply run "git commit" again to ignore the changes and
> commit anyway. As long as the files are indented on your filesystem
> the hook doesn't care if you actually included the indentation changes
> in the changes that you're currently committing.


Your hook does this:


+git diff --cached --name-only --diff-filter=ACMR | grep '\.[ch]$' |\
+    xargs src/tools/pgindent/pgindent --silent-diff \
+    || {
+    echo ERROR: Aborting commit because pgindent was not run
+    git diff --cached --name-only --diff-filter=ACMR | grep
'\.[ch]$' | xargs src/tools/pgindent/pgindent
+    exit 1
+    }


At this stage the files are now indented, so if it failed and you run
`git commit` again it will commit with the indention changes.


>
> So to be completely clear you can do the following with my hook:
> git commit # runs pgindent and fails
> git commit # commits changes anyway
> git commit -am 'Run pgindent' # commit indentation changes separately
>
> Or what I usually do:
> git commit # runs pgindent and fails
> git add --patch # choose relevant changes to add to commit
> git commit # commit the changes
> git checkout -- . # undo irrelevant changes on filesystem
>
> Honestly PGAUTOINDENT=no seems stricter, since the only
> way to bypass the failure is now to run manually run pgindent
> or git commit with the --no-verify flag.
>
>> files=$(git diff --cached --name-only --diff-filter=ACMR)
>> src/tools/pgindent/pgindent $files
> That seems like it would fail if there's any files or directories with
> spaces in them. Maybe this isn't something we care about though.


We don't have any, and the filenames git produces are relative to the
git root. I don't think this is an issue.


>
>> # no need to filter files - pgindent ignores everything that isn't a
>> # .c or .h file
> If the first argument is a non .c or .h file, then pgindent interprets
> it as the typedefs file. So it's definitely important to filter non .c
> and .h files out. Because now if you commit a single
> non .c or .h file this hook messes up the indentation in all of
> your files. You can reproduce by running:
> src/tools/pgindent/pgindent README



I have a patch at [1] to remove this misfeature.


>
>> # only do this on master
>> test  "$branch" = "master" || return 0
> I would definitely want a way to disable this check. As a normal
> submitter I never work directly on master.


Sure, that's your choice. My intended audience here is committers, who
of course do work on master.


cheers


andrew


[1]  https://postgr.es/m/21bb8573-9e56-812b-84cf-1e4f3c4c2...@dunslane.net

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 8:35 AM Andres Freund  wrote:
> I think it's probably ok, but perhaps deserves a bit more thought about when
> to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
> now.
>
> With "opportunistic freezing" I mean freezing the page, even though we don't
> *have* to freeze any of the tuples.
>
> The overall condition gating freezing is:
> if (pagefrz.freeze_required || tuples_frozen == 0 ||
> (prunestate->all_visible && prunestate->all_frozen &&
>  fpi_before != pgWalUsage.wal_fpi))
>
> fpi_before is set before the heap_page_prune() call.

Have you considered page-level checksums, and how the impact on hint
bits needs to be accounted for here?

All RDS customers use page-level checksums. And I've noticed that it's
very common for the number of FPIs to only be very slightly less than
the number of pages dirtied. Much of which is just hint bits. The
"fpi_before != pgWalUsage.wal_fpi" test catches that.

> To me a condition that checked if the buffer is already dirty and if another
> XLogInsert() would be likely to generate an FPI would make more sense. The
> rare race case of a checkpoint starting concurrently doesn't matter IMO.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

--
Peter Geoghegan




Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Jelte Fennema
> At this stage the files are now indented, so if it failed and you run
> `git commit` again it will commit with the indention changes.

No, because at no point a "git add" is happening, so the changes
made by pgindent are not staged. As long as you don't run the
second "git commit" with the -a flag the commit will be exactly
the same as you prepared it before.

> Sure, that's your choice. My intended audience here is committers, who
> of course do work on master.

Yes I understand, I meant it would be nice if the script had a environment
variable like PG_COMMIT_HOOK_ALL_BRANCHES (bad name)
for this purpose.

On Thu, 26 Jan 2023 at 17:54, Andrew Dunstan  wrote:
>
>
> On 2023-01-26 Th 11:16, Jelte Fennema wrote:
> > On Thu, 26 Jan 2023 at 15:40, Andrew Dunstan  wrote:
> >> I didn't really like your hook, as it forces a reindent, and many people
> >> won't want that (for reasons given elsewhere in this thread).
> > I'm not sure what you mean by "forces a reindent". Like I explained
> > you can simply run "git commit" again to ignore the changes and
> > commit anyway. As long as the files are indented on your filesystem
> > the hook doesn't care if you actually included the indentation changes
> > in the changes that you're currently committing.
>
>
> Your hook does this:
>
>
> +git diff --cached --name-only --diff-filter=ACMR | grep '\.[ch]$' |\
> +xargs src/tools/pgindent/pgindent --silent-diff \
> +|| {
> +echo ERROR: Aborting commit because pgindent was not run
> +git diff --cached --name-only --diff-filter=ACMR | grep
> '\.[ch]$' | xargs src/tools/pgindent/pgindent
> +exit 1
> +}
>
>
> At this stage the files are now indented, so if it failed and you run
> `git commit` again it will commit with the indention changes.
>
>
> >
> > So to be completely clear you can do the following with my hook:
> > git commit # runs pgindent and fails
> > git commit # commits changes anyway
> > git commit -am 'Run pgindent' # commit indentation changes separately
> >
> > Or what I usually do:
> > git commit # runs pgindent and fails
> > git add --patch # choose relevant changes to add to commit
> > git commit # commit the changes
> > git checkout -- . # undo irrelevant changes on filesystem
> >
> > Honestly PGAUTOINDENT=no seems stricter, since the only
> > way to bypass the failure is now to run manually run pgindent
> > or git commit with the --no-verify flag.
> >
> >> files=$(git diff --cached --name-only --diff-filter=ACMR)
> >> src/tools/pgindent/pgindent $files
> > That seems like it would fail if there's any files or directories with
> > spaces in them. Maybe this isn't something we care about though.
>
>
> We don't have any, and the filenames git produces are relative to the
> git root. I don't think this is an issue.
>
>
> >
> >> # no need to filter files - pgindent ignores everything that isn't a
> >> # .c or .h file
> > If the first argument is a non .c or .h file, then pgindent interprets
> > it as the typedefs file. So it's definitely important to filter non .c
> > and .h files out. Because now if you commit a single
> > non .c or .h file this hook messes up the indentation in all of
> > your files. You can reproduce by running:
> > src/tools/pgindent/pgindent README
>
>
>
> I have a patch at [1] to remove this misfeature.
>
>
> >
> >> # only do this on master
> >> test  "$branch" = "master" || return 0
> > I would definitely want a way to disable this check. As a normal
> > submitter I never work directly on master.
>
>
> Sure, that's your choice. My intended audience here is committers, who
> of course do work on master.
>
>
> cheers
>
>
> andrew
>
>
> [1]  https://postgr.es/m/21bb8573-9e56-812b-84cf-1e4f3c4c2...@dunslane.net
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>




Re: Non-superuser subscription owners

2023-01-26 Thread Jeff Davis
On Thu, 2023-01-26 at 09:43 -0500, Robert Haas wrote:
> I have no issue with that as a long-term plan. However, I think that
> for right now we should just introduce pg_create_subscription. It
> would make sense to add pg_create_connection in the same patch that
> adds a CREATE CONNECTION command (or whatever exact syntax we end up
> with) -- and that patch can also change CREATE SUBSCRIPTION to
> require
> both privileges where a connection string is specified directly.

I assumed it would be a problem to say that pg_create_subscription was
enough to create a subscription today, and then later require
additional privileges (e.g. pg_create_connection).

If that's not a problem, then this sounds fine with me.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 08:54:55 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 8:35 AM Andres Freund  wrote:
> > I think it's probably ok, but perhaps deserves a bit more thought about when
> > to "opportunistically" freeze. Perhaps to make it *more* aggressive than 
> > it's
> > now.
> >
> > With "opportunistic freezing" I mean freezing the page, even though we don't
> > *have* to freeze any of the tuples.
> >
> > The overall condition gating freezing is:
> > if (pagefrz.freeze_required || tuples_frozen == 0 ||
> > (prunestate->all_visible && prunestate->all_frozen &&
> >  fpi_before != pgWalUsage.wal_fpi))
> >
> > fpi_before is set before the heap_page_prune() call.
> 
> Have you considered page-level checksums, and how the impact on hint
> bits needs to be accounted for here?
> 
> All RDS customers use page-level checksums. And I've noticed that it's
> very common for the number of FPIs to only be very slightly less than
> the number of pages dirtied. Much of which is just hint bits. The
> "fpi_before != pgWalUsage.wal_fpi" test catches that.

I assume the case you're thinking of is that pruning did *not* do any changes,
but in the process of figuring out that nothing needed to be pruned, we did a
MarkBufferDirtyHint(), and as part of that emitted an FPI?


> > To me a condition that checked if the buffer is already dirty and if another
> > XLogInsert() would be likely to generate an FPI would make more sense. The
> > rare race case of a checkpoint starting concurrently doesn't matter IMO.
> 
> That's going to be very significantly more aggressive. For example
> it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Thus a modification of the above logic could be to opportunistically freeze if
a ) it won't cause an FPI and either
b1) the page was already dirty before pruning, as we'll not do a ringbuffer
replacement in that case
or
b2) We wrote a WAL record during pruning, as the difference in flush position
is marginal

An even more aggressive version would be to replace b1) with logic that'd
allow newly dirtying the page if it wasn't read through the ringbuffer. But
newly dirtying the page feels like it'd be more dangerous.


A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Greetings,

Andres Freund




Re: fix and document CLUSTER privileges

2023-01-26 Thread Nathan Bossart
On Wed, Jan 25, 2023 at 08:27:57PM -0800, Jeff Davis wrote:
> Committed these extra clarifications. Thank you.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: meson oddities

2023-01-26 Thread Andres Freund
On 2023-01-26 10:20:58 +0100, Peter Eisentraut wrote:
> On 19.01.23 21:45, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-01-19 21:37:15 +0100, Peter Eisentraut wrote:
> > > On 11.01.23 12:05, Peter Eisentraut wrote:
> > > > I think there is also an adjacent issue:  The subdir options may be
> > > > absolute or relative.  So if you specify --prefix=/usr/local and
> > > > --sysconfdir=/etc/postgresql, then
> > > > 
> > > >       config_paths_data.set_quoted('SYSCONFDIR', dir_prefix / 
> > > > dir_sysconf)
> > > > 
> > > > would produce something like /usr/local/etc/postgresql.
> > 
> > I don't think it would. The / operator understands absolute paths and 
> > doesn't
> > add the "first component" if the second component is absolute.
> 
> Oh, that is interesting.  In that case, this is not the right patch.  We
> should proceed with my previous patch in [0] then.

WFM.

I still think it'd be slightly more legible if we tested the prefix for
postgres|pgsql once, rather than do the per-variable .contains() checks on the
"combined" path. But it's a pretty minor difference, and I'd have no problem
with you comitting your version.

Basically:
is_pg_prefix = dir_prefix.contains('pgsql) or dir_prefix.contains('postgres')
...
if not (is_pg_prefix or dir_data.contains('pgsql') or 
dir_data.contains('postgres'))

instead of "your":

if not ((dir_prefix/dir_data).contains('pgsql') or 
(dir_prefix/dir_data).contains('postgres'))

Greetings,

Andres Freund




Re: Add LZ4 compression in pg_dump

2023-01-26 Thread Justin Pryzby
On Wed, Jan 25, 2023 at 07:57:18PM +, gkokola...@pm.me wrote:
> On Wednesday, January 25th, 2023 at 7:00 PM, Justin Pryzby 
>  wrote:
> > While looking at this, I realized that commit 5e73a6048 introduced a
> > regression:
> > 
> > @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)
> > 
> > - if (AH->compression != 0)
> > 
> > - pg_log_warning("archive is compressed, but this installation does not 
> > support compression -- no data will be available");
> > + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > 
> > + pg_fatal("archive is compressed, but this installation does not support 
> > compression");
> > 
> > Before, it was possible to restore non-data chunks of a dump file, even
> > if the current build didn't support its compression. But that's now
> > impossible - and it makes the code we're discussing in RestoreArchive()
> > unreachable.

On Thu, Jan 26, 2023 at 08:53:28PM +0900, Michael Paquier wrote:
> On Thu, Jan 26, 2023 at 11:24:47AM +, gkokola...@pm.me wrote:
> > I gave this a little bit of thought. I think that ReadHead should not
> > emit a warning, or at least not this warning as it is slightly misleading.
> > It implies that it will automatically turn off data restoration, which is
> > false. Further ahead, the code will fail with a conflicting error message
> > stating that the compression is not available.
> > 
> > Instead, it would be cleaner both for the user and the maintainer to
> > move the check in RestoreArchive and make it the sole responsible for
> > this logic.
> 
> -pg_fatal("cannot restore from compressed archive (compression not 
> supported in this installation)");
> +pg_fatal("cannot restore data from compressed archive (compression not 
> supported in this installation)");
> Hmm.  I don't mind changing this part as you suggest.
> 
> -#ifndef HAVE_LIBZ
> -   if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> -   pg_fatal("archive is compressed, but this installation does 
> not support compression");
> -#endif
> However I think that we'd better keep the warning, as it can offer a
> hint when using pg_restore -l not built with compression support if
> looking at a dump that has been compressed.

Yeah.  But the original log_warning text was better, and should be
restored:

-   if (AH->compression != 0)
-   pg_log_warning("archive is compressed, but this installation 
does not support compression -- no data will be available");

That commit also added this to pg-dump.c:

+   case PG_COMPRESSION_ZSTD:
+   pg_fatal("compression with %s is not yet supported", 
"ZSTD");
+   break;
+   case PG_COMPRESSION_LZ4:
+   pg_fatal("compression with %s is not yet supported", 
"LZ4");
+   break;

In 002, that could be simplified by re-using the supports_compression()
function.  (And maybe the same in WriteDataToArchive()?)

-- 
Justin




Re: drop postmaster symlink

2023-01-26 Thread Karl O. Pinc
On Wed, 25 Jan 2023 18:03:25 -0600
"Karl O. Pinc"  Buried in
> https://www.postgresql.org/message-id/20230107165942.748ccf4e%40slate.karlpinc.com
> is the one change I see that should be made.
> 
> > In doc/src/sgml/ref/allfiles.sgml at line 222 there is an ENTITY
> > defined which references the deleted postmaster.sgml file.  
> 
> This line needs to be removed and the
> 0002-Don-t-install-postmaster-symlink-anymore.patch 
> updated.  (Unless there's some magic going on
> with the various allfiles.sgml files of which I am
> not aware.)
> 
> If this is fixed I see no other problems.

Buried in the same email, and I apologize for not mentioning
this, is one other bit of documentation text that might
or might not need attention. 

> I see a possible problem at line 1,412 of runtime.sgml

This says:

 in the postmaster's startup script just before invoking the postmaster.

Depending on how this is read, it could be interpreted to mean
that a "postmaster" binary is invoked.  It might be more clear
to write: ... just before invoking postgres.

Or it might not be worth bothering about; at this point, probably
not, but I thought you might want the heads-up anyway.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> I assume the case you're thinking of is that pruning did *not* do any changes,
> but in the process of figuring out that nothing needed to be pruned, we did a
> MarkBufferDirtyHint(), and as part of that emitted an FPI?

Yes.

> > That's going to be very significantly more aggressive. For example
> > it'll impact small tables very differently.
>
> Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
> relatively small, with one important exception below, if we are 99.99% sure
> that it's not going to require an FPI and isn't going to dirty the page.
>
> The exception is that a newer LSN on the page can cause the ringbuffer
> replacement to trigger more more aggressive WAL flushing. No meaningful
> difference if we modified the page during pruning, or if the page was already
> in s_b (since it likely won't be written out via the ringbuffer in that case),
> but if checksums are off and we just hint-dirtied the page, it could be a
> significant issue.

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

> Thus a modification of the above logic could be to opportunistically freeze if
> a ) it won't cause an FPI and either
> b1) the page was already dirty before pruning, as we'll not do a ringbuffer
> replacement in that case
> or
> b2) We wrote a WAL record during pruning, as the difference in flush position
> is marginal
>
> An even more aggressive version would be to replace b1) with logic that'd
> allow newly dirtying the page if it wasn't read through the ringbuffer. But
> newly dirtying the page feels like it'd be more dangerous.

In many cases we'll have to dirty the page anyway, just to set
PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
triggered by an FPI or triggered by my now-reverted GUC) on being able
to set the whole page all-frozen in the VM.

> A less aggressive version would be to check if any WAL records were emitted
> during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
> modified the page again. Similar to what we do now, except not requiring an
> FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

> But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
> wal_log_hint bits is on, than without them. Which I think is how using either
> of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Which part is the odd part? Is it odd that page-level freezing works
that way, or is it odd that page-level checksums work that way?

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

-- 
Peter Geoghegan




Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Tomas Vondra



On 1/26/23 16:40, Andres Freund wrote:
> Hi,
> 
> On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
>> It's not clear to me how could it cause deadlocks, as we're not waiting
>> for a lock/resource locked by someone else, but it's certainly an issue
>> for uninterruptible hangs.
> 
> Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
> lock-ordering rules.
> 

Not sure what lock ordering issues you have in mind, but I agree it's
not the right place for the sleep, no argument here.

> 
>>> My best idea for how to implement this in a somewhat safe way would be for
>>> XLogInsertRecord() to set a flag indicating that we should throttle, and set
>>> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
>>> to
>>> proceed (i.e. we'll not be in a critical / interrupts off section) can
>>> actually perform the delay.  That should fix the hard deadlock danger and
>>> remove most of the increase in lock contention.
>>>
>>
>> The solution I've imagined is something like autovacuum throttling - do
>> some accounting of how much "WAL bandwidth" each process consumed, and
>> then do the delay/sleep in a suitable place.
> 
> Where would such a point be, except for ProcessInterrupts()? Iteratively
> adding a separate set of "wal_delay()" points all over the executor,
> commands/, ... seems like a bad plan.
> 

I haven't thought about where to do the work, TBH. ProcessInterrupts()
may very well be a good place.

I should have been clearer, but the main benefit of autovacuum-like
throttling is IMHO that it involves all processes and a global limit,
while the current approach is per-backend.

> 
>>> I don't think doing an XLogFlush() of a record that we just wrote is a good
>>> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
>>> we're doing. To make matters worse, this will often cause partially filled 
>>> WAL
>>> pages to be flushed out - rewriting a WAL page multiple times can
>>> significantly increase overhead on the storage level. At the very least 
>>> this'd
>>> have to flush only up to the last fully filled page.
>>>
>>
>> I don't see why would that significantly increase the number of flushes.
>> The goal is to do this only every ~1MB of a WAL or so, and chances are
>> there were many (perhaps hundreds or more) commits in between. At least
>> in a workload with a fair number of OLTP transactions (which is kinda
>> the point of all this).
> 
> Because the accounting is done separately in each process. Even if you just
> add a few additional flushes for each connection, in aggregate that'll be a
> lot.
> 

How come? Imagine the backend does flush only after generating e.g. 1MB
of WAL. Small transactions won't do any additional flushes at all
(because commit resets the accounting). Large transactions will do an
extra flush every 1MB, so 16x per WAL segment. But in between there will
be many commits from the small transactions. If we backoff to the last
complete page, that should eliminate even most of these flushes.

So where would the additional flushes come from?

> 
>> And the "small" OLTP transactions don't really do any extra fsyncs,
>> because the accounting resets at commit (or places that flush WAL).
>> [...]
>>> Just counting the number of bytes inserted by a backend will make the 
>>> overhead
>>> even worse, as the flush will be triggered even for OLTP sessions doing tiny
>>> transactions, even though they don't contribute to the problem you're trying
>>> to address.  How about counting how many bytes of WAL a backend has inserted
>>> since the last time that backend did an XLogFlush()?
>>>
>>
>> No, we should reset the counter at commit, so small OLTP transactions
>> should not reach really trigger this. That's kinda the point, to only
>> delay "large" transactions producing a lot of WAL.
> 
> I might have missed something, but I don't think the first version of patch
> resets the accounting at commit?
> 

Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test
the behavior / demonstrate the effect.

> 
>> Same for the flushes of partially flushed pages - if there's enough of
>> small OLTP transactions, we're already having this issue. I don't see
>> why would this make it measurably worse.
> 
> Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
> could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
> calls to partial boundaries by every autovacuum, by every process doing a bit
> more bulky work. Because you're flushing the LSN after a record you just
> inserted, you're commonly not going to be "joining" the work of an already
> in-process XLogFlush().
> 

Right. We do ~16 additional flushes per 16MB segment (or something like
that, depending on the GUC value). Considering we do thousand of commits
per segment, each of which does a flush, I don't see how would this make
it measurably worse.

> 
>> But if needed, we can simply backoff to the last page boundary, s

Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
I wrote:
>>> It'd probably be reasonable to file down that sharp edge by instead
>>> specifying that TimestampDifferenceMilliseconds will clamp overflowing
>>> differences to LONG_MAX.  Maybe there should be a clamp on the underflow
>>> side too ... but should it be to LONG_MIN or to zero?

After looking closer, I see that TimestampDifferenceMilliseconds
already explicitly states that its output is intended for WaitLatch
and friends, which makes it perfectly sane for it to clamp the result
to [0, INT_MAX] rather than depending on the caller to not pass
out-of-range values.

I checked existing callers, and found one place in basebackup_copy.c
that had not read the memo about TimestampDifferenceMilliseconds
never returning a negative value, and another in postmaster.c that
had not read the memo about Min() and Max() being macros.  There
are none that are unhappy about clamping to INT_MAX, and at least
one that was already assuming it could just cast the result to int.

Hence, I propose the attached.  I haven't gone as far as to change
the result type from long to int; that seems like a lot of code
churn (if we are to update WaitLatch and all callers to match)
and it won't really buy anything semantically.

regards, tom lane

diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 05470057f5..2bb6c89f8c 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len)
 		 * the system clock was set backward, so that such occurrences don't
 		 * have the effect of suppressing further progress messages.
 		 */
-		if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD)
+		if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD ||
+			now < mysink->last_progress_report_time)
 		{
 			mysink->last_progress_report_time = now;
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5b775cf7d0..62fba5fcee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1670,11 +1670,12 @@ DetermineSleepTime(void)
 
 	if (next_wakeup != 0)
 	{
-		/* Ensure we don't exceed one minute, or go under 0. */
-		return Max(0,
-   Min(60 * 1000,
-	   TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-	   next_wakeup)));
+		int			ms;
+
+		/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
+		ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
+   next_wakeup);
+		return Min(60 * 1000, ms);
 	}
 
 	return 60 * 1000;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e95398db05..b0cfddd548 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
 pgsocket	wait_fd = PGINVALID_SOCKET;
 int			rc;
 TimestampTz nextWakeup;
-int			nap;
+long		nap;
 
 /*
  * Exit walreceiver if we're not in recovery. This should not
@@ -528,15 +528,9 @@ WalReceiverMain(void)
 for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
 	nextWakeup = Min(wakeup[i], nextWakeup);
 
-/*
- * Calculate the nap time.  WaitLatchOrSocket() doesn't accept
- * timeouts longer than INT_MAX milliseconds, so we limit the
- * result accordingly.  Also, we round up to the next
- * millisecond to avoid waking up too early and spinning until
- * one of the wakeup times.
- */
+/* Calculate the nap time, clamping as necessary. */
 now = GetCurrentTimestamp();
-nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));
+nap = TimestampDifferenceMilliseconds(now, nextWakeup);
 
 /*
  * Ideally we would reuse a WaitEventSet object repeatedly
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 928c330897..b1d1963729 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1690,12 +1690,12 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  *
  * This is typically used to calculate a wait timeout for WaitLatch()
  * or a related function.  The choice of "long" as the result type
- * is to harmonize with that.  It is caller's responsibility that the
- * input timestamps not be so far apart as to risk overflow of "long"
- * (which'd happen at about 25 days on machines with 32-bit "long").
+ * is to harmonize with that; furthermore, we clamp the result to at most
+ * INT_MAX milliseconds, because that's all that WaitLatch() allows.
  *
- * Both inputs must be ordinary finite timestamps (in current usage,
- * they'll be results from GetCurrentTimestamp()).
+ * At least one input must be an ordinary finite timestamp, else the "diff"
+ * calculation might overflow.  We do support stop_time == TIMESTAMP_INFINITY,
+ * which will result in INT_MAX wait time.
  *
  * We expect start_time <= stop_tim

Re: improving user.c error messages

2023-01-26 Thread Nathan Bossart
Thanks for taking a look.

On Thu, Jan 26, 2023 at 10:07:39AM +0100, Alvaro Herrera wrote:
> Please use 
>   errdetail("You must have %s privilege to create roles with %s.",
>   "SUPERUSER", "SUPERUSER")));
> 
> in this kind of message where multiple copies appear that only differ in
> the keyword to use, to avoid creating four copies of essentially the
> same string.
> 
> This applies in several places.

I did this in v2.

>> - errmsg("must have createdb privilege 
>> to change createdb attribute")));
>> + errmsg("permission denied to alter 
>> role"),
>> + errhint("You must have CREATEDB 
>> privilege to alter roles with CREATEDB.")));
> 
> I think this one is a bit ambiguous; does "with" mean that roles that
> have that priv cannot be changed, or does it mean that you cannot meddle
> with that bit in particular?  I think it'd be better to say
>   "You must have %s privilege to change the %s attribute."
> or something like that.

Yeah, it's probably better to say "to alter roles with %s" to refer to
roles that presently have the attribute and "to change the %s attribute"
when referring to privileges for the attribute.  I did this in v2, too.

I've also switched from errhint() to errdetail() as suggested by Tom.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a6db1f28f0e7079193af4fca3fde27cf4780dbb7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 26 Jan 2023 11:05:13 -0800
Subject: [PATCH v2 1/1] Improve user.c error messages.

---
 src/backend/commands/user.c   | 171 --
 src/test/regress/expected/create_role.out |  77 ++
 src/test/regress/expected/dependency.out  |   4 +
 src/test/regress/expected/privileges.out  |  23 ++-
 4 files changed, 195 insertions(+), 80 deletions(-)

diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index 3a92e930c0..e2c80f5060 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -316,23 +316,33 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
 		if (!has_createrole_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("permission denied to create role")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles.",
+			   "CREATEROLE")));
 		if (issuper)
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must be superuser to create superusers")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "SUPERUSER", "SUPERUSER")));
 		if (createdb && !have_createdb_privilege())
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have createdb permission to create createdb users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "CREATEDB", "CREATEDB")));
 		if (isreplication && !has_rolreplication(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have replication permission to create replication users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "REPLICATION", "REPLICATION")));
 		if (bypassrls && !has_bypassrls_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have bypassrls to create bypassrls users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "BYPASSRLS", "BYPASSRLS")));
 	}
 
 	/*
@@ -744,10 +754,18 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 	roleid = authform->oid;
 
 	/* To mess with a superuser in any way you gotta be superuser. */
-	if (!superuser() && (authform->rolsuper || dissuper))
+	if (!superuser() && authform->rolsuper)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must be superuser to alter superuser roles or change superuser attribute")));
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege to alter roles with %s.",
+		   "SUPERUSER", "SUPERUSER")));
+	if (!superuser() && dissuper)
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege to change the %s attribute.",
+		   "SUPERUSER", "SUPERUSER")));
 
 	/*
 	 * Most changes to a role require that you both have CREATEROLE privileges
@@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 	{
 		/* things an unprivileged user certainly can't do */
 		if (dinherit || dcreaterole || dcreatedb || dcanlogin || dc

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Matthias van de Meent
On Thu, 26 Jan 2023 at 19:45, Peter Geoghegan  wrote:
>
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> > I assume the case you're thinking of is that pruning did *not* do any 
> > changes,
> > but in the process of figuring out that nothing needed to be pruned, we did 
> > a
> > MarkBufferDirtyHint(), and as part of that emitted an FPI?
>
> Yes.
>
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record 
> > is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was 
> > already
> > in s_b (since it likely won't be written out via the ringbuffer in that 
> > case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
>
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

Could someone explain to me why we don't currently (optionally)
include the functionality of page freezing in the PRUNE records? I
think they're quite closely related (in that they both execute in
VACUUM and are required for long-term system stability), and are even
more related now that we have opportunistic page-level freezing. I
think adding a "freeze this page as well"-flag in PRUNE records would
go a long way to reducing the WAL overhead of aggressive and more
opportunistic freezing.

-Matthias




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 11:35 AM Peter Geoghegan  wrote:
> You complained about the descriptions being theoretical. But there's
> nothing theoretical about the fact that we more or less do *all*
> freezing in an eventual aggressive VACUUM in many important cases,
> including very simple cases like pgbench_history -- the simplest
> possible append-only table case. We'll merrily rewrite the entire
> table, all at once, for no good reason at all. Consistently, reliably.
> It's so incredibly obvious that this makes zero sense! And yet I don't
> think you've ever engaged with such basic points as that one.

I'm aware that that's a problem, and I agree that it sucks. I think
that what this patch does is make vacuum more aggressively, and I
expect that would help this problem. I haven't said much about that
because I don't think it's controversial. However, the patch also has
a cost, and that's what I think is controversial.

I think it's pretty much impossible to freeze more aggressively
without losing in some scenario or other. If waiting longer to freeze
would have resulted in the data getting updated again or deleted
before we froze it, then waiting longer reduces the total amount of
freezing work that ever has to be done. Freezing more aggressively
inevitably gives up some amount of that potential benefit in order to
try to secure some other benefit. It's a trade-off.

I think that the goal of a patch that makes vacuum more (or less)
aggressive should be to make the cases where we lose as obscure as
possible, and the cases where we win as broad as possible. I think
that, in order to be a good patch, it needs to be relatively difficult
to find cases where we incur a big loss. If it's easy to find a big
loss, then I think it's better to stick with the current behavior,
even if it's also easy to find a big gain. There's nothing wonderful
about the current behavior, but (to paraphrase what I think Andres has
already said several times) it's better to keep shipping code with the
same bad behavior than to put out a new major release with behaviors
that are just as bad, but different.

I feel like your emails sometimes seem to suppose that I think that
you're a bad person, or a bad developer, or that you have no good
ideas, or that you have no good ideas about this topic, or that this
topic is not important, or that we don't need to do better than we are
currently doing. I think none of those things. However, I'm also not
prepared to go all the way to the other end of the spectrum and say
that all of your ideas and everything in this patch are great. I don't
think either of those things, either.

I certainly think that freezing more aggressively in some scenarios
could be a great idea, but it seems like the patch's theory is to be
very nearly maximally aggressive in every vacuum run if the table size
is greater than some threshold, and I don't think that's right at all.
I'm not exactly sure what information we should use to decide how
aggressive to be, but I am pretty sure that the size of the table is
not it.  It's true that, for a small table, the cost of having to
eventually vacuum the whole table at once isn't going to be very high,
whereas for a large table, it will be. That line of reasoning makes a
size threshold sound reasonable. However, the amount of extra work
that we can potentially do by vacuuming more aggressively *also*
increases with the table size, which to me means using that a
criterion actually isn't sensible at all.

One idea that I've had about how to solve this problem is to try to
make vacuum try to aggressively freeze some portion of the table on
each pass, and to behave less aggressively on the rest of the table so
that, hopefully, no single vacuum does too much work. Unfortunately, I
don't really know how to do that effectively. If we knew that the
table was going to see 10 vacuums before we hit
autovacuum_freeze_max_age, we could try to have each one do 10% of the
amount of freezing that was going to need to be done rather than
letting any single vacuum do all of it, but we don't have that sort of
information. Also, even if we did have that sort of information, the
idea only works if the pages that we freeze sooner are ones that we're
not about to update or delete again, and we don't have any idea what
is likely there. In theory we could have some system that tracks how
recently each page range in a table has been modified, and direct our
freezing activity toward the ones less-recently modified on the theory
that they're not so likely to be modified again in the near future,
but in reality we have no such system. So I don't really feel like I
know what the right answer is here, yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: improving user.c error messages

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 2:14 PM Nathan Bossart  wrote:
> Yeah, it's probably better to say "to alter roles with %s" to refer to
> roles that presently have the attribute and "to change the %s attribute"
> when referring to privileges for the attribute.  I did this in v2, too.
>
> I've also switched from errhint() to errdetail() as suggested by Tom.

This seems fine to me in general but I'm not entirely sure about this part:

@@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
  {
  /* things an unprivileged user certainly can't do */
  if (dinherit || dcreaterole || dcreatedb || dcanlogin || dconnlimit ||
- dvalidUntil || disreplication || dbypassRLS)
+ dvalidUntil || disreplication || dbypassRLS ||
+ (dpassword && roleid != currentUserId))
  ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("permission denied")));
-
- /* an unprivileged user can change their own password */
- if (dpassword && roleid != currentUserId)
- ereport(ERROR,
- (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must have CREATEROLE privilege to change another user's password")));
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege and %s on role \"%s\".",
+"CREATEROLE", "ADMIN OPTION", rolename)));
  }
  else if (!superuser())
  {

Basically my question is whether having one error message for all of
those cases is good enough, or whether we should be trying harder. I
don't mind if the conclusion is that it's OK as-is, and I'm not
entirely sure what would be better. But when I was working on this
code, all of those cases OR'd together feeding into a single error
message seemed a little sketchy to me, so I am wondering what others
think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.

$ cat /home/postgres/tmp/temp_config
#default_toast_compression = lz4
wal_compression = lz4
$ export TEMP_CONFIG=/home/postgres/tmp/temp_config
$ cd ~/pgsql/src/test/recovery
$ make check PROVE_TESTS=t/011_crash_recovery.pl
...
+++ tap check in src/test/recovery +++
t/011_crash_recovery.pl .. 1/? 
#   Failed test 'new xid after restart is greater'
#   at t/011_crash_recovery.pl line 53.
# '729'
# >
# '729'

#   Failed test 'xid is aborted after crash'
#   at t/011_crash_recovery.pl line 57.
#  got: 'committed'
# expected: 'aborted'
# Looks like you failed 2 tests of 3.

Maybe this is somehow the test script's fault, but I don't see how.

It fails the same way with 'wal_compression = pglz', so I think it's
generic to that whole feature rather than specific to LZ4.

regards, tom lane




Re: Helper functions for wait_for_catchup() in Cluster.pm

2023-01-26 Thread Drouvot, Bertrand

Hi,

On 1/26/23 10:42 AM, Alvaro Herrera wrote:

On 2023-Jan-26, Drouvot, Bertrand wrote:


On 1/24/23 7:27 PM, Alvaro Herrera wrote:



1. I don't think wait_for_write_catchup is necessary, because
calling wait_for_catchup() and omitting the 'mode' and 'lsn' arguments
would already do the same thing. 


Having a closer look, it does not seem to be the case. The default mode
in wait_for_catchup() is 'replay' and the default mode for the lsn is 'write'.

But in wait_for_write_catchup() we are making use of 'write' for both.




2. Because wait_for_replay_catchup is an instance method, passing the
second node as argument is needlessly noisy, because that's already
known as $self.  So we can just say

$primary_node->wait_for_replay_catchup($standby_node);


Yeah, but same here, there is places where the node passed as the second argument is not 
the "$self":

src/bin/pg_rewind/t/007_standby_source.pl:$node_b->wait_for_replay_catchup('node_c',
 $node_a);
src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
 $node_primary);
src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
 $node_primary);
src/test/recovery/t/001_stream_rep.pl:  
$node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary);

So it looks like there is still a need for wait_for_replay_catchup() with 2 
parameters.


Ah, cascading replication.  In that case, let's make the second
parameter optional.  If it's not given, $self is used.



Good point, done in V3 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/src/bin/pg_rewind/t/007_standby_source.pl 
b/src/bin/pg_rewind/t/007_standby_source.pl
index 52644c2c0d..7236a3b177 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -74,9 +74,8 @@ $node_a->safe_psql('postgres',
"INSERT INTO tbl1 values ('in A, before promotion')");
 $node_a->safe_psql('postgres', 'CHECKPOINT');
 
-my $lsn = $node_a->lsn('write');
-$node_a->wait_for_catchup('node_b', 'write', $lsn);
-$node_b->wait_for_catchup('node_c', 'write', $lsn);
+$node_a->wait_for_write_catchup('node_b', $node_a);
+$node_b->wait_for_write_catchup('node_c', $node_a);
 
 # Promote C
 #
@@ -160,7 +159,7 @@ in A, after C was promoted
 $node_a->safe_psql('postgres',
"INSERT INTO tbl1 values ('in A, after rewind')");
 
-$node_b->wait_for_catchup('node_c', 'replay', $node_a->lsn('write'));
+$node_b->wait_for_replay_catchup('node_c', $node_a);
 
 check_query(
'SELECT * FROM tbl1',
diff --git a/src/test/modules/brin/t/02_wal_consistency.pl 
b/src/test/modules/brin/t/02_wal_consistency.pl
index 5983ef208e..8b2b244feb 100644
--- a/src/test/modules/brin/t/02_wal_consistency.pl
+++ b/src/test/modules/brin/t/02_wal_consistency.pl
@@ -70,6 +70,6 @@ my ($ret, $out, $err) = $whiskey->psql(
});
 cmp_ok($out, '>=', 1);
 
-$whiskey->wait_for_catchup($charlie, 'replay', $whiskey->lsn('insert'));
+$whiskey->wait_for_replay_catchup($charlie);
 
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..3ba1545688 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2709,6 +2709,45 @@ sub wait_for_catchup
return;
 }
 
+# Now defining helper functions wait_for_replay_catchup() and
+# wait_for_write_catchup().
+# Please note that wait_for_flush_catchup() and wait_for_sent_catchup() are not
+# defined as: 1) they are not used yet and 2) it lets their author (if any)
+# decide the node->lsn() mode to be used.
+
+=pod
+
+=item $node->wait_for_replay_catchup(standby_name, node)
+
+Helper function for wait_for_catchup when waiting for the replay_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_replay_catchup
+{
+   my ($self, $standby_name, $node) = @_;
+   $node = defined($node) ? $node : $self;
+
+   $self->wait_for_catchup($standby_name, 'replay', $node->lsn('flush'));
+}
+
+=pod
+
+=item $node->wait_for_write_catchup(standby_name, node)
+
+Helper function for wait_for_catchup when waiting for the write_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_write_catchup
+{
+   my ($self, $standby_name, $node) = @_;
+
+   $self->wait_for_catchup($standby_name, 'write', $node->lsn('write'));
+}
+
 =pod
 
 =item $node->wait_for_slot_catchup(slot_name, mode, target_lsn)
diff --git a/src/test/recovery/t/001_stream_rep.pl 
b/src/test/recovery/t/001_stream_rep.pl
index 23a90dd85b..76846905a7 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -47,9 +47,8 @@ $node_primary->safe_psql('postgres',
"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
 
 # Wait for standbys to catch up
-my $primary_lsn = $node_primary->lsn('write');
-$node_primary->wait_for_catchup($node_standby_1, 'replay', $pri

Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote:
> After looking closer, I see that TimestampDifferenceMilliseconds
> already explicitly states that its output is intended for WaitLatch
> and friends, which makes it perfectly sane for it to clamp the result
> to [0, INT_MAX] rather than depending on the caller to not pass
> out-of-range values.

+1

>   * This is typically used to calculate a wait timeout for WaitLatch()
>   * or a related function.  The choice of "long" as the result type
> - * is to harmonize with that.  It is caller's responsibility that the
> - * input timestamps not be so far apart as to risk overflow of "long"
> - * (which'd happen at about 25 days on machines with 32-bit "long").
> + * is to harmonize with that; furthermore, we clamp the result to at most
> + * INT_MAX milliseconds, because that's all that WaitLatch() allows.
>   *
> - * Both inputs must be ordinary finite timestamps (in current usage,
> - * they'll be results from GetCurrentTimestamp()).
> + * At least one input must be an ordinary finite timestamp, else the "diff"
> + * calculation might overflow.  We do support stop_time == 
> TIMESTAMP_INFINITY,
> + * which will result in INT_MAX wait time.

I wonder if we should explicitly reject negative timestamps to eliminate
any chance of int64 overflow, too.  Alternatively, we could detect that the
operation will overflow and return either 0 or INT_MAX, but I assume
there's minimal use of this function with negative timestamps.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 11:28 AM Robert Haas  wrote:
> I think it's pretty much impossible to freeze more aggressively
> without losing in some scenario or other. If waiting longer to freeze
> would have resulted in the data getting updated again or deleted
> before we froze it, then waiting longer reduces the total amount of
> freezing work that ever has to be done. Freezing more aggressively
> inevitably gives up some amount of that potential benefit in order to
> try to secure some other benefit. It's a trade-off.

There is no question about that.

> I think that the goal of a patch that makes vacuum more (or less)
> aggressive should be to make the cases where we lose as obscure as
> possible, and the cases where we win as broad as possible. I think
> that, in order to be a good patch, it needs to be relatively difficult
> to find cases where we incur a big loss. If it's easy to find a big
> loss, then I think it's better to stick with the current behavior,
> even if it's also easy to find a big gain.

Again, this seems totally uncontroversial. It's just incredibly vague,
and not at all actionable.

Relatively difficult for Andres, or for somebody else? What are the
real parameters here? Obviously there are no clear answers available.

> However, I'm also not
> prepared to go all the way to the other end of the spectrum and say
> that all of your ideas and everything in this patch are great. I don't
> think either of those things, either.

It doesn't matter. I'm done with it. This is not a negotiation about
what gets in and what doesn't get in.

All that I aim to do now is to draw some kind of line under the basic
page-level freezing work, since of course I'm still responsible for
that. And perhaps to defend my personal reputation.

> I certainly think that freezing more aggressively in some scenarios
> could be a great idea, but it seems like the patch's theory is to be
> very nearly maximally aggressive in every vacuum run if the table size
> is greater than some threshold, and I don't think that's right at all.

We'll systematically avoid accumulating debt past a certain point --
that's its purpose. That is, we'll avoid accumulating all-visible
pages that eventually need to be frozen.

> I'm not exactly sure what information we should use to decide how
> aggressive to be, but I am pretty sure that the size of the table is
> not it.  It's true that, for a small table, the cost of having to
> eventually vacuum the whole table at once isn't going to be very high,
> whereas for a large table, it will be. That line of reasoning makes a
> size threshold sound reasonable. However, the amount of extra work
> that we can potentially do by vacuuming more aggressively *also*
> increases with the table size, which to me means using that a
> criterion actually isn't sensible at all.

The overwhelming cost is usually FPIs in any case. If you're not
mostly focussing on that, you're focussing on the wrong thing. At
least with larger tables. You just have to focus on the picture over
time, across multiple VACUUM operations.

> One idea that I've had about how to solve this problem is to try to
> make vacuum try to aggressively freeze some portion of the table on
> each pass, and to behave less aggressively on the rest of the table so
> that, hopefully, no single vacuum does too much work. Unfortunately, I
> don't really know how to do that effectively.

That has been proposed a couple of times in the context of this
thread. It won't work, because the way autovacuum works in general
(and likely always will work) doesn't allow it. With an append-only
table, each VACUUM will naturally have to scan significantly more
pages than the last one, forever (barring antiwraparound vacuums). Why
wouldn't it continue that way? I mean it might not (the table might
stop growing altogether), but then it doesn't matter much what we do.

If you're not behaving very proactively at the level of each VACUUM
operation, then the picture over time is that you're *already* falling
behind. At least with an append-only table. You have to think of the
sequence of operations, not just one.

> In theory we could have some system that tracks how
> recently each page range in a table has been modified, and direct our
> freezing activity toward the ones less-recently modified on the theory
> that they're not so likely to be modified again in the near future,
> but in reality we have no such system. So I don't really feel like I
> know what the right answer is here, yet.

So we need to come up with a way of getting reliable information from
the future, about an application that we have no particular
understanding of. As opposed to just eating the cost to some degree,
and making it configurable.

-- 
Peter Geoghegan




Re: improving user.c error messages

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote:
> @@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
>   {
>   /* things an unprivileged user certainly can't do */
>   if (dinherit || dcreaterole || dcreatedb || dcanlogin || dconnlimit ||
> - dvalidUntil || disreplication || dbypassRLS)
> + dvalidUntil || disreplication || dbypassRLS ||
> + (dpassword && roleid != currentUserId))
>   ereport(ERROR,
>   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> - errmsg("permission denied")));
> -
> - /* an unprivileged user can change their own password */
> - if (dpassword && roleid != currentUserId)
> - ereport(ERROR,
> - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> - errmsg("must have CREATEROLE privilege to change another user's 
> password")));
> + errmsg("permission denied to alter role"),
> + errdetail("You must have %s privilege and %s on role \"%s\".",
> +"CREATEROLE", "ADMIN OPTION", rolename)));
>   }
>   else if (!superuser())
>   {
> 
> Basically my question is whether having one error message for all of
> those cases is good enough, or whether we should be trying harder. I
> don't mind if the conclusion is that it's OK as-is, and I'm not
> entirely sure what would be better. But when I was working on this
> code, all of those cases OR'd together feeding into a single error
> message seemed a little sketchy to me, so I am wondering what others
> think.

I wondered the same thing, but I hesitated because I didn't want to change
too much in a patch for error messaging.  I can give it a try.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote:
>> - * Both inputs must be ordinary finite timestamps (in current usage,
>> - * they'll be results from GetCurrentTimestamp()).
>> + * At least one input must be an ordinary finite timestamp, else the "diff"
>> + * calculation might overflow.  We do support stop_time == 
>> TIMESTAMP_INFINITY,
>> + * which will result in INT_MAX wait time.

> I wonder if we should explicitly reject negative timestamps to eliminate
> any chance of int64 overflow, too.

Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
but I take your point that the subtraction would overflow with
TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
make use of pg_sub_s64_overflow()?

BTW, I just noticed that the adjacent function TimestampDifference
has a lot of callers that would be much happier using
TimestampDifferenceMilliseconds.

regards, tom lane




Re: improving user.c error messages

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote:
>> Basically my question is whether having one error message for all of
>> those cases is good enough, or whether we should be trying harder.

I think the password case needs to be kept separate, because the
conditions for it are different (specifically the exception that
you can alter your own password).  Lumping the rest together
seems OK to me.

regards, tom lane




Re: Something is wrong with wal_compression

2023-01-26 Thread Justin Pryzby
On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
> The symptom being exhibited by Michael's new BF animal tanager
> is perfectly reproducible elsewhere.

I think these tests have always failed with wal_compression ?

https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com
https://www.postgresql.org/message-id/20210313012820.gj29...@telsasoft.com
https://www.postgresql.org/message-id/2022031948.gj9...@telsasoft.com

https://www.postgresql.org/message-id/ynqwd2gsmrnqw...@paquier.xyz
|My buildfarm machine has been changed to use wal_compression = lz4,
|while on it for HEAD runs.





Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Justin Pryzby  writes:
> On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
>> The symptom being exhibited by Michael's new BF animal tanager
>> is perfectly reproducible elsewhere.

> I think these tests have always failed with wal_compression ?

If that's a known problem, and we've done nothing about it,
that is pretty horrid.  That test case is demonstrating fundamental
database corruption after a crash.

regards, tom lane




Re: wrong Append/MergeAppend elision?

2023-01-26 Thread David Rowley
On Fri, 27 Jan 2023 at 01:30, Amit Langote  wrote:
> It seems that the planner currently elides an Append/MergeAppend that
> has run-time pruning info (part_prune_index) set, but which I think is
> a bug.

This is actually how I intended it to work. Whether it was a good idea
or not, I'm currently unsure. I mentioned it in [1].

I think the plan shapes I was talking about were some ordered paths
from partial paths per what is being added right at the end of
add_paths_to_append_rel().  However, now that I look at it again, I'm
not sure why it wouldn't be correct to still have those paths with a
single-child Append. Certainly, the "if (list_length(live_childrels)
== 1)" test made in add_paths_to_append_rel() is no longer aligned to
the equivalent test in set_append_references(), so it's possible even
now that we make a plan that uses the extra sorted partial paths added
in add_paths_to_append_rel() and still have the Append in the final
plan.

There is still the trade-off of having to pull tuples through the
Append node for when run-time pruning is unable to prune the last
partition. So your proposal to leave the Append alone when there's
run-time pruning info is certainly not a no-brainer.

[1] 
https://www.postgresql.org/message-id/cakjs1f_utf1mbp8ueobyaarzio4e4qb4z8fksurpam+3q_h...@mail.gmail.com




Re: Minimal logical decoding on standbys

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote:
> - I'm struggling to create a test for btree killtuples as there is a need for 
> rows removal on the table (that could produce a conflict too):
> Do you've a scenario in mind for this one? (and btw in what kind of WAL 
> record should the conflict be detected in such a case? xl_btree_delete?)

Hm, it might indeed be hard in "modern" postgres.  I think you'd need at least
two concurrent sessions, to prevent on-access pruning on the table.


DROP TABLE IF EXISTS indexdel;
CREATE TABLE indexdel(id int8 primary key);
INSERT INTO indexdel SELECT generate_series(1, 1);
VACUUM indexdel; -- ensure hint bits are set etc

DELETE FROM indexdel;

SELECT pg_current_wal_insert_lsn();

SET enable_indexonlyscan = false;
-- This scan finds that the index items are dead - but doesn't yet issue a
-- btree delete WAL record, that only happens when needing space on the page
-- again.
EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY 
id ASC;
SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC;

-- The insertions into the range of values prev
INSERT INTO indexdel SELECT generate_series(1, 100);


Does generate the btree deletion record, but it also does emit a PRUNE (from
heapam_index_fetch_tuple() -> heap_page_prune_opt()).

While the session could create a cursor to prevent later HOT cleanup, the
query would also trigger hot pruning (or prevent the rows from being dead, if
you declare the cursor before the DELETE). So you'd need overlapping cursors
in a concurrent session...

Too complicated.

Greetings,

Andres Freund




Re: Something is wrong with wal_compression

2023-01-26 Thread Justin Pryzby
On Thu, Jan 26, 2023 at 02:08:27PM -0600, Justin Pryzby wrote:
> On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
> > The symptom being exhibited by Michael's new BF animal tanager
> > is perfectly reproducible elsewhere.
> 
> I think these tests have always failed with wal_compression ?
> 
> https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com
> https://www.postgresql.org/message-id/20210313012820.gj29...@telsasoft.com
> https://www.postgresql.org/message-id/2022031948.gj9...@telsasoft.com

+ 
https://www.postgresql.org/message-id/c86ce84f-dd38-9951-102f-13a931210f52%40dunslane.net




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
>> I wonder if we should explicitly reject negative timestamps to eliminate
>> any chance of int64 overflow, too.
> 
> Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
> but I take your point that the subtraction would overflow with
> TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
> make use of pg_sub_s64_overflow()?

That would be my vote.  I think the 'diff <= 0' check might need to be
replaced with something like 'start_time > stop_time' so that we return 0
for the underflow case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-01-26 Thread Reid Thompson
On Mon, 2023-01-23 at 12:31 -0800, Andres Freund wrote:
> Hi,
> 
> I think it's basically still waiting on author, until the O(N) cost is gone
> from the overflow limit check.
> 
> Greetings,
> 
> Andres Freund

Yes, just a rebase. There is still work to be done per earlier in the
thread.

I do want to follow up and note re palloc/pfree vs malloc/free that the
tracking code (0001-Add-tracking-...) is not tracking palloc/pfree but is
explicitely tracking malloc/free. Not every palloc/pfree call executes the
tracking code, only those where the path followed includes malloc() or
free().  Routine palloc() calls fulfilled from the context's
freelist/emptyblocks/freeblock/etc and pfree() calls not invoking free()
avoid the tracking code.

Thanks,
Reid





Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 11:26 AM Matthias van de Meent
 wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records? I
> think they're quite closely related (in that they both execute in
> VACUUM and are required for long-term system stability), and are even
> more related now that we have opportunistic page-level freezing. I
> think adding a "freeze this page as well"-flag in PRUNE records would
> go a long way to reducing the WAL overhead of aggressive and more
> opportunistic freezing.

Yeah, we've talked about doing that in the past year. It's quite
possible. It would make quite a lot of sense, because the actual
overhead of the WAL record for freezing tends to come from the generic
WAL record header stuff itself. If there was only one record for both,
then you'd only need to include the relfilenode and block number (and
so on) once.

It would be tricky to handle Multis, so what you'd probably do is just
freezing xmin, and possibly aborted and locker XIDs in xmax. So you
wouldn't completely get rid of the main freeze record, but would be
able to avoid it in many important cases.

-- 
Peter Geoghegan




lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Tomas Vondra
Hi,

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

Anyway, in 'ps ax' I saw this:

94545  -  Ss   0:03.39 postgres: buildfarm regression [local] SELECT
94627  -  Is   0:00.03 postgres: parallel worker for PID 94545
94628  -  Is   0:00.02 postgres: parallel worker for PID 94545

and the backend was stuck waiting on this query:

select final > 1 as multibatch
  from hash_join_batches(
$$
  select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar
b2 using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);

This started on 2023-01-20 23:23:18.125, and the next log (after I did
the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time.

It seems all three processes are doing WaitEventSetWait, either through
a ConditionVariable, or WaitLatch. But I don't have any good idea of
what might have broken - and as it got "unstuck" I can't investigate
more. But I see there's nodeHash and parallelism, and I recall there's a
lot of gotchas due to how the backends cooperate when building the hash
table, etc. Thomas, any idea what might be wrong?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company[Switching to LWP 100877 of process 94628]
_poll () at _poll.S:4
#0  _poll () at _poll.S:4
No locals.
#1  0x88ef6850 in __thr_poll (fds=0x4, nfds=2, timeout=-1) at 
/usr/src/lib/libthr/thread/thr_syscalls.c:338
curthread = 0x8f09b000
ret = 
#2  0x007d3318 in WaitEventSetWaitBlock (set=0x8f19ceb0, 
cur_timeout=-1, occurred_events=0x80dcba98, nevents=1) at latch.c:1171
returned_events = 0
rc = 
cur_pollfd = 
cur_event = 
errflags = 
#3  WaitEventSetWait (set=0x8f19ceb0, timeout=, 
occurred_events=, occurred_events@entry=0x80dcba98, 
nevents=nevents@entry=1, wait_event_info=, 
wait_event_info@entry=134217745) at latch.c:1000
rc = 
start_time = {tv_sec = 32, tv_nsec = 2401711808}
cur_time = {tv_sec = 2161949248, tv_nsec = 9648676}
cur_timeout = -1
returned_events = 
#4  0x007f2898 in ConditionVariableSleep (cv=cv@entry=0x9c14f350, 
wait_event_info=wait_event_info@entry=134217745) at condition_variable.c:157
event = {pos = -1676348592, events = 0, fd = -2133017936, user_data = 
0x7d3128 }
done = 
#5  0x007cfd60 in BarrierArriveAndWait (barrier=0x9c14f338, 
barrier@entry=0x80dcbbc0, wait_event_info=134217745, wait_event_info@entry=1) 
at barrier.c:191
release = 
start_phase = 5
next_phase = 6
elected = 
#6  0x006b4808 in ExecParallelHashIncreaseNumBuckets 
(hashtable=hashtable@entry=0x8f0e16f0) at nodeHash.c:1572
pstate = 
i = 
chunk = 
chunk_s = 
#7  0x006b2d30 in ExecParallelHashTupleAlloc 
(hashtable=hashtable@entry=0x8f0e16f0, size=40, shared=shared@entry=0x80dcbbc0) 
at nodeHash.c:2795
growth = PHJ_GROWTH_NEED_MORE_BUCKETS
chunk = 
pstate = 0x9c14f2a0
chunk_size = 
chunk_shared = 
result = 
curbatch = 
#8  0x006b27ac in ExecParallelHashTableInsert 
(hashtable=hashtable@entry=0x8f0e16f0, slot=, 
slot@entry=0x8f0e19d8, hashvalue=, hashvalue@entry=3919329229) 
at nodeHash.c:1693
hashTuple = 
tuple = 0x8f0ebd78
batchno = 
shared = 2400066360
bucketno = 
#9  0x006b0ac8 in MultiExecParallelHash (node=0x8f0e0908) at 
nodeHash.c:288
outerNode = 0x8f0e1170
hashtable = 0x8f0e16f0
hashkeys = 0x8f0ebac8
econtext = 0x8f0e1508
pstate = 
build_barrier = 
slot = 0x8f0e19d8
i = 
hashvalue = 
#10 MultiExecHash (node=node@entry=0x8f0e0b70) at nodeHash.c:112
No locals.
#11 0x006a13e4 in MultiExecProcNode (node=node@entry=0x8f0e0908) at 
execProcnode.c:502
result = 
#12 0x006b5b00 in ExecHashJoinImpl (pstate=0x8f0e0b70, parallel=true) 
at nodeHashjoin.c:290
hashNode = 
econtext = 0x8f0e07d8
joinqual = 0x0
otherqual = 0x0
outerNode = 0x8f0e0a58
hashtable = 0x8f0e16f0
node = 0x8f0e0b70
parallel_state = 0x9c14f2a0
hashvalue = 
outerTupleSlot = 
batchno = 
build_barrier = 
#13 ExecParallelHashJoin (pstate=0x8f0e0b70) at nodeHashjoin.c:600
No locals.
#14 0x006a193c in ExecProcNodeInstr (node=0x8f0e0b70) at 
execProcnode.c:462
result = 
#15 0x0069a60c in ExecProcNode (node=0x8f0e0b70) at 
../../../src/include/e

Re: wrong Append/MergeAppend elision?

2023-01-26 Thread Tom Lane
David Rowley  writes:
> On Fri, 27 Jan 2023 at 01:30, Amit Langote  wrote:
>> It seems that the planner currently elides an Append/MergeAppend that
>> has run-time pruning info (part_prune_index) set, but which I think is
>> a bug.

> There is still the trade-off of having to pull tuples through the
> Append node for when run-time pruning is unable to prune the last
> partition. So your proposal to leave the Append alone when there's
> run-time pruning info is certainly not a no-brainer.

Yeah.  Amit's proposal amounts to optimizing for the case that all
partitions get pruned, which does not seem to me to be the way
to bet.  I'm inclined to think it's fine as-is.

regards, tom lane




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 10:44:45 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record 
> > is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was 
> > already
> > in s_b (since it likely won't be written out via the ringbuffer in that 
> > case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
> 
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

I don't quite follow. What do you mean with "record header overhead"? Unless
that includes FPIs, I don't think that's that commonly true?

The problematic case I am talking about is when we do *not* emit a WAL record
during pruning (because there's nothing to prune), but want to freeze the
table. If you don't log an FPI, the remaining big overhead is that increasing
the LSN on the page will often cause an XLogFlush() when writing out the
buffer.

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.


> > Thus a modification of the above logic could be to opportunistically freeze 
> > if
> > a ) it won't cause an FPI and either
> > b1) the page was already dirty before pruning, as we'll not do a ringbuffer
> > replacement in that case
> > or
> > b2) We wrote a WAL record during pruning, as the difference in flush 
> > position
> > is marginal
> >
> > An even more aggressive version would be to replace b1) with logic that'd
> > allow newly dirtying the page if it wasn't read through the ringbuffer. But
> > newly dirtying the page feels like it'd be more dangerous.
> 
> In many cases we'll have to dirty the page anyway, just to set
> PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> triggered by an FPI or triggered by my now-reverted GUC) on being able
> to set the whole page all-frozen in the VM.

IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
bits. But freezing does trigger one even without wal_log_hint_bits.

You're right, it makes sense to consider whether we'll emit a
XLOG_HEAP2_VISIBLE anyway.


> > A less aggressive version would be to check if any WAL records were emitted
> > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if 
> > we
> > modified the page again. Similar to what we do now, except not requiring an
> > FPI to have been emitted.
> 
> Also way more aggressive. Not nearly enough on its own.

In which cases will it be problematically more aggressive?

If we emitted a WAL record during pruning we've already set the LSN of the
page to a very recent LSN. We know the page is dirty. Thus we'll already
trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.



> > But to me it seems a bit odd that VACUUM now is more aggressive if 
> > checksums /
> > wal_log_hint bits is on, than without them. Which I think is how using 
> > either
> > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> 
> Which part is the odd part? Is it odd that page-level freezing works
> that way, or is it odd that page-level checksums work that way?

That page-level freezing works that way.


> In any case this seems like an odd thing for you to say, having
> eviscerated a patch that really just made the same behavior trigger
> independently of FPIs in some tables, controlled via a GUC.

jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj

That behaviour I critizied was causing a torrent of FPIs and additional
dirtying of pages. My proposed replacement for the current FPI check doesn't,
because a) it only triggers when we wrote a WAL record b) It doesn't trigger
if we would write an FPI.

Greetings,

Andres Freund




Re: doc: add missing "id" attributes to extension packaging page

2023-01-26 Thread Brar Piening

On 18.01.2023 at 06:50, Brar Piening wrote:


I'll give it a proper look this weekend.


It turns out I didn't get a round tuit.

... and I'm afraid I probably will not be able to work on this until
mid/end February so we'll have to move this to the next commitfest until
somebody wants to take it over and push it through.






Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Tom Lane
Tomas Vondra  writes:
> I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> did not report any results for a couple days, and it seems it got into
> an infinite loop in REL_11_STABLE when building hash table in a parallel
> hashjoin, or something like that.

> It seems to be progressing now, probably because I attached gdb to the
> workers to get backtraces, which does signals etc.

That reminds me of cases that I saw several times on my now-deceased
animal florican:

https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us

There's clearly something rotten somewhere in there, but whether
it's our bug or FreeBSD's isn't clear.

regards, tom lane




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 2:57 PM Peter Geoghegan  wrote:
> Relatively difficult for Andres, or for somebody else? What are the
> real parameters here? Obviously there are no clear answers available.

Andres is certainly smarter than the average guy, but practically any
scenario that someone can create in a few lines of SQL is something to
which code will be exposed to on some real-world system. If Andres
came along and said, hey, well I found a way to make this patch suck,
and proceeded to describe a scenario that involved a complex set of
tables and multiple workloads running simultaneously and using a
debugger to trigger some race condition and whatever, I'd be like "OK,
but is that really going to happen?". The actual scenario he came up
with is three lines of SQL, and it's nothing remotely obscure. That
kind of thing is going to happen *all the time*.

> The overwhelming cost is usually FPIs in any case. If you're not
> mostly focussing on that, you're focussing on the wrong thing. At
> least with larger tables. You just have to focus on the picture over
> time, across multiple VACUUM operations.

I think that's all mostly true, but the cases where being more
aggressive can cause *extra* FPIs are worthy of just as much attention
as the cases where we can reduce them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 20:26:00 +0100, Matthias van de Meent wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records?

I think we definitely should (and have argued for it a couple times). It's not
just about reducing WAL overhead, it's also about reducing redundant
visibility checks - which are where a very significant portion of the CPU time
for VACUUMing goes to.

Besides performance considerations, it's also just plain weird that
lazy_scan_prune() can end up with a different visibility than
heap_page_prune() (mostly due to concurrent aborts).


The number of WAL records we often end up emitting for a processing a single
page in vacuum is just plain absurd:
- PRUNE
- FREEZE_PAGE
- VISIBLE

There's afaict no justification whatsoever for these to be separate records.


> I think they're quite closely related (in that they both execute in VACUUM
> and are required for long-term system stability), and are even more related
> now that we have opportunistic page-level freezing. I think adding a "freeze
> this page as well"-flag in PRUNE records would go a long way to reducing the
> WAL overhead of aggressive and more opportunistic freezing.

Yep.

I think we should also seriously consider setting all-visible during on-access
pruning, and freezing rows during on-access pruning.

Greetings,

Andres Freund




Re: Non-superuser subscription owners

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 12:36 PM Jeff Davis  wrote:
> On Thu, 2023-01-26 at 09:43 -0500, Robert Haas wrote:
> > I have no issue with that as a long-term plan. However, I think that
> > for right now we should just introduce pg_create_subscription. It
> > would make sense to add pg_create_connection in the same patch that
> > adds a CREATE CONNECTION command (or whatever exact syntax we end up
> > with) -- and that patch can also change CREATE SUBSCRIPTION to
> > require
> > both privileges where a connection string is specified directly.
>
> I assumed it would be a problem to say that pg_create_subscription was
> enough to create a subscription today, and then later require
> additional privileges (e.g. pg_create_connection).
>
> If that's not a problem, then this sounds fine with me.

Wonderful! I'm working on a patch, but due to various distractions,
it's not done yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 9:49 AM Tom Lane  wrote:
> Tomas Vondra  writes:
> > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > did not report any results for a couple days, and it seems it got into
> > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > hashjoin, or something like that.
>
> > It seems to be progressing now, probably because I attached gdb to the
> > workers to get backtraces, which does signals etc.
>
> That reminds me of cases that I saw several times on my now-deceased
> animal florican:
>
> https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
>
> There's clearly something rotten somewhere in there, but whether
> it's our bug or FreeBSD's isn't clear.

And if it's ours, it's possibly in latch code and not anything higher
(I mean, not in condition variables, barriers, or parallel hash join)
because I saw a similar hang in the shm_mq stuff which uses the latch
API directly.  Note that 13 switched to kqueue but still used the
self-pipe, and 14 switched to a signal event, and this hasn't been
reported in those releases or later, which makes the poll() code path
a key suspect.




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 12:54 PM Robert Haas  wrote:
> > The overwhelming cost is usually FPIs in any case. If you're not
> > mostly focussing on that, you're focussing on the wrong thing. At
> > least with larger tables. You just have to focus on the picture over
> > time, across multiple VACUUM operations.
>
> I think that's all mostly true, but the cases where being more
> aggressive can cause *extra* FPIs are worthy of just as much attention
> as the cases where we can reduce them.

It's a question of our exposure to real problems, in no small part.
What can we afford to be wrong about? What problem can be fixed by the
user more or less as it emerges, and what problem doesn't have that
quality?

There is very good reason to believe that the large majority of all
data that people store in a system like Postgres is extremely cold
data:

https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/
https://brandur.org/fragments/events

Having a separate aggressive step that rewrites an entire large table,
apparently at random, is just a huge burden to users. You've said that
you agree that it sucks, but somehow I still can't shake the feeling
that you don't fully understand just how much it sucks.

-- 
Peter Geoghegan




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro  wrote:
> On Fri, Jan 27, 2023 at 9:49 AM Tom Lane  wrote:
> > Tomas Vondra  writes:
> > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > > did not report any results for a couple days, and it seems it got into
> > > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > > hashjoin, or something like that.
> >
> > > It seems to be progressing now, probably because I attached gdb to the
> > > workers to get backtraces, which does signals etc.
> >
> > That reminds me of cases that I saw several times on my now-deceased
> > animal florican:
> >
> > https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
> >
> > There's clearly something rotten somewhere in there, but whether
> > it's our bug or FreeBSD's isn't clear.
>
> And if it's ours, it's possibly in latch code and not anything higher
> (I mean, not in condition variables, barriers, or parallel hash join)
> because I saw a similar hang in the shm_mq stuff which uses the latch
> API directly.  Note that 13 switched to kqueue but still used the
> self-pipe, and 14 switched to a signal event, and this hasn't been
> reported in those releases or later, which makes the poll() code path
> a key suspect.

Also, 14 changed the flag/memory barrier dance (maybe_sleeping), but
13 did it the same way as 11 + 12.  So between 12 and 13 we have just
the poll -> kqueue change.




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote:
>> Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
>> but I take your point that the subtraction would overflow with
>> TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
>> make use of pg_sub_s64_overflow()?

> That would be my vote.  I think the 'diff <= 0' check might need to be
> replaced with something like 'start_time > stop_time' so that we return 0
> for the underflow case.

Right, so more like this.

regards, tom lane

diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 05470057f5..2bb6c89f8c 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len)
 		 * the system clock was set backward, so that such occurrences don't
 		 * have the effect of suppressing further progress messages.
 		 */
-		if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD)
+		if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD ||
+			now < mysink->last_progress_report_time)
 		{
 			mysink->last_progress_report_time = now;
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5b775cf7d0..62fba5fcee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1670,11 +1670,12 @@ DetermineSleepTime(void)
 
 	if (next_wakeup != 0)
 	{
-		/* Ensure we don't exceed one minute, or go under 0. */
-		return Max(0,
-   Min(60 * 1000,
-	   TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-	   next_wakeup)));
+		int			ms;
+
+		/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
+		ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
+   next_wakeup);
+		return Min(60 * 1000, ms);
 	}
 
 	return 60 * 1000;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e95398db05..b0cfddd548 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
 pgsocket	wait_fd = PGINVALID_SOCKET;
 int			rc;
 TimestampTz nextWakeup;
-int			nap;
+long		nap;
 
 /*
  * Exit walreceiver if we're not in recovery. This should not
@@ -528,15 +528,9 @@ WalReceiverMain(void)
 for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
 	nextWakeup = Min(wakeup[i], nextWakeup);
 
-/*
- * Calculate the nap time.  WaitLatchOrSocket() doesn't accept
- * timeouts longer than INT_MAX milliseconds, so we limit the
- * result accordingly.  Also, we round up to the next
- * millisecond to avoid waking up too early and spinning until
- * one of the wakeup times.
- */
+/* Calculate the nap time, clamping as necessary. */
 now = GetCurrentTimestamp();
-nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));
+nap = TimestampDifferenceMilliseconds(now, nextWakeup);
 
 /*
  * Ideally we would reuse a WaitEventSet object repeatedly
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 928c330897..47e059a409 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1690,26 +1690,31 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  *
  * This is typically used to calculate a wait timeout for WaitLatch()
  * or a related function.  The choice of "long" as the result type
- * is to harmonize with that.  It is caller's responsibility that the
- * input timestamps not be so far apart as to risk overflow of "long"
- * (which'd happen at about 25 days on machines with 32-bit "long").
- *
- * Both inputs must be ordinary finite timestamps (in current usage,
- * they'll be results from GetCurrentTimestamp()).
+ * is to harmonize with that; furthermore, we clamp the result to at most
+ * INT_MAX milliseconds, because that's all that WaitLatch() allows.
  *
  * We expect start_time <= stop_time.  If not, we return zero,
  * since then we're already past the previously determined stop_time.
  *
+ * Subtracting finite and infinite timestamps works correctly, returning
+ * zero or INT_MAX as appropriate.
+ *
  * Note we round up any fractional millisecond, since waiting for just
  * less than the intended timeout is undesirable.
  */
 long
 TimestampDifferenceMilliseconds(TimestampTz start_time, TimestampTz stop_time)
 {
-	TimestampTz diff = stop_time - start_time;
+	TimestampTz diff;
 
-	if (diff <= 0)
+	/* Deal with zero or negative elapsed time quickly. */
+	if (start_time >= stop_time)
 		return 0;
+	/* To not fail with timestamp infinities, we must detect overflow. */
+	if (pg_sub_s64_overflow(stop_time, start_time, &diff))
+		return (long) INT_MAX;
+	if (diff >= (INT_MAX * INT64CONST(1000) - 999))
+		return (long) INT_MAX;
 	else
 		return (long) ((diff + 9

Re: Cygwin cleanup

2023-01-26 Thread Justin Pryzby
Note that cirrus failed like this:

https://api.cirrus-ci.com/v1/artifact/task/4881596411543552/testrun/build/testrun/subscription/010_truncate/log/010_truncate_publisher.log

2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] ERROR:  could not 
open file "pg_logical/snapshots/0-14F2060.snap": Is a directory
2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', 
publication_names '"pub1"')
2023-01-25 23:17:10.418 GMT [29850][walsender] 
[pg_16413_sync_16394_7192732880582452157][6/0:0] PANIC:  could not open file 
"pg_logical/snapshots/0-14F2060.snap": No such file or directory
2023-01-25 23:17:10.418 GMT [29850][walsender] 
[pg_16413_sync_16394_7192732880582452157][6/0:0] STATEMENT:  START_REPLICATION 
SLOT "pg_16413_sync_16394_7192732880582452157" LOGICAL 0/14F2060 (proto_version 
'4', origin 'any', publication_names '"pub3"')

I don't understand how "Is a directory" happened ..

It looks like maybe the call stack would've been:

SnapBuildSerializationPoint()
xlog_decode() or standby_decode() ?
LogicalDecodingProcessRecord()
XLogSendLogical()
WalSndLoop()
StartLogicalReplication()

-- 
Justin




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan  wrote:
> There is very good reason to believe that the large majority of all
> data that people store in a system like Postgres is extremely cold
> data:

The systems where I end up troubleshooting problems seem to be, most
typically, busy OLTP systems. I'm not in a position to say whether
that's more or less common than systems with extremely cold data, but
I am in a position to say that my employer will have a lot fewer happy
customers if we regress that use case. Naturally I'm keen to avoid
that.

> Having a separate aggressive step that rewrites an entire large table,
> apparently at random, is just a huge burden to users. You've said that
> you agree that it sucks, but somehow I still can't shake the feeling
> that you don't fully understand just how much it sucks.

Ha!

Well, that's possible. But maybe you don't understand how much your
patch makes other things suck.

I don't think we can really get anywhere here by postulating that the
problem is the other person's lack of understanding, even if such a
postulate should happen to be correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote:
> Right, so more like this.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Something is wrong with wal_compression

2023-01-26 Thread Andrey Borodin
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane  wrote:
>
> That test case is demonstrating fundamental
> database corruption after a crash.
>

Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.

It was discussed long before WAL compression here [0]. The thing is it
is easier to reproduce with compression, but compression has nothing
to do with it, as far as I understand.

Proposed fix is here[1], but I think it's better to fix the test. It
should not veryfi Xid, but rather side effects of "CREATE TABLE mine(x
integer);".


Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/565FB155-C6B0-41E2-8C44-7B514DC25132%2540yandex-team.ru
[1] 
https://www.postgresql.org/message-id/flat/20210313012820.GJ29463%40telsasoft.com#0f18d3a4d593ea656fdc761e026fee81




Re: GUCs to control abbreviated sort keys

2023-01-26 Thread Peter Eisentraut

On 25.01.23 22:16, Jeff Davis wrote:

I am highlighting this case because the existence of a single non-
contrived case or regression suggests that we may want to explore
further and tweak heuristics. That's quite natural when the heuristics
are based on a complex dependency like a collation provider. The
sort_abbreviated_keys GUC makes that kind of exploration and tweaking a
lot easier.


Maybe an easier way to enable or disable it in the source code with a 
#define would serve this.  Making it a GUC right away seems a bit 
heavy-handed.  Further exploration and tweaking might well require 
further source code changes, so relying on a source code level toggle 
would seem appropriate.






Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Andrew Dunstan


On 2023-01-26 Th 12:05, Jelte Fennema wrote:
>> At this stage the files are now indented, so if it failed and you run
>> `git commit` again it will commit with the indention changes.
> No, because at no point a "git add" is happening, so the changes
> made by pgindent are not staged. As long as you don't run the
> second "git commit" with the -a flag the commit will be exactly
> the same as you prepared it before.


Hmm, but I usually run with -a, I even have a git alias for it. I guess
what this discussion illustrates is that there are various patters for
using git, and we shouldn't assume that everyone else is using the same
patterns we are.

I'm still  mildly inclined to say this material would be better placed
in the developer wiki. After all, this isn't the only thing a postgres
developer might use a git hook for (mine has more material in it than in
what I posted).


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 1:22 PM Robert Haas  wrote:
> On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan  wrote:
> > There is very good reason to believe that the large majority of all
> > data that people store in a system like Postgres is extremely cold
> > data:
>
> The systems where I end up troubleshooting problems seem to be, most
> typically, busy OLTP systems. I'm not in a position to say whether
> that's more or less common than systems with extremely cold data, but
> I am in a position to say that my employer will have a lot fewer happy
> customers if we regress that use case. Naturally I'm keen to avoid
> that.

This is the kind of remark that makes me think that you don't get it.

The most influential OLTP benchmark of all time is TPC-C, which has
exactly this problem. In spades -- it's enormously disruptive. Which
is one reason why I used it as a showcase for a lot of this work. Plus
practical experience (like the Heroku database in the blog post I
linked to) fully agrees with that benchmark, as far as this stuff goes
-- that was also a busy OLTP database.

Online transaction involves transactions. Right? There is presumably
some kind of ledger, some kind of orders table. Naturally these have
entries that age out fairly predictably. After a while, almost all the
data is cold data. It is usually about that simple.

One of the key strengths of systems like Postgres is the ability to
inexpensively store a relatively large amount of data that has just
about zero chance of being read, let alone modified. While at the same
time having decent OLTP performance for the hot data. Not nearly as
good as an in-memory system, mind you -- and yet in-memory systems
remain largely a niche thing.

-- 
Peter Geoghegan




Re: improving user.c error messages

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 03:07:43PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
>> On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote:
>>> Basically my question is whether having one error message for all of
>>> those cases is good enough, or whether we should be trying harder.
> 
> I think the password case needs to be kept separate, because the
> conditions for it are different (specifically the exception that
> you can alter your own password).  Lumping the rest together
> seems OK to me.

Hm.  In v2, the error message for both cases is the same:

ERROR:  permission denied to alter role
DETAIL:  You must have CREATEROLE privilege and ADMIN OPTION on role 
"regress_priv_user2".

We could add "to change its attributes" and "to change its password" to
separate the two, but I'm not sure that adds much.  ISTM the current error
message for ALTER ROLE PASSWORD implies that you can change your own
password, and that's lost with my patch.  Perhaps we should add an
errhint() with that information instead.  WDYT?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Andrey Borodin  writes:
> On Thu, Jan 26, 2023 at 12:12 PM Tom Lane  wrote:
>> That test case is demonstrating fundamental
>> database corruption after a crash.

> Not exactly corruption. XID was not persisted and buffer data did not
> hit a disk. Database is in the correct state.

Really?  I don't see how this part is even a little bit okay:

[00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
[00:40:50.745](0.001s) 
[00:40:50.745](0.000s) #   Failed test 'xid is aborted after crash'
#   at t/011_crash_recovery.pl line 57.
[00:40:50.746](0.001s) #  got: 'committed'
# expected: 'aborted'

If any tuples made by that transaction had reached disk,
we'd have a problem.

regards, tom lane




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote:
>> Right, so more like this.

> LGTM

Thanks, pushed.

Returning to the prior patch ... I don't much care for this:

+/* Maybe there will be a free slot in a second... */
+retry_time = TimestampTzPlusSeconds(now, 1);
+LogRepWorkerUpdateSyncStartWakeup(retry_time);

We're not moving the goalposts very far on unnecessary wakeups if
we have to do that.  Do we need to get a wakeup on sync slot free?
Although having to send that to every worker seems ugly.  Maybe this
is being done in the wrong place and we need to find a way to get
the launcher to handle it.

As for the business about process_syncing_tables being only called
conditionally, I was already of the opinion that the way it's
getting called is loony.  Why isn't it called from LogicalRepApplyLoop
(and noplace else)?  With more certainty about when it runs, we might
not need so many kluges elsewhere.

regards, tom lane




Re: improving user.c error messages

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 03:07:43PM -0500, Tom Lane wrote:
>> I think the password case needs to be kept separate, because the
>> conditions for it are different (specifically the exception that
>> you can alter your own password).  Lumping the rest together
>> seems OK to me.

> Hm.  In v2, the error message for both cases is the same:

>   ERROR:  permission denied to alter role
>   DETAIL:  You must have CREATEROLE privilege and ADMIN OPTION on role 
> "regress_priv_user2".

> We could add "to change its attributes" and "to change its password" to
> separate the two, but I'm not sure that adds much.  ISTM the current error
> message for ALTER ROLE PASSWORD implies that you can change your own
> password, and that's lost with my patch.  Perhaps we should add an
> errhint() with that information instead.  WDYT?

Well, it's not a hint.  I think the above is fine for non-password
cases, but for passwords maybe

ERROR:  permission denied to alter role password
DETAIL:  To change another role's password, you must have CREATEROLE 
privilege and ADMIN OPTION on role "%s".

regards, tom lane




Re: Something is wrong with wal_compression

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 11:14 AM Tom Lane  wrote:
> Andrey Borodin  writes:
> > On Thu, Jan 26, 2023 at 12:12 PM Tom Lane  wrote:
> >> That test case is demonstrating fundamental
> >> database corruption after a crash.
>
> > Not exactly corruption. XID was not persisted and buffer data did not
> > hit a disk. Database is in the correct state.
>
> Really?  I don't see how this part is even a little bit okay:
>
> [00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
> [00:40:50.745](0.001s)
> [00:40:50.745](0.000s) #   Failed test 'xid is aborted after crash'
> #   at t/011_crash_recovery.pl line 57.
> [00:40:50.746](0.001s) #  got: 'committed'
> # expected: 'aborted'
>
> If any tuples made by that transaction had reached disk,
> we'd have a problem.

The problem is that the WAL wasn't flushed, allowing the same xid to
be allocated again after crash recovery.  But for any data pages to
hit the disk, we'd have to flush WAL first, so then it couldn't
happen, no?  FWIW I also re-complained about the dangers of anyone
relying on pg_xact_status() for its stated purpose after seeing
tanager's failure[1].

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJ9p2JPPMA4eYAKq%3Dr9d_4_8vziet_tS1LEBbiny5-ypA%40mail.gmail.com




Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Jelte Fennema
On Thu, 26 Jan 2023 at 22:46, Andrew Dunstan  wrote:
> Hmm, but I usually run with -a, I even have a git alias for it. I guess
> what this discussion illustrates is that there are various patters for
> using git, and we shouldn't assume that everyone else is using the same
> patterns we are.

I definitely agree that there are lots of ways to use git. And I now
understand why my hook didn't work well for your existing workflow.

I've pretty much unlearned the -a flag. Because the easiest way I've
been able to split up changes into different commits is using "git add
-p", which adds partial pieces of files to the staging area. And that
workflow combines terribly with "git commit -a" because -a adds all
the things that I specifically didn't put in the staging area into the
final commit anyway.

> I'm still  mildly inclined to say this material would be better placed
> in the developer wiki. After all, this isn't the only thing a postgres
> developer might use a git hook for

I think it should definitely be somewhere. I have a preference for the
repo, since I think the docs on codestyle are already in too many
different places. But the wiki is already much better than having no
shared hook at all. I mainly think we should try to make it as easy as
possible for people to commit well indented code.

> (mine has more material in it than in what I posted).

Anything that is useful for the wider community and could be part of
this example/template git hook? (e.g. some perltidy automation)




Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Jan 27, 2023 at 11:14 AM Tom Lane  wrote:
>> If any tuples made by that transaction had reached disk,
>> we'd have a problem.

> The problem is that the WAL wasn't flushed, allowing the same xid to
> be allocated again after crash recovery.  But for any data pages to
> hit the disk, we'd have to flush WAL first, so then it couldn't
> happen, no?

Ah, now I get the point: the "committed xact" seen after restart
isn't the same one as we saw before the crash, but a new one that
was given the same XID because nothing about the old one had made
it to disk yet.

> FWIW I also re-complained about the dangers of anyone
> relying on pg_xact_status() for its stated purpose after seeing
> tanager's failure[1].

Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.

regards, tom lane




Re: GUCs to control abbreviated sort keys

2023-01-26 Thread Jeff Davis
On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> Maybe an easier way to enable or disable it in the source code with a
> #define would serve this.  Making it a GUC right away seems a bit 
> heavy-handed.  Further exploration and tweaking might well require 
> further source code changes, so relying on a source code level toggle
> would seem appropriate.

I am using these GUCs for testing the various collation paths in my
collation refactoring branch.

I find them pretty useful, and when I saw a regression, I thought
others might think it was useful, too. But if not I'll just leave them
in my branch and withdraw from this thread.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 12:45 PM Andres Freund  wrote:
> > Most of the overhead of FREEZE WAL records (with freeze plan
> > deduplication and page-level freezing in) is generic WAL record header
> > overhead. Your recent adversarial test case is going to choke on that,
> > too. At least if you set checkpoint_timeout to 1 minute again.
>
> I don't quite follow. What do you mean with "record header overhead"? Unless
> that includes FPIs, I don't think that's that commonly true?

Even if there are no directly observable FPIs, there is still extra
WAL, which can cause FPIs indirectly, just by making checkpoints more
frequent. I feel ridiculous even having to explain this to you.

> The problematic case I am talking about is when we do *not* emit a WAL record
> during pruning (because there's nothing to prune), but want to freeze the
> table. If you don't log an FPI, the remaining big overhead is that increasing
> the LSN on the page will often cause an XLogFlush() when writing out the
> buffer.
>
> I don't see what your reference to checkpoint timeout is about here?
>
> Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
> = 1min. It just makes it cheaper to reproduce.

That's flagrantly intellectually dishonest. Sure, it made it easier to
reproduce. But that's not all it did!

You had *lots* of specific numbers and technical details in your first
email, such as "Time for vacuuming goes up to ~5x. WAL volume to
~9x.". But you did not feel that it was worth bothering with details
like having set checkpoint_timeout to 1 minute, which is a setting
that nobody uses, and obviously had a multiplicative effect. That
detail was unimportant. I had to drag it out of you!

You basically found a way to add WAL overhead to a system/workload
that is already in a write amplification vicious cycle, with latent
tipping point type behavior.

There is a practical point here, that is equally obvious, and yet
somehow still needs to be said: benchmarks like that one are basically
completely free of useful information. If we can't agree on how to
assess such things in general, then what can we agree on when it comes
to what should be done about it, what trade-off to make, when it comes
to any similar question?

> > In many cases we'll have to dirty the page anyway, just to set
> > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> > triggered by an FPI or triggered by my now-reverted GUC) on being able
> > to set the whole page all-frozen in the VM.
>
> IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
> bits. But freezing does trigger one even without wal_log_hint_bits.

That is correct.

> You're right, it makes sense to consider whether we'll emit a
> XLOG_HEAP2_VISIBLE anyway.

As written the page-level freezing FPI mechanism probably doesn't
really stand to benefit much from doing that. Either checksums are
disabled and it's just a hint, or they're enabled and there is a very
high chance that we'll get an FPI inside lazy_scan_prune rather than
right after it is called, when PD_ALL_VISIBLE is set.

That's not perfect, of course, but it doesn't have to be. Perhaps it
should still be improved, just on general principle.

> > > A less aggressive version would be to check if any WAL records were 
> > > emitted
> > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI 
> > > if we
> > > modified the page again. Similar to what we do now, except not requiring 
> > > an
> > > FPI to have been emitted.
> >
> > Also way more aggressive. Not nearly enough on its own.
>
> In which cases will it be problematically more aggressive?
>
> If we emitted a WAL record during pruning we've already set the LSN of the
> page to a very recent LSN. We know the page is dirty. Thus we'll already
> trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

You seem to be talking about this as if the only thing that could
matter is the immediate FPI -- the first order effects -- and not any
second order effects. You certainly didn't get to 9x extra WAL
overhead by controlling for that before. Should I take it that you've
decided to assess these things more sensibly now? Out of curiosity:
why the change of heart?

> > > But to me it seems a bit odd that VACUUM now is more aggressive if 
> > > checksums /
> > > wal_log_hint bits is on, than without them. Which I think is how using 
> > > either
> > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> >
> > Which part is the odd part? Is it odd that page-level freezing works
> > that way, or is it odd that page-level checksums work that way?
>
> That page-level freezing works that way.

I think that it will probably cause a little confusion, and should be
specifically documented. But other than that, it seems reasonable
enough to me. I mean, should I not do something that's going to be of
significant help to users with checksums, just because it'll be
somewhat confusing to a small min

Re: Rework of collation code, extensibility

2023-01-26 Thread Jeff Davis

Attached v9 and added perf numbers below.

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)

My primary motivation is still related to
https://commitfest.postgresql.org/41/3956/ but the combination of
cleaner code and a performance boost seems like reasonable
justification for this patch set independently.

There aren't any clear open items on this patch. Peter Eisentraut asked
me to focus this thread on the refactoring, which I've done by reducing
it to 2 patches, and I left multilib ICU up to the other thread. He
also questioned the increased line count, but I think the currently-low
line count is due to bad style. PeterG provided some review comments,
in particular when to do the tiebreaking, which I addressed.

This patch has been around for a while, so I'll take a fresh look and
see if I see risk areas, and re-run a few sanity checks. Of course more
feedback would also be welcome.

PERFORMANCE:

==
Setup:
==

base: master with v9-0001 applied (GUCs only)
refactor: master with v9-0001, v9-0002, v9-0003 applied

Note that I wasn't able to see any performance difference between the
base and master, v9-0001 just adds some GUCs to make testing easier.

glibc  2.35  ICU 70.1
gcc11.3.0LLVM 14.0.0

built with meson (uses -O3)

$ perl text_generator.pl 1000 10 > /tmp/strings.utf8.txt

CREATE TABLE s (t TEXT);
COPY s FROM '/tmp/strings.utf8.txt';
VACUUM FREEZE s;
CHECKPOINT;
SET work_mem='10GB';
SET max_parallel_workers = 0;
SET max_parallel_workers_per_gather = 0;

=
Test queries:
=

EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "C";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en_US";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";

Timings are measured as the milliseconds to return the first tuple from
the Sort operator (as reported in EXPLAIN ANALYZE). Median of three
runs.


Results:


  baserefactor   speedup

sort_abbreviated_keys=false:
  C   73777273  1.4%
  en_US  35081   35090  0.0%
  en-US-x-ixu20520   19465  5.4%

sort_abbreviated_keys=true:
  C   81058008  1.2%
  en_US  35067   34850  0.6%
  en-US-x-icu22626   21507  5.2%

===
Conclusion:
===

These numbers can move +/-1 percentage point, so I'd interpret anything
less than that as noise. This happens to be the first run where all the
numbers favored the refactoring patch, but it is generally consistent
with what I had seen before.

The important part is that, for ICU, it appears to be a substantial
speedup when using meson (-O3).

Also, when/if the multilib ICU support goes in, that may lose some of
these gains due to an extra indirect function call.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS




text_generator.pl
Description: Perl program
From 39ed011cc51ba3a4af5e3b559a7b8de25fb895a5 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 21 Jan 2023 12:44:07 -0800
Subject: [PATCH v9 1/3] Introduce GUCs to control abbreviated keys sort
 optimization.

The setting sort_abbreviated_keys turns the optimization on or off
overall. The optimization relies on collation providers, which are
complex dependencies, and the performance of the optimization may rely
on many factors. Introducing a GUC allows easier diagnosis when this
optimization results in worse perforamnce.

The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing
users to experiment with the abbreviated keys optimization when using
the libc provider. Previously, the optimization only applied to
collations using the ICU provider unless specially compiled. By
default, allowed only for superusers (because an incorrect setting
could lead to wrong results), but can be granted to others.
---
 doc/src/sgml/config.sgml   | 40 ++
 src/backend/utils/adt/varlena.c| 10 +++---
 src/backend/utils/misc/guc_tables.c| 24 +
 src/backend/utils/sort/tuplesortvariants.c | 17 ++---
 4 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f985afc009..8f55b89f35 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  sort_abbreviated_keys (boolean)
+  
+   sort_abbreviated_keys configuration parameter
+  
+  
+  
+   
+Enables or disables the use of abbreviated sort keys, a sort optimization,
+if applicable. The default is true. Disabling may
+be useful to diagnose

Partition key causes problem for volatile target list query

2023-01-26 Thread Bruce Momjian
I have found an odd behavior --- a query in the target list that assigns
to a partitioned column causes queries that would normally be volatile
to return always zero.

In the first query, no partitioning is used:

 d1 | d2
+
  1 |  0
  2 |  0
  2 |  1
  1 |  0
  1 |  2
  1 |  2
  1 |  0
  0 |  2
  2 |  0
  2 |  2

In the next query, 'd1' is a partition key and it gets a constant value
of zero for all rows:

 d1 | d2
+
-->   0 |  1
-->   0 |  2
  0 |  2
  0 |  1
  0 |  2
  0 |  1
  0 |  2
  0 |  2
  0 |  2
  0 |  2

The self-contained query is attached.  The value is _always_ zero, which
suggests random() is not being called;  calling setseed() does not
change that.  If I change "SELECT x" with "SELECT 2", the "2" is used. 
I see this behavior back to PG 11.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.


partition_test.sql
Description: application/sql


  1   2   >