Re: Minimal logical decoding on standbys

2023-01-26 Thread Drouvot, Bertrand

Hi,

On 1/26/23 9:13 PM, Andres Freund wrote:

Hi,

On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote:

- I'm struggling to create a test for btree killtuples as there is a need for 
rows removal on the table (that could produce a conflict too):
Do you've a scenario in mind for this one? (and btw in what kind of WAL record 
should the conflict be detected in such a case? xl_btree_delete?)


Hm, it might indeed be hard in "modern" postgres.  I think you'd need at least
two concurrent sessions, to prevent on-access pruning on the table.


DROP TABLE IF EXISTS indexdel;
CREATE TABLE indexdel(id int8 primary key);
INSERT INTO indexdel SELECT generate_series(1, 1);
VACUUM indexdel; -- ensure hint bits are set etc

DELETE FROM indexdel;

SELECT pg_current_wal_insert_lsn();

SET enable_indexonlyscan = false;
-- This scan finds that the index items are dead - but doesn't yet issue a
-- btree delete WAL record, that only happens when needing space on the page
-- again.
EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY 
id ASC;
SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC;

-- The insertions into the range of values prev
INSERT INTO indexdel SELECT generate_series(1, 100);


Does generate the btree deletion record, but it also does emit a PRUNE (from
heapam_index_fetch_tuple() -> heap_page_prune_opt()).

While the session could create a cursor to prevent later HOT cleanup, the
query would also trigger hot pruning (or prevent the rows from being dead, if
you declare the cursor before the DELETE). So you'd need overlapping cursors
in a concurrent session...



Thanks for the scenario and explanation!

I agree that a second session would be needed (and so I understand why I was
struggling when trying with a single session ;-) )


Too complicated.



Yeah.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: drop postmaster symlink

2023-01-26 Thread Peter Eisentraut

On 26.01.23 19:36, Karl O. Pinc wrote:

I see a possible problem at line 1,412 of runtime.sgml

This says:

  in the postmaster's startup script just before invoking the postmaster.

Depending on how this is read, it could be interpreted to mean
that a "postmaster" binary is invoked.  It might be more clear
to write: ... just before invoking postgres.

Or it might not be worth bothering about; at this point, probably
not, but I thought you might want the heads-up anyway.


Good find.  I have adjusted that, and a few more nearby.





Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Bharath Rupireddy
On Thu, Jan 26, 2023 at 9:21 PM Andres Freund  wrote:
>
> > 7. I think we need to not let backends throttle too frequently even
> > though they have crossed wal_throttle_threshold bytes. The best way is
> > to rely on replication lag, after all the goal of this feature is to
> > keep replication lag under check - say, throttle only when
> > wal_distance > wal_throttle_threshold && replication_lag >
> > wal_throttle_replication_lag_threshold.
>
> I think my idea of only forcing to flush/wait an LSN some distance in the past
> would automatically achieve that?

I'm sorry, I couldn't get your point, can you please explain it a bit more?

Looking at the patch, the feature, in its current shape, focuses on
improving replication lag (by throttling WAL on the primary) only when
synchronous replication is enabled. Why is that? Why can't we design
it for replication in general (async, sync, and logical replication)?

Keeping replication lag under check enables one to provide a better
RPO guarantee as discussed in the other thread
https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 9:58 PM Andres Freund  wrote:
> It doesn't seem like a great proxy to me. ISTM that this means that how
> aggressive vacuum is about opportunistically freezing pages depends on config
> variables like checkpoint_timeout & max_wal_size (less common opportunistic
> freezing), full_page_writes & use of unlogged tables (no opportunistic
> freezing), and the largely random scheduling of autovac workers.

The FPI thing was originally supposed to complement the freezing
strategies stuff, and possibly other rules that live in
lazy_scan_prune. Obviously you can freeze a page by following any rule
that you care to invent -- you can decide by calling random(). Two
rules can coexist during the same VACUUM (actually, they do already).

> Essentially the "any fpi" logic is a very coarse grained way of using the page
> LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> since the last write" is a good metric to avoid unnecessary freezing - it's
> too coarse. But I think using the LSN is the right thought. What about
> something like
>
>   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
>   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
>  FreezeMe();
>
> I probably got some details wrong, what I am going for with lsn_threshold is
> that we'd freeze an already dirty page if it's not been updated within 10% of
> the LSN distance to the last VACUUM.

It seems to me that you're reinventing something akin to eager
freezing strategy here. At least that's how I define it, since now
you're bringing the high level context into it; what happens with the
table, with VACUUM operations, and so on. Obviously this requires
tracking the metadata that you suppose will be available in some way
or other, in particular things like lsn_of_last_vacuum.

What about unlogged/temporary tables? The obvious thing to do there is
what I did in the patch that was reverted (freeze whenever the page
will thereby become all-frozen), and forget about LSNs. But you have
already objected to that part, specifically.

BTW, you still haven't changed the fact that you get rather different
behavior with checksums/wal_log_hints. I think that that's good, but
you didn't seem to.

> I don't think the speculation is that fundamentally different - a heavily
> updated table with a bit of a historic, non-changing portion, makes
> vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
> record" or "any fpi".

That's true. The point I was making is that both this proposal and
eager freezing are based on some kind of high level picture of the
needs of the table, based on high level metadata. To me that's the
defining characteristic.

> > And even when we lose, you generally still won't have been completely
> > wrong -- even then there generally will indeed be a second FPI later
> > on for the same page, to go with everything else. This makes the
> > wasted freezing even less significant, on a comparative basis!
>
> This is precisely why I think that we can afford to be quite aggressive about
> freezing already dirty pages...

I'm beginning to warm to this idea, now that I understand it a little better.

-- 
Peter Geoghegan




Re: Reducing power consumption on idle servers

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 7:37 PM Bharath Rupireddy
 wrote:
> On Wed, Jan 25, 2023 at 2:10 AM Tom Lane  wrote:
> > It's kind of moot until we've reached the point where we can
> > credibly claim to have explicit wakeups for every event of
> > interest.  I don't think we're very close to that today, and
> > I do think we should try to get closer.  There may come a point
> > of diminishing returns though.
>
> IIUC, we're discussing here whether or not to get rid of hibernate
> loops, IOW, sleep-wakeup-doworkifthereisany-sleep loops and rely on
> other processes' wakeup signals to reduce the overall power
> consumption, am I right?
>
> I'm trying to understand this a bit - can the signals (especially,
> SIGURG that we use to set latches to wake up processes) ever get lost
> on the way before reaching the target process? If yes, how? How
> frequently can it happen? Is there any history of reported issues in
> postgres because a signal got lost?
>
> I'm reading about Pending Signals and queuing of signals with
> sigqueue() (in linux), can any of these guarantee that signals sent
> never get lost?

Signals don't get lost.  At least with the current definition of so
called "reliable" signals, in the POSIX standard.  (There was a
previous system of signals in ancient UNIX that was harder to program
with; we still see echos of that today, for example in C standard
there are basic signals, and Windows implements those, but they're
borderline useless; before a signal handler runs, the handler is also
de-registered, so many interesting handlers would have to re-register
it, but signals can't be atomically blocked at the same time and there
is a window of time where the default behaviour applies, which (from
our modern perspective) is clearly insane as there is literally no way
to close that race and avoid some fatal default behaviour).  BTW
sigqueue (signals that are queued up without being consolidated, and
carry a user-controlled value with them) are "realtime signals" AKA
"signal queuing", which I guess we'll never be able to use because at
least Apple didn't implement them; the "reliable signals" we use are
the plain kind where signals don't carry any kind of payload and are
collapsed into a single "pending" bit, so that if 2 people do
kill(SIGFOO) around the same time your handler might only run once (if
you registered one and its not blocked), but it will definitely run >
0 times (or be delievered via various other means such as sigwait(),
etc...).

Then there is the question of races as you go into the wait primitive.
I mean in between our reading latch->is_set and entering the kernel,
which we defend against using various techniques which you can read
about at the top of latch.c; the short version is that if that
happens, those system calls will immediately return.

The uncertainty discussed here comes from the comment beginning "There
is a race condition here, ..." in bgwriter.c, referring to the
protocol implemented by StrategyNotifyBgWriter().  I haven't really
looked into it.  I *guess* it's probably approximately OK if the
bgwriter misses a wakeup from time to time, because there's another
chance to wake it up as soon as someone needs another buffer, and if
you don't need any more buffers soon your system is probably idle; but
really "needing another buffer" isn't a great proxy for "more buffers
are being made dirty" at all!  Someone would need to think about it
and construct a complete argument for why it's OK to go to sleep for
60 seconds, or infinity, with that sloppiness in place.

There's also the walwriter to look into; from memory it was a little
less fuzzy but I haven't looked recently.




RE: [Proposal] Add foreign-server health checks infrastructure

2023-01-26 Thread Hayato Kuroda (Fujitsu)
I found cfbot failure, PSA fixed version.
Sorry for noise.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v29-0003-add-test.patch
Description: v29-0003-add-test.patch


v29-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch
Description:  v29-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch


v29-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch
Description:  v29-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch


v29-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
Description:  v29-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch


Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Thu, Jan 26, 2023 at 7:15 PM David Rowley  wrote:
>
> On Thu, 26 Jan 2023 at 23:29, John Naylor 
wrote:
> > Coming back to this, I wanted to sketch out this idea in a bit more
detail.
> >
> > Have two memtuple arrays, one for first sortkey null and one for first
sortkey non-null:
> > - Qsort the non-null array, including whatever specialization is
available. Existing specialized comparators could ignore nulls (and their
ordering) taking less space in the binary.
> > - Only if there is more than one sort key, qsort the null array.
Ideally at some point we would have a method of ignoring the first sortkey
(this is an existing opportunity that applies elsewhere as well).
> > - To handle two arrays, grow_memtuples() would need some adjustment, as
would any callers that read the final result of an in-memory sort -- they
would need to retrieve the tuples starting with the appropriate array
depending on NULLS FIRST/LAST behavior.
>
> Thanks for coming back to this. I've been thinking about this again
> recently due to what was discovered in [1].

That was indeed part of the motivation for bringing this up.

> Why I'm bringing this up here is that I wondered, in addition to what
> you're mentioning above, if we're making some changes to allow
> multiple in-memory arrays, would it be worth going a little further
> and allowing it so we could have N arrays which we'd try to keep under
> L3 cache size. Maybe some new GUC could be used to know what a good
> value is.  With fast enough disks, it's often faster to use smaller
> values of work_mem which don't exceed L3 to keep these batches
> smaller.  Keeping them in memory would remove the need to write out
> tapes.

That's interesting. I don't know enough to guess how complex it would be to
make "external" merges agnostic about whether the tapes are on disk or in
memory.

If in-memory sorts were designed analogously to external ones,
grow_memtuples would never have to repalloc, it could just allocate a new
tape, which could further shave some cycles.

My hunch in my last email was that having separate groups of tapes for each
null/non-null first sortkey would be complex, because it increases the
number of places that have to know about nulls and their ordering. If we
wanted to go to N arrays instead of 2, and additionally wanted separate
null/non-null treatment, it seems we would still need 2 sets of arrays, one
with N non-null-first-sortkey tapes and one with M null-first-sortkey tapes.

So using 2 arrays seems like the logical first step. I'd be curious to hear
other possible development paths.

> I think the slower sorts I found in [2] could also be partially caused
> by the current sort specialisation comparators re-comparing the
> leading column during a tie-break. I've not gotten around to disabling
> the sort specialisations to see if and how much this is a factor for
> that test.

Right, that's worth addressing independently of the window function
consideration. I'm still swapping this area back in my head, but I believe
one issue is that state->base.onlyKey signals two things: "one sortkey, not
abbreviated". We could add a separate branch for "first key unabbreviated,
nkeys>1" -- I don't think we'd need to specialize, just branch -- and
instead of state->base.comparetup, call a set of analogous functions that
only handle keys 2 and above (comparetup_tail_* ? or possibly just add a
boolean parameter compare_first). That would not pose a huge challenge, I
think, since they're already written like this:

/* Compare the leading sort key */
compare = ApplySortComparator(...);
if (compare != 0)
return compare;

/* Compare additional sort keys */
...

The null/non-null separation would eliminate a bunch of branches in inlined
comparators, so we could afford to add another branch for number of keys.

I haven't thought through either of these ideas in the gory detail, but I
don't yet see any big obstacles.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-26 Thread Masahiko Sawada
On Fri, Jan 27, 2023 at 3:17 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
> > If I'm understanding this result correctly, it seems to me that your
> > patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
> > BUFFERS READ), but there seems no visible performance gain with only
> > your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
> > patch should be included in the WAL DIO patch rather than applying it
> > alone. Am I missing something?
>
> We already support using DIO for WAL - it's just restricted in a way that
> makes it practically not usable. And the reason for that is precisely that
> walsenders need to read the WAL. See get_sync_bit():
>
> /*
>  * Optimize writes by bypassing kernel cache with O_DIRECT when using
>  * O_SYNC and O_DSYNC.  But only if archiving and streaming are 
> disabled,
>  * otherwise the archive command or walsender process will read the 
> WAL
>  * soon after writing it, which is guaranteed to cause a physical 
> read if
>  * we bypassed the kernel cache. We also skip the
>  * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the 
> same
>  * reason.
>  *
>  * Never use O_DIRECT in walreceiver process for similar reasons; the 
> WAL
>  * written by walreceiver is normally read by the startup process soon
>  * after it's written. Also, walreceiver performs unaligned writes, 
> which
>  * don't work with O_DIRECT, so it is required for correctness too.
>  */
> if (!XLogIsNeeded() && !AmWalReceiverProcess())
> o_direct_flag = PG_O_DIRECT;
>
>
> Even if that weren't the case, splitting up bigger commits in incrementally
> committable chunks is a good idea.

Agreed. I was wondering about the fact that the test result doesn't
show things to satisfy the first motivation of this patch, which is to
improve performance by reducing disk I/O and system calls regardless
of the DIO patch. But it makes sense to me that this patch is a part
of the DIO patch series.

I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Reducing power consumption on idle servers

2023-01-26 Thread Bharath Rupireddy
On Wed, Jan 25, 2023 at 2:10 AM Tom Lane  wrote:
>
> Thomas Munro  writes:
> > Yeah, I definitely want to fix it.  I just worry that 60s is so long
> > that it also needs that analysis work to be done to explain that it's
> > OK that we're a bit sloppy on noticing when to wake up, at which point
> > you might as well go to infinity.
>
> Yeah.  The perfectionist in me wants to say that there should be
> explicit wakeups for every event of interest, in which case there's
> no need for a timeout.  The engineer in me says "but what about bugs?".
> Better a slow reaction than never reacting at all.  OTOH, then you
> have to have a discussion about whether 60s (or any other
> ice-cap-friendly value) is an acceptable response time even in the
> presence of bugs.
>
> It's kind of moot until we've reached the point where we can
> credibly claim to have explicit wakeups for every event of
> interest.  I don't think we're very close to that today, and
> I do think we should try to get closer.  There may come a point
> of diminishing returns though.

IIUC, we're discussing here whether or not to get rid of hibernate
loops, IOW, sleep-wakeup-doworkifthereisany-sleep loops and rely on
other processes' wakeup signals to reduce the overall power
consumption, am I right?

I'm trying to understand this a bit - can the signals (especially,
SIGURG that we use to set latches to wake up processes) ever get lost
on the way before reaching the target process? If yes, how? How
frequently can it happen? Is there any history of reported issues in
postgres because a signal got lost?

I'm reading about Pending Signals and queuing of signals with
sigqueue() (in linux), can any of these guarantee that signals sent
never get lost?

FWIW, a recent commit cd4329d that removed promote_trigger_file also
removed hibernation and added wait forever relying on the latch set.
If we're worried that the signals can get lost on the way, then this
also needs to be fixed. And, I also see lot of
WaitLatch() with waiting forever relying on others to wake them up.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
> If I'm understanding this result correctly, it seems to me that your
> patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
> BUFFERS READ), but there seems no visible performance gain with only
> your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
> patch should be included in the WAL DIO patch rather than applying it
> alone. Am I missing something?

We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():

/*
 * Optimize writes by bypassing kernel cache with O_DIRECT when using
 * O_SYNC and O_DSYNC.  But only if archiving and streaming are 
disabled,
 * otherwise the archive command or walsender process will read the WAL
 * soon after writing it, which is guaranteed to cause a physical read 
if
 * we bypassed the kernel cache. We also skip the
 * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the 
same
 * reason.
 *
 * Never use O_DIRECT in walreceiver process for similar reasons; the 
WAL
 * written by walreceiver is normally read by the startup process soon
 * after it's written. Also, walreceiver performs unaligned writes, 
which
 * don't work with O_DIRECT, so it is required for correctness too.
 */
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;


Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.

Greetings,

Andres Freund




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 19:01:03 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 6:37 PM Andres Freund  wrote:
> > I also don't really see how that is responsive to anything else in my
> > email. That's just as true for the current gating condition (the issuance of
> > an FPI during heap_page_prune() / HTSV()).
> >
> > What I was wondering about is whether we should replace the
> >   fpi_before != pgWalUsage.wal_fpi
> > with
> >   records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)
>
> I understand that. What I'm saying is that that's going to create a
> huge problem of its own, unless you separately account for that
> problem.

> The simplest and obvious example is something like a pgbench_tellers
> table. VACUUM will generally run fast enough relative to the workload
> that it will set some of those pages all-visible. Now it's going to
> freeze them, too. Arguably it shouldn't even be setting the pages
> all-visible, but now you make that existing problem much worse.

So the benefit of the FPI condition is that it indicates that the page hasn't
been updated all that recently, because, after all, a checkpoint has happened
since?  If that's the intention, it needs a huge honking comment - at least I
can't read that out of:

  Also freeze when pruning generated an FPI, if doing so means that we set the
  page all-frozen afterwards (might not happen until final heap pass).


It doesn't seem like a great proxy to me. ISTM that this means that how
aggressive vacuum is about opportunistically freezing pages depends on config
variables like checkpoint_timeout & max_wal_size (less common opportunistic
freezing), full_page_writes & use of unlogged tables (no opportunistic
freezing), and the largely random scheduling of autovac workers.


I can see it making a difference for pgbench_tellers, but it's a pretty small
difference in overall WAL volume. I can think of more adverse workloads though
- but even there the difference seems not huge, and not predictably
reached. Due to the freeze plan stuff you added, the amount of WAL for
freezing a page is pretty darn small compared to the amount of WAL if compared
to the amount of WAL needed to fill a page with non-frozen tuples.

That's not to say we shouldn't reduce the risk - I agree that both the "any
fpi" and the "any record" condition can have adverse effects!


However, an already dirty page getting frozen is also the one case where
freezing won't have meaningful write amplication effect. So I think it's worth
trying spending effort figuring out how we can make freezing in that situation
have unlikely and small downsides.


The cases with downsides are tables that are very heavily updated througout,
where the page is going to be defrosted again almost immediately. As you say,
the all-visible marking has a similar problem.


Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

  lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
  if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
 FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.



> The important point is that there doesn't seem to be any good way
> around thinking about the table as a whole if you're going to freeze
> speculatively. This is not the same dynamic as we see with the FPI
> thing IMV -- that's not nearly so speculative as what you're talking
> about, since it is speculative in roughly the same sense that eager
> freezing was speculative (hence the suggestion that something like
> vacuum_freeze_strategy_threshold could have a roll to play).

I don't think the speculation is that fundamentally different - a heavily
updated table with a bit of a historic, non-changing portion, makes
vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
record" or "any fpi".


> The FPI thing is mostly about the cost now versus the cost later on.
> You're gambling that you won't get another FPI later on if you freeze
> now. But the cost of a second FPI later on is so much higher than the
> added cost of freezing now that that's a very favorable bet, that we
> can afford to "lose" many times while still coming out ahead overall.

Agreed. And not just avoiding FPIs, avoiding another dirtying of the page! The
latter part is especially huge IMO. Depending on s_b size it can also avoid
another *read* of the page...


> And even when we lose, you generally still won't have been completely
> wrong -- even then there generally will indeed be a second FPI later
> on for the same page, to go with everything else. This makes the
> wasted freezing even

Re: recovery modules

2023-01-26 Thread Nathan Bossart
On Wed, Jan 25, 2023 at 04:34:21PM +0900, Michael Paquier wrote:
> The loop part is annoying..  I've never been a fan of adding this
> cross-value checks for the archiver modules in the first place, and it
> would make things much simpler in the checkpointer if we need to think
> about that as we want these values to be reloadable.  Perhaps this
> could just be an exception where we just give priority on one over the
> other archive_cleanup_command?  The startup process has a well-defined
> sequence after a failure, while the checkpointer is designed to remain
> robust.

Yeah, there are some problems here.  If we ERROR, we'll just bounce back to
the sigsetjmp() block once a second, and we'll never pick up configuration
reloads, shutdown signals, etc.  If we FATAL, we'll just rapidly restart
over and over.  Given the dicussion about misconfigured archiving
parameters [0], I doubt folks will be okay with giving priority to one or
the other.

I'm currently thinking that the checkpointer should set a flag and clear
the recovery callbacks when a misconfiguration is detected.  Anytime the
checkpointer tries to use the archive-cleanup callback, a WARNING would be
emitted.  This is similar to an approach I proposed for archiving
misconfigurations (that we didn't proceed with) [1].  Given the
aformentioned problems, this approach might be more suitable for the
checkpointer than it is for the archiver.

Thoughts?

[0] https://postgr.es/m/9ee5d180-2c32-a1ca-d3d7-63a723f68d9a%40enterprisedb.com
[1] https://postgr.es/m/20220914222736.GA3042279%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-26 Thread Masahiko Sawada
On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 26, 2023 at 2:45 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> > > On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > > > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > > > Please review the attached v2 patch further.
> > > >
> > > > I'm still unclear on the performance goals of this patch. I see that it
> > > > will reduce syscalls, which sounds good, but to what end?
> > > >
> > > > Does it allow a greater number of walsenders? Lower replication
> > > > latency? Less IO bandwidth? All of the above?
> > >
> > > One benefit would be that it'd make it more realistic to use direct IO 
> > > for WAL
> > > - for which I have seen significant performance benefits. But when we
> > > afterwards have to re-read it from disk to replicate, it's less clearly a 
> > > win.
> >
> > Satya's email just now reminded me of another important reason:
> >
> > Eventually we should add the ability to stream out WAL *before* it has 
> > locally
> > been written out and flushed. Obviously the relevant positions would have to
> > be noted in the relevant message in the streaming protocol, and we couldn't
> > generally allow standbys to apply that data yet.
> >
> > That'd allow us to significantly reduce the overhead of synchronous
> > replication, because instead of commonly needing to send out all the pending
> > WAL at commit, we'd just need to send out the updated flush position. The
> > reason this would lower the overhead is that:
> >
> > a) The reduced amount of data to be transferred reduces latency - it's easy 
> > to
> >accumulate a few TCP packets worth of data even in a single small OLTP
> >transaction
> > b) The remote side can start to write out data earlier
> >
> >
> > Of course this would require additional infrastructure on the receiver
> > side. E.g. some persistent state indicating up to where WAL is allowed to be
> > applied, to avoid the standby getting ahead of th eprimary, in case the
> > primary crash-restarts (or has more severe issues).
> >
> >
> > With a bit of work we could perform WAL replay on standby without waiting 
> > for
> > the fdatasync of the received WAL - that only needs to happen when a) we 
> > need
> > to confirm a flush position to the primary b) when we need to write back 
> > pages
> > from the buffer pool (and some other things).
>
> Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
> is right, the eventual plan is to do a bunch of other stuff as
> described above and we've discussed this in another thread (see
> below). I would like to once again clarify motivation behind this
> feature:
>
> 1. It enables WAL readers (callers of WALRead() - wal senders,
> pg_walinspect etc.) to use WAL buffers as first level cache which
> might reduce number of IOPS at a peak load especially when the pread()
> results in a disk read (WAL isn't available in OS page cache). I had
> earlier presented the buffer hit ratio/amount of pread() system calls
> reduced with wal senders in the first email of this thread (95% of the
> time wal senders are able to read from WAL buffers without impacting
> anybody). Now, here are the results with the WAL DIO patch [1] - where
> WAL pread() turns into a disk read, see the results [2] and attached
> graph.
>
> 2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
> page cache, using WAL buffers as read cache helps a lot. It is clearly
> evident from my experiment with WAL DIO patch [1], see the results [2]
> and attached graph. As expected, WAL DIO brings down the TPS, whereas
> WAL buffers read i.e. this patch brings it up.
>
> [2] Test case is an insert pgbench workload.
> clientsHEADWAL DIOWAL DIO & WAL BUFFERS READWAL BUFFERS READ
> 11404107014241375
> 2148779614541517
> 43064174330113019
> 86114355660265954
> 161156070511221612132
> 3223181130792344923561
> 6443607269834399745636
> 12880723451698151581911
> 25611092590185107332114046
> 512119354109817110287117506
> 768112435105795106853111605
> 1024107554105541105942109370
> 204888552790248069990555
> 409661323548145870461743

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Assertion failure in SnapBuildInitialSnapshot()

2023-01-26 Thread Amit Kapila
On Thu, Dec 8, 2022 at 8:17 AM Masahiko Sawada  wrote:
>
> The same assertion failure has been reported on another thread[1].
> Since I could reproduce this issue several times in my environment
> I've investigated the root cause.
>
> I think there is a race condition of updating
> procArray->replication_slot_xmin by CreateInitDecodingContext() and
> LogicalConfirmReceivedLocation().
>
> What I observed in the test was that a walsender process called:
> SnapBuildProcessRunningXacts()
>   LogicalIncreaseXminForSlot()
> LogicalConfirmReceivedLocation()
>   ReplicationSlotsComputeRequiredXmin(false).
>
> In ReplicationSlotsComputeRequiredXmin() it acquired the
> ReplicationSlotControlLock and got 0 as the minimum xmin since there
> was no wal sender having effective_xmin.
>

What about the current walsender process which is processing
running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
slot's effective_xmin have a non-zero value? If not, then why?

-- 
With Regards,
Amit Kapila.




Re: Inconsistency in reporting checkpointer stats

2023-01-26 Thread Bharath Rupireddy
On Wed, Dec 21, 2022 at 5:15 PM Nitin Jadhav
 wrote:
>
> I have modified the code accordingly and attached the new version of
> patches. patch 0001 fixes the inconsistency in checkpointer stats and
> patch 0002 separates main buffer and SLRU buffer count from checkpoint
> complete log message.

IMO, there's no need for 2 separate patches for these changes.

+(errmsg("restartpoint complete: wrote %d buffers (%.1f%%), "
+"wrote %d slru buffers (%.1f%%); %d WAL
file(s) added, "
+"%d removed, %d recycled; write=%ld.%03d s, "
+"sync=%ld.%03d s, total=%ld.%03d s; sync files=%d, "
+"longest=%ld.%03d s, average=%ld.%03d s;
distance=%d kB, "
+"estimate=%d kB; lsn=%X/%X, redo lsn=%X/%X",
Hm, restartpoint /checkpoint complete message is already too long to
read and adding slru buffers to it make it further longer. Note that
we don't need to add every checkpoint stat to the log message but to
pg_stat_bgwriter. Isn't it enough to show SLRU buffers information in
pg_stat_bgwriter alone?

Can't one look at pg_stat_slru's blks_written
(pgstat_count_slru_page_written()) to really track the SLRUs written?
Or is it that one may want to track SLRUs during a checkpoint
separately? Is there a real use-case/customer reported issue driving
this change?

After looking at pg_stat_slru.blks_written, I think the best way is to
just leave things as-is and let CheckpointStats count slru buffers too
unless there's really a reported issue that says
pg_stat_slru.blks_written doesn't serve the purpose.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Something is wrong with wal_compression

2023-01-26 Thread Laurenz Albe
On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:
> On Fri, Jan 27, 2023 at 3:04 PM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier  
> > > wrote:
> > > > My opinion would be to make this function more reliable, FWIW, even if
> > > > that involves a performance impact when called in a close loop by
> > > > forcing more WAL flushes to ensure its report durability and
> > > > consistency.
> > 
> > > Yeah, the other thread has a patch for that.  But it would hurt some
> > > workloads.
> > 
> > I think we need to get the thing correct first and worry about
> > performance later.  What's wrong with simply making pg_xact_status
> > write and flush a record of the XID's existence before returning it?
> > Yeah, it will cost you if you use that function, but not if you don't.
> 
> There is no
> doubt that the current situation is unacceptable, though, so maybe we
> really should just do it and make a faster one later.  Anyone else
> want to vote on this?

I wasn't aware of the existence of pg_xact_status, so I suspect that it
is not a widely known and used feature.  After reading the documentation,
I'd say that anybody who uses it will want it to give a reliable answer.
So I'd agree that it is better to make it more expensive, but live up to
its promise.

Yours,
Laurenz Albe




RE: Exit walsender before confirming remote flush in logical replication

2023-01-26 Thread Hayato Kuroda (Fujitsu)
Dear Dilip, hackers,

Thanks for giving your opinion. I analyzed the relation with the given commit,
and I thought I could keep my patch. How do you think?

# Abstract

* Some modifications should be needed.
* We cannot rollback the shutdown if walsenders are stuck
* We don't have a good way to detect stuck

# Discussion

Compared to physical replication, it is likely to happen that logical 
replication is stuck.
I think the risk should be avoided as much as possible by fixing codes.
Then, if it leads another failure, we can document the caution to users.

While shutting down the server, checkpointer sends SIGUSR1 signal to wansenders.
It is done after being exited other processes, so we cannot raise ERROR and 
rollback
the operation if checkpointer recognize the process stuck at that time.

We don't have any features that postmaster can check  whether this node is
publisher or not. So if we want to add the mechanism that can check the health
of walsenders before shutting down, we must do that at the top of
process_pm_shutdown_request() even if we are not in logical replication.
I think it affects the basis of postgres largely, and in the first place,
PostgreSQL does not have a mechanism to check the health of process.

Therefore, I want to adopt the approach that walsender itself exits immediately 
when they get signals.

## About patch - Were fixes correct?

In ProcessPendingWrites(), my patch, wansender calls WalSndDone() when it gets
SIGUSR1 signal. I think this should be. From the patch [1]:

```
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
/* Try to flush pending output to the client */
if (pq_flush_if_writable() != 0)
WalSndShutdown();
+
+   /* If we got shut down requested, try to exit the process */
+   if (got_STOPPING)
+   WalSndDone(XLogSendLogical);
}
 
/* reactivate latch so WalSndLoop knows to continue */
```


Per my analysis, in case of logical replication, walsenders exit with following
steps. Note that logical walsender does not receive SIGUSR2 signal, set flag by
themselves instead:

1. postmaster sends shutdown requests to checkpointer
2. checkpointer sends SIGUSR1 to walsenders and wait
3. when walsenders accept SIGUSR1, they turn got_SIGUSR1 on.
4. walsenders consume all WALs. @XLogSendLogical
5. walsenders turn got_SIGUSR2 on by themselves @XLogSendLogical
6. walsenders recognize the flag is on, so call WalSndDone() @ WalSndLoop
7. proc_exit(0)
8. checkpoitner writes shutdown record
...

Type (i) stuck, I reported in -hackers[1], means that processes stop at step 6
and Type (ii) stuck means that processes stop at 4. In step4, got_SIGUSR2 is 
never set to on, so
we must use got_STOPPING flag.

[1]: 
https://www.postgresql.org/message-id/tycpr01mb58701a47f35fed0a2b399662f5...@tycpr01mb5870.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Improve GetConfigOptionValues function

2023-01-26 Thread Bharath Rupireddy
On Tue, Jan 24, 2023 at 8:43 PM Tom Lane  wrote:
>
> Bharath Rupireddy  writes:
> > On Mon, Jan 23, 2023 at 9:51 PM Tom Lane  wrote:
> >> Also, I intentionally dropped the GUC_NO_SHOW_ALL check in
> >> get_explain_guc_options, because it seems redundant given
> >> the preceding GUC_EXPLAIN check.  It's unlikely we'd ever have
> >> a variable that's marked both GUC_EXPLAIN and GUC_NO_SHOW_ALL ...
> >> but if we did, shouldn't the former take precedence here anyway?
>
> > You're right, but there's nothing that prevents users writing GUCs
> > with GUC_EXPLAIN and GUC_NO_SHOW_ALL.
>
> "Users"?  You do realize those flags are only settable by C code,
> right?

I meant extensions here.

> Moreover, you haven't explained why it would be good that
> you can't get at the behavior that a GUC is both shown in EXPLAIN
> and not shown in SHOW ALL.  If you want "not shown by either",
> that's already accessible by setting only the GUC_NO_SHOW_ALL
> flag.  So I'd almost argue this is a bug fix, though I concede
> it's a bit hard to imagine why somebody would want that choice.
> Still, if we have two independent flags they should produce four
> behaviors, not just three.

Got it. I think I should've looked at GUC_NO_SHOW_ALL and GUC_EXPLAIN
as separate flags not depending on each other in any way, meaning,
GUCs marked with GUC_NO_SHOW_ALL mustn't be shown in SHOW ALL and its
friends
pg_settings and pg_show_all_settings(), and GUCs marked with
GUC_EXPLAIN must be shown in explain output. This understanding is
clear and simple IMO.

Having said that, I have no objection to the v4 patch posted upthread.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add LZ4 compression in pg_dump

2023-01-26 Thread Justin Pryzby
On Mon, Jan 16, 2023 at 11:54:46AM +0900, Michael Paquier wrote:
> On Sun, Jan 15, 2023 at 07:56:25PM -0600, Justin Pryzby wrote:
> > On Mon, Jan 16, 2023 at 10:28:50AM +0900, Michael Paquier wrote:
> >> The functions changed by 0001 are cfopen[_write](),
> >> AllocateCompressor() and ReadDataFromArchive().  Why is it a good idea
> >> to change these interfaces which basically exist to handle inputs?
> > 
> > I changed to pass pg_compress_specification as a pointer, since that's
> > the usual convention for structs, as followed by the existing uses of
> > pg_compress_specification.
> 
> Okay, but what do we gain here?  It seems to me that this introduces
> the risk that a careless change in one of the internal routines if
> they change slight;ly compress_spec, hence impacting any of their
> callers?  Or is that fixing an actual bug (except if I am missing your
> point, that does not seem to be the case)?  

To circle back to this: I was not saying there's any bug.  The proposed
change was only to follow normal and existing normal conventions for
passing structs.  It could also be a pointer to const.  It's fine with
me if you say that it's intentional how it's written already.

> >> Is there some benefit in changing compression_spec within the
> >> internals of these routines before going back one layer down to their
> >> callers?  Changing the compression_spec on-the-fly in these internal
> >> paths could be risky, actually, no?
> > 
> > I think what you're saying is that if the spec is passed as a pointer,
> > then the called functions shouldn't set spec->algorithm=something.
> 
> Yes.  HEAD makes sure of that, 0001 would not prevent that.  So I am a
> bit confused in seeing how this is a benefit.




RE: [Proposal] Add foreign-server health checks infrastructure

2023-01-26 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

I have updated my patch for error handling and kqueue() support.
Actually I do not have BSD-like machine, but I developed by using github CICD.
I think at first we should focus on 0001-0003, and then work for 0004.

Followings are change notes and my analysis.

0001

* Fix missed replacements from PQConnCheck() to PQconnCheck().
* Error handling was improved. Now we can detect the failure of poll() and 
return -1 at that time.
* I thought we don't have to add select(2) in PQconnCheck(). According to man 
page [1],
  select(2) can be only used for watch whether the status is readable, 
writable, or exceptional condition.
  It means that select() does not have an event corresponding to POLLRDHUP.

0002, 0003
Not changed

0004

* Add kqueue(2) support() for BSD family.
* I did not add epoll() support, because it can be used only on linux and such 
systems have POLLRDHUP instead. 
 checked other codes in libpq, and they do not use epoll(). We can see that 
such an event does not specified in POSIX [2]
 and it can be used for linux only. I decided to use poll() as much as possible 
to keep the policy.
* While coding, I found that there are no good timing to close the kernel event 
queue.
 It means that the lifetime of kqueue becomes same as the client program and 
occupy the small memory forever.
 I'm not sure it can be accepted.


[1]: https://man7.org/linux/man-pages/man2/select.2.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v28-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch
Description:  v28-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch


v28-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
Description:  v28-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch


v28-0003-add-test.patch
Description: v28-0003-add-test.patch


v28-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch
Description:  v28-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch


Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Jan 27, 2023 at 3:04 PM Tom Lane  wrote:
>> I think we need to get the thing correct first and worry about
>> performance later.  What's wrong with simply making pg_xact_status
>> write and flush a record of the XID's existence before returning it?
>> Yeah, it will cost you if you use that function, but not if you don't.

> It would be pg_current_xact_id() that would have to pay the cost of
> the WAL flush, not pg_xact_status() itself,

Right, typo on my part.

regards, tom lane




Re: Monotonic WindowFunc support for ntile(), percent_rank() and cume_dist()

2023-01-26 Thread David Rowley
On Wed, 25 Jan 2023 at 02:39, Melanie Plageman
 wrote:
>
> On Tue, Jan 24, 2023 at 02:00:33PM +1300, David Rowley wrote:
> > If you feel strongly about that, then feel free to show me what you
> > have in mind in more detail so I can think harder about it.
>
> Nah, I don't feel strongly. I think it was because looking at the patch
> in isolation, the repetition stands out but in the context of the rest
> of the code it doesn't.

I played around with this patch again and the performance gains for
the best case are not quite as good as we got for row_number(), rank()
and dense_rank() in the original commit.  The reasons for this is that
ntile() and co all require getting a count of the total rows in the
partition so it can calculate the result. ntile() needs to know how
large the tiles are, for example. That obviously requires pulling all
tuples into the tuplestore.

Despite this, the performance with the patch is still much better than
without.  Here's master:

explain (analyze, timing off, costs off) select * from (select
a,ntile(10) over (order by a) nt from a) where nt < 1;

 Subquery Scan on unnamed_subquery (actual rows=0 loops=1)
   Filter: (unnamed_subquery.nt < 1)
   Rows Removed by Filter: 100
   ->  WindowAgg  (actual rows=100 loops=1)
 ->  Index Only Scan using a_a_idx on a  (actual rows=100 loops=1)
   Heap Fetches: 0
 Planning Time: 0.073 ms
 Execution Time: 254.118 ms
(8 rows)

and with the patch:

 WindowAgg  (actual rows=0 loops=1)
   Run Condition: (ntile(10) OVER (?) < 1)
   ->  Index Only Scan using a_a_idx on a (actual rows=100 loops=1)
 Heap Fetches: 0
 Planning Time: 0.072 ms
 Execution Time: 140.023 ms
(6 rows)

Here's with row_number() for reference:

explain (analyze, timing off, costs off) select * from (select
a,row_number() over (order by a) rn from a) where rn < 1;

 WindowAgg (actual rows=0 loops=1)
   Run Condition: (row_number() OVER (?) < 1)
   ->  Index Only Scan using a_a_idx on a (actual rows=1 loops=1)
 Heap Fetches: 0
 Planning Time: 0.089 ms
 Execution Time: 0.054 ms
(6 rows)

you can clearly see the difference with the number of rows pulled out
of the index only scan.

This is just a 1 million row table with a single INT column and an
index on that column.

Anyway, all seems like clear wins and low controversy so I've now pushed it.

David




Re: Something is wrong with wal_compression

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 3:04 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier  wrote:
> >> My opinion would be to make this function more reliable, FWIW, even if
> >> that involves a performance impact when called in a close loop by
> >> forcing more WAL flushes to ensure its report durability and
> >> consistency.
>
> > Yeah, the other thread has a patch for that.  But it would hurt some
> > workloads.
>
> I think we need to get the thing correct first and worry about
> performance later.  What's wrong with simply making pg_xact_status
> write and flush a record of the XID's existence before returning it?
> Yeah, it will cost you if you use that function, but not if you don't.

It would be pg_current_xact_id() that would have to pay the cost of
the WAL flush, not pg_xact_status() itself, but yeah that's what the
patch does (with some optimisations).  I guess one question is whether
there are any other reasonable real world uses of
pg_current_xact_id(), other than the original goal[1].  If not, then
at least you are penalising the right users, even though they probably
only actually call pg_current_status() in extremely rare circumstances
(if COMMIT hangs up).  But I wouldn't be surprised if people have
found other reasons to be interested in xid observability, related to
distributed transactions and snapshots and suchlike.   There is no
doubt that the current situation is unacceptable, though, so maybe we
really should just do it and make a faster one later.  Anyone else
want to vote on this?

[1] 
https://www.postgresql.org/message-id/flat/CAMsr%2BYHQiWNEi0daCTboS40T%2BV5s_%2Bdst3PYv_8v2wNVH%2BXx4g%40mail.gmail.com




Re: Generating code for query jumbling through gen_node_support.pl

2023-01-26 Thread Michael Paquier
On Thu, Jan 26, 2023 at 09:39:05AM +0100, Peter Eisentraut wrote:
> There are a couple of repetitive comments, like "typmod and collation
> information are irrelevant for the query jumbling".  This applies to all
> nodes, so we don't need to repeat it for a number of nodes (and then not
> mention it for other nodes).  Maybe there should be a central place
> somewhere that describes "these kinds of fields should normally be ignored".

The place that would make the most sense to me to centralize this
knowlegde is nodes.h itself, because that's where the node attributes
are defined?
--
Michael


signature.asc
Description: PGP signature


Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 6:37 PM Andres Freund  wrote:
> I also don't really see how that is responsive to anything else in my
> email. That's just as true for the current gating condition (the issuance of
> an FPI during heap_page_prune() / HTSV()).
>
> What I was wondering about is whether we should replace the
>   fpi_before != pgWalUsage.wal_fpi
> with
>   records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)

I understand that. What I'm saying is that that's going to create a
huge problem of its own, unless you separately account for that
problem.

The simplest and obvious example is something like a pgbench_tellers
table. VACUUM will generally run fast enough relative to the workload
that it will set some of those pages all-visible. Now it's going to
freeze them, too. Arguably it shouldn't even be setting the pages
all-visible, but now you make that existing problem much worse.

The important point is that there doesn't seem to be any good way
around thinking about the table as a whole if you're going to freeze
speculatively. This is not the same dynamic as we see with the FPI
thing IMV -- that's not nearly so speculative as what you're talking
about, since it is speculative in roughly the same sense that eager
freezing was speculative (hence the suggestion that something like
vacuum_freeze_strategy_threshold could have a roll to play).

The FPI thing is mostly about the cost now versus the cost later on.
You're gambling that you won't get another FPI later on if you freeze
now. But the cost of a second FPI later on is so much higher than the
added cost of freezing now that that's a very favorable bet, that we
can afford to "lose" many times while still coming out ahead overall.
And even when we lose, you generally still won't have been completely
wrong -- even then there generally will indeed be a second FPI later
on for the same page, to go with everything else. This makes the
wasted freezing even less significant, on a comparative basis!

It's also likely true that an FPI in lazy_scan_prune is a much
stronger signal, but I think that the important dynamic is that we're
reasoning about "costs now vs costs later on". The asymmetry is really
important.

-- 
Peter Geoghegan




Re: Generating code for query jumbling through gen_node_support.pl

2023-01-26 Thread Michael Paquier
On Thu, Jan 26, 2023 at 09:37:13AM +0100, Peter Eisentraut wrote:
> Ok, the documentation make sense now.  I wonder what the performance impact
> is.  Probably, nobody cares about microoptimizing CREATE TABLE statements.
> But BEGIN/COMMIT could matter.  However, whatever you do in between the
> BEGIN and COMMIT will already be jumbled, so you're already paying the
> overhead.  Hopefully, jumbling such simple commands will have no noticeable
> overhead.
> 
> In other words, we should test this and hopefully get rid of the 'string'
> method.

Yep.  I have mentioned a few numbers upthread, and this deserves
discussion.

FYI, I have done more micro-benchmarking to compare both methods for
utility queries by hijacking JumbleQuery() to run the computation in a
tight loop run N times (could not come up with a better idea to avoid
the repeated palloc/pfree overhead), as the path to stress is
_jumbleNode().  See the attached, that should be able to apply on top
of the latest patch set (named as .txt to not feed it to the CF bot,
and need to recompile to switch the iteration).

Using that, I can compile the following results for various cases (-O2
and compute_query_id=on):
  query  |  mode  | iterations | avg_runtime_ns | avg_jumble_ns 
-++++---
 begin   | string |   5000 |4.53116 |  4.54
 begin   | jumble |   5000 |   30.94578 | 30.94
 commit  | string |   5000 |4.76004 |  4.74
 commit  | jumble |   5000 |31.4791 | 31.48
 create table 1 column   | string |   5000 |7.22836 |  7.08
 create table 1 column   | jumble |   5000 |  152.10852 |151.96
 create table 5 columns  | string |   5000 |   12.43412 | 12.28
 create table 5 columns  | jumble |   5000 |  352.88976 | 349.1
 create table 20 columns | string |500 | 49.591 |  48.2
 create table 20 columns | jumble |500 |  2272.4066 |  2271
 drop table 1 column | string |   5000 |6.70538 |  6.56
 drop table 1 column | jumble |   5000 |  50.38 | 50.24
 drop table 5 columns| string |   5000 |6.88256 |  6.74
 drop table 5 columns| jumble |   5000 |   50.02898 |  49.9
 SET work_mem| string |   5000 |7.28752 |  7.28
 SET work_mem| jumble |   5000 |   91.66588 | 91.64
(16 rows)

avg_runtime_ns is (query runtime / iterations) and avg_jumble_ns is
the same with the difference between the start/end logs in the txt
patch attached.  The overhead to run the query does not matter much if
you compare both.  The time it takes to run a jumble is correlated to
the number of nodes to go through for each query, and there is a
larger gap for more nodes to go through.  Well, a simple "begin" or
"commit" query has its computation time increase from 4ns to 30ns in
average which would be unnoticeable.  The gap is larger for larger
nodes, like SET, still we jump from 7ns to 90ns in this case.  DDLs
take the most hit with this method, where a 20-column CREATE TABLE
jumps from 50ns to 2us (note that the iteration is 10 times lower
here).

At the end, that would be unnoticeable for the average user, I guess,
but here are the numbers I get on my laptop :)
--
Michael
From ee47c4e1137adc857184d76ab62232e9c3c95451 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 27 Jan 2023 10:58:34 +0900
Subject: [PATCH 1/1] Add benchmark tweaks to stress jumbling code

---
 src/backend/nodes/queryjumblefuncs.c | 36 +---
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/src/backend/nodes/queryjumblefuncs.c 
b/src/backend/nodes/queryjumblefuncs.c
index d55adf5020..9c134cc5ae 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -101,32 +101,48 @@ CleanQuerytext(const char *query, int *location, int *len)
 JumbleState *
 JumbleQuery(Query *query, const char *querytext)
 {
+#define MAX_QUERY_COMPUTE 500
+
JumbleState *jstate = NULL;
 
Assert(IsQueryIdEnabled());
 
+   elog(WARNING, "start JumbleQuery");
+
if (query->utilityStmt &&
utility_query_id == UTILITY_QUERY_ID_STRING)
{
-   query->queryId = compute_utility_query_id(querytext,
-   
  query->stmt_location,
-   
  query->stmt_len);
+   for (int i = 0; i < MAX_QUERY_COMPUTE ; i++)
+   {
+   query->queryId = compute_utility_query_id(querytext,
+  

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 12:45 PM Andres Freund  wrote:
> > > Most of the overhead of FREEZE WAL records (with freeze plan
> > > deduplication and page-level freezing in) is generic WAL record header
> > > overhead. Your recent adversarial test case is going to choke on that,
> > > too. At least if you set checkpoint_timeout to 1 minute again.
> >
> > I don't quite follow. What do you mean with "record header overhead"? Unless
> > that includes FPIs, I don't think that's that commonly true?
>
> Even if there are no directly observable FPIs, there is still extra
> WAL, which can cause FPIs indirectly, just by making checkpoints more
> frequent. I feel ridiculous even having to explain this to you.

What does that have to do with "generic WAL record overhead"?


I also don't really see how that is responsive to anything else in my
email. That's just as true for the current gating condition (the issuance of
an FPI during heap_page_prune() / HTSV()).

What I was wondering about is whether we should replace the
  fpi_before != pgWalUsage.wal_fpi
with
  records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)


> > The problematic case I am talking about is when we do *not* emit a WAL 
> > record
> > during pruning (because there's nothing to prune), but want to freeze the
> > table. If you don't log an FPI, the remaining big overhead is that 
> > increasing
> > the LSN on the page will often cause an XLogFlush() when writing out the
> > buffer.
> >
> > I don't see what your reference to checkpoint timeout is about here?
> >
> > Also, as I mentioned before, the problem isn't specific to 
> > checkpoint_timeout
> > = 1min. It just makes it cheaper to reproduce.
>
> That's flagrantly intellectually dishonest. Sure, it made it easier to
> reproduce. But that's not all it did!
>
> You had *lots* of specific numbers and technical details in your first
> email, such as "Time for vacuuming goes up to ~5x. WAL volume to
> ~9x.". But you did not feel that it was worth bothering with details
> like having set checkpoint_timeout to 1 minute, which is a setting
> that nobody uses, and obviously had a multiplicative effect. That
> detail was unimportant. I had to drag it out of you!

The multiples were for checkpoint_timeout=5min, with
 '250s' instead of WHERE ts < now() - '120s'

I started out with checkpoint_timeout=1min, as I wanted to quickly test my
theory. Then I increased checkpoint_timeout back to 5min, adjusted the query
to some randomly guessed value. Happened to get nearly the same results.

I then experimented more with '1min', because it's less annoying to have to
wait for 120s until deletions start, than to wait for 250s. Because it's
quicker to run I thought I'd share the less resource intensive version. A
mistake as I now realize.


This wasn't intended as a carefully designed benchmark, or anything. It was a
quick proof for a problem that I found obvious. And it's not something worth
testing carefully - e.g. the constants in the test are actually quite hardware
specific, because the insert/seconds rate is very machine specific, and it's
completely unnecessarily hardware intensive due to the use of single-row
inserts, instead of batched operations.  It's just a POC.



> You basically found a way to add WAL overhead to a system/workload
> that is already in a write amplification vicious cycle, with latent
> tipping point type behavior.
>
> There is a practical point here, that is equally obvious, and yet
> somehow still needs to be said: benchmarks like that one are basically
> completely free of useful information. If we can't agree on how to
> assess such things in general, then what can we agree on when it comes
> to what should be done about it, what trade-off to make, when it comes
> to any similar question?

It's not at all free of useful information. It reproduces a problem I
predicted repeatedly, that others in the discussion also wondered about, that
you refused to acknowledge or address.

It's not a good benchmark - I completely agree with that much. It was not
designed to carefully benchmark different settings or such. It was designed to
show a problem. And it does that.



> > You're right, it makes sense to consider whether we'll emit a
> > XLOG_HEAP2_VISIBLE anyway.
>
> As written the page-level freezing FPI mechanism probably doesn't
> really stand to benefit much from doing that. Either checksums are
> disabled and it's just a hint, or they're enabled and there is a very
> high chance that we'll get an FPI inside lazy_scan_prune rather than
> right after it is called, when PD_ALL_VISIBLE is set.

I think it might be useful with logged hint bits, consider cases where all the
tuples on the page were already fully hinted. That's not uncommon, I think?


> > > > A less aggressive version would be to check if any WAL records were 
> > > > emitted
> > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI 
> > > 

Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier  wrote:
>> My opinion would be to make this function more reliable, FWIW, even if
>> that involves a performance impact when called in a close loop by
>> forcing more WAL flushes to ensure its report durability and
>> consistency.

> Yeah, the other thread has a patch for that.  But it would hurt some
> workloads.

I think we need to get the thing correct first and worry about
performance later.  What's wrong with simply making pg_xact_status
write and flush a record of the XID's existence before returning it?
Yeah, it will cost you if you use that function, but not if you don't.

> A better patch would do some kind of amortisation
> (reserving N xids at a time or some such scheme, while being careful
> to make sure the right CLOG pages etc exist if you crash and skip a
> bunch of xids on recovery) but be more complicated.

Maybe that would be appropriate for HEAD, but I'd be wary of adding
anything complicated to the back branches.  This is clearly a very
under-tested area.

regards, tom lane




Re: wrong Append/MergeAppend elision?

2023-01-26 Thread Amit Langote
On Fri, Jan 27, 2023 at 5:43 AM Tom Lane  wrote:
> David Rowley  writes:
> > On Fri, 27 Jan 2023 at 01:30, Amit Langote  wrote:
> >> It seems that the planner currently elides an Append/MergeAppend that
> >> has run-time pruning info (part_prune_index) set, but which I think is
> >> a bug.
>
> > There is still the trade-off of having to pull tuples through the
> > Append node for when run-time pruning is unable to prune the last
> > partition. So your proposal to leave the Append alone when there's
> > run-time pruning info is certainly not a no-brainer.
>
> Yeah.  Amit's proposal amounts to optimizing for the case that all
> partitions get pruned, which does not seem to me to be the way
> to bet.  I'm inclined to think it's fine as-is.

Fair enough.  I thought for a second that maybe it was simply an
oversight but David confirms otherwise.  This was interacting badly
with the other patch I'm working on and I just figured out the problem
was with that other patch.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Add LZ4 compression in pg_dump

2023-01-26 Thread Michael Paquier
On Thu, Jan 26, 2023 at 12:22:45PM -0600, Justin Pryzby wrote:
> Yeah.  But the original log_warning text was better, and should be
> restored:
> 
> -   if (AH->compression != 0)
> -   pg_log_warning("archive is compressed, but this installation 
> does not support compression -- no data will be available");

Yeah, this one's on me.  So I have gone with the simplest solution and
applied a fix to restore the original behavior, with the same warning
showing up.
--
Michael


signature.asc
Description: PGP signature


Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 14:27:53 -0500, Robert Haas wrote:
> One idea that I've had about how to solve this problem is to try to
> make vacuum try to aggressively freeze some portion of the table on
> each pass, and to behave less aggressively on the rest of the table so
> that, hopefully, no single vacuum does too much work.

I agree that this rough direction is worthwhile to purse.


> Unfortunately, I don't really know how to do that effectively. If we knew
> that the table was going to see 10 vacuums before we hit
> autovacuum_freeze_max_age, we could try to have each one do 10% of the
> amount of freezing that was going to need to be done rather than letting any
> single vacuum do all of it, but we don't have that sort of information.

I think, quite fundamentally, it's not possible to bound the amount of work an
anti-wraparound vacuum has to do if we don't have an age based autovacuum
trigger kicking in before autovacuum_freeze_max_age. After all, there might be
no autovacuum before that's autovacuum_freeze_max_age is reached.

But there's just no reason to not have a trigger below
autovacuum_freeze_max_age. That's why I think Peter's patch to split age and
anti-"auto-cancel" autovacuums is an strictly necessary change if we want to
make autovacuum fundamentally suck less. There's a few boring details to
figure out how to set/compute those limits, but I don't think there's anything
fundamentally hard.


I think we also need the number of all-frozen pages in pg_class if we want to
make better scheduling decision. As we already compute the number of
all-visible pages at the end of vacuuming, we can compute the number of
all-frozen pages as well. The space for another integer in pg_class doesn't
bother me one bit.


Let's say we had a autovacuum_vacuum_age trigger of 100m, and
autovacuum_freeze_max_age=500m. We know that we're roughly going to be
vacuuming 5 times before reaching autovacuum_freeze_max_age (very slow
autovacuums are an issue, but if one autovacuum takes 100m+ xids long, there's
not much we can do).

With that we could determine the eager percentage along the lines of:
  frozen_target = Min(age(relfrozenxid), 
autovacuum_freeze_max_age)/autovacuum_freeze_max_age
  eager_percentage = Min(0, frozen_target * relpages - pg_class.relallfrozen * 
relpages)

One thing I don't know fully how to handle is how to ensure that we try to
freeze a different part of the table each vacuum. I guess we could store a
page number in pgstats?


This would help address the "cliff" issue of reaching
autovacuum_freeze_max_age. What it would *not*, on its own, would is the
number of times we rewrite pages.

I can guess at a few ways to heuristically identify when tables are "append
mostly" from vacuum's view (a table can be update heavy, but very localized to
recent rows, and still be append mostly from vacuum's view).  There's obvious
cases, e.g. when there are way more inserts than dead rows.  But other cases
are harder.



> Also, even if we did have that sort of information, the idea only works if
> the pages that we freeze sooner are ones that we're not about to update or
> delete again, and we don't have any idea what is likely there.

Perhaps we could use something like
  (age(relfrozenxid) - age(newest_xid_on_page)) / age(relfrozenxid)
as a heuristic?

I have a gut feeling that we should somehow collect/use statistics about the
number of frozen pages, marked as such by the last (or recent?) vacuum, that
had to be "defrosted" by backends. But I don't quite know how to yet.  I think
we could collect statistics about that by storing the LSN of the last vacuum
in the shared stats, and incrementing that counter when defrosting.

A lot of things like that would work a whole lot better if we had statistics
that take older data into account, but weigh it less than more recent
data. But that's hard/expensive to collect.


Greetings,

Andres Freund




Re: Something is wrong with wal_compression

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier  wrote:
> On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote:
> > If we agree that xid allocation is not something persistent, let's fix
> > the test? We can replace a check with select * from pg_class or,
> > maybe, add an amcheck run.
> > As far as I recollect, this test was introduced to test this new
> > function in 857ee8e391f.
>
> My opinion would be to make this function more reliable, FWIW, even if
> that involves a performance impact when called in a close loop by
> forcing more WAL flushes to ensure its report durability and
> consistency.  As things stand, this is basically unreliable, and we
> document it as something applications can *use*.  Adding a note in the
> docs to say that this function can be unstable for some edge cases
> does not make much sense to me, either.  Commit 857ee8e itself says
> that we can use it if a database connection is lost, which could
> happen on a crash..

Yeah, the other thread has a patch for that.  But it would hurt some
workloads.  A better patch would do some kind of amortisation
(reserving N xids at a time or some such scheme, while being careful
to make sure the right CLOG pages etc exist if you crash and skip a
bunch of xids on recovery) but be more complicated.  For the record,
back before release 13 added the 64 bit xid allocator, these functions
(or rather their txid_XXX ancestors) were broken in a different way:
they didn't track epochs reliably, the discovery of which led to the
new xid8-based functions, so that might provide a natural
back-patching range, if a back-patchable solution can be agreed on.




Re: Record queryid when auto_explain.log_verbose is on

2023-01-26 Thread Michael Paquier
On Thu, Jan 26, 2023 at 10:00:04PM +0900, torikoshia wrote:
> I'll work on this next.

Cool, thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Something is wrong with wal_compression

2023-01-26 Thread Michael Paquier
On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote:
> If we agree that xid allocation is not something persistent, let's fix
> the test? We can replace a check with select * from pg_class or,
> maybe, add an amcheck run.
> As far as I recollect, this test was introduced to test this new
> function in 857ee8e391f.

My opinion would be to make this function more reliable, FWIW, even if
that involves a performance impact when called in a close loop by
forcing more WAL flushes to ensure its report durability and
consistency.  As things stand, this is basically unreliable, and we
document it as something applications can *use*.  Adding a note in the
docs to say that this function can be unstable for some edge cases
does not make much sense to me, either.  Commit 857ee8e itself says
that we can use it if a database connection is lost, which could
happen on a crash..
--
Michael


signature.asc
Description: PGP signature


Re: Partition key causes problem for volatile target list query

2023-01-26 Thread Bruce Momjian
On Thu, Jan 26, 2023 at 07:21:16PM -0500, Tom Lane wrote:
> Well, if you looked further than the first few rows, it wouldn't be
> "always zero".  But the select from the partitioned table will read
> the first partition first, and that partition will have the rows
> with d1=0, by definition.
> 
> =# explain select * from case_test2 limit 10;
> QUERY PLAN
>   
>
> 
> ---
>  Limit  (cost=0.00..0.19 rows=10 width=8)
>->  Append  (cost=0.00..1987.90 rows=102260 width=8)
>  ->  Seq Scan on case_test2_0 case_test2_1  (cost=0.00..478.84 
> rows=3318
> 4 width=8)
>  ->  Seq Scan on case_test2_1 case_test2_2  (cost=0.00..480.86 
> rows=3328
> 6 width=8)
>  ->  Seq Scan on case_test2_2 case_test2_3  (cost=0.00..484.30 
> rows=3353
> 0 width=8)
>  ->  Seq Scan on case_test2_3 case_test2_4  (cost=0.00..32.60 
> rows=2260 
> width=8)
> (6 rows)
> 
> The result appears sorted by d1, but that's an implementation artifact.

Wow, thanks.  Not sure how I missed something so obvious.  I just saw it
myself by generating only 10 rows and noticing the numbers were always
increasing.


-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.




Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Andrey Borodin  writes:
> On Thu, Jan 26, 2023 at 3:04 PM Tom Lane  wrote:
>> Indeed, it seems like this behavior makes pg_xact_status() basically
>> useless as things stand.

> If we agree that xid allocation is not something persistent, let's fix
> the test?

If we're not going to fix this behavior, we need to fix the docs
to disclaim that pg_xact_status() is of use for what it's said
to be good for.

regards, tom lane




Re: Partition key causes problem for volatile target list query

2023-01-26 Thread Tom Lane
Bruce Momjian  writes:
> I have found an odd behavior --- a query in the target list that assigns
> to a partitioned column causes queries that would normally be volatile
> to return always zero.

Well, if you looked further than the first few rows, it wouldn't be
"always zero".  But the select from the partitioned table will read
the first partition first, and that partition will have the rows
with d1=0, by definition.

=# explain select * from case_test2 limit 10;
QUERY PLAN  
   

---
 Limit  (cost=0.00..0.19 rows=10 width=8)
   ->  Append  (cost=0.00..1987.90 rows=102260 width=8)
 ->  Seq Scan on case_test2_0 case_test2_1  (cost=0.00..478.84 rows=3318
4 width=8)
 ->  Seq Scan on case_test2_1 case_test2_2  (cost=0.00..480.86 rows=3328
6 width=8)
 ->  Seq Scan on case_test2_2 case_test2_3  (cost=0.00..484.30 rows=3353
0 width=8)
 ->  Seq Scan on case_test2_3 case_test2_4  (cost=0.00..32.60 rows=2260 
width=8)
(6 rows)

The result appears sorted by d1, but that's an implementation artifact.

regards, tom lane




Re: Something is wrong with wal_compression

2023-01-26 Thread Andrey Borodin
On Thu, Jan 26, 2023 at 3:04 PM Tom Lane  wrote:
>
> Indeed, it seems like this behavior makes pg_xact_status() basically
> useless as things stand.
>

If we agree that xid allocation is not something persistent, let's fix
the test? We can replace a check with select * from pg_class or,
maybe, add an amcheck run.
As far as I recollect, this test was introduced to test this new
function in 857ee8e391f.


Best regards, Andrey Borodin.




Re: improving user.c error messages

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 05:41:32PM -0500, Tom Lane wrote:
> Well, it's not a hint.  I think the above is fine for non-password
> cases, but for passwords maybe
> 
>   ERROR:  permission denied to alter role password
>   DETAIL:  To change another role's password, you must have CREATEROLE 
> privilege and ADMIN OPTION on role "%s".

Okay.  I used this phrasing in v3.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 02938746f72d0545d6c5cc3d1e9c6dc30e6e4175 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 26 Jan 2023 11:05:13 -0800
Subject: [PATCH v3 1/1] Improve user.c error messages.

---
 src/backend/commands/user.c   | 166 --
 src/test/regress/expected/create_role.out |  77 ++
 src/test/regress/expected/dependency.out  |   4 +
 src/test/regress/expected/privileges.out  |  23 +--
 4 files changed, 196 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index 3a92e930c0..ad22a89298 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -316,23 +316,33 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
 		if (!has_createrole_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("permission denied to create role")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles.",
+			   "CREATEROLE")));
 		if (issuper)
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must be superuser to create superusers")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "SUPERUSER", "SUPERUSER")));
 		if (createdb && !have_createdb_privilege())
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have createdb permission to create createdb users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "CREATEDB", "CREATEDB")));
 		if (isreplication && !has_rolreplication(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have replication permission to create replication users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "REPLICATION", "REPLICATION")));
 		if (bypassrls && !has_bypassrls_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have bypassrls to create bypassrls users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "BYPASSRLS", "BYPASSRLS")));
 	}
 
 	/*
@@ -744,10 +754,18 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 	roleid = authform->oid;
 
 	/* To mess with a superuser in any way you gotta be superuser. */
-	if (!superuser() && (authform->rolsuper || dissuper))
+	if (!superuser() && authform->rolsuper)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must be superuser to alter superuser roles or change superuser attribute")));
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege to alter roles with %s.",
+		   "SUPERUSER", "SUPERUSER")));
+	if (!superuser() && dissuper)
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege to change the %s attribute.",
+		   "SUPERUSER", "SUPERUSER")));
 
 	/*
 	 * Most changes to a role require that you both have CREATEROLE privileges
@@ -761,13 +779,17 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 			dvalidUntil || disreplication || dbypassRLS)
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("permission denied")));
+	 errmsg("permission denied to alter role"),
+	 errdetail("You must have %s privilege and %s on role \"%s\".",
+			   "CREATEROLE", "ADMIN OPTION", rolename)));
 
 		/* an unprivileged user can change their own password */
 		if (dpassword && roleid != currentUserId)
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have CREATEROLE privilege to change another user's password")));
+	 errmsg("permission denied to alter role"),
+	 errdetail("To change another role's password, you must have %s privilege and %s on the role.",
+			   "CREATEROLE", "ADMIN OPTION")));
 	}
 	else if (!superuser())
 	{
@@ -779,23 +801,30 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 		if (dcreatedb && !have_createdb_privilege())
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have createdb privilege to change createdb attribute")));
+	 errmsg("permission denied to 

Partition key causes problem for volatile target list query

2023-01-26 Thread Bruce Momjian
I have found an odd behavior --- a query in the target list that assigns
to a partitioned column causes queries that would normally be volatile
to return always zero.

In the first query, no partitioning is used:

 d1 | d2
+
  1 |  0
  2 |  0
  2 |  1
  1 |  0
  1 |  2
  1 |  2
  1 |  0
  0 |  2
  2 |  0
  2 |  2

In the next query, 'd1' is a partition key and it gets a constant value
of zero for all rows:

 d1 | d2
+
-->   0 |  1
-->   0 |  2
  0 |  2
  0 |  1
  0 |  2
  0 |  1
  0 |  2
  0 |  2
  0 |  2
  0 |  2

The self-contained query is attached.  The value is _always_ zero, which
suggests random() is not being called;  calling setseed() does not
change that.  If I change "SELECT x" with "SELECT 2", the "2" is used. 
I see this behavior back to PG 11.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.


partition_test.sql
Description: application/sql


Re: Rework of collation code, extensibility

2023-01-26 Thread Jeff Davis

Attached v9 and added perf numbers below.

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)

My primary motivation is still related to
https://commitfest.postgresql.org/41/3956/ but the combination of
cleaner code and a performance boost seems like reasonable
justification for this patch set independently.

There aren't any clear open items on this patch. Peter Eisentraut asked
me to focus this thread on the refactoring, which I've done by reducing
it to 2 patches, and I left multilib ICU up to the other thread. He
also questioned the increased line count, but I think the currently-low
line count is due to bad style. PeterG provided some review comments,
in particular when to do the tiebreaking, which I addressed.

This patch has been around for a while, so I'll take a fresh look and
see if I see risk areas, and re-run a few sanity checks. Of course more
feedback would also be welcome.

PERFORMANCE:

==
Setup:
==

base: master with v9-0001 applied (GUCs only)
refactor: master with v9-0001, v9-0002, v9-0003 applied

Note that I wasn't able to see any performance difference between the
base and master, v9-0001 just adds some GUCs to make testing easier.

glibc  2.35  ICU 70.1
gcc11.3.0LLVM 14.0.0

built with meson (uses -O3)

$ perl text_generator.pl 1000 10 > /tmp/strings.utf8.txt

CREATE TABLE s (t TEXT);
COPY s FROM '/tmp/strings.utf8.txt';
VACUUM FREEZE s;
CHECKPOINT;
SET work_mem='10GB';
SET max_parallel_workers = 0;
SET max_parallel_workers_per_gather = 0;

=
Test queries:
=

EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "C";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en_US";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";

Timings are measured as the milliseconds to return the first tuple from
the Sort operator (as reported in EXPLAIN ANALYZE). Median of three
runs.


Results:


  baserefactor   speedup

sort_abbreviated_keys=false:
  C   73777273  1.4%
  en_US  35081   35090  0.0%
  en-US-x-ixu20520   19465  5.4%

sort_abbreviated_keys=true:
  C   81058008  1.2%
  en_US  35067   34850  0.6%
  en-US-x-icu22626   21507  5.2%

===
Conclusion:
===

These numbers can move +/-1 percentage point, so I'd interpret anything
less than that as noise. This happens to be the first run where all the
numbers favored the refactoring patch, but it is generally consistent
with what I had seen before.

The important part is that, for ICU, it appears to be a substantial
speedup when using meson (-O3).

Also, when/if the multilib ICU support goes in, that may lose some of
these gains due to an extra indirect function call.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS




text_generator.pl
Description: Perl program
From 39ed011cc51ba3a4af5e3b559a7b8de25fb895a5 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 21 Jan 2023 12:44:07 -0800
Subject: [PATCH v9 1/3] Introduce GUCs to control abbreviated keys sort
 optimization.

The setting sort_abbreviated_keys turns the optimization on or off
overall. The optimization relies on collation providers, which are
complex dependencies, and the performance of the optimization may rely
on many factors. Introducing a GUC allows easier diagnosis when this
optimization results in worse perforamnce.

The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing
users to experiment with the abbreviated keys optimization when using
the libc provider. Previously, the optimization only applied to
collations using the ICU provider unless specially compiled. By
default, allowed only for superusers (because an incorrect setting
could lead to wrong results), but can be granted to others.
---
 doc/src/sgml/config.sgml   | 40 ++
 src/backend/utils/adt/varlena.c| 10 +++---
 src/backend/utils/misc/guc_tables.c| 24 +
 src/backend/utils/sort/tuplesortvariants.c | 17 ++---
 4 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f985afc009..8f55b89f35 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  sort_abbreviated_keys (boolean)
+  
+   sort_abbreviated_keys configuration parameter
+  
+  
+  
+   
+Enables or disables the use of abbreviated sort keys, a sort optimization,
+if applicable. The default is true. Disabling may
+be useful to diagnose

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 12:45 PM Andres Freund  wrote:
> > Most of the overhead of FREEZE WAL records (with freeze plan
> > deduplication and page-level freezing in) is generic WAL record header
> > overhead. Your recent adversarial test case is going to choke on that,
> > too. At least if you set checkpoint_timeout to 1 minute again.
>
> I don't quite follow. What do you mean with "record header overhead"? Unless
> that includes FPIs, I don't think that's that commonly true?

Even if there are no directly observable FPIs, there is still extra
WAL, which can cause FPIs indirectly, just by making checkpoints more
frequent. I feel ridiculous even having to explain this to you.

> The problematic case I am talking about is when we do *not* emit a WAL record
> during pruning (because there's nothing to prune), but want to freeze the
> table. If you don't log an FPI, the remaining big overhead is that increasing
> the LSN on the page will often cause an XLogFlush() when writing out the
> buffer.
>
> I don't see what your reference to checkpoint timeout is about here?
>
> Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
> = 1min. It just makes it cheaper to reproduce.

That's flagrantly intellectually dishonest. Sure, it made it easier to
reproduce. But that's not all it did!

You had *lots* of specific numbers and technical details in your first
email, such as "Time for vacuuming goes up to ~5x. WAL volume to
~9x.". But you did not feel that it was worth bothering with details
like having set checkpoint_timeout to 1 minute, which is a setting
that nobody uses, and obviously had a multiplicative effect. That
detail was unimportant. I had to drag it out of you!

You basically found a way to add WAL overhead to a system/workload
that is already in a write amplification vicious cycle, with latent
tipping point type behavior.

There is a practical point here, that is equally obvious, and yet
somehow still needs to be said: benchmarks like that one are basically
completely free of useful information. If we can't agree on how to
assess such things in general, then what can we agree on when it comes
to what should be done about it, what trade-off to make, when it comes
to any similar question?

> > In many cases we'll have to dirty the page anyway, just to set
> > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> > triggered by an FPI or triggered by my now-reverted GUC) on being able
> > to set the whole page all-frozen in the VM.
>
> IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
> bits. But freezing does trigger one even without wal_log_hint_bits.

That is correct.

> You're right, it makes sense to consider whether we'll emit a
> XLOG_HEAP2_VISIBLE anyway.

As written the page-level freezing FPI mechanism probably doesn't
really stand to benefit much from doing that. Either checksums are
disabled and it's just a hint, or they're enabled and there is a very
high chance that we'll get an FPI inside lazy_scan_prune rather than
right after it is called, when PD_ALL_VISIBLE is set.

That's not perfect, of course, but it doesn't have to be. Perhaps it
should still be improved, just on general principle.

> > > A less aggressive version would be to check if any WAL records were 
> > > emitted
> > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI 
> > > if we
> > > modified the page again. Similar to what we do now, except not requiring 
> > > an
> > > FPI to have been emitted.
> >
> > Also way more aggressive. Not nearly enough on its own.
>
> In which cases will it be problematically more aggressive?
>
> If we emitted a WAL record during pruning we've already set the LSN of the
> page to a very recent LSN. We know the page is dirty. Thus we'll already
> trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

You seem to be talking about this as if the only thing that could
matter is the immediate FPI -- the first order effects -- and not any
second order effects. You certainly didn't get to 9x extra WAL
overhead by controlling for that before. Should I take it that you've
decided to assess these things more sensibly now? Out of curiosity:
why the change of heart?

> > > But to me it seems a bit odd that VACUUM now is more aggressive if 
> > > checksums /
> > > wal_log_hint bits is on, than without them. Which I think is how using 
> > > either
> > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> >
> > Which part is the odd part? Is it odd that page-level freezing works
> > that way, or is it odd that page-level checksums work that way?
>
> That page-level freezing works that way.

I think that it will probably cause a little confusion, and should be
specifically documented. But other than that, it seems reasonable
enough to me. I mean, should I not do something that's going to be of
significant help to users with checksums, just because it'll be
somewhat confusing to a small min

Re: GUCs to control abbreviated sort keys

2023-01-26 Thread Jeff Davis
On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> Maybe an easier way to enable or disable it in the source code with a
> #define would serve this.  Making it a GUC right away seems a bit 
> heavy-handed.  Further exploration and tweaking might well require 
> further source code changes, so relying on a source code level toggle
> would seem appropriate.

I am using these GUCs for testing the various collation paths in my
collation refactoring branch.

I find them pretty useful, and when I saw a regression, I thought
others might think it was useful, too. But if not I'll just leave them
in my branch and withdraw from this thread.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Jan 27, 2023 at 11:14 AM Tom Lane  wrote:
>> If any tuples made by that transaction had reached disk,
>> we'd have a problem.

> The problem is that the WAL wasn't flushed, allowing the same xid to
> be allocated again after crash recovery.  But for any data pages to
> hit the disk, we'd have to flush WAL first, so then it couldn't
> happen, no?

Ah, now I get the point: the "committed xact" seen after restart
isn't the same one as we saw before the crash, but a new one that
was given the same XID because nothing about the old one had made
it to disk yet.

> FWIW I also re-complained about the dangers of anyone
> relying on pg_xact_status() for its stated purpose after seeing
> tanager's failure[1].

Indeed, it seems like this behavior makes pg_xact_status() basically
useless as things stand.

regards, tom lane




Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Jelte Fennema
On Thu, 26 Jan 2023 at 22:46, Andrew Dunstan  wrote:
> Hmm, but I usually run with -a, I even have a git alias for it. I guess
> what this discussion illustrates is that there are various patters for
> using git, and we shouldn't assume that everyone else is using the same
> patterns we are.

I definitely agree that there are lots of ways to use git. And I now
understand why my hook didn't work well for your existing workflow.

I've pretty much unlearned the -a flag. Because the easiest way I've
been able to split up changes into different commits is using "git add
-p", which adds partial pieces of files to the staging area. And that
workflow combines terribly with "git commit -a" because -a adds all
the things that I specifically didn't put in the staging area into the
final commit anyway.

> I'm still  mildly inclined to say this material would be better placed
> in the developer wiki. After all, this isn't the only thing a postgres
> developer might use a git hook for

I think it should definitely be somewhere. I have a preference for the
repo, since I think the docs on codestyle are already in too many
different places. But the wiki is already much better than having no
shared hook at all. I mainly think we should try to make it as easy as
possible for people to commit well indented code.

> (mine has more material in it than in what I posted).

Anything that is useful for the wider community and could be part of
this example/template git hook? (e.g. some perltidy automation)




Re: Something is wrong with wal_compression

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 11:14 AM Tom Lane  wrote:
> Andrey Borodin  writes:
> > On Thu, Jan 26, 2023 at 12:12 PM Tom Lane  wrote:
> >> That test case is demonstrating fundamental
> >> database corruption after a crash.
>
> > Not exactly corruption. XID was not persisted and buffer data did not
> > hit a disk. Database is in the correct state.
>
> Really?  I don't see how this part is even a little bit okay:
>
> [00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
> [00:40:50.745](0.001s)
> [00:40:50.745](0.000s) #   Failed test 'xid is aborted after crash'
> #   at t/011_crash_recovery.pl line 57.
> [00:40:50.746](0.001s) #  got: 'committed'
> # expected: 'aborted'
>
> If any tuples made by that transaction had reached disk,
> we'd have a problem.

The problem is that the WAL wasn't flushed, allowing the same xid to
be allocated again after crash recovery.  But for any data pages to
hit the disk, we'd have to flush WAL first, so then it couldn't
happen, no?  FWIW I also re-complained about the dangers of anyone
relying on pg_xact_status() for its stated purpose after seeing
tanager's failure[1].

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJ9p2JPPMA4eYAKq%3Dr9d_4_8vziet_tS1LEBbiny5-ypA%40mail.gmail.com




Re: improving user.c error messages

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 03:07:43PM -0500, Tom Lane wrote:
>> I think the password case needs to be kept separate, because the
>> conditions for it are different (specifically the exception that
>> you can alter your own password).  Lumping the rest together
>> seems OK to me.

> Hm.  In v2, the error message for both cases is the same:

>   ERROR:  permission denied to alter role
>   DETAIL:  You must have CREATEROLE privilege and ADMIN OPTION on role 
> "regress_priv_user2".

> We could add "to change its attributes" and "to change its password" to
> separate the two, but I'm not sure that adds much.  ISTM the current error
> message for ALTER ROLE PASSWORD implies that you can change your own
> password, and that's lost with my patch.  Perhaps we should add an
> errhint() with that information instead.  WDYT?

Well, it's not a hint.  I think the above is fine for non-password
cases, but for passwords maybe

ERROR:  permission denied to alter role password
DETAIL:  To change another role's password, you must have CREATEROLE 
privilege and ADMIN OPTION on role "%s".

regards, tom lane




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote:
>> Right, so more like this.

> LGTM

Thanks, pushed.

Returning to the prior patch ... I don't much care for this:

+/* Maybe there will be a free slot in a second... */
+retry_time = TimestampTzPlusSeconds(now, 1);
+LogRepWorkerUpdateSyncStartWakeup(retry_time);

We're not moving the goalposts very far on unnecessary wakeups if
we have to do that.  Do we need to get a wakeup on sync slot free?
Although having to send that to every worker seems ugly.  Maybe this
is being done in the wrong place and we need to find a way to get
the launcher to handle it.

As for the business about process_syncing_tables being only called
conditionally, I was already of the opinion that the way it's
getting called is loony.  Why isn't it called from LogicalRepApplyLoop
(and noplace else)?  With more certainty about when it runs, we might
not need so many kluges elsewhere.

regards, tom lane




Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Andrey Borodin  writes:
> On Thu, Jan 26, 2023 at 12:12 PM Tom Lane  wrote:
>> That test case is demonstrating fundamental
>> database corruption after a crash.

> Not exactly corruption. XID was not persisted and buffer data did not
> hit a disk. Database is in the correct state.

Really?  I don't see how this part is even a little bit okay:

[00:40:50.744](0.046s) not ok 3 - xid is aborted after crash
[00:40:50.745](0.001s) 
[00:40:50.745](0.000s) #   Failed test 'xid is aborted after crash'
#   at t/011_crash_recovery.pl line 57.
[00:40:50.746](0.001s) #  got: 'committed'
# expected: 'aborted'

If any tuples made by that transaction had reached disk,
we'd have a problem.

regards, tom lane




Re: improving user.c error messages

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 03:07:43PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
>> On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote:
>>> Basically my question is whether having one error message for all of
>>> those cases is good enough, or whether we should be trying harder.
> 
> I think the password case needs to be kept separate, because the
> conditions for it are different (specifically the exception that
> you can alter your own password).  Lumping the rest together
> seems OK to me.

Hm.  In v2, the error message for both cases is the same:

ERROR:  permission denied to alter role
DETAIL:  You must have CREATEROLE privilege and ADMIN OPTION on role 
"regress_priv_user2".

We could add "to change its attributes" and "to change its password" to
separate the two, but I'm not sure that adds much.  ISTM the current error
message for ALTER ROLE PASSWORD implies that you can change your own
password, and that's lost with my patch.  Perhaps we should add an
errhint() with that information instead.  WDYT?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 1:22 PM Robert Haas  wrote:
> On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan  wrote:
> > There is very good reason to believe that the large majority of all
> > data that people store in a system like Postgres is extremely cold
> > data:
>
> The systems where I end up troubleshooting problems seem to be, most
> typically, busy OLTP systems. I'm not in a position to say whether
> that's more or less common than systems with extremely cold data, but
> I am in a position to say that my employer will have a lot fewer happy
> customers if we regress that use case. Naturally I'm keen to avoid
> that.

This is the kind of remark that makes me think that you don't get it.

The most influential OLTP benchmark of all time is TPC-C, which has
exactly this problem. In spades -- it's enormously disruptive. Which
is one reason why I used it as a showcase for a lot of this work. Plus
practical experience (like the Heroku database in the blog post I
linked to) fully agrees with that benchmark, as far as this stuff goes
-- that was also a busy OLTP database.

Online transaction involves transactions. Right? There is presumably
some kind of ledger, some kind of orders table. Naturally these have
entries that age out fairly predictably. After a while, almost all the
data is cold data. It is usually about that simple.

One of the key strengths of systems like Postgres is the ability to
inexpensively store a relatively large amount of data that has just
about zero chance of being read, let alone modified. While at the same
time having decent OLTP performance for the hot data. Not nearly as
good as an in-memory system, mind you -- and yet in-memory systems
remain largely a niche thing.

-- 
Peter Geoghegan




Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Andrew Dunstan


On 2023-01-26 Th 12:05, Jelte Fennema wrote:
>> At this stage the files are now indented, so if it failed and you run
>> `git commit` again it will commit with the indention changes.
> No, because at no point a "git add" is happening, so the changes
> made by pgindent are not staged. As long as you don't run the
> second "git commit" with the -a flag the commit will be exactly
> the same as you prepared it before.


Hmm, but I usually run with -a, I even have a git alias for it. I guess
what this discussion illustrates is that there are various patters for
using git, and we shouldn't assume that everyone else is using the same
patterns we are.

I'm still  mildly inclined to say this material would be better placed
in the developer wiki. After all, this isn't the only thing a postgres
developer might use a git hook for (mine has more material in it than in
what I posted).


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: GUCs to control abbreviated sort keys

2023-01-26 Thread Peter Eisentraut

On 25.01.23 22:16, Jeff Davis wrote:

I am highlighting this case because the existence of a single non-
contrived case or regression suggests that we may want to explore
further and tweak heuristics. That's quite natural when the heuristics
are based on a complex dependency like a collation provider. The
sort_abbreviated_keys GUC makes that kind of exploration and tweaking a
lot easier.


Maybe an easier way to enable or disable it in the source code with a 
#define would serve this.  Making it a GUC right away seems a bit 
heavy-handed.  Further exploration and tweaking might well require 
further source code changes, so relying on a source code level toggle 
would seem appropriate.






Re: Something is wrong with wal_compression

2023-01-26 Thread Andrey Borodin
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane  wrote:
>
> That test case is demonstrating fundamental
> database corruption after a crash.
>

Not exactly corruption. XID was not persisted and buffer data did not
hit a disk. Database is in the correct state.

It was discussed long before WAL compression here [0]. The thing is it
is easier to reproduce with compression, but compression has nothing
to do with it, as far as I understand.

Proposed fix is here[1], but I think it's better to fix the test. It
should not veryfi Xid, but rather side effects of "CREATE TABLE mine(x
integer);".


Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/565FB155-C6B0-41E2-8C44-7B514DC25132%2540yandex-team.ru
[1] 
https://www.postgresql.org/message-id/flat/20210313012820.GJ29463%40telsasoft.com#0f18d3a4d593ea656fdc761e026fee81




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote:
> Right, so more like this.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan  wrote:
> There is very good reason to believe that the large majority of all
> data that people store in a system like Postgres is extremely cold
> data:

The systems where I end up troubleshooting problems seem to be, most
typically, busy OLTP systems. I'm not in a position to say whether
that's more or less common than systems with extremely cold data, but
I am in a position to say that my employer will have a lot fewer happy
customers if we regress that use case. Naturally I'm keen to avoid
that.

> Having a separate aggressive step that rewrites an entire large table,
> apparently at random, is just a huge burden to users. You've said that
> you agree that it sucks, but somehow I still can't shake the feeling
> that you don't fully understand just how much it sucks.

Ha!

Well, that's possible. But maybe you don't understand how much your
patch makes other things suck.

I don't think we can really get anywhere here by postulating that the
problem is the other person's lack of understanding, even if such a
postulate should happen to be correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Cygwin cleanup

2023-01-26 Thread Justin Pryzby
Note that cirrus failed like this:

https://api.cirrus-ci.com/v1/artifact/task/4881596411543552/testrun/build/testrun/subscription/010_truncate/log/010_truncate_publisher.log

2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] ERROR:  could not 
open file "pg_logical/snapshots/0-14F2060.snap": Is a directory
2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', 
publication_names '"pub1"')
2023-01-25 23:17:10.418 GMT [29850][walsender] 
[pg_16413_sync_16394_7192732880582452157][6/0:0] PANIC:  could not open file 
"pg_logical/snapshots/0-14F2060.snap": No such file or directory
2023-01-25 23:17:10.418 GMT [29850][walsender] 
[pg_16413_sync_16394_7192732880582452157][6/0:0] STATEMENT:  START_REPLICATION 
SLOT "pg_16413_sync_16394_7192732880582452157" LOGICAL 0/14F2060 (proto_version 
'4', origin 'any', publication_names '"pub3"')

I don't understand how "Is a directory" happened ..

It looks like maybe the call stack would've been:

SnapBuildSerializationPoint()
xlog_decode() or standby_decode() ?
LogicalDecodingProcessRecord()
XLogSendLogical()
WalSndLoop()
StartLogicalReplication()

-- 
Justin




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote:
>> Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
>> but I take your point that the subtraction would overflow with
>> TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
>> make use of pg_sub_s64_overflow()?

> That would be my vote.  I think the 'diff <= 0' check might need to be
> replaced with something like 'start_time > stop_time' so that we return 0
> for the underflow case.

Right, so more like this.

regards, tom lane

diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 05470057f5..2bb6c89f8c 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len)
 		 * the system clock was set backward, so that such occurrences don't
 		 * have the effect of suppressing further progress messages.
 		 */
-		if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD)
+		if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD ||
+			now < mysink->last_progress_report_time)
 		{
 			mysink->last_progress_report_time = now;
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5b775cf7d0..62fba5fcee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1670,11 +1670,12 @@ DetermineSleepTime(void)
 
 	if (next_wakeup != 0)
 	{
-		/* Ensure we don't exceed one minute, or go under 0. */
-		return Max(0,
-   Min(60 * 1000,
-	   TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-	   next_wakeup)));
+		int			ms;
+
+		/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
+		ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
+   next_wakeup);
+		return Min(60 * 1000, ms);
 	}
 
 	return 60 * 1000;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e95398db05..b0cfddd548 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
 pgsocket	wait_fd = PGINVALID_SOCKET;
 int			rc;
 TimestampTz nextWakeup;
-int			nap;
+long		nap;
 
 /*
  * Exit walreceiver if we're not in recovery. This should not
@@ -528,15 +528,9 @@ WalReceiverMain(void)
 for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
 	nextWakeup = Min(wakeup[i], nextWakeup);
 
-/*
- * Calculate the nap time.  WaitLatchOrSocket() doesn't accept
- * timeouts longer than INT_MAX milliseconds, so we limit the
- * result accordingly.  Also, we round up to the next
- * millisecond to avoid waking up too early and spinning until
- * one of the wakeup times.
- */
+/* Calculate the nap time, clamping as necessary. */
 now = GetCurrentTimestamp();
-nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));
+nap = TimestampDifferenceMilliseconds(now, nextWakeup);
 
 /*
  * Ideally we would reuse a WaitEventSet object repeatedly
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 928c330897..47e059a409 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1690,26 +1690,31 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  *
  * This is typically used to calculate a wait timeout for WaitLatch()
  * or a related function.  The choice of "long" as the result type
- * is to harmonize with that.  It is caller's responsibility that the
- * input timestamps not be so far apart as to risk overflow of "long"
- * (which'd happen at about 25 days on machines with 32-bit "long").
- *
- * Both inputs must be ordinary finite timestamps (in current usage,
- * they'll be results from GetCurrentTimestamp()).
+ * is to harmonize with that; furthermore, we clamp the result to at most
+ * INT_MAX milliseconds, because that's all that WaitLatch() allows.
  *
  * We expect start_time <= stop_time.  If not, we return zero,
  * since then we're already past the previously determined stop_time.
  *
+ * Subtracting finite and infinite timestamps works correctly, returning
+ * zero or INT_MAX as appropriate.
+ *
  * Note we round up any fractional millisecond, since waiting for just
  * less than the intended timeout is undesirable.
  */
 long
 TimestampDifferenceMilliseconds(TimestampTz start_time, TimestampTz stop_time)
 {
-	TimestampTz diff = stop_time - start_time;
+	TimestampTz diff;
 
-	if (diff <= 0)
+	/* Deal with zero or negative elapsed time quickly. */
+	if (start_time >= stop_time)
 		return 0;
+	/* To not fail with timestamp infinities, we must detect overflow. */
+	if (pg_sub_s64_overflow(stop_time, start_time, &diff))
+		return (long) INT_MAX;
+	if (diff >= (INT_MAX * INT64CONST(1000) - 999))
+		return (long) INT_MAX;
 	else
 		return (long) ((diff + 9

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro  wrote:
> On Fri, Jan 27, 2023 at 9:49 AM Tom Lane  wrote:
> > Tomas Vondra  writes:
> > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > > did not report any results for a couple days, and it seems it got into
> > > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > > hashjoin, or something like that.
> >
> > > It seems to be progressing now, probably because I attached gdb to the
> > > workers to get backtraces, which does signals etc.
> >
> > That reminds me of cases that I saw several times on my now-deceased
> > animal florican:
> >
> > https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
> >
> > There's clearly something rotten somewhere in there, but whether
> > it's our bug or FreeBSD's isn't clear.
>
> And if it's ours, it's possibly in latch code and not anything higher
> (I mean, not in condition variables, barriers, or parallel hash join)
> because I saw a similar hang in the shm_mq stuff which uses the latch
> API directly.  Note that 13 switched to kqueue but still used the
> self-pipe, and 14 switched to a signal event, and this hasn't been
> reported in those releases or later, which makes the poll() code path
> a key suspect.

Also, 14 changed the flag/memory barrier dance (maybe_sleeping), but
13 did it the same way as 11 + 12.  So between 12 and 13 we have just
the poll -> kqueue change.




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 12:54 PM Robert Haas  wrote:
> > The overwhelming cost is usually FPIs in any case. If you're not
> > mostly focussing on that, you're focussing on the wrong thing. At
> > least with larger tables. You just have to focus on the picture over
> > time, across multiple VACUUM operations.
>
> I think that's all mostly true, but the cases where being more
> aggressive can cause *extra* FPIs are worthy of just as much attention
> as the cases where we can reduce them.

It's a question of our exposure to real problems, in no small part.
What can we afford to be wrong about? What problem can be fixed by the
user more or less as it emerges, and what problem doesn't have that
quality?

There is very good reason to believe that the large majority of all
data that people store in a system like Postgres is extremely cold
data:

https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/
https://brandur.org/fragments/events

Having a separate aggressive step that rewrites an entire large table,
apparently at random, is just a huge burden to users. You've said that
you agree that it sucks, but somehow I still can't shake the feeling
that you don't fully understand just how much it sucks.

-- 
Peter Geoghegan




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 9:49 AM Tom Lane  wrote:
> Tomas Vondra  writes:
> > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > did not report any results for a couple days, and it seems it got into
> > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > hashjoin, or something like that.
>
> > It seems to be progressing now, probably because I attached gdb to the
> > workers to get backtraces, which does signals etc.
>
> That reminds me of cases that I saw several times on my now-deceased
> animal florican:
>
> https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
>
> There's clearly something rotten somewhere in there, but whether
> it's our bug or FreeBSD's isn't clear.

And if it's ours, it's possibly in latch code and not anything higher
(I mean, not in condition variables, barriers, or parallel hash join)
because I saw a similar hang in the shm_mq stuff which uses the latch
API directly.  Note that 13 switched to kqueue but still used the
self-pipe, and 14 switched to a signal event, and this hasn't been
reported in those releases or later, which makes the poll() code path
a key suspect.




Re: Non-superuser subscription owners

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 12:36 PM Jeff Davis  wrote:
> On Thu, 2023-01-26 at 09:43 -0500, Robert Haas wrote:
> > I have no issue with that as a long-term plan. However, I think that
> > for right now we should just introduce pg_create_subscription. It
> > would make sense to add pg_create_connection in the same patch that
> > adds a CREATE CONNECTION command (or whatever exact syntax we end up
> > with) -- and that patch can also change CREATE SUBSCRIPTION to
> > require
> > both privileges where a connection string is specified directly.
>
> I assumed it would be a problem to say that pg_create_subscription was
> enough to create a subscription today, and then later require
> additional privileges (e.g. pg_create_connection).
>
> If that's not a problem, then this sounds fine with me.

Wonderful! I'm working on a patch, but due to various distractions,
it's not done yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 20:26:00 +0100, Matthias van de Meent wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records?

I think we definitely should (and have argued for it a couple times). It's not
just about reducing WAL overhead, it's also about reducing redundant
visibility checks - which are where a very significant portion of the CPU time
for VACUUMing goes to.

Besides performance considerations, it's also just plain weird that
lazy_scan_prune() can end up with a different visibility than
heap_page_prune() (mostly due to concurrent aborts).


The number of WAL records we often end up emitting for a processing a single
page in vacuum is just plain absurd:
- PRUNE
- FREEZE_PAGE
- VISIBLE

There's afaict no justification whatsoever for these to be separate records.


> I think they're quite closely related (in that they both execute in VACUUM
> and are required for long-term system stability), and are even more related
> now that we have opportunistic page-level freezing. I think adding a "freeze
> this page as well"-flag in PRUNE records would go a long way to reducing the
> WAL overhead of aggressive and more opportunistic freezing.

Yep.

I think we should also seriously consider setting all-visible during on-access
pruning, and freezing rows during on-access pruning.

Greetings,

Andres Freund




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 2:57 PM Peter Geoghegan  wrote:
> Relatively difficult for Andres, or for somebody else? What are the
> real parameters here? Obviously there are no clear answers available.

Andres is certainly smarter than the average guy, but practically any
scenario that someone can create in a few lines of SQL is something to
which code will be exposed to on some real-world system. If Andres
came along and said, hey, well I found a way to make this patch suck,
and proceeded to describe a scenario that involved a complex set of
tables and multiple workloads running simultaneously and using a
debugger to trigger some race condition and whatever, I'd be like "OK,
but is that really going to happen?". The actual scenario he came up
with is three lines of SQL, and it's nothing remotely obscure. That
kind of thing is going to happen *all the time*.

> The overwhelming cost is usually FPIs in any case. If you're not
> mostly focussing on that, you're focussing on the wrong thing. At
> least with larger tables. You just have to focus on the picture over
> time, across multiple VACUUM operations.

I think that's all mostly true, but the cases where being more
aggressive can cause *extra* FPIs are worthy of just as much attention
as the cases where we can reduce them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Tom Lane
Tomas Vondra  writes:
> I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> did not report any results for a couple days, and it seems it got into
> an infinite loop in REL_11_STABLE when building hash table in a parallel
> hashjoin, or something like that.

> It seems to be progressing now, probably because I attached gdb to the
> workers to get backtraces, which does signals etc.

That reminds me of cases that I saw several times on my now-deceased
animal florican:

https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us

There's clearly something rotten somewhere in there, but whether
it's our bug or FreeBSD's isn't clear.

regards, tom lane




Re: doc: add missing "id" attributes to extension packaging page

2023-01-26 Thread Brar Piening

On 18.01.2023 at 06:50, Brar Piening wrote:


I'll give it a proper look this weekend.


It turns out I didn't get a round tuit.

... and I'm afraid I probably will not be able to work on this until
mid/end February so we'll have to move this to the next commitfest until
somebody wants to take it over and push it through.






Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 10:44:45 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record 
> > is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was 
> > already
> > in s_b (since it likely won't be written out via the ringbuffer in that 
> > case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
> 
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

I don't quite follow. What do you mean with "record header overhead"? Unless
that includes FPIs, I don't think that's that commonly true?

The problematic case I am talking about is when we do *not* emit a WAL record
during pruning (because there's nothing to prune), but want to freeze the
table. If you don't log an FPI, the remaining big overhead is that increasing
the LSN on the page will often cause an XLogFlush() when writing out the
buffer.

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.


> > Thus a modification of the above logic could be to opportunistically freeze 
> > if
> > a ) it won't cause an FPI and either
> > b1) the page was already dirty before pruning, as we'll not do a ringbuffer
> > replacement in that case
> > or
> > b2) We wrote a WAL record during pruning, as the difference in flush 
> > position
> > is marginal
> >
> > An even more aggressive version would be to replace b1) with logic that'd
> > allow newly dirtying the page if it wasn't read through the ringbuffer. But
> > newly dirtying the page feels like it'd be more dangerous.
> 
> In many cases we'll have to dirty the page anyway, just to set
> PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> triggered by an FPI or triggered by my now-reverted GUC) on being able
> to set the whole page all-frozen in the VM.

IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
bits. But freezing does trigger one even without wal_log_hint_bits.

You're right, it makes sense to consider whether we'll emit a
XLOG_HEAP2_VISIBLE anyway.


> > A less aggressive version would be to check if any WAL records were emitted
> > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if 
> > we
> > modified the page again. Similar to what we do now, except not requiring an
> > FPI to have been emitted.
> 
> Also way more aggressive. Not nearly enough on its own.

In which cases will it be problematically more aggressive?

If we emitted a WAL record during pruning we've already set the LSN of the
page to a very recent LSN. We know the page is dirty. Thus we'll already
trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.



> > But to me it seems a bit odd that VACUUM now is more aggressive if 
> > checksums /
> > wal_log_hint bits is on, than without them. Which I think is how using 
> > either
> > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> 
> Which part is the odd part? Is it odd that page-level freezing works
> that way, or is it odd that page-level checksums work that way?

That page-level freezing works that way.


> In any case this seems like an odd thing for you to say, having
> eviscerated a patch that really just made the same behavior trigger
> independently of FPIs in some tables, controlled via a GUC.

jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj

That behaviour I critizied was causing a torrent of FPIs and additional
dirtying of pages. My proposed replacement for the current FPI check doesn't,
because a) it only triggers when we wrote a WAL record b) It doesn't trigger
if we would write an FPI.

Greetings,

Andres Freund




Re: wrong Append/MergeAppend elision?

2023-01-26 Thread Tom Lane
David Rowley  writes:
> On Fri, 27 Jan 2023 at 01:30, Amit Langote  wrote:
>> It seems that the planner currently elides an Append/MergeAppend that
>> has run-time pruning info (part_prune_index) set, but which I think is
>> a bug.

> There is still the trade-off of having to pull tuples through the
> Append node for when run-time pruning is unable to prune the last
> partition. So your proposal to leave the Append alone when there's
> run-time pruning info is certainly not a no-brainer.

Yeah.  Amit's proposal amounts to optimizing for the case that all
partitions get pruned, which does not seem to me to be the way
to bet.  I'm inclined to think it's fine as-is.

regards, tom lane




lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Tomas Vondra
Hi,

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

Anyway, in 'ps ax' I saw this:

94545  -  Ss   0:03.39 postgres: buildfarm regression [local] SELECT
94627  -  Is   0:00.03 postgres: parallel worker for PID 94545
94628  -  Is   0:00.02 postgres: parallel worker for PID 94545

and the backend was stuck waiting on this query:

select final > 1 as multibatch
  from hash_join_batches(
$$
  select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar
b2 using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);

This started on 2023-01-20 23:23:18.125, and the next log (after I did
the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time.

It seems all three processes are doing WaitEventSetWait, either through
a ConditionVariable, or WaitLatch. But I don't have any good idea of
what might have broken - and as it got "unstuck" I can't investigate
more. But I see there's nodeHash and parallelism, and I recall there's a
lot of gotchas due to how the backends cooperate when building the hash
table, etc. Thomas, any idea what might be wrong?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company[Switching to LWP 100877 of process 94628]
_poll () at _poll.S:4
#0  _poll () at _poll.S:4
No locals.
#1  0x88ef6850 in __thr_poll (fds=0x4, nfds=2, timeout=-1) at 
/usr/src/lib/libthr/thread/thr_syscalls.c:338
curthread = 0x8f09b000
ret = 
#2  0x007d3318 in WaitEventSetWaitBlock (set=0x8f19ceb0, 
cur_timeout=-1, occurred_events=0x80dcba98, nevents=1) at latch.c:1171
returned_events = 0
rc = 
cur_pollfd = 
cur_event = 
errflags = 
#3  WaitEventSetWait (set=0x8f19ceb0, timeout=, 
occurred_events=, occurred_events@entry=0x80dcba98, 
nevents=nevents@entry=1, wait_event_info=, 
wait_event_info@entry=134217745) at latch.c:1000
rc = 
start_time = {tv_sec = 32, tv_nsec = 2401711808}
cur_time = {tv_sec = 2161949248, tv_nsec = 9648676}
cur_timeout = -1
returned_events = 
#4  0x007f2898 in ConditionVariableSleep (cv=cv@entry=0x9c14f350, 
wait_event_info=wait_event_info@entry=134217745) at condition_variable.c:157
event = {pos = -1676348592, events = 0, fd = -2133017936, user_data = 
0x7d3128 }
done = 
#5  0x007cfd60 in BarrierArriveAndWait (barrier=0x9c14f338, 
barrier@entry=0x80dcbbc0, wait_event_info=134217745, wait_event_info@entry=1) 
at barrier.c:191
release = 
start_phase = 5
next_phase = 6
elected = 
#6  0x006b4808 in ExecParallelHashIncreaseNumBuckets 
(hashtable=hashtable@entry=0x8f0e16f0) at nodeHash.c:1572
pstate = 
i = 
chunk = 
chunk_s = 
#7  0x006b2d30 in ExecParallelHashTupleAlloc 
(hashtable=hashtable@entry=0x8f0e16f0, size=40, shared=shared@entry=0x80dcbbc0) 
at nodeHash.c:2795
growth = PHJ_GROWTH_NEED_MORE_BUCKETS
chunk = 
pstate = 0x9c14f2a0
chunk_size = 
chunk_shared = 
result = 
curbatch = 
#8  0x006b27ac in ExecParallelHashTableInsert 
(hashtable=hashtable@entry=0x8f0e16f0, slot=, 
slot@entry=0x8f0e19d8, hashvalue=, hashvalue@entry=3919329229) 
at nodeHash.c:1693
hashTuple = 
tuple = 0x8f0ebd78
batchno = 
shared = 2400066360
bucketno = 
#9  0x006b0ac8 in MultiExecParallelHash (node=0x8f0e0908) at 
nodeHash.c:288
outerNode = 0x8f0e1170
hashtable = 0x8f0e16f0
hashkeys = 0x8f0ebac8
econtext = 0x8f0e1508
pstate = 
build_barrier = 
slot = 0x8f0e19d8
i = 
hashvalue = 
#10 MultiExecHash (node=node@entry=0x8f0e0b70) at nodeHash.c:112
No locals.
#11 0x006a13e4 in MultiExecProcNode (node=node@entry=0x8f0e0908) at 
execProcnode.c:502
result = 
#12 0x006b5b00 in ExecHashJoinImpl (pstate=0x8f0e0b70, parallel=true) 
at nodeHashjoin.c:290
hashNode = 
econtext = 0x8f0e07d8
joinqual = 0x0
otherqual = 0x0
outerNode = 0x8f0e0a58
hashtable = 0x8f0e16f0
node = 0x8f0e0b70
parallel_state = 0x9c14f2a0
hashvalue = 
outerTupleSlot = 
batchno = 
build_barrier = 
#13 ExecParallelHashJoin (pstate=0x8f0e0b70) at nodeHashjoin.c:600
No locals.
#14 0x006a193c in ExecProcNodeInstr (node=0x8f0e0b70) at 
execProcnode.c:462
result = 
#15 0x0069a60c in ExecProcNode (node=0x8f0e0b70) at 
../../../src/include/e

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 11:26 AM Matthias van de Meent
 wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records? I
> think they're quite closely related (in that they both execute in
> VACUUM and are required for long-term system stability), and are even
> more related now that we have opportunistic page-level freezing. I
> think adding a "freeze this page as well"-flag in PRUNE records would
> go a long way to reducing the WAL overhead of aggressive and more
> opportunistic freezing.

Yeah, we've talked about doing that in the past year. It's quite
possible. It would make quite a lot of sense, because the actual
overhead of the WAL record for freezing tends to come from the generic
WAL record header stuff itself. If there was only one record for both,
then you'd only need to include the relfilenode and block number (and
so on) once.

It would be tricky to handle Multis, so what you'd probably do is just
freezing xmin, and possibly aborted and locker XIDs in xmax. So you
wouldn't completely get rid of the main freeze record, but would be
able to avoid it in many important cases.

-- 
Peter Geoghegan




Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-01-26 Thread Reid Thompson
On Mon, 2023-01-23 at 12:31 -0800, Andres Freund wrote:
> Hi,
> 
> I think it's basically still waiting on author, until the O(N) cost is gone
> from the overflow limit check.
> 
> Greetings,
> 
> Andres Freund

Yes, just a rebase. There is still work to be done per earlier in the
thread.

I do want to follow up and note re palloc/pfree vs malloc/free that the
tracking code (0001-Add-tracking-...) is not tracking palloc/pfree but is
explicitely tracking malloc/free. Not every palloc/pfree call executes the
tracking code, only those where the path followed includes malloc() or
free().  Routine palloc() calls fulfilled from the context's
freelist/emptyblocks/freeblock/etc and pfree() calls not invoking free()
avoid the tracking code.

Thanks,
Reid





Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote:
> Nathan Bossart  writes:
>> I wonder if we should explicitly reject negative timestamps to eliminate
>> any chance of int64 overflow, too.
> 
> Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
> but I take your point that the subtraction would overflow with
> TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
> make use of pg_sub_s64_overflow()?

That would be my vote.  I think the 'diff <= 0' check might need to be
replaced with something like 'start_time > stop_time' so that we return 0
for the underflow case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Something is wrong with wal_compression

2023-01-26 Thread Justin Pryzby
On Thu, Jan 26, 2023 at 02:08:27PM -0600, Justin Pryzby wrote:
> On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
> > The symptom being exhibited by Michael's new BF animal tanager
> > is perfectly reproducible elsewhere.
> 
> I think these tests have always failed with wal_compression ?
> 
> https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com
> https://www.postgresql.org/message-id/20210313012820.gj29...@telsasoft.com
> https://www.postgresql.org/message-id/2022031948.gj9...@telsasoft.com

+ 
https://www.postgresql.org/message-id/c86ce84f-dd38-9951-102f-13a931210f52%40dunslane.net




Re: Minimal logical decoding on standbys

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote:
> - I'm struggling to create a test for btree killtuples as there is a need for 
> rows removal on the table (that could produce a conflict too):
> Do you've a scenario in mind for this one? (and btw in what kind of WAL 
> record should the conflict be detected in such a case? xl_btree_delete?)

Hm, it might indeed be hard in "modern" postgres.  I think you'd need at least
two concurrent sessions, to prevent on-access pruning on the table.


DROP TABLE IF EXISTS indexdel;
CREATE TABLE indexdel(id int8 primary key);
INSERT INTO indexdel SELECT generate_series(1, 1);
VACUUM indexdel; -- ensure hint bits are set etc

DELETE FROM indexdel;

SELECT pg_current_wal_insert_lsn();

SET enable_indexonlyscan = false;
-- This scan finds that the index items are dead - but doesn't yet issue a
-- btree delete WAL record, that only happens when needing space on the page
-- again.
EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY 
id ASC;
SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC;

-- The insertions into the range of values prev
INSERT INTO indexdel SELECT generate_series(1, 100);


Does generate the btree deletion record, but it also does emit a PRUNE (from
heapam_index_fetch_tuple() -> heap_page_prune_opt()).

While the session could create a cursor to prevent later HOT cleanup, the
query would also trigger hot pruning (or prevent the rows from being dead, if
you declare the cursor before the DELETE). So you'd need overlapping cursors
in a concurrent session...

Too complicated.

Greetings,

Andres Freund




Re: wrong Append/MergeAppend elision?

2023-01-26 Thread David Rowley
On Fri, 27 Jan 2023 at 01:30, Amit Langote  wrote:
> It seems that the planner currently elides an Append/MergeAppend that
> has run-time pruning info (part_prune_index) set, but which I think is
> a bug.

This is actually how I intended it to work. Whether it was a good idea
or not, I'm currently unsure. I mentioned it in [1].

I think the plan shapes I was talking about were some ordered paths
from partial paths per what is being added right at the end of
add_paths_to_append_rel().  However, now that I look at it again, I'm
not sure why it wouldn't be correct to still have those paths with a
single-child Append. Certainly, the "if (list_length(live_childrels)
== 1)" test made in add_paths_to_append_rel() is no longer aligned to
the equivalent test in set_append_references(), so it's possible even
now that we make a plan that uses the extra sorted partial paths added
in add_paths_to_append_rel() and still have the Append in the final
plan.

There is still the trade-off of having to pull tuples through the
Append node for when run-time pruning is unable to prune the last
partition. So your proposal to leave the Append alone when there's
run-time pruning info is certainly not a no-brainer.

[1] 
https://www.postgresql.org/message-id/cakjs1f_utf1mbp8ueobyaarzio4e4qb4z8fksurpam+3q_h...@mail.gmail.com




Re: Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
Justin Pryzby  writes:
> On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
>> The symptom being exhibited by Michael's new BF animal tanager
>> is perfectly reproducible elsewhere.

> I think these tests have always failed with wal_compression ?

If that's a known problem, and we've done nothing about it,
that is pretty horrid.  That test case is demonstrating fundamental
database corruption after a crash.

regards, tom lane




Re: Something is wrong with wal_compression

2023-01-26 Thread Justin Pryzby
On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote:
> The symptom being exhibited by Michael's new BF animal tanager
> is perfectly reproducible elsewhere.

I think these tests have always failed with wal_compression ?

https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com
https://www.postgresql.org/message-id/20210313012820.gj29...@telsasoft.com
https://www.postgresql.org/message-id/2022031948.gj9...@telsasoft.com

https://www.postgresql.org/message-id/ynqwd2gsmrnqw...@paquier.xyz
|My buildfarm machine has been changed to use wal_compression = lz4,
|while on it for HEAD runs.





Re: improving user.c error messages

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote:
>> Basically my question is whether having one error message for all of
>> those cases is good enough, or whether we should be trying harder.

I think the password case needs to be kept separate, because the
conditions for it are different (specifically the exception that
you can alter your own password).  Lumping the rest together
seems OK to me.

regards, tom lane




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
Nathan Bossart  writes:
> On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote:
>> - * Both inputs must be ordinary finite timestamps (in current usage,
>> - * they'll be results from GetCurrentTimestamp()).
>> + * At least one input must be an ordinary finite timestamp, else the "diff"
>> + * calculation might overflow.  We do support stop_time == 
>> TIMESTAMP_INFINITY,
>> + * which will result in INT_MAX wait time.

> I wonder if we should explicitly reject negative timestamps to eliminate
> any chance of int64 overflow, too.

Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
but I take your point that the subtraction would overflow with
TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
make use of pg_sub_s64_overflow()?

BTW, I just noticed that the adjacent function TimestampDifference
has a lot of callers that would be much happier using
TimestampDifferenceMilliseconds.

regards, tom lane




Re: improving user.c error messages

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote:
> @@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
>   {
>   /* things an unprivileged user certainly can't do */
>   if (dinherit || dcreaterole || dcreatedb || dcanlogin || dconnlimit ||
> - dvalidUntil || disreplication || dbypassRLS)
> + dvalidUntil || disreplication || dbypassRLS ||
> + (dpassword && roleid != currentUserId))
>   ereport(ERROR,
>   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> - errmsg("permission denied")));
> -
> - /* an unprivileged user can change their own password */
> - if (dpassword && roleid != currentUserId)
> - ereport(ERROR,
> - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> - errmsg("must have CREATEROLE privilege to change another user's 
> password")));
> + errmsg("permission denied to alter role"),
> + errdetail("You must have %s privilege and %s on role \"%s\".",
> +"CREATEROLE", "ADMIN OPTION", rolename)));
>   }
>   else if (!superuser())
>   {
> 
> Basically my question is whether having one error message for all of
> those cases is good enough, or whether we should be trying harder. I
> don't mind if the conclusion is that it's OK as-is, and I'm not
> entirely sure what would be better. But when I was working on this
> code, all of those cases OR'd together feeding into a single error
> message seemed a little sketchy to me, so I am wondering what others
> think.

I wondered the same thing, but I hesitated because I didn't want to change
too much in a patch for error messaging.  I can give it a try.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 11:28 AM Robert Haas  wrote:
> I think it's pretty much impossible to freeze more aggressively
> without losing in some scenario or other. If waiting longer to freeze
> would have resulted in the data getting updated again or deleted
> before we froze it, then waiting longer reduces the total amount of
> freezing work that ever has to be done. Freezing more aggressively
> inevitably gives up some amount of that potential benefit in order to
> try to secure some other benefit. It's a trade-off.

There is no question about that.

> I think that the goal of a patch that makes vacuum more (or less)
> aggressive should be to make the cases where we lose as obscure as
> possible, and the cases where we win as broad as possible. I think
> that, in order to be a good patch, it needs to be relatively difficult
> to find cases where we incur a big loss. If it's easy to find a big
> loss, then I think it's better to stick with the current behavior,
> even if it's also easy to find a big gain.

Again, this seems totally uncontroversial. It's just incredibly vague,
and not at all actionable.

Relatively difficult for Andres, or for somebody else? What are the
real parameters here? Obviously there are no clear answers available.

> However, I'm also not
> prepared to go all the way to the other end of the spectrum and say
> that all of your ideas and everything in this patch are great. I don't
> think either of those things, either.

It doesn't matter. I'm done with it. This is not a negotiation about
what gets in and what doesn't get in.

All that I aim to do now is to draw some kind of line under the basic
page-level freezing work, since of course I'm still responsible for
that. And perhaps to defend my personal reputation.

> I certainly think that freezing more aggressively in some scenarios
> could be a great idea, but it seems like the patch's theory is to be
> very nearly maximally aggressive in every vacuum run if the table size
> is greater than some threshold, and I don't think that's right at all.

We'll systematically avoid accumulating debt past a certain point --
that's its purpose. That is, we'll avoid accumulating all-visible
pages that eventually need to be frozen.

> I'm not exactly sure what information we should use to decide how
> aggressive to be, but I am pretty sure that the size of the table is
> not it.  It's true that, for a small table, the cost of having to
> eventually vacuum the whole table at once isn't going to be very high,
> whereas for a large table, it will be. That line of reasoning makes a
> size threshold sound reasonable. However, the amount of extra work
> that we can potentially do by vacuuming more aggressively *also*
> increases with the table size, which to me means using that a
> criterion actually isn't sensible at all.

The overwhelming cost is usually FPIs in any case. If you're not
mostly focussing on that, you're focussing on the wrong thing. At
least with larger tables. You just have to focus on the picture over
time, across multiple VACUUM operations.

> One idea that I've had about how to solve this problem is to try to
> make vacuum try to aggressively freeze some portion of the table on
> each pass, and to behave less aggressively on the rest of the table so
> that, hopefully, no single vacuum does too much work. Unfortunately, I
> don't really know how to do that effectively.

That has been proposed a couple of times in the context of this
thread. It won't work, because the way autovacuum works in general
(and likely always will work) doesn't allow it. With an append-only
table, each VACUUM will naturally have to scan significantly more
pages than the last one, forever (barring antiwraparound vacuums). Why
wouldn't it continue that way? I mean it might not (the table might
stop growing altogether), but then it doesn't matter much what we do.

If you're not behaving very proactively at the level of each VACUUM
operation, then the picture over time is that you're *already* falling
behind. At least with an append-only table. You have to think of the
sequence of operations, not just one.

> In theory we could have some system that tracks how
> recently each page range in a table has been modified, and direct our
> freezing activity toward the ones less-recently modified on the theory
> that they're not so likely to be modified again in the near future,
> but in reality we have no such system. So I don't really feel like I
> know what the right answer is here, yet.

So we need to come up with a way of getting reliable information from
the future, about an application that we have no particular
understanding of. As opposed to just eating the cost to some degree,
and making it configurable.

-- 
Peter Geoghegan




Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote:
> After looking closer, I see that TimestampDifferenceMilliseconds
> already explicitly states that its output is intended for WaitLatch
> and friends, which makes it perfectly sane for it to clamp the result
> to [0, INT_MAX] rather than depending on the caller to not pass
> out-of-range values.

+1

>   * This is typically used to calculate a wait timeout for WaitLatch()
>   * or a related function.  The choice of "long" as the result type
> - * is to harmonize with that.  It is caller's responsibility that the
> - * input timestamps not be so far apart as to risk overflow of "long"
> - * (which'd happen at about 25 days on machines with 32-bit "long").
> + * is to harmonize with that; furthermore, we clamp the result to at most
> + * INT_MAX milliseconds, because that's all that WaitLatch() allows.
>   *
> - * Both inputs must be ordinary finite timestamps (in current usage,
> - * they'll be results from GetCurrentTimestamp()).
> + * At least one input must be an ordinary finite timestamp, else the "diff"
> + * calculation might overflow.  We do support stop_time == 
> TIMESTAMP_INFINITY,
> + * which will result in INT_MAX wait time.

I wonder if we should explicitly reject negative timestamps to eliminate
any chance of int64 overflow, too.  Alternatively, we could detect that the
operation will overflow and return either 0 or INT_MAX, but I assume
there's minimal use of this function with negative timestamps.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Something is wrong with wal_compression

2023-01-26 Thread Tom Lane
The symptom being exhibited by Michael's new BF animal tanager
is perfectly reproducible elsewhere.

$ cat /home/postgres/tmp/temp_config
#default_toast_compression = lz4
wal_compression = lz4
$ export TEMP_CONFIG=/home/postgres/tmp/temp_config
$ cd ~/pgsql/src/test/recovery
$ make check PROVE_TESTS=t/011_crash_recovery.pl
...
+++ tap check in src/test/recovery +++
t/011_crash_recovery.pl .. 1/? 
#   Failed test 'new xid after restart is greater'
#   at t/011_crash_recovery.pl line 53.
# '729'
# >
# '729'

#   Failed test 'xid is aborted after crash'
#   at t/011_crash_recovery.pl line 57.
#  got: 'committed'
# expected: 'aborted'
# Looks like you failed 2 tests of 3.

Maybe this is somehow the test script's fault, but I don't see how.

It fails the same way with 'wal_compression = pglz', so I think it's
generic to that whole feature rather than specific to LZ4.

regards, tom lane




Re: Helper functions for wait_for_catchup() in Cluster.pm

2023-01-26 Thread Drouvot, Bertrand

Hi,

On 1/26/23 10:42 AM, Alvaro Herrera wrote:

On 2023-Jan-26, Drouvot, Bertrand wrote:


On 1/24/23 7:27 PM, Alvaro Herrera wrote:



1. I don't think wait_for_write_catchup is necessary, because
calling wait_for_catchup() and omitting the 'mode' and 'lsn' arguments
would already do the same thing. 


Having a closer look, it does not seem to be the case. The default mode
in wait_for_catchup() is 'replay' and the default mode for the lsn is 'write'.

But in wait_for_write_catchup() we are making use of 'write' for both.




2. Because wait_for_replay_catchup is an instance method, passing the
second node as argument is needlessly noisy, because that's already
known as $self.  So we can just say

$primary_node->wait_for_replay_catchup($standby_node);


Yeah, but same here, there is places where the node passed as the second argument is not 
the "$self":

src/bin/pg_rewind/t/007_standby_source.pl:$node_b->wait_for_replay_catchup('node_c',
 $node_a);
src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
 $node_primary);
src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2,
 $node_primary);
src/test/recovery/t/001_stream_rep.pl:  
$node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary);

So it looks like there is still a need for wait_for_replay_catchup() with 2 
parameters.


Ah, cascading replication.  In that case, let's make the second
parameter optional.  If it's not given, $self is used.



Good point, done in V3 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/src/bin/pg_rewind/t/007_standby_source.pl 
b/src/bin/pg_rewind/t/007_standby_source.pl
index 52644c2c0d..7236a3b177 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -74,9 +74,8 @@ $node_a->safe_psql('postgres',
"INSERT INTO tbl1 values ('in A, before promotion')");
 $node_a->safe_psql('postgres', 'CHECKPOINT');
 
-my $lsn = $node_a->lsn('write');
-$node_a->wait_for_catchup('node_b', 'write', $lsn);
-$node_b->wait_for_catchup('node_c', 'write', $lsn);
+$node_a->wait_for_write_catchup('node_b', $node_a);
+$node_b->wait_for_write_catchup('node_c', $node_a);
 
 # Promote C
 #
@@ -160,7 +159,7 @@ in A, after C was promoted
 $node_a->safe_psql('postgres',
"INSERT INTO tbl1 values ('in A, after rewind')");
 
-$node_b->wait_for_catchup('node_c', 'replay', $node_a->lsn('write'));
+$node_b->wait_for_replay_catchup('node_c', $node_a);
 
 check_query(
'SELECT * FROM tbl1',
diff --git a/src/test/modules/brin/t/02_wal_consistency.pl 
b/src/test/modules/brin/t/02_wal_consistency.pl
index 5983ef208e..8b2b244feb 100644
--- a/src/test/modules/brin/t/02_wal_consistency.pl
+++ b/src/test/modules/brin/t/02_wal_consistency.pl
@@ -70,6 +70,6 @@ my ($ret, $out, $err) = $whiskey->psql(
});
 cmp_ok($out, '>=', 1);
 
-$whiskey->wait_for_catchup($charlie, 'replay', $whiskey->lsn('insert'));
+$whiskey->wait_for_replay_catchup($charlie);
 
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..3ba1545688 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2709,6 +2709,45 @@ sub wait_for_catchup
return;
 }
 
+# Now defining helper functions wait_for_replay_catchup() and
+# wait_for_write_catchup().
+# Please note that wait_for_flush_catchup() and wait_for_sent_catchup() are not
+# defined as: 1) they are not used yet and 2) it lets their author (if any)
+# decide the node->lsn() mode to be used.
+
+=pod
+
+=item $node->wait_for_replay_catchup(standby_name, node)
+
+Helper function for wait_for_catchup when waiting for the replay_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_replay_catchup
+{
+   my ($self, $standby_name, $node) = @_;
+   $node = defined($node) ? $node : $self;
+
+   $self->wait_for_catchup($standby_name, 'replay', $node->lsn('flush'));
+}
+
+=pod
+
+=item $node->wait_for_write_catchup(standby_name, node)
+
+Helper function for wait_for_catchup when waiting for the write_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_write_catchup
+{
+   my ($self, $standby_name, $node) = @_;
+
+   $self->wait_for_catchup($standby_name, 'write', $node->lsn('write'));
+}
+
 =pod
 
 =item $node->wait_for_slot_catchup(slot_name, mode, target_lsn)
diff --git a/src/test/recovery/t/001_stream_rep.pl 
b/src/test/recovery/t/001_stream_rep.pl
index 23a90dd85b..76846905a7 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -47,9 +47,8 @@ $node_primary->safe_psql('postgres',
"CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
 
 # Wait for standbys to catch up
-my $primary_lsn = $node_primary->lsn('write');
-$node_primary->wait_for_catchup($node_standby_1, 'replay', $pri

Re: improving user.c error messages

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 2:14 PM Nathan Bossart  wrote:
> Yeah, it's probably better to say "to alter roles with %s" to refer to
> roles that presently have the attribute and "to change the %s attribute"
> when referring to privileges for the attribute.  I did this in v2, too.
>
> I've also switched from errhint() to errdetail() as suggested by Tom.

This seems fine to me in general but I'm not entirely sure about this part:

@@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
  {
  /* things an unprivileged user certainly can't do */
  if (dinherit || dcreaterole || dcreatedb || dcanlogin || dconnlimit ||
- dvalidUntil || disreplication || dbypassRLS)
+ dvalidUntil || disreplication || dbypassRLS ||
+ (dpassword && roleid != currentUserId))
  ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("permission denied")));
-
- /* an unprivileged user can change their own password */
- if (dpassword && roleid != currentUserId)
- ereport(ERROR,
- (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must have CREATEROLE privilege to change another user's password")));
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege and %s on role \"%s\".",
+"CREATEROLE", "ADMIN OPTION", rolename)));
  }
  else if (!superuser())
  {

Basically my question is whether having one error message for all of
those cases is good enough, or whether we should be trying harder. I
don't mind if the conclusion is that it's OK as-is, and I'm not
entirely sure what would be better. But when I was working on this
code, all of those cases OR'd together feeding into a single error
message seemed a little sketchy to me, so I am wondering what others
think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Robert Haas
On Thu, Jan 26, 2023 at 11:35 AM Peter Geoghegan  wrote:
> You complained about the descriptions being theoretical. But there's
> nothing theoretical about the fact that we more or less do *all*
> freezing in an eventual aggressive VACUUM in many important cases,
> including very simple cases like pgbench_history -- the simplest
> possible append-only table case. We'll merrily rewrite the entire
> table, all at once, for no good reason at all. Consistently, reliably.
> It's so incredibly obvious that this makes zero sense! And yet I don't
> think you've ever engaged with such basic points as that one.

I'm aware that that's a problem, and I agree that it sucks. I think
that what this patch does is make vacuum more aggressively, and I
expect that would help this problem. I haven't said much about that
because I don't think it's controversial. However, the patch also has
a cost, and that's what I think is controversial.

I think it's pretty much impossible to freeze more aggressively
without losing in some scenario or other. If waiting longer to freeze
would have resulted in the data getting updated again or deleted
before we froze it, then waiting longer reduces the total amount of
freezing work that ever has to be done. Freezing more aggressively
inevitably gives up some amount of that potential benefit in order to
try to secure some other benefit. It's a trade-off.

I think that the goal of a patch that makes vacuum more (or less)
aggressive should be to make the cases where we lose as obscure as
possible, and the cases where we win as broad as possible. I think
that, in order to be a good patch, it needs to be relatively difficult
to find cases where we incur a big loss. If it's easy to find a big
loss, then I think it's better to stick with the current behavior,
even if it's also easy to find a big gain. There's nothing wonderful
about the current behavior, but (to paraphrase what I think Andres has
already said several times) it's better to keep shipping code with the
same bad behavior than to put out a new major release with behaviors
that are just as bad, but different.

I feel like your emails sometimes seem to suppose that I think that
you're a bad person, or a bad developer, or that you have no good
ideas, or that you have no good ideas about this topic, or that this
topic is not important, or that we don't need to do better than we are
currently doing. I think none of those things. However, I'm also not
prepared to go all the way to the other end of the spectrum and say
that all of your ideas and everything in this patch are great. I don't
think either of those things, either.

I certainly think that freezing more aggressively in some scenarios
could be a great idea, but it seems like the patch's theory is to be
very nearly maximally aggressive in every vacuum run if the table size
is greater than some threshold, and I don't think that's right at all.
I'm not exactly sure what information we should use to decide how
aggressive to be, but I am pretty sure that the size of the table is
not it.  It's true that, for a small table, the cost of having to
eventually vacuum the whole table at once isn't going to be very high,
whereas for a large table, it will be. That line of reasoning makes a
size threshold sound reasonable. However, the amount of extra work
that we can potentially do by vacuuming more aggressively *also*
increases with the table size, which to me means using that a
criterion actually isn't sensible at all.

One idea that I've had about how to solve this problem is to try to
make vacuum try to aggressively freeze some portion of the table on
each pass, and to behave less aggressively on the rest of the table so
that, hopefully, no single vacuum does too much work. Unfortunately, I
don't really know how to do that effectively. If we knew that the
table was going to see 10 vacuums before we hit
autovacuum_freeze_max_age, we could try to have each one do 10% of the
amount of freezing that was going to need to be done rather than
letting any single vacuum do all of it, but we don't have that sort of
information. Also, even if we did have that sort of information, the
idea only works if the pages that we freeze sooner are ones that we're
not about to update or delete again, and we don't have any idea what
is likely there. In theory we could have some system that tracks how
recently each page range in a table has been modified, and direct our
freezing activity toward the ones less-recently modified on the theory
that they're not so likely to be modified again in the near future,
but in reality we have no such system. So I don't really feel like I
know what the right answer is here, yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Matthias van de Meent
On Thu, 26 Jan 2023 at 19:45, Peter Geoghegan  wrote:
>
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> > I assume the case you're thinking of is that pruning did *not* do any 
> > changes,
> > but in the process of figuring out that nothing needed to be pruned, we did 
> > a
> > MarkBufferDirtyHint(), and as part of that emitted an FPI?
>
> Yes.
>
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record 
> > is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was 
> > already
> > in s_b (since it likely won't be written out via the ringbuffer in that 
> > case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
>
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

Could someone explain to me why we don't currently (optionally)
include the functionality of page freezing in the PRUNE records? I
think they're quite closely related (in that they both execute in
VACUUM and are required for long-term system stability), and are even
more related now that we have opportunistic page-level freezing. I
think adding a "freeze this page as well"-flag in PRUNE records would
go a long way to reducing the WAL overhead of aggressive and more
opportunistic freezing.

-Matthias




Re: improving user.c error messages

2023-01-26 Thread Nathan Bossart
Thanks for taking a look.

On Thu, Jan 26, 2023 at 10:07:39AM +0100, Alvaro Herrera wrote:
> Please use 
>   errdetail("You must have %s privilege to create roles with %s.",
>   "SUPERUSER", "SUPERUSER")));
> 
> in this kind of message where multiple copies appear that only differ in
> the keyword to use, to avoid creating four copies of essentially the
> same string.
> 
> This applies in several places.

I did this in v2.

>> - errmsg("must have createdb privilege 
>> to change createdb attribute")));
>> + errmsg("permission denied to alter 
>> role"),
>> + errhint("You must have CREATEDB 
>> privilege to alter roles with CREATEDB.")));
> 
> I think this one is a bit ambiguous; does "with" mean that roles that
> have that priv cannot be changed, or does it mean that you cannot meddle
> with that bit in particular?  I think it'd be better to say
>   "You must have %s privilege to change the %s attribute."
> or something like that.

Yeah, it's probably better to say "to alter roles with %s" to refer to
roles that presently have the attribute and "to change the %s attribute"
when referring to privileges for the attribute.  I did this in v2, too.

I've also switched from errhint() to errdetail() as suggested by Tom.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a6db1f28f0e7079193af4fca3fde27cf4780dbb7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 26 Jan 2023 11:05:13 -0800
Subject: [PATCH v2 1/1] Improve user.c error messages.

---
 src/backend/commands/user.c   | 171 --
 src/test/regress/expected/create_role.out |  77 ++
 src/test/regress/expected/dependency.out  |   4 +
 src/test/regress/expected/privileges.out  |  23 ++-
 4 files changed, 195 insertions(+), 80 deletions(-)

diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index 3a92e930c0..e2c80f5060 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -316,23 +316,33 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
 		if (!has_createrole_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("permission denied to create role")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles.",
+			   "CREATEROLE")));
 		if (issuper)
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must be superuser to create superusers")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "SUPERUSER", "SUPERUSER")));
 		if (createdb && !have_createdb_privilege())
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have createdb permission to create createdb users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "CREATEDB", "CREATEDB")));
 		if (isreplication && !has_rolreplication(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have replication permission to create replication users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "REPLICATION", "REPLICATION")));
 		if (bypassrls && !has_bypassrls_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have bypassrls to create bypassrls users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "BYPASSRLS", "BYPASSRLS")));
 	}
 
 	/*
@@ -744,10 +754,18 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 	roleid = authform->oid;
 
 	/* To mess with a superuser in any way you gotta be superuser. */
-	if (!superuser() && (authform->rolsuper || dissuper))
+	if (!superuser() && authform->rolsuper)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must be superuser to alter superuser roles or change superuser attribute")));
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege to alter roles with %s.",
+		   "SUPERUSER", "SUPERUSER")));
+	if (!superuser() && dissuper)
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied to alter role"),
+ errdetail("You must have %s privilege to change the %s attribute.",
+		   "SUPERUSER", "SUPERUSER")));
 
 	/*
 	 * Most changes to a role require that you both have CREATEROLE privileges
@@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 	{
 		/* things an unprivileged user certainly can't do */
 		if (dinherit || dcreaterole || dcreatedb || dcanlogin || dc

Re: suppressing useless wakeups in logical/worker.c

2023-01-26 Thread Tom Lane
I wrote:
>>> It'd probably be reasonable to file down that sharp edge by instead
>>> specifying that TimestampDifferenceMilliseconds will clamp overflowing
>>> differences to LONG_MAX.  Maybe there should be a clamp on the underflow
>>> side too ... but should it be to LONG_MIN or to zero?

After looking closer, I see that TimestampDifferenceMilliseconds
already explicitly states that its output is intended for WaitLatch
and friends, which makes it perfectly sane for it to clamp the result
to [0, INT_MAX] rather than depending on the caller to not pass
out-of-range values.

I checked existing callers, and found one place in basebackup_copy.c
that had not read the memo about TimestampDifferenceMilliseconds
never returning a negative value, and another in postmaster.c that
had not read the memo about Min() and Max() being macros.  There
are none that are unhappy about clamping to INT_MAX, and at least
one that was already assuming it could just cast the result to int.

Hence, I propose the attached.  I haven't gone as far as to change
the result type from long to int; that seems like a lot of code
churn (if we are to update WaitLatch and all callers to match)
and it won't really buy anything semantically.

regards, tom lane

diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 05470057f5..2bb6c89f8c 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len)
 		 * the system clock was set backward, so that such occurrences don't
 		 * have the effect of suppressing further progress messages.
 		 */
-		if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD)
+		if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD ||
+			now < mysink->last_progress_report_time)
 		{
 			mysink->last_progress_report_time = now;
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5b775cf7d0..62fba5fcee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1670,11 +1670,12 @@ DetermineSleepTime(void)
 
 	if (next_wakeup != 0)
 	{
-		/* Ensure we don't exceed one minute, or go under 0. */
-		return Max(0,
-   Min(60 * 1000,
-	   TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-	   next_wakeup)));
+		int			ms;
+
+		/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
+		ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
+   next_wakeup);
+		return Min(60 * 1000, ms);
 	}
 
 	return 60 * 1000;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e95398db05..b0cfddd548 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
 pgsocket	wait_fd = PGINVALID_SOCKET;
 int			rc;
 TimestampTz nextWakeup;
-int			nap;
+long		nap;
 
 /*
  * Exit walreceiver if we're not in recovery. This should not
@@ -528,15 +528,9 @@ WalReceiverMain(void)
 for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
 	nextWakeup = Min(wakeup[i], nextWakeup);
 
-/*
- * Calculate the nap time.  WaitLatchOrSocket() doesn't accept
- * timeouts longer than INT_MAX milliseconds, so we limit the
- * result accordingly.  Also, we round up to the next
- * millisecond to avoid waking up too early and spinning until
- * one of the wakeup times.
- */
+/* Calculate the nap time, clamping as necessary. */
 now = GetCurrentTimestamp();
-nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));
+nap = TimestampDifferenceMilliseconds(now, nextWakeup);
 
 /*
  * Ideally we would reuse a WaitEventSet object repeatedly
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 928c330897..b1d1963729 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1690,12 +1690,12 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  *
  * This is typically used to calculate a wait timeout for WaitLatch()
  * or a related function.  The choice of "long" as the result type
- * is to harmonize with that.  It is caller's responsibility that the
- * input timestamps not be so far apart as to risk overflow of "long"
- * (which'd happen at about 25 days on machines with 32-bit "long").
+ * is to harmonize with that; furthermore, we clamp the result to at most
+ * INT_MAX milliseconds, because that's all that WaitLatch() allows.
  *
- * Both inputs must be ordinary finite timestamps (in current usage,
- * they'll be results from GetCurrentTimestamp()).
+ * At least one input must be an ordinary finite timestamp, else the "diff"
+ * calculation might overflow.  We do support stop_time == TIMESTAMP_INFINITY,
+ * which will result in INT_MAX wait time.
  *
  * We expect start_time <= stop_tim

Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Tomas Vondra



On 1/26/23 16:40, Andres Freund wrote:
> Hi,
> 
> On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
>> It's not clear to me how could it cause deadlocks, as we're not waiting
>> for a lock/resource locked by someone else, but it's certainly an issue
>> for uninterruptible hangs.
> 
> Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
> lock-ordering rules.
> 

Not sure what lock ordering issues you have in mind, but I agree it's
not the right place for the sleep, no argument here.

> 
>>> My best idea for how to implement this in a somewhat safe way would be for
>>> XLogInsertRecord() to set a flag indicating that we should throttle, and set
>>> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
>>> to
>>> proceed (i.e. we'll not be in a critical / interrupts off section) can
>>> actually perform the delay.  That should fix the hard deadlock danger and
>>> remove most of the increase in lock contention.
>>>
>>
>> The solution I've imagined is something like autovacuum throttling - do
>> some accounting of how much "WAL bandwidth" each process consumed, and
>> then do the delay/sleep in a suitable place.
> 
> Where would such a point be, except for ProcessInterrupts()? Iteratively
> adding a separate set of "wal_delay()" points all over the executor,
> commands/, ... seems like a bad plan.
> 

I haven't thought about where to do the work, TBH. ProcessInterrupts()
may very well be a good place.

I should have been clearer, but the main benefit of autovacuum-like
throttling is IMHO that it involves all processes and a global limit,
while the current approach is per-backend.

> 
>>> I don't think doing an XLogFlush() of a record that we just wrote is a good
>>> idea - that'll sometimes significantly increase the number of fdatasyncs/sec
>>> we're doing. To make matters worse, this will often cause partially filled 
>>> WAL
>>> pages to be flushed out - rewriting a WAL page multiple times can
>>> significantly increase overhead on the storage level. At the very least 
>>> this'd
>>> have to flush only up to the last fully filled page.
>>>
>>
>> I don't see why would that significantly increase the number of flushes.
>> The goal is to do this only every ~1MB of a WAL or so, and chances are
>> there were many (perhaps hundreds or more) commits in between. At least
>> in a workload with a fair number of OLTP transactions (which is kinda
>> the point of all this).
> 
> Because the accounting is done separately in each process. Even if you just
> add a few additional flushes for each connection, in aggregate that'll be a
> lot.
> 

How come? Imagine the backend does flush only after generating e.g. 1MB
of WAL. Small transactions won't do any additional flushes at all
(because commit resets the accounting). Large transactions will do an
extra flush every 1MB, so 16x per WAL segment. But in between there will
be many commits from the small transactions. If we backoff to the last
complete page, that should eliminate even most of these flushes.

So where would the additional flushes come from?

> 
>> And the "small" OLTP transactions don't really do any extra fsyncs,
>> because the accounting resets at commit (or places that flush WAL).
>> [...]
>>> Just counting the number of bytes inserted by a backend will make the 
>>> overhead
>>> even worse, as the flush will be triggered even for OLTP sessions doing tiny
>>> transactions, even though they don't contribute to the problem you're trying
>>> to address.  How about counting how many bytes of WAL a backend has inserted
>>> since the last time that backend did an XLogFlush()?
>>>
>>
>> No, we should reset the counter at commit, so small OLTP transactions
>> should not reach really trigger this. That's kinda the point, to only
>> delay "large" transactions producing a lot of WAL.
> 
> I might have missed something, but I don't think the first version of patch
> resets the accounting at commit?
> 

Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test
the behavior / demonstrate the effect.

> 
>> Same for the flushes of partially flushed pages - if there's enough of
>> small OLTP transactions, we're already having this issue. I don't see
>> why would this make it measurably worse.
> 
> Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
> could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
> calls to partial boundaries by every autovacuum, by every process doing a bit
> more bulky work. Because you're flushing the LSN after a record you just
> inserted, you're commonly not going to be "joining" the work of an already
> in-process XLogFlush().
> 

Right. We do ~16 additional flushes per 16MB segment (or something like
that, depending on the GUC value). Considering we do thousand of commits
per segment, each of which does a flush, I don't see how would this make
it measurably worse.

> 
>> But if needed, we can simply backoff to the last page boundary, s

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> I assume the case you're thinking of is that pruning did *not* do any changes,
> but in the process of figuring out that nothing needed to be pruned, we did a
> MarkBufferDirtyHint(), and as part of that emitted an FPI?

Yes.

> > That's going to be very significantly more aggressive. For example
> > it'll impact small tables very differently.
>
> Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
> relatively small, with one important exception below, if we are 99.99% sure
> that it's not going to require an FPI and isn't going to dirty the page.
>
> The exception is that a newer LSN on the page can cause the ringbuffer
> replacement to trigger more more aggressive WAL flushing. No meaningful
> difference if we modified the page during pruning, or if the page was already
> in s_b (since it likely won't be written out via the ringbuffer in that case),
> but if checksums are off and we just hint-dirtied the page, it could be a
> significant issue.

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

> Thus a modification of the above logic could be to opportunistically freeze if
> a ) it won't cause an FPI and either
> b1) the page was already dirty before pruning, as we'll not do a ringbuffer
> replacement in that case
> or
> b2) We wrote a WAL record during pruning, as the difference in flush position
> is marginal
>
> An even more aggressive version would be to replace b1) with logic that'd
> allow newly dirtying the page if it wasn't read through the ringbuffer. But
> newly dirtying the page feels like it'd be more dangerous.

In many cases we'll have to dirty the page anyway, just to set
PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
triggered by an FPI or triggered by my now-reverted GUC) on being able
to set the whole page all-frozen in the VM.

> A less aggressive version would be to check if any WAL records were emitted
> during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
> modified the page again. Similar to what we do now, except not requiring an
> FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

> But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
> wal_log_hint bits is on, than without them. Which I think is how using either
> of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Which part is the odd part? Is it odd that page-level freezing works
that way, or is it odd that page-level checksums work that way?

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

-- 
Peter Geoghegan




Re: drop postmaster symlink

2023-01-26 Thread Karl O. Pinc
On Wed, 25 Jan 2023 18:03:25 -0600
"Karl O. Pinc"  Buried in
> https://www.postgresql.org/message-id/20230107165942.748ccf4e%40slate.karlpinc.com
> is the one change I see that should be made.
> 
> > In doc/src/sgml/ref/allfiles.sgml at line 222 there is an ENTITY
> > defined which references the deleted postmaster.sgml file.  
> 
> This line needs to be removed and the
> 0002-Don-t-install-postmaster-symlink-anymore.patch 
> updated.  (Unless there's some magic going on
> with the various allfiles.sgml files of which I am
> not aware.)
> 
> If this is fixed I see no other problems.

Buried in the same email, and I apologize for not mentioning
this, is one other bit of documentation text that might
or might not need attention. 

> I see a possible problem at line 1,412 of runtime.sgml

This says:

 in the postmaster's startup script just before invoking the postmaster.

Depending on how this is read, it could be interpreted to mean
that a "postmaster" binary is invoked.  It might be more clear
to write: ... just before invoking postgres.

Or it might not be worth bothering about; at this point, probably
not, but I thought you might want the heads-up anyway.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




Re: Add LZ4 compression in pg_dump

2023-01-26 Thread Justin Pryzby
On Wed, Jan 25, 2023 at 07:57:18PM +, gkokola...@pm.me wrote:
> On Wednesday, January 25th, 2023 at 7:00 PM, Justin Pryzby 
>  wrote:
> > While looking at this, I realized that commit 5e73a6048 introduced a
> > regression:
> > 
> > @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)
> > 
> > - if (AH->compression != 0)
> > 
> > - pg_log_warning("archive is compressed, but this installation does not 
> > support compression -- no data will be available");
> > + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > 
> > + pg_fatal("archive is compressed, but this installation does not support 
> > compression");
> > 
> > Before, it was possible to restore non-data chunks of a dump file, even
> > if the current build didn't support its compression. But that's now
> > impossible - and it makes the code we're discussing in RestoreArchive()
> > unreachable.

On Thu, Jan 26, 2023 at 08:53:28PM +0900, Michael Paquier wrote:
> On Thu, Jan 26, 2023 at 11:24:47AM +, gkokola...@pm.me wrote:
> > I gave this a little bit of thought. I think that ReadHead should not
> > emit a warning, or at least not this warning as it is slightly misleading.
> > It implies that it will automatically turn off data restoration, which is
> > false. Further ahead, the code will fail with a conflicting error message
> > stating that the compression is not available.
> > 
> > Instead, it would be cleaner both for the user and the maintainer to
> > move the check in RestoreArchive and make it the sole responsible for
> > this logic.
> 
> -pg_fatal("cannot restore from compressed archive (compression not 
> supported in this installation)");
> +pg_fatal("cannot restore data from compressed archive (compression not 
> supported in this installation)");
> Hmm.  I don't mind changing this part as you suggest.
> 
> -#ifndef HAVE_LIBZ
> -   if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> -   pg_fatal("archive is compressed, but this installation does 
> not support compression");
> -#endif
> However I think that we'd better keep the warning, as it can offer a
> hint when using pg_restore -l not built with compression support if
> looking at a dump that has been compressed.

Yeah.  But the original log_warning text was better, and should be
restored:

-   if (AH->compression != 0)
-   pg_log_warning("archive is compressed, but this installation 
does not support compression -- no data will be available");

That commit also added this to pg-dump.c:

+   case PG_COMPRESSION_ZSTD:
+   pg_fatal("compression with %s is not yet supported", 
"ZSTD");
+   break;
+   case PG_COMPRESSION_LZ4:
+   pg_fatal("compression with %s is not yet supported", 
"LZ4");
+   break;

In 002, that could be simplified by re-using the supports_compression()
function.  (And maybe the same in WriteDataToArchive()?)

-- 
Justin




Re: meson oddities

2023-01-26 Thread Andres Freund
On 2023-01-26 10:20:58 +0100, Peter Eisentraut wrote:
> On 19.01.23 21:45, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-01-19 21:37:15 +0100, Peter Eisentraut wrote:
> > > On 11.01.23 12:05, Peter Eisentraut wrote:
> > > > I think there is also an adjacent issue:  The subdir options may be
> > > > absolute or relative.  So if you specify --prefix=/usr/local and
> > > > --sysconfdir=/etc/postgresql, then
> > > > 
> > > >       config_paths_data.set_quoted('SYSCONFDIR', dir_prefix / 
> > > > dir_sysconf)
> > > > 
> > > > would produce something like /usr/local/etc/postgresql.
> > 
> > I don't think it would. The / operator understands absolute paths and 
> > doesn't
> > add the "first component" if the second component is absolute.
> 
> Oh, that is interesting.  In that case, this is not the right patch.  We
> should proceed with my previous patch in [0] then.

WFM.

I still think it'd be slightly more legible if we tested the prefix for
postgres|pgsql once, rather than do the per-variable .contains() checks on the
"combined" path. But it's a pretty minor difference, and I'd have no problem
with you comitting your version.

Basically:
is_pg_prefix = dir_prefix.contains('pgsql) or dir_prefix.contains('postgres')
...
if not (is_pg_prefix or dir_data.contains('pgsql') or 
dir_data.contains('postgres'))

instead of "your":

if not ((dir_prefix/dir_data).contains('pgsql') or 
(dir_prefix/dir_data).contains('postgres'))

Greetings,

Andres Freund




Re: fix and document CLUSTER privileges

2023-01-26 Thread Nathan Bossart
On Wed, Jan 25, 2023 at 08:27:57PM -0800, Jeff Davis wrote:
> Committed these extra clarifications. Thank you.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-26 08:54:55 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 8:35 AM Andres Freund  wrote:
> > I think it's probably ok, but perhaps deserves a bit more thought about when
> > to "opportunistically" freeze. Perhaps to make it *more* aggressive than 
> > it's
> > now.
> >
> > With "opportunistic freezing" I mean freezing the page, even though we don't
> > *have* to freeze any of the tuples.
> >
> > The overall condition gating freezing is:
> > if (pagefrz.freeze_required || tuples_frozen == 0 ||
> > (prunestate->all_visible && prunestate->all_frozen &&
> >  fpi_before != pgWalUsage.wal_fpi))
> >
> > fpi_before is set before the heap_page_prune() call.
> 
> Have you considered page-level checksums, and how the impact on hint
> bits needs to be accounted for here?
> 
> All RDS customers use page-level checksums. And I've noticed that it's
> very common for the number of FPIs to only be very slightly less than
> the number of pages dirtied. Much of which is just hint bits. The
> "fpi_before != pgWalUsage.wal_fpi" test catches that.

I assume the case you're thinking of is that pruning did *not* do any changes,
but in the process of figuring out that nothing needed to be pruned, we did a
MarkBufferDirtyHint(), and as part of that emitted an FPI?


> > To me a condition that checked if the buffer is already dirty and if another
> > XLogInsert() would be likely to generate an FPI would make more sense. The
> > rare race case of a checkpoint starting concurrently doesn't matter IMO.
> 
> That's going to be very significantly more aggressive. For example
> it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Thus a modification of the above logic could be to opportunistically freeze if
a ) it won't cause an FPI and either
b1) the page was already dirty before pruning, as we'll not do a ringbuffer
replacement in that case
or
b2) We wrote a WAL record during pruning, as the difference in flush position
is marginal

An even more aggressive version would be to replace b1) with logic that'd
allow newly dirtying the page if it wasn't read through the ringbuffer. But
newly dirtying the page feels like it'd be more dangerous.


A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Greetings,

Andres Freund




Re: Non-superuser subscription owners

2023-01-26 Thread Jeff Davis
On Thu, 2023-01-26 at 09:43 -0500, Robert Haas wrote:
> I have no issue with that as a long-term plan. However, I think that
> for right now we should just introduce pg_create_subscription. It
> would make sense to add pg_create_connection in the same patch that
> adds a CREATE CONNECTION command (or whatever exact syntax we end up
> with) -- and that patch can also change CREATE SUBSCRIPTION to
> require
> both privileges where a connection string is specified directly.

I assumed it would be a problem to say that pg_create_subscription was
enough to create a subscription today, and then later require
additional privileges (e.g. pg_create_connection).

If that's not a problem, then this sounds fine with me.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Jelte Fennema
> At this stage the files are now indented, so if it failed and you run
> `git commit` again it will commit with the indention changes.

No, because at no point a "git add" is happening, so the changes
made by pgindent are not staged. As long as you don't run the
second "git commit" with the -a flag the commit will be exactly
the same as you prepared it before.

> Sure, that's your choice. My intended audience here is committers, who
> of course do work on master.

Yes I understand, I meant it would be nice if the script had a environment
variable like PG_COMMIT_HOOK_ALL_BRANCHES (bad name)
for this purpose.

On Thu, 26 Jan 2023 at 17:54, Andrew Dunstan  wrote:
>
>
> On 2023-01-26 Th 11:16, Jelte Fennema wrote:
> > On Thu, 26 Jan 2023 at 15:40, Andrew Dunstan  wrote:
> >> I didn't really like your hook, as it forces a reindent, and many people
> >> won't want that (for reasons given elsewhere in this thread).
> > I'm not sure what you mean by "forces a reindent". Like I explained
> > you can simply run "git commit" again to ignore the changes and
> > commit anyway. As long as the files are indented on your filesystem
> > the hook doesn't care if you actually included the indentation changes
> > in the changes that you're currently committing.
>
>
> Your hook does this:
>
>
> +git diff --cached --name-only --diff-filter=ACMR | grep '\.[ch]$' |\
> +xargs src/tools/pgindent/pgindent --silent-diff \
> +|| {
> +echo ERROR: Aborting commit because pgindent was not run
> +git diff --cached --name-only --diff-filter=ACMR | grep
> '\.[ch]$' | xargs src/tools/pgindent/pgindent
> +exit 1
> +}
>
>
> At this stage the files are now indented, so if it failed and you run
> `git commit` again it will commit with the indention changes.
>
>
> >
> > So to be completely clear you can do the following with my hook:
> > git commit # runs pgindent and fails
> > git commit # commits changes anyway
> > git commit -am 'Run pgindent' # commit indentation changes separately
> >
> > Or what I usually do:
> > git commit # runs pgindent and fails
> > git add --patch # choose relevant changes to add to commit
> > git commit # commit the changes
> > git checkout -- . # undo irrelevant changes on filesystem
> >
> > Honestly PGAUTOINDENT=no seems stricter, since the only
> > way to bypass the failure is now to run manually run pgindent
> > or git commit with the --no-verify flag.
> >
> >> files=$(git diff --cached --name-only --diff-filter=ACMR)
> >> src/tools/pgindent/pgindent $files
> > That seems like it would fail if there's any files or directories with
> > spaces in them. Maybe this isn't something we care about though.
>
>
> We don't have any, and the filenames git produces are relative to the
> git root. I don't think this is an issue.
>
>
> >
> >> # no need to filter files - pgindent ignores everything that isn't a
> >> # .c or .h file
> > If the first argument is a non .c or .h file, then pgindent interprets
> > it as the typedefs file. So it's definitely important to filter non .c
> > and .h files out. Because now if you commit a single
> > non .c or .h file this hook messes up the indentation in all of
> > your files. You can reproduce by running:
> > src/tools/pgindent/pgindent README
>
>
>
> I have a patch at [1] to remove this misfeature.
>
>
> >
> >> # only do this on master
> >> test  "$branch" = "master" || return 0
> > I would definitely want a way to disable this check. As a normal
> > submitter I never work directly on master.
>
>
> Sure, that's your choice. My intended audience here is committers, who
> of course do work on master.
>
>
> cheers
>
>
> andrew
>
>
> [1]  https://postgr.es/m/21bb8573-9e56-812b-84cf-1e4f3c4c2...@dunslane.net
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 8:35 AM Andres Freund  wrote:
> I think it's probably ok, but perhaps deserves a bit more thought about when
> to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
> now.
>
> With "opportunistic freezing" I mean freezing the page, even though we don't
> *have* to freeze any of the tuples.
>
> The overall condition gating freezing is:
> if (pagefrz.freeze_required || tuples_frozen == 0 ||
> (prunestate->all_visible && prunestate->all_frozen &&
>  fpi_before != pgWalUsage.wal_fpi))
>
> fpi_before is set before the heap_page_prune() call.

Have you considered page-level checksums, and how the impact on hint
bits needs to be accounted for here?

All RDS customers use page-level checksums. And I've noticed that it's
very common for the number of FPIs to only be very slightly less than
the number of pages dirtied. Much of which is just hint bits. The
"fpi_before != pgWalUsage.wal_fpi" test catches that.

> To me a condition that checked if the buffer is already dirty and if another
> XLogInsert() would be likely to generate an FPI would make more sense. The
> rare race case of a checkpoint starting concurrently doesn't matter IMO.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

--
Peter Geoghegan




Re: run pgindent on a regular basis / scripted manner

2023-01-26 Thread Andrew Dunstan


On 2023-01-26 Th 11:16, Jelte Fennema wrote:
> On Thu, 26 Jan 2023 at 15:40, Andrew Dunstan  wrote:
>> I didn't really like your hook, as it forces a reindent, and many people
>> won't want that (for reasons given elsewhere in this thread).
> I'm not sure what you mean by "forces a reindent". Like I explained
> you can simply run "git commit" again to ignore the changes and
> commit anyway. As long as the files are indented on your filesystem
> the hook doesn't care if you actually included the indentation changes
> in the changes that you're currently committing.


Your hook does this:


+git diff --cached --name-only --diff-filter=ACMR | grep '\.[ch]$' |\
+    xargs src/tools/pgindent/pgindent --silent-diff \
+    || {
+    echo ERROR: Aborting commit because pgindent was not run
+    git diff --cached --name-only --diff-filter=ACMR | grep
'\.[ch]$' | xargs src/tools/pgindent/pgindent
+    exit 1
+    }


At this stage the files are now indented, so if it failed and you run
`git commit` again it will commit with the indention changes.


>
> So to be completely clear you can do the following with my hook:
> git commit # runs pgindent and fails
> git commit # commits changes anyway
> git commit -am 'Run pgindent' # commit indentation changes separately
>
> Or what I usually do:
> git commit # runs pgindent and fails
> git add --patch # choose relevant changes to add to commit
> git commit # commit the changes
> git checkout -- . # undo irrelevant changes on filesystem
>
> Honestly PGAUTOINDENT=no seems stricter, since the only
> way to bypass the failure is now to run manually run pgindent
> or git commit with the --no-verify flag.
>
>> files=$(git diff --cached --name-only --diff-filter=ACMR)
>> src/tools/pgindent/pgindent $files
> That seems like it would fail if there's any files or directories with
> spaces in them. Maybe this isn't something we care about though.


We don't have any, and the filenames git produces are relative to the
git root. I don't think this is an issue.


>
>> # no need to filter files - pgindent ignores everything that isn't a
>> # .c or .h file
> If the first argument is a non .c or .h file, then pgindent interprets
> it as the typedefs file. So it's definitely important to filter non .c
> and .h files out. Because now if you commit a single
> non .c or .h file this hook messes up the indentation in all of
> your files. You can reproduce by running:
> src/tools/pgindent/pgindent README



I have a patch at [1] to remove this misfeature.


>
>> # only do this on master
>> test  "$branch" = "master" || return 0
> I would definitely want a way to disable this check. As a normal
> submitter I never work directly on master.


Sure, that's your choice. My intended audience here is committers, who
of course do work on master.


cheers


andrew


[1]  https://postgr.es/m/21bb8573-9e56-812b-84cf-1e4f3c4c2...@dunslane.net

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





  1   2   >