from:"'Andres Freund'"

Re: pg_logical_emit_message() misses a XLogFlush()

2023-08-15 Thread Andres Freund

Hi,

On 2023-08-16 12:37:21 +0900, Michael Paquier wrote:
> I won't fight much if there are objections to backpatching, but that's
> not really wise (no idea how much EDB's close flavor of BDR relies on
> that).

To be clear: I don't just object to backpatching, I also object to making
existing invocations flush WAL in HEAD. I do not at all object to adding a
parameter that indicates flushing, or a separate function to do so. The latter
might be better, because it allows you to flush a group of messages, rather
than a single one.

Greetings,

Andres Freund

Re: pg_logical_emit_message() misses a XLogFlush()

2023-08-15 Thread Andres Freund

Hi,

On 2023-08-16 03:20:53 +0200, Tomas Vondra wrote:
> On 8/16/23 02:33, Andres Freund wrote:
> > Hi,
> >
> > On 2023-08-16 06:58:56 +0900, Michael Paquier wrote:
> >> On Tue, Aug 15, 2023 at 11:37:32AM +0200, Tomas Vondra wrote:
> >>> Shouldn't the flush be done only for non-transactional messages? The
> >>> transactional case will be flushed by regular commit flush.
> >>
> >> Indeed, that would be better.  I am sending an updated patch.
> >>
> >> I'd like to backpatch that, would there be any objections to that?
> >
> > Yes, I object. This would completely cripple the performance of some uses of
> > logical messages - a slowdown of several orders of magnitude. It's not clear
> > to me that flushing would be the right behaviour if it weren't released, but
> > it certainly doesn't seem right to make such a change in a minor release.
> >
>
> So are you objecting to adding the flush in general, or just to the
> backpatching part?

Both, I think. I don't object to adding a way to trigger flushing, but I think
it needs to be optional.


> IMHO we either guarantee durability of non-transactional messages, in which
> case this would be a clear bug - and I'd say a fairly serious one.  I'm
> curious what the workload that'd see order of magnitude of slowdown does
> with logical messages I've used it, but even if such workload exists, would
> it really be enough to fix any other durability bug?

Not sure what you mean with the last sentence?

I've e.g. used non-transactional messages for:

- A non-transactional queuing system. Where sometimes one would dump a portion
  of tables into messages, with something like
  SELECT pg_logical_emit_message(false, 'app:', to_json(r)) FROM r;
  Obviously flushing after every row would be bad.

  This is useful when you need to coordinate with other systems in a
  non-transactional way. E.g. letting other parts of the system know that
  files on disk (or in S3 or ...) were created/deleted, since a database
  rollback wouldn't unlink/revive the files.

- Audit logging, when you want to log in a way that isn't undone by rolling
  back transaction - just flushing every pg_logical_emit_message() would
  increase the WAL flush rate many times, because instead of once per
  transaction, you'd now flush once per modified row. It'd basically make it
  impractical to use for such things.

- Optimistic locking. Emitting things that need to be locked on logical
  replicas, to be able to commit on the primary. A pre-commit hook would wait
  for the WAL to be replayed sufficiently - but only once per transaction, not
  once per object.


> Or perhaps we don't want to guarantee durability for such messages, in
> which case we don't need to fix it at all (even in master).

Well, I can see adding an option to flush, or perhaps a separate function to
flush, to master.


> The docs are not very clear on what to expect, unfortunately. It says
> that non-transactional messages are "written immediately" which I could
> interpret in either way.

Yea, the docs certainly should be improved, regardless what we end up with.

Greetings,

Andres Freund

Re: pg_logical_emit_message() misses a XLogFlush()

2023-08-15 Thread Andres Freund

Hi,

On 2023-08-16 06:58:56 +0900, Michael Paquier wrote:
> On Tue, Aug 15, 2023 at 11:37:32AM +0200, Tomas Vondra wrote:
> > Shouldn't the flush be done only for non-transactional messages? The
> > transactional case will be flushed by regular commit flush.
> 
> Indeed, that would be better.  I am sending an updated patch.
> 
> I'd like to backpatch that, would there be any objections to that?

Yes, I object. This would completely cripple the performance of some uses of
logical messages - a slowdown of several orders of magnitude. It's not clear
to me that flushing would be the right behaviour if it weren't released, but
it certainly doesn't seem right to make such a change in a minor release.

Greetings,

Andres Freund

Re: run pgindent on a regular basis / scripted manner

2023-08-12 Thread Andres Freund

Hi,

On 2023-08-12 17:03:37 -0400, Andrew Dunstan wrote:
> On 2023-08-11 Fr 19:02, Tom Lane wrote:
> > Peter Geoghegan  writes:
> > > My workflow up until now has avoiding making updates to typedefs.list
> > > in patches. I only update typedefs locally, for long enough to indent
> > > my code. The final patch doesn't retain any typedefs.list changes.
> > Yeah, I've done the same and will have to stop.
> > 
> > > I guess that I can't do that anymore. Hopefully maintaining the
> > > typedefs.list file isn't as inconvenient as it once seemed to me to
> > > be.
> > I don't think it'll be a problem.  If your rule is "add new typedef
> > names added by your patch to typedefs.list, keeping them in
> > alphabetical order" then it doesn't seem very complicated, and
> > hopefully conflicts between concurrently-developed patches won't
> > be common.
>
> My recollection is that missing typedefs cause indentation that kinda sticks
> out like a sore thumb.
> 
> The reason we moved to a buildfarm based typedefs list was that some
> typedefs are platform dependent, so any list really needs to be the union of
> the found typedefs on various platforms, and the buildfarm was a convenient
> vehicle for doing that. But that doesn't mean you shouldn't manually add a
> typedef you have added in your code.

It's a somewhat annoying task though, find all the typedefs, add them to the
right place in the file (we have an out of order entry right now). I think a
script that *adds* (but doesn't remove) local typedefs would make this less
painful.

Greetings,

Andres Freund

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-12 Thread Andres Freund

Hi,

On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
> Thanks.  I realised that it's easy enough to test that theory about
> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
> false randomly.  Then the test occasionally fails as described.  Seems
> like we'll need to fix that test, but it's not evidence of a server
> bug, and my signal handler refactoring patch is in the clear.  Thanks
> for testing it!

WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
passes even after I make ConditionalLockBufferForCleanup() fail 100%.

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-12 Thread Andres Freund

Hi,

On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote:
> > I think there could be a quite simple fix: Track by how much we've extended
> > the relation previously in the same bistate. If we already extended by many
> > blocks, it's very likey that we'll do so further.
> >
> > A simple prototype patch attached. The results for me are promising. I 
> > copied
> > a smaller file [1], to have more accurate throughput results at shorter runs
> > (15s).
> 
> Thank you for the patch!

Attached is a mildly updated version of that patch. No functional changes,
just polished comments and added a commit message.

> > Any chance you could your benchmark? I don't see as much of a regression vs 
> > 16
> > as you...
> 
> Sure. The results are promising for me too:
>
>  nclients = 1, execution time = 13.743
>  nclients = 2, execution time = 7.552
>  nclients = 4, execution time = 4.758
>  nclients = 8, execution time = 3.035
>  nclients = 16, execution time = 2.172
>  nclients = 32, execution time = 1.959
> nclients = 64, execution time = 1.819
> nclients = 128, execution time = 1.583
> nclients = 256, execution time = 1.631

Nice. We are consistently better than both 15 and "post integer parsing 16".

I'm really a bit baffled at myself for not using this approach from the get
go.

This also would make it much more beneficial to use a BulkInsertState in
nodeModifyTable.c, even without converting to table_multi_insert().

I was tempted to optimize RelationAddBlocks() a bit, by not calling
RelationExtensionLockWaiterCount() if we are already extending by
MAX_BUFFERS_TO_EXTEND_BY. Before this patch, it was pretty much impossible to
reach that case, because of the MAX_BUFFERED_* limits in copyfrom.c, but now
it's more common. But that probably ought to be done only HEAD, not 16.

A related oddity: RelationExtensionLockWaiterCount()->LockWaiterCount() uses
an exclusive lock on the lock partition - which seems not at all necessary?

Unless somebody sees a reason not to, I'm planning to push this?

Greetings,

Andres Freund
>From 26e7faad65273beefeb7b40a169c1b631e204501 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 12 Aug 2023 12:31:43 -0700
Subject: [PATCH v2] hio: Take number of prior relation extensions into account

The new relation extension logic, introduced in 00d1e02be24, could lead to
slowdowns in some scenarios. E.g., when loading narrow rows into a table using
COPY, the caller of RelationGetBufferForTuple() will only request a small
number of pages. Without concurrency, we just extended using pwritev() in that
case. However, if there is *some* concurrency, we switched between extending
by a small number of pages and a larger number of pages, depending on the
number of waiters for the relation extension logic.  However, some
filesystems, XFS in particular, do not perform well when switching between
extending files using fallocate() and pwritev().

To avoid that issue, remember the number of prior relation extensions in
BulkInsertState and extend more aggressively if there were prior relation
extensions. That not just avoids the aforementioned slowdown, but also leads
to noticeable performance gains in other situations, primarily due to
extending more aggressively when there is no concurrency. I should have done
it this way from the get go.

Reported-by: Masahiko Sawada 
Author: Andres Freund 
Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=kp6mszngk3bq9yr...@mail.gmail.com
Backpatch: 16-, where the new relation extension code was added
---
 src/include/access/hio.h | 13 ++---
 src/backend/access/heap/heapam.c |  1 +
 src/backend/access/heap/hio.c| 19 +++
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 228433ee4a2..9bc563b7628 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -32,15 +32,22 @@ typedef struct BulkInsertStateData
 	Buffer		current_buf;	/* current insertion target page */

 	/*
-	 * State for bulk extensions. Further pages that were unused at the time
-	 * of the extension. They might be in use by the time we use them though,
-	 * so rechecks are needed.
+	 * State for bulk extensions.
+	 *
+	 * last_free..next_free are further pages that were unused at the time of
+	 * the last extension. They might be in use by the time we use them
+	 * though, so rechecks are needed.
 	 *
 	 * XXX: Eventually these should probably live in RelationData instead,
 	 * alongside targetblock.
+	 *
+	 * already_extended_by is the number of pages that this bulk inserted
+	 * extended by. If we already extended by a significant number of pages,
+	 * we can be more aggressive about extending going forward.
 	 */
 	BlockNumber next_free;
 	BlockNumber last_free;
+	uint32		already_extended_by;
 } BulkInser

Re: Fix pg_stat_reset_single_table_counters function

2023-08-12 Thread Andres Freund

Hi,

On 2023-08-10 17:48:10 +0900, Masahiko Sawada wrote:
> Good catch! I've confirmed that the issue has been fixed by your patch.

Indeed.


> However, I'm not sure the added regression tests are stable since
> autovacuum workers may scan the pg_database and increment the
> statistics after resetting the stats.

What about updating the table and checking the update count is reset? That'd
not be reset by autovacuum.

Greetings,

Andres Freund

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-08-12 Thread Andres Freund

Hi,

On 2023-08-11 19:35:22 -0700, Jeff Davis wrote:
> Controlling search_path is critical for the correctness and security of
> functions. Right now, the author of a function without a SET clause has
> little ability to control the function's behavior, because even basic
> operators like "+" involve search_path. This is a big problem for, e.g.
> functions used in expression indexes which are called by any user with
> write privileges on the table.

> Motivation:
>
> I'd like to (eventually) get to safe-by-default behavior. In other
> words, the simplest function declaration should be safe for the most
> common use cases.

I'm not sure that anything based, directly or indirectly, on search_path
really is a realistic way to get there.

> To get there, we need some way to explicitly specify the less common
> cases. Right now there's no way for the function author to indicate
> that a function intends to use the session's search path. We also need
> an easier way to specify that the user wants a safe search_path ("SET
> search_path = pg_catalog, pg_temp" is arcane).

No disagreement with that. Even if I don't yet agree that your proposal is a
convincing path to "easy security for PLs" - just making the search path stuff
less arcane is good.

> And when we know more about the user's actual intent, then it will be
> easier to either form a transition plan to push users into the safer
> behavior, or at least warn strongly when the user is doing something
> dangerous (i.e. using a function that depends on the session's search
> path as part of an expression index).

I think that'd be pretty painful from a UX perspective. Having to write
e.g. operators as operator(schema, op) just sucks as an experience. And with
extensions plenty of operators will live outside of pg_catalog, so there is
plenty things that will need qualifying.  And because of things like type
coercion search, which prefers "bettering fitting" coercions over search path
order, you can't just put "less important" things later in search path.

I wonder if we ought to work more on "fossilizing" the result of search path
resolutions at the time functions are created, rather than requiring the user
to do so explicitly.  Most of the problem here comes down to the fact that if
a user creates a function like 'a + b' we'll not resolve the operator, the
potential type coercions etc, when the function is created - we do so when the
function is executed.

We can't just store the oids at the time, because that'd end up very fragile -
tables/functions/... might be dropped and recreated etc and thus change their
oid. But we could change the core PLs to rewrite all the queries (*) so that
they schema qualify absolutely everything, including operators and implicit
type casts.

That way objects referenced by functions can still be replaced, but search
path can't be used to "inject" objects in different schemas. Obviously it
could lead to errors on some schema changes - e.g. changing a column type
might mean that a relevant cast lives in a different place than with the old
type - but I think that'll be quite rare. Perhaps we could offer a ALTER
FUNCTION ... REFRESH REFERENCES; or such?

One obvious downside of such an approach is that it requires some work with
each PL. I'm not sure that's avoidable - and I suspect that most "security
sensitive" functions are written in just a few languages.

(*) Obviously the one thing that doesn't work for is use of EXECUTE in plpgsql
and similar constructs elsewhere. I'm not sure there's much that can be done
to make that safe, but it's worth thinking about more.

Greetings,

Andres Freund

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 20:11:34 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-08-11 18:30:02 -0400, Tom Lane wrote:
> >> +1 for including this in CI tests
>
> > I didn't even mean CI - I meant 'make check-world' / 'meson test'. Which of
> > course would include CI automatically.
>
> Hmm.  I'm allergic to anything that significantly increases the cost
> of check-world, and this seems like it'd do that.

Hm, compared to the cost of check-world it's not that large, but still,
annoying to make it larger.

We can make it lot cheaper, but perhaps not in a general enough fashion that
it's suitable for a test.

pgindent already can query git (for --commit). We could teach pgindent to
ask git what remote branch is being tracked, and constructed a list of files
of the difference between the remote branch and the local branch?

That option could do something like:
git diff --name-only $(git rev-parse --abbrev-ref --symbolic-full-name 
@{upstream})

That's pretty quick, even for a relatively large delta.

> Maybe we could automate it, but not as part of check-world per se?

We should definitely do that.  Another related thing that'd be useful to
script is updating typedefs.list with the additional typedefs found
locally. Right now the script for that still lives in the

Greetings,

Andres Freund

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 18:30:02 -0400, Tom Lane wrote:
> Peter Geoghegan  writes:
> > On Fri, Aug 11, 2023 at 2:25 PM Andres Freund  wrote:
> >> We could a test that fails when there's some mis-indented code. That seems
> >> like it might catch things earlier?
> 
> +1 for including this in CI tests

I didn't even mean CI - I meant 'make check-world' / 'meson test'. Which of
course would include CI automatically.


> > It definitely would. That would go a long way towards addressing my
> > concerns. But I suspect that that would run into problems that stem
> > from the fact that the buildfarm is testing something that isn't all
> > that simple. Don't typedefs need to be downloaded from some other
> > blessed buildfarm animal?
> 
> No.  I presume koel is using src/tools/pgindent/typedefs.list,
> which has always been the "canonical" list but up to now we've
> been lazy about maintaining it.  Part of the new regime is that
> typedefs.list should now be updated on-the-fly by patches that
> add new typedefs.

Yea. Otherwise nobody else can indent reliably, without repeating the work of
adding typedefs.list entries of all the patches since the last time it was
updated in the repository.

Greetings,

Andres Freund

Re: run pgindent on a regular basis / scripted manner

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 13:59:40 -0700, Peter Geoghegan wrote:
> On Sat, Jun 17, 2023 at 7:08 AM Andrew Dunstan  wrote:
> > I have set up a new buildfarm animal called koel which will run the module.
> 
> I'm starting to have doubts about this policy. There have now been
> quite a few follow-up "fixes" to indentation issues that koel
> complained about. None of these fixups have been included in
> .git-blame-ignore-revs. If things continue like this then "git blame"
> is bound to become much less usable over time.

I'm not sure I buy that that's going to be a huge problem - most of the time
such fixups are pretty small compared to larger reindents.


> I don't think that it makes sense to invent yet another rule for
> .git-blame-ignore-revs, though. Will we need another buildfarm member
> to enforce that rule, too?

We could a test that fails when there's some mis-indented code. That seems
like it might catch things earlier?

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 13:19:34 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 12:26 PM Andres Freund  wrote:
> > > For example, dealing with core dumps left behind by the regression
> > > tests can be annoying.
> >
> > Hm. I don't have a significant problem with that. But I can see it being
> > problematic. Unfortunately, short of preventing core dumps from happening,
> > I don't think we really can do much about that - whatever is running the 
> > tests
> > shouldn't have privileges to change system wide settings about where core
> > dumps end up etc.
>
> I was unclear. I wasn't talking about managing core dumps. I was
> talking about using core dumps to get a simple backtrace, that just
> gives me some very basic information. I probably shouldn't have even
> mentioned core dumps, because what I'm really concerned about is the
> workflow around getting very basic information about assertion
> failures. Not around core dumps per se.
>
> The inconsistent approach needed to get a simple, useful backtrace for
> assertion failures (along with other basic information associated with
> the failure) is a real problem. Particularly when running the tests.
> Most individual assertion failures that I see are for code that I'm
> practically editing in real time. So shaving cycles here really
> matters.

Ah, yes. Agreed, obviously.

> For one thing the symbol mangling that we have around the built-in
> backtraces makes them significantly less useful. I really hope that
> your libbacktrace patch gets committed soon, since that looks like it
> would be a nice quality of life improvement, all on its own.

I've been hacking a further on it:
https://github.com/anarazel/postgres/tree/backtrace

Haven't yet posted a new version. Doing it properly for fronted tools has some
dependencies on threading stuff. I'm hoping Thomas' patch for that will go in.

Now it also intercepts segfaults and prints
backtraces for them, if that's possible (it requires libbacktrace to be async
signal safe, which it isn't on all platforms).

Where supported, a crash (distinguishing from assertion failures) will now
report something like:

process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 
0xdeadbeef
[0x5628ec45212f] pg_fatalsig_handler+0x20f: 
../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615
[0x7fc4b743650f] [unknown]
[0x5628ec14897c] check_root+0x19c (inlined): 
../../../../home/andres/src/postgresql/src/backend/main/main.c:393
[0x5628ec14897c] main+0x19c: 
../../../../home/andres/src/postgresql/src/backend/main/main.c:183

after I added
*(volatile int*)0xdeadbeef = 1;
to check_root().

For signals sent by users, it'd show the pid of the process sending the signal
on most OSs. I really would like some generalized infrastructure for that, so
that we can report for things like query cancellations.

As the patch stands, the only OS that doesn't yet have useful "backtrace on
crash" support with that is windows, as libbacktrace unfortunately isn't
signal safe on windows. But it'd still provide useful backtraces on
Assert()s. I haven't yet figured out whether/when it's required to be signal
safe on windows though - crashes are intercepted by
SetUnhandledExceptionFilter() - I don't understand the precise constraints of
that. Plenty people seem to generate backtraces on crashes on windows using
that, without concerns for signal safety like things.

Currently Macos CI doesn't use libbacktrace, but as it turns out
backtrace_symbols() on windows is a heck of a lot better than on glibc
platforms.  CI for windows with visual studio doesn't have libbacktrace
installed yet (and has the aforementioned signal safety issue), I think it's
installed for windows w/ mingw.

> It would also be great if the tests spit out information about
> assertion failures that was reasonably complete (backtrace without any
> mangling, query text included, other basic context), reliably and
> uniformly -- it shouldn't matter if it's from TAP or pg_regress test
> SQL scripts.

Hm. What other basic context are you thinking of? Pid is obvious. I guess
backend type could be useful too, but normally be inferred from the stack
trace pretty easily. Application name could be useful to know which test
caused the crash.

I'm a bit wary about trying to print things like query text - what if that
string points to memory not terminated by a \0? I guess we could use an
approach similar to pgstat_get_crashed_backend_activity().

One issue with reporting crashes from signal handlers is that the obvious way
to make that signal safe (lots of small writes) leads to the potential for
interspersed lines.  It's probably worth having a stat

Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-12 07:51:09 +1200, Thomas Munro wrote:
> Oh, I see what's happening.  Maybe commit b91dd9de wasn't the best
> idea, but bc971f4025c broke an assumption, since it doesn't use
> ConditionVariableSleep().  That is confusing the signal forwarding
> logic which expects to find our entry in the wait list in the common
> case.

Hm, I guess I got confused by the cv code once more. I thought that
ConditionVariableCancelSleep() would wake us up anyway, because
once we return from ConditionVariableSleep(), we'd be off the list. But I now
realize (and I think not for the first time), that ConditionVariableSleep()
always puts us *back* on the list.

Leaving aside the issue in this thread, isn't always adding us back into the
list bad from a contention POV alone - it doubles the write traffic on the CV
and is guaranteed to cause contention for ConditionVariableBroadcast().  I
wonder if this explains some performance issues I've seen in the past.

What if we instead reset cv_sleep_target once we've been taken off the list? I
think it'd not be too hard to make it safe to do the proclist_contains()
without the spinlock.  Lwlocks have something similar, there we solve it by
this sequence:

proclist_delete(&wakeup, iter.cur, lwWaitLink);

/*
 * Guarantee that lwWaiting being unset only becomes visible 
once the
 * unlink from the link has completed. Otherwise the target 
backend
 * could be woken up for other reason and enqueue for a new 
lock - if
 * that happens before the list unlink happens, the list would 
end up
 * being corrupted.
 *
 * The barrier pairs with the LWLockWaitListLock() when 
enqueuing for
 * another lock.
 */
pg_write_barrier();
waiter->lwWaiting = LW_WS_NOT_WAITING;
PGSemaphoreUnlock(waiter->sem);

I guess this means we'd need something like lwWaiting for CVs as well.

> From a85b2827f4500bc2e7c533feace474bc47086dfa Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Sat, 12 Aug 2023 07:06:08 +1200
> Subject: [PATCH] De-pessimize ConditionVariableCancelSleep().
>
> Commit b91dd9de was concerned with a theoretical problem with our
> non-atomic condition variable operations.  If you stop sleeping, and
> then cancel the sleep in a separate step, you might be signaled in
> between, and that could be lost.  That doesn't matter for callers of
> ConditionVariableBroadcast(), but callers of ConditionVariableSignal()
> might be upset if a signal went missing like this.

FWIW I suspect at least some of the places that'd have a problem without that
forwarding, might also be racy with it

> New idea: ConditionVariableCancelSleep() can just return true if you've
> been signaled.  Hypothetical users of ConditionVariableSignal() would
> then still have a way to deal with rare lost signals if they are
> concerned about that problem.

Sounds like a plan to me.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 11:56:27 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 11:23 AM Andres Freund  wrote:
> > > Couldn't you say the same thing about defensive "can't happen" ERRORs?
> > > They are essentially a form of assertion that isn't limited to
> > > assert-enabled builds.
> >
> > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
> > our transaction state machinery is confused. Yes, let's just continue going
> > through the same machinery again, that'll resolve it.".
> 
> I am not unsympathetic to Ashutosh's point about conventional ERRORs
> being easier to deal with when debugging your own code, during initial
> development work.

Oh, I am as well - I just don't think it's a good idea to introduce "log + 
error"
assertions to core postgres, because it seems very likely that they'll end up
getting used a lot.

> But that seems like a problem with the tooling in other areas.

Agreed.

> For example, dealing with core dumps left behind by the regression
> tests can be annoying.

Hm. I don't have a significant problem with that. But I can see it being
problematic. Unfortunately, short of preventing core dumps from happening,
I don't think we really can do much about that - whatever is running the tests
shouldn't have privileges to change system wide settings about where core
dumps end up etc.

> Don't you also hate it when there's a regression.diffs that just shows 20k
> lines of subtractions? Perhaps you don't -- perhaps your custom setup makes
> it quick and easy to get relevant information about what actually went
> wrong.

I do really hate that. At the very least we should switch to using
restart-after-crash by default, and not start new tests once the server has
crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see
if the reason the test failed is that the backend died.

> But it seems like that sort of thing could be easier to deal with by
> default, without using custom shell scripts or anything -- particularly for
> those of us that haven't been Postgres hackers for eons.

Yes, wholeheartedly agreed.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 11:14:34 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 10:57 AM Andres Freund  wrote:
> > I am quite strongly against this. This will lead to assertions being hit in
> > tests without that being noticed, e.g. because they happen in a background
> > process that just restarts.
> 
> Couldn't you say the same thing about defensive "can't happen" ERRORs?
> They are essentially a form of assertion that isn't limited to
> assert-enabled builds.

Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
our transaction state machinery is confused. Yes, let's just continue going
through the same machinery again, that'll resolve it.". Even worse are the
ones that are just WARNINGS. Like "Oh, something wrote beyond the length of
allocated memory, that's something to issue a WARNING about and happily
continue".

There've been people arguing in the past that it's good for robustness to do
stuff like that. I think it's exactly the opposite.

Now, don't get me wrong, there are plenty cases where a "this shouldn't
happen" elog(ERROR) is appropriate...

> I have sometimes thought that it would be handy if there was a variant
> of "can't happen" ERRORs that took on the structure of an assertion.

What are you imagining? Basically something that generates an elog(ERROR) with
the stringified expression as part of the error message?

Greetings,

Andres Freund

Re: [PATCH] Support static linking against LLVM

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 12:59:27 -0500, Marcelo Juchem wrote:
> In my case, my product has a very controlled environment.
> We build all our infrastructure from source and we avoid dynamic linking by
> design, except where technically not viable (e.g.: pgsql extensions).
> 
> LLVM is one of the libraries we're specifically required to statically link.
> Unfortunately I can't share the specifics of why that's the case.

If you build llvm with just static lib support it should work as-is?


> On a side note, I'd like to take this opportunity to ask you if there's any
> work being done towards migrating away from deprecated LLVM APIs.
> If there's no work being done on that front, I might take a stab at it if
> there's any interest from the PostgreSQL community in that contribution.

Yes: 
https://postgr.es/m/CA%2BhUKGKNX_%3Df%2B1C4r06WETKTq0G4Z_7q4L4Fxn5WWpMycDj9Fw%40mail.gmail.com


> Migrating to the new APIs will have PostgreSQL require at the minimum
> version 8 of LLVM (released in March 2019).

I think Thomas' patch abstract it enough so that older versions continue to
work...

Greetings,

Andres Freund

Re: LLVM 16 (opaque pointers)

2023-08-11 Thread Andres Freund

Hi,

On 2023-05-21 15:01:41 +1200, Thomas Munro wrote:
> *For anyone working with this type of IR generation code and
> questioning their sanity, I can pass on some excellent advice I got
> from Andres: build LLVM yourself with assertions enabled, as they
> catch some classes of silly mistake that otherwise just segfault
> inscrutably on execution.

Hm. I think we need a buildfarm animal with an assertion enabled llvm 16 once
we merge this. I think after an upgrade my buildfarm machine has the necessary
resources.


> @@ -150,7 +150,7 @@ llvm_compile_expr(ExprState *state)
>  
>   /* create function */
>   eval_fn = LLVMAddFunction(mod, funcname,
> -   
> llvm_pg_var_func_type("TypeExprStateEvalFunc"));
> +   
> llvm_pg_var_func_type("ExecInterpExprStillValid"));

Hm, that's a bit ugly. But ...

> @@ -77,9 +80,44 @@ extern Datum AttributeTemplate(PG_FUNCTION_ARGS);
>  Datum
>  AttributeTemplate(PG_FUNCTION_ARGS)
>  {
> + PGFunction  fp PG_USED_FOR_ASSERTS_ONLY;
> +
> + fp = &AttributeTemplate;
>   PG_RETURN_NULL();
>  }

Other parts of the file do this by putting the functions into
referenced_functions[], i'd copy that here and below.

> +void
> +ExecEvalSubroutineTemplate(ExprState *state,
> +struct ExprEvalStep *op,
> +ExprContext *econtext)
> +{
> + ExecEvalSubroutine fp PG_USED_FOR_ASSERTS_ONLY;
> +
> + fp = &ExecEvalSubroutineTemplate;
> +}
> +
> +extern bool ExecEvalBoolSubroutineTemplate(ExprState *state,
> + 
>struct ExprEvalStep *op,
> + 
>ExprContext *econtext);
> +bool
> +ExecEvalBoolSubroutineTemplate(ExprState *state,
> +struct ExprEvalStep 
> *op,
> +ExprContext 
> *econtext)
> +{
> + ExecEvalBoolSubroutine fp PG_USED_FOR_ASSERTS_ONLY;
> +
> + fp = &ExecEvalBoolSubroutineTemplate;
> + return false;
> +}
> +


Thanks for working on this!

Greetings,

Andres Freund

Re: LLVM 16 (opaque pointers)

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-10 16:56:54 +0200, Ronan Dunklau wrote:
> I tried my hand at backporting it to previous versions, and not knowing
> anything about it made me indeed question my sanity.  It's quite easy for PG
> 15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier
> most notably due to to the change in fcinfo args representation. But I guess
> that's also a topic for another day :-)

Given that 11 is about to be EOL, I don't think it's worth spending the time
to support a new LLVM version for it.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 17:59:37 +0530, Ashutosh Bapat wrote:
> Most of the Asserts are recoverable by rolling back the transaction
> without crashing the backend. So an elog(ERROR, ) is enough. But just
> by themselves elogs are compiled into non-debug binary and the
> condition check can waste CPU cycles esp. conditions are mostly true
> like the ones we use in Assert.
> 
> Attached patch combines Assert and elog(ERROR, ) so that when an
> Assert is triggered in assert-enabled binary, it will throw an error
> while keeping the backend intact. Thus it does not affect gdb session
> or psql session. These elogs do not make their way to non-assert
> binary so do not make it to production like Assert.

I am quite strongly against this. This will lead to assertions being hit in
tests without that being noticed, e.g. because they happen in a background
process that just restarts.

Greetings,

Andres Freund

Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote:
> That's an awful lot of CPU for a cluster doing essentially "nothing"
> (there's no WAL to decode/replicate, etc.). It usually disappears after
> a couple seconds, but sometimes it's a rather persistent state.

Ugh, that's not great.

> The profile from the walsender processes looks like this:
> 
> --99.94%--XLogSendLogical
>   |
>   |--99.23%--XLogReadRecord
>   |  XLogReadAhead
>   |  XLogDecodeNextRecord
>   |  ReadPageInternal
>   |  logical_read_xlog_page
>   |  |
>   |  |--97.80%--WalSndWaitForWal
>   |  |  |
>   |  |  |--68.48%--WalSndWait
> 
> It seems to me the issue is in WalSndWait, which was reworked to use
> ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up
> waking each other in a busy loop, until the timing changes just enough
> to break the cycle.

IMO ConditionVariableCancelSleep()'s behaviour of waking up additional
processes can nearly be considered a bug, at least when combined with
ConditionVariableBroadcast(). In that case the wakeup is never needed, and it
can cause situations like this, where condition variables basically
deteriorate to a busy loop.

I hit this with AIO as well. I've been "solving" it by adding a
ConditionVariableCancelSleepEx(), which has a only_broadcasts argument.

I'm inclined to think that any code that needs that needs the forwarding
behaviour is pretty much buggy.

Greetings,

Andres Freund

Re: [PATCH] Support static linking against LLVM

2023-08-11 Thread Andres Freund

Hi,

On 2023-08-10 14:45:47 -0500, Marcelo Juchem wrote:
> By default, PostgreSQL doesn't explicitly choose whether to link
> statically or dynamically against LLVM when LLVM JIT is enabled (e.g.:
> `./configure --with-llvm`).
> 
> `llvm-config` will choose to dynamically link by default.
> 
> In order to statically link, one must pass `--link-static` to
> `llvm-config` when listing linker flags (`--ldflags`) and libraries
> (`--libs`).
> 
> This patch enables custom flags to be passed to `llvm-config` linker
> related invocations through the environment variable
> `LLVM_CONFIG_LINK_ARGS`.
> 
> To statically link against LLVM it suffices, then, to call `configure`
> with environment variable `LLVM_CONFIG_LINK_ARGS=--link-static`.

I'm not opposed to it, but I'm not sure I see much need for it either. Do you
have a specific use case for this?

Greetings,

Andres Freund

Re: Add PG CI to older PG releases

2023-08-10 Thread Andres Freund

Hi,

On 2023-08-10 18:09:03 -0400, Tom Lane wrote:
> Nazir Bilal Yavuz  writes:
> > PG CI is added starting from PG 15, adding PG CI to PG 14 and below
> > could be beneficial. So, firstly I tried adding it to the
> > REL_14_STABLE branch. If that makes sense, I will try to add PG CI to
> > other old PG releases.
> 
> I'm not actually sure this is worth spending time on.  We don't
> do new development in three-release-back branches, nor do we have
> so many spare CI cycles that we can routinely run CI testing in
> those branches [1].

I find it much less stressful to backpatch if I can test the patches via CI
first.  I don't think I'm alone in that - Alvaro for example has previously
done this in a more limited fashion (no windows):
https://postgr.es/m/20220928192226.4c6zeenujaoqq4bq%40alvherre.pgsql

Greetings,

Andres Freund

Re: Add PG CI to older PG releases

2023-08-10 Thread Andres Freund

Hi,

On 2023-08-10 19:55:15 +0300, Nazir Bilal Yavuz wrote:
> v2-0001-windows-Only-consider-us-to-be-running-as-service.patch is an
> older commit (59751ae47fd43add30350a4258773537e98d4063). A couple of
> tests were failing without this because the log file was empty and the
> tests were comparing strings with the log files (like in the first
> mail).

Hm. I'm a bit worried about backpatching this one, it's not a small
behavioural change. But the prior behaviour is pretty awful and could very
justifiably be viewed as a bug.

At the very least this would need to be combined with

commit 950e64fa46b164df87b5eb7c6e15213ab9880f87
Author: Andres Freund 
Date:   2022-07-18 17:06:34 -0700

Use STDOUT/STDERR_FILENO in most of syslogger.

Greetings,

Andres Freund

Re: Support to define custom wait events for extensions

2023-08-09 Thread Andres Freund

Hi,

On 2023-08-09 20:10:42 +0900, Masahiro Ikeda wrote:
> * Is there any way to not force extensions that don't use shared
>   memory for their use like dblink to acquire AddinShmemInitLock?;

I think the caller shouldn't need to do deal with AddinShmemInitLock at all.

Greetings,

Andres Freund

Re: Remove distprep

2023-08-09 Thread Andres Freund

Hi,

Thanks for sending the -packagers email Peter!

On 2023-08-09 16:25:28 +0200, Christoph Berg wrote:
> Re: Tom Lane
> > Meh ... the fact that it works fine for you doesn't mean it will work
> > fine elsewhere.  Since we're trying to get out from under maintaining
> > the autoconf build system, I don't think it makes sense to open
> > ourselves up to having to do more work on it.  A policy of benign
> > neglect seems appropriate to me.
> 
> Understood, I was just pointing out there are more types of generated
> files in there.

The situation for configure is somewhat different, due to being maintained in
the repository, rather than just being included in the tarball...

Greetings,

Andres Freund

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-08 22:29:50 -0400, Robert Treat wrote:
> In case it's helpful, from an SPI oriented perspective, $7K/month is
> probably an order of magnitude more than what we can sustain, so I
> don't see a way to make that work without some kind of additional
> magic that includes other non-profits and/or commercial companies
> changing donation habits between now and September.

Yea, I think that'd make no sense, even if we could afford it. I think the
patches I've written should drop it to 1/2 already. Thomas added some
throttling to push it down further.

> Purchasing a couple of mac-mini's (and/or similar gear) would be near
> trivial though, just a matter of figuring out where/how to host it
> (but I think infra can chime in on that if that's what get's decided).

Cool. Because of the limitation of running two VMs at a time on macos and the
comparatively low cost of mac minis, it seems they beat alternative models by
a fair bit.

Pginfra/sysadmin: ^

Based on being off by an order of magnitude, as you mention earlier, it seems
that:

1) reducing test runtime etc, as already in progress
2) getting 2 mac minis as runners
3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

Would be viable for a month or three? I hope we can get some cloud providers
to chip in for 3), but I'd like to have something in place that doesn't depend
on that.

Given the cost of macos VMs at AWS, the only of the big cloud providers to
have macos instances, I think we'd burn pointlessly quick through credits if
we used VMs for that.

(*) I think we should be able to get below that, but ...

> The other likely option would be to seek out cloud credits from one of
> the big three (or others); Amazon has continually said they would be
> happy to donate more credits to us if we had a use, and I think some
> of the other hosting providers have said similarly at times; so we'd
> need to ask and hope it's not too bureaucratic.

Yep.

I tried to start that progress within microsoft, fwiw.  Perhaps Joe and
Jonathan know how to start within AWS?  And perhaps Noah inside GCP?

It'd be the least work to get it up and running in GCP, as it's already
running there, but should be quite doable at the others as well.

Greetings,

Andres Freund

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote:
> On 08/08/2023 05:15, Andres Freund wrote:
> > With the improvements detailed below, cirrus' free CI would last
> > about ~65 runs / month.
>
> I think that's plenty.

Not so sure, I would regularly exceed it, I think. But it definitely will
suffice for more casual contributors.

> > Potential paths forward for cfbot, in addition to the above:
> > 
> > - Pay for compute / ask the various cloud providers to grant us compute
> >credits. At least some of the cloud providers can be used via cirrus-ci.
> > 
> > - Host (some) CI runners ourselves. Particularly with macos and windows, 
> > that
> >could provide significant savings.
> > 
> > - Build our own system, using buildbot, jenkins or whatnot.
> > 
> > 
> > Opinions as to what to do?
> 
> The resources for running our own system isn't free either. I'm sure we can
> get sponsors for the cirrus-ci credits, or use donations.

As outlined in my reply to Alvaro, just using credits likely is financially
not viable...

> > 5) Move use of -Dsegsize_blocks=6 from macos to linux
> > 
> > Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively 
> > we
> > could stop covering both meson and autoconf segsize_blocks. It does 
> > affect
> > runtime on linux as well.
> 
> Could we have a comment somewhere on why we use -Dsegsize_blocks on these
> particular CI runs? It seems pretty random. I guess the idea is to have one
> autoconf task and one meson task with that option, to check that the
> autoconf/meson option works?

Hm, some of that was in the commit message, but I should have added it to
.cirrus.yml as well.

Normally, the "relation segment" code basically has no coverage in our tests,
because we (quite reasonably) don't generate tables large enough. We've had
plenty bugs that we didn't notice due the code not being exercised much. So it
seemed useful to add CI coverage, by making the segments very small.

I chose the tasks by looking at how long they took at the time, I
think. Adding them to to the slower ones.

> > 6) Disable write cache flushes on windows
> > 
> > It's a bit ugly to do this without using the UI... Shaves off about 30s
> > from the tests.
> 
> A brief comment would be nice: "We don't care about persistence over hard
> crashes in the CI, so disable write cache flushes to speed it up."

Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and testing
it there. It doesn't actually seem to suffice when run in isolation, because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...

Greetings,

Andres Freund

Re: Support to define custom wait events for extensions

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-09 08:03:29 +0900, Michael Paquier wrote:
> On Tue, Aug 08, 2023 at 03:59:54PM -0700, Andres Freund wrote:
> > On 2023-08-08 08:54:10 +0900, Michael Paquier wrote:
> >> - WaitEventExtensionShmemInit() should gain a dshash_create(), to make
> >> sure that the shared table is around, and we are going to have a
> >> reference to it in WaitEventExtensionCounterData, saved from
> >> dshash_get_hash_table_handle().
> > 
> > I'm not even sure it's worth using dshash here. Why don't we just create a
> > decently sized dynahash (say 128 enties) in shared memory? We overallocate
> > shared memory by enough that there's a lot of headroom for further entries, 
> > in
> > the rare cases they're needed.
> 
> The question here would be how many slots the most popular extensions
> actually need, but that could always be sized up based on the
> feedback.

On a default initdb (i.e. 128MB s_b), after explicitly disabling huge pages,
we over-allocate shared memory by by 1922304 bytes, according to
pg_shmem_allocations. We allow that memory to be used for things like shared
hashtables that grow beyond their initial size. So even if the hash table's
static size is too small, there's lots of room to grow, even on small systems.

Just because it's somewhat interesting: With huge pages available and not
disabled, we over-allocate by 3364096 bytes.

Greetings,

Andres Freund

Re: Support to define custom wait events for extensions

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-08 08:54:10 +0900, Michael Paquier wrote:
> - WaitEventExtensionShmemInit() should gain a dshash_create(), to make
> sure that the shared table is around, and we are going to have a
> reference to it in WaitEventExtensionCounterData, saved from
> dshash_get_hash_table_handle().

I'm not even sure it's worth using dshash here. Why don't we just create a
decently sized dynahash (say 128 enties) in shared memory? We overallocate
shared memory by enough that there's a lot of headroom for further entries, in
the rare cases they're needed.

> We are going to need a fixed size for these custom strings, but perhaps a
> hard limit of 256 characters for each entry of the hash table is more than
> enough for most users?

I'd just use NAMEDATALEN.

- Andres

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-08 16:44:37 -0400, Robert Haas wrote:
> On Mon, Aug 7, 2023 at 6:05 PM Andres Freund  wrote:
> > I think the biggest flaw of the locking scheme is that the LockHash locks
> > protect two, somewhat independent, things:
> > 1) the set of currently lockable objects, i.e. the entries in the hash 
> > table [partition]
> > 2) the state of all the locks [in a partition]
> >
> > It'd not be that hard to avoid the shared hashtable lookup in a number of
> > cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
> > above.  But we can't, in general, avoid the lock on the partition anyway, as
> > the each lock's state is also protected by the partition lock.
> 
> Yes, and that's a huge problem. The main selling point of the whole
> fast-path mechanism is to ease the pressure on the lock manager
> partition locks, and if we did something like what you described in
> the previous email without changing the locking regimen, we'd bring
> all of that contention back. I'm pretty sure that would suck.

Yea - I tried to outline how I think we could implement the fastpath locking
scheme in a less limited way in the earlier email, that I had quoted above
this bit.  Here I was pontificating on what we possibly should do in addition
to that. I think even if we had "unlimited" fastpath locking, there's still
enough pressure on the lock manager locks that it's worth improving the
overall locking scheme.

Greetings,

Andres Freund

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-08 18:34:58 +0200, Alvaro Herrera wrote:
> On 2023-Aug-08, Andres Freund wrote:
> 
> > Given the cost of macos, it seems like it'd be by far the most of affordable
> > to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, 
> > as
> > persistent runners. Cirrus has builtin macos virtualization support - but 
> > can
> > only host two VMs on each mac, due to macos licensing restrictions. A single
> > mac mini would suffice to keep up with our unoptimized monthly runtime
> > (although there likely would be some overhead).
> 
> If using persistent workers is an option, maybe we should explore that.
> I think we could move all or some of the Linux - Debian builds to
> hardware that we already have in shelves (depending on how much compute
> power is really needed.)

(76+830+860+935)/((365/12)*24) = 3.7

3.7 instances with 4 "vcores" are busy 100% of the time. So we'd need at least
~16 cpu threads - I think cirrus sometimes uses instances that disable HT, so
it'd perhaps be 16 cores actually.


> I think using other OSes is more difficult, mostly because I doubt we
> want to deal with licenses; but even FreeBSD might not be a realistic
> option, at least not in the short term.

They can be VMs, so that shouldn't be a big issue.

> >task_name|sum
> > +
> >  FreeBSD - 13 - Meson   | 1017:56:09
> >  Windows - Server 2019, MinGW64 - Meson | 00:00:00
> >  SanityCheck| 76:48:41
> >  macOS - Ventura - Meson| 873:12:43
> >  Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
> >  Linux - Debian Bullseye - Autoconf | 830:17:26
> >  Linux - Debian Bullseye - Meson| 860:37:21
> >  CompilerWarnings   | 935:30:35
> > (8 rows)
> >
> 
> moving just Debian, that might alleviate 76+830+860+935 hours from the
> Cirrus infra, which is ~46%.  Not bad.
> 
> 
> (How come Windows - Meson reports allballs?)

It's mingw64, which we've marked as "manual", because we didn't have the cpu
cycles to run it.

Greetings,

Andres Freund

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-08 10:25:52 -0500, Tristan Partin wrote:
> On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:
> > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
> > like the following (depends on caching etc):
> > 
> > task costs in credits
> > linux-sanity: 0.01
> > linux-compiler-warnings: 0.05
> > linux-meson: 0.07
> > freebsd   : 0.08
> > linux-autoconf: 0.09
> > windows   : 0.18
> > macos : 0.28
> > total task runtime is 40.8
> > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month
> 
> I am not in the loop on the autotools vs meson stuff. How much longer do we
> anticipate keeping autotools around?

I think it depends in what fashion. We've been talking about supporting
building out-of-tree modules with "pgxs" for at least a 5 year support
window. But the replacement isn't yet finished [1], so that clock hasn't yet
started ticking.

> Seems like it could be a good opportunity to reduce some CI usage if
> autotools were finally dropped, but I know there are still outstanding tasks
> to complete.
> 
> Back of the napkin math says autotools is about 12% of the credit cost,
> though I haven't looked to see if linux-meson and linux-autotools are 1:1.

The autoconf task is actually doing quite useful stuff right now, leaving the
use of configure aside, as it builds with address sanitizer. Without that it'd
be a lot faster. But we'd loose, imo quite important, coverage. The tests
would run a bit faster with meson, but it'd be overall a difference on the
margins.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/meson-pkgconfig

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund

Hi,

On 2023-08-08 16:28:49 +0200, Peter Eisentraut wrote:
> On 08.08.23 04:15, Andres Freund wrote:
> > Potential paths forward for cfbot, in addition to the above:
> >
> > - Pay for compute / ask the various cloud providers to grant us compute
> >credits. At least some of the cloud providers can be used via cirrus-ci.
> >
> > - Host (some) CI runners ourselves. Particularly with macos and windows, 
> > that
> >could provide significant savings.
> >
> > - Build our own system, using buildbot, jenkins or whatnot.
>
> I think we should use the "compute credits" plan from Cirrus CI.  It should
> be possible to estimate the costs for that.  Money is available, I think.

Unfortunately just doing that seems like it would up considerably on the too
expensive side. Here are the stats for last months' cfbot runtimes (provided
by Thomas):

   task_name|sum
+
 FreeBSD - 13 - Meson   | 1017:56:09
 Windows - Server 2019, MinGW64 - Meson | 00:00:00
 SanityCheck| 76:48:41
 macOS - Ventura - Meson| 873:12:43
 Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
 Linux - Debian Bullseye - Autoconf | 830:17:26
 Linux - Debian Bullseye - Meson| 860:37:21
 CompilerWarnings   | 935:30:35
(8 rows)

If I did the math right, that's about 7000 credits (and 1 credit costs 1 USD).

task costs in credits
linux-sanity: 55.30
linux-autoconf: 598.04
linux-meson: 619.40
linux-compiler-warnings: 674.28
freebsd   : 732.24
windows   : 1201.09
macos : 3143.52

Now, those times are before optimizing test runtime. And besides optimizing
the tasks, we can also optimize not running tests for docs patches etc. And
optimize cfbot to schedule a bit better.

But still, the costs look not realistic to me.

If instead we were to use our own GCP account, it's a lot less. t2d-standard-4
instances, which are faster than what we use right now, cost $0.168984 / hour
as "normal" instances and $0.026764 as "spot" instances right now [1]. Windows
VMs are considerably more expensive due to licensing - 0.184$/h in addition.

Assuming spot instances, linux+freebsd tasks would cost ~100USD month (maybe
10-20% more in reality, due to a) spot instances getting terminated requiring
retries and b) disks).

Windows would be ~255 USD / month (same retries caveats).

Given the cost of macos, it seems like it'd be by far the most of affordable
to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
persistent runners. Cirrus has builtin macos virtualization support - but can
only host two VMs on each mac, due to macos licensing restrictions. A single
mac mini would suffice to keep up with our unoptimized monthly runtime
(although there likely would be some overhead).

Greetings,

Andres Freund

[1] https://cloud.google.com/compute/all-pricing

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-07 Thread Andres Freund

Hi,

On 2023-08-07 19:15:41 -0700, Andres Freund wrote:
> The set of free CI providers has shrunk since we chose cirrus, as have the
> "free" resources provided. I started, quite incomplete as of now, wiki page at
> [4].

Oops, as Thomas just noticed, I left off that link:

[4] https://wiki.postgresql.org/wiki/CI_Providers

Greetings,

Andres Freund

Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-07 Thread Andres Freund

Hi,

As some of you might have seen when running CI, cirrus-ci is restricting how
much CI cycles everyone can use for free (announcement at [1]). This takes
effect September 1st.

This obviously has consequences both for individual users of CI as well as
cfbot.


The first thing I think we should do is to lower the cost of CI. One thing I
had not entirely realized previously, is that macos CI is by far the most
expensive CI to provide. That's not just the case with cirrus-ci, but also
with other providers.  See the series of patches described later in the email.


To me, the situation for cfbot is different than the one for individual
users.

IMO, for the individual user case it's important to use CI for "free", without
a whole lot of complexity. Which imo rules approaches like providing
$cloud_provider compute accounts, that's too much setup work.  With the
improvements detailed below, cirrus' free CI would last about ~65 runs /
month.

For cfbot I hope we can find funding to pay for compute to use for CI. The, by
far, most expensive bit is macos. To a significant degree due to macos
licensing terms not allowing more than 2 VMs on a physical host :(.


The reason we chose cirrus-ci were

a) Ability to use full VMs, rather than a pre-selected set of VMs, which
   allows us to test a larger number

b) Ability to link to log files, without requiring an account. E.g. github
   actions doesn't allow to view logs unless logged in.

c) Amount of compute available.


The set of free CI providers has shrunk since we chose cirrus, as have the
"free" resources provided. I started, quite incomplete as of now, wiki page at
[4].


Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
  displaying the CI status for different platforms

- give up


Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
  credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
  could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.


Opinions as to what to do?



The attached series of patches:

1) Makes startup of macos instances faster, using more efficient caching of
   the required packages. Also submitted as [2].

2) Introduces a template initdb that's reused during the tests. Also submitted
   as [3]

3) Remove use of -DRANDOMIZE_ALLOCATED_MEMORY from macos tasks. It's
   expensive. And CI also uses asan on linux, so I don't think it's really
   needed.

4) Switch tasks to use debugoptimized builds. Previously many tasks used -Og,
   to get decent backtraces etc. But the amount of CPU burned that way is too
   large. One issue with that is that use of ccache becomes much more crucial,
   uncached build times do significantly increase.

5) Move use of -Dsegsize_blocks=6 from macos to linux

   Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
   could stop covering both meson and autoconf segsize_blocks. It does affect
   runtime on linux as well.

6) Disable write cache flushes on windows

   It's a bit ugly to do this without using the UI... Shaves off about 30s
   from the tests.

7) pg_regress only checked once a second whether postgres started up, but it's
   usually much faster. Use pg_ctl's logic.  It might be worth replacing the
   use psql with directly using libpq in pg_regress instead, looks like the
   overhead of repeatedly starting psql is noticeable.


FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
linux-sanity: 0.01
linux-compiler-warnings: 0.05
linux-meson: 0.07
freebsd   : 0.08
linux-autoconf: 0.09
windows   : 0.18
macos : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month


Greetings,

Andres Freund

[1] https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/
[2] 
https://www.postgresql.org/message-id/20230805202539.r3umyamsnctysdc7%40awork3.anarazel.de
[3] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7...@alap3.anarazel.de
>From 00c48a90f3acc6cba6af873b29429ee6d4ba38a6 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 3 Aug 2023 23:29:13 -0700
Subject: [PATCH v1 1/9] ci: macos: used cached macports install

This substantially speeds up the mac CI time.

Discussion: https://postgr.es/m/20230805202539.r3umyamsnctys...@awork3.anarazel.de
---
 .cirrus.yml  | 63 +++---
 src/tools/ci/ci_macports_packages.sh | 97 
 2 files changed, 122 insertions(+), 38 deletions(-)
 create mode 100755 s

Re: cpluspluscheck vs ICU

2023-08-07 Thread Andres Freund

Hi,

On 2023-03-10 20:10:30 -0800, Andres Freund wrote:
> On 2023-03-10 19:37:27 -0800, Andres Freund wrote:
> > I just hit this once more - and I figured out a fairly easy fix:
> > 
> > We just need a
> >   #ifndef U_DEFAULT_SHOW_DRAFT
> >   #define U_DEFAULT_SHOW_DRAFT 0
> >   #endif
> > before including unicode/ucol.h.
> > 
> > At first I was looking at
> >   #define U_SHOW_CPLUSPLUS_API 0
> > and
> >   #define U_HIDE_INTERNAL_API 1
> > which both work, but they are documented to be internal.
> 
> Err. Unfortunately only the U_SHOW_CPLUSPLUS_API approach actually works. The
> others don't, not quite sure what I was doing earlier.
> 
> So it's either relying on a define marked as internal, or the below:
> 
> > Alternatively we could emit U_DEFAULT_SHOW_DRAFT 0 into pg_config.h to avoid
> > that issue.
> > 
> > 
> > The only other thing I see is to do something like:
> > [ugly]
> > which seems mighty ugly.

The ICU docs talk about it like it's not really internal:
https://github.com/unicode-org/icu/blob/720e5741ccaa112c4faafffdedeb7459b66c5673/docs/processes/release/tasks/healthy-code.md#test-icu4c-headers

So I'm inclined to go with that solution.

Any comments? Arguments against?

Greetings,

Andres Freund

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-07 Thread Andres Freund

Hi,

On 2023-08-07 12:57:40 +0200, Christoph Berg wrote:
> Re: Thomas Munro
> > Thanks for testing!  Would you mind trying v8 from that thread?  V7
> > had a silly bug (I accidentally deleted a 'case' label while cleaning
> > some stuff up, resulting in the above error...)
> 
> v8 worked better. It succeeded a few times (at least 12, my screen
> scrollback didn't catch more) before erroring like this:

> [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated 
> connection due to recovery conflict
> [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on 
> standby
> [10:21:58.463](0.000s)
> [10:21:58.463](0.000s) #   Failed test 'startup deadlock: stats show conflict 
> on standby'
> #   at t/031_recovery_conflict.pl line 332.
> [10:21:58.463](0.000s) #  got: '0'
> # expected: '1'

Hm, that could just be a "harmless" race. Does it still happen if you apply
the attached patch in addition?

Greetings,

Andres Freund
diff --git i/src/backend/utils/activity/pgstat_database.c w/src/backend/utils/activity/pgstat_database.c
index 7149f22f729..bb36d73ec04 100644
--- i/src/backend/utils/activity/pgstat_database.c
+++ w/src/backend/utils/activity/pgstat_database.c
@@ -81,12 +81,22 @@ void
 pgstat_report_recovery_conflict(int reason)
 {
 	PgStat_StatDBEntry *dbentry;
+	PgStat_EntryRef *entry_ref;
+	PgStatShared_Database *sharedent;
 
 	Assert(IsUnderPostmaster);
 	if (!pgstat_track_counts)
 		return;
 
-	dbentry = pgstat_prep_database_pending(MyDatabaseId);
+	/*
+	 * Update the shared stats directly - recovery conflicts should never be
+	 * common enough for that to be a problem.
+	 */
+	entry_ref =
+		pgstat_get_entry_ref_locked(PGSTAT_KIND_DATABASE, MyDatabaseId, InvalidOid, false);
+
+	sharedent = (PgStatShared_Database *) entry_ref->shared_stats;
+	dbentry = &sharedent->stats;
 
 	switch (reason)
 	{
@@ -116,6 +126,8 @@ pgstat_report_recovery_conflict(int reason)
 			dbentry->conflict_startup_deadlock++;
 			break;
 	}
+
+	pgstat_unlock_entry(entry_ref);
 }
 
 /*

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Andres Freund

Hi,

On 2023-08-07 14:36:48 -0700, Andres Freund wrote:
> What if fast path locks entered PROCLOCK into the shared hashtable, just like
> with normal locks, the first time a lock is acquired by a backend. Except that
> we'd set a flag indicating the lock is a fastpath lock. When the lock is
> released, neither the LOCALLOCK nor the PROCLOCK entry would be
> removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the
> lock is not held anymore.
> 
> That itself wouldn't buy us much - we'd still need to do a lookup in the
> shared hashtable. But, by the time we decide whether to use fast path locks,
> we've already done a hash lookup in the LOCALLOCK hashtable. Because the
> PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get
> the PROCLOCK entry without a shared hash table lookup.
> 
> Acquiring a strong lock on a fastpath lock would basically entail modifying
> all the relevant PROCLOCKs atomically to indicate that fast path locks aren't
> possible anymore.  Acquiring a fast path lock would just require atomically
> modifying the PROCLOCK to indicate that the lock is held.
> 
> On a first blush, this sounds like it could end up being fairly clean and
> generic?

On 2023-08-07 13:05:32 -0400, Robert Haas wrote:
> Of course, another thing we could do is try to improve the main lock
> manager somehow. I confess that I don't have a great idea for that at
> the moment, but the current locking scheme there is from a very, very
> long time ago and clearly wasn't designed with modern hardware in
> mind.

I think the biggest flaw of the locking scheme is that the LockHash locks
protect two, somewhat independent, things:
1) the set of currently lockable objects, i.e. the entries in the hash table 
[partition]
2) the state of all the locks [in a partition]

It'd not be that hard to avoid the shared hashtable lookup in a number of
cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
above.  But we can't, in general, avoid the lock on the partition anyway, as
the each lock's state is also protected by the partition lock.

The amount of work to do a lookup in the shared hashtable and/or create a new
entry therein, is quite bound.  But the work for acquiring a lock is much less
so. We'll e.g. often have to iterate over the set of lock holders etc.

I think we ought to investigate whether pushing down the locking for the "lock
state" into the individual locks is worth it. That way the partitioned lock
would just protect the hashtable.

The biggest issue I see is deadlock checking. Right now acquiring all lock
partitions gives you a consistent view of all the non-fastpath locks - and
fastpath locks can't participate in deadlocks. Any scheme that makes "lock
state" locking in general more granular, will make it next to impossible to
have a similarly consistent view of all locks.  I'm not sure the current
degree of consistency is required however - the lockers participating in a
lock cycle, pretty much by definition, are blocked.

A secondary issue is that making the locks more granular could affect the
happy path measurably - we'd need two atomics for each heavyweight lock
acquisition, not one.  But if we cached the lookup in the shared hashtable,
we'd commonly be able to skip the hashtable lookup...

Greetings,

Andres Freund

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Andres Freund

Hi,

On 2023-08-07 13:05:32 -0400, Robert Haas wrote:
> I would also argue that the results are actually not that great,
> because once you get past 64 partitions you're right back where you
> started, or maybe worse off. To me, there's nothing magical about
> cases between 16 and 64 relations that makes them deserve special
> treatment - plenty of people are going to want to use hundreds of
> partitions, and even if you only use a few dozen, this isn't going to
> help as soon as you join two or three partitioned tables, and I
> suspect it hurts whenever it doesn't help.
> 
> I think we need a design that scales better. I don't really know what
> that would look like, exactly, but your idea of a hybrid approach
> seems like it might be worth further consideration. We don't have to
> store an infinite number of fast-path locks in an array that we search
> linearly, and it might be better that if we moved to some other
> approach we could avoid some of the regression.

My gut feeling is that the state for fast path locks doesn't live in quite the
right place.

What if fast path locks entered PROCLOCK into the shared hashtable, just like
with normal locks, the first time a lock is acquired by a backend. Except that
we'd set a flag indicating the lock is a fastpath lock. When the lock is
released, neither the LOCALLOCK nor the PROCLOCK entry would be
removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the
lock is not held anymore.

That itself wouldn't buy us much - we'd still need to do a lookup in the
shared hashtable. But, by the time we decide whether to use fast path locks,
we've already done a hash lookup in the LOCALLOCK hashtable. Because the
PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get
the PROCLOCK entry without a shared hash table lookup.

Acquiring a strong lock on a fastpath lock would basically entail modifying
all the relevant PROCLOCKs atomically to indicate that fast path locks aren't
possible anymore.  Acquiring a fast path lock would just require atomically
modifying the PROCLOCK to indicate that the lock is held.

On a first blush, this sounds like it could end up being fairly clean and
generic?

Greetings,

Andres Freund

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Andres Freund

Hi,

On 2023-08-07 13:59:26 -0700, Matt Smiley wrote:
> I have not yet written a reproducer since we see this daily in production.
> I have a sketch of a few ways that I think will reproduce the behavior
> we're observing, but haven't had time to implement it.
> 
> I'm not sure if we're seeing this behavior in production

It might be worth for you to backpatch

commit 92daeca45df
Author: Andres Freund 
Date:   2022-11-21 20:34:17 -0800

Add wait event for pg_usleep() in perform_spin_delay()

into 12. That should be low risk and have only trivially resolvable
conflicts. Alternatively, you could use bpftrace et al to set a userspace
probe on perform_spin_delay().


> , but it's definitely an interesting find.  Currently we are running
> postgres 12.11, with an upcoming upgrade to 15 planned.  Good to know
> there's a potential improvement waiting in 16.  I noticed that in
> LWLockAcquire the call to LWLockDequeueSelf occurs (
> https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218)
> directly between the unsuccessful attempt to immediately acquire the lock
> and reporting the backend's wait event.

That's normal.



> > I'm also wondering if it's possible that the reason for the throughput
> > drops
> > are possibly correlated with heavyweight contention or higher frequency
> > access
> > to the pg_locks view. Deadlock checking and the locks view acquire locks on
> > all lock manager partitions... So if there's a bout of real lock contention
> > (for longer than deadlock_timeout)...
> >
> 
> Great questions, but we ruled that out.  The deadlock_timeout is 5 seconds,
> so frequently hitting that would massively violate SLO and would alert the
> on-call engineers.  The pg_locks view is scraped a couple times per minute
> for metrics collection, but the lock_manager lwlock contention can be
> observed thousands of times every second, typically with very short
> durations.  The following example (captured just now) shows the number of
> times per second over a 10-second window that any 1 of the 16
> "lock_manager" lwlocks was contended:

Some short-lived contention is fine and expected - the question is how long
the waits are...

Unfortunately my experience is that the overhead of bpftrace means that
analyzing things like this with bpftrace is very hard... :(.


> > Given that most of your lock manager traffic comes from query planning -
> > have you evaluated using prepared statements more heavily?
> >
> 
> Yes, there are unrelated obstacles to doing so -- that's a separate can of
> worms, unfortunately.  But in this pathology, even if we used prepared
> statements, the backend would still need to reacquire the same locks during
> each executing transaction.  So in terms of lock acquisition rate, whether
> it's via the planner or executor doing it, the same relations have to be
> locked.

Planning will often lock more database objects than query execution. Which can
keep you using fastpath locks for longer.

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-07 Thread Andres Freund

Hi,

On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote:
> On Mon, Aug 7, 2023 at 3:16 PM David Rowley  wrote:
> >
> > On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> > > So, it looks like this item can be closed off.  I'll hold off from
> > > doing that for a few days just in case anyone else wants to give
> > > feedback or test themselves.
> >
> > Alright, closed.
>
> IIUC the problem with multiple concurrent COPY is not resolved yet.

Yea - it was just hard to analyze until the other regressions were fixed.


> The result of nclients = 1 became better thanks to recent fixes, but
> there still seems to be the performance regression at nclient = 2~16
> (on RHEL 8 and 9). Andres reported[1] that after changing
> MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
> would not be the solution, as he mentioned.

I think there could be a quite simple fix: Track by how much we've extended
the relation previously in the same bistate. If we already extended by many
blocks, it's very likey that we'll do so further.

A simple prototype patch attached. The results for me are promising. I copied
a smaller file [1], to have more accurate throughput results at shorter runs
(15s).

HEAD before:
clients  tps
1 41
2 76
4136
8248
16   360
32   375
64   317


HEAD after:
clients  tps
1 43
2 80
4155
8280
16   369
32   405
64   344

Any chance you could your benchmark? I don't see as much of a regression vs 16
as you...

Greetings,

Andres Freund

[1] COPY (SELECT generate_series(1, 10)) TO '/tmp/data.copy';
diff --git i/src/include/access/hio.h w/src/include/access/hio.h
index 228433ee4a2..5ae39aec7f8 100644
--- i/src/include/access/hio.h
+++ w/src/include/access/hio.h
@@ -41,6 +41,7 @@ typedef struct BulkInsertStateData
 	 */
 	BlockNumber next_free;
 	BlockNumber last_free;
+	uint32		already_extended_by;
 } BulkInsertStateData;
 
 
diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c
index 7ed72abe597..6a66214a580 100644
--- i/src/backend/access/heap/heapam.c
+++ w/src/backend/access/heap/heapam.c
@@ -1776,6 +1776,7 @@ GetBulkInsertState(void)
 	bistate->current_buf = InvalidBuffer;
 	bistate->next_free = InvalidBlockNumber;
 	bistate->last_free = InvalidBlockNumber;
+	bistate->already_extended_by = 0;
 	return bistate;
 }
 
diff --git i/src/backend/access/heap/hio.c w/src/backend/access/heap/hio.c
index c275b08494d..a2be4273df1 100644
--- i/src/backend/access/heap/hio.c
+++ w/src/backend/access/heap/hio.c
@@ -283,6 +283,13 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		 */
 		extend_by_pages += extend_by_pages * waitcount;
 
+		/*
+		 * If we previously extended using the same bistate, it's very likely
+		 * we'll extend some more. Try to extend by as many pages.
+		 */
+		if (bistate)
+			extend_by_pages = Max(extend_by_pages, bistate->already_extended_by);
+
 		/*
 		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
 		 * them all concurrently.
@@ -409,6 +416,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		/* maintain bistate->current_buf */
 		IncrBufferRefCount(buffer);
 		bistate->current_buf = buffer;
+		bistate->already_extended_by += extend_by_pages;
 	}
 
 	return buffer;

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-06 Thread Andres Freund

Hi,

On 2023-08-02 16:51:29 -0700, Matt Smiley wrote:
> I thought it might be helpful to share some more details from one of the
> case studies behind Nik's suggestion.
> 
> Bursty contention on lock_manager lwlocks recently became a recurring cause
> of query throughput drops for GitLab.com, and we got to study the behavior
> via USDT and uprobe instrumentation along with more conventional
> observations (see
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301).  This
> turned up some interesting finds, and I thought sharing some of that
> research might be helpful.

Hm, I'm curious whether you have a way to trigger the issue outside of your
prod environment. Mainly because I'm wondering if you're potentially hitting
the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
you'd not see the benefit unless you reproduced the load in 16+.

I'm also wondering if it's possible that the reason for the throughput drops
are possibly correlated with heavyweight contention or higher frequency access
to the pg_locks view. Deadlock checking and the locks view acquire locks on
all lock manager partitions... So if there's a bout of real lock contention
(for longer than deadlock_timeout)...

Given that most of your lock manager traffic comes from query planning - have
you evaluated using prepared statements more heavily?

Greetings,

Andres Freund

Re: initdb caching during tests

2023-08-05 Thread Andres Freund

Hi,

On 2023-08-05 16:58:38 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Times for running all tests under meson, on my workstation (20 cores / 40
> > threads):
> 
> > cassert build -O2:
> 
> > Before:
> > real0m44.638s
> > user7m58.780s
> > sys 2m48.773s
> 
> > After:
> > real0m38.938s
> > user2m37.615s
> > sys 2m0.570s
> 
> Impressive results.  Even though your bottom-line time doesn't change that
> much

Unfortunately we have a few tests that take quite a while - for those the
initdb removal doesn't make that much of a difference. Particularly because
this machine has enough CPUs to not be fully busy except for the first few
seconds...

E.g. for a run with the patch applied:

258/265 postgresql:pg_basebackup / pg_basebackup/010_pg_basebackup  
   OK   16.58s   187 subtests passed
259/265 postgresql:subscription / subscription/100_bugs 
   OK6.69s   12 subtests passed
260/265 postgresql:regress / regress/regress
   OK   24.95s   215 subtests passed
261/265 postgresql:ssl / ssl/001_ssltests   
   OK7.97s   205 subtests passed
262/265 postgresql:pg_dump / pg_dump/002_pg_dump
   OK   19.65s   11262 subtests passed
263/265 postgresql:recovery / recovery/027_stream_regress   
   OK   29.34s   6 subtests passed
264/265 postgresql:isolation / isolation/isolation  
   OK   33.94s   112 subtests passed
265/265 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade   
   OK   38.22s   18 subtests passed

The pg_upgrade test is faster in isolation (29s), but not that much. The
overall runtime is reduces due to the reduced "competing" cpu usage, but other
than that...


Looking at where the time is spent when running the pg_upgrade test on its own:

grep -E '^\[' testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade 
|sed -E -e 's/.*\(([0-9.]+)s\)(.*)/\1 \2/g'|sort -n -r

cassert:
13.094  ok 5 - regression tests pass
6.147  ok 14 - run of pg_upgrade for new instance
2.340  ok 6 - dump before running pg_upgrade
1.638  ok 17 - dump after running pg_upgrade
1.375  ok 12 - run of pg_upgrade --check for new instance
0.798  ok 1 - check locales in original cluster
0.371  ok 9 - invalid database causes failure status (got 1 vs expected 1)
0.149  ok 7 - run of pg_upgrade --check for new instance with incorrect binary 
path
0.131  ok 16 - check that locales in new cluster match original cluster

optimized:
8.372  ok 5 - regression tests pass
3.641  ok 14 - run of pg_upgrade for new instance
1.371  ok 12 - run of pg_upgrade --check for new instance
1.104  ok 6 - dump before running pg_upgrade
0.636  ok 17 - dump after running pg_upgrade
0.594  ok 1 - check locales in original cluster
0.359  ok 9 - invalid database causes failure status (got 1 vs expected 1)
0.148  ok 7 - run of pg_upgrade --check for new instance with incorrect binary 
path
0.127  ok 16 - check that locales in new cluster match original cluster


The time for "dump before running pg_upgrade" is misleadingly high - there's
no output between starting initdb and the dump, so the timing includes initdb
and a bunch of other work.  But it's still not fast (1.637s) after.

A small factor is that the initdb times are not insignificant, because the
template initdb can't be used due to a bunch of parameters passed to initdb :)


> the big reduction in CPU time should translate to a nice speedup on slower
> buildfarm animals.

Yea. It's a particularly large win when using valgrind. Under valgrind, a very
large portion of the time for many tests is just spent doing initdb... So I am
hoping to see some nice gains for skink.

Greetings,

Andres Freund

ci: Improve macos startup using a cached macports installation

2023-08-05 Thread Andres Freund

Hi,

We have some issues with CI on macos and windows being too expensive (more on
that soon in a separate email). For macos most of the obviously wasted time is
spent installing packages with homebrew. Even with the package downloads being
cached, it takes about 1m20s to install them.  We can't just cache the whole
homebrew installation, because it contains a lot of pre-installed packages.

After a bunch of experimenting, I found a way to do this a lot faster: The
attached patch installs macports and installs our dependencies from
that. Because there aren't pre-existing macports packages, we can just cache
the whole thing.  Doing so naively wouldn't yield that much of a speedup,
because it takes a while to unpack a tarball (or whatnot) with as many files
as our dependencies have - that's a lot of filesystem metadata
operations. Instead the script creates a HFS+ filesystem in a file and caches
that - that's mounted within a few seconds. To further keep the size in check,
that file is compressed with zstd in the cache.

As macports has a package for IPC::Run and IO::Pty, we can use those instead
of the separate cache we had for the perl installation.

After the patch, the cached case takes ~5s to "install" and the cache is half
the size than the one for homebrew.

The comparison isn't entirely fair, because I used the occasion to not install
'make' (since meson is used for building) and llvm (we didn't enable it for
the build anyway). That gain is a bit smaller without that, but still
significant.


An alternative implementation would be to create the "base" .dmg file outside
of CI and download it onto the CI instances. But I didn't want to figure out
the hosting situation for such files, so I thought this was the best
near-medium term path.

Greetings,

Andres Freund
>From e0f64949e7b2b2aca4398acb3bfa8e859857e910 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 3 Aug 2023 23:29:13 -0700
Subject: [PATCH v1] ci: macos: used cached macports install

This substantially speeds up the mac CI time.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .cirrus.yml  | 63 +++---
 src/tools/ci/ci_macports_packages.sh | 97 
 2 files changed, 122 insertions(+), 38 deletions(-)
 create mode 100755 src/tools/ci/ci_macports_packages.sh

diff --git a/.cirrus.yml b/.cirrus.yml
index d260f15c4e2..e9cfc542cfe 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -429,8 +429,7 @@ task:
 
 CIRRUS_WORKING_DIR: ${HOME}/pgsql/
 CCACHE_DIR: ${HOME}/ccache
-HOMEBREW_CACHE: ${HOME}/homebrew-cache
-PERL5LIB: ${HOME}/perl5/lib/perl5
+MACPORTS_CACHE: ${HOME}/macports-cache
 
 CC: ccache cc
 CXX: ccache c++
@@ -454,55 +453,43 @@ task:
 - mkdir ${HOME}/cores
 - sudo sysctl kern.corefile="${HOME}/cores/core.%P"
 
-  perl_cache:
-folder: ~/perl5
-  cpan_install_script:
-- perl -mIPC::Run -e 1 || cpan -T IPC::Run
-- perl -mIO::Pty -e 1 || cpan -T IO::Pty
-  upload_caches: perl
-
-
-  # XXX: Could we instead install homebrew into a cached directory? The
-  # homebrew installation takes a good bit of time every time, even if the
-  # packages do not need to be downloaded.
-  homebrew_cache:
-folder: $HOMEBREW_CACHE
+  # Use macports, even though homebrew is installed. The installation
+  # of the additional packages we need would take quite a while with
+  # homebrew, even if we cache the downloads. We can't cache all of
+  # homebrew, because it's already large. So we use macports. To cache
+  # the installation we create a .dmg file that we mount if it already
+  # exists.
+  # XXX: The reason for the direct p5.34* references is that we'd need
+  # the large macport tree around to figure out that p5-io-tty is
+  # actually p5.34-io-tty. Using the unversioned name works, but
+  # updates macports every time.
+  macports_cache:
+folder: ${MACPORTS_CACHE}
   setup_additional_packages_script: |
-brew install \
+sh src/tools/ci/ci_macports_packages.sh \
   ccache \
-  icu4c \
-  krb5 \
-  llvm \
+  icu \
+  kerberos5 \
   lz4 \
-  make \
   meson \
   openldap \
   openssl \
-  python \
-  tcl-tk \
+  p5.34-io-tty \
+  p5.34-ipc-run \
+  tcl \
   zstd
-
-brew cleanup -s # to reduce cache size
-  upload_caches: homebrew
+# Make macports install visible for subsequent steps
+echo PATH=/opt/local/sbin/:/opt/local/bin/:$PATH >> $CIRRUS_ENV
+  upload_caches: macports
 
   ccache_cache:
 folder: $CCACHE_DIR
   configure_script: |
-brewpath="/opt/homebrew"
-PKG_CONFIG_PATH="${brewpath}/lib/pkgconfig:${PKG_CONFIG_PATH}"
-
-for pkg in icu4c krb5 openldap openssl zstd ; do
-  pkgpath="${brewpath}/opt/${pkg}"
-  PKG_CONFIG_PATH="${pkgpath}/lib/pkgconfig:${PKG_CONFIG_PATH}&quo

initdb caching during tests

2023-08-05 Thread Andres Freund

Hi,

We have some issues with CI on macos and windows being too expensive (more on
that soon in a separate email), which reminded me of this thread (with
original title: [1])

I've attached a somewhat cleaned up version of the patch to cache initdb
across runs.  The results are still fairly impressive in my opinion.


One thing I do not like, but don't have a good idea for how to improve, is
that there's a bunch of duplicated logic in pg_regress.c and Cluster.pm. I've
tried to move that into initdb.c itself, but that ends up pretty ugly, because
we need to be a lot more careful about checking whether options are compatible
etc. I've also thought about just putting this into a separate perl script,
but right now we still allow basic regression tests without perl being
available.  So I concluded that for now just having the copies is the best
answer.


Times for running all tests under meson, on my workstation (20 cores / 40
threads):

cassert build -O2:

Before:
real0m44.638s
user7m58.780s
sys 2m48.773s

After:
real0m38.938s
user2m37.615s
sys 2m0.570s


cassert build -O0:

Before:
real1m11.290s
user13m9.817s
sys 2m54.946s

After:
real1m2.959s
user3m5.835s
sys 1m59.887s


non-cassert build:

Before:
real0m34.579s
user5m30.418s
sys 2m40.507s

After:
real0m27.710s
user2m20.644s
sys 1m55.770s


On CI this reduces the test times substantially:
Freebsd   8:51 -> 5:35
Debian w/ asan, autoconf  6:43 -> 4:55
Debian w/ alignmentsan, ubsan 4:02 -> 2:33
macos 5:07 -> 4:29
windows  10:21 -> 9:49

This is ignoring a bit of run-to-run variance, but the trend is obvious enough
that it's not worth worrying about that.

Greetings,

Andres Freund

[1] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7afz%40alap3.anarazel.de
>From 0fd431c277f01284a91999a04368de6b59b6691e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 2 Feb 2023 21:51:53 -0800
Subject: [PATCH v3 1/2] Use "template" initdb in tests

Discussion: https://postgr.es/m/20220120021859.3zpsfqn4z7ob7...@alap3.anarazel.de
---
 meson.build  | 30 ++
 .cirrus.yml  |  3 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm | 46 ++-
 src/test/regress/pg_regress.c| 74 ++--
 src/Makefile.global.in   | 52 +
 5 files changed, 161 insertions(+), 44 deletions(-)

diff --git a/meson.build b/meson.build
index 04ea3488522..47429a18c3f 100644
--- a/meson.build
+++ b/meson.build
@@ -3056,8 +3056,10 @@ testport = 4
 test_env = environment()
 
 temp_install_bindir = test_install_location / get_option('bindir')
+test_initdb_template = meson.build_root() / 'tmp_install' / 'initdb-template'
 test_env.set('PG_REGRESS', pg_regress.full_path())
 test_env.set('REGRESS_SHLIB', regress_module.full_path())
+test_env.set('INITDB_TEMPLATE', test_initdb_template)
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
@@ -3072,6 +3074,34 @@ if library_path_var != ''
 endif
 
 
+# Create (and remove old) initdb template directory. Tests use that, where
+# possible, to make it cheaper to run tests.
+#
+# Use python to remove the old cached initdb, as we cannot rely on a working
+# 'rm' binary on windows.
+test('initdb_cache',
+ python,
+ args: [
+   '-c', '''
+import shutil
+import sys
+import subprocess
+
+shutil.rmtree(sys.argv[1], ignore_errors=True)
+sp = subprocess.run(sys.argv[2:] + [sys.argv[1]])
+sys.exit(sp.returncode)
+''',
+   test_initdb_template,
+   temp_install_bindir / 'initdb',
+   '-A', 'trust', '-N', '--no-instructions'
+ ],
+ priority: setup_tests_priority - 1,
+ timeout: 300,
+ is_parallel: false,
+ env: test_env,
+ suite: ['setup'])
+
+
 
 ###
 # Test Generation
diff --git a/.cirrus.yml b/.cirrus.yml
index d260f15c4e2..8fce17dff08 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -115,8 +115,9 @@ task:
   test_minimal_script: |
 su postgres <<-EOF
   ulimit -c unlimited
+  meson test $MTEST_ARGS --suite setup
   meson test $MTEST_ARGS --num-processes ${TEST_JOBS} \
-tmp_install cube/regress pg_ctl/001_start_stop
+cube/regress pg_ctl/001_start_stop
 EOF
 
   on_failure:
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 5e161dbee60..4d449c35de9 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -

Re: SIGQUIT handling, redux

2023-08-02 Thread Andres Freund

Hi,

On 2023-08-02 12:35:19 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
> >> It's simple enough that maybe we could back-patch it, once it's
> >> aged awhile in HEAD.  OTOH, given the lack of field reports of
> >> trouble here, I'm not sure back-patching is worth the risk.
> 
> > FWIW, looking at collected stack traces in azure, there's a slow but steady
> > stream of crashes below StartupPacketTimeoutHandler. ...
> > Unsurprisingly just in versions before 14, where this change went in.
> > I think that might be enough evidence for backpatching the commit? I've not
> > heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().
> 
> I'd be willing to take a look at this in a few weeks when $real_life
> is a bit less demanding.

Cool.


> Right before minor releases would likely be a bad idea anyway.

Agreed. I had just waded through the stacks, so I thought it'd be worth
bringing up, didn't intend to imply we should backpatch immediately.


Aside: Analyzing this kind of thing at scale is made considerably more painful
by "expected" ereport(PANIC)s (say out of disk space during WAL writes)
calling abort() and dumping core... While there are other PANICs you really
want cores of.

Greetings,

Andres Freund

Re: SIGQUIT handling, redux

2023-08-02 Thread Andres Freund

Hi,

On 2020-09-11 11:52:55 -0400, Tom Lane wrote:
> It's simple enough that maybe we could back-patch it, once it's
> aged awhile in HEAD.  OTOH, given the lack of field reports of
> trouble here, I'm not sure back-patching is worth the risk.

FWIW, looking at collected stack traces in azure, there's a slow but steady
stream of crashes below StartupPacketTimeoutHandler. Most seem to be things
like
libcrypto->malloc->StartupPacketTimeoutHandler->proc_exit->socket_close->free->crash
there's a few other variants, some where the stack apparently was not
decipherable for the relevant tooling.

Note that this wouldn't even include cases where this caused hangs - which is
quite common IME.

Unsurprisingly just in versions before 14, where this change went in.

I think that might be enough evidence for backpatching the commit? I've not
heard of issues due to the checks in check_on_shmem_exit_lists_are_empty().

Greetings,

Andres Freund

Re: stats test intermittent failure

2023-08-01 Thread Andres Freund

Hi,

On 2023-07-31 21:03:07 +0900, Masahiko Sawada wrote:
> Regarding the patch, I have a comment:
> 
>  -- Test that reuse of strategy buffers and reads of blocks into these reused
> --- buffers while VACUUMing are tracked in pg_stat_io.
> +-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient
> +-- demand for shared buffers from concurrent queries, some blocks may be
> +-- evicted from the strategy ring before they can be reused. In such cases
> +-- this, the backend will evict a block from a shared buffer outside of the
> +-- ring and add it to the ring. This is considered an eviction and not a 
> reuse.
> 
> The new comment seems not to be accurate if my understanding is correct. How
> about the following?
> 
> Test that reuse of strategy buffers and reads of blocks into these
> reused buffers while VACUUMing are tracked in pg_stat_io. If there is
> sufficient demand for shared buffers from concurrent queries, some
> buffers may be pinned by other backends before they can be reused. In
> such cases, the backend will evict a buffer from a shared buffer
> outside of the ring and add it to the ring. This is considered an
> eviction and not a reuse.

I integrated the suggested change of the comment and tweaked it a bit
more. And finally pushed the fix.

Sorry that it took so long.

Greetings,

Andres Freund

Re: Support to define custom wait events for extensions

2023-07-31 Thread Andres Freund

Hi,

On 2023-08-01 12:14:49 +0900, Michael Paquier wrote:
> On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote:
> > Thanks for committing the main patch.
> > 
> > In my understanding, the rest works are
> > * to support WaitEventExtensionMultiple()
> > * to replace WAIT_EVENT_EXTENSION to custom wait events
> > 
> > Do someone already works for them? If not, I'll consider
> > how to realize them.
> 
> Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have
> no dependency to shared_preload_libraries.  Perhaps these could just
> use a dynamic handling but that deserves a separate discussion because
> of the fact that they'd need shared memory without being able to
> request it.  autoprewarm.c is much simpler.

This is why the scheme as implemented doesn't really make sense to me. It'd be
much easier to use if we had a shared hashtable with the identifiers than
what's been merged now.

In plenty of cases it's not realistic for an extension library to run in each
backend, while still needing to wait for things.

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-27 20:53:16 +1200, David Rowley wrote:
> To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my
> AMD 3990x machine.  Today's REL_16_STABLE takes 530.344 ms. We're
> talking about another patch to speed up the pg_strtoint functions
> which gets this down to 483.790 ms. Do we need to do this for v16 or
> can we just leave this as it is already?  The slowdown does not seem
> to be much above what we'd ordinarily classify as noise using this
> test on my machine.

I think we need to do something for 16 - it appears on recent-ish AMD the
regression is quite a bit smaller than on intel.  You see something < 1%, I
see more like 4%.  I think there's also other cases where the slowdown is more
substantial.

Besides intel vs amd, it also looks like the gcc version might make a
difference. The code generated by 13 is noticeably slower than 12 for me...

> Benchmark setup:
> 
> COPY (SELECT generate_series(1, 200) a, (random() * 10 -
> 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
> DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

There's a lot of larger numbers in the file, which likely reduces the impact
some. And there's the overhead of actually inserting the rows into the table,
making the difference appear smaller than it is.

If I avoid the actual insert into the table and use more columns, I see an about
10% regression here.

COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM 
generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy';

psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE 
lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \
  pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM 
'/tmp/lotsaints_wide.copy' WHERE false") -t 5

15: 2992.605
HEAD:   3325.201
fastpath1.patch 2932.606
fastpath2.patch 2783.915

Interestingly fastpath1 is slower now, even though it wasn't with earlier
patches (which still is repeatable). I do not have the foggiest as to why.

Greetings,

Andres Freund

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 19:11:38 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-07-31 18:33:37 -0400, Tom Lane wrote:
> >> (And could it be that we had one in the predecessor 13.1 image?)
> 
> > No, I checked, and it's not in there either... It looks like the difference 
> > is
> > that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't.
> 
> Huh.  Maybe worth reporting as a FreeBSD bug?

Yea. Hoping our local freebsd developer has a suggestion as to which component
to report it to, or even fix it :).


> In any case I don't think it's *our* bug.

Agreed. An explicit tzsetup UTC isn't much to carry going forward...

Greetings,

Andres Freund

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 21:25:06 +, José Neves wrote:
> Ok, if I understood you correctly, I start to see where my logic is faulty. 
> Just to make sure that I got it right, taking the following example again:
>
> T-1
> INSERT LSN1-1000
> UPDATE LSN2-2000
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
>
> T-2
> INSERT LSN1-500
> UPDATE LSN2-1500
> UPDATE LSN3-2500
> COMMIT  LSN4-5500
>
> Where data will arrive in this order:
>
> INSERT LSN1-500
> INSERT LSN1-1000
> UPDATE LSN2-1500
> UPDATE LSN2-2000
> UPDATE LSN3-2500
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
> COMMIT  LSN4-5500

No, they won't arrive in that order. They will arive as

BEGIN
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000
BEGIN
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Because T1 committed before T2. Changes are only streamed out at commit /
prepare transaction (*). Within a transaction, they however *will* be ordered
by LSN.

(*) Unless you use streaming mode, in which case it'll all be more
complicated, as you'll also receive changes for transactions that might still
abort.


> You are saying that the LSN3-3000 will never be missing, either the entire
> connection will fail at that point, or all should be received in the
> expected order (which is different from the "numeric order" of LSNs).

I'm not quite sure what you mean with the "different from the numeric order"
bit...


> If the connection is down, upon restart, I will receive the entire T-1
> transaction again (well, all example data again).

Yes, unless you already acknowledged receipt up to LSN4-4000 and/or are only
asking for newer transactions when reconnecting.


> In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger
> numeric value" than the ones representing INSERT and UPDATE events on T-2, I
> will be receiving the entire T-2 transaction again, as the LSN4-5500 is
> still uncommitted.

I don't quite know what you mean with "commit LSN4-4000" here.


> This makes sense to me, but just to be extra clear, I will never receive a
> transaction commit before receiving all other events for that transaction.

Correct.

Greetings,

Andres Freund

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 18:33:37 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I saw that CI image builds for freebsd were failing, because 13.1, used 
> > until
> > now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 
> > 13.2
> > - unfortunately that failed. In pltcl of all places.
> 
> I tried to replicate this in a freshly-installed 13.2 VM, and could
> not, which is unsurprising because the FreeBSD installation process
> does not let you skip selecting a timezone --- which creates
> /etc/localtime AFAICT.  So I'm unconvinced that we ought to worry
> about the case where that's not there.  How is it that the CI image
> lacks that file?

I don't know why it lacks the file - the CI image is based on the google cloud
image freebsd maintains / publishes ([1][2]). Which doesn't have /etc/localtime.

I've now added a "tzsetup UTC" to the image generation, which fixes the test
failure.


> (And could it be that we had one in the predecessor 13.1 image?)

No, I checked, and it's not in there either... It looks like the difference is
that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't.

Upstream 13.1 image:

truss date 2>&1|grep -E 'open|stat'
...
open("/etc/localtime",O_RDONLY,0101401200)   ERR#2 'No such file or 
directory'
open("/usr/share/zoneinfo/UTC",O_RDONLY,00)  = 3 (0x3)
fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0)
open("/usr/share/zoneinfo/posixrules",O_RDONLY,014330460400) = 3 (0x3)
fstat(3,{ mode=-r--r--r-- ,inode=356621,size=3535,blksize=32768 }) = 0 (0x0)
fstat(1,{ mode=p- ,inode=696,size=0,blksize=4096 }) = 0 (0x0)

Upstream 13.2 image:
open("/etc/localtime",O_RDONLY,01745)ERR#2 'No such file or 
directory'
fstat(1,{ mode=p- ,inode=658,size=0,blksize=4096 }) = 0 (0x0)


Why not reading the UTC zone leads to timestamps being out of range, I do not
know...

Greetings,

Andres Freund

[1] https://wiki.freebsd.org/GoogleCloudPlatform
[2] https://cloud.google.com/compute/docs/images#freebsd

Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 14:16:22 +, José Neves wrote:
> Hi Amit, thanks for the reply.
>
> In our worker (custom pg replication client), we care only about INSERT,
> UPDATE, and DELETE operations, which - sure - may be part of the issue.

That seems likely. Postgres streams out changes in commit order, not in order
of the changes having been made (that'd not work due to rollbacks etc). If you
just disregard transactions entirely, you'll get something bogus after
retries.

You don't need to store the details for each commit in the target system, just
up to which LSN you have processed *commit records*. E.g. if you have received
and safely stored up to commit 0/1000, you need to remember that.

Are you using the 'streaming' mode / option to pgoutput?

> 1. We have no way to match LSN operations with the respective commit, as
> they have unordered offsets.

Not sure what you mean with "unordered offsets"?

> Assuming that all of them were received in order, we would commit all data 
> with the commit message LSN4-4000 as other events would match the transaction 
> start and end LSN interval of it.

Logical decoding sends out changes in a deterministic order and you won't see
out of order data when using TCP (the entire connection can obviously fail
though).

Andres

Re: pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

On 2023-07-31 12:15:10 -0700, Andres Freund wrote:
> It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even
> parse what it generates:
> 
> echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] 
> -format "%Y/%m/%d"]'|tclsh8.6
> 
> Which works on 13.1 (and other operating systems), without a problem.
> 
> I used truss as a very basic way to see differences between 13.1 and 13.2 -
> the big difference was .2 failing just after
> access("/etc/localtime",F_OK)  ERR#2 'No such file or 
> directory'
> open("/etc/localtime",O_RDONLY,077)ERR#2 'No such file or 
> directory'
> 
> whereas 13.1 also saw that, but then continued to
> 
> issetugid()= 0 (0x0)
> open("/usr/share/zoneinfo/UTC",O_RDONLY,00)= 3 (0x3)
> fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0)
> ...
> 
> which made me test specifying the timezone explicitly:
> echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" 
> -timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6
> 
> Which, surprise, works.
> 
> So does specifying the timezone via the TZ='UTC' environment variable.
> 
> 
> I guess there could be a libc behaviour change or such around timezones? I do
> see
> https://www.freebsd.org/releases/13.2R/relnotes/
> "tzcode has been upgraded to version 2022g with improved timezone change 
> detection and reliability fixes."

One additional datapoint: If I configure the timezone with "tzsetup" (which
creates /etc/localtime), the problem vanishes as well.

Greetings,

Andres Freund

pltcl tests fail with FreeBSD 13.2

2023-07-31 Thread Andres Freund

Hi,

I saw that CI image builds for freebsd were failing, because 13.1, used until
now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 13.2
- unfortunately that failed. In pltcl of all places.

https://api.cirrus-ci.com/v1/artifact/task/5275616266682368/testrun/build/testrun/pltcl/regress/regression.diffs

Notably both 13.1 and 13.2 are using tcl 8.6.13.

The code for the relevant function is this:

create function tcl_date_week(int4,int4,int4) returns text as $$
return [clock format [clock scan "$2/$3/$1"] -format "%U"]
$$ language pltcl immutable;

select tcl_date_week(2010,1,26);


It doesn't surprise me that there are problems with above clock scan, it uses
the MM/DD/ format without making that explicit. But why that doesn't work
on freebsd 13.2, I can't explain.  It looks like tcl specifies the MM/DD/
bit for "free format scans":
https://www.tcl.tk/man/tcl8.6/TclCmd/clock.html#M80

It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even
parse what it generates:

echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] 
-format "%Y/%m/%d"]'|tclsh8.6

Which works on 13.1 (and other operating systems), without a problem.

I used truss as a very basic way to see differences between 13.1 and 13.2 -
the big difference was .2 failing just after
access("/etc/localtime",F_OK)ERR#2 'No such file or 
directory'
open("/etc/localtime",O_RDONLY,077)  ERR#2 'No such file or 
directory'

whereas 13.1 also saw that, but then continued to

issetugid()  = 0 (0x0)
open("/usr/share/zoneinfo/UTC",O_RDONLY,00)  = 3 (0x3)
fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0)
...

which made me test specifying the timezone explicitly:
echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" 
-timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6

Which, surprise, works.

So does specifying the timezone via the TZ='UTC' environment variable.


I guess there could be a libc behaviour change or such around timezones? I do
see
https://www.freebsd.org/releases/13.2R/relnotes/
"tzcode has been upgraded to version 2022g with improved timezone change 
detection and reliability fixes."

Greetings,

Andres Freund

Postmaster doesn't correctly handle crashes in PM_STARTUP state

2023-07-29 Thread Andres Freund

Hi,

While testing something I made the checkpointer process intentionally crash as
soon as it started up.  The odd thing I observed on macOS is that we start a
*new* checkpointer before shutting down:

2023-07-29 14:32:39.241 PDT [65031] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"
2023-07-29 14:32:39.244 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.244 PDT [65031] LOG:  checkpointer process (PID 65032) was 
terminated by signal 11: Segmentation fault: 11
2023-07-29 14:32:39.244 PDT [65031] LOG:  terminating any other active server 
processes
2023-07-29 14:32:39.244 PDT [65031] DEBUG:  sending SIGQUIT to process 65034
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  sending SIGQUIT to process 65033
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.245 PDT [65035] LOG:  process 65035 taking over ProcSignal 
slot 126, but it's not empty
2023-07-29 14:32:39.245 PDT [65031] DEBUG:  reaping dead processes
2023-07-29 14:32:39.245 PDT [65031] LOG:  shutting down because 
restart_after_crash is off

Note that a new process (65035) is started after the crash has been
observed. I added logging to StartChildProcess(), and the process that's
started is another checkpointer.

I could not initially reproduce this on linux.

After a fair bit of confusion, I figured out the reason: On macOS it takes a
bit longer for the startup process to finish, which means we're still in
PM_STARTUP state when we see that crash, instead of PM_RECOVERY or PM_RUN or
...

The problem is that unfortunately HandleChildCrash() doesn't change pmState
when in PM_STARTUP:

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;

Once I figured that out, I put a sleep(1) in StartupProcessMain(), and the
problem reproduces on linux as well.

I haven't fully dug through the history, this looks to be a quite old problem.


Arguably we might also be missing PM_SHUTDOWN_2, but I can't really see a bad
consequence of that.

Greetings,

Andres Freund

Re: Support worker_spi to execute the function dynamically.

2023-07-28 Thread Andres Freund

Hi,

On 2023-07-28 13:45:29 +0900, Michael Paquier wrote:
> Having each bgworker on its own schema would be enough to prevent
> conflicts, but I'd like to add a second thing: a check on
> pg_stat_activity.wait_event after starting the workers.  I have added
> something like that in the patch I have posted today for the custom
> wait events at [1] and it enforces the startup sequences of the
> workers in a stricter way.

Is that very meaningful? ISTM the interesting thing to check for would be that
the state is idle?

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-07-28 Thread Andres Freund

Hi,

On 2023-07-28 16:19:47 -0400, Robert Haas wrote:
> On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
> > > Or are we trying to determine how likely
> > > the freeze record is to emit an FPI and avoid eager freezing when it
> > > isn't worth the cost?
> >
> > No, that's not something that we're doing right now (we just talked
> > about doing something like that). In 16 VACUUM just "makes the best
> > out of a bad situation" when an FPI was already required during
> > pruning. We have already "paid for the privilege" of writing some WAL
> > for the page at that point, so it's reasonable to not squander a
> > window of opportunity to avoid future FPIs in future VACUUM
> > operations, by freezing early.
> 
> This doesn't make sense to me. It's true that if the pruning record
> emitted an FPI, then the freezing record probably won't need to do so
> either, unless by considerable ill-fortune the redo pointer was moved
> in the tiny window between pruning and freezing. But isn't that also
> true that if the pruning record *didn't* emit an FPI? In that case,
> the pruning record wasn't the first WAL-logged modification to the
> page during the then-current checkpoint cycle, and some earlier
> modification already did it. But in that case also the freezing won't
> need to emit a new FPI, except for the same identical case where the
> redo pointer moves at just the wrong time.
> 
> Put differently, I can't see any reason to care whether pruning
> emitted an FPI or not. Either way, it's very unlikely that freezing
> needs to do so.

+many

Using FPIs having been generated during pruning as part of the logic to decide
whether to freeze makes no sense to me. We likely need some heuristic here,
but I suspect it has to look quite different than the FPI condition.

Greetings,

Andres

Re: Support worker_spi to execute the function dynamically.

2023-07-27 Thread Andres Freund

Hi,

The new test fails with my AIO branch occasionally. But I'm fairly certain
that's just due to timing differences.

Excerpt from the log:

2023-07-27 21:43:00.385 UTC [42339] LOG:  worker_spi worker 3 initialized with 
schema3.counted
2023-07-27 21:43:00.399 UTC [42344] 001_worker_spi.pl LOG:  statement: SELECT 
datname, count(datname) FROM pg_stat_activity
WHERE backend_type = 'worker_spi' GROUP BY datname;
2023-07-27 21:43:00.403 UTC [42340] LOG:  worker_spi worker 2 initialized with 
schema2.counted
2023-07-27 21:43:00.407 UTC [42341] LOG:  worker_spi worker 1 initialized with 
schema1.counted
2023-07-27 21:43:00.420 UTC [42346] 001_worker_spi.pl LOG:  statement: SELECT 
worker_spi_launch(1);
2023-07-27 21:43:00.423 UTC [42347] LOG:  worker_spi dynamic worker 1 
initialized with schema1.counted
2023-07-27 21:43:00.432 UTC [42349] 001_worker_spi.pl LOG:  statement: SELECT 
worker_spi_launch(2);
2023-07-27 21:43:00.437 UTC [42350] LOG:  worker_spi dynamic worker 2 
initialized with schema2.counted
2023-07-27 21:43:00.443 UTC [42347] ERROR:  duplicate key value violates unique 
constraint "pg_namespace_nspname_index"
2023-07-27 21:43:00.443 UTC [42347] DETAIL:  Key (nspname)=(schema1) already 
exists.
2023-07-27 21:43:00.443 UTC [42347] CONTEXT:  SQL statement "CREATE SCHEMA 
"schema1" CREATE TABLE "counted" (   type text CHECK (type IN ('total', 
'delta')),   value   integer)CREATE UNIQUE INDEX "counted_unique_total" 
ON "counted" (type) WHERE type = 'total'"


As written, dynamic and static workers race each other. It doesn't make a lot
of sense to me to use the same ids for either?

The attached patch reproduces the problem on master.

Note that without the sleep(3) in the test the workers don't actually finish
starting, the test shuts down the cluster before that happens...

Greetings,

Andres Freund
diff --git i/src/test/modules/worker_spi/t/001_worker_spi.pl w/src/test/modules/worker_spi/t/001_worker_spi.pl
index c2938713134..093a038b005 100644
--- i/src/test/modules/worker_spi/t/001_worker_spi.pl
+++ w/src/test/modules/worker_spi/t/001_worker_spi.pl
@@ -52,6 +52,7 @@ $node->append_conf(
 shared_preload_libraries = 'worker_spi'
 worker_spi.database = 'mydb'
 worker_spi.total_workers = 3
+log_statement=all
 });
 $node->restart;
 
@@ -68,6 +69,9 @@ ok( $node->poll_query_until(
 # check their existence.
 my $worker1_pid = $node->safe_psql('mydb', 'SELECT worker_spi_launch(1);');
 my $worker2_pid = $node->safe_psql('mydb', 'SELECT worker_spi_launch(2);');
+
+sleep(3);
+
 ok( $node->poll_query_until(
 		'mydb',
 		qq[SELECT datname, count(datname)  FROM pg_stat_activity
diff --git i/src/test/modules/worker_spi/worker_spi.c w/src/test/modules/worker_spi/worker_spi.c
index 903dcddef97..ebd2120ec11 100644
--- i/src/test/modules/worker_spi/worker_spi.c
+++ w/src/test/modules/worker_spi/worker_spi.c
@@ -120,6 +120,8 @@ initialize_worker_spi(worktable *table)
 			elog(FATAL, "failed to create my schema");
 
 		debug_query_string = NULL;	/* rest is not statement-specific */
+
+		pg_usleep(USECS_PER_SEC*1);
 	}
 
 	SPI_finish();
@@ -127,6 +129,8 @@ initialize_worker_spi(worktable *table)
 	CommitTransactionCommand();
 	debug_query_string = NULL;
 	pgstat_report_activity(STATE_IDLE, NULL);
+
+	elog(LOG, "done");
 }
 
 void

Re: Buildfarm animal grassquit is failing

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 20:10:06 -0400, Tom Lane wrote:
> grassquit has been failing to run the regression tests for the last
> few days, since [1]:
> 
> # +++ regress check in src/test/regress +++
> # using temp instance on port 6880 with PID 766305
> ERROR:  stack depth limit exceeded
> HINT:  Increase the configuration parameter "max_stack_depth" (currently 
> 2048kB), after ensuring the platform's stack depth limit is adequate.
> ERROR:  stack depth limit exceeded
> HINT:  Increase the configuration parameter "max_stack_depth" (currently 
> 2048kB), after ensuring the platform's stack depth limit is adequate.
> # command failed: "psql" -X -q -c "CREATE DATABASE \\"regression\\" 
> TEMPLATE=template0 LOCALE='C'" -c "ALTER DATABASE \\"regression\\" SET 
> lc_messages TO 'C';ALTER DATABASE \\"regression\\" SET lc_monetary TO 
> 'C';ALTER DATABASE \\"regression\\" SET lc_numeric TO 'C';ALTER DATABASE 
> \\"regression\\" SET lc_time TO 'C';ALTER DATABASE \\"regression\\" SET 
> bytea_output TO 'hex';ALTER DATABASE \\"regression\\" SET 
> timezone_abbreviations TO 'Default';" "postgres"
> 
> This seems to be happening in all branches, so I doubt it
> has anything to do with recent PG commits.  Instead, I notice
> that grassquit seems to have been updated to a newer Debian
> release: 6.1.0-6-amd64 works, 6.3.0-2-amd64 doesn't.

Indeed, the automated updates had been stuck, and I fixed that a few days
ago. I missed grassquit's failures unfortunately.

I think I know what the issue is - I have seen them locally. Newer versions of
gcc and clang (or libasan?) default to enabling "stack use after return"
checks - for some reason that implies using a shadow stack *sometimes*. Which
can confuse our stack depth checking terribly, not so surprisingly.

I've been meaning to look into what we could do to fix that, but I haven't
found the cycles...

For now I added :detect_stack_use_after_return=0 to ASAN_OPTIONS, which I
think should fix the issue.

Greetings,

Andres Freund

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-26 07:40:31 +0900, Michael Paquier wrote:
> On Tue, Jul 25, 2023 at 09:49:01AM -0700, Andres Freund wrote:
> > FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
> > because they don't scale all that well.
>
> What were you looking at here?  Just wondering.

Here's what I had written offlist a few days ago:

> The basic idea is to have a ringbuffer of in-progress insertions. The
> acquisition of a position in the ringbuffer is done at the same time as
> advancing the reserved LSN, using a 64bit xadd. The trick that makes that
> possible is to use the high bits of the atomic for the position in the
> ringbuffer. The point of using the high bits is that they wrap around, without
> affecting the rest of the value.
>
> Of course, when using xadd, we can't keep the "prev" pointer in the
> atomic. That's where the ringbuffer comes into play. Whenever one inserter has
> determined the byte pos of its insertion, it updates the "prev byte pos" in
> the *next* ringbuffer entry.
>
> Of course that means that insertion N+1 needs to wait for N to set the prev
> position - but that happens very quickly. In my very hacky prototype the
> relevant path (which for now just spins) is reached very rarely, even when
> massively oversubscribed. While I've not implemented that, N+1 could actually
> do the first "iteration" in CopyXLogRecordToWAL() before it needs the prev
> position, the COMP_CRC32C() could happen "inside" the buffer.
>
>
> There's a fair bit of trickyness in turning that into something working, of
> course :).  Ensuring that the ring buffer of insertions doesn't wrap around is
> non-trivial. Nor is trivial to ensure that the "reduced" space LSN in the
> atomic can't overflow.
>
> I do wish MAX_BACKENDS were smaller...
>
>
> Until last night I thought all my schemes would continue to need something
> like the existing WAL insertion locks, to implement
> WALInsertLockAcquireExclusive().
>
> But I think I came up with an idea to do away with that (not even prototyped
> yet): Use one bit in the atomic that indicates that no new insertions are
> allowed. Whenever the xadd finds that old value actually was locked, it
> "aborts" the insertion, and instead waits for a condition variable (or
> something similar). Of course that's after modifying the atomic - to deal with
> that the "lock holder" reverts all modifications that have been made to the
> atomic when releasing the "lock", they weren't actually successful and all
> those backends will retry.
>
> Except that this doesn't quite suffice - XLogInsertRecord() needs to be able
> to "roll back", when it finds that we now need to log FPIs. I can't quite see
> how to make that work with what I describe above. The only idea I have so far
> is to just waste the space with a NOOP record - it should be pretty rare. At
> least if we updated RedoRecPtr eagerly (or just stopped this stupid business
> of having an outdated backend-local copy).
>
>
>
> My prototype shows this idea to be promising. It's a tad slower at low
> concurrency, but much better at high concurrency. I think most if not all of
> the low-end overhead isn't inherent, but comes from having both old and new
> infrastructure in place (as well as a lot of debugging cruft).


Greetings,

Andres Freund

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 16:43:16 +0900, Michael Paquier wrote:
> On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> > Yes, it looks safe to me too.
> 
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.

Just a small heads up:

I just rebased my aio tree over the commit and promptly, on the first run, saw
a hang. I did some debugging on that. Unfortunately repeated runs haven't
repeated that hang, despite quite a bit of trying.

The symptom I was seeing is that all running backends were stuck in
LWLockWaitForVar(), even though the value they're waiting for had
changed. Which obviously "shouldn't be possible".

It's of course possible that this is AIO specific, but I didn't see anything
in stacks to suggest that.

I do wonder if this possibly exposed an undocumented prior dependency on the
value update always happening under the list lock.

Greetings,

Andres Freund

Re: Logical walsenders don't process XLOG_CHECKPOINT_SHUTDOWN

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 14:31:00 +0530, Amit Kapila wrote:
> To ensure that all the data has been sent during the upgrade, we can
> ensure that each logical slot's confirmed_flush_lsn (position in the
> WAL till which subscriber has confirmed that it has applied the WAL)
> is the same as current_wal_insert_lsn. Now, because we don't send
> XLOG_CHECKPOINT_SHUTDOWN even on clean shutdown, confirmed_flush_lsn
> will never be the same as current_wal_insert_lsn. The one idea being
> discussed in patch [1] (see 0003) is to ensure that each slot's LSN is
> exactly XLOG_CHECKPOINT_SHUTDOWN ago which probably has some drawbacks
> like what if we tomorrow add some other WAL in the shutdown checkpoint
> path or the size of record changes then we would need to modify the
> corresponding code in upgrade.

Yea, that doesn't seem like a good path. But there is a variant that seems
better: We could just scan the end of the WAL for records that should have
been streamed out?

Greetings,

Andres Freund

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 16:43:16 +0900, Michael Paquier wrote:
> On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> > Yes, it looks safe to me too.
> 
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.
> 
> > FWIW, 0001 essentially implements what
> > an existing TODO comment introduced by commit 008608b9d5106 says:
> 
> We really need to do something in terms of documentation with
> something like 0002, so I'll try to look at that next.  Regarding
> 0003, I don't know.  I think that we'd better look more into cases
> where it shows actual benefits for specific workloads (like workloads
> with a fixed rate of read and/or write operations?).

FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
because they don't scale all that well. If there's no very clear improvements,
I'm not sure it's worth putting too much effort into polishing them all that
much.

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 08:50:19 -0700, Andres Freund wrote:
> One idea I had was to add a fastpath that won't parse all strings, but will
> parse the strings that we would generate, and fall back to the more general
> variant if it fails. See the attached, rough, prototype:
> 
> fix_COPY_DEFAULT.patch + fastpath.patch:
> 746.971
> 
> fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
> 715.570
> 
> Now, the precise contents of this fastpath are not yet clear (wrt imul or
> not), but I think the idea has promise.

Btw, I strongly suspect that fastpath wants to be branchless SSE when it grows
up.

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 23:37:08 +1200, David Rowley wrote:
> On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> I've not really studied the fix_COPY_DEFAULT.patch patch.  Is there a
> reason to delay committing that?  It would be good to eliminate that
> as a variable for the current performance regression.

Yea, I don't think there's a reason to hold off on that. Sawada-san, do you
want to commit it? Or Andrew?


> > prep:
> > COPY (SELECT generate_series(1, 200) a, (random() * 10 - 
> > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
> > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);
> >
> > benchmark:
> > psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY 
> > lotsaints FROM '/tmp/lotsaints.copy';") -t 15
> >
> > I disabled turbo mode, pinned the server to a single core of my Xeon Gold 
> > 5215:
> >
> > HEAD:   812.690
> >
> > your patch: 821.354
> >
> > strtoint from 8692f6644e7:  824.543
> >
> > strtoint from 6b423ec677d^: 806.678
> 
> I'm surprised to see the imul version is faster. It's certainly not
> what we found when working on 6b423ec67.

What CPUs did you test it on? I'd not be surprised if this were heavily
dependent on the microarchitecture.


> I've been fooling around a bit to try to learn what might be going on.
> I wrote 2 patches:
> 
> 1) pg_strtoint_imul.patch:  Effectively reverts 6b423ec67 and puts the
> code how it likely would have looked had I not done that work at all.
> (I've assumed that the hex, octal, binary parsing would have been
> added using the overflow multiplication functions just as the base 10
> parsing).


> 2) pg_strtoint_optimize.patch: I've made another pass over the
> functions with the current overflow checks and optimized the digit
> checking code so that it can be done in a single check like: if (digit
> < 10).  This can be done by assigning the result of *ptr - '0' to an
> unsigned char.  Anything less than '0' will wrap around and anything
> above '9' will remain so.  I've also removed a few inefficiencies with
> the isspace checking. We didn't need to do "while (*ptr &&
> isspace(*ptr)).  It's fine just to do while (isspace(*ptr)) since '\0'
> isn't a space char.  I also got rid of the isxdigit call.  The
> hexlookup array contains -1 for non-hex chars. We may as well check
> the digit value is >= 0.
> 
> Here are the results I get using your test as quoted above:
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch
> 
> latency average = 568.102 ms
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch
> 
> latency average = 531.238 ms
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch
> 
> latency average = 559.333 ms (surprisingly also faster on my machine)
> 
> The optimized version of the pg_strtoint functions wins over the imul
> patch.  Could you test to see if this is also the case in your Xeon
> machine?

(these are the numbers with turbo disabled, if I enable it they're all in the
6xx ms range, but the variance is higher)


fix_COPY_DEFAULT.patch
774.344

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch
777.673

fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch
777.545

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch
795.298

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch + likely(isdigit())
773.477

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch
774.443

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch + likely(isdigit())
774.513

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch + likely(isdigit()) + unlikely(*ptr == '_')
763.879


One idea I had was to add a fastpath that won't parse all strings, but will
parse the strings that we would generate, and fall back to the more general
variant if it fails. See the attached, rough, prototype:

fix_COPY_DEFAULT.patch + fastpath.patch:
746.971

fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
715.570

Now, the precise contents of this fastpath are not yet clear (wrt imul or
not), but I think the idea has promise.



> > (when I say strtoint from, I did not replace the goto labels, so that part 
> > is
> > unchanged and unrelated)
> 
> I didn't quite follow this.

I meant that I didn't undo ereport()->ereturn().

Greetings,

Andres Freund
diff --git c/src/backend/utils/adt/numutils.c i/src/backend/utils/adt/numutils.c
ind

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-24 Thread Andres Freund

Hi,

Hm, in some cases your patch is better, but in others both the old code
(8692f6644e7) and HEAD beat yours on my machine. TBH, not entirely sure why.

prep:
COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int b, 
3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

benchmark:
psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints 
FROM '/tmp/lotsaints.copy';") -t 15

I disabled turbo mode, pinned the server to a single core of my Xeon Gold 5215:

HEAD:   812.690

your patch: 821.354

strtoint from 8692f6644e7:  824.543

strtoint from 6b423ec677d^: 806.678

(when I say strtoint from, I did not replace the goto labels, so that part is
unchanged and unrelated)


IOW, for me the code from 15 is the fastest by a good bit... There's an imul,
sure, but the fact that it sets a flag makes it faster than having to add more
tests and branches.


Looking at a profile reminded me of how silly it is that we are wasting a good
chunk of the time in these isdigit() checks, even though we already rely on on
the ascii values via (via *ptr++ - '0'). I think that's done in the headers
for some platforms, but not others (glibc). And we've even done already for
octal and binary!

Open coding isdigit() gives us:


HEAD:   797.434

your patch: 803.570

strtoint from 8692f6644e7:  778.943

strtoint from 6b423ec677d^: 777.741


It's somewhat odd that HEAD and your patch switch position here...


> - else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
> + /* process hex digits */
> + else if (ptr[1] == 'x' || ptr[1] == 'X')
>   {
>
>   firstdigit = ptr += 2;

I find this unnecessarily hard to read. I realize it's been added in
6fcda9aba83, but I don't see a reason to use such a construct here.


I find it somewhat grating how much duplication there now is in this
code due to being repeated for all the bases...


Greetings,

Andres Freund

Re: [BUG] Crash on pgbench initialization.

2023-07-24 Thread Andres Freund

Hi,

On 2023-07-24 09:42:44 -0700, Andres Freund wrote:
> > I don't know this code at all, but I hope that this can be solved with
> > just Anton's proposed patch.
> 
> Yes, it's just that off-by-one.  I need to check if there's a similar bug for
> local / temp table buffers though.

Doesn't appear that way. We *do* fail if there's only 1 remaining buffer, but
we already did before my change (because we also need to pin the fsm). I don't
think that's an issue worth worrying about, if all-1 of local buffers are
pinned, you're going to have problems.

Thanks Anton / Victoria for the report and fix. Pushed.

Greetings,

Andres Freund

Re: [BUG] Crash on pgbench initialization.

2023-07-24 Thread Andres Freund

Hi,

On 2023-07-24 15:54:33 +0200, Alvaro Herrera wrote:
> On 2023-Jul-24, Michael Paquier wrote:
> 
> > On Sun, Jul 23, 2023 at 11:21:47PM +0300, Anton A. Melnikov wrote:
> > > After some research, found this happens when the LimitAdditionalPins() 
> > > returns exactly zero.
> > > In the current master, this will happen e.g. if shared_buffers = 10MB and 
> > > max_worker_processes = 40.
> > > Then the command "pgbench --initialize postgres" will lead to crash.
> > > See the backtrace attached.
> > > 
> > > There is a fix in the patch applied. Please take a look on it.
> > 
> > Nice catch, issue reproduced here so I am adding an open item for now.
> > (I have not looked at the patch, yet.)
> 
> Hmm, I see that all this code was added by 31966b151e6a, which makes
> this Andres' item.  I see Michael marked it as such in the open items
> page, but did not CC Andres, so I'm doing that here now.

Thanks - I had indeed not seen this.  I can't really keep up with the list at
all times...


> I don't know this code at all, but I hope that this can be solved with
> just Anton's proposed patch.

Yes, it's just that off-by-one.  I need to check if there's a similar bug for
local / temp table buffers though.

Greetings,

Andres Freund

Re: Avoid stack frame setup in performance critical routines using tail calls

2023-07-19 Thread Andres Freund

Hi,

David and I were chatting about this patch, in the context of his bump
allocator patch.  Attached is a rebased version that is also split up into two
steps, and a bit more polished.

I wasn't sure what a good test was. I ended up measuring
  COPY pgbench_accounts TO '/dev/null' WITH (FORMAT 'binary');
of a scale 1 database with pgbench:

c=1;pgbench -q -i -s1 && pgbench  -n -c$c -j$c -t100 -f <(echo "COPY 
pgbench_accounts TO '/dev/null' WITH (FORMAT 'binary');")

average latency
HEAD:   33.865 ms
01: 32.820 ms
02: 29.934 ms

The server was pinned to the one core, turbo mode disabled.  That's a pretty
nice win, I'd say.  And I don't think this is actually the most allocator
bound workload, I just tried something fairly random...


Greetings,

Andres Freund
>From 4b97070e32ed323ae0f083bf865717c6d271ce53 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 18 Jul 2023 18:55:58 -0700
Subject: [PATCH v2 1/4] Optimize palloc() etc to allow sibling calls

The reason palloc() etc previously couldn't use sibling calls is that they did
check the returned value (to e.g. raise an error when the allocation
fails). Push the error handling down into the memory context implementation -
they can avoid performing the check at all in the hot code paths.
---
 src/include/nodes/memnodes.h  |   4 +-
 src/include/utils/memutils_internal.h |  29 +++-
 src/backend/utils/mmgr/alignedalloc.c |  11 +-
 src/backend/utils/mmgr/aset.c |  23 ++--
 src/backend/utils/mmgr/generation.c   |  14 +-
 src/backend/utils/mmgr/mcxt.c | 186 ++
 src/backend/utils/mmgr/slab.c |  12 +-
 7 files changed, 102 insertions(+), 177 deletions(-)

diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index ff6453bb7ac..47d2932e6ed 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -57,10 +57,10 @@ typedef void (*MemoryStatsPrintFunc) (MemoryContext context, void *passthru,
 
 typedef struct MemoryContextMethods
 {
-	void	   *(*alloc) (MemoryContext context, Size size);
+	void	   *(*alloc) (MemoryContext context, Size size, int flags);
 	/* call this free_p in case someone #define's free() */
 	void		(*free_p) (void *pointer);
-	void	   *(*realloc) (void *pointer, Size size);
+	void	   *(*realloc) (void *pointer, Size size, int flags);
 	void		(*reset) (MemoryContext context);
 	void		(*delete_context) (MemoryContext context);
 	MemoryContext (*get_chunk_context) (void *pointer);
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index 2d107bbf9d4..3a050819263 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -19,9 +19,9 @@
 #include "utils/memutils.h"
 
 /* These functions implement the MemoryContext API for AllocSet context. */
-extern void *AllocSetAlloc(MemoryContext context, Size size);
+extern void *AllocSetAlloc(MemoryContext context, Size size, int flags);
 extern void AllocSetFree(void *pointer);
-extern void *AllocSetRealloc(void *pointer, Size size);
+extern void *AllocSetRealloc(void *pointer, Size size, int flags);
 extern void AllocSetReset(MemoryContext context);
 extern void AllocSetDelete(MemoryContext context);
 extern MemoryContext AllocSetGetChunkContext(void *pointer);
@@ -36,9 +36,9 @@ extern void AllocSetCheck(MemoryContext context);
 #endif
 
 /* These functions implement the MemoryContext API for Generation context. */
-extern void *GenerationAlloc(MemoryContext context, Size size);
+extern void *GenerationAlloc(MemoryContext context, Size size, int flags);
 extern void GenerationFree(void *pointer);
-extern void *GenerationRealloc(void *pointer, Size size);
+extern void *GenerationRealloc(void *pointer, Size size, int flags);
 extern void GenerationReset(MemoryContext context);
 extern void GenerationDelete(MemoryContext context);
 extern MemoryContext GenerationGetChunkContext(void *pointer);
@@ -54,9 +54,9 @@ extern void GenerationCheck(MemoryContext context);
 
 
 /* These functions implement the MemoryContext API for Slab context. */
-extern void *SlabAlloc(MemoryContext context, Size size);
+extern void *SlabAlloc(MemoryContext context, Size size, int flags);
 extern void SlabFree(void *pointer);
-extern void *SlabRealloc(void *pointer, Size size);
+extern void *SlabRealloc(void *pointer, Size size, int flags);
 extern void SlabReset(MemoryContext context);
 extern void SlabDelete(MemoryContext context);
 extern MemoryContext SlabGetChunkContext(void *pointer);
@@ -75,7 +75,7 @@ extern void SlabCheck(MemoryContext context);
  * part of a fully-fledged MemoryContext type.
  */
 extern void AlignedAllocFree(void *pointer);
-extern void *AlignedAllocRealloc(void *pointer, Size size);
+extern void *AlignedAllocRealloc(void *pointer, Size size, int flags);
 extern MemoryContext AlignedAllocGetChunkContext(v

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-07-18 Thread Andres Freund

Hi,

I wanted this feature a couple times before...

On 2023-07-03 13:56:29 +0530, Palak Chaturvedi wrote:
> +PG_FUNCTION_INFO_V1(pg_buffercache_invalidate);
> +Datum
> +pg_buffercache_invalidate(PG_FUNCTION_ARGS)

I don't think "invalidating" is the right terminology. Note that we already
have InvalidateBuffer() - but it's something we can't allow users to do, as it
throws away dirty buffer contents (it's used for things like dropping a
table).

How about using "discarding" for this functionality?

Using the buffer ID as the identifier doesn't seem great, because what that
buffer is used for, could have changed since the buffer ID has been acquired
(via the pg_buffercache view presumably)?

My suspicion is that the usual usecase for this would be to drop all buffers
that can be dropped?

> + if (bufnum < 0 || bufnum > NBuffers)
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +  errmsg("buffernum is not valid")));
> +
> + }
> +
> + result = TryInvalidateBuffer(bufnum);
> + PG_RETURN_BOOL(result);
> +}

I think this should be restricted to superuser by default (by revoking
permissions from PUBLIC). We allow normal users to use pg_prewarm(...), true -
but we perform an ACL check on the relation, so it can only be used for
relations you have access too.  This function could be used to affect
performance of other users quite substantially.

> +/*
> +Try Invalidating a buffer using bufnum.
> +If the buffer is invalid, the function returns false.
> +The function checks for dirty buffer and flushes the dirty buffer before 
> invalidating.
> +If the buffer is still dirty it returns false.
> +*/
> +bool
> +TryInvalidateBuffer(Buffer bufnum)
> +{
> + BufferDesc *bufHdr = GetBufferDescriptor(bufnum - 1);
> + uint32  buf_state;
> +
> + ReservePrivateRefCountEntry();
> +
> + buf_state = LockBufHdr(bufHdr);
> + if ((buf_state & BM_VALID) == BM_VALID)
> + {
> + /*
> +  * The buffer is pinned therefore cannot invalidate.
> +  */
> + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
> + {
> + UnlockBufHdr(bufHdr, buf_state);
> + return false;
> + }
> + if ((buf_state & BM_DIRTY) == BM_DIRTY)
> + {
> + /*
> +  * Try once to flush the dirty buffer.
> +  */
> + PinBuffer_Locked(bufHdr);
> + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), 
> LW_SHARED);
> + FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, 
> IOCONTEXT_NORMAL);
> + LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
> + UnpinBuffer(bufHdr);
> + buf_state = LockBufHdr(bufHdr);
> + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
> + {
> + UnlockBufHdr(bufHdr, buf_state);
> + return false;
> + }
> +
> + /*
> +  * If its dirty again or not valid anymore give up.
> +  */
> +
> + if ((buf_state & (BM_DIRTY | BM_VALID)) != (BM_VALID))
> + {
> + UnlockBufHdr(bufHdr, buf_state);
> + return false;
> + }
> +
> +         }
> +
> + InvalidateBuffer(bufHdr);

I'm wary of using InvalidateBuffer() here, it's typically used for different
purposes, including throwing valid contents away. That seems a bit scary.

I think you should be able to just use InvalidateVictimBuffer()?

Greetings,

Andres Freund

Re: Atomic ops for unlogged LSN

2023-07-17 Thread Andres Freund

Hi,

On 2023-07-17 16:15:52 -0700, Nathan Bossart wrote:
> On Mon, Jul 17, 2023 at 07:08:03PM -0400, Stephen Frost wrote:
> > Awesome.  Was there any other feedback on this change which basically is
> > just getting rid of a spinlock and replacing it with using atomics?
> > It's still in needs-review status but there's been a number of
> > comments/reviews (drive-by, at least) but without any real ask for any
> > changes to be made.
> 
> LGTM

Why don't we just use a barrier when around reading the value? It's not like
CreateCheckPoint() is frequent?

Greetings,

Andres Freund

Report distinct wait events when waiting for WAL "operation"

2023-07-17 Thread Andres Freund

Hi,

In a number of workloads one can see two wait events prominently:
LWLock:WALWrite and LWLock:WALInsert. Unfortunately for both that is not very
informative:

LWLock:WALWrite can be reported because there is genuine contention on the
LWLock, or, more commonly, because several backends are waiting for another to
finish IO. In the latter case we are not actually waiting to acquire the lock,
we are waiting for the lock to be released *without* then acquiring it.

LWLock:WALInsert can be reported because there are not enough WALInsert locks
(c.f. NUM_XLOGINSERT_LOCKS) or because we are waiting for another backend to
finish copying a WAL record into wal_buffers.  In the latter case we are
therefore not waiting to acquire an LWLock.


I think both of these cases are relevant to distinguish from an operational
perspective. Secondarily, I've received many questions about making those
locks more scalable / granular, when in most of the cases the issue was not
actual lock contention.

Today it's surprisingly hard to figure out whether the issue is lock
contention or the speed of copying buffers for WAL insert locks / computing
the last prc of the CRC checksum.


Therefore I'm proposing that LWLockAcquireOrWait() and LWLockWaitForVar() not
use the "generic" LWLockReportWaitStart(), but use caller provided wait
events.  The attached patch adds two new wait events for the existing callers.

I waffled a bit about which wait event section to add these to. Ended up with
"IPC", but would be happy to change that.

WAIT_EVENT_WAL_WAIT_INSERT  WALWaitInsert   "Waiting for WAL record to be 
copied into buffers."
WAIT_EVENT_WAL_WAIT_WRITE   WALWaitWrite"Waiting for WAL buffers to be 
written or flushed to disk."


Previously it was e.g. not really possible to distinguish that something like
this:

┌┬─┬┬───┐
│  backend_type  │ wait_event_type │ wait_event │ count │
├┼─┼┼───┤
│ client backend │ LWLock  │ WALInsert  │32 │
│ client backend │ (null)  │ (null) │ 9 │
│ walwriter  │ IO  │ WALWrite   │ 1 │
│ client backend │ Client  │ ClientRead │ 1 │
│ client backend │ LWLock  │ WALWrite   │ 1 │
└┴─┴┴───┘

is a workload with a very different bottleneck than this:

┌┬─┬───┬───┐
│  backend_type  │ wait_event_type │  wait_event   │ count │
├┼─┼───┼───┤
│ client backend │ IPC │ WALWaitInsert │22 │
│ client backend │ LWLock  │ WALInsert │13 │
│ client backend │ LWLock  │ WALBufMapping │ 5 │
│ walwriter  │ (null)  │ (null)│ 1 │
│ client backend │ Client  │ ClientRead│ 1 │
│ client backend │ (null)  │ (null)│ 1 │
└┴─┴───┴───┘

even though they are very different

FWIW, the former is bottlenecked by the number of WAL insertion locks, the
second is bottlenecked by copying WAL into buffers due to needing to flush
them.

Greetings,

Andres Freund
>From 9dcf4a45b6ed7d8fca1a1cd6782f11dff3a84406 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 17 Jul 2023 09:20:05 -0700
Subject: [PATCH v1] Caller specified wait events for LWLockWaitForVar(),
 LWLockAcquireOrWait()

Reporting waits within those functions as LWLock wait events is misleading, as
we do not actually wait for the lock themselves.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/lwlock.h  |  6 --
 src/backend/access/transam/xlog.c |  6 --
 src/backend/storage/lmgr/lwlock.c | 19 +--
 .../utils/activity/wait_event_names.txt   |  2 ++
 4 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 34169e5889e..0f345076f41 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,7 +127,8 @@ extern PGDLLIMPORT bool Trace_lwlocks;
 
 extern bool LWLockAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode);
-extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
+extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode,
+uint32 wait_event_info);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val);
 extern void LWLockReleaseAll(void);
@@ -135,7 +136,8 @@ extern bool LWLockHeldByMe(LWLock *lock);
 extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride);
 extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode);
 
-extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);

Re: Improve heapgetpage() performance, overhead from serializable

2023-07-17 Thread Andres Freund

Hi,

On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:
> LGTM and I have a fool question:
>
>   if (likely(all_visible))
>   {
>   if (likely(!check_serializable))
>   scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
> page, buffer,
>   
>    block, lines, 1, 0);
>   else
>   scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
> page, buffer,
>   
>    block, lines, 1, 1);
>   }
>   else
>   {
>   if (likely(!check_serializable))
>   scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
> page, buffer,
>   
>    block, lines, 0, 0);
>   else
>   scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
> page, buffer,
>   
>    block, lines, 0, 1);
>
>
> Does it make sense to combine if else condition and put it to the incline 
> function’s param?
>
> Like:
> scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
>   
>    block, lines, all_visible, check_serializable);

I think that makes it less likely that the compiler actually generates a
constant-folded version for each of the branches. Perhaps worth some
experimentation.

Greetings,

Andres Freund

Re: Autogenerate some wait events code and documentation

2023-07-16 Thread Andres Freund

Hi,

On 2023-07-14 13:49:22 +0900, Michael Paquier wrote:
> I have looked again at 0001 and 0002 and applied them to get them out
> of the way.  0003 and 0004 are rebased and attached.  I'll add them to
> the CF for later consideration.  More opinions are welcome.

> From b6390183bdcc054df82279bb0b2991730f85a0a3 Mon Sep 17 00:00:00 2001
> From: Michael Paquier 
> Date: Thu, 13 Jul 2023 10:14:47 +0900
> Subject: [PATCH v3 4/4] Remove column for enum elements in
>  wait_event_names.txt
> 
> This file is now made of two columns, removing the column listing the
> enum elements for each wait event class:
> - Camelcase event name used in pg_stat_activity.  There are now
> unquoted.
> - Description of the documentation.

> The enum elements are generated from what is now the first column.

I think the search issue is valid, so I do think going the other way is
preferrable. I.e. use just the enum value in the .txt and generate the camel
case name from that. That allows you to search the define used in code and
find a hit in the file.

I personally would still leave off the WAIT_EVENT prefix in the .txt, I think
most of us can remember to chop that off.

I don't think we need to be particularly consistent with wait events across
major versions. They're necessarily tied to how the code works, and we've
yanked that around plenty.

Greetings,

Andres Freund

Improve heapgetpage() performance, overhead from serializable

2023-07-15 Thread Andres Freund

Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed.  It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise.  All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 1000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987


FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.




Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies.  I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...


I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?


Greetings,

Andres Freund
>From 92ca2c15ccf2adb9b7f2da9cf86b571534287136 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 15 Jul 2023 15:13:30 -0700
Subject: [PATCH v1] optimize heapgetpage()

---
 src/backend/access/heap/heapam.c | 100 ++-
 1 file changed, 72 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7ed72abe597..d662d2898a2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -366,6 +366,56 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 	scan->rs_numblocks = numBlks;
 }
 
+/*
+ * The loop for heapgetpage() in pagemode. Pulled out so it can be called
+ * multiple times, with constant arguments for all_visible,
+ * check_serializable.
+ */
+pg_attribute_always_inline
+static int
+heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot,
+	Page page, Buffer buffer,
+	BlockNumber block, int lines,
+	bool all_visible, bool check_serializable)
+{
+	int			ntup = 0;
+	OffsetNumber lineoff;
+
+	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	{
+		ItemId		lpp = PageGetItemId(page, lineoff);
+		bool		valid;
+		HeapTupleData loctup;
+
+		if (!ItemIdIsNormal(lpp))
+			continue;
+
+		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		loctup.t_len = ItemIdGetLength(lpp);
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		ItemPointerSet(&(loctup.t_self), block, lineoff);
+
+		if (all_visible)
+			valid = true;
+		else
+			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
+
+		if (check_serializable)
+			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+&loctup, buffer, snapshot);
+
+		if (valid)
+		{
+			scan->rs_vistuples[ntup] = lineoff;
+			ntup++;
+		}
+	}
+
+	Assert(ntup <= MaxHeapTuplesPerPage);
+
+	return ntup;
+}
+
 /*
  * heapgetpage - subroutine for heapgettup()
  *
@@ -381,9 +431,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
-	int			ntup;
-	OffsetNumber lineoff;
 	bool		all_visible;
+	bool		check_serializable;
 
 	Assert(block < scan->rs_nblocks);
 
@@ -427,7 +476,6 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	page = BufferGetPage(buffer);
 	TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, page);
 	lines = PageGetMaxOffsetNumber(page);
-	ntup = 0;
 
 	/*
 	 * If the all-visible flag indicates that all tuples on the page are
@@ -450,37 +498,33 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 * tuple for visibility the hard way.
 	 */
 	all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
+	check_serializable =
+		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_

Re: Inefficiency in parallel pg_restore with many tables

2023-07-15 Thread Andres Freund

Hi,

On 2023-07-15 13:47:12 -0400, Tom Lane wrote:
> I wonder if we could replace the sorted ready-list with a priority heap,
> although that might be complicated by the fact that pop_next_work_item
> has to be capable of popping something that's not necessarily the
> largest remaining job.  Another idea could be to be a little less eager
> to sort the list every time; I think in practice scheduling wouldn't
> get much worse if we only re-sorted every so often.

Perhaps we could keep track of where the newly inserted items are, and use
insertion sort or such when the number of new elements is much smaller than
the size of the already sorted elements?

As you say, a straight priority heap might not be easy. But we could just open
code using two sorted arrays, one large, one for recent additions that needs
to be newly sorted. And occasionally merge the small array into the big array,
once it has gotten large enough that sorting becomes expensive.  We could go
for a heap of N>2 such arrays, but I doubt it would be worth much.

Greetings,

Andres Freund

XLogSaveBufferForHint() correctness and more

2023-07-14 Thread Andres Freund

Hi,

While looking at [1] I started to wonder why it is safe that
CreateCheckPoint() updates XLogCtl->RedoRecPtr after releasing the WAL
insertion lock:

/*
 * Now we can release the WAL insertion locks, allowing other xacts to
 * proceed while we are flushing disk buffers.
 */
WALInsertLockRelease();

/* Update the info_lck-protected copy of RedoRecPtr as well */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->RedoRecPtr = checkPoint.redo;
SpinLockRelease(&XLogCtl->info_lck);

The most important user of that is GetRedoRecPtr().

Right now I'm a bit confused why it's ok that

/* Update the info_lck-protected copy of RedoRecPtr as well */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->RedoRecPtr = checkPoint.redo;
SpinLockRelease(&XLogCtl->info_lck);

happens after WALInsertLockRelease().


But then started to wonder, even if that weren't the case, how come
XLogSaveBufferForHint() and other uses of GetRedoRecPtr(), aren't racy as
hell?

The reason XLogInsertRecord() can safely check if an FPW is needed is that it
holds a WAL insertion lock, the redo pointer cannot change until the insertion
lock is released.

But there's *zero* interlock in XLogSaveBufferForHint() from what I can tell?
A checkpoint could easily start between between the GetRedoRecPtr() and the
check whether this buffer needs to be WAL logged?


While XLogSaveBufferForHint() makes no note of this, it's sole caller,
MarkBufferDirtyHint(), tries to deal with some related concerns to some
degree:

/*
 * If the block is already dirty because we either made 
a change
 * or set a hint already, then we don't need to write a 
full page
 * image.  Note that aggressive cleaning of blocks 
dirtied by hint
 * bit setting would increase the call rate. Bulk 
setting of hint
 * bits would reduce the call rate...
 *
 * We must issue the WAL record before we mark the 
buffer dirty.
 * Otherwise we might write the page before we write 
the WAL. That
 * causes a race condition, since a checkpoint might 
occur between
 * writing the WAL record and marking the buffer dirty. 
We solve
 * that with a kluge, but one that is already in use 
during
 * transaction commit to prevent race conditions. 
Basically, we
 * simply prevent the checkpoint WAL record from being 
written
 * until we have marked the buffer dirty. We don't 
start the
 * checkpoint flush until we have marked dirty, so our 
checkpoint
 * must flush the change to disk successfully or the 
checkpoint
 * never gets written, so crash recovery will fix.
 *
 * It's possible we may enter here without an xid, so 
it is
 * essential that CreateCheckPoint waits for virtual 
transactions
 * rather than full transactionids.
 */
Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 
0);
MyProc->delayChkptFlags |= DELAY_CHKPT_START;
delayChkptFlags = true;
lsn = XLogSaveBufferForHint(buffer, buffer_std);

but I don't think that really does all that much, because the
DELAY_CHKPT_START handling in CreateCheckPoint() happens after we determine
the redo pointer.  This code isn't even reached if we wrongly skipped due to
the if (lsn <= RedoRecPtr).


I seriously doubt this can correctly be implemented outside of xlog*.c /
without the use of a WALInsertLock?

I feel like I must be missing here, this isnt' a particularly narrow race?


It looks to me like the sue of GetRedoRecPtr() in nextval_internal() is also
wrong. I think the uses in slot.c, snapbuild.c, rewriteheap.c are fine.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20230714151626.rhgae7taigk2xrq7%40awork3.anarazel.de

Re: New WAL record to detect the checkpoint redo location

2023-07-14 Thread Andres Freund

Hi,

As I think I mentioned before, I like this idea. However, I don't like the
implementation too much.

On 2023-06-15 13:11:57 +0530, Dilip Kumar wrote:
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index b2430f617c..a025fb91e2 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -744,6 +744,7 @@ XLogInsertRecord(XLogRecData *rdata,
>   XLogRecPtr  StartPos;
>   XLogRecPtr  EndPos;
>   boolprevDoPageWrites = doPageWrites;
> + boolcallerHoldingExlock = holdingAllLocks;
>   TimeLineID  insertTLI;
>  
>   /* we assume that all of the record header is in the first chunk */
> @@ -792,10 +793,18 @@ XLogInsertRecord(XLogRecData *rdata,
>*--
>*/
>   START_CRIT_SECTION();
> - if (isLogSwitch)
> - WALInsertLockAcquireExclusive();
> - else
> - WALInsertLockAcquire();
> +
> + /*
> +  * Acquire wal insertion lock, nothing to do if the caller is already
> +  * holding the exclusive lock.
> +  */
> + if (!callerHoldingExlock)
> + {
> + if (isLogSwitch)
> + WALInsertLockAcquireExclusive();
> + else
> + WALInsertLockAcquire();
> + }
>  
>   /*
>* Check to see if my copy of RedoRecPtr is out of date. If so, may have

This might work right now, but doesn't really seem maintainable. Nor do I like
adding branches into this code a whole lot.


> @@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)

I think the commentary above this function would need a fair bit of
revising...

>*/
>   RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
>  
> + /*
> +  * Insert a special purpose CHECKPOINT_REDO record as the first record 
> at
> +  * checkpoint redo lsn.  Although we have the checkpoint record that
> +  * contains the exact redo lsn, there might have been some other records
> +  * those got inserted between the redo lsn and the actual checkpoint
> +  * record.  So when processing the wal, we cannot rely on the checkpoint
> +  * record if we want to stop at the checkpoint-redo LSN.
> +  *
> +  * This special record, however, is not required when we doing a 
> shutdown
> +  * checkpoint, as there will be no concurrent wal insertions during that
> +  * time.  So, the shutdown checkpoint LSN will be the same as
> +  * checkpoint-redo LSN.
> +  *
> +  * This record is guaranteed to be the first record at checkpoint redo 
> lsn
> +  * because we are inserting this while holding the exclusive wal 
> insertion
> +  * lock.
> +  */
> + if (!shutdown)
> + {
> + int dummy = 0;
> +
> + XLogBeginInsert();
> + XLogRegisterData((char *) &dummy, sizeof(dummy));
> + recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
> + }

It seems to me that we should be able to do better than this.

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

Alternatively, I think we should split XLogInsertRecord() so that the part
with the insertion locks held is a separate function, that we could use here.

Greetings,

Andres Freund

Re: WAL Insertion Lock Improvements

2023-07-13 Thread Andres Freund

Hi,

On 2023-07-13 14:04:31 -0700, Andres Freund wrote:
> From b74b6e953cb5a7e7ea1a89719893f6ce9e231bba Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy 
> Date: Fri, 19 May 2023 15:00:21 +
> Subject: [PATCH v8] Optimize WAL insertion lock acquisition and release
>
> This commit optimizes WAL insertion lock acquisition and release
> in the following way:

I think this commit does too many things at once.

> 1. WAL insertion lock's variable insertingAt is currently read and
> written with the help of lwlock's wait list lock to avoid
> torn-free reads/writes. This wait list lock can become a point of
> contention on a highly concurrent write workloads. Therefore, make
> insertingAt a 64-bit atomic which inherently provides torn-free
> reads/writes.

"inherently" is a bit strong, given that we emulate 64bit atomics where not
available...

> 2. LWLockUpdateVar currently acquires lwlock's wait list lock even when
> there are no waiters at all. Add a fastpath exit to LWLockUpdateVar when
> there are no waiters to avoid unnecessary locking.

I don't think there's enough of an explanation for why this isn't racy.

The reason it's, I think, safe, is that anyone using LWLockConflictsWithVar()
will do so twice in a row, with a barrier inbetween. But that really relies on
what I explain in the paragraph below:

> It also adds notes on why LWLockConflictsWithVar doesn't need a
> memory barrier as far as its current usage is concerned.

I don't think:
 * NB: LWLockConflictsWithVar (which is called from
 * LWLockWaitForVar) relies on the spinlock used above 
in this
 * function and doesn't use a memory barrier.

helps to understand why any of this is safe to a meaningful degree.

The existing comments aren't obviously aren't sufficient to explain this, but
the reason it's, I think, safe today, is that we are only waiting for
insertions that started before WaitXLogInsertionsToFinish() was called.  The
lack of memory barriers in the loop means that we might see locks as "unused"
that have *since* become used, which is fine, because they only can be for
later insertions that we wouldn't want to wait on anyway.

Not taking a lock to acquire the current insertingAt value means that we might
see older insertingAt value. Which should also be fine, because if we read a
too old value, we'll add ourselves to the queue, which contains atomic
operations.

> diff --git a/src/backend/storage/lmgr/lwlock.c 
> b/src/backend/storage/lmgr/lwlock.c
> index 59347ab951..82266e6897 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -1547,9 +1547,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
>   * *result is set to true if the lock was free, and false otherwise.
>   */
>  static bool
> -LWLockConflictsWithVar(LWLock *lock,
> -uint64 *valptr, uint64 oldval, 
> uint64 *newval,
> -bool *result)
> +LWLockConflictsWithVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 oldval,
> +uint64 *newval, bool *result)
>  {
>   boolmustwait;
>   uint64  value;
> @@ -1572,13 +1571,11 @@ LWLockConflictsWithVar(LWLock *lock,
>   *result = false;
>
>   /*
> -  * Read value using the lwlock's wait list lock, as we can't generally
> -  * rely on atomic 64 bit reads/stores.  TODO: On platforms with a way to
> -  * do atomic 64 bit reads/writes the spinlock should be optimized away.
> +  * Reading the value atomically ensures that we don't need any explicit
> +  * locking. Note that in general, 64 bit atomic APIs in postgres 
> inherently
> +  * provide explicit locking for the platforms without atomics support.
>*/

This comment seems off to me. Using atomics doesn't guarantee not needing
locking. It just guarantees that we are reading a non-torn value.

> @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock,
>   *
>   * Note: this function ignores shared lock holders; if the lock is held
>   * in shared mode, returns 'true'.
> + *
> + * Be careful that LWLockConflictsWithVar() does not include a memory 
> barrier,
> + * hence the caller of this function may want to rely on an explicit barrier 
> or
> + * a spinlock to avoid memory ordering issues.
>   */

s/careful/aware/?

s/spinlock/implied barrier via spinlock or lwlock/?

Greetings,

Andres

Re: WAL Insertion Lock Improvements

2023-07-13 Thread Andres Freund

On 2023-07-11 09:20:45 +0900, Michael Paquier wrote:
> On Mon, Jun 05, 2023 at 08:00:00AM +0530, Bharath Rupireddy wrote:
> > On Wed, May 31, 2023 at 5:05 PM Michael Paquier  wrote:
> >> @Andres: Were there any extra tests you wanted to be run for more
> >> input?
> > 
> > @Andres Freund please let us know your thoughts.
> 
> Err, ping.  It seems like this thread is waiting on input from you,
> Andres?

Looking. Sorry for not getting to this earlier.

- Andres

Re: DROP DATABASE is interruptible

2023-07-13 Thread Andres Freund

Hi,

On 2023-07-12 11:54:18 +0200, Daniel Gustafsson wrote:
> > On 12 Jul 2023, at 03:59, Andres Freund  wrote:
> > On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote:
> >>> On 25 Jun 2023, at 19:03, Andres Freund  wrote:
> >>> On 2023-06-21 12:02:04 -0700, Andres Freund wrote:
> 
> >>> There don't need to be explict checks, because pg_upgrade will fail, 
> >>> because
> >>> it connects to every database. Obviously the error could be nicer, but it
> >>> seems ok for something hopefully very rare. I did add a test ensuring 
> >>> that the
> >>> behaviour is caught.
> >> 
> >> I don't see any pg_upgrade test in the patch?
> > 
> > Oops, I stashed them alongside some unrelated changes... Included this time.
> 
> Looking more at this I wonder if we in HEAD should make this a bit nicer by
> extending the --check phase to catch this?  I did a quick hack along these
> lines in the 0003 commit attached here (0001 and 0002 are your unchanged
> patches, just added for consistency and to be CFBot compatible).  If done it
> could be a separate commit to make the 0002 patch backport cleaner of course.

I don't really have an opinion on that, tbh...

> >> +  errhint("Use DROP DATABASE to drop invalid databases"));
> >> Should end with a period as a complete sentence?
> > 
> > I get confused about this every time. It's not helped by this example in
> > sources.sgml:
> > 
> > 
> > Primary:could not create shared memory segment: %m
> > Detail: Failed syscall was shmget(key=%d, size=%u, 0%o).
> > Hint:   the addendum
> > 
> > 
> > Which notably does not use punctuation for the hint. But indeed, later we 
> > say:
> >   
> >Detail and hint messages: Use complete sentences, and end each with
> >a period.  Capitalize the first word of sentences.  Put two spaces after
> >the period if another sentence follows (for English text; might be
> >inappropriate in other languages).
> >   
> 
> That's not a very helpful example, and one which may give the wrong impression
> unless the entire page is read.  I've raised this with a small diff to improve
> it on -docs.

Thanks for doing that!


> > Updated patches attached.
> 
> This version of the patchset LGTM.

Backpatching indeed was no fun. Not having BackgroundPsql.pm was the worst
part. But also a lot of other conflicts in tests... Took me 5-6 hours or
so.
But I now finally pushed the fixes. Hope the buildfarm agrees with it...

Thanks for the review!

Greetings,

Andres Freund

Re: vac_truncate_clog()'s bogus check leads to bogusness

2023-07-13 Thread Andres Freund

On 2023-06-22 22:29:12 -0700, Noah Misch wrote:
> On Thu, Jun 22, 2023 at 09:45:18AM -0700, Andres Freund wrote:
> > On 2023-06-21 21:50:39 -0700, Noah Misch wrote:
> > > On Wed, Jun 21, 2023 at 03:12:08PM -0700, Andres Freund wrote:
> > > > When vac_truncate_clog() returns early
> > > ...
> > > > we haven't released the lwlock that we acquired earlier
> > > 
> > > > Until there's some cause for the session to call LWLockReleaseAll(), 
> > > > the lock
> > > > is held. Until then neither the process holding the lock, nor any other
> > > > process, can finish vacuuming.  We don't even have an assert against a
> > > > self-deadlock with an already held lock, oddly enough.
> > > 
> > > I agree with this finding.  Would you like to add the lwlock releases, or
> > > would you like me to?
> > 
> > Happy with either.  I do have code and testcase, so I guess it would make
> > sense for me to do it?
> 
> Sounds good.  Thanks.

Done.

Re: Meson build updates

2023-07-12 Thread Andres Freund

Hi,

On 2023-06-29 13:34:42 -0500, Tristan Partin wrote:
> On Thu Jun 29, 2023 at 12:35 PM CDT, Andres Freund wrote:
> > Hm. One minor disadvantage of this is that if no c++ compiler was found, you
> > can't really see anything about llvm in the the output, nor in 
> > meson-log.txt,
> > making it somewhat hard to figure out why llvm was disabled.
> >
> > I think something like
> >
> > elif llvmopt.auto()
> >   message('llvm requires a C++ compiler')
> > endif
> >
> > Should do the trick?
> 
> Your patch looks great to me.
> 
> > > v5-0011-Pass-feature-option-through-to-required-kwarg.patch
> >
> > I'm a bit confused how it ended how it's looking like it is right now, but
> > ... :)
> >
> > I'm thinking of merging 0011 and relevant parts of 0012 and 0015 into one.
> 
> Your patch looks great to me.

Pushed these. Thanks!

Greetings,

Andres Freund

Re: Remove distprep

2023-07-12 Thread Andres Freund

Hi,

On 2023-06-09 11:10:14 +0200, Peter Eisentraut wrote:
> Per discussion at the unconference[0], I started to write a patch that
> removes the make distprep target.  A more detailed explanation of the
> rationale is also in the patch.

Thanks for tackling this!

> It needs some polishing around the edges, but I wanted to put it out here to
> get things moving and avoid duplicate work.

It'd be nice if we could get this in soon, so we could move ahead with the
src/tools/msvc removal.

> One thing in particular that isn't clear right now is how "make world"
> should behave if the documentation tools are not found.  Maybe we should
> make a build option, like "--with-docs", to mirror the meson behavior.

Isn't that somewhat unrelated to distprep?  I see that you removed missing,
but I don't really understand why as part of this commit?

> -# If there are any files in the source directory that we also generate in the
> -# build directory, they might get preferred over the newly generated files,
> -# e.g. because of a #include "file", which always will search in the current
> -# directory first.
> -message('checking for file conflicts between source and build directory')

You're thinking this can be removed because distclean is now reliable? There
were some pretty annoying to debug issues early on, where people switched from
an in-tree autoconf build to meson, with some files left over from the source
build, causing problems at a *later* time (when the files should have changed,
but the wrong ones were picked up).  That's not really related distprep etc,
so I'd change this separately, if we want to change it.

> diff --git a/src/tools/pginclude/cpluspluscheck 
> b/src/tools/pginclude/cpluspluscheck
> index 4e09c4686b..287395887c 100755
> --- a/src/tools/pginclude/cpluspluscheck
> +++ b/src/tools/pginclude/cpluspluscheck
> @@ -134,6 +134,9 @@ do
>   test "$f" = src/interfaces/ecpg/preproc/preproc.h && continue
>   test "$f" = src/test/isolation/specparse.h && continue
>
> + # FIXME
> + test "$f" = src/backend/utils/adt/jsonpath_internal.h && continue
> +
>   # ppport.h is not under our control, so we can't make it standalone.
>   test "$f" = src/pl/plperl/ppport.h && continue

Hm, what's that about?

We really ought to replace these scripts with something better, which
understands concurrency...

Greetings,

Andres Freund

Re: Forgive trailing semicolons inside of config files

2023-07-11 Thread Andres Freund

Hi,

On 2023-07-11 12:21:46 -0400, Greg Sabino Mullane wrote:
> On Tue, Jul 11, 2023 at 11:04 AM Isaac Morland 
> > Or maybe there could be a "check configuration" subcommand which checks
> > the configuration.
> >
>
> There are things inside of Postgres once it has started, but yeah,
> something akin to visudo would be nice for editing config files.

You can also do it kind-of-reasonably with the server binary, like:
PGDATA=/srv/dev/pgdev-dev /path/to/postgres -C server_version; echo $?

> > I'd be more interested in improvements in visibility of errors. For
> > example, maybe if I try to start the server and there is a config file
> > problem, I could somehow get a straightforward error message right in the
> > terminal window complaining about the line of the configuration which is
> > wrong.
> >
>
> That ship has long since sailed. We already send a detailed error message
> with the line number, but in today's world of "service start", "systemctl
> start", and higher level of control such as Patroni and Kubernetes, getting
> things to show in a terminal window isn't happening. We can't work around
> 2>&1.

At least with debian's infrastructure, both systemctl start and reload show
errors reasonably well:

start with broken config:
Jul 11 19:13:40 awork3 systemd[1]: Starting postgresql@15-test.service - 
PostgreSQL Cluster 15-test...
Jul 11 19:13:40 awork3 postgresql@15-test[3217452]: Error: invalid line 3 in 
/var/lib/postgresql/15/test/postgresql.auto.conf: dd
Jul 11 19:13:40 awork3 systemd[1]: postgresql@15-test.service: Can't open PID 
file /run/postgresql/15-test.pid (yet?) after start: No such file or directory

reload with broken config:
Jul 11 19:10:38 awork3 systemd[1]: Reloading postgresql@15-test.service - 
PostgreSQL Cluster 15-test...
Jul 11 19:10:38 awork3 postgresql@15-test[3217175]: Error: invalid line 3 in 
/var/lib/postgresql/15/test/postgresql.auto.conf: dd
Jul 11 19:10:38 awork3 systemd[1]: postgresql@15-test.service: Control process 
exited, code=exited, status=1/FAILURE
Jul 11 19:10:38 awork3 systemd[1]: Reload failed for postgresql@15-test.service 
- PostgreSQL Cluster 15-test.

However: It looks like that's all implemented in debian specific tooling,
rather than PG itself. Oops.

Looks like we could make this easier in core postgres by adding one more
sd_notify() call, with something like
STATUS=reload failed due to syntax error in file 
"/srv/dev/pgdev-dev/postgresql.conf" line 821, near end of line

Greetings,

Andres Freund

Re: DROP DATABASE is interruptible

2023-07-11 Thread Andres Freund

Hi,

On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote:
> > On 25 Jun 2023, at 19:03, Andres Freund  wrote:
> > On 2023-06-21 12:02:04 -0700, Andres Freund wrote:
> >> I'm hacking on this bugfix again,
>
> This patch LGTM from reading through and testing (manually and with your
> supplied tests in the patch), I think this is a sound approach to deal with
> this.

Thanks!


> >> I haven't yet added checks to pg_upgrade, even though that's clearly
> >> needed. I'm waffling a bit between erroring out and just ignoring the
> >> database? pg_upgrade already fails when datallowconn is set "wrongly", see
> >> check_proper_datallowconn().  Any opinions?
> >
> > There don't need to be explict checks, because pg_upgrade will fail, because
> > it connects to every database. Obviously the error could be nicer, but it
> > seems ok for something hopefully very rare. I did add a test ensuring that 
> > the
> > behaviour is caught.
>
> I don't see any pg_upgrade test in the patch?

Oops, I stashed them alongside some unrelated changes... Included this time.



> > It's somewhat odd that pg_upgrade prints errors on stdout...
>
> There are many odd things about pg_upgrade logging, updating it to use the
> common logging framework of other utils would be nice.

Indeed.


> >> I'm not sure what should be done for psql. It's probably not a good idea to
> >> change tab completion, that'd just make it appear the database is gone. 
> >> But \l
> >> could probably show dropped databases more prominently?
> >
> > I have not done that. I wonder if this is something that should be done in 
> > the
> > back branches?
>
> Possibly, I'm not sure where we usually stand on changing the output format of
> \ commands in psql in minor revisions.

I'd normally be quite careful, people do script psql.

While breaking things when encountering an invalid database doesn't actually
sound like a bad thing, I don't think it fits into any of the existing column
output by psql for \l.


> A few small comments on the patch:
>
> + * Max connections allowed (-1=no limit, -2=invalid database). A database
> + * is set to invalid partway through eing dropped.  Using datconnlimit=-2
> + * for this purpose isn't particularly clean, but is backpatchable.
> Typo: s/eing/being/.

Fixed.


> A limit of -1 makes sense, but the meaning of -2 is less intuitive IMO.
> Would it make sense to add a #define with a more descriptive name to save
> folks reading this having to grep around and figure out what -2 means?

I went back and forth about this one. We don't use defines for such things in
all the frontend code today, so the majority of places won't be improved by
adding this. I added them now, which required touching a few otherwise
untouched places, but not too bad.




> + errhint("Use DROP DATABASE to drop invalid databases"));
> Should end with a period as a complete sentence?

I get confused about this every time. It's not helped by this example in
sources.sgml:


Primary:could not create shared memory segment: %m
Detail: Failed syscall was shmget(key=%d, size=%u, 0%o).
Hint:   the addendum


Which notably does not use punctuation for the hint. But indeed, later we say:
   
Detail and hint messages: Use complete sentences, and end each with
a period.  Capitalize the first word of sentences.  Put two spaces after
the period if another sentence follows (for English text; might be
inappropriate in other languages).
   


> + errmsg("cannot alter invalid database \"%s\"", stmt->dbname),
> + errdetail("Use DROP DATABASE to drop invalid databases"));
> Shouldn't this be an errhint() instead? Also ending with a period.

Yep.


> + if (database_is_invalid_form((Form_pg_database) dbform))
> + continue;
> Would it make sense to stick a DEBUG2 log entry in there to signal that such a
> database exist? (The same would apply for the similar hunk in autovacuum.c.)

I don't really have an opinion on it. Added.

elog(DEBUG2,
 "skipping invalid database \"%s\" while 
computing relfrozenxid",
 NameStr(dbform->datname));
and
elog(DEBUG2,
 "autovacuum: skipping invalid database \"%s\"",
     NameStr(pgdatabase->datname));


Updated patches attached.


Not looking forward to fixing all the conflicts.


Does anybody have an opinion about whether we should add a dedicated field to
pg_database for representing invali

Re: Support to define custom wait events for extensions

2023-07-11 Thread Andres Freund

ents/t/001_basic.pl
> @@ -0,0 +1,34 @@
> +# Copyright (c) 2023, PostgreSQL Global Development Group
> +
> +use strict;
> +use warnings;
> +
> +use PostgreSQL::Test::Cluster;
> +use PostgreSQL::Test::Utils;
> +use Test::More;
> +
> +my $node = PostgreSQL::Test::Cluster->new('main');
> +
> +$node->init;
> +$node->append_conf(
> + 'postgresql.conf',
> + "shared_preload_libraries = 'test_custom_wait_events'"
> +);
> +$node->start;

I think this should also test registering a wait event later.


> @@ -0,0 +1,188 @@
> +/*--
> + *
> + * test_custom_wait_events.c
> + *   Code for testing custom wait events
> + *
> + * This code initializes a custom wait event in shmem_request_hook() and
> + * provide a function to launch a background worker waiting forever
> + * with the custom wait event.

Isn't this vast overkill?  Why not just have a simple C function that waits
with a custom wait event, until cancelled? That'd maybe 1/10th of the LOC.


Greetings,

Andres Freund

Re: Cleaning up threading code

2023-07-11 Thread Andres Freund

On 2023-07-12 08:58:29 +1200, Thomas Munro wrote:
> On Mon, Jul 10, 2023 at 10:45 AM Thomas Munro  wrote:
> > * defined ENABLE_THREAD_SAFETY 1 in c.h, for the benefit of extensions
> 
> I may lack imagination but I couldn't think of any use for that
> vestigial macro in backend/extension code, and client code doesn't see
> c.h and might not get the right answer anyway if it's dynamically
> linked which is the usual case.  I took it out for now.  Open to
> discussing further if someone can show what kinds of realistic
> external code would be affected.

WFM.

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Andres Freund

Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.
> Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> vCPUs, 512GB RAM):
>
> * PG15 (4b15868b69)
> PG15: nclients = 1, execution time = 14.181
>
> * PG16 (c24e9ef330)
> PG16: nclients = 1, execution time = 17.112

> The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
> extend tables more efficiently". With commit 1cbbee0338 (the previous
> commit of 00d1e02be2), I got a better numbers, it didn't have a better
> scalability, though:
>
> PG16: nclients = 1, execution time = 17.444

I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.

1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY  FROM

This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?

2) pg_strtoint32_safe() got substantially slower, mainly due
   to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals

pinned to one cpu, turbo mode disabled, I get the following best-of-three times 
for
  copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)

15:
6281.107 ms

HEAD:
7000.469 ms

backing out 9f8377f7a27:
6433.516 ms

also backing out faff8f8e47f, 6fcda9aba83:
6235.453 ms

I suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().

For higher concurrency numbers, I now was able to reproduce the regression, to
a smaller degree. Much smaller after fixing the above. The reason we run into
the issue here is basically that the rows in the test are very narrow and reach

#define MAX_BUFFERED_TUPLES 1000

at a small number of pages, so we go back and forth between extending with
fallocate() and not.

I'm *not* saying that that is the solution, but after changing that to 5000,
the numbers look a lot better (with the other regressions "worked around"):

(this is again with turboboost disabled, to get more reproducible numbers)

clients 1   2   4   8   16  32

15,buffered=100025725   13211   9232563948624700
15,buffered=500026107   14550   8644605049434766
HEAD+fixes,buffered=100025875   14505   8200490035653433
HEAD+fixes,buffered=500025830   12975   6527359427392642

Greetings,

Andres Freund

[1] 
https://postgr.es/m/CAD21AoAEwHTLYhuQ6PaBRPXKWN-CgW9iw%2B4hm%3D2EOFXbJQ3tOg%40mail.gmail.com

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Andres Freund

Hi,

On 2023-07-11 09:09:43 +0200, Jakub Wartak wrote:
> On Mon, Jul 10, 2023 at 6:24 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > > Out of curiosity I've tried and it is reproducible as you have stated : 
> > > XFS
> > > @ 4.18.0-425.10.1.el8_7.x86_64:
> > >...
> > > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > > loop of sync writes (D) by kworker/2:1H-kblockd:
> >
> > That clearly won't go well.  It's not reproducible on newer systems,
> > unfortunately :(. Or well, fortunately maybe.
> >
> >
> > I wonder if a trick to avoid this could be to memorialize the fact that we
> > bulk extended before and extend by that much going forward? That'd avoid the
> > swapping back and forth.
>
> I haven't seen this thread [1] "Question on slow fallocate", from XFS
> mailing list being mentioned here (it was started by Masahiko), but I
> do feel it contains very important hints even challenging the whole
> idea of zeroing out files (or posix_fallocate()). Please especially
> see Dave's reply.

I think that's just due to the reproducer being a bit too minimal and the
actual problem being addressed not being explained.


> He also argues that posix_fallocate() != fallocate().  What's interesting is
> that it's by design and newer kernel versions should not prevent such
> behaviour, see my testing result below.

I think the problem there was that I was not targetting a different file
between the different runs, somehow assuming the test program would be taking
care of that.

I don't think the test program actually tests things in a particularly useful
way - it does fallocate()s in 8k chunks - which postgres never does.



> All I can add is that this those kernel versions (4.18.0) seem to very
> popular across customers (RHEL, Rocky) right now and that I've tested
> on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
> Masahiko test.c and still got 6-7x slower time when using XFS on that
> kernel. After installing kernel-ml (6.4.2) the test.c result seems to
> be the same (note it it occurs only when 1st allocating space, but of
> course it doesnt if the same file is rewritten/"reallocated"):

test.c really doesn't reproduce postgres behaviour in any meaningful way,
using it as a benchmark is a bad idea.

Greetings,

Andres Freund

Re: Refactoring backend fork+exec code

2023-07-10 Thread Andres Freund

nds = {
  [PMC_AV_LAUNCHER] = {"autovacuum launcher", AutoVacLauncherMain, true},
  ...
}

or such should work.


I'd also use designated initializers for the fields, it's otherwise hard to
know what true means etc.

I think it might be good to put more into array. If we e.g. knew whether a
particular child type is a backend-like, and aux process or syslogger, we
could avoid the duplicated InitAuxiliaryProcess(),
MemoryContextDelete(PostmasterContext) etc calls everywhere.


> +/*
> + * SubPostmasterMain -- Get the fork/exec'd process into a state equivalent
> + *   to what it would be if we'd simply forked on Unix, and 
> then
> + *   dispatch to the appropriate place.
> + *
> + * The first two command line arguments are expected to be "--forkFOO"
> + * (where FOO indicates which postmaster child we are to become), and
> + * the name of a variables file that we can read to load data that would
> + * have been inherited by fork() on Unix.  Remaining arguments go to the
> + * subprocess FooMain() routine. XXX
> + */
> +void
> +SubPostmasterMain(int argc, char *argv[])
> +{
> + PostmasterChildType child_type;
> + char   *startup_data;
> + size_t  startup_data_len;
> +
> + /* In EXEC_BACKEND case we will not have inherited these settings */
> + IsPostmasterEnvironment = true;
> + whereToSendOutput = DestNone;
> +
> + /* Setup essential subsystems (to ensure elog() behaves sanely) */
> + InitializeGUCOptions();
> +
> + /* Check we got appropriate args */
> + if (argc < 3)
> + elog(FATAL, "invalid subpostmaster invocation");
> +
> + if (strncmp(argv[1], "--forkchild=", 12) == 0)
> + {
> + char   *entry_name = argv[1] + 12;
> + boolfound = false;
> +
> + for (int idx = 0; idx < lengthof(entry_kinds); idx++)
> + {
> + if (strcmp(entry_kinds[idx].name, entry_name) == 0)
> + {
> + child_type = idx;
> + found = true;
> + break;
> + }
> + }
> + if (!found)
> + elog(ERROR, "unknown child kind %s", entry_name);
> + }


Hm, shouldn't we error out when called without --forkchild?


> +/* Save critical backend variables into the BackendParameters struct */
> +#ifndef WIN32
> +static bool
> +save_backend_variables(BackendParameters *param, ClientSocket *client_sock)
> +#else

There's so much of this kind of thing. Could we hide it in a struct or such
instead of needing ifdefs everywhere?



> --- a/src/backend/storage/ipc/shmem.c
> +++ b/src/backend/storage/ipc/shmem.c
> @@ -144,6 +144,8 @@ InitShmemAllocation(void)
>   /*
>* Initialize ShmemVariableCache for transaction manager. (This doesn't
>* really belong here, but not worth moving.)
> +  *
> +  * XXX: we really should move this
>*/
>   ShmemVariableCache = (VariableCache)
>   ShmemAlloc(sizeof(*ShmemVariableCache));

Heh. Indeed. And probably just rename it to something less insane.


Greetings,

Andres Freund

Re: ResourceOwner refactoring

2023-07-10 Thread Andres Freund

Hi,

On 2023-07-10 12:14:46 -0700, Andres Freund wrote:
> >  /*
> > - * Initially allocated size of a ResourceArray.  Must be power of two since
> > - * we'll use (arraysize - 1) as mask for hashing.
> > + * Size of the fixed-size array to hold most-recently remembered resources.
> >   */
> > -#define RESARRAY_INIT_SIZE 16
> > +#define RESOWNER_ARRAY_SIZE 32
>
> That's 512 bytes, pretty large to be searched sequentially. It's somewhat sad
> that the array needs to include 8 byte of ResourceOwnerFuncs...

On that note:

It's perhaps worth noting that this change actually *increases* the size of
ResourceOwnerData. Previously:

/* size: 576, cachelines: 9, members: 19 */
/* sum members: 572, holes: 1, sum holes: 4 */

now:
/* size: 696, cachelines: 11, members: 13 */
/* sum members: 693, holes: 1, sum holes: 3 */

It won't make a difference memory-usage wise, given aset.c's rounding
behaviour. But it does mean that more memory needs to be zeroed, as
ResourceOwnerCreate() uses MemoryContextAllocZero.

As there's really no need to initialize the long ->arr, ->locks, it seems
worth to just initialize explicitly instead of using MemoryContextAllocZero().

With the new representation, is there any point in having ->locks? We still
need ->nlocks to be able to switch to the lossy behaviour, but there doesn't
seem to be much point in having the separate array.

Greetings,

Andres Freund

Re: ResourceOwner refactoring

2023-07-10 Thread Andres Freund

  owner->releasing = true;

Is it a problem than a possible error would lead to ->releasing = true to be
"leaked"?  I think it might be, because we haven't called ResourceOwnerSort(),
which means we'd not process resources in the right order during abort
processing.


>

> +/*
> + * Hash function for value+kind combination.
> + */
>  static inline uint32
>  hash_resource_elem(Datum value, ResourceOwnerFuncs *kind)
>  {
> - Datum   data[2];
> -
> - data[0] = value;
> - data[1] = PointerGetDatum(kind);
> -
> - return hash_bytes((unsigned char *) &data, 2 * SIZEOF_DATUM);
> + /*
> +  * Most resource kinds store a pointer in 'value', and pointers are 
> unique
> +  * all on their own.  But some resources store plain integers (Files and
> +  * Buffers as of this writing), so we want to incorporate the 'kind' in
> +  * the hash too, otherwise those resources will collide a lot.  But
> +  * because there are only a few resource kinds like that - and only a 
> few
> +  * resource kinds to begin with - we don't need to work too hard to mix
> +  * 'kind' into the hash.  Just add it with hash_combine(), it perturbs 
> the
> +  * result enough for our purposes.
> +  */
> +#if SIZEOF_DATUM == 8
> + return hash_combine64(murmurhash64((uint64) value), (uint64) kind);
> +#else
> + return hash_combine(murmurhash32((uint32) value), (uint32) kind);
> +#endif
>  }

Why are we using a 64bit hash to then just throw out the high bits?


Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund

Hi,

On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> Out of curiosity I've tried and it is reproducible as you have stated : XFS
> @ 4.18.0-425.10.1.el8_7.x86_64:
>...
> According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> the XFS issues sync writes while ext4 does not, xfs looks like constant
> loop of sync writes (D) by kworker/2:1H-kblockd:

That clearly won't go well.  It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.

I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.

One thing that confuses me is that Sawada-san observed a regression at a
single client - yet from what I can tell, there's actually not a single
fallocate() being generated in that case, because the table is so narrow that
we never end up extending by a sufficient number of blocks in
RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.

I don't yet have a RHEL8 system at hand, could either of you send the result
of a
  strace -c -p $pid_of_backend_doing_copy

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund

Hi,

On 2023-07-03 11:59:38 +0900, Masahiko Sawada wrote:
> On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada  wrote:
> >
> > After further investigation, the performance degradation comes from
> > calling posix_fallocate() (called via FileFallocate()) and pwritev()
> > (called via FileZero) alternatively depending on how many blocks we
> > extend by. And it happens only on the xfs filesystem.
>
> FYI, the attached simple C program proves the fact that calling
> alternatively posix_fallocate() and pwrite() causes slow performance
> on posix_fallocate():
>
> $ gcc -o test test.c
> $ time ./test test.1 1
> total   20
> fallocate   20
> filewrite   0
>
> real0m1.305s
> user0m0.050s
> sys 0m1.255s
>
> $ time ./test test.2 2
> total   20
> fallocate   10
> filewrite   10
>
> real1m29.222s
> user0m0.139s
> sys 0m3.139s

On an xfs filesystem, with a very recent kernel:

time /tmp/msw_test /srv/dev/fio/msw 0
total   20
fallocate   0
filewrite   20

real0m0.456s
user0m0.017s
sys 0m0.439s


time /tmp/msw_test /srv/dev/fio/msw 1
total   20
fallocate   20
filewrite   0

real0m0.141s
user0m0.010s
sys 0m0.131s


time /tmp/msw_test /srv/dev/fio/msw 2
total   20
fallocate   10
filewrite   10

real0m0.297s
user0m0.017s
sys 0m0.280s


So I don't think I can reproduce your problem on that system...

I also tried adding a fdatasync() into the loop, but that just made things
uniformly slow.


I guess I'll try to dig up whether this is a problem in older upstream
kernels, or whether it's been introduced in RHEL.

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund

Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.
> Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> vCPUs, 512GB RAM):

Gah, RHEL with its frankenkernels, the bane of my existance.

FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
you tested this on RHEL9 as well by any chance?

Greetings,

Andres Freund

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund

Hi,

On 2023-07-10 15:25:41 +0200, Alvaro Herrera wrote:
> On 2023-Jul-03, Masahiko Sawada wrote:
> 
> > While testing PG16, I observed that in PG16 there is a big performance
> > degradation in concurrent COPY into a single relation with 2 - 16
> > clients in my environment. I've attached a test script that measures
> > the execution time of COPYing 5GB data in total to the single relation
> > while changing the number of concurrent insertions, in PG16 and PG15.
> 
> This item came up in the RMT meeting.  Andres, I think this item belongs
> to you, because of commit 00d1e02be2.

I'll take a look - I wasn't even aware of this thread until now.

Greetings,

Andres Freund

Re: Autogenerate some wait events code and documentation

2023-07-06 Thread Andres Freund

Hi,

On 2023-07-06 09:36:12 +0900, Michael Paquier wrote:
> On Wed, Jul 05, 2023 at 02:59:39PM -0700, Andres Freund wrote:
> > Rebasing a patch over this I was a bit confused because I got a bunch of
> > ""unable to parse wait_event_names.txt" errors. Took me a while to figure 
> > out
> > that that was just because I didn't include a trailing . in the description.
> > Perhaps that could be turned into a more meaningful error?
> > 
> > die "unable to parse wait_event_names.txt"
> >   unless $line =~ /^(\w+)\t+(\w+)\t+("\w+")\t+("\w.*\.")$/;
> 
> Agreed that we could at least add the $line in the error message
> generated, at least, to help with debugging.
> 
> > I also do wonder if we should invest in generating the lwlock names as
> > well. Except for a few abbreviations, the second column is always the
> > camel-cased version of what follows WAIT_EVENT_. Feels pretty tedious to 
> > write
> > that out.
> 
> And you mean getting rid of lwlocknames.txt?

No, I meant the second column in wait_event_names.txt. If you look at stuff
like:
WAIT_EVENT_ARCHIVER_MAIN"ArchiverMain"  "Waiting in main loop of 
archiver process."

It'd be pretty trivial to generate ArchiverMain from ARCHIVER_MAIN.



> The impact on dtrace or other similar tools is uncertain to me because we
> have free number on this list, and stuff like GetLWLockIdentifier() rely on
> the input ID from lwlocknames.txt.

I don't really care, tbh. If we wanted to keep the names the same in case of
abbreviations, we could just make the name optional, and auto-generated if not
explicitly specified.



> > Perhaps we should just change the case of the upper-cased names (DSM, SSL,
> > WAL, ...) to follow the other names?
> 
> So you mean renaming the existing events like WalSenderWaitForWAL to
> WalSenderWaitForWal?

Yes.


> The impact on existing monitoring queries is not zero because any changes
> would be silent, and that's the part that worried me the most even if it can
> remove one column in the txt file.

Then let's just use - or so to indicate the inferred name, with a "string"
overriding it?

Greetings,

Andres Freund

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 5211 matches

Mail list logo