from:"'Andres Freund'"

Re: meson: Optionally disable installation of test modules

2023-02-25 Thread Andres Freund

Hi,

On 2023-02-22 10:09:10 +0100, Peter Eisentraut wrote:
> On 20.02.23 20:48, Andres Freund wrote:
> > On 2023-02-20 19:43:56 +0100, Peter Eisentraut wrote:
> > > I don't think any callers try to copy a directory source, so the
> > > shutil.copytree() stuff isn't necessary.
> > 
> > I'd like to use it for installing docs outside of the normal install
> > target. Of course we could add the ability at a later point, but that seems 
> > a
> > bit pointless back-forth to me.
> 
> I figured it could be useful as a general installation tool, but the current
> script has specific command-line options for this specific purpose, so I
> don't think it would work for your purpose anyway.
> 
> For the purpose here, we really just need something that does
> 
> for src in sys.argv[1:-1]:
> shutil.copy2(src, sys.argv[-1])
> 
> But we need to call it twice for different sets of files and destinations,
> and since we can't have more than one command per test, we either need to
> write two "tests" or write a wrapper script like the one we have here.

How about making the arguments
  --install target-path list of files or directories
  --install another-path another set of files


> I don't know what the best way to slice this is, but it's not a lot of code
> that we couldn't move around again in the future.

That's true. The main work here is going through all the test modules, and
that won't be affected by changing the argument syntax.

Greetings,

Andres Freund

Re: windows/meson cfbot warnings

2023-02-25 Thread Andres Freund

Hi,

On 2022-12-28 17:44:47 -0800, Andres Freund wrote:
> On 2022-12-28 18:35:38 -0600, Justin Pryzby wrote:
> > Since a few days ago, the windows/meson task has been spewing messages for 
> > each tap
> > test:
> > 
> > | Unknown TAP version. The first line MUST be `TAP version `. Assuming 
> > version 12.
> > 
> > I guess because the image is updated to use meson v1.0.0.
> > https://github.com/mesonbuild/meson/commit/b7a5c384a1f1ba80c09904e7ef4f5160bdae3345
> > 
> > mesonbuild/mtest.py-if version is None:
> > mesonbuild/mtest.py:self.warnings.append('Unknown TAP version. 
> > The first line MUST be `TAP version `. Assuming version 12.')
> > mesonbuild/mtest.py-version = 12
> 
> I think the change is somewhat likely to be reverted in the next meson minor
> version. It apparently came about due to somewhat odd language in the tap-14
> spec...  So I think we should just wait for now.

FWIW, that did happen in 1.0.1, and the output is now clean again.

Greetings,

Andres Freund

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-23 Thread Andres Freund

Hi,

On 2023-02-23 13:56:56 -0500, Tom Lane wrote:
> Corey Huinker  writes:
> > My not-ready-for-16 work on CAST( ... ON DEFAULT ... ) involved making
> > FuncExpr/IoCoerceExpr/ArrayCoerceExpr have a safe_mode flag, and that
> > necessitates adding a reserror boolean to ExprEvalStep for subsequent steps
> > to test if the error happened.
> 
> Why do you want it in ExprEvalStep ... couldn't it be in ExprState?
> I can't see why you'd need more than one at a time during evaluation.

I don't know exactly what CAST( ... ON DEFAULT ... ) is aiming for - I guess
it wants to assign a different value when the cast fails?  Is the default
expression a constant, or does it need to be runtime evaluated?  If a const,
then the cast steps just could assign the new value. If runtime evaluation is
needed I'd expect the various coerce steps to jump to the value implementing
the default expression in case of a failure.

So I'm not sure we even need a reserror field in ExprState.

Greetings,

Andres Freund

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-23 Thread Andres Freund

Hi,

On 2023-02-23 13:39:14 -0500, Corey Huinker wrote:
> My not-ready-for-16 work on CAST( ... ON DEFAULT ... ) involved making
> FuncExpr/IoCoerceExpr/ArrayCoerceExpr have a safe_mode flag, and that
> necessitates adding a reserror boolean to ExprEvalStep for subsequent steps
> to test if the error happened.

I think if that requires adding a new variable to each ExprEvalStep, it's
DOA. The overhead would be too high. But I don't see why it would need to be
added all ExprEvalSteps instead of individual steps, or perhaps ExprEvalState?

> Will that change be throwing some architectures over the 64 byte count?

It would.

I find the 'pahole' tool very useful for looking at struct layout.

struct ExprEvalStep {
intptr_t   opcode;   /* 0 8 */
Datum *resvalue; /* 8 8 */
_Bool *resnull;  /*16 8 */
union {
struct {
intlast_var; /*24 4 */
_Bool  fixed;/*28 1 */

/* XXX 3 bytes hole, try to pack */

TupleDesc  known_desc;   /*32 8 */
const TupleTableSlotOps  * kind; /*40 8 */
} fetch; /*2424 */
...
struct {
Datum *values;   /*24 8 */
_Bool *nulls;/*32 8 */
intnelems;   /*40 4 */
MinMaxOp   op;   /*44 4 */
FmgrInfo * finfo;/*48 8 */
FunctionCallInfo fcinfo_data;/*56 8 */
} minmax;/*2440 */
...

} d; /*2440 */

/* size: 64, cachelines: 1, members: 4 */
};

We don't have memory to spare in the "general" portion of ExprEvalStep
(currently 24 bytes), as several of the type-specific portions are already 40
bytes large.

Greetings,

Andres Freund

Re: buildfarm + meson

2023-02-23 Thread Andres Freund

On 2023-02-23 06:27:23 -0500, Andrew Dunstan wrote:
> 
> On 2023-02-22 We 20:20, Andres Freund wrote:
> > 
> > > There is work to do to make sure we pick up the right log files, and maybe
> > > adjust a module or two. I have adopted a design where instead of trying to
> > > know a lot about the testing regime the client needs to know a lot less.
> > > Instead, it gets meson to tell it the set of tests. I will probably work 
> > > on
> > > enabling some sort of filter, but I think this makes things more
> > > future-proof. I have stuck with the design of making testing fairly
> > > fine-grained, so each suite runs separately.
> > I don't understand why you'd want to run each suite separately. Serially
> > executing the test takes way longer than doing so in parallel. Why would we
> > want to enforce that?
> > 
> > Particularly because with meson the tests log files and the failed tests can
> > directly be correlated? And it should be easy to figure out which log files
> > need to be kept, you can just skip the directories in testrun/ that contain
> > test.success.
> > 
> 
> We can revisit that later. For now I'm more concerned with getting a working
> setup.

My fear is that this ends up being entrenched in the design and hard to change
later.


> The requirements of the buildfarm are a bit different from those of a
> developer, though. Running things in parallel can make things faster, but
> that can also increase the compute load.

Sure, I'm not advocating to using a [high] concurrency by default.


> Also, running things serially makes it easier to report a failure stage that
> pinpoints the test that encountered an issue.

You're relying on running tests in a specific order. Instead you can also just
run tests in parallel and check test status in order and report the first
failed test in that order.


> But like I say we can come
> back to this.

> 
> > > On a Windows instance, fairly similar to what's running drongo, I can get 
> > > a
> > > successful build with meson+VS2019, but I'm getting an error in the
> > > regression tests, which don't like setting lc_time to 'de_DE'. Not sure
> > > what's going on there.
> > Huh, that's odd.
> 
> 
> See my reply to Michael for details

I suspect the issue might be related to this:

+   local %ENV = (PATH => $ENV{PATH}, PGUSER => $ENV{PGUSER});
+   @makeout=run_log("meson test --logbase checklog 
--print-errorlogs --no-rebuild -C $pgsql --suite setup --suite regress");

Greetings,

Andres Freund

Re: buildfarm + meson

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-22 18:23:44 -0500, Andrew Dunstan wrote:
> Here's a progress report on adapting the buildfarm client to meson
> 
> There is a development branch where I'm working on the changes. They can be
> seen here:
> 
> <https://github.com/PGBuildFarm/client-code/compare/main...dev/meson>
> 
> On my Linux box (Fedora 37, where crake runs) I can get a complete run.

Nice!


> There is work to do to make sure we pick up the right log files, and maybe
> adjust a module or two. I have adopted a design where instead of trying to
> know a lot about the testing regime the client needs to know a lot less.
> Instead, it gets meson to tell it the set of tests. I will probably work on
> enabling some sort of filter, but I think this makes things more
> future-proof. I have stuck with the design of making testing fairly
> fine-grained, so each suite runs separately.

I don't understand why you'd want to run each suite separately. Serially
executing the test takes way longer than doing so in parallel. Why would we
want to enforce that?

Particularly because with meson the tests log files and the failed tests can
directly be correlated? And it should be easy to figure out which log files
need to be kept, you can just skip the directories in testrun/ that contain
test.success.


> On a Windows instance, fairly similar to what's running drongo, I can get a
> successful build with meson+VS2019, but I'm getting an error in the
> regression tests, which don't like setting lc_time to 'de_DE'. Not sure
> what's going on there.

Huh, that's odd.


> meson apparently wants touch and cp installed, although I can't see why at
> first glance. For Windows I just copied them into the path from an msys2
> installation.

Those should probably be fixed.

Greetings,

Andres Freund

Re: XLogReadBufferExtended() vs disconnected segments

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-22 17:01:47 -0800, Andres Freund wrote:
> One way to to defend against this would be to make mdextend(), whenever it
> extends into the last block of a segment, unlink the next segment - it can't
> be a validly existing contents.  But it seems scary to just unlink entire
> segments.

Another way might be for XLOG_SMGR_TRUNCATE record, as well as smgr unlinks in
commit/abort records, to include not just the "target size", as we do today,
but to also include the current size.

I'm not sure that'd fix all potential issues, but it seems like it'd fix a lot
of the more obvious issues, because it'd prevent scenarios like a base backup
copying segment N, without copying N - 1, due to a concurrent truncate/drop,
from causing harm. Due to the range being included in the WAL record, replay
would know that N needs to be unlinked, even if smgrnblocks() thinks the
relation is much smaller.

Greetings,

Andres Freund

XLogReadBufferExtended() vs disconnected segments

2023-02-22 Thread Andres Freund

Hi,

I was trying to implement ExtendRelationBufferedTo(), responding to a review
comment by Heikki, in
https://www.postgresql.org/message-id/2023003152.rh4s75aedj65h...@awork3.anarazel.de

Which lead me to stare at the P_NEW do while loop in
XLogReadBufferExtended(). I first started to reply on that thread, but it
seems like a big enough issue that it seemed worth starting a separate thread.

The relevant logic was added in 6f2aead1ffec, the relevant discussion is at
https://www.postgresql.org/message-id/32313.1392231107%40sss.pgh.pa.us

My understanding of what happend there is that we tried to extend a relation,
sized one block below a segment boundary, and after that the relation was much
larger, because the next segment file existed, and had a non-zero size. And
because we extended blkno-lastblock times, we'd potentially blow up the
relation size much more than intended.

The actual cause of that in the reported case appears to have been a bug in
wal-e. But I suspect it's possible to hit something like that without such
problems, just due to crashes on the replica, or "skew" while taking a base
backup.


I find it pretty sketchy that we just leave the contents of the previously
"disconnected" segment contents around, without using log_invalid_page() for
the range, or warning, or ...

Most of the time this issue would be fixed due to later WAL replay
initializing the later segment. But I don't think there's a guarantee for
that (see below).

It'd be one thing if we accidentally used data in such a segment, if the
situation is only caused by a bug in base backup tooling, or filesystem
corruption, or ...

But I think we can encounter the issue without anything like that being
involved. Imagine this scenario:

1) filenode A gets extended to segment 3
2) basebackup starts, including performing a checkpoint
3) basebackup ends up copying A's segment 3 first, while in progress
4) filenode A is dropped
5) checkpoint happens, allowing smgrrel 10 to be used again
6) filenode 10 is created newly
7) basebackup ends

At that point A will have segment 0, segment 3. The WAL replay for 4) won't
drop segment 3, because an smgrnblocks() won't even see it, because segment 2
doesn't exist.

If a replica starts from this base backup, we'll be fine until A again grows
far enough to fill segment 2. At that point, we'll suddenly have completely
bogus contents in 3. Obviously accesses to those contents could trivially
crash at that point.


I suspect there's an easier to hit version of this: Consider this path in
ExecuteTruncateGuts():

/*
 * Normally, we need a transaction-safe truncation here.  
However, if
 * the table was either created in the current (sub)transaction 
or has
 * a new relfilenumber in the current (sub)transaction, then we 
can
 * just truncate it in-place, because a rollback would cause 
the whole
 * table or the current physical file to be thrown away anyway.
 */
if (rel->rd_createSubid == mySubid ||
rel->rd_newRelfilelocatorSubid == mySubid)
{
/* Immediate, non-rollbackable truncation is OK */
heap_truncate_one_rel(rel);



Afaict that could easily lead to a version of the above that doesn't even
require relfilenodes getting recycled.


One way to to defend against this would be to make mdextend(), whenever it
extends into the last block of a segment, unlink the next segment - it can't
be a validly existing contents.  But it seems scary to just unlink entire
segments.


Greetings,

Andres Freund

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-22 16:34:44 -0500, Tom Lane wrote:
> I wrote:
> > Andres Freund  writes:
> >> Maybe it's worth sticking a StaticAssert() for the struct size
> >> somewhere.
> 
> > Indeed.  I thought we had one already.
> 
> >> I'm a bit wary about that being too noisy, there are some machines with
> >> odd alignment requirements. Perhaps worth restricting the assertion to
> >> x86-64 + armv8 or such?
> 
> > I'd put it in first and only reconsider if it shows unfixable problems.
> 
> Now that we've got the sizeof(ExprEvalStep) under control, shouldn't
> we do the attached?

Indeed. Pushed.

Let's hope there's no rarely used architecture with odd alignment rules.

Greetings,

Andres Freund

Re: Improving inferred query column names

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-22 16:38:51 -0500, Tom Lane wrote:
> Vladimir Churyukin  writes:
> > It doesn't need to be perfect, but it needs to be consistent. So far you
> > proposed a rule to replace () with _. What is the plan for expressions, how
> > to convert them to names (with deduplication I guess?, because there could
> > be 2 similar expressions mapped to the same name potentially).
> 
> I do not think we need to do anything for arbitrary expressions.

Exactly. It's not like they have a useful name today.  Nor are they unique.


> The proposal so far was just to handle a function call wrapped
> around something else by converting to the function name followed
> by whatever we'd emit for the something else.

Just to showcase that better, what I think we're discussing is changing:

SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──┬┬──┐
│ sum  │  sum   │ ?column? │
├──┼┼──┤
│ 2774 │ 257896 │2 │
└──┴┴──┘
(1 row)

to

SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──┬───┬──┐
│ sum_relpages │ sum_reltuples │ ?column? │
├──┼───┼──┤
│ 2774 │257896 │2 │
└──┴───┴──┘
(1 row)


> You cannot realistically
> handle, say, operator expressions without emitting names that will
> require quoting, which doesn't seem attractive.

Well, it doesn't require much to be better than "?column?", which already
requires quoting...

We could just do something like printing __. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.

Whether that's worth it, or whether column name lengths would be too painful,
IDK.


> And no, deduplication isn't on the table at all here.

+1

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-21 11:22:26 -0800, Andres Freund wrote:
> On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:
> > Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
> > ExtendRelationBuffered?
> 
> Hm. That's a a good point. Probably not. Perhaps it could be useful to support
> RBM_NORMAL as well? But even if, it'd just be a lock release away if we always
> used RBM_ZERO_AND_LOCK.

There's a fair number of callers using RBM_NORMAL, via ReadBuffer[Extended]()
right now. While some of them are trivial to convert, others aren't (e.g.,
brin_getinsertbuffer()).  So I'm inclined to continue allowing RBM_NORMAL.

Greetings,

Andres Freund

Re: Improving inferred query column names

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-11 12:47:04 -0800, Vladimir Churyukin wrote:
> That is a good idea for simple cases, I'm just curious how it would look
> like for more complex cases (you can have all kinds of expressions as
> parameters for aggregate function calls).
> If it works only for simple cases, I think it would be confusing and not
> very helpful.

I don't think it needs to be perfect to be helpful.


> Wouldn't it make more sense to just deduplicate the names by adding
> numerical postfixes, like sum_1, sum_2?

That'd be considerably worse than what we do today imo, because any reordering
/ added aggregate would lead to everything else changing as well.


Greetings,

Andres Freund

Re: Improving inferred query column names

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:
> On 11.02.23 20:24, Andres Freund wrote:
> I think we should just do it and not care about what breaks.  There has
> never been any guarantee about these.
> 
> FWIW, "most" other SQL implementations appear to generate column names like
> 
> SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
> column names: "SUM(reads)", "SUM(writes)"

Hm, personally I don't like leaving in parens in the names, that makes it
unnecessarily hard to reference the columns.  sum_reads imo is more usable
than than "SUM(reads)".

> We had a colleague look into this a little while ago, and it got pretty
> tedious to implement this for all the expression types.

Hm, any chance that colleague could be pointed at this discussion and chime
in? It doesn't immediately look that hard to do substantially better than
today. Of course there's an approximately endless amount of effort that could
be poured into this, but even some fairly basic improvements seem like a big
win.

> And, you know, the bikeshedding ...

Indeed. I already started above :)

Greetings,

Andres Freund

Re: pg_regress: Treat child process failure as test failure

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-22 15:10:11 +0100, Daniel Gustafsson wrote:
> > On 26 Nov 2022, at 22:46, Daniel Gustafsson  wrote:
> 
> > I've moved the statuses[i] check before the differ check, such that there 
> > is a
> > separate block for this not mixed up with the differs check and printing.
> 
> Rebased patch to handle breakage of v2 due to bd8d453e9b.

I think we probably should just apply this? The current behaviour doesn't seem
right, and I don't see a downside of the new behaviour?

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-22 11:18:57 +0200, Heikki Linnakangas wrote:
> On 21/02/2023 21:22, Andres Freund wrote:
> > On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:
> > > Is it ever possible to call this without a relcache entry? WAL redo
> > > functions do that with ReadBuffer, but they only extend a relation
> > > implicitly, by replay a record for a particular block.
> > 
> > I think we should use it for crash recovery as well, but the patch doesn't
> > yet. We have some gnarly code there, see the loop using P_NEW in
> > XLogReadBufferExtended(). Extending the file one-by-one is a lot more
> > expensive than doing it in bulk.
> 
> Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the gap,
> and then call ExtendRelationBuffered for the target page. Or the new
> ExtendRelationBufferedTo() function that you mentioned.

I don't think it's safe to just use smgrzeroextend(). Without the page-level
interlock from the buffer entry, a concurrent reader can read/write the
extended portion of the relation, while we're extending. That can lead to
loosing writes.

It also turns out that just doing smgrzeroextend(), without filling s_b, is
often bad for performance, because it may cause reads when trying to fill the
buffers. Although hopefully that's less of an issue during WAL replay, due to
REGBUF_WILL_INIT.

> In the common case that you load a lot of data to a relation extending it,
> and then crash, the WAL replay would still extend the relation one page at a
> time, which is inefficient. Changing that would need bigger changes, to
> WAL-log the relation extension as a separate WAL record, for example. I
> don't think we need to solve that right now, it can be addressed separately
> later.

Yea, that seems indeed something for later.

There's several things we could do without adding WAL logging of relation
extension themselves.

One relatively easy thing would be to add information about the number of
blocks we're extending by to XLOG_HEAP2_MULTI_INSERT records. Compared to the
insertions themselves that'd barely be noticable.

A slightly more complicated thing would be to peek ahead in the WAL (we have
infrastructure for that now) and extend by enough for the next few relation
extensions.

Greetings,

Andres Freund

Re: meson and sslfiles.mk in src/test/ssl/

2023-02-22 Thread Andres Freund

Hi,

On 2023-02-22 09:42:54 -0800, Jacob Champion wrote:
> I'm happy to contribute cycles to a Meson port when you're ready for
> it. From a skim it seems like maybe in-source generation isn't a focus for
> Meson [1]. They might encourage us to write custom Python for it anyway?

You'd normally just invoke commands for updating sources as run_target()s, not
custom targets. Or generate them in the build tree via custom_target() and
then copy to the source tree via a run_target().

For this case I think it'd suffice to add a run target that does something
like
make -C ~/src/postgresql/src/test/ssl/ -f 
~/src/postgresql/src/test/ssl/sslfiles.mk sslfiles OPENSSL=openssl

obviously with the necessary things being replaced by the relevant variables.

sslfiles.mk doesn't depend on the rest of the buildsystem, and is a rarely
executed command, so I don't see a problem with using make to update the
files. At least for a long while.

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-21 Thread Andres Freund

Hi,

On 2023-02-21 15:00:15 -0600, Jim Nasby wrote:
> On 10/28/22 9:54 PM, Andres Freund wrote:
> > b) I found that is quite beneficial to bulk-extend the relation with
> > smgrextend() even without concurrency. The reason for that is the 
> > primarily
> > the aforementioned dirty buffers that our current extension method 
> > causes.
> > 
> > One bit that stumped me for quite a while is to know how much to extend 
> > the
> > relation by. RelationGetBufferForTuple() drives the decision whether / 
> > how
> > much to bulk extend purely on the contention on the extension lock, 
> > which
> > obviously does not work for non-concurrent workloads.
> > 
> > After quite a while I figured out that we actually have good 
> > information on
> > how much to extend by, at least for COPY /
> > heap_multi_insert(). heap_multi_insert() can compute how much space is
> > needed to store all tuples, and pass that on to
> > RelationGetBufferForTuple().
> > 
> > For that to be accurate we need to recompute that number whenever we 
> > use an
> > already partially filled page. That's not great, but doesn't appear to 
> > be a
> > measurable overhead.
> Some food for thought: I think it's also completely fine to extend any
> relation over a certain size by multiple blocks, regardless of concurrency.
> E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel
> for what algorithm would make sense here; maybe something along the lines of
> extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably
> extending by just a couple extra pages doesn't help much without
> concurrency).

I previously implemented just that. It's not easy to get right. You can easily
end up with several backends each extending the relation by quite a bit, at
the same time (or you re-introduce contention). Which can end up with a
relation being larger by a bunch if data loading stops at some point.

We might want that as well at some point, but the approach implemented in the
patchset is precise and thus always a win, and thus should be the baseline.

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-21 Thread Andres Freund

han RBM_ZERO_AND_LOCK make sense with
> ExtendRelationBuffered?

Hm. That's a a good point. Probably not. Perhaps it could be useful to support
RBM_NORMAL as well? But even if, it'd just be a lock release away if we always
used RBM_ZERO_AND_LOCK.


> Is it ever possible to call this without a relcache entry? WAL redo
> functions do that with ReadBuffer, but they only extend a relation
> implicitly, by replay a record for a particular block.

I think we should use it for crash recovery as well, but the patch doesn't
yet. We have some gnarly code there, see the loop using P_NEW in
XLogReadBufferExtended(). Extending the file one-by-one is a lot more
expensive than doing it in bulk.


> All of the above comments are around the BulkExtendRelationBuffered()
> function's API. That needs a closer look and a more thought-out design to
> make it nice. Aside from that, this approach seems valid.

Thanks for looking!  I agree that it can stand a fair bit of polishing...

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-21 Thread Andres Freund

Hi,

On 2023-02-21 17:40:31 +0200, Heikki Linnakangas wrote:
> > v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
> 
> This looks straightforward. My only concern is that it changes the order
> that things happen at abort. Currently, AbortBufferIO() is called very early
> in AbortTransaction(), and this patch moves it much later. I don't see any
> immediate problems from that, but it feels scary.

Yea, it does feel a bit awkward. But I suspect it's actually the right
thing. We've not even adjusted the transaction state at the point we're
calling AbortBufferIO(). And AbortBufferIO() will sometimes allocate memory
for a WARNING, which conceivably could fail - although I don't think that's a
particularly realistic scenario due to TransactionAbortContext (I guess you
could have a large error context stack or such).

Medium term I think we need to move a lot more of the error handling into
resowners. Having a dozen+ places with their own choreographed sigsetjmp()
recovery blocks is error prone as hell. Not to mention tedious.

> > @@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
> >  static void
> >  AtProcExit_Buffers(int code, Datum arg)
> >  {
> > -   AbortBufferIO();
> > UnlockBuffers();
> > CheckForBufferLeaks();
> 
> Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on
> elog(FATAL)? Do we need to worry about that?

We have before_shmem_exit() callbacks that should protect against
that. InitPostgres() registers ShutdownPostgres(), and
CreateAuxProcessResourceOwner() registers
ReleaseAuxProcessResourcesCallback().

I think we'd already be in trouble if we didn't reliably end up doing resowner
cleanup during process exit.

Perhaps ResourceOwnerCreate()/ResourceOwnerDelete() should maintain a list of
"active" resource owners and have a before-exit callback that ensures the list
is empty and PANICs if not?  Better a crash restart than hanging because we
didn't release some shared resource.

Greetings,

Andres Freund

Re: meson: Optionally disable installation of test modules

2023-02-20 Thread Andres Freund

On 2023-02-20 19:43:56 +0100, Peter Eisentraut wrote:
> I don't think any callers try to copy a directory source, so the
> shutil.copytree() stuff isn't necessary.

I'd like to use it for installing docs outside of the normal install
target. Of course we could add the ability at a later point, but that seems a
bit pointless back-forth to me.

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-20 Thread Andres Freund

Hi,

On 2023-02-21 08:33:22 +1300, David Rowley wrote:
> On Tue, 21 Feb 2023 at 07:30, Andres Freund  wrote:
> > 2) We should introduce an API mcxt.c API to perform allocations that the
> >caller promises not to individually free.
> 
> It's not just pfree. Offhand, there's also repalloc,
> GetMemoryChunkSpace and GetMemoryChunkContext too.

Sure, and all of those should assert out / crash if done with the allocation
function.

> I am interested in a bump allocator for tuplesort.c. There it would be
> used in isolation and all the code which would touch pointers
> allocated by the bump allocator would be self-contained to the
> tuplesorting code.
> 
> What use case do you have in mind?

E.g. the whole executor state tree (and likely also the plan tree) should be
allocated that way. They're never individually freed. But we also allocate
other things in the same context, and those do need to be individually
freeable. We could use a separate memory context, but that'd increase memory
usage in many cases, because there'd be two different blocks being allocated
from at the same time.

To me opting into this on a per-allocation basis seems likely to make this
more widely usable than requiring a distinct memory context type.

Greetings,

Andres Freund

Re: meson: Non-feature feature options

2023-02-20 Thread Andres Freund

Hi,

On 2023-02-20 19:53:53 +0100, Peter Eisentraut wrote:
> On 20.02.23 13:33, Nazir Bilal Yavuz wrote:
> > I added SSL and UUID patches. UUID patch has two different fixes:
> > 
> > 1 - v1-0002-meson-Refactor-UUID-option.patch: Adding 'auto' choice to
> > 'uuid' combo option.
> > 
> > 2 - v1-0002-meson-Refactor-UUID-option-with-uuid_library.patch: Making
> > 'uuid' feature option and adding new 'uuid_library' combo option with
> > ['auto', 'bsd', 'e2fs', 'ossp'] choices. If 'uuid_library' is set other
> > than 'auto' and it can't be found, build throws an error.
> > 
> > What do you think?
> 
> I like the second approach, with a 'uuid' feature option.  As you wrote
> earlier, adding an 'auto' choice to a combo option doesn't work fully like a
> real feature option.

But we can make it behave exactly like one, by checking the auto_features
option.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-20 Thread Andres Freund

Hi,

On 2023-02-17 09:52:01 -0800, Andres Freund wrote:
> On 2023-02-17 17:26:20 +1300, David Rowley wrote:
> Random note:
>
> I wonder if we should having a bitmap (in an int) in front of aset's
> freelist. In a lot of cases we incur plenty cache misses, just to find the
> freelist bucket empty.

Two somewhat related thoughts:

1) We should move AllocBlockData->freeptr into AllocSetContext. It's only ever
   used for the block at the head of ->blocks.

   We completely unnecessarily incur more cache line misses due to this (and
   waste a tiny bit of space).

2) We should introduce an API mcxt.c API to perform allocations that the
   caller promises not to individually free.  We've talked a bunch about
   introducing a bump allocator memory context, but that requires using
   dedicated memory contexts, which incurs noticable space overhead, whereas
   just having a separate function call for the existing memory contexts
   doesn't have that issue.

   For aset.c we should just allocate from set->freeptr, without going through
   the freelist. Obviously we'd not round up to a power of 2. And likely, at
   least outside of assert builds, we should not have a chunk header.

Greetings,

Andres Freund

Re: SQL/JSON revisited

2023-02-20 Thread Andres Freund

mod((Node *) expr);
> + var->estate = ExecInitExpr(expr, ps);
> + var->econtext = ps->ps_ExprContext;
> + var->mcxt = CurrentMemoryContext;
> + var->evaluated = false;
> + var->value = (Datum) 0;
> + var->isnull = true;
> +
> + args = lappend(args, var);
> + }
> + }
> +
> + cxt->colexprs = palloc(sizeof(*cxt->colexprs) *
> +    
> list_length(tf->colvalexprs));
> +
> + JsonTableInitScanState(cxt, &cxt->root, root, NULL, args,
> +CurrentMemoryContext);
> +
> + i = 0;
> +
> + foreach(lc, tf->colvalexprs)
> + {
> + Expr   *expr = lfirst(lc);
> +
> + cxt->colexprs[i].expr =
> + ExecInitExprWithCaseValue(expr, ps,
> +   
> &cxt->colexprs[i].scan->current,
> +   
> &cxt->colexprs[i].scan->currentIsNull);
> +
> + i++;
> + }
> +
> + state->opaque = cxt;
> +}

Evaluating N expressions for a json table isn't a good approach, both memory
and CPU efficiency wise.

Why don't you just emit the proper expression directly, insted of the
CaseTestExpr stuff, that you then separately evaluate?



Greetings,

Andres Freund

Re: Add WAL read stats to pg_stat_wal

2023-02-20 Thread Andres Freund

Hi,

On 2023-02-20 14:21:39 +0900, Kyotaro Horiguchi wrote:
> I totally agree that it will be useful, but I'm not quite sure how
> checkpointer would be able to let postmaster know about that state
> without requiring access to shared memory.

SendPostmasterSignal(PMSIGNAL_SHUTDOWN_CHECKPOINT_COMPLETE);
or such.

Greetings,

Andres Freund

Re: cfbot failures

2023-02-19 Thread Andres Freund

Hi,

On 2023-02-19 19:08:41 -0600, Justin Pryzby wrote:
> On 2023-02-11, Andres Freund wrote 
> 20230212004254.3lp22a7bpkcjo...@awork3.anarazel.de:
> > The windows test failure is a transient issue independent of the patch
> > (something went wrong with image permissions).
> 
> That's happening again since 3h ago.
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql

Fixed manually. This is some sort of gcp issue. Bilal tried to deploy a
workaround, but that didn't yet work.

[21:39:06.006] 2023-02-19T21:39:06Z: ==> 
windows.googlecompute.windows-ci-vs-2019: Creating image...
[21:44:08.025] 2023-02-19T21:44:08Z: ==> 
windows.googlecompute.windows-ci-vs-2019: Error waiting for image: time out 
while waiting for image to register
...

[21:44:10.990] gcloud compute images deprecate 
pg-ci-${CIRRUS_TASK_NAME}-${DATE} --state=DEPRECATED
[21:44:33.834] ERROR: (gcloud.compute.images.deprecate) Could not fetch 
resource:
[21:44:33.834]  - Required 'compute.images.deprecate' permission for 
'projects/cirrus-ci-community/global/images/pg-ci-windows-ci-vs-2019-2023-02-19t21-30-43'

Greetings,

Andres Freund

Re: Handle TEMP_CONFIG for pg_regress style tests in pg_regress.c

2023-02-19 Thread Andres Freund

Hi,

On 2023-02-19 11:13:38 -0500, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 2023-02-19 Su 02:25, Peter Eisentraut wrote:
> >> On 18.02.23 21:26, Andres Freund wrote:
> >>> My inclination is to move TEMP_CONFIG support from the Makefile to
> >>> pg_regress.c. That way it's consistent across the build tools and isn't
> >>> duplicated.
> 
> >> I'm having a hard time understanding what TEMP_CONFIG is for.
> 
> > It's used by the buildfarm to add the extra config settings from its 
> > configuration file.
> 
> I have also used it manually to inject configuration changes into
> TAP tests, for instance running them with debug_discard_caches = 1.

Similar. Explicitly turning on fsync, changing the log level to debug etc.

> It's quite handy, but I agree the lack of documentation is bad.

We have some minimal documentation for EXTRA_REGRESS_OPTS, but imo that's a
bit of a different use case, as it adds actual commandline options for
pg_regress (and thus doesn't work for tap tests).

Seems we'd need a section in regress.sgml documenting the various environment
variables?

> It looks to me like pg_regress already does implement this; that
> is, the Makefiles convert TEMP_CONFIG into a --temp-config switch
> to pg_[isolation_]regress.  So if we made pg_regress responsible
> for examining the envvar directly, very little new code would be
> needed.

It's very little, indeed - the patch upthread ends up with:
 4 files changed, 11 insertions(+), 16 deletions(-)

> (Maybe net negative code if we remove the command line
> switch, but I'm not sure if we should.)

I don't think we should - we use it for various regression tests, to specify a
config file they should load (shared_preload_libraries, wal_level, etc).

The way I implemented it now is that TEMP_CONFIG is added earlier in the
resulting config file, than the contents of the file explicitly specified on
the commandline.

> What we'd lose is the ability to write make TEMP_CONFIG=foo check but I
> wouldn't miss that.  Having a uniform rule that TEMP_CONFIG is an
> environment variable and nothing else seems good.

If we were concerned about it we could just add an export of TEMP_CONFIG to
src/Makefile.global.in

Greetings,

Andres Freund

Re: Weird failure with latches in curculio on v15

2023-02-18 Thread Andres Freund

Hi,

On 2023-02-18 15:51:06 +0530, Robert Haas wrote:
> On Thu, Feb 16, 2023 at 10:02 PM Andres Freund  wrote:
> > But there's nothing inherent in that. We know for certain which files we're
> > going to archive. And we don't need to work one-by-one. The archiver could
> > just start multiple subprocesses at the same time.
> 
> But what if it doesn't want to start multiple processes, just
> multiplex within a single process?

To me that seems even simpler? Nothing but the archiver is supposed to create
.done files and nothing is supposed to remove .ready files without archiver
having created the .done files.  So the archiver process can scan
archive_status until its done or until N archives have been collected, and
then process them at once?  Only the creation of the .done files would be
serial, but I don't think that's commonly a problem (and could be optimized as
well, by creating multiple files and then fsyncing them in a second pass,
avoiding N filesystem journal flushes).

Maybe I am misunderstanding what you see as the problem?

Greetings,

Andres Freund

Re: occasional valgrind reports for handle_sig_alarm on 32-bit ARM

2023-02-18 Thread Andres Freund

Hi,

On 2023-02-18 13:56:38 +0100, Tomas Vondra wrote:
> or (somewhat weird)
> 
> ==23734== Use of uninitialised value of size 4
> ==23734==at 0x88DDC8: handle_sig_alarm (timeout.c:457)
> ==23734==by 0x: ???
> ==23734==  Uninitialised value was created by a stack allocation
> ==23734==at 0x64CE2C: EndCommand (dest.c:167)
> ==23734==
> {
>
>Memcheck:Value4
>fun:handle_sig_alarm
>obj:*
> }

I'd try using valgrind's --vgdb-error=1, and inspecting the state.

I assume this is without specifying --read-var-info=yes? Might be worth
trying, sometimes the increased detail can be really helpful.

It's certainly interesting that the error happens in timeout.c:457 - currently
that's the end of the function. And dest.c:167 is the entry of EndCommand().

Perhaps there's some confusion around the state of the stack? The fact that it
looks like the function epilogue of handle_sig_alarm() uses an uninitialized
variable created by the function prologue of EndCommand() does seem to suggest
something like that.

It'd be interesting to see the exact instruction triggering the failure +
surroundings.

> It might be a valgrind issue and/or false positive, but I don't think
> I've seen such failures before, so I'm wondering if this might be due to
> some recent changes?

Have you run 32bit arm valgrind before? It'd not surprise me if there are some
32bit arm issues in valgrind, libc, or such.

Greetings,

Andres Freund

Handle TEMP_CONFIG for pg_regress style tests in pg_regress.c

2023-02-18 Thread Andres Freund

Hi,

When building with meson, TEMP_CONFIG is supported for TAP tests, but doesn't
do anything for regress/isolation.

The reason for that is that meson's (and ninja's) architecture is to separate
"build setup" from the "build/test/whatever" stage, moving dynamism (and more
costly operations) to the "setup" phase.

In this case the implication is that the command line for the test isn't
re-computed dynamically. But pg_regress doesn't look at TEMP_CONFIG, it just
has a --temp-config=... parameter, that src/Makefile.global.in dynamically
adds if TEMP_CONFIG is set.

In contrast to that, TEMP_CONFIG support for tap tests is implemented in
Cluster.pm, and thus works transparently.

My inclination is to move TEMP_CONFIG support from the Makefile to
pg_regress.c. That way it's consistent across the build tools and isn't
duplicated. pg_regress already looks at a bunch of temporary variables
(e.g. PG_REGRESS_SOCK_DIR, PG_TEST_USE_UNIX_SOCKETS), so this isn't really
breaking new ground.

It can be implemented differently, e.g. by adding the parameter dynamically in
the wrapper around pg_regress, but I don't see an advantage in that.

Patch attached.

Greetings,

Andres Freund
>From 22d931aa350698e63e61c449d76b9ac2e3ee37b0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 18 Feb 2023 12:20:09 -0800
Subject: [PATCH v1 1/2] Move TEMP_CONFIG into pg_regress.c

That way it works for meson as well, and gets rid of a bit of duplicated
handling of TEMP_CONFIG.
---
 src/interfaces/ecpg/test/Makefile | 6 +++---
 src/test/regress/pg_regress.c | 8 
 src/Makefile.global.in| 8 
 src/tools/msvc/vcregress.pl   | 5 -
 4 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index d7a7d1d1ca5..ce4b7237b66 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -71,11 +71,11 @@ REGRESS_OPTS =  --expecteddir=$(srcdir) \
   $(EXTRA_REGRESS_OPTS)
 
 check: all
-	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check $(TEMP_CONF) --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule sql/twophase
+	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule sql/twophase
 
 # Connect to the server using TCP, and add a TCP-specific test.
 checktcp: all | temp-install
-	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check $(TEMP_CONF) --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule --host=localhost sql/twophase connect/test1
+	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule --host=localhost sql/twophase connect/test1
 
 installcheck: all
 	./pg_regress $(REGRESS_OPTS) --bindir='$(bindir)' $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule
@@ -89,4 +89,4 @@ installcheck-prepared-txns: all
 	./pg_regress $(REGRESS_OPTS) --bindir='$(bindir)' $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule sql/twophase
 
 check-prepared-txns: all | temp-install
-	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check $(TEMP_CONF) --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule sql/twophase
+	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule sql/twophase
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 6cd5998b9d7..759bf00d981 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2032,6 +2032,14 @@ regression_main(int argc, char *argv[],
 	if (!use_unix_sockets)
 		hostname = "localhost";
 
+	/*
+	 * If the TEMP_CONFIG environment variable is set, use it. Give higher
+	 * precedence to an explicitly specified --temp-config by adding it later
+	 * to the list, as later config assignments override earlier ones.
+	 */
+	if (getenv("TEMP_CONFIG"))
+		add_stringlist_item(&temp_configs, getenv("TEMP_CONFIG"));
+
 	/*
 	 * We call the initialization function here because that way we can set
 	 * default parameters and let them be overwritten by the commandline.
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index fb3e197fc06..24656a1c74c 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -655,12 +655,6 @@ ifdef NO_LOCALE
 NOLOCALE += --no-locale
 endif
 
-# file with extra config for temp build
-TEMP_CONF =
-ifdef TEMP_CONFIG
-TEMP_CONF += --temp-config=$(TEMP_CONFIG)
-endif
-
 pg_regress_locale_flags = $(if $(ENCODING),--encoding=$(ENCODING)) $(NOLOCA

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-02-18 Thread Andres Freund

Hi,

On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
> 18.02.2023 04:06, Andres Freund wrote:
> > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> > How can a process that we did notify crashing, that has already executed
> > SQL statements, end up in MarkPostmasterChildActive()?
>
> Maybe it's just the backend started for the money test has got
> the same PID (5948) that the backend for the name test had?

I somehow mashed name and money into one test in my head... So forget what I
wrote.

That doesn't really explain the assertion though.

It's too bad that we didn't use doesn't include
log_connections/log_disconnections. If nothing else, it makes it a lot easier
to identify problems like that. We actually do try to configure it for CI, but
it currently doesn't work for pg_regress style tests with meson. Need to fix
that. Starting a thread.

One thing that made me very suspicious when reading related code is this
remark:

bool
ReleasePostmasterChildSlot(int slot)
...
/*
 * Note: the slot state might already be unused, because the logic in
 * postmaster.c is such that this might get called twice when a child
 * crashes.  So we don't try to Assert anything about the state.
 */

That seems fragile, and potentially racy. What if we somehow can end up
starting another backend inbetween the two ReleasePostmasterChildSlot() calls,
we can end up marking a slot that, newly, has a process associated with it, as
inactive? Once the slot has been released the first time, it can be assigned
again.

ISTM that it's not a good idea that we use PM_CHILD_ASSIGNED to signal both,
that a slot has not been used yet, and that it's not in use anymore. I think
that makes it quite a bit harder to find state management issues.

> A simple script that I've found [1] shows that the pids reused rather often
> (for me, approximately each 300 process starts in Windows 10 H2), buy maybe
> under some circumstances (many concurrent processes?) PIDs can coincide even
> so often to trigger that behavior.

It's definitely very aggressive in reusing pids - and it seems to
intentionally do work to keep pids small. I wonder if it'd be worth trying to
exercise this path aggressively by configuring a very low max pid on linux, in
an EXEC_BACKEND build.

Greetings,

Andres Freund

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> I still have no theory for how this condition was reached despite a
> lot of time thinking about it and searching for more clues.  As far as
> I can tell, the recent improvements to postmaster's signal and event
> handling shouldn't be related: the state management and logic was
> unchanged.

Yea, it's all very odd.

If you look at the log:

2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL:  No 
valid identifier after ".".
2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT:  
SELECT parse_ident('xxx.1020');
...
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), 
File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 5948
abort() has been called
...
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  server process (PID 5948) was 
terminated by exception 0xC354
2023-02-08 00:53:27.420 GMT postmaster[872] HINT:  See C include file 
"ntstatus.h" for a description of the hexadecimal value.
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  terminating any other active 
server processes
2023-02-08 00:53:27.434 GMT postmaster[872] LOG:  all server processes 
terminated; reinitializing

and that it's indeed the money test that failed:
 money... FAILED (test process exited with exit 
code 2) 7337 ms

it's very hard to understand how this stack can come to be:

0085`f03ffa40 7ff6`fd89faa8 ucrtbased!abort(void)+0x5a 
[minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
0085`f03ffa80 7ff6`fd6474dc postgres!ExceptionalCondition(
char * conditionName = 0x7ff6`fdd03ca8 
"PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", 
char * fileName = 0x7ff6`fdd03c80 
"../src/backend/storage/ipc/pmsignal.c", 
int lineNumber = 0n329)+0x78 
[c:\cirrus\src\backend\utils\error\assert.c @ 67]
0085`f03ffac0 7ff6`fd676eff 
postgres!MarkPostmasterChildActive(void)+0x7c 
[c:\cirrus\src\backend\storage\ipc\pmsignal.c @ 329]
0085`f03ffb00 7ff6`fd59aa3a postgres!InitProcess(void)+0x2ef 
[c:\cirrus\src\backend\storage\lmgr\proc.c @ 375]
0085`f03ffb60 7ff6`fd467689 postgres!SubPostmasterMain(
int argc = 0n3, 
char ** argv = 0x01c6`f3814e80)+0x33a 
[c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
0085`f03ffd90 7ff6`fda0e1c9 postgres!main(
int argc = 0n3, 
char ** argv = 0x01c6`f3814e80)+0x2f9 
[c:\cirrus\src\backend\main\main.c @ 192]

How can a process that we did notify crashing, that has already executed SQL
statements, end up in MarkPostmasterChildActive()?

> While failing to understand this, I worked[1] on CI log indexing tool
> with public reports that highlight this sort of thing[2], so I'll be
> watching out for more evidence.  Unfortunately I have no data from
> before 1 Feb (cfbot previously wasn't interested in the past at all;
> I'd need to get my hands on the commit IDs for earlier testing but I
> can't figure out how to get those out of Cirrus or Github -- anyone
> know how?).  FWIW I have a thing I call bfbot for slurping up similar
> data from the build farm.  It's not pretty enough for public
> consumption, but I do know that this assertion hasn't failed there,
> except the cases I mentioned earlier, and a load of failures on
> lorikeet which was completely b0rked until recently.

> [1] https://xkcd.com/974/
> [2] http://cfbot.cputube.org/highlights/assertion-90.html

I think this extremely useful.

Greetings,

Andres Freund

File* argument order, argument types

2023-02-17 Thread Andres Freund

Hi,

While trying to add additional File* functions (FileZero, FileFallocat) I went
back and forth about the argument order between "amount" and "offset".

We have:

extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 
wait_event_info);
extern int FileRead(File file, void *buffer, size_t amount, off_t offset, 
uint32 wait_event_info);
extern int FileWrite(File file, const void *buffer, size_t amount, off_t 
offset, uint32 wait_event_info);
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 
wait_event_info);

and I want to add (for [1])
extern int FileZero(File file, off_t amount, off_t offset, uint32 
wait_event_info);
extern int FileFallocate(File file, off_t amount, off_t offset, uint32 
wait_event_info);

The differences originate in trying to mirror the underlying function's
signatures:

int posix_fadvise(int fd, off_t offset, off_t len, int advice);
ssize_t pread(int fd, void buf[.count], size_t count, off_t offset);
ssize_t pwrite(int fd, const void buf[.count], size_t count, off_t offset);
int ftruncate(int fd, off_t length);
int posix_fallocate(int fd, off_t offset, off_t len);
int sync_file_range(int fd, off64_t offset, off64_t nbytes, unsigned int flags);


It seems quite confusing to be this inconsistent about argument order and
argument types in the File* functions. For one, the relation to the underlying
posix functions isn't always obvious. For another, we're not actually
mirroring the signatures all that well, our argument and return types don't
actually match.


It'd be easy enough to decide on a set of types for the arguments, that'd be
API (but not necessarily ABI compatible, but we don't care) compatible. But
changing the argument order would commonly lead to silent breakage, which
obviously would be bad. Or maybe it's unlikely enough that there are external
callers?

I don't know what to actually propose. I guess the least bad I can see is to
pick one type & argument order that we document to be the default, with a
caveat placed above the functions not following the argument order.

Order wise, I think we should choose amount, offset.  For the return type we
probably should pick ssize_t?  I don't know what we should standardize on for
'amount', I'd probably be inclined to go for size_t.

Greetings,

Andres Freund

[1] https://postgr.es/m/20221029025420.eplyow6k7tgu6...@awork3.anarazel.de

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 12:36:05 -0800, Jeff Davis wrote:
> > > There's already a check that the new cluster is empty, so I think
> > > it's
> > > safe to hack the pg_database locale fields.
> > 
> > I don't think we need to, we do issue the CREATE DATABASEs. So we
> > just need to
> > make sure that includes the collation provider info, and the proper
> > template
> > database, in pg_upgrade mode.
> 
> We must fixup template1/postgres in the new cluster (change it to libc
> to match the old cluster), because any objects existing in those
> databases in the old cluster may depend on the default collation. I
> don't see how we can do that without updating pg_database, so I'm not
> following your point.

I think we just drop/recreate template1 and postgres during pg_upgrade.  Yep,
looks like it. See create_new_objects():

/*
 * template1 database will already exist in the target 
installation,
 * so tell pg_restore to drop and recreate it; otherwise we 
would fail
 * to propagate its database-level properties.
 */
create_opts = "--clean --create";

and then:

/*
 * postgres database will already exist in the target 
installation, so
 * tell pg_restore to drop and recreate it; otherwise we would 
fail to
 * propagate its database-level properties.
 */
if (strcmp(old_db->db_name, "postgres") == 0)
    create_opts = "--clean --create";
else
create_opts = "--create";

Greetings,

Andres Freund

Re: Missing free_var() at end of accum_sum_final()?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 11:48:14 +0900, Michael Paquier wrote:
> On Thu, Feb 16, 2023 at 01:35:54PM -0800, Andres Freund wrote:
> > But why do we need it? Most SQL callable functions don't need to be careful
> > about not leaking O(1) memory, the exception being functions backing btree
> > opclasses.
> > 
> > In fact, the detailed memory management often is *more* expensive than just
> > relying on the calling memory context being reset.
> > 
> > Of course, numeric.c doesn't really seem to have gotten that message, so
> > there's a consistency argument here.
> 
> I don't know which final result is better.  The arguments go two ways:
> 1) Should numeric.c be simplified so as its memory structure with extra
> pfree()s, making it more consistent with more global assumptions than
> just this file?  This has the disadvantage of creating more noise in
> backpatching, while increasing the risk of leaks if some of the
> removed parts are allocated in a tight loop within the same context.
> This makes memory management less complicated.  That's how I am
> understanding your point.

It's not just simplification, it's just faster to free via context reset. I
whipped up a random query exercising numeric math a bunch:

SELECT max(a + b + '17'::numeric + c) FROM (SELECT generate_series(1::numeric, 
1000::numeric)) aa(a), (SELECT generate_series(1::numeric, 100::numeric)) 
bb(b), (SELECT generate_series(1::numeric, 10::numeric)) cc(c);

Removing the free_var()s from numeric_add_opt_error() speeds it up from ~361ms
to ~338ms.


This code really needs some memory management overhead reduction love. Many
allocation could be avoided by having a small on-stack "buffer" that's used
unless the numeric is large.


> 2) Should the style within numeric.c be more consistent?  This is how
> I am understanding this proposal.  As you quote, this makes memory
> management more complicated (not convinced about that for the internal
> of numerics?), while making the file more consistent.

> At the end, perhaps that's not worth bothering, but 2) prevails when
> it comes to the rule of making some code consistent with its
> surroundings.  1) has more risks seeing how old this code is.

I'm a bit wary that this will trigger a stream of patches to pointlessly free
things, causing churn and slowdowns. I suspect there's other places in
numeric.c where we don't free, and there certainly are a crapton in other
functions.

Greetings,

Andres Freund

Re: recovery modules

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-16 13:58:10 -0800, Nathan Bossart wrote:
> On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote:
> > I'm quite baffled by:
> > /* Close any files left open by copy_file() or compare_files() 
> > */
> > AtEOSubXact_Files(false, InvalidSubTransactionId, 
> > InvalidSubTransactionId);
> > 
> > in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files()
> > completely outside the context of a transaction environment. And it only 
> > does
> > the thing you want because you pass parameters that aren't actually valid in
> > the normal use in AtEOSubXact_Files().  I really don't understand how that's
> > supposed to be ok.
> 
> Hm.  Should copy_file() and compare_files() have PG_FINALLY blocks that
> attempt to close the files instead?  What would you recommend?

I don't fully now, it's not entirely clear to me what the goals here were. I
think you'd likely need to do a bit of infrastructure work to do this
sanely. So far we just didn't have the need to handle files being released in
a way like you want to do there.

I suspect a good direction would be to use resource owners. Add a separate set
of functions that release files on resource owner release. Most of the
infrastructure is there already, for temporary files
(c.f. OpenTemporaryFile()).

Then that resource owner could be reset in case of error.

I'm not even sure that erroring out is a reasonable way to implement
copy_file(), compare_files(), particularly because you want to return via a
return code from basic_archive_files().

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 10:00:41 -0800, Jeff Davis wrote:
> I guess I'm fine hacking pg_upgrade, but I think I'd like to make it
> conditional on this specific case: only perform the fixup if the old
> cluster is 15 or earlier and using libc and the newer cluster is 16 or
> later and using icu.

-1. That's just going to cause pain one major version upgrade further down the
line. Why would we want to incur that pain?


> There's already a check that the new cluster is empty, so I think it's
> safe to hack the pg_database locale fields.

I don't think we need to, we do issue the CREATE DATABASEs. So we just need to
make sure that includes the collation provider info, and the proper template
database, in pg_upgrade mode.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 17:26:20 +1300, David Rowley wrote:
> I didn't hear it mentioned explicitly here, but I suspect it's faster
> when increasing the initial size due to the memory context caching
> code that reuses aset MemoryContexts (see context_freelists[] in
> aset.c). Since we reset the context before caching it, then it'll
> remain fast when we can reuse a context, provided we don't need to do
> a malloc for an additional block beyond the initial block that's kept
> in the cache.

I'm not so sure this is the case. Which is one of the reasons I'd really like
to see a) memory context stats for executor context b) a CPU profile of the
problem c) a reproducer.

Jonah, did you just increase the initial size, or did you potentially also
increase the maximum block size?

And did you increase ALLOCSET_DEFAULT_INITSIZE everywhere, or just passed a
larger block size in CreateExecutorState()?  If the latter,the context
freelist wouldn't even come into play.

A 8MB max block size is pretty darn small if you have a workload that ends up
with a gigabytes worth of blocks.

And the problem also could just be that the default initial blocks size takes
too long to ramp up to a reasonable block size. I think it's 20 blocks to get
from ALLOCSET_DEFAULT_INITSIZE to ALLOCSET_DEFAULT_MAXSIZE.  Even if you
allocate a good bit more than 8MB, having to additionally go through 20
smaller chunks is going to be noticable until you reach a good bit higher
number of blocks.

> Maybe we should think of a more general-purpose way of doing this
> caching which just keeps a global-to-the-process dclist of blocks
> laying around.  We could see if we have any free blocks both when
> creating the context and also when we need to allocate another block.

Not so sure about that. I suspect the problem could just as well be the
maximum block size, leading to too many blocks being allocated. Perhaps we
should scale that to a certain fraction of work_mem, by default?

Either way, I don't think we should go too deep without some data, too likely
to miss the actual problem.

> I see no reason why this couldn't be shared among the other context
> types rather than keeping this cache stuff specific to aset.c.  slab.c
> might need to be pickier if the size isn't exactly what it needs, but
> generation.c should be able to make use of it the same as aset.c
> could.  I'm unsure what'd we'd need in the way of size classing for
> this, but I suspect we'd need to pay attention to that rather than do
> things like hand over 16MBs of memory to some context that only wants
> a 1KB initial block.

Possible. I can see something like a generic "free block" allocator being
useful. Potentially with allocating the underlying memory with larger mmap()s
than we need for individual blocks.

Random note:

I wonder if we should having a bitmap (in an int) in front of aset's
freelist. In a lot of cases we incur plenty cache misses, just to find the
freelist bucket empty.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 16:19:46 +0900, Michael Paquier wrote:
> But it looks like I misunderstood what this quote meant compared to
> what v3 does.  It is true that v3 sets iov_len and iov_base more than
> needed when writing sizes larger than BLCKSZ.

I don't think it does for writes larger than BLCKSZ, it just does more for
writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going
to be spent elsewhere.

> Seems like you think that it is not really going to matter much to track
> which iovecs have been already initialized during the first loop on
> pg_pwritev_with_retry() to keep the code shorter?

Yes. I'd bet that, in the unlikely case you're going to see any difference at
all, unconditionally initializing is going to win.

Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every
call. Even if we could do some further optimizations of what I did in the
patch, you can initialize needed IOVs repeatedly a *lot* of times, before it
shows up...

I'm inclined to go with my version, with the argument order swapped to
Bharath's order.

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 09:01:54 -0800, Jeff Davis wrote:
> On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:
> > On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:
> > > I am saying that pg_upgrade should be able to deal with the
> > > difference. The
> > > details of how to implement that, don't matter that much.
> > 
> > To clarify, you're saying that pg_upgrade should simply update
> > pg_database to set the new databases' collation fields equal to that
> > of
> > the old cluster?

Yes.

> Thinking about this more, it's not clear to me if this would be in
> scope for pg_upgrade or not.

I don't think we should consider changing the default collation provider
without making this more seamless, one way or another.


> If pg_upgrade is fixing up the new cluster rather than checking for
> compatibility, why doesn't it just take over and do the initdb for the new
> cluster itself? That would be less confusing for users, and avoid some
> weirdness (like, if you drop the database "postgres" on the original, why
> does it reappear after an upgrade?).

I've wondered about that as well. There are some initdb-time options you can
set, but pg_upgrade could forward those.

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Andres Freund

Hi, 

On February 16, 2023 9:40:17 PM PST, Michael Paquier  
wrote:
>On Fri, Feb 17, 2023 at 10:44:05AM +0530, Robert Haas wrote:
>> Uh, I had the contrary impression from the discussion upthread, but it
>> sounds like I might be misunderstanding the situation?
>
>IMO, it would be nice to be able to have the automatic detection of
>meson work in the CFbot to see how this patch goes.  Perhaps that's
>not a reason enough to hold on this patch, though..

Fwiw, the manually triggered mingw task today builds with ICU support. One 
thing the patch could do is to just comment out the "manual" piece in 
.cirrus.yml, then cfbot would run it for just this cf entry.

I am planning to build the optional libraries that are easily built, as part of 
the image build for use by CI. Just haven't gotten around to it. The patch Jeff 
linked to is part of the experimentation on the way to that. If somebody else 
wants to finish that, even better. IIRC that prototype builds all optional 
dependencies except for kerberos and ossp-uuid. 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Andres Freund

Hi, 

On February 16, 2023 9:14:05 PM PST, Robert Haas  wrote:
>On Thu, Feb 16, 2023 at 9:45 PM Andres Freund  wrote:
>> On 2023-02-16 15:05:10 +0530, Robert Haas wrote:
>> > The fact that we can't use ICU on Windows, though, weakens this
>> > argument a lot. In my experience, we have a lot of Windows users, and
>> > they're not any happier with the operating system collations than
>> > Linux users. Possibly less so.
>>
>> Why can't you use ICU on windows? It works today, afaict?
>
>Uh, I had the contrary impression from the discussion upthread, but it
>sounds like I might be misunderstanding the situation?

That was about the build environment in CI / cfbot, I think. Jeff was making 
icu a hard requirement by default, but ICU wasn't installed in a usable way, so 
the build failed. The patch he referred to was just building ICU during the CI 
run. 

I do remember encountering issues with the mkvcbuild.pl build not building 
against a downloaded modern icu build, but that was just about library naming 
or directory structure, or such. 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16

2023-02-16 Thread Andres Freund

Hi

On 2023-02-17 08:30:09 +0530, Amit Kapila wrote:
> Thanks, I was not completely sure about whether we need to bump
> XLOG_PAGE_MAGIC for this patch as this makes the additional space just
> by changing the datatype of one of the members of the existing WAL
> record. We normally change it for the addition/removal of new fields
> aka change in WAL record format, or addition of a new type of WAL
> record. Does anyone else have an opinion/suggestion on this matter?

I'd definitely change it - the width of a field still means you can't really
parse the old WAL sensibly anymore.

Greetings,

Andres Freund

Should CreateExprContext() be using ALLOCSET_DEFAULT_SIZES?

2023-02-16 Thread Andres Freund

Hi,

The thread around 
https://postgr.es/m/caduqk8uqw5qauqldd-0sbcvzvncre3jmjb9+ydwo_umv_ht...@mail.gmail.com
reminded me of the following:

ISTM that we really shouldn't use ALLOCSET_DEFAULT_SIZES for expression
contexts, as they most commonly see only a few small, or no, allocations.

That's true for OLTPish queries, but is is very often true even for analytics
queries.

Just because I had it loaded, here's the executor state for TPCH-Q01, which is
pretty expression heavy:

ExecutorState: 65536 total in 4 blocks; 42512 free (11 chunks); 23024 used
  TupleSort main: 32832 total in 2 blocks; 7320 free (7 chunks); 25512 used
TupleSort sort: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Caller tuples: 8192 total in 1 blocks (9 chunks); 6488 free (0 chunks); 
1704 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
Grand total: 139328 bytes in 11 blocks; 88032 free (18 chunks); 51296 used

As you can see very little was allocated in the ExprContext's.


ISTM that we could save a reasonable amount of memory by using a smaller
initial size.

Not so sure if a smaller max size should be used though.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 21:34:18 -0500, Jonah H. Harris wrote:
> On Thu, Feb 16, 2023 at 7:32 PM Andres Freund  wrote:
> Given not much changed regarding that allocation context IIRC, I’d think
> all recents. It was observed in 13, 14, and 15.

We did have a fair bit of changes in related code in the last few years,
including some in 16. I wouldn't expect them to help *hugely*, but also
wouldn't be surprised if it showed.

> I’ll have to create one - it was most evident on a TPC-C or sysbench test
> using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible
> with pgbench as well.

I'd like a workload that hits a perf issue with this, because I think there
likely are some general performance improvements that we could make, without
changing the initial size or the "growth rate".

Perhaps, as a starting point, you could get
  MemoryContextStats(queryDesc->estate->es_query_cxt)
both at the end of standard_ExecutorStart() and at the beginning of
standard_ExecutorFinish(), for one of the queries triggering the performance
issues?

> > If the issue is a specific FDW needing to make a lot of allocations, I can
> > see
> > adding an API to tell a memory context that it ought to be ready to
> > allocate a
> > certain amount of memory efficiently (e.g. by increasing the block size of
> > the
> > next allocation by more than 2x).
>
>
> While I’m happy to be wrong, it seems is an inherent problem not really
> specific to FDW implementations themselves but the general expectation that
> all FDWs are using more of that context than non-FDW cases for similar
> types of operations, which wasn’t really a consideration in that allocation
> over time.

Lots of things can end up in the query context, it's really not FDW specific
for it to be of nontrivial size. E.g. most tuples passed around end up in it.

Similar performance issues also exists for plenty other memory contexts. Which
for me that's a reason *not* to make it configurable just for
CreateExecutorState. Or were you proposing to make ALLOCSET_DEFAULT_INITSIZE
configurable? That would end up with a lot of waste, I think.

The executor context case might actually be a comparatively easy case to
address. There's really two "phases" of use for es_query_ctx. First, we create
the entire executor tree in it, during standard_ExecutorStart(). Second,
during query execution, we allocate things with query lifetime (be that
because they need to live till the end, or because they are otherwise
managed, like tuples).

Even very simple queries end up with multiple blocks at the end:
E.g.
  SELECT relname FROM pg_class WHERE relkind = 'r' AND relname = 'frak';
yields:
  ExecutorState: 43784 total in 3 blocks; 8960 free (5 chunks); 34824 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Grand total: 51976 bytes in 4 blocks; 16888 free (5 chunks); 35088 used

So quite justifiably we could just increase the hardcoded size in
CreateExecutorState. I'd expect that starting a few size classes up would help
noticeably.

But I think we likely could do better here. The amount of memory that ends up
in es_query_cxt during "phase 1" strongly correlates with the complexity of
the statement, as the whole executor tree ends up in it.  Using information
about the complexity of the planned statement to influence es_query_cxt's
block sizes would make sense to me.  I suspect it's a decent enough proxy for
"phase 2" as well.

Medium-long term I really want to allocate at least all the executor nodes
themselves in a single allocation. But that's a bit further out than what
we're talking about here.

Greetings,

Andres Freund

Re: Make set_ps_display faster and easier to use

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 14:19:24 +1300, David Rowley wrote:
> After fixing up the set_ps_display()s to use set_ps_display_with_len()
> where possible, I discovered some not so nice code which appends "
> waiting" onto the process title. Basically, there's a bunch of code
> that looks like this:
> 
> const char *old_status;
> int len;
> 
> old_status = get_ps_display(&len);
> new_status = (char *) palloc(len + 8 + 1);
> memcpy(new_status, old_status, len);
> strcpy(new_status + len, " waiting");
> set_ps_display(new_status);
> new_status[len] = '\0'; /* truncate off " waiting" */

Yea, that code is atrocious...  It took me a while to figure out that no,
LockBufferForCleanup() isn't leaking memory, because it'll always reach the
cleanup path *further up* in the function.

Avoiding the allocation across loop iterations seems like a completely
pointless optimization in these paths - we add the " waiting", precisely
because it's a slow path. But of course not allocating memory would be even
better...

> Seeing that made me wonder if we shouldn't just have something more
> generic for setting a suffix on the process title.  I came up with
> set_ps_display_suffix() and set_ps_display_remove_suffix().  The above
> code can just become:
> 
> set_ps_display_suffix("waiting");
> 
> then to remove the "waiting" suffix, just:
> 
> set_ps_display_remove_suffix();

That'd definitely be better.

It's not really a topic for this patch, but somehow the fact that we have
these set_ps_display() calls all over feels wrong, particularly because most
of them are paired with a pgstat_report_activity() call.  It's not entirely
obvious how it should be instead, but it doesn't feel right.

> +/*
> + * set_ps_display_suffix
> + *   Adjust the process title to append 'suffix' onto the end with a 
> space
> + *   between it and the current process title.
> + */
> +void
> +set_ps_display_suffix(const char *suffix)
> +{
> + size_t  len;

Think this will give you an unused-variable warning in the PS_USE_NONE case.

> +#ifndef PS_USE_NONE
> + /* update_process_title=off disables updates */
> + if (!update_process_title)
> + return;
> +
> + /* no ps display for stand-alone backend */
> + if (!IsUnderPostmaster)
> + return;
> +
> +#ifdef PS_USE_CLOBBER_ARGV
> + /* If ps_buffer is a pointer, it might still be null */
> + if (!ps_buffer)
> + return;
> +#endif

This bit is now repeated three times. How about putting it into a helper?

> +#ifndef PS_USE_NONE
> +static void
> +set_ps_display_internal(void)

Very very minor nit: Perhaps this should be update_ps_display() or
flush_ps_display() instead?

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:49:07 -0500, Jonah H. Harris wrote:
> I've been working on a federated database project that heavily relies on
> foreign data wrappers. During benchmarking, we noticed high system CPU
> usage in OLTP-related cases, which we traced back to multiple brk calls
> resulting from block frees in AllocSetReset upon ExecutorEnd's
> FreeExecutorState. This is because FDWs allocate their own derived
> execution states and required data structures within this context,
> exceeding the initial 8K allocation, that need to be cleaned-up.

What PG version?

Do you have a way to reproduce this with core code,
e.g. postgres_fdw/file_fdw?

What is all that memory used for? Is it possible that the real issue are too
many tiny allocations, due to some allocation growing slowly?

> Increasing the default query context allocation from ALLOCSET_DEFAULT_SIZES
> to a larger initial "appropriate size" solved the issue and almost doubled
> the throughput. However, the "appropriate size" is workload and
> implementation dependent, so making it configurable may be better than
> increasing the defaults, which would negatively impact users (memory-wise)
> who aren't encountering this scenario.
> 
> I have a patch to make it configurable, but before submitting it, I wanted
> to hear your thoughts and feedback on this and any other alternative ideas
> you may have.

This seems way too magic to expose to users. How would they ever know how to
set it? And it will heavily on the specific queries, so a global config won't
work well.

If the issue is a specific FDW needing to make a lot of allocations, I can see
adding an API to tell a memory context that it ought to be ready to allocate a
certain amount of memory efficiently (e.g. by increasing the block size of the
next allocation by more than 2x).

Greetings,

Andres Freund

Re: Introduce list_reverse() to make lcons() usage less inefficient

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-17 11:36:40 +1300, David Rowley wrote:
> While working on [1] to make improvements in the query planner around
> the speed to find EquivalenceMembers in an EquivalenceClass, because
> that patch does have a large impact in terms of performance
> improvements, some performance tests with that patch started to
> highlight some other places that bottleneck the planner's performance.
> 
> One of those places is in generate_orderedappend_paths() when we find
> that the required sort order is the same as the reverse of the
> partition order.  In this case, we build a list of paths for each
> partition using the lcons() function. Since Lists are now arrays in
> PostgreSQL, lcons() isn't as efficient as it once was and it must
> memmove the entire existing contents of the list up one element to
> make way to prepend the new element. This is effectively quadratic and
> becomes noticeable with a large number of partitions.

I have wondered before if we eventually ought to switch to embedded lists for
some planner structures, including paths. add_path() inserts/deletes at points
in the middle of the list, which isn't great.

> One way we could solve that is to just lappend() the new item and then
> just reverse the order of the list only when we need to.

That's not generally the same as lcons() ing, but I guess it's fine here,
because we build the lists from scratch, so the reversing actually yields the
correct result.

But wouldn't an even cheaper way here be to iterate over the children in
reverse order when match_partition_order_desc? We can do that efficiently
now. Looks like we don't have a readymade helper for it, but it'd be easy
enough to add or open code.

Greetings,

Andres Freund

Re: Missing free_var() at end of accum_sum_final()?

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 15:26:26 +0900, Michael Paquier wrote:
> On Thu, Feb 16, 2023 at 06:59:13AM +0100, Joel Jacobson wrote:
> > I noticed the NumericVar's pos_var and neg_var are not free_var()'d
> > at the end of accum_sum_final().
> > 
> > The potential memory leak seems small, since the function is called
> > only once per sum() per worker (and from a few more places), but
> > maybe they should be free'd anyways for correctness? 
> 
> Indeed, it is true that any code path of numeric.c that relies on a
> NumericVar with an allocation done in its buffer is careful enough to
> free it, except for generate_series's SRF where one step of the
> computation is done.  I don't see directly why you could not do the
> following:
> @@ -11973,6 +11973,9 @@ accum_sum_final(NumericSumAccum *accum, NumericVar 
> *result)
> /* And add them together */
> add_var(&pos_var, &neg_var, result);
>  
> +   free_var(&pos_var);
> +   free_var(&neg_var);
> +
> /* Remove leading/trailing zeroes */
> strip_var(result);

But why do we need it? Most SQL callable functions don't need to be careful
about not leaking O(1) memory, the exception being functions backing btree
opclasses.

In fact, the detailed memory management often is *more* expensive than just
relying on the calling memory context being reset.

Of course, numeric.c doesn't really seem to have gotten that message, so
there's a consistency argument here.

Greetings,

Andres Freund

Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy

2023-02-16 Thread Andres Freund

On 2023-02-16 12:37:57 +0530, Robert Haas wrote:
> The patch creates 100_bugs.pl

What's the story behind 100_bugs.pl? This name clearly is copied from
src/test/subscription/t/100_bugs.pl - but I've never understood why that is
outside of the normal numbering space.

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-15 09:21:48 +0100, Drouvot, Bertrand wrote:
> diff --git a/src/backend/utils/activity/pgstat_relation.c 
> b/src/backend/utils/activity/pgstat_relation.c
> index f793ac1516..b26e2a5a7a 100644
> --- a/src/backend/utils/activity/pgstat_relation.c
> +++ b/src/backend/utils/activity/pgstat_relation.c
> @@ -471,20 +471,46 @@ pgstat_fetch_stat_tabentry_ext(bool shared, Oid reloid)
>   * Find any existing PgStat_TableStatus entry for rel_id in the current
>   * database. If not found, try finding from shared tables.
>   *
> + * If an entry is found, copy it and increment the copy's counters with their
> + * subtransactions counterparts. Then return the copy. There is no need for 
> the
> + * caller to pfree the copy as the MemoryContext will be reset soon after.
> + *

The "There is no need" bit seems a bit off. Yes, that's true for the current
callers, but who says that it has to stay that way?

Otherwise this looks ready, on a casual scan.

Greetings,

Andres Freund

Re: recovery modules

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 12:15:12 -0800, Nathan Bossart wrote:
> Thanks for reviewing.
> 
> On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote:
> > On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote:
> >> @@ -144,10 +170,12 @@ basic_archive_configured(void)
> >>   * Archives one file.
> >>   */
> >>  static bool
> >> -basic_archive_file(const char *file, const char *path)
> >> +basic_archive_file(ArchiveModuleState *state, const char *file, const 
> >> char *path)
> >>  {
> >>sigjmp_buf  local_sigjmp_buf;
> > 
> > Not related the things changed here, but this should never have been pushed
> > down into individual archive modules. There's absolutely no way that we're
> > going to keep this up2date and working correctly in random archive
> > modules. And it would break if archive modules are ever called outside of
> > pgarch.c.
> 
> Yeah.  IIRC I did briefly try to avoid this, but the difficulty was that
> each module will have its own custom cleanup logic.

It can use PG_TRY/CATCH for that, if the top-level sigsetjmp is in pgarch.c.
Or you could add a cleanup callback to the API, to be called after the
top-level cleanup in pgarch.c.

I'm quite baffled by:
/* Close any files left open by copy_file() or compare_files() 
*/
AtEOSubXact_Files(false, InvalidSubTransactionId, 
InvalidSubTransactionId);

in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files()
completely outside the context of a transaction environment. And it only does
the thing you want because you pass parameters that aren't actually valid in
the normal use in AtEOSubXact_Files().  I really don't understand how that's
supposed to be ok.

Greetings,

Andres Freund

Re: recovery modules

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote:
> On Wed, Feb 15, 2023 at 03:38:21PM +0900, Michael Paquier wrote:
> > I may tweak a bit the comments, but nothing more.  And I don't think I
> > have more to add.  Andres, do you have anything you would like to
> > mention?

Just some minor comments below. None of them needs to be addressed.


> @@ -144,10 +170,12 @@ basic_archive_configured(void)
>   * Archives one file.
>   */
>  static bool
> -basic_archive_file(const char *file, const char *path)
> +basic_archive_file(ArchiveModuleState *state, const char *file, const char 
> *path)
>  {
>   sigjmp_buf  local_sigjmp_buf;

Not related the things changed here, but this should never have been pushed
down into individual archive modules. There's absolutely no way that we're
going to keep this up2date and working correctly in random archive
modules. And it would break if archive modules are ever called outside of
pgarch.c.


> +static void
> +basic_archive_shutdown(ArchiveModuleState *state)
> +{
> + BasicArchiveData *data = (BasicArchiveData *) (state->private_data);

The parens around (state->private_data) are imo odd.

> + basic_archive_context = data->context;
> + Assert(CurrentMemoryContext != basic_archive_context);
> +
> + if (MemoryContextIsValid(basic_archive_context))
> + MemoryContextDelete(basic_archive_context);

I guess I'd personally be paranoid and clean data->context after
this. Obviously doesn't matter right now, but at some later date it could be
that we'd error out after this point, and re-entered the shutdown callback.


> +
> +/*
> + * Archive module callbacks
> + *
> + * These callback functions should be defined by archive libraries and 
> returned
> + * via _PG_archive_module_init().  ArchiveFileCB is the only required 
> callback.
> + * For more information about the purpose of each callback, refer to the
> + * archive modules documentation.
> + */
> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveFileCB) (ArchiveModuleState *state, const char *file, 
> const char *path);
> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);
> +
> +typedef struct ArchiveModuleCallbacks
> +{
> + ArchiveStartupCB startup_cb;
> + ArchiveCheckConfiguredCB check_configured_cb;
> + ArchiveFileCB archive_file_cb;
> + ArchiveShutdownCB shutdown_cb;
> +} ArchiveModuleCallbacks;

If you wanted you could just define the callback types in the struct now, as
we don't need asserts for the types.

Greetings,

Andres Freund

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 14:21:01 +0900, Kyotaro Horiguchi wrote:
> I'm not sure why output plugin is involved in the delay mechanism.

+many

The output plugin absolutely never should be involved in something like
this. It was a grave mistake that OutputPluginUpdateProgress() calls were
added to the commit callback and then proliferated.

> It appears to me that it would be simpler if the delay occurred in reorder
> buffer or logical decoder instead.

This is a feature specific to walsender. So the riggering logic should either
directly live in the walsender, or in a callback set in
LogicalDecodingContext. That could be called from decode.c or such.

Greetings,

Andres Freund

Re: Add WAL read stats to pg_stat_wal

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 23:39:00 +0530, Bharath Rupireddy wrote:
> While working on [1], I was in need to know WAL read stats (number of
> times and amount of WAL data read from disk, time taken to read) to
> measure the benefit. I had to write a developer patch to capture WAL
> read stats as pg_stat_wal currently emits WAL write stats. With recent
> works on pg_stat_io which emit data read IO stats too, I think it's
> better to not miss WAL read stats. It might help others who keep an
> eye on IOPS of the production servers for various reasons. The WAL
> read stats I'm thinking useful are wal_read_bytes - total amount of
> WAL read, wal_read - total number of times WAL is read from disk,
> wal_read_time - total amount of time spent reading WAL (tracked only
> when an existing GUC track_wal_io_timing is on).

I doesn't really seem useful to have this in pg_stat_wal, because you can't
really figure out where those reads are coming from. Are they crash recovery?
Walsender? ...?

I think this'd better be handled by adding WAL support for pg_stat_io. Then
the WAL reads would be attributed to the relevant backend type, making it
easier to answer such questions.  Going forward I want to add support for
seeing pg_stat_io for individual connections, which'd then automatically
support this feature for the WAL reads as well.

Eventually I think pg_stat_wal should only track wal_records, wal_fpi,
wal_buffers_full and fill the other columns from pg_stat_io.

However, this doesn't "solve" the following issue:

> Note that the patch needs a bit more work, per [2]. With the patch,
> the WAL senders (processes exiting after checkpointer) will generate
> stats and we need to either let all or only one WAL sender to write
> stats to disk. Allowing one WAL sender to write might be tricky.
> Allowing all WAL senders to write might make too many writes to the
> stats file. And, we need a lock to let only one process write. I can't
> think of a best way here at the moment.
> 
> Thoughts?
> 
> [1] 
> https://www.postgresql.org/message-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=q...@mail.gmail.com
> [2]
> /*
>  * Write out stats after shutdown. This needs to be called by exactly one
>  * process during a normal shutdown, and since checkpointer is shut down
>  * very late...
>  *
>  * Walsenders are shut down after the checkpointer, but currently don't
>  * report stats. If that changes, we need a more complicated solution.
>  */
> before_shmem_exit(pgstat_before_server_shutdown, 0);

I wonder if we should keep the checkpointer around for longer. If we have
checkpointer signal postmaster after it wrote the shutdown checkpoint,
postmaster could signal walsenders to shut down, and checkpointer could do
some final work, like writing out the stats.

I suspect this could be useful for other things as well. It's awkward that we
don't have a place to put "just before shutting down" type tasks. And
checkpointer seems well suited for that.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:58:23 +0900, Michael Paquier wrote:
> On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote:
> > The v3 patch reduces initialization of iovec array elements which is a
> > clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
> > many times (I assume this is what is needed for the relation extension
> > lock improvements feature). However, it increases the number of iovec
> > initialization with zerobuf for the cases when pg_pwrite_zeros is
> > called for sizes far greater than BLCKSZ (for instance, WAL file
> > initialization).

In those cases the cost of initializing the IOV doesn't matter, relative to
the other costs. The important point is to not initialize a lot of elements if
they're not even needed. Because we need to overwrite the trailing iov
element, it doesn't seem worth to try to "pre-initialize" iov.

Referencing a static variable is more expensive than accessing an on-stack
variable. Having a first-call check is more expensive than not having it.

Thus making the iov and zbuf_sz static isn't helpful. Particularly the latter
seems like a bad idea, because it's a compiler constant.

> It seems to me that v3 would do extra initializations only if
> pg_pwritev_with_retry() does *not* retry its writes, but that's not
> the case as it retries on a partial write as per its name.  The number
> of iov buffers is stricly capped by remaining_size.

I don't really understand this bit?

Greetings,

Andres Freund

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:22:56 +0700, John Naylor wrote:
> On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada 
> > Right. TidStore is implemented not only for heap, so loading
> > out-of-order TIDs might be important in the future.
> 
> That's what I was probably thinking about some weeks ago, but I'm having a
> hard time imagining how it would come up, even for something like the
> conveyor-belt concept.

We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.

Greetings,

Andres Freund

Re: Weird failure with latches in curculio on v15

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 15:18:57 +0530, Robert Haas wrote:
> On Fri, Feb 10, 2023 at 12:59 AM Andres Freund  wrote:
> > I don't think it's that hard to imagine problems. To be reasonably fast, a
> > decent restore implementation will have to 'restore ahead'. Which also
> > provides ample things to go wrong. E.g.
> >
> > - WAL source is switched, restore module needs to react to that, but 
> > doesn't,
> >   we end up lots of wasted work, or worse, filename conflicts
> > - recovery follows a timeline, restore module doesn't catch on quickly 
> > enough
> > - end of recovery happens, restore just continues on
> 
> I don't see how you can prevent those things from happening. If the
> restore process is working in some way that requires an event loop,
> and I think that will be typical for any kind of remote archiving,
> then either it has control most of the time, so the event loop can be
> run inside the restore process, or, as Nathan proposes, we don't let
> the archiver have control and it needs to run that restore process in
> a separate background worker. The hazards that you mention here exist
> either way. If the event loop is running inside the restore process,
> it can decide not to call the functions that we provide in a timely
> fashion and thus fail to react as it should. If the event loop runs
> inside a separate background worker, then that process can fail to be
> responsive in precisely the same way. Fundamentally, if the author of
> a restore module writes code to have multiple I/Os in flight at the
> same time and does not write code to cancel those I/Os if something
> changes, then such cancellation will not occur. That remains true no
> matter which process is performing the I/O.

IDK. I think we can make that easier or harder. Right now the proposed API
doesn't provide anything to allow to address this.


> > > I don't quite see how you can make asynchronous and parallel archiving
> > > work if the archiver process only calls into the archive module at
> > > times that it chooses. That would mean that the module has to return
> > > control to the archiver when it's in the middle of archiving one or
> > > more files -- and then I don't see how it can get control back at the
> > > appropriate time. Do you have a thought about that?
> >
> > I don't think archiver is the hard part, that already has a dedicated
> > process, and it also has something of a queuing system already. The startup
> > process imo is the complicated one...
> >
> > If we had a 'restorer' process, startup fed some sort of a queue with things
> > to restore in the near future, it might be more realistic to do something 
> > you
> > describe?
> 
> Some kind of queueing system might be a useful part of the interface,
> and a dedicated restorer process does sound like a good idea. But the
> archiver doesn't have this solved, precisely because you have to
> archive a single file, return control, and wait to be invoked again
> for the next file. That does not scale.

But there's nothing inherent in that. We know for certain which files we're
going to archive. And we don't need to work one-by-one. The archiver could
just start multiple subprocesses at the same time. All the blocking it does
right now are artificially imposed by the use of system(). We could instead
just use something popen() like and have a configurable number of processes
running at the same time.

What I was trying to point out was that the work a "restorer" process has to
do is more speculative, because we don't know when we'll promote, whether
we'll follow a timeline increase, whether the to-be-restored WAL already
exists. That's solvable, but a bunch of the relevant work ought to be solved
in core core code, instead of just in archive modules.

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 15:05:10 +0530, Robert Haas wrote:
> The fact that we can't use ICU on Windows, though, weakens this
> argument a lot. In my experience, we have a lot of Windows users, and
> they're not any happier with the operating system collations than
> Linux users. Possibly less so.

Why can't you use ICU on windows? It works today, afaict?

Greetings,

Andres Freund

Re: Refactor calculations to use instr_time

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:19:02 +0300, Nazir Bilal Yavuz wrote:
> What do you think?

Here's a small review:

> +#define WALSTAT_ACC(fld, var_to_add) \
> + (stats_shmem->stats.fld += var_to_add.fld)
> +#define WALLSTAT_ACC_INSTR_TIME_TYPE(fld) \
> + (stats_shmem->stats.fld += INSTR_TIME_GET_MICROSEC(PendingWalStats.fld))
> + WALSTAT_ACC(wal_records, diff);
> + WALSTAT_ACC(wal_fpi, diff);
> + WALSTAT_ACC(wal_bytes, diff);
> + WALSTAT_ACC(wal_buffers_full, PendingWalStats);
> + WALSTAT_ACC(wal_write, PendingWalStats);
> + WALSTAT_ACC(wal_sync, PendingWalStats);
> + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_write_time);
> + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_sync_time);
>  #undef WALSTAT_ACC
> -
>   LWLockRelease(&stats_shmem->lock);

WALSTAT_ACC is undefined, but WALLSTAT_ACC_INSTR_TIME_TYPE isn't.

I'd not remove the newline before LWLockRelease().


>   /*
> diff --git a/src/include/pgstat.h b/src/include/pgstat.h
> index db9675884f3..295c5eabf38 100644
> --- a/src/include/pgstat.h
> +++ b/src/include/pgstat.h
> @@ -445,6 +445,21 @@ typedef struct PgStat_WalStats
>   TimestampTz stat_reset_timestamp;
>  } PgStat_WalStats;
>  
> +/* Created for accumulating wal_write_time and wal_sync_time as a
> instr_time

Minor code-formatting point: In postgres we don't put code in the same line as
a multi-line comment starting with the /*. So either

/* single line comment */
or
/*
 * multi line
 * comment
 */

> + * but instr_time can't be used as a type where it ends up on-disk
> + * because its units may change. PgStat_WalStats type is used for
> + * in-memory/on-disk data. So, PgStat_PendingWalUsage is created for
> + * accumulating intervals as a instr_time.
> + */
> +typedef struct PgStat_PendingWalUsage
> +{
> + PgStat_Counter wal_buffers_full;
> + PgStat_Counter wal_write;
> + PgStat_Counter wal_sync;
> + instr_time wal_write_time;
> + instr_time wal_sync_time;
> +} PgStat_PendingWalUsage;
> +

I wonder if we should try to put pgWalUsage in here. But that's probably
better done as a separate patch.

Greetings,

Andres Freund

Re: doc: add missing "id" attributes to extension packaging page

2023-02-15 Thread Andres Freund

Hi,

On 2023-02-15 13:34:37 -0600, Karl O. Pinc wrote:
> This makes me think that it would be useful to add --nonet to the
> xsltproc invocations.  That would catch this error before it goes to
> CI.

We are doing that now :)

commit 969509c3f2e3b4c32dcf264f9d642b5ef01319f3
Author: Tom Lane 
Date:   2023-02-08 17:15:23 -0500
 
Stop recommending auto-download of DTD files, and indeed disable it.



> I'm also noticing that the existing xsl:import-s all import entire
> docbook stylesheets.  It does not hurt to do this; the output is
> unaffected, although I can't say what it means for build performance.
> It does keep it simple.  Only one import is needed no matter which
> templates we use the import mechanism to extend.  And by importing
> "everything" there's no concern about any (unlikely) changes to
> the the "internals" of the catalog.
> 
> Should we import only what we need or all of docbook?  I don't know.

It couldn't hurt to check if performance improves when you avoid doing so. I
suspect it won't make much of a difference, because the time is actually spent
evaluating xslt rather than parsing it.

Greetings,

Andres Freund

Re: Minimal logical decoding on standbys

2023-02-15 Thread Andres Freund

Hi,

On 2023-02-15 18:02:11 +0530, Ashutosh Sharma wrote:
> Thanks Andres. I have one more query (both for you and Bertrand). I
> don't know if this has already been answered somewhere in this mail
> thread, if yes, please let me know the mail that answers this query.
> 
> Will there be a problem if we mandate the use of physical replication
> slots and hot_standby_feedback to support minimum LD on standby. I
> know people can do a physical replication setup without a replication
> slot or even with hot_standby_feedback turned off, but are we going to
> have any issue if we ask them to use a physical replication slot and
> turn on hot_standby_feedback for LD on standby. This will reduce the
> code changes required to do conflict handling for logical slots on
> standby which is being done by v50-0001 and v50-0002* patches
> currently.

I don't think it would. E.g. while restoring from archives we can't rely on
knowing that the slot still exists on the primary.

We can't just do corrupt things, including potentially crashing, when the
configuration is wrong. We can't ensure that the configuration is accurate all
the time. So we need to detect this case. Hence needing to detect conflicts.

> IMHO even in normal scenarios i.e. when we are not doing LD on
> standby, we should mandate the use of a physical replication slot.

I don't think that's going to fly. There plenty scenarios where you e.g. don't
want to use a slot, e.g. when you want to limit space use on the primary.

Greetings,

Andres Freund

Re: We shouldn't signal process groups with SIGQUIT

2023-02-15 Thread Andres Freund

Hi,

On 2023-02-15 09:57:41 -0800, Nathan Bossart wrote:
> On Tue, Feb 14, 2023 at 04:20:59PM -0800, Andres Freund wrote:
> > On 2023-02-14 14:23:32 -0800, Nathan Bossart wrote:
> >> On Tue, Feb 14, 2023 at 12:47:12PM -0800, Andres Freund wrote:
> >> > Not really related: I do wonder how often we end up self deadlocking in
> >> > quickdie(), due to the ereport() not beeing reentrant. We'll "fix" it 
> >> > soon
> >> > after, due to postmasters SIGKILL.  Perhaps we should turn on
> >> > send_abort_for_kill on CI?
> >> 
> >> +1, this seems like it'd be useful in general.  I'm guessing this will
> >> require a bit of work on the CI side to generate the backtrace.
> > 
> > They're already generated for all current platforms.  It's possible that 
> > debug
> > info for some system libraries is missing, but the most important one (like
> > libc) should be available.
> > 
> > Since yesterday the cfbot website allows to easily find the coredumps, too:
> > http://cfbot.cputube.org/highlights/core-7.html
> 
> Oh, that's nifty.  Any reason not to enable send_abort_for_crash, too?

I think it'd be too noisy. Right now you get just a core dump of the crashed
process, but if we set send_abort_for_crash we'd end up with a lot of core
dumps, making it harder to know what to look at.

We should never need the send_abort_for_kill path, so I don't think the noise
issue applies to the same degree.

Greetings,

Andres

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Andres Freund

Hi

On 2023-02-15 10:28:37 +0900, Kyotaro Horiguchi wrote:
> At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund  wrote 
> in 
> > Hi,
> > 
> > On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> > > Done, PSA v2 patch.
> > 
> > This feels way too complicated to me.  How about something more like the
> > attached?
> 
> I like this one, but the parameters offset and size are in a different
> order from pwrite(fd, buf, count, offset).  I perfer the arrangement
> suggested by Bharath.

Yes, it probably is better. Not sure why I went with that order.


> And isn't it better to use Min(remaining_size, BLCKSZ) instead of a bare if
> statement?

I really can't make myself care about which version is better :)

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> On Mon, Feb 13, 2023 at 11:09 PM Andres Freund  wrote:
> >
> > > Later code assigns iov[0].iov_len thus we need to provide a separate
> > > iov non-const variable, or can we use pwrite instead there?  (I didn't
> > > find pg_pwrite_with_retry(), though)
> >
> > Given that we need to do that, and given that we already need to loop to
> > handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not
> > worth avoiding iov initialization.
> >
> > But I think it's worth limiting the initialization to blocks.
>
> We can still optimize away the for loop by using a single iovec for
> remaining size, like the attached v2 patch.
>
> > I'd also try to combine the first pg_writev_* with the second one.
>
> Done, PSA v2 patch.

This feels way too complicated to me.  How about something more like the
attached?


> 2) A small test module passing in a file with the size to write isn't
> multiple of block size, meaning, the code we have in the function to
> write last remaining bytes (less than BLCKSZ) gets covered which isn't
> covered right now -

FWIW, I tested this locally by just specifying a smaller size than BLCKSZ for
the write size.

Greetings,

Andres Freund
>From ba0a8697c639870b76e92a285644a87d36cf4496 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 13 Feb 2023 17:11:29 -0800
Subject: [PATCH v3 01/12] Revise pg_pwrite_zeros()

- add offset parameter
- avoid memset() of zerobuf on every call
- don't initialize the whole IOV array unnecessarily
- don't handle the smaller trailing write in a separate write call
---
 src/include/common/file_utils.h|  2 +-
 src/common/file_utils.c| 67 ++
 src/backend/access/transam/xlog.c  |  2 +-
 src/bin/pg_basebackup/walmethods.c |  2 +-
 4 files changed, 26 insertions(+), 47 deletions(-)

diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index bda6d3a5413..95b5cc8b3a1 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -44,6 +44,6 @@ extern ssize_t pg_pwritev_with_retry(int fd,
 	 int iovcnt,
 	 off_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size);
+extern ssize_t pg_pwrite_zeros(int fd, off_t offset, size_t size);
 
 #endif			/* FILE_UTILS_H */
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152f..a161b1c7bbe 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -537,62 +537,41 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size)
+pg_pwrite_zeros(int fd, off_t offset, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
-	size_t		zbuffer_sz;
+	const static PGAlignedBlock zbuffer = {0};		/* worth BLCKSZ */
+	void *zerobuf_addr = unconstify(PGAlignedBlock *, &zbuffer)->data;
 	struct iovec iov[PG_IOV_MAX];
-	int			blocks;
-	size_t		remaining_size = 0;
-	int			i;
-	ssize_t		written;
+	size_t		remaining_size = size;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
-
-	/* Prepare to write out a lot of copies of our zero buffer at once. */
-	for (i = 0; i < lengthof(iov); ++i)
-	{
-		iov[i].iov_base = zbuffer.data;
-		iov[i].iov_len = zbuffer_sz;
-	}
-
 	/* Loop, writing as many blocks as we can for each system call. */
-	blocks = size / zbuffer_sz;
-	remaining_size = size % zbuffer_sz;
-	for (i = 0; i < blocks;)
+	while (remaining_size > 0)
 	{
-		int			iovcnt = Min(blocks - i, lengthof(iov));
-		off_t		offset = i * zbuffer_sz;
-
-		written = pg_pwritev_with_retry(fd, iov, iovcnt, offset);
-
-		if (written < 0)
-			return written;
-
-		i += iovcnt;
-		total_written += written;
-	}
-
-	/* Now, write the remaining size, if any, of the file with zeros. */
-	if (remaining_size > 0)
-	{
-		/* We'll never write more than one block here */
-		int			iovcnt = 1;
-
-		/* Jump on to the end of previously written blocks */
-		off_t		offset = i * zbuffer_sz;
-
-		iov[0].iov_len = remaining_size;
+		int			iovcnt = 0;
+		ssize_t		written;
+
+		for (; iovcnt < PG_IOV_MAX && remaining_size > 0; iovcnt++)
+		{
+			size_t this_iov_size;
+
+			iov[iovcnt].iov_base = zerobuf_addr;
+
+			if (remaining_size < BLCKSZ)
+this_iov_size = remaining_size;
+			else
+this_iov_size = BLCKSZ;
+
+			iov[iovcnt].iov_len = this_iov_size;
+			remaining_size -= this_iov_size;
+		}
 
 		written = pg_pwritev_with_retry(fd, iov, iovcnt, offset);
 
 		if (written < 0)
 			return written;
 
+		offset += written;
 		total_written += written;
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 16:06:24 +0900, Michael Paquier wrote:
> On Mon, Feb 13, 2023 at 05:10:56PM -0800, Andres Freund wrote:
> > I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have 
> > an
> > offset parameter.  Huh, what lead to the function being so constrained?
> 
> Its current set of uses cases, where we only use it now to initialize
> with zeros with WAL segments.  If you have a case that plans to use
> that stuff with an offset, no problem with me.

Then it really shouldn't have been named pg_pwrite_zeros(). The point of the
p{write,read}{,v} family of functions is to be able to specify the offset to
read/write at. I assume the p is for position, but I'm not sure.

Greetings,

Andres Freund

Re: We shouldn't signal process groups with SIGQUIT

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 14:23:32 -0800, Nathan Bossart wrote:
> On Tue, Feb 14, 2023 at 12:47:12PM -0800, Andres Freund wrote:
> > Not really related: I do wonder how often we end up self deadlocking in
> > quickdie(), due to the ereport() not beeing reentrant. We'll "fix" it soon
> > after, due to postmasters SIGKILL.  Perhaps we should turn on
> > send_abort_for_kill on CI?
> 
> +1, this seems like it'd be useful in general.  I'm guessing this will
> require a bit of work on the CI side to generate the backtrace.

They're already generated for all current platforms.  It's possible that debug
info for some system libraries is missing, but the most important one (like
libc) should be available.

Since yesterday the cfbot website allows to easily find the coredumps, too:
http://cfbot.cputube.org/highlights/core-7.html

There's definitely some work to be done to make the contents of the backtraces
more useful though.  E.g. printing out the program name, the current directory
(although that doesn't seem to be doable for all programs). For everything but
windows that's in src/tools/ci/cores_backtrace.sh.

Greetings,

Andres Freund

Re: We shouldn't signal process groups with SIGQUIT

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 15:38:24 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > ISTM that signal_child() should downgrade SIGQUIT to SIGTERM when sending to
> > the process group. That way we'd maintain the current behaviour for postgres
> > itself, but stop core-dumping archive/restore scripts (as well as other
> > subprocesses that e.g. trusted PLs might create).
> 
> Yeah, I had been thinking along the same lines.  One issue
> is that that means the backend itself will get SIGQUIT and SIGTERM
> in close succession.  We need to make sure that that won't cause
> problems.  It might be prudent to think about what order to send
> the two signals in.

I hope we already deal with that reasonably well - I think it's not uncommon
for that to happen, regardless of this change.

Just naively hacking this behaviour change into the current code, would yield
sending SIGQUIT to postgres, and then SIGTERM to the whole process
group. Which seems like a reasonable order?  quickdie() should _exit()
immediately in the signal handler, so we shouldn't get to processing the
SIGTERM.  Even if both signals are "reacted to" at the same time, possibly
with SIGTERM being processed first, the SIGQUIT handler should be executed
long before the next CFI().

Not really related: I do wonder how often we end up self deadlocking in
quickdie(), due to the ereport() not beeing reentrant. We'll "fix" it soon
after, due to postmasters SIGKILL.  Perhaps we should turn on
send_abort_for_kill on CI?

Greetings,

Andres Freund

We shouldn't signal process groups with SIGQUIT

2023-02-14 Thread Andres Freund

Hi,

The default reaction to SIGQUIT is to create core dumps. We use SIGQUIT to
implement immediate shutdowns. We send the signal to the entire process group.

The result of that is that we regularly produce core dumps for binaries like
sh/cp. I regularly see this on my local system, I've seen it on CI. Recently
Thomas added logic to show core dumps happing in cfbot ([1]). Plenty unrelated
core dumps, but also lots in sh/cp ([2]).

We found a bunch of issues as part of [3], but I think the issue I'm
discussing here is separate.


ISTM that signal_child() should downgrade SIGQUIT to SIGTERM when sending to
the process group. That way we'd maintain the current behaviour for postgres
itself, but stop core-dumping archive/restore scripts (as well as other
subprocesses that e.g. trusted PLs might create).


Makes sense?


Greetings,

Andres Freund


[1] http://cfbot.cputube.org/highlights/core.html

[2] A small sample:
https://api.cirrus-ci.com/v1/task/5939902693507072/logs/cores.log
https://api.cirrus-ci.com/v1/task/5549174150660096/logs/cores.log
https://api.cirrus-ci.com/v1/task/6153817767542784/logs/cores.log
https://api.cirrus-ci.com/v1/task/6567335205535744/logs/cores.log
https://api.cirrus-ci.com/v1/task/4804998119292928/logs/cores.log

[3] https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz

Re: User functions for building SCRAM secrets

2023-02-14 Thread Andres Freund

Hi,

On 2023-01-23 00:58:40 -0500, Jonathan S. Katz wrote:
> Here is another attempt at this patch that takes into account the SCRAM code
> refactor. I addressed some of Daniel's previous feedback, but will need to
> make another pass on the docs and the assert trace as the main focus of this
> revision was bringing the code inline with the recent changes.

This reliably fails on CI:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3988

I think this is related to encoding issues. The 32bit debian task
intentionally uses LANG=C. Resulting in failures like:
https://api.cirrus-ci.com/v1/artifact/task/6696410851049472/testrun/build-32/testrun/regress/regress/regression.diffs

Windows fails with a similar issue:
https://api.cirrus-ci.com/v1/artifact/task/5676064060473344/testrun/build/testrun/regress/regress/regression.diffs

I've set the patch as waiting on author for now.

- Andres

Re: doc: add missing "id" attributes to extension packaging page

2023-02-14 Thread Andres Freund

Hi,

On 2023-01-26 21:48:54 +0100, Brar Piening wrote:
> On 18.01.2023 at 06:50, Brar Piening wrote:
> 
> > I'll give it a proper look this weekend.
> 
> It turns out I didn't get a round tuit.
> 
> ... and I'm afraid I probably will not be able to work on this until
> mid/end February so we'll have to move this to the next commitfest until
> somebody wants to take it over and push it through.

A small note: As-is this fails on CI, because we don't allow network access
during the docs build anymore (since it always fails these days):

https://cirrus-ci.com/task/5474029402849280?logs=docs_build#L297

[17:02:03.114] time make -s -j${BUILD_JOBS} -C doc
[17:02:04.092] I/O error : Attempt to load network entity 
http://cdn.docbook.org/release/xsl/current/html/sections.xsl
[17:02:04.092] warning: failed to load external entity 
"http://cdn.docbook.org/release/xsl/current/html/sections.xsl";
[17:02:04.092] compilation error: file stylesheet-html-common.xsl line 17 
element import
[17:02:04.092] xsl:import : unable to load 
http://cdn.docbook.org/release/xsl/current/html/sections.xsl

I think this is just due to the common URL in docbook packages being
http://docbook.sourceforge.net/release/xsl/current/*
Because of that the docbook catalog matching logic won't work for that file:

E.g. I have the following in /etc/xml/docbook-xsl.xml, on debian unstable:
http://docbook.sourceforge.net/release/xsl/"; 
catalog="file:///usr/share/xml/docbook/stylesheet/docbook-xsl/catalog.xml"/>

As all our other references use the sourceforge address, this should too.

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-11 10:24:37 -0800, Andres Freund wrote:
> Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
> and the pg_stat_io tests.

One thing I started to wonder about since is whether we should remove the io_
prefix from io_object, io_context. The prefixes make sense on the C level, but
it's not clear to me that that's also the case on the table level.

Greetings,

Andres Freund

Re: Force testing of query jumbling code in TAP tests

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 16:04:16 +0900, Michael Paquier wrote:
> On Mon, Feb 13, 2023 at 09:45:12AM -0800, Andres Freund wrote:
> > Shouldn't there at least be some basic verification of pg_stat_statements
> > output being sane after running the test? Even if that's perhaps just 
> > actually
> > printing the statements.
> 
> There is a total of 20k entries in pg_stat_statements if the max is
> high enough to store everything.  Only dumping the first 100
> characters of each query generates at least 1MB worth of logs, which
> would bloat a lot of the buildfarm in each run.  So I would not do
> that.  One thing may be perhaps to show a count of the queries in five
> categories: select, insert, delete, update and the rest?

I didn't mean printing in the sense of outputting the statements to the tap
log. Maybe creating a temp table or such for all the queries. And yes, then
doing some top-level analysis on it like you describe sounds like a good idea.

Greetings,

Andres Freund

Re: Move defaults toward ICU in 16?

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 09:48:08 -0800, Jeff Davis wrote:
> On Fri, 2023-02-10 at 18:00 -0800, Andres Freund wrote:
> > But, shouldn't pg_upgrade be able to deal with this? As long as the
> > databases
> > are created with template0, we can create the collations at that
> > point?
> 
> Are you saying that the upgraded cluster could have a different default
> collation for the template databases than the original cluster?

> That would be wrong to do, at least by default, but I could see it
> being a useful option.
> 
> Or maybe I misunderstand what you're saying?

I am saying that pg_upgrade should be able to deal with the difference. The
details of how to implement that, don't matter that much.

FWIW, I don't think it matters much what collation template0 has, since we
allow to change the locale provider when using template0 as the template.

We could easily update template0, if we think that's necessary. But I don't
think it really is. As long as the newly created databases have the right
provider, I'd lean towards not touching template0. But whatever...

Greetings,

Andres Freund

Re: Buffer usage detailed by RelKind in EXPLAIN ANALYZE BUFFERS

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 18:14:58 -0800, Andrey Borodin wrote:
> On Mon, Feb 13, 2023 at 4:39 PM Andres Freund  wrote:
> >
> > The problem I'm talking about is the increased overhead in InstrStopNode(),
> > due to BufferUsageAccumDiff() getting more expensive.
> >
> 
> Thanks, now I understand the problem better. According to godbolt.com
> my patch doubles the number of instructions in this function. Unless
> we compute only tables\indexes\matviews.
> Anyway, without regarding functionality of this particular patch,
> BufferUsageAccumDiff() does not seem slow to me. It's just a
> branchless bunch of simd instructions. Performance of this function
> might matter only when called gazillion times per second.

It is called gazillions of times per second when you do an EXPLAIN (ANALYZE,
BUFFERS). Every invocation of an executor node calls it.

Here's a quick pgbench, showing todays code, with -O3, without assertions:
   298.396   0  SELECT generate_series(1, 1000) OFFSET 1000;
   397.400   0  EXPLAIN (ANALYZE, TIMING OFF) SELECT 
generate_series(1, 1000) OFFSET 1000;
   717.238   0  EXPLAIN (ANALYZE, TIMING ON) SELECT 
generate_series(1, 1000) OFFSET 1000;
   419.736   0  EXPLAIN (ANALYZE, BUFFERS, TIMING OFF) SELECT 
generate_series(1, 1000) OFFSET 1000;
   761.575   0  EXPLAIN (ANALYZE, BUFFERS, TIMING ON) SELECT 
generate_series(1, 1000) OFFSET 1000;

The effect ends up a lot larger once you add in joins etc, because it ads
additional executor node that all have their instrumentation started/stopped.

Greetings,

Andres Freund

Re: Exit walsender before confirming remote flush in logical replication

2023-02-13 Thread Andres Freund

On 2023-02-14 10:05:40 +0900, Kyotaro Horiguchi wrote:
> What do you think about the need for explicitly specifying the
> default?  I'm fine with specifying the default using a single word,
> such as WAIT_FOR_REMOTE_FLUSH.

We obviously shouldn't force the option to be present. Why would we want to
break existing clients unnecessarily?  Without it the behaviour should be
unchanged from today's.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-12 09:31:36 -0800, Andres Freund wrote:
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
offset parameter.  Huh, what lead to the function being so constrained?

- Andres

Re: pg_walinspect memory leaks

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 15:22:02 -0800, Peter Geoghegan wrote:
> More concretely, it looks like GetWALRecordInfo() calls
> CStringGetTextDatum/cstring_to_text in a way that accumulates way too
> much memory in ExprContext.

Additionally, we leak two stringinfos for each record.

> This could be avoided by using a separate memory context that is reset
> periodically, or something else along the same lines.

Everything other than a per-row memory context that's reset each time seems
hard to manage in this case.

Somehwat funnily, GetWALRecordsInfo() then ends up being unnecessarily
dilligent about cleaning up O(1) memory, after not caring about O(N) memory...

Greetings,

Andres Freund

Re: Adding "large" to PG_TEST_EXTRA

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-14 11:38:06 +1100, Peter Smith wrote:
> No, nothing specific in mind. But maybe like these:
> - tests for causing obscure errors that would never otherwise be
> reached without something deliberately designed to fail a certain way

I think there's some cases around this that could be usefu, but also a lot
that wouldn't.

> - tests for trivial user errors apparently deemed not worth bloating
> the regression tests with -- e.g. many errorConflictingDefElem not
> being called [1].

I don't think it's worth adding a tests for all of these. The likelihood of
catching a problem seems quite small.

> - timing-related or error tests where some long (multi-second) delay
> is a necessary part of the setup.

IME that's almost always a sign that the test wouldn't be stable anyway.

Greetings,

Andres Freund

Re: Buffer usage detailed by RelKind in EXPLAIN ANALYZE BUFFERS

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 16:36:25 -0800, Andrey Borodin wrote:
> On Mon, Feb 13, 2023 at 4:29 PM Andres Freund  wrote:
> > > 1. Some more increments on hot paths. We have to add this tiny toll to
> > > every single buffer hit, but it will be seldom of any use.
> >
> > Additionally, I bet it slows down EXPLAIN (ANALYZE, BUFFERS) noticeably. 
> > It's
> > already quite expensive...
> >
> 
> I think collection of instrumentation is done unconditionally.
> We always do that
> pgBufferUsage.shared_blks_hit++;
> when the buffer is in shared_buffers.

The problem I'm talking about is the increased overhead in InstrStopNode(),
due to BufferUsageAccumDiff() getting more expensive.

Andres

Re: Inconsistency in reporting checkpointer stats

2023-02-13 Thread Andres Freund



On 2022-12-21 17:14:12 +0530, Nitin Jadhav wrote:
> Kindly review and share your comments.

This doesn't pass the tests, because the regression tests weren't adjusted:

https://api.cirrus-ci.com/v1/artifact/task/5937624817336320/testrun/build/testrun/regress/regress/regression.diffs

Re: Make EXPLAIN generate a generic plan for a parameterized query

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-05 18:24:03 +0100, Laurenz Albe wrote:
> Anyway, attached is v7 that tries to do it that way.

This consistently fails on CI:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3962

https://api.cirrus-ci.com/v1/artifact/task/5156723929907200/testrun/build/testrun/regress/regress/regression.diffs

Greetings,

Andres Freund

Re: Buffer usage detailed by RelKind in EXPLAIN ANALYZE BUFFERS

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 16:23:30 -0800, Andrey Borodin wrote:
> But there are some caveats:
> 1. Some more increments on hot paths. We have to add this tiny toll to
> every single buffer hit, but it will be seldom of any use.

Additionally, I bet it slows down EXPLAIN (ANALYZE, BUFFERS) noticeably. It's
already quite expensive...


> All in all I do not have an opinion if this feature is a good tradeoff.
> What do you think? Does the feature look useful? Do we want a more
> polished implementation?

Unless the above issues could be avoided, I don't think so.

Greetings,

Andres Freund

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-02-13 Thread Andres Freund

Hi,

On 2023-01-26 15:27:20 -0500, Reid Thompson wrote:
> Yes, just a rebase. There is still work to be done per earlier in the
> thread.

The tests recently started to fail:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3867


> I do want to follow up and note re palloc/pfree vs malloc/free that the
> tracking code (0001-Add-tracking-...) is not tracking palloc/pfree but is
> explicitely tracking malloc/free. Not every palloc/pfree call executes the
> tracking code, only those where the path followed includes malloc() or
> free().  Routine palloc() calls fulfilled from the context's
> freelist/emptyblocks/freeblock/etc and pfree() calls not invoking free()
> avoid the tracking code.

Sure, but we create a lot of memory contexts, so that's not a whole lot of
comfort.


I marked this as waiting on author.

Greetings,

Andres Freund

Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-08 13:10:40 +0100, Alvaro Herrera wrote:
> On 2023-Feb-08, Alvaro Herrera wrote:
> 
> > I propose instead the following: each command is prepared just before
> > it's executed, as previously, and if we see a \startpipeline, then we
> > prepare all commands starting with the one just after, and until the
> > \endpipeline.
> 
> Here's the patch.

There's something wrong with the patch, it reliably fails with core dumps:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3260

Example crash:
https://api.cirrus-ci.com/v1/task/4922406553255936/logs/cores.log
https://api.cirrus-ci.com/v1/artifact/task/6611256413519872/crashlog/crashlog-pgbench.EXE_0750_2023-02-13_14-07-06-189.txt

Andres

Re: Support for dumping extended statistics

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-01 04:38:17 +, Hari krishna Maddileti wrote:
> Thanks for the update, I have attached the updated patch with 
> meson compatible and  addressed warnings from make file too.

This patch consistently crashes in CI:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F4114

Example crash:
https://api.cirrus-ci.com/v1/task/4910781754507264/logs/cores.log

Greetings,

Andres Freund

Re: pglz compression performance, take two

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-08 11:16:47 +0100, Tomas Vondra wrote:
> On 2/7/23 21:18, Andres Freund wrote:
> > 
> > Independent of this failure, I'm worried about the cost/benefit analysis of 
> > a
> > pglz change that changes this much at once. It's quite hard to review.
> > 
> 
> I agree.
> 
> I think I managed to understand what the patch does during the review,
> but it's so much harder - it'd definitely be better to have this split
> into smaller parts, somehow. Interestingly enough the commit message
> actually says this:
> 
>   This patch accumulates several changes to pglz compression:
>   1. Convert macro-functions to regular functions for readability
>   2. Use more compact hash table with uint16 indexes instead of pointers
>   3. Avoid prev pointer in hash table
>   4. Use 4-byte comparisons during a search instead of 1-byte
>  comparisons
> 
> Which I think is a pretty good recipe how to split the patch. (And we
> also need a better commit message, or at least a proposal.)
> 
> This'd probably also help when investigating the extra byte issue,
> discussed yesterday. (Assuming it's not related to the invalid access
> reported by valgrind / asan).

Due to the sanitizer changes, and this feedback, I'm marking the entry as
waiting on author.

Greetings,

Andres Freund

Re: Adding "large" to PG_TEST_EXTRA

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-14 09:26:47 +1100, Peter Smith wrote:
> I've observed suggested test cases get rejected as being overkill, or
> because they would add precious seconds to the test execution. OTOH, I
> felt such tests would still help gain some additional percentages from
> the "code coverage" stats. The kind of tests I am thinking of don't
> necessarily need a huge disk/CPU - but they just take longer to run
> than anyone has wanted to burden the build-farm with.

I'd say it depend on the test whether it's worth adding. Code coverage for its
own sake isn't that useful, they have to actually test something useful. And
tests have costs beyond runtime, e.g. new tests tend to fail in some edge
cases.

E.g. just having tests hit more lines, without verifying that the behaviour is
actually correct, only provides limited additional assurance.  It's also not
very useful to add a very expensive test that provides only a very small
additional amount of coverage.

IOW, even if we add more test categories, it'll still be a tradeoff.

> Sorry for the thread interruption -- but I thought this might be the
> right place to ask: What is the recommended way to deal with such
> tests intended primarily for better code coverage?

I don't think that exists today.

Do you have an example of the kind of test you're thinking of?

Greetings,

Andres Freund

Re: recovery modules

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 14:56:47 -0800, Nathan Bossart wrote:
> On Mon, Feb 13, 2023 at 04:37:10PM +0900, Michael Paquier wrote:
> > +   basic_archive_context = data->context;
> > +   if (CurrentMemoryContext == basic_archive_context)
> > +   MemoryContextSwitchTo(TopMemoryContext);
> > +
> > +   if (MemoryContextIsValid(basic_archive_context))
> > +   MemoryContextDelete(basic_archive_context);
> > 
> > This is a bit confusing, because it means that we enter in the
> > shutdown callback with one context, but exit it under
> > TopMemoryContext.  Are you sure that this will be OK when there could
> > be multiple callbacks piled up with before_shmem_exit()?  shmem_exit()
> > has nothing specific to memory contexts.
> 
> Well, we can't free the memory context while we are in it, so we have to
> switch to another one.  I agree that this is a bit confusing, though.

Why would we be in that memory context? I'd just add an assert documenting
we're not.


> On second thought, I'm not sure it's important to make sure the state is
> freed in the shutdown callback.  It's only called just before the archiver
> process exits, so we're not really at risk of leaking anything.  I suppose
> we might not always restart the archiver in this case, but I also don't
> anticipate that behavior changing in the near future.  I think this
> callback is more useful for things like shutting down background workers.

I think it's crucial. Otherwise we're just ossifying the design that there's
just one archive module active at a time.


> I went ahead and removed the shutdown callback from basic_archive and the
> note about leaking from the documentation.

-1


Greetings,

Andres Freund

Re: old_snapshot_threshold bottleneck on replica

2023-02-13 Thread Andres Freund

Hi,

On 2023-01-24 10:46:28 -0500, Robert Haas wrote:
> On Mon, Jan 23, 2023 at 9:40 AM Maxim Orlov  wrote:
> > One of our customers stumble onto a significant performance degradation 
> > while running multiple OLAP-like queries on a replica.
> > After some investigation, it became clear that the problem is in accessing 
> > old_snapshot_threshold parameter.
>
> It has been suggested that we remove that feature entirely.

Indeed. There's a lot of things wrong with it. We have reproducers for
creating wrong query results. Nobody has shown interest in fixing the
problems, for several years by now. It costs users that *do not* use the
feature performance (*).

I think we're doing our users a disservice by claiming to have this feature.

I don't think a lot of the existing code would survive if we were to create a
newer version, more maintainable / reliable, version of the feature.

Greetings,

Andres Freund

(*) E.g. TestForOldSnapshot() is called in a good number of places, and emits
quite a bit of code. It's not executed, but the emitted code is large
enough to lead to worse code being generated.

Re: Adding "large" to PG_TEST_EXTRA

2023-02-13 Thread Andres Freund

On 2023-02-13 14:55:45 -0500, Tom Lane wrote:
> bigdisk, bigcpu?

Works for me.

I'll probably just add bigdisk as part of adding a test for bulk relation
extensions, mentioning in a comment that we might want bigcpu if we have a
test for it?

Re: Adding "large" to PG_TEST_EXTRA

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 14:15:24 -0500, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2023-02-13 13:45:41 -0500, Stephen Frost wrote:
> > > Are there existing tests that we should add into that set that you're
> > > thinking of..?  I've been working with the Kerberos tests and that's
> > > definitely one that seems to fit this description...
> > 
> > I think the kerberos tests are already opt-in, so I don't think we need to
> > gate it further.
> 
> I'd like to lump them in with a bunch of other tests though, to give it
> more chance to run..  My issue currently is that they're *too* gated.

Isn't the reason that we gate them that much that the test poses a security
hazard on a multi-user system?

I don't think we should combine opting into security hazards with opting into
using disk space.

FWIW, the kerberos tests run on all CI OSs other than windows. I have
additional CI coverage for openbsd and netbsd in a separate branch, providing
further coverage - but I'm not sure we want those additional covered OSs in
core PG.

I think the tests for kerberos run frequently enough in practice. I don't know
how good the coverage they provide is, though, but that's a separate aspect to
improve anyway.

> > I guess there's an argument to be made that we should use this for e.g.
> > 002_pg_upgrade.pl or 027_stream_regress.pl - but I think both of these test
> > pretty fundamental behaviour like WAL replay, which is unfortunately is 
> > pretty
> > easy to break, so I'd be hesitant.
> 
> Hm.  If you aren't playing with that part of the code though, maybe it'd
> be nice to not run them.

It's surprisingly easy to break it accidentally...

> The pg_dump tests might also make sense to segregate out as they can add up
> to be a lot, and there's more that we could and probably should be doing
> there.

IMO the main issue with the pg_dump test is their verbosity, rather than the
runtime... ~8.8k subtests is a lot.

find . -name 'regress_log*'|xargs -n 1 wc -l|sort -nr|head -n 5|less
12712 ./testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump
5124 ./testrun/pg_rewind/002_databases/log/regress_log_002_databases
1928 ./testrun/pg_rewind/001_basic/log/regress_log_001_basic
1589 ./testrun/recovery/017_shm/log/regress_log_017_shm
1077 ./testrun/pg_rewind/004_pg_xlog_symlink/log/regress_log_004_pg_xlog_symlink

> > I guess we could stop running the full regression tests in 002_pg_upgrade.pl
> > if !large?
> 
> Perhaps... but then what are we testing?

There's plenty to pg_upgrade other than the pg_dump comparison aspect.

I'm not sure it's worth spending too much energy finding tests that we can run
less commonly than now. We've pushed back on tests using lots of resources so
far, so we don't really have them...

Greetings,

Andres Freund

Re: Adding "large" to PG_TEST_EXTRA

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 13:54:59 -0500, Tom Lane wrote:
> Bikeshedding a bit ... is "large" the right name?  It's not awful but
> I wonder if there is a better one

I did wonder about that too. But didn't come up with something more poignant.

> it seems like this class could eventually include tests that run a long time
> but don't necessarily eat disk space.  "resource-intensive" is too long.

I'm not sure we'd want to combine time-intensive and disk-space-intensive test
in the same category. Availability of disk space and cpu cycles don't have to
correlate that well.

lotsadisk, lotsacpu? :)

Greetings,

Andres Freund

Re: Adding "large" to PG_TEST_EXTRA

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 13:45:41 -0500, Stephen Frost wrote:
> Are there existing tests that we should add into that set that you're
> thinking of..?  I've been working with the Kerberos tests and that's
> definitely one that seems to fit this description...

I think the kerberos tests are already opt-in, so I don't think we need to
gate it further.

Maybe the pgbench tests?

I guess there's an argument to be made that we should use this for e.g.
002_pg_upgrade.pl or 027_stream_regress.pl - but I think both of these test
pretty fundamental behaviour like WAL replay, which is unfortunately is pretty
easy to break, so I'd be hesitant.

I guess we could stop running the full regression tests in 002_pg_upgrade.pl
if !large?

Greetings,

Andres Freund

Adding "large" to PG_TEST_EXTRA

2023-02-13 Thread Andres Freund

Hi,

I'm working on rebasing [1], my patch to make relation extension scale
better.

As part of that I'd like to add tests for relation extension. To be able to
test the bulk write strategy path, we need to have a few backends concurrently
load > 16MB files.

It seems pretty clear that doing that on all buildfarm machines wouldn't be
nice / welcome. And it also seems likely that this won't be the last case
where that'd be useful.

So I'd like to add a 'large' class to PG_TEST_EXTRA, that we can use in tests
that we only want to execute on machines with sufficient resources.

Makes sense?

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20221029025420.eplyow6k7tgu6he3%40awork3.anarazel.de

Re: Rework LogicalOutputPluginWriterUpdateProgress

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 14:06:57 +0530, Amit Kapila wrote:
> > > The patch calls update_progress in change_cb_wrapper and other
> > > wrappers which will miss the case of DDLs that generates a lot of data
> > > that is not processed by the plugin. I think for that we either need
> > > to call update_progress from reorderbuffer.c similar to what the patch
> > > has removed or we need some other way to address it. Do you have any
> > > better idea?
> >
> > I don't mind calling something like update_progress() in the specific cases
> > that's needed, but I think those are just the
> >   if (!RelationIsLogicallyLogged(relation))
> >   if (relation->rd_rel->relrewrite && !rb->output_rewrites))
> >
> > To me it makes a lot more sense to call update_progress() for those, rather
> > than generally.
> >
> 
> Won't it be better to call it wherever we don't invoke any wrapper
> function like for cases REORDER_BUFFER_CHANGE_INVALIDATION, sequence
> changes, etc.? I was thinking that wherever we don't call the wrapper
> function which means we don't have a chance to invoke
> update_progress(), the timeout can happen if there are a lot of such
> messages.

ISTM that the likelihood of causing harm due to increased overhead is higher
than the gain.

Greetings,

Andres Freund

Re: Rework LogicalOutputPluginWriterUpdateProgress

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 08:22:34 +0530, Amit Kapila wrote:
> On Sat, Feb 11, 2023 at 3:04 AM Andres Freund  wrote:
> >
> > > One difference I see with the patch is that I think we will end up
> > > sending keepalive for empty prepared transactions even though we don't
> > > skip sending begin/prepare messages for those.
> >
> > With the proposed approach we reliably know whether a callback wrote
> > something, so we can tune the behaviour here fairly easily.
> >
> 
> I would like to clarify a few things about the proposed approach. In
> commit_cb_wrapper()/prepare_cb_wrapper(), the patch first did
> ctx->did_write = false;, then call the commit/prepare callback (which
> will call pgoutput_commit_txn()/pgoutput_prepare_txn()) and then call
> update_progress() which will make decisions based on ctx->did_write
> flag. Now, for this to work pgoutput_commit_txn/pgoutput_prepare_txn
> should know that the transaction has performed some writes before that
> call which is currently working because pgoutput is tracking the same
> via sent_begin_txn.

I don't really see these as being related. What pgoutput does internally to
optimize for some usecases shouldn't matter to the larger infrastructure.


> Is the intention here that we still track whether BEGIN () has been sent via
> pgoutput?

Yes. If somebody later wants to propose tracking this alongside a txn and
passing that to the output plugin callbacks, we can do that. But that's
independent of fixing the broken architecture of the progress infrastructure.

Greetings,

Andres Freund

Re: proposal: psql: psql variable BACKEND_PID

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 12:52:23 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > E.g. I fire of a query, it's slower than I'd like, I want to attach perf. Of
> > course I can establish a separate connection, query pg_stat_activity there,
> > and then perf. But that requires manually filtering pg_stat_activity to find
> > the query.
> 
> ... in this case, the problem is that the session is tied up doing the
> slow query.  You can't run "select pg_backend_pid()", but you can't
> extract a psql variable value either.  If you had the foresight to
> set up a PROMPT, or to collect the PID earlier, you're good.  But I'm
> still not seeing where a psql variable makes that easier.

I guess you could argue that referencing BACKEND_PID in PROMPT would be more
readable. But that's about it.

Greetings,

Andres Freund

Re: Force testing of query jumbling code in TAP tests

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 14:00:36 +0900, Michael Paquier wrote:
> With this addition, the query jumbling gets covered at 95%~, while
> https://coverage.postgresql.org/src/backend/nodes/index.html reports
> currently 35%.
> 
> Thoughts or comments?

Shouldn't there at least be some basic verification of pg_stat_statements
output being sane after running the test? Even if that's perhaps just actually
printing the statements.

Greetings,

Andres Freund

Re: proposal: psql: psql variable BACKEND_PID

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-13 18:06:23 +0100, Peter Eisentraut wrote:
> On 09.02.23 10:11, Pavel Stehule wrote:
> > first and main (for me) - I can use psql variables tab complete - just
> > :B - it is significantly faster
> > second - I can see all connection related information by \set
> > third - there is not hook on reconnect in psql - so if you implement
> > BACKEND_PID by self, you ensure to run query with pg_backend_pid() after
> > any reconnect or connection change.
> > 
> > It is clean so you can run "select pg_backend_pid() AS "BACKEND_PID"
> > \gset" and you can store it to .psqlrc. But most of the time I am in
> > customer's environment, and I have the time, possibility to do a
> > complete setup of .psqlrc. It looks (for me) like a generally useful
> > feature to be everywhere.
> 
> But what do you need the backend PID for in the first place?

For me it's using gdb, pidstat, strace, perf, ...

But for those %p in the PROMPTs is more useful.


> Of course, you might want to use it to find your own session in
> pg_stat_activity or something like that, but then you're already in a query
> and can use pg_backend_pid().  What do you need the backend PID for outside
> of such a query?

E.g. I fire of a query, it's slower than I'd like, I want to attach perf. Of
course I can establish a separate connection, query pg_stat_activity there,
and then perf. But that requires manually filtering pg_stat_activity to find
the query.

Greetings,

Andres Freund

< 11 12 13 14 15 16 17 18 19 20 >

1501 - 1600 of 5453 matches

Mail list logo