Re: BF animal malleefowl reported an failure in 001_password.pl

2023-01-13 Thread Tom Lane
"houzj.f...@fujitsu.com"  writes:
> I noticed one BF failure[1] when monitoring the BF for some other commit.
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=malleefowl=2023-01-13%2009%3A54%3A51
> ...
> So it seems the connection happens before pg_ident.conf is actually reloaded ?
> Not sure if we need to do something make sure the reload happen, because it's
> looks like very rare failure which hasn't happen in last 90 days.

That does look like a race condition between config reloading and
new-backend launching.  However, I can't help being suspicious about
the fact that we haven't seen this symptom before and now here it is
barely a day after 7389aad63 (Use WaitEventSet API for postmaster's
event loop).  It seems fairly plausible that that did something that
causes the postmaster to preferentially process connection-accept ahead
of SIGHUP.  I took a quick look through the code and did not see a
smoking gun, but I'm way too tired to be sure I didn't miss something.

In general, use of WaitEventSet instead of signals will tend to slot
the postmaster into non-temporally-ordered event responses in two
ways: (1) the latch.c code will report events happening at more-or-less
the same time in a specific order, and (2) the postmaster.c code will
react to signal-handler-set flags in a specific order.  AFAICS, both
of those code layers will prioritize latch events ahead of
connection-accept events, but did I misread it?

Also it seems like the various platform-specific code paths in latch.c
could diverge as to the priority order of events, which could cause
annoying platform-specific behavior.  Not sure there's much to be
done there other than to be sensitive to not letting such divergence
happen.

regards, tom lane




Re: fixing CREATEROLE

2023-01-13 Thread vignesh C
On Tue, 10 Jan 2023 at 23:16, Robert Haas  wrote:
>
> On Thu, Jan 5, 2023 at 2:53 PM Robert Haas  wrote:
> > On Tue, Jan 3, 2023 at 3:11 PM Robert Haas  wrote:
> > > Committed and back-patched 0001 with fixes for the issues that you 
> > > pointed out.
> > >
> > > Here's a trivial rebase of the rest of the patch set.
> >
> > I committed 0001 and 0002 after improving the commit messages a bit.
> > Here's the remaining two patches back. I've done a bit more polishing
> > of these as well, specifically in terms of fleshing out the regression
> > tests. I'd like to move forward with these soon, if nobody's too
> > vehemently opposed to that.
>
> Done now.

I'm not sure if any work is left here, if there is nothing more to do,
can we close this?

Regards,
Vignesh




Re: Removing redundant grouping columns

2023-01-13 Thread vignesh C
On Sat, 31 Dec 2022 at 02:32, Tom Lane  wrote:
>
> I wrote:
> > Richard Guo  writes:
> >> While we are here, I wonder if we can do the same trick for
> >> distinctClause, to cope with cases like
> >> select distinct a.x, b.y from a, b where a.x = b.y;
>
> > We do that already, no?
>
> Oh, wait, I see what you mean: we are smart in code paths that rely
> on distinct_pathkeys, but not in the hash-based code paths.  Right,
> that can be fixed the same way.  0001 attached is the same as before,
> 0002 adds similar logic for the distinctClause.
>
> The plan change in expected/pg_trgm.out is surprising at first
> glance, but I believe it's correct: the item that is being
> dropped is a parameterless STABLE function, so its value is not
> supposed to change for the duration of the scan.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch ./v3-0002-remove-redundant-DISTINCT.patch

Hunk #4 FAILED at 4704.

1 out of 10 hunks FAILED -- saving rejects to file
src/backend/optimizer/plan/planner.c.rej

[1] - http://cfbot.cputube.org/patch_41_4083.log

Regards,
Vignesh




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread vignesh C
On Wed, 4 Jan 2023 at 17:32, David Geier  wrote:
>
> I fixed the compilation error on CFBot.
> I missed adding instr_time.c to the Meson makefile.
> New patch set attached.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch
./0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patch

patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 135.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej

[1] - http://cfbot.cputube.org/patch_41_3751.log

Regards,
Vignesh




Re: Pluggable toaster

2023-01-13 Thread vignesh C
On Sun, 8 Jan 2023 at 01:40, Nikita Malakhov  wrote:
>
> Hi!
>
> Thank you for your attention.
> I've rebased the patchset onto the latest master (from 07.01), but the second 
> part is still
> in pre-patch shape - it is working, but tests are failing due to changes in 
> TOAST relations
> logic - I haven't adapted 'em yet.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
c44f6334ca6ff6d242d9eb6742441bc4e1294067 ===
=== expanding ./0002_toaster_default_v25.patch.gz
=== expanding ./0001_toaster_interface_v25.patch.gz
=== expanding ./0004_drop_toaster_v25.patch.gz
=== expanding ./0003_pg_toastrel_control_v25.patch.gz
=== applying patch ./0001_toaster_interface_v25.patch

patching file src/include/postgres.h
Hunk #1 succeeded at 80 with fuzz 2 (offset 4 lines).
Hunk #2 FAILED at 148.
Hunk #3 FAILED at 315.
Hunk #4 FAILED at 344.
Hunk #5 FAILED at 359.
4 out of 5 hunks FAILED -- saving rejects to file src/include/postgres.h.rej

[1] - http://cfbot.cputube.org/patch_41_3490.log

Regards,
Vignesh




Re: Operation log for major operations

2023-01-13 Thread vignesh C
On Mon, 5 Dec 2022 at 13:42, Dmitry Koval  wrote:
>
> Hi!
>
>  >I think storing this in pg_control is a bad idea.  That file is
>  >extremely critical and if you break it, you're pretty much SOL on
>  >recovering your data.  I suggest that this should use a separate file.
>
> Thanks. Operation log location changed to 'global/pg_control_log' (if
> the name is not pretty, it can be changed).
>
> I attached the patch (v2-0001-Operation-log.patch) and description of
> operation log (Operation-log.txt).

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch ./v2-0001-Operation-log.patch

patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 134.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej

[1] - http://cfbot.cputube.org/patch_41_4018.log

Regards,
Vignesh




Re: backup.sgml typo

2023-01-13 Thread Amit Kapila
On Sat, Jan 14, 2023 at 7:32 AM Tatsuo Ishii  wrote:
>
> There seem to be a small typo in backup.sgml
> (archive_command is unnecessarily
> repeated). Attached is a patch to fix that.
>

LGTM.

-- 
With Regards,
Amit Kapila.




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-13 Thread vignesh C
On Thu, 12 Jan 2023 at 21:09, Takamichi Osumi (Fujitsu)
 wrote:
>
> On Thursday, January 12, 2023 12:04 PM Kyotaro Horiguchi 
>  wrote:
> > At Wed, 11 Jan 2023 12:46:24 +, "Hayato Kuroda (Fujitsu)"
> >  wrote in
> > > them. Which version is better?
> >
> >
> > Some comments by a quick loock, different from the above.
> Horiguchi-san, thanks for your review !
>
>
> > + CONNECTION 'host=192.168.1.50 port=5432 user=foo
> > dbname=foodb'
> >
> > I understand that we (not PG people, but IT people) are supposed to use in
> > documents a certain set of special addresses that is guaranteed not to be
> > routed in the field.
> >
> > > TEST-NET-1 : 192.0.2.0/24
> > > TEST-NET-2 : 198.51.100.0/24
> > > TEST-NET-3 : 203.0.113.0/24
> >
> > (I found 192.83.123.89 in the postgres_fdw doc, but it'd be another issue..)
> Fixed. If necessary we can create another thread for this.
>
> > + if (strspn(tmp, "-0123456789 ") == strlen(tmp))
> >
> > Do we need to bother spending another memory block for apparent non-digits
> > here?
> Yes. The characters are necessary to handle an issue reported in [1].
> The issue happened if the user inputs a negative value,
> then the length comparison became different between strspn and strlen
> and the input value was recognized as seconds, when
> the unit wasn't described. This led to a wrong error message for the user.
>
> Those addition of such characters solve the issue.
>
> > + errmsg(INT64_FORMAT " ms
> > is outside the valid range for parameter
> > +\"%s\"",
> >
> > We don't use INT64_FORMAT in translatable message strings. Cast then
> > use %lld instead.
> Thanks for teaching us. Fixed.
>
> > This message looks unfriendly as it doesn't suggest the valid range, and it
> > shows the input value in a different unit from what was in the input. A I 
> > think we
> > can spell it as "\"%s\" is outside the valid range for subsciription 
> > parameter
> > \"%s\" (0 ..  in millisecond)"
> Makes sense. I incorporated the valid range with the aligned format of 
> recovery_min_apply_delay.
> FYI, the physical replication's GUC doesn't write the unites for the range 
> like below.
> I followed and applied this style.
>
> ---
> LOG:  -1 ms is outside the valid range for parameter 
> "recovery_min_apply_delay" (0 .. 2147483647)
> FATAL:  configuration file 
> "/home/k5user/new/pg/l/make_v15/slave/postgresql.conf" contains errors
> ---
>
> > + int64   min_apply_delay;
> > ..
> > + if (ms < 0 || ms > INT_MAX)
> >
> > Why is the variable wider than required?
> You are right. Fixed.
>
> > + errmsg("%s and %s are mutually
> > exclusive options",
> > +"min_apply_delay > 0",
> > "streaming = parallel"));
> >
> > Mmm. Couldn't we refuse 0 as min_apply_delay?
> Sorry, the previous patch's behavior wasn't consistent with this error 
> message.
>
> In the previous patch, if we conducted alter subscription
> with stream = parallel and min_apply_delay = 0 (from a positive value) at the 
> same time,
> the alter command failed, although this should succeed by this time-delayed 
> feature specification.
> We fixed this part accordingly by some more tests in AlterSubscription().
>
> By the way, we should allow users to change min_apply_dealy to 0
> whenever they want from different value. Then, we didn't restrict
> this kind of operation.
>
> > + sub->minapplydelay > 0)
> > ...
> > + if (opts.min_apply_delay > 0 &&
> >
> > Is there any reason for the differenciation?
> Yes. The former is the object for an existing subscription configuration.
> For example, if we alter subscription with setting streaming = 'parallel'
> for a subscription created with min_apply_delay = '1 day', we
> need to reject the alter command. The latter is new settings.
>
>
> > +
> >   errmsg("cannot set %s for subscription with %s",
> > +
> > "streaming = parallel", "min_apply_delay > 0"));
> >
> > I think that this shoud be more like human-speking. Say, "cannot enable
> > min_apply_delay for subscription in parallel streaming mode" or something..
> > The same is applicable to the nearby message.
> Reworded the error messages. Please check.
>
> > +static void maybe_delay_apply(TimestampTz ts);
> >
> >   apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
> > -XLogRecPtr lsn)
> > +XLogRecPtr lsn, TimestampTz ts)
> >
> > "ts" looks too generic. Couldn't it be more specific?
> > We need a explanation for the parameter in the function comment.
> Changed it to finish_ts, since it indicates commit/prepare time.
> This terminology should be aligned with finish lsn.
>
> > + if (!am_parallel_apply_worker())
> > + {
> > + Assert(ts > 0);
> > + 

Re: Add BufFileRead variants with short read and EOF detection

2023-01-13 Thread Amit Kapila
On Thu, Jan 12, 2023 at 2:44 PM Peter Eisentraut
 wrote:
>
> On 10.01.23 07:20, Amit Kapila wrote:
> > Yeah, we can do that but not sure if it is worth doing any of those
> > because there are already other places that don't use the exact
> > context.
>
> Ok, updated patches attached.
>

Both the patches look good to me.

-- 
With Regards,
Amit Kapila.




Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 19:39:41 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 6:09 PM Andres Freund  wrote:
> > I don't think the split is right. There's too much in 0001 - it's basically
> > introducing the terminology of 0002 already. Could you make it a much more
> > minimal change?
>
> Okay.
>
> I thought that you might say that. I really just wanted to show you
> how small the code footprint was for the more controversial part.

I don't mind you splitting this into three parts ;)



> I believe that we agree on most things when it comes to VACUUM, but
> I'm pretty sure that we still disagree about the value in setting
> autovacuum_freeze_max_age very high. I continue to believe that
> setting it over a billion or so is just a bad idea. I'm mentioning
> this only because it might give you some idea of where I'm coming from
> -- in general I believe that age-based settings often have only very
> weak relationships with what actually matters.

I think part of our difference around a high autovacuum_freeze_max_age are due
to things you're trying to address here - if no-auto-cancel is a separate
threshold from autovacuum_freeze_max_age, it's less problematic to set
autovacuum_freeze_max_age to something lower.

But yes, there's a remaining difference of opinion / experience.



> It might make sense to always give a small fixed amount of headroom
> when autovacuum_freeze_max_age is set to very high values. Maybe just
> 5 million XIDs/MXIDs. That would probably be just as effective as
> (say) 500 million in almost all cases. But then you have to accept
> another magic number.

I suspect that most systems with a high autovacuum_freeze_max_age use it
because 200m leads to too frequent autovacuums - a few million won't do much
for those.


How about a float autovacuum_no_auto_cancel_age where positive values are
treated as absolute values, and negative values are a multiple of
autovacuum_freeze_max_age? And where the "computed" age is capped at
vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.

And maybe a default setting of -1.8 or so?

If a user chooses to set autovacuum_no_auto_cancel_age and vacuum_failsafe_age
to 2.1 billion, oh well, that's really not our problem.



> > 1) It regularly scares the crap out of users, even though it's normal.  This
> >is further confounded by failsafe autovacuums, where a scared reaction is
> >appropriate, not being visible in pg_stat_activity.
>
> The docs actually imply that when the system reaches the point of
> entering xidStopLimit mode, you might get data corruption. Of course
> that's not true (the entire point of xidStopLimit is to avoid that),
> but apparently we like to keep users on their toes.

Well, historically there wasn't all that much protection. And I suspect there
still might be some dragons. We really need to get rid of the remaining places
that cache 32bit xids across transactions.



> > 3) Autovacuums triggered by tuple thresholds persistently getting cancelled
> >also regularly causes outages, and they make it more likely that an
> >eventual age-based vacuum will take forever.
>
> Really? Outages? I imagine that you'd have to be constantly hammering
> the table with DDL before it could happen. That's possible, but it
> seems relatively obvious that doing that is asking for trouble.

Yes, due to queries slowing down due to the bloat, or due to running out of
space.

I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
even DDL, they just needed to make the table readonly, for a brief moment, a
few times a day. The problem started once the indexes grew to be large enough
that the time for (re-)finding dead tuples + and an index scan phase got large
enough that they were unlucky to be killed a few times in a row. After that
autovac never got through the index scan phase. Not doing any "new" work, just
collecting the same dead tids over and over, scanning the indexes for those
tids, never quite finishing.


You really don't need that frequent lock conflicts. It just needs to be more
frequent than what the index scans portion of vacuuming takes, which can be a
long time with large indexes.



> > Aspect 1) is addressed to a good degree by the proposed split of anti-wrap
> > into an age and anti-cancel triggers. And could be improved by reporting
> > failsafe autovacuums in pg_stat_activity.
>
> What you call aspect 2 (the issue with disastrous HW lock traffic jams
> involving TRUNCATE being run from a cron job, etc) is a big goal of
> mine for this patch series. You seem unsure of how effective my
> approach (or an equally simple approach based on table age heuristics)
> will be. Is that the case?

I am was primarily concerned about situations where an admin interactively was
trying to get rid of the table holding back the xid horizon, after some
problem caused a lot of work to pile up.

If you know postgres 

Re: Amcheck verification of GiST and GIN

2023-01-13 Thread Andrey Borodin
On Fri, Jan 13, 2023 at 7:35 PM Jose Arthur Benetasso Villanova
 wrote:
>
> Hello again. I see the change. Thanks
>

Thanks! I also found out that there was a CI complaint about amcheck.h
not including some necessary stuff. Here's a version with a fix for
that.

Best regards, Andrey Borodin.


v21-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v21-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v21-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-13 Thread Peter Geoghegan
On Fri, Jan 13, 2023 at 6:09 PM Andres Freund  wrote:
> I don't think the split is right. There's too much in 0001 - it's basically
> introducing the terminology of 0002 already. Could you make it a much more
> minimal change?

Okay.

I thought that you might say that. I really just wanted to show you
how small the code footprint was for the more controversial part.

> IDK, it splits up anti-wraparound vacuums into different sub-kinds but doesn't
> allow to distinguish most of them from a plain autovacuum.
>
> Seems pretty easy to display the trigger from 0001 in
> autovac_report_activity()? You'd have to move the AutoVacType -> translated
> string mapping into a separate function. That seems like a good idea anyway,
> the current coding makes translators translate several largely identical
> strings that just differ in one part.

I'll look into it. It's on my TODO list now.

> seems quite confusing / non-principled. What's the logic behind the auto
> cancel threshold being 2 x freeze_max_age, except that when freeze_max_age is
> 1 billion, the cutoff is set to 1 billion? That just makes no sense to me.

Why is the default for autovacuum_freeze_max_age 200 million? I think
that it's because 200 million is 10% of 2 billion. And so it is for
this cap. 1 billion is 50% of 2 billion. It's just numerology. It
doesn't make sense to me either.

Of course it's not *completely* arbitrary. Obviously the final auto
cancel threshold needs to be at least somewhat greater than
freeze_max_age for any of this to make any sense at all. And, we
should ideally have a comfortable amount of slack to work with, so
that things like moderate (not pathological) autovacuum worker
starvation isn't likely to defeat our attempts at avoiding the
no-auto-cancel behavior for no good reason. Finally, once we get to a
certain table age it starts to seem like a bad idea to not be
conservative about auto cancellations.

I believe that we agree on most things when it comes to VACUUM, but
I'm pretty sure that we still disagree about the value in setting
autovacuum_freeze_max_age very high. I continue to believe that
setting it over a billion or so is just a bad idea. I'm mentioning
this only because it might give you some idea of where I'm coming from
-- in general I believe that age-based settings often have only very
weak relationships with what actually matters.

It might make sense to always give a small fixed amount of headroom
when autovacuum_freeze_max_age is set to very high values. Maybe just
5 million XIDs/MXIDs. That would probably be just as effective as
(say) 500 million in almost all cases. But then you have to accept
another magic number.

> I think this'd be a lot more readable if you introduced a separate variable
> for the "no-cancel" threshold, rather than overloading freeze_max_age with
> different meanings. And you should remove the confusing "lowering" of the
> cutoff.  Maybe something like
>
> no_cancel_age = freeze_max_age;
> if (no_cancel_age < ANTIWRAPAROUND_MAX_AGE)
> {
> /* multiply by two, but make sure to not exceed 
> ANTIWRAPAROUND_MAX_AGE */
> no_cancel_age = Min((uint32)ANTIWRAPAROUND_MAX_AGE, 
> (uint32)no_cancel_age * 2);
> }

I'm happy to do it that way, but let's decide what the algorithm
itself should be first. Or let's explore it a bit more, at least.

> > The only mechanism that the patch changes is related to "prevent
> > auto-cancellation" behaviors -- which is now what the term
> > "antiwraparound" refers to.
>
> Not sure that redefining what a long-standing name refers to is helpful. It
> might be best to retire it and come up with new names.

I generally try to avoid bike-shedding, and naming things is the
ultimate source of bike shedding. I dread having to update the docs
for this stuff, too. The docs in this area (particularing "Routine
Vacuuming") are such a mess already. But perhaps you're right.

> 1) It regularly scares the crap out of users, even though it's normal.  This
>is further confounded by failsafe autovacuums, where a scared reaction is
>appropriate, not being visible in pg_stat_activity.

The docs actually imply that when the system reaches the point of
entering xidStopLimit mode, you might get data corruption. Of course
that's not true (the entire point of xidStopLimit is to avoid that),
but apparently we like to keep users on their toes.

>I suspect that learning that "vacuum to prevent wraparound" isn't a
>problem, contributes to people later ignoring "must be vacuumed within ..."
>WARNINGS, which I've seen plenty times.

That point never occured to me, but it makes perfect intuitive sense
that users would behave that way. This phenomenon is sometimes called
alarm fatigue, It can be quite dangerous to warn people about
non-issues "out of an abundance of caution".

> 3) Autovacuums triggered by tuple thresholds persistently getting cancelled
>also regularly causes outages, and they make it more likely that an
>eventual 

Re: Amcheck verification of GiST and GIN

2023-01-13 Thread Jose Arthur Benetasso Villanova



On Fri, 13 Jan 2023, Andrey Borodin wrote:


On Fri, Jan 13, 2023 at 3:46 AM Jose Arthur Benetasso Villanova
 wrote:


The only thing that I found is the gin_index_parent_check function in docs
still references the "gin_index_parent_check(index regclass,
heapallindexed boolean) returns void"



Correct! Please find the attached fixed version.

Thank you!

Best regards, Andrey Borodin.



Hello again. I see the change. Thanks

--
Jose Arthur Benetasso Villanova




Re: How to find the number of cached pages for a relation?

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 17:28:31 -0800, Amin wrote:
> Before scanning a relation, in the planner stage, I want to make a call to
> retrieve information about how many pages will be a hit for a specific
> relation. The module pg_buffercache seems to be doing a similar thing.
> Also, pg_statio_all_tables seems to be having that information, but it is
> updated after execution. However, I want the information before execution.
> Also not sure how pg_statio_all_tables is created and how I can access it
> in the code.

There's no cheap way to do that. Currently the only ways are to:

a) Do one probe of the buffer mapping table for each block of the
   relation. Thus O(#relation blocks).

b) Scan all of buffer headers, check which are for the relation. Thus
   O(#NBuffers)

Neither of which are a good idea during planning.


It might be a bit more realistic to get very rough estimates:

You could compute the table's historic cache hit ratio from pgstats (i.e. use
the data backing pg_statio_all_tables). Of course that's not going to be
specific to your query (for index scans etc), and might have changed more
recently. It'd also be completely wrong after a restart.

If we had information about *recent* cache hit patterns for the relation, it'd
be a lot better, but we don't have the infrastructure for that, and
introducing it would increase the size of the stats entries noticably.

Another way could be to probe the buffer mapping table for a small subset of
the locks and infer the likelihood of other blocks being in shared buffers
that way.

A third way could be to track the cache hit for relations in backend local
memory, likely in the relache entry. The big disadvantage would be that query
plans would differ between connections and that connections would need to
"warm up" to have good plans. But it'd handle restarts nicely.

Greetings,

Andres Freund




Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2023-01-13 Thread Thomas Munro
On Thu, Jan 5, 2023 at 2:14 PM Tom Lane  wrote:
> The rcancelrequested API is something that I devised out of whole cloth
> awhile ago.  It's not in Tcl's copy of the code, which AFAIK is the
> only other project using this regex engine.  I do still have vague
> hopes of someday seeing the engine as a standalone project, which is
> why I'd prefer to keep a bright line between the engine and Postgres.
> But there's no very strong reason to think that any hypothetical future
> external users who need a cancel API would want this API as opposed to
> one that requires exit() or longjmp() to get out of the engine.  So if
> we're changing the way we use it, I think it's perfectly reasonable to
> redesign that API to make it simpler and less of an impedance mismatch.

Thanks for that background.  Alright then, here's a new iteration
exploring this direction.  It gets rid of CANCEL_REQUESTED() ->
REG_CANCEL and the associated error and callback function, and instead
has just "INTERRUPT(re);" at those cancellation points, which is a
macro that defaults to nothing (for Tcl's benefit).  Our regcustom.h
defines it as CHECK_FOR_INTERRUPTS().  I dunno if it's worth passing
the "re" argument...  I was imagining that someone who wants to free
memory explicitly and then longjmp would probably need it?  (It might
even be possible to expand to something that sets an error and
returns, not investigated.)  Better name or design very welcome.

Another decision is to use the no-OOM version of palloc.  (Not
explored: could we use throwing palloc with attribute returns_nonnull
to teach GCC and Clang to prune the failure handling from generated
regex code?)  (As for STACK_TOO_DEEP(): why follow a function pointer,
when it could be macro-only too?  But that's getting off track.)

I split the patch in two: memory and interrupts.  I also found a place
in contrib/pg_trgm that did no-longer-needed try/finally.
From 159d3d0fd7894c69f20367771b2100e48e064eec Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 4 Jan 2023 14:15:40 +1300
Subject: [PATCH v5 1/4] Use MemoryContext API for regex memory management.

Previously, regex_t objects' memory was managed with malloc() and free()
directly.  Switch to palloc()-based memory management instead.
Advantages:

 * memory used by cached regexes is now visible with MemoryContext
   observability tools

 * cleanup can be done automatically in certain failure modes
   (something that later commits will take advantage of)

 * cleanup can be done in bulk

On the downside, there may be more fragmentation (wasted memory) due to
per-regex MemoryContext objects.  This is a problem shared with other
cached objects in PostgreSQL and can probably be improved with later
tuning.

Thanks to Noah Misch for suggesting this general approach, which
unblocks later work on interrupts.

Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
---
 src/backend/utils/adt/regexp.c | 57 --
 src/include/regex/regcustom.h  |  6 ++--
 2 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 810dcb85b6..81400ba150 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -96,9 +96,13 @@ typedef struct regexp_matches_ctx
 #define MAX_CACHED_RES	32
 #endif
 
+/* A parent memory context for regular expressions. */
+static MemoryContext RegexpCacheMemoryContext;
+
 /* this structure describes one cached regular expression */
 typedef struct cached_re_str
 {
+	MemoryContext cre_context;	/* memory context for this regexp */
 	char	   *cre_pat;		/* original RE (not null terminated!) */
 	int			cre_pat_len;	/* length of original RE, in bytes */
 	int			cre_flags;		/* compile flags: extended,icase etc */
@@ -145,6 +149,7 @@ RE_compile_and_cache(text *text_re, int cflags, Oid collation)
 	int			regcomp_result;
 	cached_re_str re_temp;
 	char		errMsg[100];
+	MemoryContext oldcontext;
 
 	/*
 	 * Look for a match among previously compiled REs.  Since the data
@@ -172,6 +177,13 @@ RE_compile_and_cache(text *text_re, int cflags, Oid collation)
 		}
 	}
 
+	/* Set up the cache memory on first go through. */
+	if (unlikely(RegexpCacheMemoryContext == NULL))
+		RegexpCacheMemoryContext =
+			AllocSetContextCreate(TopMemoryContext,
+  "RegexpCacheMemoryContext",
+  ALLOCSET_SMALL_SIZES);
+
 	/*
 	 * Couldn't find it, so try to compile the new RE.  To avoid leaking
 	 * resources on failure, we build into the re_temp local.
@@ -183,6 +195,18 @@ RE_compile_and_cache(text *text_re, int cflags, Oid collation)
 	   pattern,
 	   text_re_len);
 
+	/*
+	 * Make a memory context for this compiled regexp.  This is initially a
+	 * child of the current memory context, so it will be cleaned up
+	 * automatically if compilation is interrupted and throws an ERROR.
+	 * We'll re-parent it under the longer lived cache context if we make it
+	

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 16:13:45 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 2:00 PM Andres Freund  wrote:
> > I think it'd be a good idea to split off the part of the patch that 
> > introduces
> > AutoVacType / adds logging for what triggered. That's independently useful,
> > likely uncontroversial and makes the remaining patch smaller.
>
> I like that idea.

Cool.


> Attached revision v4 breaks things up mechanically, along those lines
> (no real changes to the code inself, though). The controversial parts
> of the patch are indeed a fairly small proportion of the total
> changes.

I don't think the split is right. There's too much in 0001 - it's basically
introducing the terminology of 0002 already. Could you make it a much more
minimal change?


> > I'd also add the trigger to the pg_stat_activity entry for the autovac
> > worker. Additionally I think we should add information about using failsafe
> > mode to the p_s_a entry.
>
> I agree that that's all useful, but it seems like it can be treated as
> later work.

IDK, it splits up anti-wraparound vacuums into different sub-kinds but doesn't
allow to distinguish most of them from a plain autovacuum.

Seems pretty easy to display the trigger from 0001 in
autovac_report_activity()? You'd have to move the AutoVacType -> translated
string mapping into a separate function. That seems like a good idea anyway,
the current coding makes translators translate several largely identical
strings that just differ in one part.



> > I've wished for a non-wraparound, xid age based, "autovacuum trigger" many
> > times, FWIW. And I've seen plenty of places write their own userspace 
> > version
> > of it, because without it they run into trouble.  However, I don't like that
> > the patch infers the various thresholds using magic constants / multipliers.
>
> As I said, these details are totally negotiable, and likely could be a
> lot better without much effort.
>
> What are your concerns about the thresholds? For example, is it that
> you can't configure the behavior directly at all? Something else?

The above, but mainly that

> >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> >   prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
> >   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age

seems quite confusing / non-principled. What's the logic behind the auto
cancel threshold being 2 x freeze_max_age, except that when freeze_max_age is
1 billion, the cutoff is set to 1 billion? That just makes no sense to me.


Maye I'm partially confused by the Min(freeze_max_age * 2, 1 billion). As you
pointed out in [1], that doesn't actually lower the threshold for "table age"
vacuums, because we don't even get to that portion of the code if we didn't
already cross freeze_max_age.

if (freeze_max_age < ANTIWRAPAROUND_MAX_AGE)
freeze_max_age *= 2;
freeze_max_age = Min(freeze_max_age, ANTIWRAPAROUND_MAX_AGE);

You're lowering a large freeze_max_age to ANTIWRAPAROUND_MAX_AGE - but it only
happens after all other checks of freeze_max_age, so it won't influence
those. That's confusing code.

I think this'd be a lot more readable if you introduced a separate variable
for the "no-cancel" threshold, rather than overloading freeze_max_age with
different meanings. And you should remove the confusing "lowering" of the
cutoff.  Maybe something like

no_cancel_age = freeze_max_age;
if (no_cancel_age < ANTIWRAPAROUND_MAX_AGE)
{
/* multiply by two, but make sure to not exceed ANTIWRAPAROUND_MAX_AGE 
*/
no_cancel_age = Min((uint32)ANTIWRAPAROUND_MAX_AGE, 
(uint32)no_cancel_age * 2);
}

The uint32 bit isn't needed with ANTIWRAPAROUND_MAX_AGE at 1 billion, but at
1.2 it would be needed, so it seems better to have it.


That still doesn't explain why we the cancel_age = freeze_max_age * 2
behaviour should be clamped at 1 billion, though.



> > If I understand the patch correctly, we now have the following age based
> > thresholds for av:
> >
> > - force-enable autovacuum:
> >   oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
> > - autovacuum based on age:
> >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> > tableagevac = relfrozenxid < recentXid - freeze_max_age
> > - prevent auto-cancellation:
> >   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> >   prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
> >   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age
> >
> > Is that right?
>
> That summary looks accurate, but I'm a bit confused about why you're
> asking the question this way. I thought that it was obvious that the
> patch doesn't change most of these things.

For me it was helpful to clearly list the triggers when thinking about the
issue. I found the diff hard to read and, as noted above, the logic for the
auto cancel threshold quite confusing, so ...


> The only mechanism 

backup.sgml typo

2023-01-13 Thread Tatsuo Ishii
There seem to be a small typo in backup.sgml
(archive_command is unnecessarily
repeated). Attached is a patch to fix that.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8bab521718..be05a33205 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -956,8 +956,7 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
  On a standby, archive_mode must be always in order
  for pg_backup_stop to wait.
  Archiving of these files happens automatically since you have
- already configured archive_command or archive_library or
- archive_command.
+ already configured archive_command or archive_library.
  In most cases this happens quickly, but you are advised to monitor your
  archive system to ensure there are no delays.
  If the archive process has fallen behind because of failures of the


Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

2023-01-13 Thread Thomas Munro
On Tue, Jan 10, 2023 at 3:05 PM Andres Freund  wrote:
> On 2023-01-03 12:05:20 -0800, Andres Freund wrote:
> > The attached patch adds libpq-be-fe-helpers.h and uses it in 
> > libpqwalreceiver,
> > dblink, postgres_fdw.
>
> > As I made libpq-be-fe-helpers.h handle reserving external fds,
> > libpqwalreceiver now does so. I briefly looked through its users without
> > seeing cases of leaking in case of errors - which would already have been 
> > bad,
> > since we'd already have leaked a libpq connection/socket.
> >
> >
> > Given the lack of field complaints and the size of the required changes, I
> > don't think we should backpatch this, even though it's pretty clearly buggy
> > as-is.
>
> Any comments on this? Otherwise I think I'll go with this approach.

+1.  Not totally convinced about the location but we are free to
re-organise it any time, and the random CI failures are bad.




How to find the number of cached pages for a relation?

2023-01-13 Thread Amin
Hi,

Before scanning a relation, in the planner stage, I want to make a call to
retrieve information about how many pages will be a hit for a specific
relation. The module pg_buffercache seems to be doing a similar thing.
Also, pg_statio_all_tables seems to be having that information, but it is
updated after execution. However, I want the information before execution.
Also not sure how pg_statio_all_tables is created and how I can access it
in the code.

Thank you!


Re: pgsql: Add new GUC createrole_self_grant.

2023-01-13 Thread David G. Johnston
On Fri, Jan 13, 2023 at 4:46 PM Andres Freund  wrote:

>
> I don't really see what that has to do with the topic at hand, unless you
> want
> to suggest removing the entire section about how to write secure security
> definer functions?
>

Not remove, but I'm not seeing why the introduction of this GUC requires
any change to the documentation.

I'll leave discussion of security invoker to the other thread going on
right now.


> The point of the security definer section is to explain how to safely write
> security definer functions that you grant to less privileged users
>

Yeah, we are really good at "how".

+If the security definer function intends to create roles, and if it
+is running as a non-superuser, createrole_self_grant
+should also be set to a known value using the SET
+clause.

I'd like to know "why".  Without knowing why we are adding this I can't
give it a +1.  I want the patch to include the why.

David J.


Announcing Release 16 of the PostgreSQL Buildfarm client

2023-01-13 Thread Andrew Dunstan
Hot on the heels of Release 15 comes Release 16.

This release deals with some issues that have been discovered with the
check for update feature of Release 15 and the |force_every| and
|trigger_exclude| features, so that it now works correctly with those
features.

It also features these items:

  * a new |--check-for-work| mode of run_branches.pl
This mode doesn't do any work but exits with a zero status if there
is work to do and 1 if there is not. It is intended for use as an
ExecCondition in |systemd| units
  * up to date filtering now works with an explicit list of branches, as
well as with key words like |ALL|
  * reduce the verbosity of |"Another process holds the lock"| messages.
These are now only emitted if the |verbose| setting is greater than 1
  * |update_personality| now has options to change the owner name and
owner email
This was in Release 15 but was accidentally omitted from the release
notes. Up to now the only way to change these was by action from
the administrators.
  * improve collection of logs in cross version upgrade testing

The release can be downloaded from

 or


cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Fixes required for cross version update testing

2023-01-13 Thread Andrew Dunstan


On 2023-01-13 Fr 19:33, Justin Pryzby wrote:
> On Fri, Jan 13, 2023 at 05:20:41PM -0500, Andrew Dunstan wrote:
>> Over at [1] there was some discussion of moving knowledge of what's
>> required to be fixed from old branch repos to be able to upgrade them
>> into the core code, instead of having it reside in a buildfarm client
>> module.
> Is this instead of the idea for the buildfarm to use the same SQL script
> as the TAP test (upgrade_adapt.sql) ?
>
> Discussed various places:
>
> https://www.postgresql.org/message-id/flat/1575064.1615060...@sss.pgh.pa.us
>
> https://github.com/PGBuildFarm/client-code/pull/23
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=0df9641d39057f431655b92b8a490b89c508a0b3
> |   The long-term plan is to make the buildfarm code re-use this new SQL
> |   file, so as committers are able to fix any compatibility issues in the
> |   tests of pg_upgrade with a refresh of the core code, without having to
> |   poke at the buildfarm client.  Note that this is only able to handle the
> |   main regression test suite, and that nothing is done yet for contrib
> |   modules yet (these have more issues like their database names).
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9814ff550046f825b751803191b29a2fbbc79283
>

I didn't adopt the PR precisely because it didn't do enough, unlike the
module I posted, which supports upgrades all the way from 9.2 forward,
and for more databases than just regression.

I frankly think this is a better approach.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Fixes required for cross version update testing

2023-01-13 Thread Andrew Dunstan


On 2023-01-13 Fr 19:49, Tom Lane wrote:
> Andrew Dunstan  writes:
>> Here's a piece of WIP for that, in the form of a perl module that
>> provides a function that takes an old version number / tag and provides
>> the set of sql statements that need to be run to make the old repo
>> upgradeable. It still needs a good deal of polish, but it's a start.
> Oh!  I've been hacking on exactly the same idea all day ...
>
> https://www.postgresql.org/message-id/891521.1673657296%40sss.pgh.pa.us
>
>   



Oh, sorry if I have wasted some of your time. I posted my outline idea
and then it itched so I scratched.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Fixes required for cross version update testing

2023-01-13 Thread Tom Lane
Andrew Dunstan  writes:
> Here's a piece of WIP for that, in the form of a perl module that
> provides a function that takes an old version number / tag and provides
> the set of sql statements that need to be run to make the old repo
> upgradeable. It still needs a good deal of polish, but it's a start.

Oh!  I've been hacking on exactly the same idea all day ...

https://www.postgresql.org/message-id/891521.1673657296%40sss.pgh.pa.us

regards, tom lane




Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-13 Thread Tom Lane
This is a followup to the discussion at [1], in which we agreed that
it's time to fix the buildfarm client so that knowledge about
cross-version discrepancies in pg_dump output can be moved into
the community git tree, making it feasible for people other than
Andrew to fix problems when we change things of that sort.
The idea is to create helper files that live in the git tree and
are used by the BF client to perform the activities that are likely
to need tweaking.

Attached are two patches, one for PG git and one for the buildfarm
client, that create a working POC for this approach.  I've only
carried this as far as making a helper file for HEAD, but I believe
that helper files for the back branches would mostly just need to
be cut-down versions of this one.  I've tested it successfully with
cross-version upgrade tests down to 9.3.  (9.2 would need some more
work, and I'm not sure if it's worth the trouble --- are we going to
retire 9.2 soon?)

I'm a very mediocre Perl programmer, so I'm sure there are stylistic
and other problems, but I'm encouraged that this seems feasible.

Also, I wonder if we can't get rid of
src/bin/pg_upgrade/upgrade_adapt.sql in favor of using this code.
I tried to write adjust_database_contents() in such a way that it
could be pointed at a database by some other Perl code that's
not the buildfarm client.

regards, tom lane

[1] https://www.postgresql.org/message-id/951602.1673535249%40sss.pgh.pa.us
diff --git a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
new file mode 100644
index 00..279b2bd0e6
--- /dev/null
+++ b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
@@ -0,0 +1,421 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+=pod
+
+=head1 NAME
+
+PostgreSQL::Test::AdjustUpgrade - helper module for cross-version upgrade tests
+
+=head1 SYNOPSIS
+
+  use PostgreSQL::Test::AdjustUpgrade;
+
+  # Adjust contents of old-version database before dumping
+  adjust_database_contents($old_version, $psql, %dbnames);
+
+  # Adjust contents of old pg_dumpall output file to match newer version
+  $dump = adjust_old_dumpfile($old_version, $dump);
+
+  # Adjust contents of new pg_dumpall output file to match older version
+  $dump = adjust_new_dumpfile($old_version, $dump);
+
+=head1 DESCRIPTION
+
+C encapsulates various hacks needed to
+compare the results of cross-version upgrade tests.
+
+=cut
+
+package PostgreSQL::Test::AdjustUpgrade;
+
+use strict;
+use warnings;
+
+use Exporter 'import';
+
+our @EXPORT = qw(
+  adjust_database_contents
+  adjust_old_dumpfile
+  adjust_new_dumpfile
+);
+
+=pod
+
+=head1 ROUTINES
+
+=over
+
+=item adjust_database_contents($old_version, $psql, %dbnames)
+
+Perform any DDL operations against the old-version installation that are
+needed before we can pg_upgrade it into the current PostgreSQL version.
+
+Typically this involves dropping or adjusting no-longer-supported objects.
+
+Arguments:
+
+=over
+
+=item C: Branch we are upgrading from, e.g. 'REL_11_STABLE'
+
+=item C: Object with a method C
+to perform SQL against the old installation, returning psql's exit status
+
+=item C: Hash of database names present in the old installation
+
+=back
+
+=cut
+
+sub adjust_database_contents
+{
+	my ($old_version, $psql, %dbnames) = @_;
+
+	# nothing to do for non-cross-version tests
+	return if $old_version eq 'HEAD';
+
+	# stuff not supported from release 16
+	if ($old_version ge 'REL_12_STABLE'
+		and $old_version lt 'REL_16_STABLE')
+	{
+		# Can't upgrade aclitem in user tables from pre 16 to 16+.
+		# Also can't handle child tables with locally-generated columns.
+		my $prstmt = join(';',
+			'alter table public.tab_core_types drop column aclitem',
+			'drop table public.gtest_normal_child',
+			'drop table public.gtest_normal_child2');
+
+		$psql->old_psql("regression", $prstmt);
+		return if $?;
+	}
+
+	# stuff not supported from release 14
+	if ($old_version lt 'REL_14_STABLE')
+	{
+		# postfix operators (some don't exist in very old versions)
+		my $prstmt = join(';',
+			'drop operator #@# (bigint,NONE)',
+			'drop operator #%# (bigint,NONE)',
+			'drop operator if exists !=- (bigint,NONE)',
+			'drop operator if exists #@%# (bigint,NONE)');
+
+		$psql->old_psql("regression", $prstmt);
+		return if $?;
+
+		# get rid of dblink's dependencies on regress.so
+		$prstmt = join(';',
+			'drop function if exists public.putenv(text)',
+			'drop function if exists public.wait_pid(integer)');
+
+		my $regrdb =
+		  $old_version le "REL9_4_STABLE"
+		  ? "contrib_regression"
+		  : "contrib_regression_dblink";
+
+		if ($dbnames{$regrdb})
+		{
+			$psql->old_psql($regrdb, $prstmt);
+			return if $?;
+		}
+	}
+
+	# user table OIDs are gone from release 12 on
+	if ($old_version lt 'REL_12_STABLE')
+	{
+		my $nooid_stmt = q{
+   DO $stmt$
+   DECLARE
+  rec text;
+   BEGIN
+  FOR rec in
+ select 

Re: Fixes required for cross version update testing

2023-01-13 Thread Justin Pryzby
On Fri, Jan 13, 2023 at 05:20:41PM -0500, Andrew Dunstan wrote:
> Over at [1] there was some discussion of moving knowledge of what's
> required to be fixed from old branch repos to be able to upgrade them
> into the core code, instead of having it reside in a buildfarm client
> module.

Is this instead of the idea for the buildfarm to use the same SQL script
as the TAP test (upgrade_adapt.sql) ?

Discussed various places:

https://www.postgresql.org/message-id/flat/1575064.1615060...@sss.pgh.pa.us

https://github.com/PGBuildFarm/client-code/pull/23

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=0df9641d39057f431655b92b8a490b89c508a0b3
|   The long-term plan is to make the buildfarm code re-use this new SQL
|   file, so as committers are able to fix any compatibility issues in the
|   tests of pg_upgrade with a refresh of the core code, without having to
|   poke at the buildfarm client.  Note that this is only able to handle the
|   main regression test suite, and that nothing is done yet for contrib
|   modules yet (these have more issues like their database names).

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9814ff550046f825b751803191b29a2fbbc79283

-- 
Justin




Re: Amcheck verification of GiST and GIN

2023-01-13 Thread Andrey Borodin
On Fri, Jan 13, 2023 at 3:46 AM Jose Arthur Benetasso Villanova
 wrote:
>
> The only thing that I found is the gin_index_parent_check function in docs
> still references the "gin_index_parent_check(index regclass,
> heapallindexed boolean) returns void"
>

Correct! Please find the attached fixed version.

Thank you!

Best regards, Andrey Borodin.


v20-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v20-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v20-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-13 Thread Peter Geoghegan
On Fri, Jan 13, 2023 at 2:00 PM Andres Freund  wrote:
> On 2023-01-12 16:08:06 -0500, Robert Haas wrote:
> > Normally, the XID age of a table never reaches autovacuum_freeze_max_age in
> > the first place.
>
> That's not at all my experience. I often see it being the primary reason for
> autovacuum vacuuming large tables on busy OLTP systems. Even without any
> longrunning transactions or such, with available autovac workers and without
> earlier autovacuums getting interrupted by locks.  Once a table is large,
> reasonable scale factors require a lot of changes to accumulate to trigger an
> autovacuum, and during a vacuum a lot of transactions complete, leading to
> large tables having a significant age by the time autovac finishes.

I've definitely seen this. I've also noticed that TPC-C's stock and
customer tables (both of which are subject to many HOT updates) only
ever receive antiwraparound autovacuums, even with my very aggressive
autovacuum settings.

Overall, I think that it's quite common among the largest tables, even
when things are running normally.

> The most common "bad" reason for reaching autovacuum_freeze_max_age that I see
> is cost limits not allowing vacuum to complete on time.

There are many problems with the statistics driving this whole
process, that I won't rehash right now. I actually think that the
whole idea of relying on statistical sampling for dead tuples is
fundamentally just bogus (that's not how statistics work in general),
though I have quite a few less fundamental and more concrete
complaints about the statistics just being wrong on their own terms.

> Perhaps we should track how often autovacuum was triggered by what reason in
> a relation's pgstats? That'd make it a lot easier to collect data, both for
> tuning the thresholds on a system, and for hacking on postgres.

That would definitely be useful.

> Tracking the number of times autovacuum was interruped due to a lock request
> might be a good idea as well?

Also useful information worth having.

> I think it'd be a good idea to split off the part of the patch that introduces
> AutoVacType / adds logging for what triggered. That's independently useful,
> likely uncontroversial and makes the remaining patch smaller.

I like that idea.

Attached revision v4 breaks things up mechanically, along those lines
(no real changes to the code inself, though). The controversial parts
of the patch are indeed a fairly small proportion of the total
changes.

> I'd also add the trigger to the pg_stat_activity entry for the autovac
> worker. Additionally I think we should add information about using failsafe
> mode to the p_s_a entry.

I agree that that's all useful, but it seems like it can be treated as
later work.

> I've wished for a non-wraparound, xid age based, "autovacuum trigger" many
> times, FWIW. And I've seen plenty of places write their own userspace version
> of it, because without it they run into trouble.  However, I don't like that
> the patch infers the various thresholds using magic constants / multipliers.

As I said, these details are totally negotiable, and likely could be a
lot better without much effort.

What are your concerns about the thresholds? For example, is it that
you can't configure the behavior directly at all? Something else?

> autovacuum_freeze_max_age is really a fairly random collection of things:
> 1) triggers autovacuum on tables based on age, in addition to the dead tuple /
>inserted tuples triggers
> 2) prevents auto-cancellation of autovacuum
> 3) starts autovacuum, even if autovacuum is disabled
>
> IME hitting 1) isn't a reason for concern, it's perfectly normal. Needing 2)
> to make progress is a bit more concerning. 3) should rarely be needed, but is
> a good safety mechanism.
>
> I doubt that controlling all of them via one GUC is sensible.

I agree, of course, but just to be clear: I don't think it matters
that we couple together 1 and 3. In fact it's good that we do that,
because the point to the user is that they cannot disable table-age
(i.e. what we current call antiwraparound) autovacuums -- that just
makes sense.

The only problem that I see is that item 2 is tied to the other items
from your list.

> If I understand the patch correctly, we now have the following age based
> thresholds for av:
>
> - force-enable autovacuum:
>   oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
> - autovacuum based on age:
>   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
> tableagevac = relfrozenxid < recentXid - freeze_max_age
> - prevent auto-cancellation:
>   freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
>   prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
>   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age
>
> Is that right?

That summary looks accurate, but I'm a bit confused about why you're
asking the question this way. I thought that it was obvious that the
patch doesn't change most 

Re: pgsql: Add new GUC createrole_self_grant.

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-12 18:15:50 -0700, David G. Johnston wrote:
> On Thu, Jan 12, 2023 at 8:11 AM Robert Haas  wrote:
> 
> > On Wed, Jan 11, 2023 at 7:53 PM David G. Johnston
> >  wrote:
> > > Justed wanted to chime in and say Robert has eloquently put into words
> > much of what I have been thinking here, and that I concur that guiding the
> > DBA to use care with the power they have been provided is a sane position
> > to take.
> > >
> > > +1, and thank you.
> >
> > Thanks!
> >
> > Here's a patch. In it I make three changes, only one of which is
> > directly relevant to the topic at hand:
> >
> > 1. Add a sentence to the documentation on writing SECURITY FUNCTIONS
> > safely concerning createrole_self_grant.
> > 2. Add a sentence to the documentation on SECURITY DEFINER referring
> > to the section about writing such functions safely.
> > 3. Remove a note discussing the fact that pre-8.3 versions did not
> > have SET clauses for functions.
> >
> > I can separate this into multiple patches if desired. And of course
> > you, Tom, or others may have suggestions on which of these changes
> > should be included at all or how to word them better.
> >
> >
> I'm still not really feeling how security definer is the problematic option
> here.  Security invoker scares me much more.

I don't really see what that has to do with the topic at hand, unless you want
to suggest removing the entire section about how to write secure security
definer functions?


> postgres=# create role unpriv login;
> CREATE ROLE
> postgres=# create role creator createrole login;
> CREATE ROLE
> postgres=# grant pg_read_all_data to creator with admin option;
> postgres=#
> create procedure makeunprivpowerful() LANGUAGE sql AS $$
> grant pg_read_all_data to unpriv;
> $$;
> CREATE PROCEDURE
> postgres=# alter procedure makeunprivpowerful() owner to unpriv;
> ALTER PROCEDURE
> 
> postgres=# \c postgres unpriv
> You are now connected to database "postgres" as user "unpriv".
> postgres=> call makeunprivpowerful();
> ERROR:  must have admin option on role "pg_read_all_data"
> CONTEXT:  SQL function "makeunprivpowerful" statement 1
> 
> postgres=# \c postgres creator
> You are now connected to database "postgres" as user "creator".
> postgres=> call makeunprivpowerful();
> CALL
> 
> Personally I think the best advice for a CREATEROLE granted user is to
> never call routines they themselves don't own; or at least only those have
> reviewed thoroughly and understand the consequences of.  Regardless of
> security definer/invoker.
> 
> In short, a low-privilege user can basically run anything without much fear
> because they can't do harm anyway.  Thus security invoker applies to them,
> and having security definer gives them the ability to do some things
> without needing to have permanent higher privileges.  These are useful,
> security related attributes with respect to them.
> 
> For a high-privilege I would argue that neither security invoker nor
> definer are relevant considerations from a security point-of-view.  If you
> have high-privilege you must know what it is you are executing before you
> execute it and anything bad it causes you to do using your privileges, or
> higher if you run a security definer function blindly, is an administrative
> failure, not a security hole.

The point of the security definer section is to explain how to safely write
security definer functions that you grant to less privileged users. It's not
about whether it's safe to call a security invoker / definer function -
indeed, if you don't trust the function author / owner, it's never safe to
call the function.


Greetings,

Andres Freund




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 13:38:15 -0500, Melanie Plageman wrote:
> > I think that's marginally better, but I think having to define both
> > FIRST and NUM is excessive and doesn't make it less fragile.  Not sure
> > what anyone else will say, but I'd prefer if it started at "0".

The reason for using FIRST is to be able to define the loop variable as the
enum type, without assigning numeric values to an enum var. I prefer it
slightly.


> From f8c9077631169a778c893fd16b7a973ad5725f2a Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Fri, 9 Dec 2022 18:23:19 -0800
> Subject: [PATCH v47 1/5] pgindent and some manual cleanup in pgstat related

Applied.


> Subject: [PATCH v47 2/5] pgstat: Infrastructure to track IO operations


> diff --git a/src/backend/utils/activity/pgstat.c 
> b/src/backend/utils/activity/pgstat.c
> index 0fa5370bcd..608c3b59da 100644
> --- a/src/backend/utils/activity/pgstat.c
> +++ b/src/backend/utils/activity/pgstat.c

Reminder to self: Need to bump PGSTAT_FILE_FORMAT_ID before commit.

Perhaps you could add a note about that to the commit message?



> @@ -359,6 +360,15 @@ static const PgStat_KindInfo 
> pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
>   .snapshot_cb = pgstat_checkpointer_snapshot_cb,
>   },
>
> + [PGSTAT_KIND_IO] = {
> + .name = "io_ops",

That should be "io" now I think?



> +/*
> + * Check that stats have not been counted for any combination of IOContext,
> + * IOObject, and IOOp which are not tracked for the passed-in BackendType. 
> The
> + * passed-in PgStat_BackendIO must contain stats from the BackendType 
> specified
> + * by the second parameter. Caller is responsible for locking the passed-in
> + * PgStat_BackendIO, if needed.
> + */

Other PgStat_Backend* structs are just for pending data. Perhaps we could
rename it slightly to make that clearer? PgStat_BktypeIO?
PgStat_IOForBackendType? or a similar variation?


> +bool
> +pgstat_bktype_io_stats_valid(PgStat_BackendIO *backend_io,
> +  BackendType bktype)
> +{
> + boolbktype_tracked = pgstat_tracks_io_bktype(bktype);
> +
> + for (IOContext io_context = IOCONTEXT_FIRST;
> +  io_context < IOCONTEXT_NUM_TYPES; io_context++)
> + {
> + for (IOObject io_object = IOOBJECT_FIRST;
> +  io_object < IOOBJECT_NUM_TYPES; io_object++)
> + {
> + /*
> +  * Don't bother trying to skip to the next loop 
> iteration if
> +  * pgstat_tracks_io_object() would return false here. 
> We still
> +  * need to validate that each counter is zero anyway.
> +  */
> + for (IOOp io_op = IOOP_FIRST; io_op < IOOP_NUM_TYPES; 
> io_op++)
> + {
> + if ((!bktype_tracked || 
> !pgstat_tracks_io_op(bktype, io_context, io_object, io_op)) &&
> + 
> backend_io->data[io_context][io_object][io_op] != 0)
> + return false;

Hm, perhaps this could be broken up into multiple lines? Something like

/* no stats, so nothing to validate */
if (backend_io->data[io_context][io_object][io_op] == 0)
continue;

/* something went wrong if have stats for something not tracked */
if (!bktype_tracked ||
!pgstat_tracks_io_op(bktype, io_context, io_object, io_op))
return false;


> +typedef struct PgStat_BackendIO
> +{
> + PgStat_Counter 
> data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> +} PgStat_BackendIO;

Would it bother you if we swapped the order of iocontext and iobject here and
related places? It makes more sense to me semantically, and should now be
pretty easy, code wise.


> +/* shared version of PgStat_IO */
> +typedef struct PgStatShared_IO
> +{

Maybe /* PgStat_IO in shared memory */?



> Subject: [PATCH v47 3/5] pgstat: Count IO for relations

Nearly happy with this now. See one minor nit below.

I don't love the counting in register_dirty_segment() and mdsyncfiletag(), but
I don't have a better idea, and it doesn't seem too horrible.


> @@ -1441,6 +1474,28 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, 
> ForkNumber forkNum,
>
>   UnlockBufHdr(buf, buf_state);
>
> + if (oldFlags & BM_VALID)
> + {
> + /*
> +  * When a BufferAccessStrategy is in use, blocks evicted from 
> shared
> +  * buffers are counted as IOOP_EVICT in the corresponding 
> context
> +  * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
> +  * strategy in two cases: 1) while initially claiming buffers 
> for the
> +  * strategy ring 2) to replace an existing strategy ring buffer
> +  * because it is pinned or in use and cannot be reused.
> +  *
> +  * Blocks evicted from buffers already in the 

Re: add PROCESS_MAIN to VACUUM

2023-01-13 Thread Nathan Bossart
On Fri, Jan 13, 2023 at 03:24:09PM -0800, Jeff Davis wrote:
> For completeness, did you consider CLUSTER and REINDEX options as well?

I have not, but I can put together patches for those as well.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add PROCESS_MAIN to VACUUM

2023-01-13 Thread Jeff Davis
On Thu, 2022-12-29 at 16:00 -0800, Nathan Bossart wrote:
> The motivation for adding this option is to make it easier to VACUUM
> only a
> relation's TOAST table.  At the moment, you need to find the TOAST
> table by
> examining a relation's reltoastrelid, and you need USAGE on the
> pg_toast
> schema.  This option could also help make it possible to call only
> vac_update_datfrozenxid() without processing any relations, as
> discussed
> elsewhere [2].

For completeness, did you consider CLUSTER and REINDEX options as well?


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-01-13 Thread Nathan Bossart
On Fri, Jan 13, 2023 at 02:56:26PM -0800, Nathan Bossart wrote:
> Okay.  Here is a new patch set.

And here is a rebased patch set (c44f633 changed the same LOCK TABLE docs).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 48be435025ba10235c821cbd298c43763cbc5a56 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 9 Jan 2023 14:29:24 -0800
Subject: [PATCH v6 1/2] fix MAINTAIN privs for partitions

---
 doc/src/sgml/ref/analyze.sgml |  5 ++-
 doc/src/sgml/ref/cluster.sgml | 17 ++---
 doc/src/sgml/ref/lock.sgml|  5 ++-
 doc/src/sgml/ref/reindex.sgml |  6 +++-
 doc/src/sgml/ref/vacuum.sgml  |  5 ++-
 src/backend/commands/cluster.c| 35 +--
 src/backend/commands/indexcmds.c  | 22 ++--
 src/backend/commands/lockcmds.c   |  8 +
 src/backend/commands/tablecmds.c  | 30 ++--
 src/backend/commands/vacuum.c |  9 +++--
 src/include/commands/tablecmds.h  |  1 +
 .../expected/cluster-conflict-partition.out   |  6 ++--
 .../specs/cluster-conflict-partition.spec |  5 +--
 src/test/regress/expected/cluster.out |  4 ++-
 src/test/regress/expected/vacuum.out  | 18 --
 src/test/regress/sql/cluster.sql  |  2 ++
 16 files changed, 119 insertions(+), 59 deletions(-)

diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index a26834da4f..2f94e89cb0 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -156,7 +156,10 @@ ANALYZE [ VERBOSE ] [ table_and_columnsANALYZE can only be performed by superusers and roles
-   with privileges of pg_maintain.)
+   with privileges of pg_maintain.)  If a role has
+   permission to ANALYZE a partitioned table, it is also
+   permitted to ANALYZE each of its partitions, regardless
+   of whether the role has the aforementioned privileges on the partition.
ANALYZE will skip over any tables that the calling user
does not have permission to analyze.
   
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 145101e6a5..b9f2acb1de 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -69,10 +69,7 @@ CLUSTER [VERBOSE]
   
CLUSTER without any parameter reclusters all the
previously-clustered tables in the current database that the calling user
-   owns or has the MAINTAIN privilege for, or all such tables
-   if called by a superuser or a role with privileges of the
-   pg_maintain
-   role.  This form of CLUSTER cannot be
+   has privileges for.  This form of CLUSTER cannot be
executed inside a transaction block.
   
 
@@ -134,6 +131,18 @@ CLUSTER [VERBOSE]
  
   Notes
 
+   
+To cluster a table, one must have the MAINTAIN privilege
+on the table or be the table's owner, a superuser, or a role with
+privileges of the
+pg_maintain
+role.  If a role has permission to CLUSTER a partitioned
+table, it is also permitted to CLUSTER each of its
+partitions, regardless of whether the role has the aforementioned
+privileges on the partition.  CLUSTER will skip over any
+tables that the calling user does not have permission to cluster.
+   
+

 In cases where you are accessing single rows randomly
 within a table, the actual order of the data in the
diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index 8524182211..5b3b2b793a 100644
--- a/doc/src/sgml/ref/lock.sgml
+++ b/doc/src/sgml/ref/lock.sgml
@@ -177,7 +177,10 @@ LOCK [ TABLE ] [ ONLY ] name [ * ]
 MODE (or a less-conflicting mode as described in ) is permitted. If a user has
 SELECT privileges on the table, ACCESS SHARE
-MODE is permitted.
+MODE is permitted.  If a role has permission to lock a
+partitioned table, it is also permitted to lock each of its partitions,
+regardless of whether the role has the aforementioned privileges on the
+partition.

 

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 192513f34e..c6ad2546f9 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -306,7 +306,11 @@ REINDEX [ ( option [, ...] ) ] { DA
indexes on shared catalogs will be skipped unless the user owns the
catalog (which typically won't be the case), has privileges of the
pg_maintain role, or has the MAINTAIN
-   privilege on the catalog.  Of course, superusers can always reindex anything.
+   privilege on the catalog.  If a role has permission to
+   REINDEX a partitioned table, it is also permitted to
+   REINDEX each of its partitions, regardless of whether the
+   role has the aforementioned privileges on the partition.  Of course,
+   superusers can always reindex anything.
   
 
   
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 8fa8421847..545b23b54f 100644
--- 

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-01-13 Thread Nathan Bossart
On Fri, Jan 13, 2023 at 01:30:28PM -0800, Jeff Davis wrote:
> If we care about that use case, let's do it right and have forms of
> VACUUM/CLUSTER/REINDEX that check permissions on the main table, skip
> the work on the main table, and descend directly to the toast tables.
> That doesn't seem hard, but it's a separate patch.

You may be interested in https://commitfest.postgresql.org/41/4088/.

> Right now, we should simply fix the problem.

Okay.  Here is a new patch set.  I've split the partition work out to a
separate patch, and I've removed the main relation lookups for TOAST tables
in favor of adding a skip_privs flag to vacuum_rel().  The latter patch
probably needs some additional commentary and tests, which I'll go ahead
and add if we want to proceed with this approach.  I'm assuming the session
lock should be sufficient for avoiding the case where the TOAST table's OID
is reused by the time we get to it, but I'm not sure if it's sufficient to
prevent vacuuming if the privileges on the main relation have changed
across transactions.  Even if it's not, I'm not sure that case is worth
worrying about too much.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 989674a789187a8e620bb70c2155e4e3ccacdcdf Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 9 Jan 2023 14:29:24 -0800
Subject: [PATCH v5 1/2] fix MAINTAIN privs for partitions

---
 doc/src/sgml/ref/analyze.sgml |  5 ++-
 doc/src/sgml/ref/cluster.sgml | 17 ++---
 doc/src/sgml/ref/lock.sgml|  5 ++-
 doc/src/sgml/ref/reindex.sgml |  6 +++-
 doc/src/sgml/ref/vacuum.sgml  |  5 ++-
 src/backend/commands/cluster.c| 35 +--
 src/backend/commands/indexcmds.c  | 22 ++--
 src/backend/commands/lockcmds.c   |  8 +
 src/backend/commands/tablecmds.c  | 30 ++--
 src/backend/commands/vacuum.c |  9 +++--
 src/include/commands/tablecmds.h  |  1 +
 .../expected/cluster-conflict-partition.out   |  6 ++--
 .../specs/cluster-conflict-partition.spec |  5 +--
 src/test/regress/expected/cluster.out |  4 ++-
 src/test/regress/expected/vacuum.out  | 18 --
 src/test/regress/sql/cluster.sql  |  2 ++
 16 files changed, 119 insertions(+), 59 deletions(-)

diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index a26834da4f..2f94e89cb0 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -156,7 +156,10 @@ ANALYZE [ VERBOSE ] [ table_and_columnsANALYZE can only be performed by superusers and roles
-   with privileges of pg_maintain.)
+   with privileges of pg_maintain.)  If a role has
+   permission to ANALYZE a partitioned table, it is also
+   permitted to ANALYZE each of its partitions, regardless
+   of whether the role has the aforementioned privileges on the partition.
ANALYZE will skip over any tables that the calling user
does not have permission to analyze.
   
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 145101e6a5..b9f2acb1de 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -69,10 +69,7 @@ CLUSTER [VERBOSE]
   
CLUSTER without any parameter reclusters all the
previously-clustered tables in the current database that the calling user
-   owns or has the MAINTAIN privilege for, or all such tables
-   if called by a superuser or a role with privileges of the
-   pg_maintain
-   role.  This form of CLUSTER cannot be
+   has privileges for.  This form of CLUSTER cannot be
executed inside a transaction block.
   
 
@@ -134,6 +131,18 @@ CLUSTER [VERBOSE]
  
   Notes
 
+   
+To cluster a table, one must have the MAINTAIN privilege
+on the table or be the table's owner, a superuser, or a role with
+privileges of the
+pg_maintain
+role.  If a role has permission to CLUSTER a partitioned
+table, it is also permitted to CLUSTER each of its
+partitions, regardless of whether the role has the aforementioned
+privileges on the partition.  CLUSTER will skip over any
+tables that the calling user does not have permission to cluster.
+   
+

 In cases where you are accessing single rows randomly
 within a table, the actual order of the data in the
diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index d9c5bf9a1d..21a9f88c70 100644
--- a/doc/src/sgml/ref/lock.sgml
+++ b/doc/src/sgml/ref/lock.sgml
@@ -176,7 +176,10 @@ LOCK [ TABLE ] [ ONLY ] name [ * ]
 or TRUNCATE privileges on the target table. All other
 forms of LOCK are allowed with
 table-level UPDATE, DELETE,
-or TRUNCATE privileges.
+or TRUNCATE privileges.  If a role has permission to
+lock a partitioned table, it is also permitted to lock each of its
+partitions, regardless of whether the role has the aforementioned
+

Re: Patch: Global Unique Index

2023-01-13 Thread Cary Huang

On 2022-11-30 2:30 p.m., Greg Stark wrote:

On Tue, 29 Nov 2022 at 21:16, Tom Lane  wrote:

I actually think that that problem should be soluble with a
slightly different approach.  The thing that feels insoluble
is that you can't do this without acquiring sufficient locks
to prevent addition of new partitions while the insertion is
in progress.  That will be expensive in itself, and it will
turn ATTACH PARTITION into a performance disaster.

I think there`s a lot of room to manoeuvre here. This is a new feature
that doesn't need to be 100% complete or satisfy any existing
standard. There are lots of options for compromises that leave room
for future improvements.

1) We could just say sure ATTACH is slow if you're attaching an
non-empty partition
2) We could invent a concept like convalidated and let people attach a
partition without validating the uniqueness and then validate it later
concurrently
3) We could say ATTACH doesn't work now and come up with a better
strategy in the future

Also, don't I vaguely recall something in exclusion constraints about
having some kind of in-memory "intent" list where you declared that
you're about to insert a value, you validate it doesn't violate the
constraint and then you're free to insert it because anyone else will
see your intent in memory? There might be a need for some kind of
global object that only holds inserted keys long enough that other
sessions are guaranteed to see the key in the correct index. And that
could maybe even be in memory rather than on disk.

This isn't a simple project but I don't think it's impossible as long
as we keep an open mind about the requirements.


In the current global unique index implementation, ATTACH can be slow if 
there are concurrent inserts happening. ATTACH tries to acquire 
shareLock on all existing partitions and partition-to-be before it scans 
and sorts them for uniqueness check. It will release them only after all 
partitions have been checked. If there are concurrent inserts, ATTACH 
has to wait for all inserts complete. Likewise, if ATTACH is in 
progress, inserts have to wait as well. This is an issue now.


If we were to make ATTACH acquire a lower level lock (AccessShareLock), 
scans a partition, and then release it. there is nothing stopping any 
concurrent inserts from inserting a conflict right after it finishes 
checking. This is another issue. There is no transaction level lock 
being triggered here like in multiple concurent inserts case


Another email thread called "create index concurrently on partitioned 
index" discuss some approaches that may be used to solve the attach 
issue here, basically to allow ATTACH PARTITION CONCURRENTLY...



regards

Cary Huang
-
HighGo Software Canada










Fixes required for cross version update testing

2023-01-13 Thread Andrew Dunstan
Over at [1] there was some discussion of moving knowledge of what's
required to be fixed from old branch repos to be able to upgrade them
into the core code, instead of having it reside in a buildfarm client
module.

Here's a piece of WIP for that, in the form of a perl module that
provides a function that takes an old version number / tag and provides
the set of sql statements that need to be run to make the old repo
upgradeable. It still needs a good deal of polish, but it's a start.

The advantage is that it makes it far less likely that the buildfarm
maintainer (i.e. me for now) is a bottleneck in fixing issues that arise
from development. This is by far the biggest area where we have seen
buildfarm breakage for cross version upgrade testing.


cheers


andrew


[1]  https://postgr.es/m/951602.1673535...@sss.pgh.pa.us

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


AdjustUpgrade.pm
Description: Perl program


Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-12 16:08:06 -0500, Robert Haas wrote:
> Normally, the XID age of a table never reaches autovacuum_freeze_max_age in
> the first place.

That's not at all my experience. I often see it being the primary reason for
autovacuum vacuuming large tables on busy OLTP systems. Even without any
longrunning transactions or such, with available autovac workers and without
earlier autovacuums getting interrupted by locks.  Once a table is large,
reasonable scale factors require a lot of changes to accumulate to trigger an
autovacuum, and during a vacuum a lot of transactions complete, leading to
large tables having a significant age by the time autovac finishes.

The most common "bad" reason for reaching autovacuum_freeze_max_age that I see
is cost limits not allowing vacuum to complete on time.


Perhaps we should track how often autovacuum was triggered by what reason in
a relation's pgstats? That'd make it a lot easier to collect data, both for
tuning the thresholds on a system, and for hacking on postgres.

Tracking the number of times autovacuum was interruped due to a lock request
might be a good idea as well?


I think it'd be a good idea to split off the part of the patch that introduces
AutoVacType / adds logging for what triggered. That's independently useful,
likely uncontroversial and makes the remaining patch smaller.

I'd also add the trigger to the pg_stat_activity entry for the autovac
worker. Additionally I think we should add information about using failsafe
mode to the p_s_a entry.


I've wished for a non-wraparound, xid age based, "autovacuum trigger" many
times, FWIW. And I've seen plenty of places write their own userspace version
of it, because without it they run into trouble.  However, I don't like that
the patch infers the various thresholds using magic constants / multipliers.


autovacuum_freeze_max_age is really a fairly random collection of things:
1) triggers autovacuum on tables based on age, in addition to the dead tuple /
   inserted tuples triggers
2) prevents auto-cancellation of autovacuum
3) starts autovacuum, even if autovacuum is disabled

IME hitting 1) isn't a reason for concern, it's perfectly normal. Needing 2)
to make progress is a bit more concerning. 3) should rarely be needed, but is
a good safety mechanism.

I doubt that controlling all of them via one GUC is sensible.


If I understand the patch correctly, we now have the following age based
thresholds for av:

- force-enable autovacuum:
  oldest_datfrozenxid + autovacuum_freeze_max_age < nextXid
- autovacuum based on age:
  freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
tableagevac = relfrozenxid < recentXid - freeze_max_age
- prevent auto-cancellation:
  freeze_max_age = Min(autovacuum_freeze_max_age, table_freeze_max_age)
  prevent_auto_cancel_age = Min(freeze_max_age * 2, 1 billion)
  prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age

Is that right?


One thing I just noticed: Isn't it completely bonkers that we compute
recentXid/recentMulti once at the start of a worker in
relation_needs_vacanalyze()?  That's fine for the calls in do_autovacuum()'s
initial loops over all tables. But seems completely wrong for the later calls
via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() ->
relation_needs_vacanalyze()?

These variables really shouldn't be globals. It makes sense to cache them
locally in do_autovacuum(), but reusing them
recheck_relation_needs_vacanalyze() and sharing it between launcher and worker
is bad.

Greetings,

Andres Freund




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-01-13 Thread Jeff Davis
On Fri, 2023-01-13 at 12:33 -0800, Nathan Bossart wrote:
> That would fix the problem in the original complaint, but it wouldn't
> allow
> for vacuuming toast tables directly if you only have MAINTAIN
> privileges on
> the main relation.  If you can vacuum the toast table indirectly via
> the
> main relation, shouldn't it be possible to vacuum it directly?

Perhaps, but that's barely supported today: you have to awkwardly find
the internal toast table name yourself, and you need the admin to grant
you USAGE on the pg_toast schema. I don't think we're obligated to also
support this hackery for non-owners with a new MAINTAIN privilege.

If we care about that use case, let's do it right and have forms of
VACUUM/CLUSTER/REINDEX that check permissions on the main table, skip
the work on the main table, and descend directly to the toast tables.
That doesn't seem hard, but it's a separate patch.

Right now, we should simply fix the problem.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 15:25:16 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds.
>
> Cheaper, and perhaps more accurate too?  Don't recall if we have any code
> paths where the input timestamps are likely to be better-than-microsecond,
> but surely that's coming someday.

instr_time on !WIN32 use struct timespec, so we already should have nanosecond
precision available. IOW, we could add a INSTR_TIME_GET_NANOSEC today. Or am I
misunderstanding what you mean?


> I'm unsure that we want to deal with rdtsc's vagaries in general, but
> no objection to changing instr_time.

Cool.

Looking at the instr_time.h part of the change, I think it should go further,
and basically do the same thing in the WIN32 path. The only part that needs to
be be win32 specific is INSTR_TIME_SET_CURRENT(). That'd reduce duplication a
good bit.

Greetings,

Andres Freund




Re: real/float example for testlibpq3

2023-01-13 Thread Mark Wong
On Thu, Nov 03, 2022 at 09:55:22AM -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > On 01.11.22 09:15, Tom Lane wrote:
> >> Agreed that the libpq manual is not the place for this, but I feel
> >> like it will also be clutter in "Data Types".  Perhaps we should
> >> invent a new appendix or the like?  Somewhere near the wire protocol
> >> docs seems sensible.
> 
> > Would that clutter the protocol docs? ;-)
> 
> I said "near", not "in".  At the time I was thinking "new appendix",
> but I now recall that the wire protocol docs are not an appendix
> but a chapter in the Internals division.  So that doesn't seem like
> quite the right place anyway.
> 
> Perhaps a new chapter under "IV. Client Interfaces" is the right
> place?
> 
> If we wanted to get aggressive, we could move most of the nitpicky details
> about datatype text formatting (e.g., the array quoting rules) there too.
> I'm not set on that, but it'd make datatype.sgml smaller which could
> hardly be a bad thing.
> 
> > I suppose figuring out exactly where to put it and how to mark it up, 
> > etc., in a repeatable fashion is part of the job here.
> 
> Yup.

How does this look?

I've simply moved things around into a new "Binary Format" section with
the few parts that I've started for some quick feedback about whether
this is looking like the right landing place.

Regards,
Mark
diff --git a/doc/src/sgml/binary-format.sgml b/doc/src/sgml/binary-format.sgml
index a297ece784..779b606ec9 100644
--- a/doc/src/sgml/binary-format.sgml
+++ b/doc/src/sgml/binary-format.sgml
@@ -6,9 +6,102 @@
  pgsql binary format
 
  
-  This chapter describes the binary format used in the wire protocol.  There
-  are a number of C examples for the data types used in PostgreSQL.  We will
-  try to be as comprehensive as possible with the native data types.
+  This chapter describes the binary representation of the native PostgreSQL
+  data types and gives examples on how to handle each data type's binary format
+  by offering C code examples for each data types.
  
 
+ 
+  We will try to cover all of the native data types...
+ 
+
+ 
+  boolean
+
+  
+   A boolean is transmitted as single byte that, when cast to an
+   int, will be 0 for
+   false and 1 for
+   true.
+  
+
+
+
+ 
+
+ 
+  real
+
+  
+   A real is composed of 4 bytes and needs to be handled correctly
+   for byte order.
+  
+
+
+
+
+ 
+
+ 
+  timestamp without time zone
+
+  
+   A timestamp without time zone is a 64-bit data type
+   representing the number of microseconds since January 1, 2000.  It can be
+   converted into a broken-down time representation by converting the time into
+   seconds and saving the microseconds elsewhere.
+  
+
+  
+   Note that in C time is counted from January 1, 1970, so this difference
+   needs to be accounted for in addition to handling the network byte order.
+  
+
+
+
+
+  
 
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 0d6be9a2fa..688f947107 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -51,6 +51,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index 2e271862fc..705b03f4aa 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -196,6 +196,7 @@ break is not needed in a wider output rendering.
   
   
   
+  
 
  
 


Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-01-13 Thread Nathan Bossart
On Fri, Jan 13, 2023 at 11:56:03AM -0800, Jeff Davis wrote:
> I'm hesitant to add an index to pg_class just for the privilege checks
> on toast tables, and I don't think we need to.

I bet this index will be useful for more than just these privilege checks
(e.g., autovacuum currently creates a hash table for the
toast-to-main-relation mapping), but I do understand the hesitation.

> Instead, we can just
> skip the privilege check on a toast table if it's not referenced
> directly, because we already checked the privileges on the parent, and
> we still hold the session lock so nothing strange should have happened.

That would fix the problem in the original complaint, but it wouldn't allow
for vacuuming toast tables directly if you only have MAINTAIN privileges on
the main relation.  If you can vacuum the toast table indirectly via the
main relation, shouldn't it be possible to vacuum it directly?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread Tom Lane
Andres Freund  writes:
> On 2023-01-04 13:02:05 +0100, David Geier wrote:
>> Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
>> cheaper.

> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds.

Cheaper, and perhaps more accurate too?  Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.

I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.

regards, tom lane




Re: Rework of collation code, extensibility

2023-01-13 Thread Peter Geoghegan
On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis  wrote:
> Attached trivial rebase as v6.

Some review comments for this v6.

Comments on 0001-*:

* I think that 0002-* can be squashed with 0001-*, since there isn't
any functional reason why you'd want to commit the strcoll() and
strxfrm() changes separately.

Sometimes it can be useful to break things up, despite the fact that
it couldn't possibly make sense to commit just one of the resulting
patches on its own. However, I don't think that that's appropriate
here. There is no apparent conceptual boundary that you're
highlighting by splitting things up like this. strxfrm() and strcoll()
are defined in terms of each other -- they're siblings, joined at the
hip -- so this seems kinda jarring.

* Your commit message for 0001 (and other patches) don't set things up
by describing what the point is, and what the work anticipates. I
think that they should do that.

You're adding a layer of indirection that's going to set things up for
later patches that add a layer of indirection for version ICU
libraries (and even libc itself), and some of the details only make
sense in that context. This isn't just refactoring work that could
just as easily have happened in some quite different context.

* I'm not sure that pg_strcoll() should be breaking ties itself. We
break ties using strcmp() for historical reasons, but must not do that
for deterministic ICU collations, which may be obscured.

That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
the same as the well known strcoll()/strxfrm() relationship. That kind
of makes pg_strcoll() somewhat more than a strcoll() shim, which is
inconsistent. Another concern is that the deterministic collation
handling isn't handled in any one layer, which would have been nice.

Do we need to do things this way? What's it adding?

* varstrfastcmp_locale() is no longer capable of calling
ucol_strcollUTF8() through the shim interface, meaning that it has to
determine string length based on NUL-termination, when in principle it
could just use the known length of the string.

Presumably this might have performance implications. Have you thought
about that?

Some comments on 0002-*:

* I don't see much point in this new varstr_abbrev_convert() variable:

+   const size_t max_prefix_bytes = sizeof(Datum);

varstr_abbrev_convert() is concerned with packing abbreviated key
bytes into Datums, so it's perfectly reasonable to deal with
Datums/sizeof(Datum) directly.

* Having a separate pg_strxfrm_prefix_libc() function just to throw an
error doesn't really add much IMV.

Comments on 0003-*:

I suggest that some of the very trivial functions you have here (such
as pg_locale_deterministic()) be made inline functions.

Comments on 0006-*:

* get_builtin_libc_library() could be indented in a way that would
make it easier to understand.

--
Peter Geoghegan




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2023-01-13 Thread Jeff Davis
On Mon, 2023-01-09 at 14:51 -0800, Nathan Bossart wrote:
> On Tue, Jan 03, 2023 at 03:45:49PM -0800, Nathan Bossart wrote:
> > I'd like to get this fixed, but I have yet to hear thoughts on the
> > suggested approach.  I'll proceed with adjusting the tests and
> > documentation shortly unless someone objects.
> 
> As promised, here is a new version of the patch with adjusted tests
> and
> documentation.

The current patch doesn't handle the case properly where you have
partitions that have toast tables. An easy fix by recursing, but I
think we might want to do things differently anyway.

I'm hesitant to add an index to pg_class just for the privilege checks
on toast tables, and I don't think we need to. Instead, we can just
skip the privilege check on a toast table if it's not referenced
directly, because we already checked the privileges on the parent, and
we still hold the session lock so nothing strange should have happened.

I'll look into that, but so far it looks like it should come out
cleanly for toast tables. The way you're checking privileges on the
partitioned tables is fine.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-04 13:02:05 +0100, David Geier wrote:
> From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
> From: David Geier 
> Date: Thu, 17 Nov 2022 10:22:01 +0100
> Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
>  cheaper.

Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).

While the range of instr_time storing nanoseconds wouldn't be good enough for
a generic timestamp facility (hence using microsecs for Timestamp), the range
seems plenty for its use of measuring runtime:

(2 ** 63) - 1) / ((10 ** 9) * 60 * 60 * 24 * 365) = ~292 years

Of course, when using CLOCK_REALTIME, this is relative to 1970-01-01, so just
239 years.

It could theoretically be a different story, if we stored instr_time's on
disk. But we don't, they're ephemeral.


This doesn't buy a whole lot of performance - the bottlenck is the actual
timestamp computation. But in a query with not much else going on, it's
visible and reproducible. It's, unsurprisingly, a lot easier to see when using
BUFFERS.

For both timespec and nanosecond, I measured three server starts, and for each
started server three executions of
pgbench -n -Mprepared -c1 -P5 -T15 -f <(echo "EXPLAIN (ANALYZE, BUFFERS) SELECT 
generate_series(1, 1000) OFFSET 1000;")

the best result is:
timespec: 1073.431
nanosec:  957.532
a ~10% difference

Greetings,

Andres Freund




Re: properly sizing attnums in pg_get_publication_tables

2023-01-13 Thread Justin Pryzby
On Fri, Jan 13, 2023 at 07:37:29AM -0800, Ted Yu wrote:
> Hi,
> I was looking at commit b7ae03953690a1dee455ba3823cc8f71a72cbe1d .
> 
> In `pg_get_publication_tables`, attnums is allocated with size
> `desc->natts`. However, since some columns may be dropped, this size may be
> larger than necessary.
> When `nattnums > 0` is false, there is no need to allocate the `attnums`
> array. In the current formation, `attnums` should be freed in this scenario.
> 
> Please take a look at the patch which moves the allocation to inside the
> `if (nattnums > 0)` block.
> 
> Thanks

It doesn't seem worth the bother of changing it or adding 10 lines of
code, or keeping track of whether "attnums" is initialized or not.

After all, it wasn't worth pfree()ing the array (which seems to work as
intended).  The array can't be larger than ~3200 bytes, and I doubt
anybody is going to be excited about saving a couple bytes per dropped
column.

-- 
Justin




Re: Transaction timeout

2023-01-13 Thread Nikolay Samokhvalov
On Fri, Jan 13, 2023 at 10:16 AM Andrey Borodin 
wrote:

> > – it seems we could (should) have one more successful "1s wait, 3s
> sleep" iteration here, ~727ms somehow wasted in a loop, quite a lot.
>
> I think big chunk from these 727ms were spent between "BEGIN" and
> "select now(), clock_timestamp(), pg_sleep(3) \watch 1".
>

Not really – there was indeed ~2s delay between BEGIN and the first
pg_sleep query, but those ~727ms is something else.

here we measure the remainder between the beginning of the transaction
measured by "now()' and the the beginning of the last successful pg_sleep()
query:

gitpod=# select timestamptz '2023-01-13 15:51:18.179579+00' - '2023-01-13
15:49:22.906924+00';
?column?
-
 00:01:55.272655
(1 row)

It already includes all delays that we had from the beginning of our
transaction.

The problem with my question was that I didn't take into attention that
'2023-01-13 15:51:18.179579+00' is when the last successful query
*started*. So the remainder of our 2-min quota – 00:00:04.727345 – includes
the last successful loop (3s of successful query + 1s of waiting), and then
we have failed after ~700ms.

In other words, there are no issues here, all good.

> Many thanks for looking into this!

many thanks for implementing it


Re: [EXTERNAL] Re: Support load balancing in libpq

2023-01-13 Thread Jacob Champion
On Fri, Jan 13, 2023 at 9:10 AM Jelte Fennema  wrote:
>
> > Just a quick single-issue review, but I agree with Maxim that having
> > one PRNG, seeded once, would be simpler
>
> I don't agree that it's simpler. Because now there's a mutex you have
> to manage, and honestly cross-platform threading in C is not simple.
> However, I attached two additional patches that implement this
> approach on top of the previous patchset. Just to make sure that
> this patch is not blocked on this.

It hadn't been my intention to block the patch on it, sorry. Just
registering a preference.

I also didn't intend to make you refactor the locking code -- my
assumption was that you could use the existing pglock_thread() to
handle it, since it didn't seem like the additional contention would
hurt too much. Maybe that's not actually performant enough, in which
case my suggestion loses an advantage.

> > The test seed could then be handled globally as well (envvar?) so that you
> > don't have to introduce a debug-only option into the connection string.
>
> Why is a debug-only envvar any better than a debug-only connection option?
> For now I kept the connection option approach, since to me they seem pretty
> much equivalent.

I guess I worry less about envvar namespace pollution than pollution
of the connection options. And my thought was that the one-time
initialization could be moved to a place that doesn't need to know the
connection options at all, to make it easier to reason about the
architecture. Say, next to the WSAStartup machinery.

But as it is now, I agree that the implementation hasn't really lost
any complexity compared to the original, and I don't feel particularly
strongly about it. If it doesn't help to make the change, then it
doesn't help.

Thanks,
--Jacob




Re: Generate pg_stat_get_xact*() functions with Macros

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 10:36:49 +0100, Drouvot, Bertrand wrote:
> > > > Although I suspect this actually hints at an architectural thing that 
> > > > could be
> > > > fixed better: Perhaps we should replace find_tabstat_entry() with a 
> > > > version
> > > > returning a fully "reconciled" PgStat_StatTabEntry?  It feels quite 
> > > > wrong to
> > > > have that intimitate knowledge of the subtransaction stuff in 
> > > > pgstatfuncs.c
> > > > and about how the different counts get combined.
> > > > 
> > > > I think that'd allow us to move the definition of PgStat_TableStatus to
> > > > PgStat_TableXactStatus, PgStat_TableCounts to pgstat_internal.h. Which 
> > > > feels a
> > > > heck of a lot cleaner.
> > > 
> > > Yeah, I think that would be for a 4th patch, agree?
> > 
> > Yea. I am of multiple minds about the ordering. I can see benefits on fixing
> > the architectural issue before reducing duplication in the accessor with a
> > macro. The reason is that if we addressed the architectural issue, the
> > difference between the xact and non-xact version will be very minimal, and
> > could even be generated by the same macro.
> > 
> 
> Yeah, I do agree and I'm in favor of this ordering:
> 
> 1) replace find_tabstat_entry() with a version returning a fully "reconciled" 
> PgStat_StatTabEntry
> 2) remove prefixes
> 3) Introduce the new macros

> I'll first look at 1).

Makes sense.


> And it looks to me that removing PgStat_BackendFunctionEntry can be done 
> independently

It's imo the function version of 1), just a bit simpler to implement due to
the much simpler reconciliation. It could be done together with it, or
separately.

Greetings,

Andres Freund




Re: Blocking execution of SECURITY INVOKER

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 10:04:19 -0800, Jeff Davis wrote:
> However, the normal use case for expressions (whether in a default
> expression or check constraint or whatever) is very simple and doesn't
> even involve table access. There must be a way to satisfy those simple
> cases without opening up a vast attack surface area, and if we do, then
> I think my proposal might look useful again.

I don't see how. I guess we could try to introduce a classification of "never
dangerous" functions (and thus operators). But that seems like a crapton of
work and hard to get right. And I think my examples pretty conclusively show
that security definer isn't a useful boundary to *reduce* privileges. So the
whole idea of preventing only security invoker functions just seems
non-viable.

I think the combination of
a) a setting that restricts evaluation of any non-trusted expressions,
   independent of the origin
b) an easy way to execute arbitrary statements within
   SECURITY_RESTRICTED_OPERATION

is promising though. In later steps We might be able to use a) to offer the
option to automatically switch to expression owners in specific places (if the
current user has the rights to do that).


An alternative to b would be a version SET ROLE that can't be undone. But I
think we'd just miss all the other things that are prevented by
SECURITY_RESTRICTED_OPERATION.

Greetings,

Andres Freund




Re: Transaction timeout

2023-01-13 Thread Andrey Borodin
Thanks for the review Nikolay!

On Fri, Jan 13, 2023 at 8:03 AM Nikolay Samokhvalov
 wrote:
>
> 1) The current test set has only 2 simple cases – I'd suggest adding one more 
> (that one that didn't work in v1):
>
> gitpod=# set transaction_timeout to '20ms';
> SET
> gitpod=# begin; select pg_sleep(.01); select pg_sleep(.01); select 
> pg_sleep(.01); commit;
I tried exactly these tests - tests were unstable on Windows. Maybe
that OS has a more coarse-grained timer resolution.
It's a tradeoff between time spent on tests, strength of the test and
probability of false failure. I chose small time without false alarms.

> – it seems we could (should) have one more successful "1s wait, 3s sleep" 
> iteration here, ~727ms somehow wasted in a loop, quite a lot.

I think big chunk from these 727ms were spent between "BEGIN" and
"select now(), clock_timestamp(), pg_sleep(3) \watch 1". I doubt patch
really contains arithmetic errors.

Many thanks for looking into this!


Best regards, Andrey Borodin.




Re: Blocking execution of SECURITY INVOKER

2023-01-13 Thread Jeff Davis
On Fri, 2023-01-13 at 00:16 -0800, Andres Freund wrote:

> Just think of set_config(), pg_read_file(), lo_create(),
> binary_upgrade_*(),
> pg_drop_replication_slot()...

Thank you for walking through the details here. I missed it from your
first example because it was an extreme case -- a superuser writing an
eval() security definer function -- so the answer was to lock such a
dangerous function away. But more mild cases are the real problem,
because there's a lot of stuff in pg_catalog.*.

> If the default values get evaluated, this is arbitrary code exec,
> even if it
> requires a few contortions. And the same is true for evaluating *any*
> expression.

Right.

However, the normal use case for expressions (whether in a default
expression or check constraint or whatever) is very simple and doesn't
even involve table access. There must be a way to satisfy those simple
cases without opening up a vast attack surface area, and if we do, then
I think my proposal might look useful again.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 09:15:10 -0500, Reid Thompson wrote:
> On Mon, 2023-01-09 at 18:31 -0800, Andres Freund wrote:
> > > Dynamic shared memory allocations are included only in the value displayed
> > > for the backend that created them, they are not included in the value for
> > > backends that are attached to them to avoid double counting.
> >
> > As mentioned before, I don't think accounting DSM this way makes sense.
>
> Understood, previously you noted 'There are a few uses of DSMs that track
> shared resources, with the biggest likely being the stats for relations
> etc'.  I'd like to come up with a solution to address this; identifying the
> long term allocations to shared state and accounting for them such that they
> don't get 'lost' when the allocating backend exits.  Any guidance or
> direction would be appreciated.

Tracking it as backend memory usage doesn't seem helpful to me, particularly
because some of it is for server wide data tracking (pgstats, some
caches). But that doesn't mean you couldn't track and report it
separately.


> > > +/* --
> > > + * pgstat_get_all_memory_allocated() -
> > > + *
> > > + *   Return a uint64 representing the current shared memory 
> > > allocated to all
> > > + *   backends.  This looks directly at the BackendStatusArray, and 
> > > so will
> > > + *   provide current information regardless of the age of our 
> > > transaction's
> > > + *   snapshot of the status array.
> > > + *   In the future we will likely utilize additional values - 
> > > perhaps limit
> > > + *   backend allocation by user/role, etc.
> > > + * --
> >
> > I think it's completely unfeasible to execute something as expensive as
> > pgstat_get_all_backend_memory_allocated() on every allocation. Like,
> > seriously, no.
>
> Ok. Do we check every nth allocation/try to implement a scheme of checking
> more often as we we get closer to the declared max_total_bkend_mem?

I think it's just not acceptable to do O(connections) work as part of
something critical as memory allocation. Even if amortized imo.

What you could do is to have a single, imprecise, shared counter for the total
memory allocation, and have a backend-local "allowance". When the allowance is
used up, refill it from the shared counter (a single atomic op).

But honestly, I think we first need to have the accounting for a while before
it makes sense to go for the memory limiting patch. And I doubt a single GUC
will suffice to make this usable.



> > And we absolutely definitely shouldn't just add CHECK_FOR_INTERRUPT() calls
> > into the middle of allocator code.
>
> I'm open to guidance/suggestions/pointers to remedying these.

Well, just don't have the CHECK_FOR_INTERRUPT(). Nor the O(N) operation.

You also can't do the ereport(WARNING) there, that itself allocates memory,
and could lead to recursion in some edge cases.

Greetings,

Andres Freund




Re: PL/Python: Fix return in the middle of PG_TRY() block.

2023-01-13 Thread Nathan Bossart
On Thu, Jan 12, 2023 at 09:49:00PM -0800, Andres Freund wrote:
> On 2023-01-12 10:44:33 -0800, Nathan Bossart wrote:
>> There's another "return" later on in this PG_TRY block.  I wonder if it's
>> possible to detect this sort of thing at compile time.
> 
> Clang provides some annotations that allow to detect this kind of thing. I
> hacked up a test for this, and it finds quite a bit of prolematic
> code.

Nice!

> plpython is, uh, not being good? But also in plperl, pltcl.

Yikes.

> ../../../../home/andres/src/postgresql/src/pl/tcl/pltcl.c:1830:1: warning: 
> no_returns_in_pg_try 'no_returns_handle' is not held on every path through 
> here [-Wthread-safety-analysis]
> }
> ^
> ../../../../home/andres/src/postgresql/src/pl/tcl/pltcl.c:1809:2: note: 
> no_returns_in_pg_try acquired here
> PG_CATCH();
> ^
> ../../../../home/andres/src/postgresql/src/include/utils/elog.h:433:7: note: 
> expanded from macro 'PG_CATCH'
> no_returns_start(no_returns_handle##__VA_ARGS__)
> ^
> 
> Not perfect digestible, but also not too bad. I pushed the
> no_returns_start()/no_returns_stop() calls into all the PG_TRY related macros,
> because that causes the warning to point to block that the problem is
> in. E.g. above the first warning points to PG_TRY, the second to
> PG_CATCH. It'd work to just put it into PG_TRY and PG_END_TRY.

This seems roughly as digestible as the pg_prevent_errno_in_scope stuff.
However, on my macOS machine with clang 14.0.0, the messages say "mutex"
instead of "no_returns_in_pg_try," which is unfortunate since that's the
part that would clue me into what the problem is.  I suppose it'd be easy
enough to figure out after a grep or two, though.

> Clearly this would need a bunch more work, but it seems promising? I think
> there'd be other uses than this.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: No Callbacks on FATAL

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-13 16:14:11 +0300, Aleksander Alekseev wrote:
> > > Hm? MemoryContextDelete() unconditionally calls the
> > > callbacks. ShutdownPostgres() calls AbortOutOfAnyTransaction(). So if 
> > > there's
> > > an ongoing transaction, we'll call the reset callbacks on 
> > > TopMemoryContext and
> > > its children.
> >
> > Hmm ... I'd forgotten that we'd reach AbortOutOfAnyTransaction in
> > the FATAL code path.  It does seem like any memory contexts below
> > TopTransactionContext ought to get cleaned up then.
> 
> I wonder if this is a desired behavior. FATAL means a critical error
> local to a given backend, but not affecting shared memory, right? Is
> it generally safe to execute context memory callbacks having a FATAL
> error?

We need to roll back the in-progress transaction, otherwise we'd be in
trouble. Some resets are part of that. If the error actually corrupted local
state badly enough to break the transaction machinery, we'd need to PANIC out.

Greetings,

Andres Freund




Re: [EXTERNAL] Re: Support load balancing in libpq

2023-01-13 Thread Jelte Fennema
> Just a quick single-issue review, but I agree with Maxim that having
> one PRNG, seeded once, would be simpler

I don't agree that it's simpler. Because now there's a mutex you have
to manage, and honestly cross-platform threading in C is not simple.
However, I attached two additional patches that implement this
approach on top of the previous patchset. Just to make sure that
this patch is not blocked on this.

> with the tangible benefit that it would eliminate weird behavior on
> simultaneous connections when the client isn't using OpenSSL.

This is true, but still I think in practice very few people have a libpq
that's compiled without OpenSSL support.

> I'm guessing a simple lock on a
> global PRNG would be less overhead than the per-connection
> strong_random machinery, too, but I have no data to back that up.

It might very well have less overhead, but neither of them should take
up any significant amount of time during connection establishment.

> The test seed could then be handled globally as well (envvar?) so that you
> don't have to introduce a debug-only option into the connection string.

Why is a debug-only envvar any better than a debug-only connection option?
For now I kept the connection option approach, since to me they seem pretty
much equivalent.
From 561dca3b9510464600b4da8d1397e7762f523568 Mon Sep 17 00:00:00 2001
From: Jelte Fennema 
Date: Fri, 13 Jan 2023 16:52:17 +0100
Subject: [PATCH v7 3/4] Make mutexes easier to use in libpq

For process global mutexes windows requires some different setup than
other OSes. This abstracts away that logic into the newly added pglock
and pgunlock functions. The main reason to do this refactoring is to
prepare for a new global mutex that will be added in a follow up commit.
---
 src/interfaces/libpq/fe-connect.c| 58 +++-
 src/interfaces/libpq/fe-secure-openssl.c | 33 +++---
 src/interfaces/libpq/libpq-int.h | 21 +
 3 files changed, 64 insertions(+), 48 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 18a07d810dc..69ed891703a 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -7424,36 +7424,17 @@ pqGetHomeDirectory(char *buf, int bufsize)
 static void
 default_threadlock(int acquire)
 {
-#ifdef ENABLE_THREAD_SAFETY
-#ifndef WIN32
-	static pthread_mutex_t singlethread_lock = PTHREAD_MUTEX_INITIALIZER;
-#else
-	static pthread_mutex_t singlethread_lock = NULL;
-	static long mutex_initlock = 0;
-
-	if (singlethread_lock == NULL)
-	{
-		while (InterlockedExchange(_initlock, 1) == 1)
-			 /* loop, another thread own the lock */ ;
-		if (singlethread_lock == NULL)
-		{
-			if (pthread_mutex_init(_lock, NULL))
-Assert(false);
-		}
-		InterlockedExchange(_initlock, 0);
-	}
-#endif
+	static pglock_t singlethread_lock = PGLOCK_INITIALIZER;
 	if (acquire)
 	{
-		if (pthread_mutex_lock(_lock))
+		if (!pglock(_lock))
 			Assert(false);
 	}
 	else
 	{
-		if (pthread_mutex_unlock(_lock))
+		if (!pgunlock(_lock))
 			Assert(false);
 	}
-#endif
 }
 
 pgthreadlock_t
@@ -7468,3 +7449,36 @@ PQregisterThreadLock(pgthreadlock_t newhandler)
 
 	return prev;
 }
+
+bool
+pglock(pglock_t * lock)
+{
+#ifdef WIN32
+	if (lock->mutex == NULL)
+	{
+		while (InterlockedExchange(>mutex_initlock, 1) == 1)
+			 /* loop, another thread own the lock */ ;
+		if (lock->mutex == NULL)
+		{
+			if (pthread_mutex_init(>mutex, NULL))
+return false;
+		}
+		InterlockedExchange(>mutex_initlock, 0);
+	}
+#endif
+	if (pthread_mutex_lock(>mutex))
+	{
+		return false;
+	}
+	return true;
+}
+
+bool
+pgunlock(pglock_t * lock)
+{
+	if (pthread_mutex_unlock(>mutex))
+	{
+		return false;
+	}
+	return true;
+}
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index ab2cbf045b8..c52c2ccf217 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -94,12 +94,7 @@ static bool ssl_lib_initialized = false;
 #ifdef ENABLE_THREAD_SAFETY
 static long crypto_open_connections = 0;
 
-#ifndef WIN32
-static pthread_mutex_t ssl_config_mutex = PTHREAD_MUTEX_INITIALIZER;
-#else
-static pthread_mutex_t ssl_config_mutex = NULL;
-static long win32_ssl_create_mutex = 0;
-#endif
+static pglock_t ssl_config_lock = PGLOCK_INITIALIZER;
 #endif			/* ENABLE_THREAD_SAFETY */
 
 static PQsslKeyPassHook_OpenSSL_type PQsslKeyPassHook = NULL;
@@ -744,21 +739,7 @@ int
 pgtls_init(PGconn *conn, bool do_ssl, bool do_crypto)
 {
 #ifdef ENABLE_THREAD_SAFETY
-#ifdef WIN32
-	/* Also see similar code in fe-connect.c, default_threadlock() */
-	if (ssl_config_mutex == NULL)
-	{
-		while (InterlockedExchange(_ssl_create_mutex, 1) == 1)
-			 /* loop, another thread own the lock */ ;
-		if (ssl_config_mutex == NULL)
-		{
-			if (pthread_mutex_init(_config_mutex, NULL))
-return -1;
-		}
-		InterlockedExchange(_ssl_create_mutex, 0);
-	}
-#endif
-	if (pthread_mutex_lock(_config_mutex))
+	if 

Re: Transaction timeout

2023-01-13 Thread Nikolay Samokhvalov
On Thu, Jan 12, 2023 at 11:47 AM Andrey Borodin 
wrote:

> On Thu, Jan 12, 2023 at 11:24 AM Nathan Bossart
>  wrote:
> >
> > On Sun, Dec 18, 2022 at 12:53:31PM -0800, Andrey Borodin wrote:
> > > I've rewritten this part to correctly report all timeouts that did
> > > happen. However there's now a tricky comma-formatting code which was
> > > tested only manually.
>

Testing it again, a couple of questions

1) The current test set has only 2 simple cases – I'd suggest adding one
more (that one that didn't work in v1):

gitpod=# set transaction_timeout to '20ms';
SET
gitpod=# begin; select pg_sleep(.01); select pg_sleep(.01); select
pg_sleep(.01); commit;
BEGIN
 pg_sleep
--

(1 row)

ERROR:  canceling statement due to transaction timeout


gitpod=# set statement_timeout to '20ms'; set transaction_timeout to 0; --
to test value for statement_timeout and see that it doesn't fail
SET
SET
gitpod=# begin; select pg_sleep(.01); select pg_sleep(.01); select
pg_sleep(.01); commit;
BEGIN
 pg_sleep
--

(1 row)

 pg_sleep
--

(1 row)

 pg_sleep
--

(1 row)

COMMIT


2) Testing for a longer transaction (2 min), in a gitpod VM (everything is
local, no network involved)

// not sure what's happening here, maybe some overheads that are not
related to the implementation,
// but the goal was to see how precise the limiting is for longer
transactions

gitpod=# set transaction_timeout to '2min';
SET
gitpod=# begin;
BEGIN
gitpod=*# select now(), clock_timestamp(), pg_sleep(3) \watch 1
Fri 13 Jan 2023 03:49:24 PM UTC (every 1s)

  now  |clock_timestamp| pg_sleep
---+---+--
 2023-01-13 15:49:22.906924+00 | 2023-01-13 15:49:24.088728+00 |
(1 row)

[...]

Fri 13 Jan 2023 03:51:18 PM UTC (every 1s)

  now  |clock_timestamp| pg_sleep
---+---+--
 2023-01-13 15:49:22.906924+00 | 2023-01-13 15:51:18.179579+00 |
(1 row)

ERROR:  canceling statement due to transaction timeout

gitpod=!#
gitpod=!# rollback;
ROLLBACK
gitpod=# select timestamptz '2023-01-13 15:51:18.179579+00' - '2023-01-13
15:49:22.906924+00';
?column?
-
 00:01:55.272655
(1 row)

gitpod=# select interval '2min' - '00:01:55.272655';
?column?
-
 00:00:04.727345
(1 row)

gitpod=# select interval '2min' - '00:01:55.272655' - '4s';
?column?
-
 00:00:00.727345
(1 row)

– it seems we could (should) have one more successful "1s wait, 3s sleep"
iteration here, ~727ms somehow wasted in a loop, quite a lot.


properly sizing attnums in pg_get_publication_tables

2023-01-13 Thread Ted Yu
Hi,
I was looking at commit b7ae03953690a1dee455ba3823cc8f71a72cbe1d .

In `pg_get_publication_tables`, attnums is allocated with size
`desc->natts`. However, since some columns may be dropped, this size may be
larger than necessary.
When `nattnums > 0` is false, there is no need to allocate the `attnums`
array. In the current formation, `attnums` should be freed in this scenario.

Please take a look at the patch which moves the allocation to inside the
`if (nattnums > 0)` block.

Thanks


proper-sizing-of-attnums.patch
Description: Binary data


Re: Fix condition in shm_toc and remove unused function shm_toc_freespace.

2023-01-13 Thread Zhang Mingli
Hi,

Regards,
Zhang Mingli
On Jan 12, 2023, 16:54 +0800, Richard Guo , wrote:
>
> On Thu, Jan 12, 2023 at 2:50 PM Zhang Mingli  wrote:
> > On Jan 12, 2023, 14:34 +0800, Zhang Mingli , wrote:
> > > Some conditions in shm_toc_insert and shm_toc_allocate are bogus, like:
> > >   if (toc_bytes + nbytes > total_bytes || toc_bytes + nbytes < 
> > > toc_bytes)Remove the condition `toc_bytes + nbytes < toc_bytes` and take 
> > > a sizeof(shm_entry) into account in shm_toc_allocate though
> > > shm_toc_allocate does that too.
> >   shm_toc_insert does that too, and  we can report error earlier.
>
> I don't think we should consider sizeof(shm_toc_entry) in the 'if'
> condition in shm_toc_allocate, as this function is not in charge of
> allocating a new TOC entry.  That's what shm_toc_insert does.
Thanks for review.
Make sense.
Even reserve a sizeof(shm_toc_entry) when shm_toc_allocate, it cloud happen 
that there is no space  when shm_toc_insert
in case of other processes may take space after that.
Patch updated.


v1-0001-Fix-condition-in-shm_toc-and-remove-unused-functi.patch
Description: Binary data


Re: PL/Python: Fix return in the middle of PG_TRY() block.

2023-01-13 Thread Xing Guo
Hi Nathan.

On 1/13/23, Nathan Bossart  wrote:
> On Thu, Jan 12, 2023 at 11:19:29PM +0800, Xing Guo wrote:
>> I was running static analyser against PostgreSQL and found there're 2
>> return statements in PL/Python module which is not safe. Patch is
>> attached.
>
> Is the problem that PG_exception_stack and error_context_stack aren't
> properly reset?

Yes, it is.

>
>> @@ -690,12 +690,12 @@ PLy_trigger_build_args(FunctionCallInfo fcinfo,
>> PLyProcedure *proc, HeapTuple *r
>>  PyObject   *volatile pltdata = NULL;
>>  char   *stroid;
>>
>> +pltdata = PyDict_New();
>> +if (!pltdata)
>> +return NULL;
>> +
>>  PG_TRY();
>>  {
>> -pltdata = PyDict_New();
>> -if (!pltdata)
>> -return NULL;
>> -
>>  pltname = PLyUnicode_FromString(tdata->tg_trigger->tgname);
>>  PyDict_SetItemString(pltdata, "name", pltname);
>>  Py_DECREF(pltname);
>
> There's another "return" later on in this PG_TRY block.  I wonder if it's
> possible to detect this sort of thing at compile time.

Thanks for pointing it out! I missed some return statements. Because
my checker is treating 'return statement in PG_TRY() block' as errors.
It stops compiling when it finds the code pattern and I forget to
double check it.

My checker is based on AST matcher and it's possible to turn it into a
clang-tidy plugin or something similar. I want to make it easy to
integrate with scan-build, so I implement it as a static analyzer
plugin :)

If anyone is interested in the checker itself, the source code can be
found here[1]:

> Note also:
> src/pl/tcl/pltcl.c- *   PG_CATCH();
> src/pl/tcl/pltcl.c- *   {
> src/pl/tcl/pltcl.c- *   pltcl_subtrans_abort(interp, oldcontext,
> oldowner);
> src/pl/tcl/pltcl.c- *   return TCL_ERROR;
> src/pl/tcl/pltcl.c- *   }
>
> This is documented once, repeated twice:
> src/pl/plpython/plpy_spi.c- *   PG_CATCH();
> src/pl/plpython/plpy_spi.c- *   {
> src/pl/plpython/plpy_spi.c- *   
> src/pl/plpython/plpy_spi.c- *   PLy_spi_subtransaction_abort(oldcontext,
> oldowner);
> src/pl/plpython/plpy_spi.c- *   return NULL;
> src/pl/plpython/plpy_spi.c- *   }
>
> I don't quite get why this would be a sane thing to do here when
> aborting a subtransaction in pl/python, but my experience with this
> code is limited.

Hi Michael,

I'll try to understand what's going on in your pasted codes. Thanks
for pointing it out!

> Example complaints:
>
> [776/1239 42  62%] Compiling C object
> src/pl/plpython/plpython3.so.p/plpy_exec.c.o
> ../../../../home/andres/src/postgresql/src/pl/plpython/plpy_exec.c:472:1:
> warning: no_returns_in_pg_try 'no_returns_handle' is not held on every path
> through here [-Wthread-safety-analysis]
> }
> ^
> ../../../../home/andres/src/postgresql/src/pl/plpython/plpy_exec.c:417:2:
> note: no_returns_in_pg_try acquired here
> PG_TRY();
> ^
> ../../../../home/andres/src/postgresql/src/include/utils/elog.h:424:7: note:
> expanded from macro 'PG_TRY'
> no_returns_start(no_returns_handle##__VA_ARGS__)
> ^
> ...
> [785/1239 42  63%] Compiling C object src/pl/tcl/pltcl.so.p/pltcl.c.o
> ../../../../home/andres/src/postgresql/src/pl/tcl/pltcl.c:1830:1: warning:
> no_returns_in_pg_try 'no_returns_handle' is not held on every path through
> here [-Wthread-safety-analysis]
> }
> ^
> ../../../../home/andres/src/postgresql/src/pl/tcl/pltcl.c:1809:2: note:
> no_returns_in_pg_try acquired here
> PG_CATCH();
> ^
> ../../../../home/andres/src/postgresql/src/include/utils/elog.h:433:7: note:
> expanded from macro 'PG_CATCH'
> no_returns_start(no_returns_handle##__VA_ARGS__)
> ^

Hi Andres,

Your patch looks interesting and useful. I will play with it in the
next following days. I'm burried with other works recently, so my
reply may delay.

> Not perfect digestible, but also not too bad. I pushed the
> no_returns_start()/no_returns_stop() calls into all the PG_TRY related
> macros,
> because that causes the warning to point to block that the problem is
> in. E.g. above the first warning points to PG_TRY, the second to
> PG_CATCH. It'd work to just put it into PG_TRY and PG_END_TRY.
>
>
> Clearly this would need a bunch more work, but it seems promising? I think
> there'd be other uses than this.
>
>
> I briefly tried to use it for spinlocks. Mostly works and detects things
> like
> returning with a spinlock held. But it does not like dynahash's habit of
> conditionally acquiring and releasing spinlocks.
>
> Greetings,
>
> Andres Freund
>

[1] 
https://github.com/higuoxing/clang-plugins/blob/main/lib/ReturnInPgTryBlockChecker.cpp

-- 
Best Regards,
Xing









On 1/13/23, Andres Freund  wrote:
> Hi,
>
> On 2023-01-12 21:49:00 -0800, Andres Freund wrote:
>> Clearly this would need a bunch more work, but it seems promising? I
>> think
>> there'd be other uses than this.
>>
>> I briefly tried 

Re: Get relid for a relation

2023-01-13 Thread Tom Lane
Amin  writes:
> In CustomScan cost estimator, where PlannerInfo and RelOptInfo are passed,
> I want to get access to the relation stats (for example pg_stat_all_tables)
> by calling pg_stat_fetch_stat_tabentry(). However, I don't have access to
> relid to pass to this function.

Sure you do.  The existing code, eg in selfuncs.c, does it about like
this:

RangeTblEntry *rte = planner_rt_fetch(rel->relid, root);

Assert(rte->rtekind == RTE_RELATION);
relid = rte->relid;
Assert(relid != InvalidOid);
...
vardata.statsTuple = SearchSysCache3(STATRELATTINH,
 ObjectIdGetDatum(relid),
 Int16GetDatum(colnum),
 BoolGetDatum(rte->inh));

This is maybe a bit confusing, in that rel->relid is a range
table index but rte->relid is an OID.

FWIW, I seriously doubt that the numbers kept by the pg_stat mechanisms
are what you want for query planning purposes.

regards, tom lane




Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-01-13 Thread Reid Thompson
On Mon, 2023-01-09 at 18:31 -0800, Andres Freund wrote:
> Hi,
> 
> On 2023-01-05 13:44:20 -0500, Reid Thompson wrote:
> > This new field displays the current bytes of memory allocated to the
> > backend process. It is updated as memory for the process is
> > malloc'd/free'd. Memory allocated to items on the freelist is included in
> > the displayed value.
> 
> It doesn't actually malloc/free. It tracks palloc/pfree.

I will update the message

> 
> > Dynamic shared memory allocations are included only in the value displayed
> > for the backend that created them, they are not included in the value for
> > backends that are attached to them to avoid double counting.
> 
> As mentioned before, I don't think accounting DSM this way makes sense.

Understood, previously you noted 'There are a few uses of DSMs that track
shared resources, with the biggest likely being the stats for relations
etc'.  I'd like to come up with a solution to address this; identifying the
long term allocations to shared state and accounting for them such that they
don't get 'lost' when the allocating backend exits.  Any guidance or
direction would be appreciated. 

> > --- a/src/backend/postmaster/autovacuum.c
> > +++ b/src/backend/postmaster/autovacuum.c
> > @@ -407,6 +407,9 @@ StartAutoVacLauncher(void)
> >  
> >  #ifndef EXEC_BACKEND
> > case 0:
> > +   /* Zero allocated bytes to avoid double counting parent 
> > allocation */
> > +   pgstat_zero_my_allocated_bytes();
> > +
> > /* in postmaster child ... */
> > InitPostmasterChild();
> 
> 
> 
> > @@ -1485,6 +1488,9 @@ StartAutoVacWorker(void)
> >  
> >  #ifndef EXEC_BACKEND
> > case 0:
> > +   /* Zero allocated bytes to avoid double counting parent 
> > allocation */
> > +   pgstat_zero_my_allocated_bytes();
> > +
> > /* in postmaster child ... */
> > InitPostmasterChild();
> >  
> > diff --git a/src/backend/postmaster/postmaster.c 
> > b/src/backend/postmaster/postmaster.c
> > index eac3450774..24278e5c18 100644
> > --- a/src/backend/postmaster/postmaster.c
> > +++ b/src/backend/postmaster/postmaster.c
> > @@ -4102,6 +4102,9 @@ BackendStartup(Port *port)
> > {
> > free(bn);
> >  
> > +   /* Zero allocated bytes to avoid double counting parent 
> > allocation */
> > +   pgstat_zero_my_allocated_bytes();
> > +
> > /* Detangle from postmaster */
> > InitPostmasterChild();
> 
> 
> It doesn't at all seem right to call pgstat_zero_my_allocated_bytes() here,
> before even InitPostmasterChild() is called. Nor does it seem right to add the
> call to so many places.
> 
> Note that this is before we even delete postmaster's memory, see e.g.:
>   /*
>* If the PostmasterContext is still around, recycle the space; we don't
>* need it anymore after InitPostgres completes.  Note this does not 
> trash
>* *MyProcPort, because ConnCreate() allocated that space with malloc()
>* ... else we'd need to copy the Port data first.  Also, subsidiary 
> data
>* such as the username isn't lost either; see ProcessStartupPacket().
>*/
>   if (PostmasterContext)
>   {
>   MemoryContextDelete(PostmasterContext);
>   PostmasterContext = NULL;
>   }
> 
> calling pgstat_zero_my_allocated_bytes() before we do this will lead to
> undercounting memory usage, afaict.
> 

OK - I'll trace back through these and see if I can better locate and reduce the
number of invocations.

> > +/* Enum helper for reporting memory allocated bytes */
> > +enum allocation_direction
> > +{
> > +   PG_ALLOC_DECREASE = -1,
> > +   PG_ALLOC_IGNORE,
> > +   PG_ALLOC_INCREASE,
> > +};
> 
> What's the point of this?
> 
> 
> > +/* --
> > + * pgstat_report_allocated_bytes() -
> > + *
> > + *  Called to report change in memory allocated for this backend.
> > + *
> > + * my_allocated_bytes initially points to local memory, making it safe to 
> > call
> > + * this before pgstats has been initialized. allocation_direction is a
> > + * positive/negative multiplier enum defined above.
> > + * --
> > + */
> > +static inline void
> > +pgstat_report_allocated_bytes(int64 allocated_bytes, int 
> > allocation_direction)
> 
> I don't think this should take allocation_direction as a parameter, I'd make
> it two different functions.

Originally it was two functions, a suggestion was made in the thread to
maybe consolidate them to a single function with a direction indicator,
hence the above.  I'm fine with converting it back to separate functions.

> 
> > +   if (allocation_direction == PG_ALLOC_DECREASE &&
> > +   pg_sub_u64_overflow(*my_allocated_bytes, allocated_bytes, 
> > ))
> > +   {
> > +   *my_allocated_bytes = 0;
> > +   ereport(LOG,
> > +   errmsg("Backend %d deallocated %lld bytes, 

Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-01-13 Thread Zhang Mingli
HI,

On Dec 27, 2022, 19:02 +0800, Melih Mutlu , wrote:
Hi,

Having  FORCE_NULL(*) and FORCE_NOT_NULL(*) sounds good, since postgres already 
has FORCE_QUOTE(*).

I just quickly tried out your patch. It worked for me as expected.

 One little suggestion:

+ if (cstate->opts.force_notnull_all)
+ {
+     int i;
+     for(i = 0; i < num_phys_attrs; i++)
+         cstate->opts.force_notnull_flags[i] = true;
+ }

Instead of setting force_null/force_notnull flags for all columns, what about 
simply setting "attnums" list to cstate->attnumlist?
Something like the following should be enough :
if (cstate->opts.force_null_all)
   attnums = cstate->attnumlist;
else
   attnums = CopyGetAttnums(tupDesc, cstate->rel, cstate->opts.force_null);
Tanks very much for review.

I got your point and we have to handle the case that there are no force_* 
options at all.
So the codes will be like:

```
List *attnums = NIL;

if (cstate->opts.force_notnull_all)
attnums = cstate->attnumlist;
else if (cstate->opts.force_notnull)
attnums = CopyGetAttnums(tupDesc, cstate->rel, cstate->opts.force_notnull);

if (attnums != NIL)
{
// process force_notnull columns

attnums = NIL; // to process other options later
}

if (cstate->opts.force_null_all)
attnums = cstate->attnumlist;
else if (cstate->opts.force_null)
attnums = CopyGetAttnums(tupDesc, cstate->rel, cstate->opts.force_null);

if (attnums != NIL)
{
// process force_null columns

attnums = NIL; // to process other options later
}
```
That seems a little odd.

Or, we could keep attnums as local variables, then the codes will be like:

```
if (cstate->opts.force_notnull_all || cstate->opts.force_notnull)
{
if (cstate->opts.force_notnull_all)
attnums = cstate->attnumlist;
else
attnums = CopyGetAttnums(tupDesc, cstate->rel, cstate->opts.force_notnull);
// process force_notnull columns
}
```

Any other suggestions?


Regards,
Zhang Mingli





Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-01-13 Thread Justin Pryzby
On Fri, Jan 13, 2023 at 06:15:38PM +0530, Nitin Jadhav wrote:
> Hi,
> 
> The commit 7265dbffad7feac6ea9d373828583b5d3d152e07 has added a script
> in src/backend/utils/misc/check_guc that cross-checks the consistency
> of the GUCs with postgresql.conf.sample, making sure that its format
> is in line with what guc.c has. As per the commit message, the
> parameters which are not listed as NOT_IN_SAMPLE in guc.c should be
> present in postgresql.conf.sample. But I have observed a test case
> failure when the parameters which are listed as GUC_NO_SHOW_ALL in
> guc.c and if it is present in postgresql.conf.sample. I feel this
> behaviour is not expected and this should be fixed. I spent some time
> on the analysis and found that query [1] is used to fetch all the
> parameters which are not listed as NOT_IN_SAMPLE. But the pg_settings
> view does not return the parameters listed as GUC_NO_SHOW_ALL. Hence
> these records will be missed. Please share your thoughts. I would like
> to work on the patch if a fix is required.

Looks like you're right ; show_all_settings() elides settings marked
"noshow".

Do you know how you'd implement a fix ?

-- 
Justin




BF animal malleefowl reported an failure in 001_password.pl

2023-01-13 Thread houzj.f...@fujitsu.com
Hi,

I noticed one BF failure[1] when monitoring the BF for some other commit.

#   Failed test 'authentication success for method password, connstr 
user=scram_role: log matches'
#   at t/001_password.pl line 120.
#   '2023-01-13 07:33:46.741 EST [243628:5] LOG:  received 
SIGHUP, reloading configuration files
# 2023-01-13 07:33:46.742 EST [243662:1] [unknown] LOG:  connection received: 
host=[local]
# 2023-01-13 07:33:46.744 EST [243662:2] [unknown] LOG:  connection authorized: 
user=scram_role database=postgres application_name=001_password.pl
# 2023-01-13 07:33:46.748 EST [243662:3] 001_password.pl LOG:  statement: 
SELECT $$connected with user=scram_role$$
# '
# doesn't match '(?^:connection authenticated: identity="scram_role" 
method=password)'
# Looks like you failed 1 test of 79.
[08:33:47] t/001_password.pl  

After checking the test and log, I can see the test failed at the following 
code:

# For plain "password" method, all users should also be able to connect.
reset_pg_hba($node, 'all', 'all', 'password');
test_conn($node, 'user=scram_role', 'password', 0,
log_like =>
  [qr/connection authenticated: identity="scram_role" 
method=password/]);


>From the log, the expected "xxx method=password " log was not output, a simple
"connection authorized: user=scram_role database=postgres " was output Instead.
So it seems the connection happens before pg_ident.conf is actually reloaded ?
Not sure if we need to do something make sure the reload happen, because it's
looks like very rare failure which hasn't happen in last 90 days.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=malleefowl=2023-01-13%2009%3A54%3A51

Best regards,
Hou zhijie





Re: No Callbacks on FATAL

2023-01-13 Thread Aleksander Alekseev
Hi hackers,

> > Hm? MemoryContextDelete() unconditionally calls the
> > callbacks. ShutdownPostgres() calls AbortOutOfAnyTransaction(). So if 
> > there's
> > an ongoing transaction, we'll call the reset callbacks on TopMemoryContext 
> > and
> > its children.
>
> Hmm ... I'd forgotten that we'd reach AbortOutOfAnyTransaction in
> the FATAL code path.  It does seem like any memory contexts below
> TopTransactionContext ought to get cleaned up then.

I wonder if this is a desired behavior. FATAL means a critical error
local to a given backend, but not affecting shared memory, right? Is
it generally safe to execute context memory callbacks having a FATAL
error?

> As you say, we really need more details to see what's happening here.

Yep, minimal steps to reproduce the issue would be much appreciated!

-- 
Best regards,
Aleksander Alekseev




Re: PG11 to PG14 Migration Slowness

2023-01-13 Thread Aleksander Alekseev
Hi Vigneshk,

>   I'm migrating  our existing PG instances from PG11.4  to PG14.3. I have 
> around 5 Million Tables in a single database. When migrating using 
> pg_upgrade, its taking 3 hours for the process to complete. I'm not sure if 
> its the intended behaviour or we're missing something here.

Thanks for reporting this. I would say this is more or less an
expected behaviour. This being said I think we could do better than
that.

Could you identify the bottleneck or perhaps provide the minimal
automated steps (ideally, a script) to reproduce your issue in a clean
environment?

-- 
Best regards,
Aleksander Alekseev




Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-01-13 Thread Nitin Jadhav
Hi,

The commit 7265dbffad7feac6ea9d373828583b5d3d152e07 has added a script
in src/backend/utils/misc/check_guc that cross-checks the consistency
of the GUCs with postgresql.conf.sample, making sure that its format
is in line with what guc.c has. As per the commit message, the
parameters which are not listed as NOT_IN_SAMPLE in guc.c should be
present in postgresql.conf.sample. But I have observed a test case
failure when the parameters which are listed as GUC_NO_SHOW_ALL in
guc.c and if it is present in postgresql.conf.sample. I feel this
behaviour is not expected and this should be fixed. I spent some time
on the analysis and found that query [1] is used to fetch all the
parameters which are not listed as NOT_IN_SAMPLE. But the pg_settings
view does not return the parameters listed as GUC_NO_SHOW_ALL. Hence
these records will be missed. Please share your thoughts. I would like
to work on the patch if a fix is required.


[1]:
SELECT name FROM pg_settings WHERE NOT 'NOT_IN_SAMPLE' = ANY
(pg_settings_get_flags(name)) AND name <> 'config_file' ORDER BY 1;

Thanks & Regards,
Nitin Jadhav




Re: Amcheck verification of GiST and GIN

2023-01-13 Thread Jose Arthur Benetasso Villanova



On Sun, 8 Jan 2023, Andrey Borodin wrote:


On Sun, Jan 8, 2023 at 8:05 PM Andrey Borodin  wrote:


Please find the attached new version. In this patchset heapallindexed
flag is removed from GIN checks.


Uh... sorry, git-formatted wrong branch.
Here's the correct version. Double checked.



Hello again.

I applied the patch without errors / warnings and did the same tests. All 
working as expected.


The only thing that I found is the gin_index_parent_check function in docs 
still references the "gin_index_parent_check(index regclass, 
heapallindexed boolean) returns void"


--
Jose Arthur Benetasso Villanova




Re: Fix pg_publication_tables to exclude generated columns

2023-01-13 Thread Amit Kapila
On Thu, Jan 12, 2023 at 12:33 PM shiy.f...@fujitsu.com
 wrote:
>
> On Wed, Jan 11, 2023 2:40 PM Amit Kapila  wrote:
> >
> > On Wed, Jan 11, 2023 at 10:07 AM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > >> On Mon, Jan 9, 2023 11:06 PM Tom Lane  wrote:
> > > >>> We could just not fix it in the back branches.  I'd argue that this is
> > > >>> as much a definition change as a bug fix, so it doesn't really feel
> > > >>> like something to back-patch anyway.
> > >
> > > > So, if we don't backpatch then it could lead to an error when it
> > > > shouldn't have which is clearly a bug. I think we should backpatch
> > > > this unless Tom or others are against it.
> > >
> > > This isn't a hill that I'm ready to die on ... but do we have any field
> > > complaints about this?  If not, I still lean against a back-patch.
> > > I think there's a significant risk of breaking case A while fixing
> > > case B when we change this behavior, and that's something that's
> > > better done only in a major release.
> > >
> >
> > Fair enough, but note that there is a somewhat related problem for
> > dropped columns [1] as well. While reviewing that it occurred to me
> > that generated columns also have a similar problem which leads to this
> > thread (it would have been better if there is a mention of the same in
> > the initial email). Now, as symptoms are similar, I think we shouldn't
> > back-patch that as well, otherwise, it will appear to be partially
> > fixed. What do you think?
> >
> > [1] - https://www.postgresql.org/message-
> > id/OSZPR01MB631087C65BA81E1FEE5A60D2FDF59%40OSZPR01MB6310.jpnpr
> > d01.prod.outlook.com
> >
>
> I agree to only fix them on HEAD.
>
> I merged this patch and the one in [1] as they are similar problems. Please
> see the attached patch.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Exit walsender before confirming remote flush in logical replication

2023-01-13 Thread Amit Kapila
On Wed, Dec 28, 2022 at 8:18 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > > Firstly I considered 2, but I thought 1 seemed to be better.
> > > PSA the updated patch.
> > >
> >
> > I think even for logical replication we should check whether there is
> > any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
> > is the point to send the done message? Also, the caller of
> > WalSndDone() already has that check which is another reason why I
> > can't see why you didn't have the same check in function WalSndDone().
>
> I did not have strong opinion around here. Fixed.
>
> > BTW, even after fixing this, I think logical replication will behave
> > differently when due to some reason (like time-delayed replication)
> > send buffer gets full and walsender is not able to send data. I think
> > this will be less of an issue with physical replication because there
> > is a separate walreceiver process to flush the WAL which doesn't wait
> > but the same is not true for logical replication. Do you have any
> > thoughts on this matter?
>
> Yes, it may happen even if this work is done. And your analysis is correct 
> that
> the receive buffer is rarely full in physical replication because walreceiver
> immediately flush WALs.
>

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

-- 
With Regards,
Amit Kapila.




RE: Perform streaming logical transactions by background workers and parallel apply

2023-01-13 Thread houzj.f...@fujitsu.com
On Friday, January 13, 2023 1:43 PM Masahiko Sawada  
wrote:
> On Thu, Jan 12, 2023 at 9:34 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Thursday, January 12, 2023 7:08 PM Amit Kapila
>  wrote:
> > >
> > > On Thu, Jan 12, 2023 at 4:21 PM shveta malik 
> wrote:
> > > >
> > > > On Thu, Jan 12, 2023 at 10:34 AM Amit Kapila
> > > > 
> > > wrote:
> > > > >
> > > > > On Thu, Jan 12, 2023 at 9:54 AM Peter Smith
> > > > > 
> > > wrote:
> > > > > >
> > > > > >
> > > > > > doc/src/sgml/monitoring.sgml
> > > > > >
> > > > > > 5. pg_stat_subscription
> > > > > >
> > > > > > @@ -3198,11 +3198,22 @@ SELECT pid, wait_event_type,
> > > > > > wait_event FROM pg_stat_activity WHERE wait_event i
> > > > > >
> > > > > >   
> > > > > > > > > > > role="column_definition">
> > > > > > +   apply_leader_pid
> > > integer
> > > > > > +  
> > > > > > +  
> > > > > > +   Process ID of the leader apply worker, if this process is a
> apply
> > > > > > +   parallel worker. NULL if this process is a leader apply 
> > > > > > worker
> or a
> > > > > > +   synchronization worker.
> > > > > > +  
> > > > > > + 
> > > > > > +
> > > > > > + 
> > > > > > +   > > > > > + role="column_definition">
> > > > > > relid oid
> > > > > >
> > > > > >
> > > > > > OID of the relation that the worker is synchronizing; null 
> > > > > > for
> the
> > > > > > -   main apply worker
> > > > > > +   main apply worker and the parallel apply worker
> > > > > >
> > > > > >   
> > > > > >
> > > > > > 5a.
> > > > > >
> > > > > > (Same as general comment #1 about terminology)
> > > > > >
> > > > > > "apply_leader_pid" --> "leader_apply_pid"
> > > > > >
> > > > >
> > > > > How about naming this as just leader_pid? I think it could be
> > > > > helpful in the future if we decide to parallelize initial sync
> > > > > (aka parallel
> > > > > copy) because then we could use this for the leader PID of
> > > > > parallel sync workers as well.
> > > > >
> > > > > --
> > > >
> > > > I still prefer leader_apply_pid.
> > > > leader_pid does not tell which 'operation' it belongs to. 'apply'
> > > > gives the clarity that it is apply related process.
> > > >
> > >
> > > But then do you suggest that tomorrow if we allow parallel sync
> > > workers then we have a separate column leader_sync_pid? I think that
> > > doesn't sound like a good idea and moreover one can refer to docs for
> clarification.
> >
> > I agree that leader_pid would be better not only for future parallel
> > copy sync feature, but also it's more consistent with the leader_pid column 
> > in
> pg_stat_activity.
> >
> > And here is the version patch which addressed Peter's comments and
> > renamed all the related stuff to leader_pid.
> 
> Here are two comments on v79-0003 patch.

Thanks for the comments.

> 
> +/* Force to serialize messages if stream_serialize_threshold
> is reached. */
> +if (stream_serialize_threshold != -1 &&
> +(stream_serialize_threshold == 0 ||
> + stream_serialize_threshold < parallel_stream_nchunks))
> +{
> +parallel_stream_nchunks = 0;
> +return false;
> +}
> 
> I think it would be better if we show the log message ""logical replication 
> apply
> worker will serialize the remaining changes of remote transaction %u to a 
> file"
> even in stream_serialize_threshold case.

Agreed and changed.

> 
> IIUC parallel_stream_nchunks won't be reset if pa_send_data() failed due to 
> the
> timeout.

Changed.

Best Regards,
Hou zj



RE: Perform streaming logical transactions by background workers and parallel apply

2023-01-13 Thread houzj.f...@fujitsu.com
On Friday, January 13, 2023 1:02 PM Masahiko Sawada  
wrote:
> 
> On Fri, Jan 13, 2023 at 1:28 PM Amit Kapila  wrote:
> >
> > On Fri, Jan 13, 2023 at 9:06 AM Amit Kapila 
> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 7:56 AM Peter Smith 
> wrote:
> > > >
> > >
> > > >
> > > > 3.
> > > >
> > > > > > > role="column_definition">
> > > > +   leader_pid integer
> > > > +  
> > > > +  
> > > > +   Process ID of the leader apply worker if this process is a 
> > > > parallel
> > > > +   apply worker; NULL if this process is a leader apply worker or
> does not
> > > > +   participate in parallel apply, or a synchronization worker
> > > > +  
> > > >
> > > > I felt this change is giving too many details and ended up just
> > > > muddying the water.
> > > >
> > >
> > > I see that we give a similar description for other parameters as well.
> > > For example leader_pid in pg_stat_activity,
> > >
> >
> > BTW, shouldn't we update leader_pid column in pg_stat_activity as well
> > to display apply leader PID for parallel apply workers? It will
> > currently display for other parallel operations like a parallel
> > vacuum, so I don't see a reason to not do the same for parallel apply
> > workers.
> 
> +1
> 
> The parallel apply workers have different properties than the parallel query
> workers since they execute different transactions and don't use group locking
> but it would be a good hint for users to show the leader and parallel apply
> worker processes are related. If users want to check only parallel query 
> workers
> they can use the backend_type column.

Agreed, and changed as suggested.

Attach the new version patch set which address the comments so far.

Best Regards,
Hou zj


v80-0004-Retry-to-apply-streaming-xact-only-in-apply-work.patch
Description:  v80-0004-Retry-to-apply-streaming-xact-only-in-apply-work.patch


v80-0001-Display-the-leader-apply-worker-s-PID-for-parall.patch
Description:  v80-0001-Display-the-leader-apply-worker-s-PID-for-parall.patch


v80-0002-Stop-extra-worker-if-GUC-was-changed.patch
Description: v80-0002-Stop-extra-worker-if-GUC-was-changed.patch


v80-0003-Add-GUC-stream_serialize_threshold-and-test-seri.patch
Description:  v80-0003-Add-GUC-stream_serialize_threshold-and-test-seri.patch


RE: Perform streaming logical transactions by background workers and parallel apply

2023-01-13 Thread houzj.f...@fujitsu.com
On Friday, January 13, 2023 2:20 PM Peter Smith  wrote:
> 
> Here are some review comments for patch v79-0002.

Thanks for your comments.

> ==
> 
> General
> 
> 1.
> 
> I saw that earlier in this thread Hou-san [1] and Amit [2] also seemed to say
> there is not much point for this patch.
> 
> So I wanted to +1 that same opinion.
> 
> I feel this patch just adds more complexity for almost no gain:
> - reducing the 'max_apply_workers_per_suibscription' seems not very
> common in the first place.
> - even when the GUC is reduced, at that point in time all the workers might 
> be in
> use so there may be nothing that can be immediately done.
> - IIUC the excess workers (for a reduced GUC) are going to get freed naturally
> anyway over time as more transactions are completed so the pool size will
> reduce accordingly.

I need to think over it, and we can have detailed discussion after committing
the first patch. So I didn't address the comments for 0002 for now.

Best Regards,
Hou zj


Re: Non-decimal integer literals

2023-01-13 Thread Dean Rasheed
On Wed, 14 Dec 2022 at 05:47, Peter Eisentraut
 wrote:
>
> committed

Now that we have this for integer types, I think it's worth doing for
numeric as well, since the parser will now pass such things through to
numeric_in() when they don't fit in an int64, and it seems plausible
that at least some people might use non-decimal integers beyond
INT64MIN/MAX. Also, without such support in numeric_in(), the feature
looks a little incomplete:

SELECT -0x8000;
   ?column?
--
 -9223372036854775808
(1 row)

SELECT 0x8000;
ERROR:  invalid input syntax for type numeric: "0x8000"
LINE 1: select 0x8000;
   ^

One concern I had was what the performance would be like. I don't
really expect people to pass in the kinds of truly huge values that
numeric supports, but it can't be ruled out. So I gave it a go, to see
how hard it would be, and what the worst-case performance looks like.
(I included underscore-handling too, so that I could measure that at
the same time.)

The base-conversion algorithm is O(N^2), and the worst case before
overflow is with hex strings with around 108,000 digits, oct strings
with around 145,000 digits, or binary strings with around 435,000
digits. Each of those takes around 400ms to parse on my machine.
That's around the level at which I might consider adding
CHECK_FOR_INTERRUPTS()'s, but I think that it's probably not worth it,
given how unrealistic such huge inputs are in practice.

The other important thing is that this shouldn't impact the
performance when parsing regular decimal inputs. The bulk of the
non-decimal integer parsing is handled by a separate function, which
is called directly from numeric_in(), since non-decimal handling isn't
required at the set_var_from_str() level (used by the float4/8 ->
numeric conversion functions). I also re-arranged the numeric_in()
code somewhat, and was able to make substantial savings by reducing
the number of pg_strncasecmp() calls, and avoiding those calls
entirely for regular numbers that aren't NaN or Inf. Testing that with
COPY with a few million numbers of different sizes, I observed a
10-15% performance increase.

So I'm feeling quite good about the end result -- I set out hoping not
to make performance noticeably worse, but ended up making it
significantly better.

Regards,
Dean
From f129bcdaeaaa62d8ddaf6a8e6441183f46097687 Mon Sep 17 00:00:00 2001
From: Dean Rasheed 
Date: Fri, 13 Jan 2023 09:20:17 +
Subject: [PATCH 1/2] Add non-decimal integer support to type numeric.

This enhances the numeric type input function, adding support for
hexadecimal, octal, and binary integers of any size, up to the limits
of the numeric type.

Since 6fcda9aba8, such non-decimal integers have been accepted by the
parser as integer literals and passed through to numeric_in(). This
commit gives numeric_in() the ability to handle them.

While at it, simplify the handling of NaN and infinities, reducing the
number of calls to pg_strncasecmp(), and arrange for pg_strncasecmp()
to not be called at all for regular numbers. This gives a significant
performance improvement for decimal inputs, more than offsetting the
small performance hit of checking for non-decimal input.
---
 src/backend/utils/adt/numeric.c  | 355 +++
 src/test/regress/expected/numeric.out|  62 +++-
 src/test/regress/expected/numerology.out |  48 +--
 src/test/regress/sql/numeric.sql |  10 +
 4 files changed, 380 insertions(+), 95 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index a6409ecbee..ed592841dc 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -500,6 +500,11 @@ static void zero_var(NumericVar *var);
 static bool set_var_from_str(const char *str, const char *cp,
 			 NumericVar *dest, const char **endptr,
 			 Node *escontext);
+static bool set_var_from_non_decimal_integer_str(const char *str,
+ const char *cp, int sign,
+ int base, NumericVar *dest,
+ const char **endptr,
+ Node *escontext);
 static void set_var_from_num(Numeric num, NumericVar *dest);
 static void init_var_from_num(Numeric num, NumericVar *dest);
 static void set_var_from_var(const NumericVar *value, NumericVar *dest);
@@ -625,6 +630,8 @@ numeric_in(PG_FUNCTION_ARGS)
 	Node	   *escontext = fcinfo->context;
 	Numeric		res;
 	const char *cp;
+	const char *numstart;
+	int			sign;
 
 	/* Skip leading spaces */
 	cp = str;
@@ -636,70 +643,130 @@ numeric_in(PG_FUNCTION_ARGS)
 	}
 
 	/*
-	 * Check for NaN and infinities.  We recognize the same strings allowed by
-	 * float8in().
+	 * Process the number's sign. This duplicates logic in set_var_from_str(),
+	 * but it's worth doing here, since it simplifies the handling of
+	 * infinities and non-decimal integers.
 	 */
-	if (pg_strncasecmp(cp, "NaN", 3) == 0)
-	{
-		res = make_result(_nan);
-		cp += 3;
-	}
-	else if 

Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-13 Thread Amit Kapila
On Thu, Jan 12, 2023 at 8:25 AM Ted Yu  wrote:

> On Wed, Jan 11, 2023 at 6:54 PM Amit Kapila  wrote:
>>
>> On Wed, Jan 11, 2023 at 9:31 AM Ted Yu  wrote:
>> >
>> > On Tue, Jan 10, 2023 at 7:55 PM houzj.f...@fujitsu.com 
>> >  wrote:
>> >>
>> >> On Wednesday, January 11, 2023 10:21 AM Ted Yu  
>> >> wrote:
>> >> > /* First time through, initialize parallel apply worker state 
>> >> > hashtable. */
>> >> > if (!ParallelApplyTxnHash)
>> >> >
>> >> > I think it would be better if `ParallelApplyTxnHash` is created by the 
>> >> > first
>> >> > successful parallel apply worker.
>> >>
>> >> Thanks for the suggestion. But I am not sure if it's worth to changing the
>> >> order here, because It will only optimize the case where user enable 
>> >> parallel
>> >> apply but never get an available worker which should be rare. And in such 
>> >> a
>> >> case, it'd be better to increase the number of workers or disable the 
>> >> parallel mode.
>> >>
>> >
>> >
>> > I think even though the chance is rare, we shouldn't leak resource.
>> >
>>
>> But that is true iff we are never able to start the worker. Anyway, I
>> think it is probably fine either way but we can change it as per your
>> suggestion to make it more robust and probably for the code clarity
>> sake. I'll push this tomorrow unless someone thinks otherwise.
>>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Generate pg_stat_get_xact*() functions with Macros

2023-01-13 Thread Drouvot, Bertrand

Hi,

On 1/12/23 7:24 PM, Andres Freund wrote:

Hi,

On 2023-01-12 08:38:57 +0100, Drouvot, Bertrand wrote:

On 1/11/23 11:59 PM, Andres Freund wrote:

Now that this patch renames some fields


I don't mind renaming the fields - the prefixes really don't provide anything
useful. But it's not clear why this is related to this patch? You could just
include the f_ prefix in the macro, no?




Right, but the idea is to take the same approach that the one used in 
8018ffbf58 (where placing the prefixes in the macro
would have been possible too).


I'm not super happy about that patch tbh.



Probably should remove PgStat_BackendFunctionEntry.


I think that would be a 3rd patch, agree?


Yep.




I now see that PG_STAT_GET_DBENTRY_FLOAT8 already exists, defined the same
way. But the name fields misleading enough that I'd be inclined to rename it?



PG_STAT_GET_FUNCENTRY_FLOAT8_MS looks good by me. Waiting on what we'll decide 
for
the existing PG_STAT_GET_DBENTRY_FLOAT8 (so that I can align for the 
PG_STAT_GET_FUNCENTRY_FLOAT8).


+1




Although I suspect this actually hints at an architectural thing that could be
fixed better: Perhaps we should replace find_tabstat_entry() with a version
returning a fully "reconciled" PgStat_StatTabEntry?  It feels quite wrong to
have that intimitate knowledge of the subtransaction stuff in pgstatfuncs.c
and about how the different counts get combined.

I think that'd allow us to move the definition of PgStat_TableStatus to
PgStat_TableXactStatus, PgStat_TableCounts to pgstat_internal.h. Which feels a
heck of a lot cleaner.


Yeah, I think that would be for a 4th patch, agree?


Yea. I am of multiple minds about the ordering. I can see benefits on fixing
the architectural issue before reducing duplication in the accessor with a
macro. The reason is that if we addressed the architectural issue, the
difference between the xact and non-xact version will be very minimal, and
could even be generated by the same macro.



Yeah, I do agree and I'm in favor of this ordering:

1) replace find_tabstat_entry() with a version returning a fully "reconciled" 
PgStat_StatTabEntry
2) remove prefixes
3) Introduce the new macros

And it looks to me that removing PgStat_BackendFunctionEntry can be done 
independently

I'll first look at 1).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Minimal logical decoding on standbys

2023-01-13 Thread Drouvot, Bertrand

Hi,

On 1/11/23 9:27 PM, Andres Freund wrote:

Hi,

On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:

The problem I have with that is that I saw a lot of flakiness in the tests due
to the race condition. So introducing them in that order just doesn't make a
whole lot of sense to me. 


You are right it does not make sense to introduce fixing the race condition 
after the TAP tests
and after introducing the decoding logic. I'll reorder the sub-patches.


It's also something that can be committed
independently, I think.


Right but could this race condition occur outside of the context of this new 
feature?
 

That's right it's started retrieving this information from the relation.

Then, Robert made a comment in [1] saying it's not safe to call
table_open() while holding a buffer lock.


The suggested path in earlier versions to avoid doing so was to make sure that
we pass down the Relation for the table into the necessary functions. Did you
explore that any further?


So, for gistXLogPageReuse() and _bt_delitems_delete() this is "easy" to pass 
the Heap Relation.
This is what was done in earlier versions of this patch series.

But we would need to define a way to propagate the Heap Relation for those 2 
functions:

_bt_log_reuse_page()
vacuumRedirectAndPlaceholder()

When I first looked at it and saw the number of places where _bt_getbuf() is 
called
then I preferred to have a look to the current proposal.

I will give it another look, also because I just realized that it could be 
beneficial
for vacuumRedirectAndPlaceholder() too, as per this comment:

"
/* XXX: providing heap relation would allow more pruning */
vistest = GlobalVisTestFor(NULL);
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Get access to the whole query in CustomScan path callback

2023-01-13 Thread Amin
I cannot find any information related to other relations in the query other
than the one which is being scanned in the root pointer. Is there any
function call which can be used to get access to it?

On Wed, Dec 21, 2022 at 9:46 AM Tom Lane  wrote:

> Amin  writes:
> > The goal is to have access to all the tables that are being scanned or
> will
> > be scanned as a part of the query. Basically, the callback looks like
> this:
>
> > typedef void (*set_rel_pathlist_hook_type) (PlannerInfo *root,
> > RelOptInfo *rel,
> > Index rti,
> > RangeTblEntry *rte);
>
> > Now, the problem is when there is a nested query, the function will be
> > called once for the parent query and once for the subquery. However, I
> need
> > access to the whole query in this function. There seems to be no
> CustomScan
> > callback before this that has the whole query passed to it. Is there any
> > way I can get access to the complete query (or all the relations in the
> > query) by using the parameters passed to this function? Or any other
> > workaround?
>
> Everything the planner knows is accessible via the "root" pointer.
>
> I very strongly question the idea that a custom scan provider should
> be doing what you say you want to do, but the info is there.
>
> regards, tom lane
>


Get relid for a relation

2023-01-13 Thread Amin
Hi,

In CustomScan cost estimator, where PlannerInfo and RelOptInfo are passed,
I want to get access to the relation stats (for example pg_stat_all_tables)
by calling pg_stat_fetch_stat_tabentry(). However, I don't have access to
relid to pass to this function. For a sample relation, when I hardcode the
relid (for example 16385), it works. However, RelOptInfo->relid is always 1
(for whatever relation the query is scanning). Why this happens and how to
get access to the correct relid (16385) as in pg_stat_all_tables?

Thank you!


Re: Blocking execution of SECURITY INVOKER

2023-01-13 Thread Jeff Davis
On Thu, 2023-01-12 at 19:38 -0800, Andres Freund wrote:
> I.e. the default arguments where evaluated with the invoker's
> permissions, not
> the definer's, despite being controlled by the less privileged user.

This is a very interesting case. It also involves tricking the
superuser into executing their own function with the attacker's inputs.
That part is the same as the other case. What's intriguing here is that
it shows the function can be SECURITY INVOKER, and that really means it
could be any builtin function as long as the types work out.

For example:
=> create function trick(l pg_lsn = pg_switch_wal()) returns int
language plpgsql security definer as $$ begin return 42; end; $$;

If the superuser executes that, even though it's a SECURITY DEFINER
function owned by an unprivileged user, it will still call
pg_switch_wal().


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: [EXTERNAL] Re: [PATCH] Support using "all" for the db user in pg_ident.conf

2023-01-13 Thread Jelte Fennema
> Even if folks applying quotes
> would not be able anymore to replace the pattern, the risk seems a bit
> remote?

Yeah I agree the risk is remote. To be clear, the main pattern I'm
worried about breaking is simply "\1". Where people had put
quotes around \1 for no reason. All in all, I'm fine if 0003 gets
merged, but I'd also be fine with it if it doesn't. Both the risk
and the advantage seem fairly small.

> I don't see how much that's different from the recent discussion with
> regexps added for databases and users to pg_hba.conf.  And consistency
> sounds pretty good to me here.

It's not much different, except that here also all and + change their meaning
(for pg_hba.conf those special cases already existed). Mainly I called it out
because I realised this discussion was called out in that commit too.

> Regexps can have commas

That's a really good reason to allow quoted regexes indeed. Even for pg_ident
entries, commas in unquoted regexes would cause the AuthToken parsing to fail.

Is there anything you still want to see changed about any of the patches?




Re: Blocking execution of SECURITY INVOKER

2023-01-13 Thread Andres Freund
Hi,

On 2023-01-12 23:38:50 -0800, Jeff Davis wrote:
> On Thu, 2023-01-12 at 19:29 -0800, Andres Freund wrote:
> > superuser:
> > # CREATE FUNCTION exec_su(p_sql text) RETURNS text LANGUAGE plpgsql
> > SECURITY DEFINER AS $$BEGIN RAISE NOTICE 'executing %', p_sql;
> > EXECUTE p_sql;RETURN 'p_sql';END;$$;
> > # REVOKE ALL ON FUNCTION exec_su FROM PUBLIC ;
> 
> That can be solved by creating the function in a schema where ordinary
> users don't have USAGE:
> 
> CREATE TABLE trick_superuser(value text default admin.exec_su('ALTER
> USER less_privs SUPERUSER'));
> ERROR:  permission denied for schema admin

Doubtful. Leaving aside the practicalities of using dedicated schemas and
enforcing their use, there's plenty functions in pg_catalog that a less
privileged user can use to do bad things.

Just think of set_config(), pg_read_file(), lo_create(), binary_upgrade_*(),
pg_drop_replication_slot()...

If the default values get evaluated, this is arbitrary code exec, even if it
requires a few contortions. And the same is true for evaluating *any*
expression.



> > And the admin likely can switch into the user context of
> > the less privileged user to perform operations in a safer context.
> 
> How would the admin do that? The malicious UDF can just "RESET SESSION
> AUTHORIZATION" to pop back out of the safer context.

I thought we had a reasonably convenient way, but now I am not sure
anymore. Might have been via a C helper function. It can be hacked together,
but this is an area that should be as unhacky as possible.


> If there's not a good way to do this safely now, then we should
> probably provide one.

Yea, particularly because we do have all the infrastructure for it
(c.f. SECURITY_LOCAL_USERID_CHANGE / SECURITY_RESTRICTED_OPERATION).

Greetings,

Andres Freund




Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-13 Thread David Geier

Nice idea.

On 1/6/23 10:19, Jelte Fennema wrote:
Do you have some performance comparison between TIMING ON and TIMING 
SAMPLING?


+1 to see some numbers compared to TIMING ON.

Mostly I'm wondering if the sampling based approach gains us enough to 
be worth it, once the patch to use RDTSC hopefully landed (see [1]). I 
believe that with the RDTSC patch the overhead of TIMING ON is lower 
than the overhead of using ANALYZE with TIMING OFF in the first place. 
Hence, to be really useful, it would be great if we could on top of 
TIMING SAMPLING also lower the overhead of ANALYZE itself further (e.g. 
by using a fast path for the default EXPLAIN (ANALYZE, TIMING ON / 
SAMPLING)). Currently, InstrStartNode() and InstrStopNode() have a ton 
of branches and without all the typically deactivated code the 
implementation would be very small and could be placed in an inlinable 
function.


[1] 
https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de


--
David Geier
(ServiceNow)