date:20220204

Re: Unclear problem reports

2022-02-04 Thread Julien Rouhaud

On Fri, Feb 04, 2022 at 01:38:14PM -0500, Bruce Momjian wrote:
> On Thu, Feb  3, 2022 at 12:28:05PM +0800, Julien Rouhaud wrote:
> 
> > One thing that might help would be to work on a troubleshooting section on 
> > the
> > wiki with some common problems and some clear instructions on either how to 
> > try
> > to fix the problem or provide needed information for further debugging.  At
> > least it would save the time for those steps and one could quickly respond 
> > with
> > that link, similarly to how we do for the slow query questions wiki entry.
> 
> I think the challenge there is that the problems are all different, and
> if we had a pattern we could at least supply better errors and hints,
> unless I am missing something obvious.  Does someone want to go back
> through the bugs@ archive and see if they can find a pattern?

Yes we definitely don't want to have a massive page for every single problem,
but there are probably so really frequent reports for which we can't really
improve the error, and for which we tend to ask the same questions every time
which could benefit from that.

For instance for index corruption, we can suggest using enable_* and checking
the query again to highlight a problem with the index, and hint about system
collation library upgrade and so on.  I probably answered a few reports about
this over the last couple of months, and other also did.

Some immediate errors when trying to start the instance could also be covered
(like the no valid checkpoint record problem), at least to warn to *not* use
pg_resetwal.

Re: Deparsing rewritten query

2022-02-04 Thread Julien Rouhaud

On Fri, Feb 04, 2022 at 12:45:05PM +0100, Pavel Stehule wrote:
> 
> ok, I don't have any problem with it. Then there is not necessarily any
> change, and I'll mark this patch as ready for committer.

Thanks Pavel!

I also realized that the CF entry description wasn't accurate anymore, so I
also fixed that.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 10:44 PM Peter Geoghegan  wrote:
> Right - it's practically inevitable that you'll need an
> anti-wraparound VACUUM to advance relfrozenxid right now. Technically
> it's possible to advance relfrozenxid in any VACUUM, but in practice
> it just never happens on a large table. You only need to get unlucky
> with one heap page, either by failing to get a cleanup lock, or (more
> likely) by setting even one single page all-visible but not all-frozen
> just once (once in any VACUUM that takes place between anti-wraparound
> VACUUMs).

Minor correction: That's a slight exaggeration, since we won't skip
groups of all-visible pages that don't exceed SKIP_PAGES_THRESHOLD
blocks (32 blocks).

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 10:21 PM Greg Stark  wrote:
> On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan  wrote:
> > My emphasis here has been on making non-aggressive VACUUMs *always*
> > advance relfrozenxid, outside of certain obvious edge cases. And so
> > with all the patches applied, up to and including the opportunistic
> > freezing patch, every autovacuum of every table manages to advance
> > relfrozenxid during benchmarking -- usually to a fairly recent value.
> > I've focussed on making aggressive VACUUMs (especially anti-wraparound
> > autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> > user keeps canceling autovacuums, maybe due to automated script that
> > performs DDL). That has taken priority over other goals, for now.
>
> While I've seen all the above cases triggering anti-wraparound cases
> by far the majority of the cases are not of these pathological forms.

Right - it's practically inevitable that you'll need an
anti-wraparound VACUUM to advance relfrozenxid right now. Technically
it's possible to advance relfrozenxid in any VACUUM, but in practice
it just never happens on a large table. You only need to get unlucky
with one heap page, either by failing to get a cleanup lock, or (more
likely) by setting even one single page all-visible but not all-frozen
just once (once in any VACUUM that takes place between anti-wraparound
VACUUMs).

> By far the majority of anti-wraparound vacuums are triggered by tables
> that are very large and so don't trigger regular vacuums for "long
> periods" of time and consistently hit the anti-wraparound threshold
> first.

autovacuum_vacuum_insert_scale_factor can help with this on 13 and 14,
but only if you tune autovacuum_freeze_min_age with that goal in mind.
Which probably doesn't happen very often.

> There's nothing limiting how long "long periods" is and nothing tying
> it to the rate of xid consumption. It's quite common to have some
> *very* large mostly static tables in databases that have other tables
> that are *very* busy.
>
> The worst I've seen is a table that took 36 hours to vacuum in a
> database that consumed about a billion transactions per day... That's
> extreme but these days it's quite common to see tables that get
> anti-wraparound vacuums every week or so despite having < 1% modified
> tuples. And databases are only getting bigger and transaction rates
> faster...

Sounds very much like what I've been calling the freezing cliff. An
anti-wraparound VACUUM throws things off by suddenly dirtying many
more pages than the expected amount for a VACUUM against the table,
despite there being no change in workload characteristics. If you just
had to remove the dead tuples in such a table, then it probably
wouldn't matter if it happened earlier than expected.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Greg Stark

On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan  wrote:
>
> My emphasis here has been on making non-aggressive VACUUMs *always*
> advance relfrozenxid, outside of certain obvious edge cases. And so
> with all the patches applied, up to and including the opportunistic
> freezing patch, every autovacuum of every table manages to advance
> relfrozenxid during benchmarking -- usually to a fairly recent value.
> I've focussed on making aggressive VACUUMs (especially anti-wraparound
> autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> user keeps canceling autovacuums, maybe due to automated script that
> performs DDL). That has taken priority over other goals, for now.

While I've seen all the above cases triggering anti-wraparound cases
by far the majority of the cases are not of these pathological forms.

By far the majority of anti-wraparound vacuums are triggered by tables
that are very large and so don't trigger regular vacuums for "long
periods" of time and consistently hit the anti-wraparound threshold
first.

There's nothing limiting how long "long periods" is and nothing tying
it to the rate of xid consumption. It's quite common to have some
*very* large mostly static tables in databases that have other tables
that are *very* busy.

The worst I've seen is a table that took 36 hours to vacuum in a
database that consumed about a billion transactions per day... That's
extreme but these days it's quite common to see tables that get
anti-wraparound vacuums every week or so despite having < 1% modified
tuples. And databases are only getting bigger and transaction rates
faster...

-- 
greg

Re: Windows now has fdatasync()

2022-02-04 Thread Thomas Munro

On Sat, Feb 5, 2022 at 12:54 PM Robert Haas  wrote:
> On Fri, Feb 4, 2022 at 4:24 PM Thomas Munro  wrote:
> > I'm not proposing we change our default to this new level, because it
> > doesn't work on non-NTFS, an annoying complication.  This patch would
> > just provide something faster to put after "Alternatively".
>
> Hmm. I thought NTFS had kind of won the filesystem war on the Windows
> side of things. No?

Against FAT, yes, but there are also SMB/CIFS (network) and the new
ReFS (which we recently broke and then unbroke[1]).  I haven't tried
those things, lacking general Windows-fu, but I suppose they'd reject
this and we'd panic, because the docs say "file systems supported:
NTFS"[2].

[1] 
https://www.postgresql.org/message-id/flat/16854-905604506e23d5c0%40postgresql.org
[2] 
https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntflushbuffersfileex

Re: [PATCH] Add min() and max() aggregate functions for xid8

2022-02-04 Thread Ken Kato


+   PG_RETURN_FULLTRANSACTIONID((U64FromFullTransactionId(fxid1) >
U64FromFullTransactionId(fxid2)) ? fxid1 : fxid2);

Shouldn't we use FullTransactionIdFollows() to compare those two fxid
values here, instead?

+   PG_RETURN_FULLTRANSACTIONID((U64FromFullTransactionId(fxid1) <
U64FromFullTransactionId(fxid2)) ? fxid1 : fxid2);

Shouldn't we use FullTransactionIdPrecedes() to compare those two fxid
values here, instead?

Could you add the regression tests for those min() and max() functions 
for xid8?


Thank you for the comments.
I sent my old version of patch by mistake.
This is the updated one.

Best wishes

--
Ken Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8754f2f89b..1b064b4feb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19973,7 +19973,7 @@ SELECT NULLIF(value, '(none)') ...
 values.  Available for any numeric, string, date/time, or enum type,
 as well as inet, interval,
 money, oid, pg_lsn,
-tid,
+tid, xid8,
 and arrays of any of these types.

Yes
@@ -19992,7 +19992,7 @@ SELECT NULLIF(value, '(none)') ...
 values.  Available for any numeric, string, date/time, or enum type,
 as well as inet, interval,
 money, oid, pg_lsn,
-tid,
+tid, xid8,
 and arrays of any of these types.

Yes
diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c
index 9b4ceaea47..32009f81ae 100644
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
@@ -286,6 +286,24 @@ xid8cmp(PG_FUNCTION_ARGS)
 		PG_RETURN_INT32(-1);
 }
 
+Datum
+xid8_larger(PG_FUNCTION_ARGS)
+{
+	FullTransactionId fxid1 = PG_GETARG_FULLTRANSACTIONID(0);
+	FullTransactionId fxid2 = PG_GETARG_FULLTRANSACTIONID(1);
+
+	PG_RETURN_FULLTRANSACTIONID((FullTransactionIdFollowsOrEquals(fxid1, fxid2)) ? fxid1 : fxid2);
+}
+
+Datum
+xid8_smaller(PG_FUNCTION_ARGS)
+{
+	FullTransactionId fxid1 = PG_GETARG_FULLTRANSACTIONID(0);
+	FullTransactionId fxid2 = PG_GETARG_FULLTRANSACTIONID(1);
+
+	PG_RETURN_FULLTRANSACTIONID((FullTransactionIdPrecedesOrEquals(fxid1, fxid2)) ? fxid1 : fxid2);
+}
+
 /*
  *	 COMMAND IDENTIFIER ROUTINES			 *
  */
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 137f6eef69..2843f4b415 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -149,6 +149,9 @@
 { aggfnoid => 'max(pg_lsn)', aggtransfn => 'pg_lsn_larger',
   aggcombinefn => 'pg_lsn_larger', aggsortop => '>(pg_lsn,pg_lsn)',
   aggtranstype => 'pg_lsn' },
+{ aggfnoid => 'max(xid8)', aggtransfn => 'xid8_larger',
+  aggcombinefn => 'xid8_larger', aggsortop => '>(xid8,xid8)',
+  aggtranstype => 'xid8' },
 
 # min
 { aggfnoid => 'min(int8)', aggtransfn => 'int8smaller',
@@ -214,6 +217,9 @@
 { aggfnoid => 'min(pg_lsn)', aggtransfn => 'pg_lsn_smaller',
   aggcombinefn => 'pg_lsn_smaller', aggsortop => '<(pg_lsn,pg_lsn)',
   aggtranstype => 'pg_lsn' },
+{ aggfnoid => 'min(xid8)', aggtransfn => 'xid8_smaller',
+  aggcombinefn => 'xid8_smaller', aggsortop => '<(xid8,xid8)',
+  aggtranstype => 'xid8' },
 
 # count
 { aggfnoid => 'count(any)', aggtransfn => 'int8inc_any',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7024dbe10a..62f36daa98 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -203,6 +203,12 @@
 { oid => '5071', descr => 'convert xid8 to xid',
   proname => 'xid', prorettype => 'xid', proargtypes => 'xid8',
   prosrc => 'xid8toxid' },
+{ oid => '5097', descr => 'larger of two',
+  proname => 'xid8_larger', prorettype => 'xid8', proargtypes => 'xid8 xid8',
+  prosrc => 'xid8_larger' },
+{ oid => '5098', descr => 'smaller of two',
+  proname => 'xid8_smaller', prorettype => 'xid8', proargtypes => 'xid8 xid8',
+  prosrc => 'xid8_smaller' },
 { oid => '69',
   proname => 'cideq', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'cid cid', prosrc => 'cideq' },
@@ -6564,6 +6570,9 @@
 { oid => '4189', descr => 'maximum value of all pg_lsn input values',
   proname => 'max', prokind => 'a', proisstrict => 'f', prorettype => 'pg_lsn',
   proargtypes => 'pg_lsn', prosrc => 'aggregate_dummy' },
+{ oid => '5099', descr => 'maximum value of all xid8 input values',
+  proname => 'max', prokind => 'a', proisstrict => 'f', prorettype => 'xid8',
+  proargtypes => 'xid8', prosrc => 'aggregate_dummy' },
 
 { oid => '2131', descr => 'minimum value of all bigint input values',
   proname => 'min', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
@@ -6631,6 +6640,9 @@
 { oid => '4190', descr => 'minimum value of all pg_lsn input values',
   proname => 'min', prokind =

Re: [BUG]Update Toast data failure in logical replication

2022-02-04 Thread Amit Kapila

On Fri, Feb 4, 2022 at 9:06 PM Alvaro Herrera  wrote:
>
> I don't have a reason not to commit this patch.
>

It is not very clear to me from this so just checking again, are you
fine with back-patching this as well?

>
>  I have some suggestions
> on the comments and docs though.
>

Thanks, your suggestions look good to me. I'll take care of these in
the next version.

-- 
With Regards,
Amit Kapila.

Documentation about PL transforms

2022-02-04 Thread Chapman Flack

Hi,

It looks to me as if the transforms feature for PLs was added in
cac7658, and support was added to plperl and plpython at that time,
with documentation added for CREATE FUNCTION, CREATE/DROP TRANSFORM,
and the catalogs, but nothing was added at that time to plhandler.sgml
to clarify that a PL handler itself must add code to look up and apply
such transforms, or nothing happens.

Without that information, a reader not in the know (it happened to me)
could get the idea from the CREATE TRANSFORM docs that the support
is more general than it is, and an incoming argument could already
contain the 'internal' something-specific-to-the-language-implementation
returned by a specified transform. (Which does lead to follow-on questions
like "who says an arbitrary language implementation's transform result
could be safely represented in an mmgr-managed argument list?"—it really
does make better sense upon learning the PL handler has to do the deed.
But I haven't spotted any place where the docs *say* that.)

I'm thinking plhandler.sgml is the only place that really needs to be
said; readers looking up CREATE TRANSFORM and using an existing PL that
supports it don't need to know how the sausage is made. (Maybe it is
worth mentioning there, though, that it isn't possible to develop
transforms for an arbitrary PL unless that PL applies transforms.)

I noticed the CREATE TRANSFORM docs show the argument list as
(argument_type [, ...]) even though check_transform_function will reject
any argument list of length other than 1 or with type other than internal.
Is that an overly-generic way to show the syntax, or is that a style
with precedent elsewhere in the docs? (I checked a couple obvious places
like CREATE OPERATOR, CREATE FUNCTION, CREATE TYPE but did not see similar
examples).

As long as a PL handler has the sole responsibility for invoking
its transforms, I wonder if it would make sense to allow a PL to
define what its transforms should look like, maybe replacing
check_transform_function with a transform_validator for the PL.
I'm not proposing such a patch here, but I am willing to prepare
one for plhandler.sgml and maybe pltemplate.c documenting the current
situation, if nobody tells me I'm wrong about something here.

Regards,
-Chap

Re: Windows now has fdatasync()

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 4:24 PM Thomas Munro  wrote:
> I'm not proposing we change our default to this new level, because it
> doesn't work on non-NTFS, an annoying complication.  This patch would
> just provide something faster to put after "Alternatively".

Hmm. I thought NTFS had kind of won the filesystem war on the Windows
side of things. No?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: support for CREATE MODULE

2022-02-04 Thread Tom Lane

Nathan Bossart  writes:
> On Fri, Feb 04, 2022 at 05:12:43PM -0500, Tom Lane wrote:
>> On the whole I'm kind of allergic to inventing an entire new concept
>> that has as much overlap with extensions as modules seem to.  I'd
>> rather try to understand what functional requirements we're missing
>> and see if we can add them to extensions.  Yeah, we won't end up being
>> bug-compatible with Oracle's feature, but that's not a project goal
>> anyway --- and where we have tried to emulate Oracle closely, it's
>> often not worked out well (poster child: to_date).

> If I'm understanding correctly, you are suggesting that CREATE MODULE would
> be more like creating an extension without a control file, installation
> script, etc.  Objects would be added aѕ members with something like ALTER
> MODULE ADD, and members could share properties such as access control.  And
> this might be possible to do by enhancing CREATE EXTENSION instead of
> creating a new catalog, dependency type, etc.

> I think this could be a nice way to sidestep the naming resolution problems
> discussed upthread while still allowing folks to group objects together in
> some meaningful way.  Also, while it might be nice to have separate CREATE
> EXTENSION and CREATE MODULE commands, perhaps they would use roughly the
> same code paths behind the scenes.

Hm. If the functional requirement is "group objects without needing
any out-in-the-filesystem infrastructure", then I could see defining
a module as being exactly like an extension except there's no such
infrastructure --- and hence no concept of versions, plus pg_dump
needs to act differently.  That's probably enough semantic difference
to justify using a separate word, even if we can share a lot of
code infrastructure.

regards, tom lane

Re: Support tab completion for upper character inputs in psql

2022-02-04 Thread Tom Lane

=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> Tom Lane  writes:
>>> We could do something hacky like matching case only when there's
>>> no longer any matching object names, but that might be too magic.

>> I experimented with that, and it actually doesn't seem as weird
>> as I feared.  See if you like this ...

> That's a reasonable compromise, and the implementation is indeed less
> hacky than one might have feared.  Although I think putting the
> `num_keywords` variable before `num_other` would read better.

After a few days of using that, I'm having second thoughts about it,
because it turns out to impede completion in common cases.  For
example,

regression=# set transa
TRANSACTION transaction_isolation   
transaction_deferrable  transaction_read_only   

It won't fill in "ction" because of the case discrepancy between the
offered alternatives.  Maybe this trumps the question of whether you
should be able to distinguish keywords from non-keywords in the menus.
If we case-folded the keywords as per your original proposal, it'd do
what I expect it to.

In previous releases, this worked as expected: "set transa"
immediately completes "ction", and then tabbing produces this
menu:

transaction transaction_isolation   
transaction_deferrable  transaction_read_only   

That probably explains why these keywords were lower-cased in
the previous code.  However, I don't think we should blame
your suggestion to upcase them, because the same problem arises
in other completion contexts where we offer keywords.  We should
solve it across-the-board not just for these specific queries.

regards, tom lane

Re: support for CREATE MODULE

2022-02-04 Thread Nathan Bossart

On Fri, Feb 04, 2022 at 05:12:43PM -0500, Tom Lane wrote:
> If we invent modules I think they need to work more like extensions
> naming-wise, ie they group objects but have no effect for naming.
> Alternatively, you could insist that a module *is* a schema for naming
> purposes, with some extra properties but acting exactly like a schema
> for naming.  But I don't see what that buys you that you can't get
> from an extension that owns a schema that contains all its objects.
> 
> On the whole I'm kind of allergic to inventing an entire new concept
> that has as much overlap with extensions as modules seem to.  I'd
> rather try to understand what functional requirements we're missing
> and see if we can add them to extensions.  Yeah, we won't end up being
> bug-compatible with Oracle's feature, but that's not a project goal
> anyway --- and where we have tried to emulate Oracle closely, it's
> often not worked out well (poster child: to_date).

If I'm understanding correctly, you are suggesting that CREATE MODULE would
be more like creating an extension without a control file, installation
script, etc.  Objects would be added aѕ members with something like ALTER
MODULE ADD, and members could share properties such as access control.  And
this might be possible to do by enhancing CREATE EXTENSION instead of
creating a new catalog, dependency type, etc.

I think this could be a nice way to sidestep the naming resolution problems
discussed upthread while still allowing folks to group objects together in
some meaningful way.  Also, while it might be nice to have separate CREATE
EXTENSION and CREATE MODULE commands, perhaps they would use roughly the
same code paths behind the scenes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: support for CREATE MODULE

2022-02-04 Thread Jim Mlodgenski

On Fri, Feb 4, 2022 at 5:12 PM Tom Lane  wrote:

>
> On the whole I'm kind of allergic to inventing an entire new concept
> that has as much overlap with extensions as modules seem to.  I'd
> rather try to understand what functional requirements we're missing
> and see if we can add them to extensions.  Yeah, we won't end up being
> bug-compatible with Oracle's feature, but that's not a project goal
> anyway --- and where we have tried to emulate Oracle closely, it's
> often not worked out well (poster child: to_date).
>
>
Developers need a way to group related objects in some fashion so
that they can be more easily reasoned about. Modules are just the
way to do this in the spec, but if we want to leverage extensions,
that will work too. Many users who need this only have access through
a database connection. They wouldn't have access to the file system
to add a control file nor a script to add the objects. Enhancing
CREATE EXTENSION to be able to create some sort of empty extension
and then having the ability to add and remove objects from that
extension may be the minimum amount of functionality we would need
to give users the ability to group their objects.

Re: do only critical work during single-user vacuum?

2022-02-04 Thread John Naylor

On Thu, Feb 3, 2022 at 7:30 PM Robert Haas  wrote:
>
> That error comes from GetNewTransactionId(), so that function must
> either try to execute DML or do something else which causes an XID to
> be assigned. I think a plain SELECT should work just fine.

It was indeed doing writes, so that much is not a surprise anymore.

On Thu, Feb 3, 2022 at 9:08 PM Robert Haas  wrote:
>
> On Thu, Feb 3, 2022 at 8:35 PM Andres Freund  wrote:
> > Yea, I'd have no problem leaving the "hard" limit somewhere closer to 1
> > million (although 100k should be just as well), but introduce a softer "only
> > vacuum/drop/truncate" limit a good bit before that.
>
> +1.

Since there seems to be agreement on this, I can attempt a stab at it,
but it'll be another week before I can do so.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: pg_walfile_name uses XLByteToPrevSeg

2022-02-04 Thread Nathan Bossart

On Fri, Feb 04, 2022 at 09:17:54AM -0500, Robert Haas wrote:
> On Fri, Feb 4, 2022 at 9:05 AM Ashutosh Bapat
>  wrote:
>> And it gives some surprising results as well
>> ---
>> #select pg_walfile_name('0/0'::pg_lsn);
>>  pg_walfile_name
>> --
>>  000100FF
>> (1 row)
>> 
> 
> Yeah, that seems wrong.

It looks like it's been this way for a while (704ddaa).
pg_walfile_name_offset() has the following comment:

 * Note that a location exactly at a segment boundary is taken to be in
 * the previous segment.  This is usually the right thing, since the
 * expected usage is to determine which xlog file(s) are ready to archive.

I see a couple of discussions about this as well [0] [1].

[0] 
https://www.postgresql.org/message-id/flat/1154384790.3226.21.camel%40localhost.localdomain
[1] https://www.postgresql.org/message-id/flat/15952.1154827205%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 4:18 PM Robert Haas  wrote:
> On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan  wrote:
> > Application B will already block pruning by VACUUM operations against
> > application A's table, and so effectively blocks recording of the
> > resultant free space in the FSM in your scenario. And so application A
> > and application B should be considered the same application already.
> > That's just how VACUUM works.
>
> Sure ... but that also sucks. If we consider application A and
> application B to be the same application, then we're basing our
> decision about what to do on information that is inaccurate.

I agree that it sucks, but I don't think that it's particularly
relevant to the FSM prototype patch that I included with v7 of the
patch series. A heap page cannot be considered "closed" (either in the
specific sense from the patch, or in any informal sense) when it has
recently dead tuples.

At some point we should invent a fallback path for pruning, that
migrates recently dead tuples to some other subsidiary structure,
retaining only forwarding information in the heap page. But even that
won't change what I just said about closed pages (it'll just make it
easier to return and fix things up later on).

> I don't see it that way. There are cases where avoiding writes is
> better, and cases where trying to cram everything into the fewest
> possible ages is better. With the right test case you can make either
> strategy look superior.

The cost of reads is effectively much lower than writes with modern
SSDs, in TCO terms. Plus when a FSM strategy like the one from the
patch does badly according to a naive measure such as total table
size, that in itself doesn't mean that we do worse with reads. In
fact, it's quite the opposite.

The benchmark showed that v7 of the patch did very slightly worse on
overall space utilization, but far, far better on reads. In fact, the
benefits for reads were far in excess of any efficiency gains for
writes/with WAL. The greatest bottleneck is almost always latency on
modern hardware [1]. It follows that keeping logically related data
grouped together is crucial. Far more important than potentially using
very slightly more space.

The story I wanted to tell with the FSM patch was about open and
closed pages being the right long term direction. More generally, we
should emphasize managing page-level costs, and deemphasize managing
tuple-level costs, which are much less meaningful.

> What I think your test case has going for it
> is that it is similar to something that a lot of people, really a ton
> of people, actually do with PostgreSQL. However, it's not going to be
> an accurate model of what everybody does, and therein lies some
> element of danger.

No question -- agreed.

> > Then what could you have confidence in?
>
> Real-world experience. Which is hard to get if we don't ever commit
> any patches, but a good argument for (a) having them tested by
> multiple different hackers who invent test cases independently and (b)
> some configurability where we can reasonably include it, so that if
> anyone does experience problems they have an escape.

I agree.

[1] https://dl.acm.org/doi/10.1145/1022594.1022596
-- 
Peter Geoghegan

Re: support for CREATE MODULE

2022-02-04 Thread Tom Lane

Alvaro Herrera  writes:
> He said:

>> [ This patch ] [...] allows for 3-part (or 4 with the database name)
>> naming of objects within the module. 

> He then showed the following example:

>> CREATE SCHEMA foo;
>> CREATE MODULE foo.bar
>> CREATE FUNCTION hello() [...]
>> SELECT foo.bar.hello();

> Notice the three-part name there.  That's a disaster, because the name
> resolution rules become very complicated or ambiguous.

Right.  We've looked into that before --- when I made pg_namespace,
I called it that because I thought we might be able to support
nested namespaces --- but it'd really create a mess.  In particular,
the SQL standard says what a three-part name means, and this ain't it.

If we invent modules I think they need to work more like extensions
naming-wise, ie they group objects but have no effect for naming.
Alternatively, you could insist that a module *is* a schema for naming
purposes, with some extra properties but acting exactly like a schema
for naming.  But I don't see what that buys you that you can't get
from an extension that owns a schema that contains all its objects.

On the whole I'm kind of allergic to inventing an entire new concept
that has as much overlap with extensions as modules seem to.  I'd
rather try to understand what functional requirements we're missing
and see if we can add them to extensions.  Yeah, we won't end up being
bug-compatible with Oracle's feature, but that's not a project goal
anyway --- and where we have tried to emulate Oracle closely, it's
often not worked out well (poster child: to_date).

regards, tom lane

Re: Release notes for February minor releases

2022-02-04 Thread Tom Lane

Justin Pryzby  writes:
> On Fri, Feb 04, 2022 at 04:29:19PM -0500, Tom Lane wrote:
>> I'm confused about this bit.  Are we still building bogus stats for
>> inheritance parents, or has that stopped?

> To make a long story long:
> - before 859b3003de, an ERROR occurred when a stats object was created on an
>   inheritance parent.
> - To avoid the error, 859b3003de changed to no longer build "whole tree" stats
>   on the table heirarchy.  Non-inheried stats were still collected.
> - However, the stats were *also* applied to inherited queries (FROM tbl*).

> 36c4bc6 stops applying stats that shouldn't be applied (and doesn't change
> their collection during ANALYZE).

Got it.  So we collected (and still do collect) non-inherited stats
for inheritance parents, but prior to 36c4bc6 those were mistakenly
applied in estimating both inheritance and non-inheritance queries.
Now we only do the latter.

(Since 269b532ae, this is all better in HEAD, but that's not
relevant for the back-branch release notes.)

regards, tom lane

Re: support for CREATE MODULE

2022-02-04 Thread Alvaro Herrera

On 2022-Feb-04, Swaha Miller wrote:

> The POC patch Jim Mlodgenski had on that thread was similar to your
> proposed way - modules were rows in pg_namespace, with the addition of
> a new column in pg_namespace for the nspkind (module or schema.)

I don't agree that what he proposed was similar to my proposal.  The
main problem I saw in his proposal is that he was saying that modules
would be *within* schemas, which is where I think the whole thing
derailed completely.

He said:

> [ This patch ] [...] allows for 3-part (or 4 with the database name)
> naming of objects within the module. 

He then showed the following example:

> CREATE SCHEMA foo;
> CREATE MODULE foo.bar
>   CREATE FUNCTION hello() [...]
> SELECT foo.bar.hello();

Notice the three-part name there.  That's a disaster, because the name
resolution rules become very complicated or ambiguous.  What I describe
avoids that disaster, by forcing there to be two-part names only: a
module lives on its own, not in a schema, so a function name always has
at most two parts (never three), and the first part can always be
resolved down to a pg_namespace row of some kind.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: Release notes for February minor releases

2022-02-04 Thread Justin Pryzby

On Fri, Feb 04, 2022 at 04:29:19PM -0500, Tom Lane wrote:
> > +  A previous bug fix disabled building of extended statistics for
> > +  old-style inheritance trees, but any existing statistics data was
> > +  not removed, and that data would become more and more out-of-date
> > +  over time.  Adjust the planner to ignore such data.  Extended
> > +  statistics for the individual child tables are still built and used,
> > +  however.
> 
> > The issue here isn't that old stats were never updated.  For inheritance, 
> > they
> > *were* updated with non-inherited stats (for SELECT FROM ONLY).  But then
> > "SELECT FROM tbl*" used the stats anyway...
> 
> I'm confused about this bit.  Are we still building bogus stats for
> inheritance parents, or has that stopped?

To make a long story long:
- before 859b3003de, an ERROR occurred when a stats object was created on an
  inheritance parent.
- To avoid the error, 859b3003de changed to no longer build "whole tree" stats
  on the table heirarchy.  Non-inheried stats were still collected.
- However, the stats were *also* applied to inherited queries (FROM tbl*).

36c4bc6 stops applying stats that shouldn't be applied (and doesn't change
their collection during ANALYZE).

20b9fa3 then changes to collect inherited stats on partitioned tables, since
they have no non-inherited stats, and since extended stats on partitioned
tables were intended to work since v10, and did work until 859b3003de stopped
collecting them.

In back branches, pg_statistic has inherited stats for partitioned tables, and
non-inherited stats otherwise.

-- 
Justin

Re: Release notes for February minor releases

2022-02-04 Thread Tom Lane

Michael Banck  writes:
> On Fri, Feb 04, 2022 at 02:58:59PM -0500, Tom Lane wrote:
>> +  If this seems to have affected a table, REINDEX
>> +  should repair the damage.

> I don't think this is very helpful to the reader, are their indexes
> corrupt or not? If we can't tell them a specific command to run to
> check, can we at least mention that running amcheck would detect that
> (if it actually does)? Otherwise, I guess the only way to be sure is to
> just reindex every index? Or is this at least specific to b-trees?

Yeah, I wasn't too happy with that advice either, but it seems like the
best we can do [1].  I don't think we should advise blindly reindexing
your whole installation, because it's a very low-probability bug.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/20220204041935.gf4uwbdxddq6rffh%40alap3.anarazel.de

Re: Release notes for February minor releases

2022-02-04 Thread Tom Lane

Justin Pryzby  writes:
> On Fri, Feb 04, 2022 at 02:58:59PM -0500, Tom Lane wrote:
>> Please send comments/corrections by Sunday.

[ assorted comments ]

Thanks for the corrections.

> +  A previous bug fix disabled building of extended statistics for
> +  old-style inheritance trees, but any existing statistics data was
> +  not removed, and that data would become more and more out-of-date
> +  over time.  Adjust the planner to ignore such data.  Extended
> +  statistics for the individual child tables are still built and used,
> +  however.

> The issue here isn't that old stats were never updated.  For inheritance, they
> *were* updated with non-inherited stats (for SELECT FROM ONLY).  But then
> "SELECT FROM tbl*" used the stats anyway...

I'm confused about this bit.  Are we still building bogus stats for
inheritance parents, or has that stopped?

regards, tom lane

Re: [PATCH] Add reloption for views to enable RLS

2022-02-04 Thread walther


Christoph Heiss wrote:

As part of a customer project we are looking to implement an reloption for 
views which when set, runs the subquery as invoked by the user rather than the 
view owner, as is currently the case.
The rewrite rule's table references are then checked as if the user were 
referencing the table(s) directly.

This feature is similar to so-called 'SECURITY INVOKER' views in other DBMS. 


This is a feature I have long been looking for. I tested the patch (v5) 
and found two cases that I feel need to be either fixed or documented 
explicitly.



Case 1 - Schema privileges:

create schema a;
create table a.t();

create schema b;
create view b.v with (security_invoker=true) as table a.t;

create role alice;
grant usage on schema b to alice; -- missing schema a
grant select on table a.t, b.v to alice;

set role alice;
table a.t; -- ERROR: permission denied for schema a (good)
table b.v; -- no error (good or bad?)

User alice does not have USAGE privileges on schema a, but only on table 
a.t. A SELECT directly on the table fails as expected, but a SELECT on 
the view succeeds. I assume the schema access is checked when the query 
is parsed - and at that stage, the user is still the view owner?
The docs mention explicitly that *all* objects are accessed with invoker 
privileges, which is not the case.


Personally I actually like this. It allows to keep a view-based api in a 
separate schema, while:

- preserving full RLS capabilities and
- forcing the user to go through the api, because a direct access to the 
data schema is not possible.


However, since this behavior was likely unintended until now, it raises 
the question whether there are any other privilege checks that are not 
taking the invoking user into account properly?



Case 2 - Chained views:

create schema a;
create table a.t();

create role bob;
grant create on database postgres to bob;
grant usage on schema a to bob;
set role bob;
create schema b;
create view b.v1 with (security_invoker=true) as table a.t;
create view b.v2 with (security_invoker=false) as table b.v1;

reset role;
create role alice;
grant usage on schema a, b to alice;
grant select on table a.t to bob;
grant select on table b.v2 to alice;

set role alice;
table b.v2; -- ERROR: permission denied for table t (bad)

When alice runs the SELECT on b.v2, the query on b.v1 is made with bob 
privileges as the view owner of b.v2. This is verified, because alice 
does not have privileges to access b.v1, but no such error is thrown.


b.v1 will then access a.t - and my first assumption was, that in this 
case a.t should be accessed by bob, still as the view owner of b.v2. 
Clearly, this is not the case as the permission denied error shows.


This is not actually a problem with this patch, I think, but just 
highlighting a quirk in the current implementation of views 
(security_invoker=false) in general: While the query will be run with 
the view owner, the CURRENT_USER is still the invoker, even "after" the 
view. In other words, the current implementation is *not* the same as 
"security definer". It's somewhere between "security definer" and 
"security invoker" - a strange mix really.


Afaik this mix is not documented explicitly so far. But the 
security_invoker reloption exposes it in a much less expected way, so I 
only see two options really:
a) make the current implementation of security_invoker=false a true 
"security definer", i.e. change the CURRENT_USER "after" the view for good.

b) document the "security infiner/devoker" default behavior as a feature.

I really like a), as this would make a clear cut between security 
definer and security invoker views - but this would be a big breaking 
change, which I don't think is acceptable.


Best,

Wolfgang

Re: Release notes for February minor releases

2022-02-04 Thread Michael Banck

On Fri, Feb 04, 2022 at 02:58:59PM -0500, Tom Lane wrote:
> I've pushed the first draft for $SUBJECT at
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ab22eea83169c8d0eb15050ce61cbe3d7dae4de6
> 
> Please send comments/corrections by Sunday.

> + 
> +  Fix corruption of HOT chains when a RECENTLY_DEAD tuple changes
> +  state to fully DEAD during page pruning (Andres Freund)
> + 
> +
> + 
> +  This happens when the last transaction that could see
> +  the tuple ends while the page is being pruned.  It was then possible
> +  to remove a tuple that is pointed to by a redirect item elsewhere on
> +  the page.  While that causes no immediate problem, when the item slot
> +  is re-used by some new tuple, that tuple would be thought to be part
> +  of the pre-existing HOT chain, creating a form of index corruption.

Well, ouchy.

> +  If this seems to have affected a table, REINDEX
> +  should repair the damage.

I don't think this is very helpful to the reader, are their indexes
corrupt or not? If we can't tell them a specific command to run to
check, can we at least mention that running amcheck would detect that
(if it actually does)? Otherwise, I guess the only way to be sure is to
just reindex every index? Or is this at least specific to b-trees?

> + 
> +  Enforce standard locking protocol for TOAST table updates, to prevent
> +  problems with REINDEX CONCURRENTLY (Michael Paquier)
> + 
> +
> + 
> +  If applied to a TOAST table or TOAST table's index, REINDEX
> +  CONCURRENTLY tended to produce a corrupted index.  This
> +  happened because sessions updating TOAST entries released
> +  their ROW EXCLUSIVE locks immediately, rather
> +  than holding them until transaction commit as all other updates do.
> +  The fix is to make TOAST updates hold the table lock according to the
> +  normal rule.  Any existing corrupted indexes can be repaired by
> +  reindexing again.
> + 
> +

Same, but at least here the admin can cut it down to only those indexes
which were added concurrently.


Michael

-- 
Michael Banck
Teamleiter PostgreSQL-Team
Projektleiter
Tel.: +49 2166 9901-171
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Geoff Richardson, Peter Lilley

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: decoupling table and index vacuum

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 1:54 PM Robert Haas  wrote:
> > The constantly modified index will be entirely dependent on index
> > vacuuming here, and so an improved VACUUM design that allows that
> > particular index to be vacuumed more frequently could really improve
> > performance.
>
> Thanks for checking my work here - I wasn't 100% sure I had the right idea.

I should perhaps have emphasized individual leaf pages, rather than
total index size. Presumably we only need to store so many extra
versions per logical row at any one time, and we have a fair amount of
free space for extra versions on leaf pages. Typically 10%- 30% of the
space from the page (typical when it isn't already inevitable that the
page will eventually split due to simple inserts). A traditional
guarantee with B-Trees is that we get `ln(2)` space utilization with
random insertions, which leaves just over 30% of the page free for
later updates -- that's where I got 30% from.

There is a complementary effect with deduplication, since that buys us
time before the page has to split, making it much more likely that the
split will be avoided entirely. It's very nonlinear.

As I said, the competition between older snapshots and garbage
collection can still lead to version-driven page splits (especially
when non-hot updates are concentrated in one part of the key space, or
one leaf page). But that's arguably a good thing -- it naturally
relieves contention. There are actually designs that artificially
split B-Tree pages early [1], detecting concurrency control related
contention. Other systems need concurrency control in indexes, which
we avoid by having versions live in indexes.

[1] http://cidrdb.org/cidr2021/papers/cidr2021_paper21.pdf
-- 
Peter Geoghegan

Re: Windows now has fdatasync()

2022-02-04 Thread Thomas Munro

On Sat, Feb 5, 2022 at 2:10 AM Robert Haas  wrote:
> So my impression is that today we ship defaults that are unsafe on
> Windows. I don't really understand much of what you are saying here,
> but if there's a way we can stop doing that, +1 from me, especially if
> it allows us to retain reasonable performance.

The PostgreSQL default in combination with the Windows default is
unsafe on SATA drives, but if you read our
documentation carefully you might figure out that you need to disable
write caching on your disk.
https://www.postgresql.org/docs/14/wal-reliability.html says:

"Consumer-grade IDE and SATA drives are particularly likely to have
write-back caches that will not survive a power failure. Many
solid-state drives (SSD) also have volatile write-back caches. [...]
On Windows, if wal_sync_method is open_datasync (the default), write
caching can be disabled by unchecking My Computer\Open\disk
drive\Properties\Hardware\Properties\Policies\Enable write caching on
the disk. Alternatively, set wal_sync_method to fsync or
fsync_writethrough, which prevent write caching."

I'm not proposing we change our default to this new level, because it
doesn't work on non-NTFS, an annoying complication.  This patch would
just provide something faster to put after "Alternatively".

(Actually that whole page needs a refresh.  IDE is gone.  The
discussion about "recent" support for flushing caches is a bit out of
date, and in fact the problem with open_datasync on this OS is because
of problems with drivers and
https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA), not
FLUSH CACHE EXT.)

Here's an updated patch that fixes some silly problems seen on CI.
There's something a bit clunkly/weird about this HAVE_FDATASYNC stuff,
maybe I can find a tidier way, but it's enough for experimentation:

For Mingw, I unconditionally add src/port/fdatasync.o to LIBOBJS, and
I unconditionally #define HAVE_FDATASYNC in win32_port.h, and I also
changed c.h's declaration of fdatasync() because it comes before
port.h is included (I guess I could move it down instead?).

For MSVC, I unconditionally add fdatasync.o to @pgportfiles, and
HAVE_FDATASYNC is defined in Solution.pm.

It'd be interesting to see pg_test_fsync.exe output on real hardware.
Here's what a little Windows 10 VM with a virtual SATA drive says:

C:\Users\thmunro>c:\pg\bin\pg_test_fsync.exe
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync (direct) 7914.872 ops/sec 126 usecs/op
open_datasync (buffered)   6593.056 ops/sec 152 usecs/op
fdatasync   650.317 ops/sec1538 usecs/op
fsync   512.423 ops/sec1952 usecs/op
fsync_writethrough  550.881 ops/sec1815 usecs/op
open_sync (direct)  n/a
open_sync (buffered)n/a
From ea61dab5b8be308bd82e56a5063f7acac70ad291 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 12 Dec 2021 14:13:35 +1300
Subject: [PATCH v2 1/2] Fix treatment of direct I/O in pg_test_fsync.

pg_test_fsync traditionally enabled O_DIRECT for the open_datasync and
open_sync tests.  In fact current releases of PostgreSQL only do that if
wal_level=minimal (less commonly used).  So, run the test both ways.

Discussion: https://postgr.es/m/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com
---
 src/bin/pg_test_fsync/pg_test_fsync.c | 87 ---
 1 file changed, 79 insertions(+), 8 deletions(-)

diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index ddabf64c58..1be017fcd5 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -303,12 +303,12 @@ test_sync(int writes_per_op)
 	printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n"));

 	/*
-	 * Test open_datasync if available
+	 * Test open_datasync direct if available
 	 */
-	printf(LABEL_FORMAT, "open_datasync");
+	printf(LABEL_FORMAT, "open_datasync (direct)");
 	fflush(stdout);

-#ifdef OPEN_DATASYNC_FLAG
+#if defined(OPEN_DATASYNC_FLAG) && (defined(O_DIRECT) || defined(F_NOCACHE))
 	if ((tmpfile = open_direct(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1)
 	{
 		printf(NA_FORMAT, _("n/a*"));
@@ -333,6 +333,38 @@ test_sync(int writes_per_op)
 	printf(NA_FORMAT, _("n/a"));
 #endif

+	/*
+	 * Test open_datasync buffered if available
+	 */
+	printf(LABEL_FORMAT, "open_datasync (buffered)");
+	fflush(stdout);
+
+#ifdef OPEN_DATASYNC_FLAG
+	if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1)
+	{
+		printf(NA_FORMAT, _("n/a*"));
+		fs_warning = true;
+	}
+	else
+	{
+		START_TIMER;
+		for (ops = 0; alarm_triggered == false; ops++)
+		{
+			for

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan  wrote:
> Application B will already block pruning by VACUUM operations against
> application A's table, and so effectively blocks recording of the
> resultant free space in the FSM in your scenario. And so application A
> and application B should be considered the same application already.
> That's just how VACUUM works.

Sure ... but that also sucks. If we consider application A and
application B to be the same application, then we're basing our
decision about what to do on information that is inaccurate.

> 5% larger seems like a lot more than would be typical, based on what
> I've seen. I don't think that the regression in this scenario can be
> characterized as "infinitely worse", or anything like it. On a long
> enough timeline, the potential upside of something like this is nearly
> unlimited -- it could avoid a huge amount of write amplification. But
> the potential downside seems to be small and fixed -- which is the
> point (bounding the downside). The mere possibility of getting that
> big benefit (avoiding the costs from heap fragmentation) is itself a
> benefit, even when it turns out not to pay off in your particular
> case. It can be seen as insurance.

I don't see it that way. There are cases where avoiding writes is
better, and cases where trying to cram everything into the fewest
possible ages is better. With the right test case you can make either
strategy look superior. What I think your test case has going for it
is that it is similar to something that a lot of people, really a ton
of people, actually do with PostgreSQL. However, it's not going to be
an accurate model of what everybody does, and therein lies some
element of danger.

> Then what could you have confidence in?

Real-world experience. Which is hard to get if we don't ever commit
any patches, but a good argument for (a) having them tested by
multiple different hackers who invent test cases independently and (b)
some configurability where we can reasonably include it, so that if
anyone does experience problems they have an escape.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should vacuum's first heap pass be read-only?

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 3:18 PM Robert Haas  wrote:
> > What about recovery conflicts? Index vacuuming WAL records don't
> > require their own latestRemovedXid field, since they can rely on
> > earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
> > vacuuming removes always point to LP_DEAD items in the heap, it's safe
> > to lean on that.
>
> Oh, that's an interesting consideration.

You'd pretty much have to do "fake pruning", performing the same
computation as pruning without actually pruning.

> > In practice HOT generally works well enough that the number of heap
> > pages that prune significantly exceeds the subset that are also
> > vacuumed during the second pass over the heap -- at least when heap
> > fill factor has been tuned (which might be rare). The latter category
> > of pages is not reported on by the enhanced autovacuum logging added
> > to Postgres 14, so you might be able to get some sense of how this
> > works by looking at that.
>
> Is there an extra "not" in this sentence? Because otherwise it seems
> like you're saying that I should look at the information that isn't
> reported, which seems hard.

Sorry, yes. I meant "now" (as in, as of Postgres 14).

> In any case, I think this might be a death knell for the whole idea.
> It might be good to cut down the number of page writes by avoiding
> writing them twice -- but not at the expense of having the second pass
> have to visit a large number of pages it could otherwise skip. I
> suppose we could write only those pages in the first pass that we
> aren't going to need to write again later, but at that point I can't
> really see that we're winning anything.

Right. I think that we *can* be more aggressive about deferring heap
page vacuuming until another VACUUM operation with the conveyor belt
stuff. You may well end up getting almost the same benefit that way.

> Yes, I wondered about that. It seems like maybe a running VACUUM
> should periodically refresh its notion of what cutoff to use.

Yeah, Andres said something about this a few months ago. Shouldn't be
very difficult.

> I think my concern here is about not having too many different code
> paths from heap vacuuming. I agree that if we're going to vacuum
> without an on-disk conveyor belt we can use an in-memory substitute.

Avoiding special cases in vacuumlazy.c seems really important to me.

> However, to Greg's point, if we're using the conveyor belt, it seems
> like we want to merge the second pass of one VACUUM into the first
> pass of the next one.

But it's only going to be safe to do that with those dead TIDs (or
distinct generations of dead TIDs) that are known to already be
removed from all indexes, including indexes that have the least need
for vacuuming (often no direct need at all). I had imagined that we'd
want to do heap vacuuming in the same way as today with the dead TID
conveyor belt stuff -- it just might take several VACUUM operations
until we are ready to do a round of heap vacuuming.

For those indexes that use bottom-up index deletion effectively, the
index structure itself never really needs to be vacuumed to avoid
index bloat. We must nevertheless vacuum these indexes at some point,
just to be able to vacuum heap pages with LP_DEAD items safely.

Overall, I think that there will typically be stark differences among
indexes on the table, in terms of how much vacuuming each index
requires. And so the thing that drives us to perform heap vacuuming
will probably be heap vacuuming itself, and not the fact that each and
every index has become "sufficiently bloated".

> If this isn't entirely making sense, it may well be because I'm a
> little fuzzy on all of it myself.

I'm in no position to judge.   :-)

-- 
Peter Geoghegan

Re: support for CREATE MODULE

2022-02-04 Thread Swaha Miller

On Fri, Feb 4, 2022 at 10:46 AM Bruce Momjian  wrote:

> On Wed, Feb  2, 2022 at 06:28:30PM -0800, Swaha Miller wrote:
> > Hi,
> >
> > I'm following up from Jim's POC for adding MODULE to PostgreSQL. [1]
> >
> > My proposal implements modules as schema objects to be stored in a new
> > system catalog pg_module with new syntax for CREATE [OR REPLACE] MODULE,
>
> You might want to consider the steps that are most successful at getting
> Postgres patches accepted:
>
> https://wiki.postgresql.org/wiki/Todo
> Desirability -> Design -> Implement -> Test -> Review -> Commit
>
> In this case, you have jumped right to Implement.  Asking about
> Desirability first can avoid a lot of effort.
>

Thanks Bruce, that's really helpful. I was building on the discussion in
Jim's
original thread, which is why I went ahead with another POC implementation,
but I do consider this implementation as part of the desirability/design
aspect
and am hoping to get input from the community to shape this proposal/patch.

Swaha Miller

Re: Release notes for February minor releases

2022-02-04 Thread Justin Pryzby

On Fri, Feb 04, 2022 at 02:58:59PM -0500, Tom Lane wrote:
> I've pushed the first draft for $SUBJECT at
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ab22eea83169c8d0eb15050ce61cbe3d7dae4de6
> 
> Please send comments/corrections by Sunday.

+ 
+  Build extended statistics for partitioned tables (Justin Pryzby)
+ 
+ 
+  A previous bug fix disabled building of extended statistics for
+  old-style inheritance trees, but it also prevented building them for
+  partitioned tables, which was an unnecessary restriction.
+  If you have created statistics objects for partitioned tables, you
+  may wish to explicitly ANALYZE those tables after
+  installing this update, rather than waiting for auto-analyze to do it.

Since autoanalyze still doesn't process partitioned tables, the last part
should be removed.

Probably it should say "..you *should* explicitly ANALYZE thse tables..".

+ 
+  Ignore extended statistics for inheritance trees (Justin Pryzby)
+ 
+ 
+  A previous bug fix disabled building of extended statistics for
+  old-style inheritance trees, but any existing statistics data was
+  not removed, and that data would become more and more out-of-date
+  over time.  Adjust the planner to ignore such data.  Extended
+  statistics for the individual child tables are still built and used,
+  however.
+ 

The issue here isn't that old stats were never updated.  For inheritance, they
*were* updated with non-inherited stats (for SELECT FROM ONLY).  But then
"SELECT FROM tbl*" used the stats anyway...

+ 
+  Fix failure of SP-GiST indexes when indexed column's data type is
+  binary-compatible with the declared input type of the operator class
+  (Tom Lane)
+ 

maybe: when *the*

-- 
Justin

Re: support for CREATE MODULE

2022-02-04 Thread Swaha Miller

On Thu, Feb 3, 2022 at 5:42 PM Alvaro Herrera 
wrote:

> On 2022-Feb-03, Pavel Stehule wrote:
>
> > The biggest problem is coexistence of Postgres's SEARCH_PATH  object
> > identification, and local and public scopes used in MODULEs or in
> Oracle's
> > packages.
> >
> > I can imagine MODULES as third level of database unit object grouping
> with
> > following functionality
> >
> > 1. It should support all database objects like schemas
>
> I proposed a way for modules to coexist with schemas that got no reply,
> https://postgr.es/m/202106021908.ddmebx7qfdld@alvherre.pgsql
>

Yes, I arrived a little after that thread, and used that as my starting
point.

The POC patch Jim Mlodgenski had on that thread was similar to your
proposed
way - modules were rows in pg_namespace, with the addition of a new column
in pg_namespace for the nspkind (module or schema.)

Jim had asked about two options  -- the above approach and an alternative
one
of having a pg_module system catalog and got some input
https://www.postgresql.org/message-id/2897116.1622642280%40sss.pgh.pa.us

Picking up from there, I am exploring the alternative approach. And what
qualified
names would look like and how they get interpreted unambiguously, when
schemas and modules co-exist. (Also, being new to PostgreSQL, it has been a
great learning exercise for me on some of the internals of PostgreSQL.)

With modules being their own type of object stored in a pg_module system
catalog, deconstructing a qualified path could always give precedence to
schema over module. So when there is function f() in schema s and another
function f() in module s in schema public, then s.f() would invoke the
former.

What are some other directions we might want to take this patch?

I still think that that idea is valuable; it would let us create
> "private" routines, for example, which are good for encapsulation.
> But the way it interacts with schemas means we don't end up with a total
> mess in the namespace resolution rules.  I argued that modules would
> only have functions, and maybe a few other useful object types, but not
> *all* object types, because we don't need all object types to become
> private.  For example, I don't think I would like to have data types or
> casts to be private, so they can only be in a schema and they cannot be
> in a module.
>
> Of course, that idea of modules would also ease porting large DB-based
> applications from other database systems.
>

Indeed. Looking at commercial databases Oracle and Microsoft SQLServer,
they both have implemented module-like structures.

Swaha Miller

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 2:45 PM Robert Haas  wrote:
> While I agree that there's some case to be made for leaving settled
> pages well enough alone, your criterion for settled seems pretty much
> accidental.

I fully admit that I came up with the FSM heuristic with TPC-C in
mind. But you have to start somewhere.

Fortunately, the main benefit of this patch series (avoiding the
freeze cliff during anti-wraparound VACUUMs, often avoiding
anti-wraparound VACUUMs altogether) don't depend on the experimental
FSM patch at all. I chose to post that now because it seemed to help
with my more general point about qualitatively different pages, and
freezing at the page level.

> Imagine a system where there are two applications running,
> A and B. Application A runs all the time and all the transactions
> which it performs are short. Therefore, when a certain page is not
> modified by transaction A for a short period of time, the page will
> become all-visible and will be considered settled. Application B runs
> once a month and performs various transactions all of which are long,
> perhaps on a completely separate set of tables. While application B is
> running, pages take longer to settle not only for application B but
> also for application A. It doesn't make sense to say that the
> application is in control of the behavior when, in reality, it may be
> some completely separate application that is controlling the behavior.

Application B will already block pruning by VACUUM operations against
application A's table, and so effectively blocks recording of the
resultant free space in the FSM in your scenario. And so application A
and application B should be considered the same application already.
That's just how VACUUM works.

VACUUM isn't a passive observer of the system -- it's another
participant. It both influences and is influenced by almost everything
else in the system.

> I can see that this could have significant advantages under some
> circumstances. But I think it could easily be far worse under other
> circumstances. I mean, you can have workloads where you do some amount
> of read-write work on a table and then go read only and sequential
> scan it an infinite number of times. An algorithm that causes the
> table to be smaller at the point where we switch to read-only
> operations, even by a modest amount, wins infinitely over anything
> else. But even if you have no change in the access pattern, is it a
> good idea to allow the table to be, say, 5% larger if it means that
> correlated data is colocated? In general, probably yes. If that means
> that the table fails to fit in shared_buffers instead of fitting, no.
> If that means that the table fails to fit in the OS cache instead of
> fitting, definitely no.

5% larger seems like a lot more than would be typical, based on what
I've seen. I don't think that the regression in this scenario can be
characterized as "infinitely worse", or anything like it. On a long
enough timeline, the potential upside of something like this is nearly
unlimited -- it could avoid a huge amount of write amplification. But
the potential downside seems to be small and fixed -- which is the
point (bounding the downside). The mere possibility of getting that
big benefit (avoiding the costs from heap fragmentation) is itself a
benefit, even when it turns out not to pay off in your particular
case. It can be seen as insurance.

> And to me, that kind of effect is why it's hard to gain much
> confidence in regards to stuff like this via laboratory testing. I
> mean, I'm glad you're doing such tests. But in a laboratory test, you
> tend not to have things like a sudden and complete change in the
> workload, or a random other application sometimes sharing the machine,
> or only being on the edge of running out of memory. I think in general
> people tend to avoid such things in benchmarking scenarios, but even
> if include stuff like this, it's hard to know what to include that
> would be representative of real life, because just about anything
> *could* happen in real life.

Then what could you have confidence in?

-- 
Peter Geoghegan

Re: make MaxBackends available in _PG_init

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 3:13 PM Nathan Bossart  wrote:
> On Fri, Feb 04, 2022 at 08:46:39AM -0500, Robert Haas wrote:
> > For multixact.c, I think you should invent GetMaxOldestSlot() to avoid
> > confusion. Maybe it could be a static inline rather than a macro.
> >
> > Likewise, I think PROCARRAY_MAXPROCS, NumProcSignalSlots, and
> > NumBackendStatSlots should be replaced with things that look more like
> > function calls.
>
> Sorry, I did notice that it looked odd, and I should've done this in v8 and
> saved a round trip.  Here's a new revision with those macros converted to
> inline functions.  I've also dialed things back a little in some places
> where a new variable felt excessive.

Great. I'll take a look at this next week, but not right now, mostly
because it's Friday afternoon and if I commit it and anything breaks I
don't want to end up having to fix it on the weekend if I can avoid
it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should vacuum's first heap pass be read-only?

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 3:05 PM Greg Stark  wrote:
> Whatever happened to the idea to "rotate" the work of vacuum. So all
> the work of the second pass would actually be deferred until the first
> pass of the next vacuum cycle.
>
> That would also have the effect of eliminating the duplicate work,
> both the  writes with the wal generation as well as the actual scan.
> The only heap scan would be "remove line pointers previously cleaned
> from indexes and prune dead tuples recording them to clean from
> indexes in future". The index scan would remove line pointers and
> record them to be removed from the heap in a future heap scan.

I vaguely remember previous discussions of this, but only vaguely, so
if there are threads on list feel free to send pointers. It seems to
me that in order to do this, we'd need some kind of way of storing the
TIDs that were found to be dead in one VACUUM so that they can be
marked unused in the next VACUUM - and the conveyor belt patches on
which Dilip's work is based provide exactly that machinery, which his
patches then leverage to do exactly that thing. But it feels like a
big, sudden change from the way things work now, and I'm trying to
think of ways to make it more incremental, and thus hopefully less
risky.

> The downside would mainly be in the latency before the actual tuples
> get cleaned up from the table. That is not so much of an issue as far
> as space these days with tuple pruning but is more and more of an
> issue with xid wraparound. Also, having to record the line pointers
> that have been cleaned from indexes somewhere on disk for the
> subsequent vacuum would be extra state on disk and we've learned that
> means extra complexity.

I don't think there's any XID wraparound issue here. Phase 1 does a
HOT prune, after which only dead line pointers remain, not dead
tuples. And those contain no XIDs. Phase 2 is only setting those dead
line pointers back to unused.

As for the other part, that's pretty much exactly the complexity that
I'm worrying about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should vacuum's first heap pass be read-only?

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 2:16 PM Peter Geoghegan  wrote:
> On Thu, Feb 3, 2022 at 12:20 PM Robert Haas  wrote:
> > But maybe we should reconsider. What benefit do we get out of dirtying
> > the page twice like this, writing WAL each time? What if we went back
> > to the idea of having the first heap pass be read-only?
>
> What about recovery conflicts? Index vacuuming WAL records don't
> require their own latestRemovedXid field, since they can rely on
> earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
> vacuuming removes always point to LP_DEAD items in the heap, it's safe
> to lean on that.

Oh, that's an interesting consideration.

> > In fact, I'm
> > thinking we might want to go even further and try to prevent even hint
> > bit changes to the page during the first pass, especially because now
> > we have checksums and wal_log_hints. If our vacuum cost settings are
> > to believed (and I am not sure that they are) dirtying a page is 10
> > times as expensive as reading one from the disk. So on a large table,
> > we're paying 44 vacuum cost units per heap page vacuumed twice, when
> > we could be paying only 24 such cost units. What a bargain!
>
> In practice HOT generally works well enough that the number of heap
> pages that prune significantly exceeds the subset that are also
> vacuumed during the second pass over the heap -- at least when heap
> fill factor has been tuned (which might be rare). The latter category
> of pages is not reported on by the enhanced autovacuum logging added
> to Postgres 14, so you might be able to get some sense of how this
> works by looking at that.

Is there an extra "not" in this sentence? Because otherwise it seems
like you're saying that I should look at the information that isn't
reported, which seems hard.

In any case, I think this might be a death knell for the whole idea.
It might be good to cut down the number of page writes by avoiding
writing them twice -- but not at the expense of having the second pass
have to visit a large number of pages it could otherwise skip. I
suppose we could write only those pages in the first pass that we
aren't going to need to write again later, but at that point I can't
really see that we're winning anything.

> > Could we have our cake and eat it too by updating the FSM with
> > the amount of free space that the page WOULD have if we pruned it, but
> > not actually do so?
>
> Did you ever notice that VACUUM records free space after *it* prunes,
> using its own horizon? With a long running VACUUM operation, where
> unremoved "recently dead" tuples are common, it's possible that the
> amount of free space that's effectively available (available to every
> other backend) is significantly higher. And so we already record
> "subjective amounts of free space" -- though not necessarily by
> design.

Yes, I wondered about that. It seems like maybe a running VACUUM
should periodically refresh its notion of what cutoff to use.

> I'm not sure that what you're proposing here is the best way to go
> about it, but let's assume for a moment that it is. Can't you just
> simulate the conveyor belt approach, without needing a relation fork?
> Just store the same information in memory, accessed using the same
> interface, with a spillover path?

(I'm not sure it's best either.)

I think my concern here is about not having too many different code
paths from heap vacuuming. I agree that if we're going to vacuum
without an on-disk conveyor belt we can use an in-memory substitute.
However, to Greg's point, if we're using the conveyor belt, it seems
like we want to merge the second pass of one VACUUM into the first
pass of the next one. That is, if we start up a heap vacuum already
having a list of TIDs that can be marked unused, we want to do that
during the same pass of the heap that we prune and search for
newly-discovered dead TIDs. But we can't do that in the case where the
conveyor belt is only simulated, because our in-memory data structure
can't contain leftovers from a previous vacuum the way the on-disk
conveyor belt can. So it seems like the whole algorithm has to be
different. I'd like to find a way to avoid that.

If this isn't entirely making sense, it may well be because I'm a
little fuzzy on all of it myself. But I hope it's clear enough that
you can figure out what it is that I'm worrying about. If not, I'll
keep trying to explain until we both reach a sufficiently non-fuzzy
state.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Daniel Gustafsson

> On 4 Feb 2022, at 21:03, Stephen Frost  wrote:

> I wonder about the 'not in tree' bit since it is in the header files,
> certainly for NSPR which I've been poking at due to this discussion.

What I meant was that the documentation on the website isn't published from
documentation source code (in whichever format) residing in the tree.

That being said, I take that back since I just now in a git pull found that
they had done just that 6 days ago.  It's just as messy and incomplete as what
is currently on the web, important API's like NSS_InitContext are still not
even mentioned more than in a release note, but I think it stands a better
chance of success than before.

> I had hoped that they were generating the documentation on the webpage from
> what's in the header files, is that not the case then?

Not from what I can tell no.

--
Daniel Gustafsson   https://vmware.com/

Re: make MaxBackends available in _PG_init

2022-02-04 Thread Nathan Bossart

On Fri, Feb 04, 2022 at 08:46:39AM -0500, Robert Haas wrote:
> For multixact.c, I think you should invent GetMaxOldestSlot() to avoid
> confusion. Maybe it could be a static inline rather than a macro.
> 
> Likewise, I think PROCARRAY_MAXPROCS, NumProcSignalSlots, and
> NumBackendStatSlots should be replaced with things that look more like
> function calls.

Sorry, I did notice that it looked odd, and I should've done this in v8 and
saved a round trip.  Here's a new revision with those macros converted to
inline functions.  I've also dialed things back a little in some places
where a new variable felt excessive.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 38c21e3cf36c591c524a65668594990a0472139c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 16 Aug 2021 02:59:32 +
Subject: [PATCH v9 1/1] Disallow external access to MaxBackends.

Presently, MaxBackends is externally visible, but it may still be
uninitialized in places where it would be convenient to use (e.g.,
_PG_init()).  This change makes MaxBackends static to postinit.c to
disallow such direct access.  Instead, MaxBackends should now be
accessed via GetMaxBackends().
---
 src/backend/access/nbtree/nbtutils.c|  4 +-
 src/backend/access/transam/multixact.c  | 31 +---
 src/backend/access/transam/twophase.c   |  3 +-
 src/backend/commands/async.c| 11 +--
 src/backend/libpq/pqcomm.c  |  3 +-
 src/backend/postmaster/auxprocess.c |  2 +-
 src/backend/postmaster/postmaster.c |  4 +-
 src/backend/storage/ipc/dsm.c   |  2 +-
 src/backend/storage/ipc/procarray.c | 25 --
 src/backend/storage/ipc/procsignal.c| 37 +
 src/backend/storage/ipc/sinvaladt.c |  4 +-
 src/backend/storage/lmgr/deadlock.c | 31 
 src/backend/storage/lmgr/lock.c | 22 +++---
 src/backend/storage/lmgr/predicate.c| 10 +--
 src/backend/storage/lmgr/proc.c | 17 ++--
 src/backend/utils/activity/backend_status.c | 88 +++--
 src/backend/utils/adt/lockfuncs.c   |  4 +-
 src/backend/utils/init/postinit.c   | 50 ++--
 src/include/miscadmin.h |  3 +-
 19 files changed, 218 insertions(+), 133 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 6a651d8397..84164748b3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2072,7 +2072,7 @@ BTreeShmemSize(void)
 	Size		size;
 
 	size = offsetof(BTVacInfo, vacuums);
-	size = add_size(size, mul_size(MaxBackends, sizeof(BTOneVacInfo)));
+	size = add_size(size, mul_size(GetMaxBackends(), sizeof(BTOneVacInfo)));
 	return size;
 }
 
@@ -2101,7 +2101,7 @@ BTreeShmemInit(void)
 		btvacinfo->cycle_ctr = (BTCycleId) time(NULL);
 
 		btvacinfo->num_vacuums = 0;
-		btvacinfo->max_vacuums = MaxBackends;
+		btvacinfo->max_vacuums = GetMaxBackends();
 	}
 	else
 		Assert(found);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 806f2e43ba..f00c272521 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -282,12 +282,11 @@ typedef struct MultiXactStateData
 } MultiXactStateData;
 
 /*
- * Last element of OldestMemberMXactId and OldestVisibleMXactId arrays.
- * Valid elements are (1..MaxOldestSlot); element 0 is never used.
+ * Pointers to the state data in shared memory
+ *
+ * The index of the last element of the OldestMemberMXactId and
+ * OldestVisibleMXacId arrays can be obtained with GetMaxOldestSlot().
  */
-#define MaxOldestSlot	(MaxBackends + max_prepared_xacts)
-
-/* Pointers to the state data in shared memory */
 static MultiXactStateData *MultiXactState;
 static MultiXactId *OldestMemberMXactId;
 static MultiXactId *OldestVisibleMXactId;
@@ -342,6 +341,7 @@ static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 			   int nmembers, MultiXactMember *members);
 static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
+static inline int GetMaxOldestSlot(void);
 
 /* MultiXact cache management */
 static int	mxactMemberComparator(const void *arg1, const void *arg2);
@@ -662,6 +662,17 @@ MultiXactIdSetOldestMember(void)
 	}
 }
 
+/*
+ * Retrieve the index of the last element of the OldestMemberMXactId and
+ * OldestVisibleMXactId arrays.  Valid elements are (1..MaxOldestSlot); element
+ * 0 is never used.
+ */
+static inline int
+GetMaxOldestSlot(void)
+{
+	return GetMaxBackends() + max_prepared_xacts;
+}
+
 /*
  * MultiXactIdSetOldestVisible
  *		Save the oldest MultiXactId this transaction considers possibly live.
@@ -684,6 +695,7 @@ MultiXactIdSetOldestVisible(void)
 	if (!MultiXactIdIsValid(OldestVisibleMXactId[MyBackendId]))
 	{
 		MultiXactId oldestMXact;
+		int			maxOldestSlot = GetMaxOldestSlot();
 		int			i;
 
 		LWLockAc

Re: should vacuum's first heap pass be read-only?

2022-02-04 Thread Greg Stark

On Thu, 3 Feb 2022 at 12:21, Robert Haas  wrote:
>
> VACUUM's first pass over the heap is implemented by a function called
> lazy_scan_heap(), while the second pass is implemented by a function
> called lazy_vacuum_heap_rel(). This seems to imply that the first pass
> is primarily an examination of what is present, while the second pass
> does the real work. This used to be more true than it now is.

I've been out of touch for a while but I'm trying to catch up with the
progress of the past few years.

Whatever happened to the idea to "rotate" the work of vacuum. So all
the work of the second pass would actually be deferred until the first
pass of the next vacuum cycle.

That would also have the effect of eliminating the duplicate work,
both the  writes with the wal generation as well as the actual scan.
The only heap scan would be "remove line pointers previously cleaned
from indexes and prune dead tuples recording them to clean from
indexes in future". The index scan would remove line pointers and
record them to be removed from the heap in a future heap scan.

The downside would mainly be in the latency before the actual tuples
get cleaned up from the table. That is not so much of an issue as far
as space these days with tuple pruning but is more and more of an
issue with xid wraparound. Also, having to record the line pointers
that have been cleaned from indexes somewhere on disk for the
subsequent vacuum would be extra state on disk and we've learned that
means extra complexity.

-- 
greg

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Stephen Frost

Greetings,

* Daniel Gustafsson (dan...@yesql.se) wrote:
> I am writing done above in quotes, since the documentation also needs to be
> updated, completed, rewritten, organized etc etc.  The above is an import of
> what was found, and is in a fairly poor state.  Unfortunately, it's still not
> in the tree where I personally believe documentation stands the best chance of
> being kept up to date.  The NSPR documentation is probably the best of the 
> two,
> but it's also much less of a moving target.

I wonder about the 'not in tree' bit since it is in the header files,
certainly for NSPR which I've been poking at due to this discussion.  I
had hoped that they were generating the documentation on the webpage
from what's in the header files, is that not the case then?  Which is
more accurate?  If it's a simple matter of spending time going through
what's in the tree and making sure what's online matches that, I suspect
we could find some folks with time to work on helping them there.

If the in-tree stuff isn't accurate then that's a bigger problem, of
course.

> It is true that the documentation is poor and currently in bad shape with lots
> of broken links and heavily disorganized etc.  It's also true that I managed 
> to
> implement full libpq support without any crystal ball or help from the NSS
> folks.  The latter doesn't mean we can brush documentation concerns aside, but
> let's be fair in our criticism.

Agreed.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Feb  3, 2022 at 02:33:37PM -0500, Robert Haas wrote:
> > As a philosophical matter, I don't think it's great for us - or the
> > Internet in general - to be too dependent on OpenSSL. Software
> > monocultures are not great, and OpenSSL has near-constant security
> > updates and mediocre documentation. Now, maybe anything else we
> 
> I don't think it is fair to be criticizing OpenSSL for its mediocre
> documentation when the alternative being considered, NSS, has no public
> documentation.  Can the source-code-defined NSS documentation be
> considered better than the mediocre OpenSSL public documentation?

This simply isn't the case and wasn't even the case at the start of this
thread.  The NSPR documentation was only available through the header
files due to it being taken down from MDN.  The NSS documentation was
actually still there.  Looks like they've now (mostly) fixed the lack of
NSPR documentation, as noted in the recent email that I sent.

> For the record, I do like the idea of adding NSS, but I am concerned
> about its long-term maintenance, we you explained.

They've come out and explicitly said that the project is active and
maintained, and they've been doing regular releases.  I don't think
there's really any reason to think that it's not being maintained at
this point.

Thanks,

Stephen

signature.asc
Description: PGP signature

Release notes for February minor releases

2022-02-04 Thread Tom Lane

I've pushed the first draft for $SUBJECT at

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ab22eea83169c8d0eb15050ce61cbe3d7dae4de6

Please send comments/corrections by Sunday.

regards, tom lane

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Feb  3, 2022 at 01:42:53PM -0500, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > On Tue, Feb  1, 2022 at 01:52:09PM -0800, Andres Freund wrote:
> > > > There's https://hg.mozilla.org/projects/nspr/file/tip/pr/src - which is 
> > > > I
> > > > think the upstream source.
> > > > 
> > > > A project without even a bare-minimal README at the root does have a 
> > > > "internal
> > > > only" feel to it...
> > > 
> > > I agree --- it is a library --- if they don't feel the need to publish
> > > the API, it seems to mean they want to maintain the ability to change it
> > > at any time, and therefore it is inappropriate for other software to
> > > rely on that API.
> > 
> > This is really not a reasonable representation of how this library has
> > been maintained historically nor is there any reason to think that their
> > policy regarding the API has changed recently.  They do have a
> > documented API and that hasn't changed- it's just that it's not easily
> > available in web-page form any longer and that's due to something
> > independent of the library maintenance.  They've also done a good job
> 
> So they have always been bad at providing an API, not just now, or that
> their web content disappeared and they haven't fixed it, for how long? 
> I guess that is better than the v8 case, but not much.  Is posting web
> content really that hard for them?

To be clear, *part* of the web-based documentation disappeared and
hasn't been replaced yet.  The NSS-specific pieces are actually still
available, it's the NSPR (which is a lower level library used by NSS)
part that was removed from MDN and hasn't been brought back yet, but
which does still exist as comments in the source of the library.

> > with maintaining the API as one would expect from a library and so this
> > really isn't a reason to avoid using it.  If there's actual specific
> > examples of the API not being well maintained and causing issues then
> > please point to them and we can discuss if that is a reason to consider
> > not depending on NSS/NSPR.
> 
> I have no specifics.

Then I don't understand where the claim you made that "it seems to mean
they want to maintain the ability to change it at any time" has any
merit.

> > > This is not the same as Postgres extensions needing to read the Postgres
> > > source code --- they are an important but edge use case and we never saw
> > > the need to standardize or publish the internal functions that must be
> > > studied and adjusted possibly for major releases.
> > 
> > I agree that extensions and public libraries aren't entirely the same
> > but I don't think it's all that unreasonable for developers that are
> > using a library to look at the source code for that library when
> > developing against it, that's certainly something I've done for a
> > number of different libraries.
> 
> Wow, you have a much higher tolerance than I do.  How do you even know
> which functions are the public API if you have to look at the source
> code?

Because... it's documented?  They have public (and private) .h files in
the source tree and the function declarations have large comment blocks
above them which provide a documented API.  I'm not talking about having
to decipher from the actual C code what's going on but just reading the
function header comment that provides the documentation of the API for
each of the functions, and there's larger blocks of comments at the top
of those .h files which provide more insight into how the functions in
that particular part of the system work and interact with each other.
Maybe those things would be better as separate README files like what we
do, but maybe not, and I don't see it as a huge failing that they chose
to use a big comment block at the top of their .h files to explain
things rather than separate README files.

Reading comments in code that I'm calling out to, even if it's in
another library (or another part of PG where the README isn't helping me
enough, or due to there not being a README for that particular thing)
almost seems typical, to me anyway.  Perhaps the exception being when
there are good man pages.

> I frankly think we need some public statement from the NSS developers
> before moving forward --- there are just too many red flags here, and
> once we support it, it will be hard to remove support for it.

They have made public statements regarding this and it's been linked to
already in this thread:

https://github.com/mdn/content/issues/12471

where they explicitly state that the project is alive and maintained,
further, it now now also links to this:

https://bugzilla.mozilla.org/show_bug.cgi?id=1753127

Which certainly seems to have had a fair bit of action taken on it.

Indeed, it looks like they've got a lot of the docs up and online now,
including the documentation for the function that started much of this:

https://firefox-source-docs.mozilla.org/nspr/reference/p

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Daniel Gustafsson

> On 4 Feb 2022, at 19:22, Bruce Momjian  wrote:
> 
> On Thu, Feb  3, 2022 at 02:33:37PM -0500, Robert Haas wrote:
>> As a philosophical matter, I don't think it's great for us - or the
>> Internet in general - to be too dependent on OpenSSL. Software
>> monocultures are not great, and OpenSSL has near-constant security
>> updates and mediocre documentation. Now, maybe anything else we
> 
> I don't think it is fair to be criticizing OpenSSL for its mediocre
> documentation when the alternative being considered, NSS, has no public
> documentation.  Can the source-code-defined NSS documentation..

Not that it will shift the needle either way, but to give credit where credit
is due:

Both NSS and NSPR are documented, and have been since they were published by
Netscape in 1998.  The documentation does lack things, and some parts are quite
out of date.  That's true and undisputed even by the projects themselves who
state this: "It currently is very deprecated and likely incorrect or broken in
many places".

The recent issue was that Mozilla decided to remove all 3rd party projects (why
they consider their own code 3rd party is a mystery to me) from their MDN site,
and so NSS and NSPR were deleted with no replacement.  This was said to be
worked on but didn't happen and no docs were imported into the tree.  When
Daniel from curl (the other one, not I) complained, this caused enough momentum
to get this work going and it's now been "done".

   NSS: https://firefox-source-docs.mozilla.org/security/nss/
   NSPR: https://firefox-source-docs.mozilla.org/nspr/

I am writing done above in quotes, since the documentation also needs to be
updated, completed, rewritten, organized etc etc.  The above is an import of
what was found, and is in a fairly poor state.  Unfortunately, it's still not
in the tree where I personally believe documentation stands the best chance of
being kept up to date.  The NSPR documentation is probably the best of the two,
but it's also much less of a moving target.

It is true that the documentation is poor and currently in bad shape with lots
of broken links and heavily disorganized etc.  It's also true that I managed to
implement full libpq support without any crystal ball or help from the NSS
folks.  The latter doesn't mean we can brush documentation concerns aside, but
let's be fair in our criticism.

> ..be considered better than the mediocre OpenSSL public documentation?

OpenSSL has gotten a lot better in recent years, it's still not great or where
I would like it to be, but a lot better.

--
Daniel Gustafsson   https://vmware.com/

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Robert Haas

On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan  wrote:
> When VACUUM sees that all remaining/unpruned tuples on a page are
> all-visible, it isn't just important because of cost control
> considerations. It's deeper than that. It's also treated as a
> tentative signal from the application itself, about the data itself.
> Which is: this page looks "settled" -- it may never be updated again,
> but if there is an update it likely won't change too much about the
> whole page.

While I agree that there's some case to be made for leaving settled
pages well enough alone, your criterion for settled seems pretty much
accidental. Imagine a system where there are two applications running,
A and B. Application A runs all the time and all the transactions
which it performs are short. Therefore, when a certain page is not
modified by transaction A for a short period of time, the page will
become all-visible and will be considered settled. Application B runs
once a month and performs various transactions all of which are long,
perhaps on a completely separate set of tables. While application B is
running, pages take longer to settle not only for application B but
also for application A. It doesn't make sense to say that the
application is in control of the behavior when, in reality, it may be
some completely separate application that is controlling the behavior.

> The application is in charge, really -- not VACUUM. This is already
> the case, whether we like it or not. VACUUM needs to learn to live in
> that reality, rather than fighting it. When VACUUM considers a page
> settled, and the physical page still has a relatively large amount of
> free space (say 45% of BLCKSZ, a borderline case in the new FSM
> patch), "losing" so much free space certainly is unappealing. We set
> the free space to 0 in the free space map all the same, because we're
> cutting our losses at that point. While the exact threshold I've
> proposed is tentative, the underlying theory seems pretty sound to me.
> The BLCKSZ/2 cutoff (and the way that it extends the general rules for
> whole-page freezing) is intended to catch pages that are qualitatively
> different, as well as quantitatively different. It is a balancing act,
> between not wasting space, and the risk of systemic problems involving
> excessive amounts of non-HOT updates that must move a successor
> version to another page.

I can see that this could have significant advantages under some
circumstances. But I think it could easily be far worse under other
circumstances. I mean, you can have workloads where you do some amount
of read-write work on a table and then go read only and sequential
scan it an infinite number of times. An algorithm that causes the
table to be smaller at the point where we switch to read-only
operations, even by a modest amount, wins infinitely over anything
else. But even if you have no change in the access pattern, is it a
good idea to allow the table to be, say, 5% larger if it means that
correlated data is colocated? In general, probably yes. If that means
that the table fails to fit in shared_buffers instead of fitting, no.
If that means that the table fails to fit in the OS cache instead of
fitting, definitely no.

And to me, that kind of effect is why it's hard to gain much
confidence in regards to stuff like this via laboratory testing. I
mean, I'm glad you're doing such tests. But in a laboratory test, you
tend not to have things like a sudden and complete change in the
workload, or a random other application sometimes sharing the machine,
or only being on the edge of running out of memory. I think in general
people tend to avoid such things in benchmarking scenarios, but even
if include stuff like this, it's hard to know what to include that
would be representative of real life, because just about anything
*could* happen in real life.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 2:00 PM John Naylor  wrote:
> Without having looked at the latest patches, there was something in
> the back of my mind while following the discussion upthread -- the
> proposed opportunistic freezing made a lot more sense if the
> earlier-proposed open/closed pages concept was already available.

Yeah, sorry about that. The open/closed pages concept is still
something I plan on working on. My prototype (which I never posted to
the list) will be rebased, and I'll try to target Postgres 16.

> > Freezing whole pages
> > 
>
> > It's possible that a higher cutoff (for example a cutoff of 80% of
> > BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
> > addition to the downsides from fragmentation -- it's far from a simple
> > trade-off. (Not that you should believe that 50% is special, it's just
> > a starting point for me.)
>
> How was the space utilization with the 50% cutoff in the TPC-C test?

The picture was mixed. To get the raw numbers, compare
pg-relation-sizes-after-patch-2.out and
pg-relation-sizes-after-master-2.out files from the drive link I
provided (to repeat, get them from
https://drive.google.com/drive/u/1/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR)

Highlights: the largest table (the bmsql_order_line table) had a total
size of x1.006 relative to master, meaning that we did slightly worse
there. However, the index on the same table was slightly smaller
instead, probably because reducing heap fragmentation tends to make
the index deletion stuff work a bit better than before.

Certain small tables (bmsql_district and bmsql_warehouse) were
actually significantly smaller (less than half their size on master),
probably just because the patch can reliably remove LP_DEAD items from
heap pages, even when a cleanup lock isn't available.

The bmsql_new_order table was quite a bit larger, but it's not that
large anyway (1250 MB on master at the very end, versus 1433 MB with
the patch). This is a clear trade-off, since we get much less
fragmentation in the same table (as evidenced by the VACUUM output,
where there are fewer pages with any LP_DEAD items per VACUUM with the
patch). The workload for that table is characterized by inserting new
orders together, and deleting the same orders as a group later on. So
we're bound to pay a cost in space utilization to lower the
fragmentation.

> > blks_hit | 174,551,067,731
> > tup_fetched  | 124,797,772,450
>
> > Here is the same pg_stat_database info for master:
>
> > blks_hit | 283,015,966,386
> > tup_fetched  | 237,052,965,901
>
> That's impressive.

Thanks!

It's still possible to get a big improvement like that with something
like TPC-C because there are certain behaviors that are clearly
suboptimal -- once you look at the details of the workload, and
compare an imaginary ideal to the actual behavior of the system. In
particular, there is really only one way that the free space
management can work for the two big tables that will perform
acceptably -- the orders have to be stored in the same place to begin
with, and stay in the same place forever (at least to the extent that
that's possible).

-- 
Peter Geoghegan

Re: configure sets GCC=yes for clang

2022-02-04 Thread Tom Lane

Tomas Vondra  writes:
> I've been doing some experiments with compilers, and I've noticed that 
> when using clang configure still ends up setting
> GCC='yes'
> which seems somewhat strange.

I'm pretty sure that's intentional.  clang tries to be a gcc-alike,
and mostly succeeds, other than variations in command line options.
It's better to set GCC=yes and deal with the small discrepancies
than not set it and have to deal with clang as a whole separate case.

> FWIW I've noticed because when building 
> with "-Ofast" (yeah, I'm just playing with stuff) it fails like this:

> checking whether the C compiler still works... yes
> configure: error: do not put -ffast-math in CFLAGS

> which is in $GCC=yes check, and clang doesn't even have such option.

Well, the error is correct, even if clang doesn't spell the switch the
same way: you enabled fast-math optimizations and we don't want that.

[ wanders away wondering how meson handles this stuff ... ]

regards, tom lane

configure sets GCC=yes for clang

2022-02-04 Thread Tomas Vondra


Hi,

I've been doing some experiments with compilers, and I've noticed that 
when using clang configure still ends up setting


GCC='yes'

which seems somewhat strange. FWIW I've noticed because when building 
with "-Ofast" (yeah, I'm just playing with stuff) it fails like this:


checking whether the C compiler still works... yes
configure: error: do not put -ffast-math in CFLAGS

which is in $GCC=yes check, and clang doesn't even have such option.

Not a huge issue, but I wonder if this might confuse configure in other 
places, or something.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should vacuum's first heap pass be read-only?

2022-02-04 Thread Peter Geoghegan

On Thu, Feb 3, 2022 at 12:20 PM Robert Haas  wrote:
> But maybe we should reconsider. What benefit do we get out of dirtying
> the page twice like this, writing WAL each time? What if we went back
> to the idea of having the first heap pass be read-only?

What about recovery conflicts? Index vacuuming WAL records don't
require their own latestRemovedXid field, since they can rely on
earlier XLOG_HEAP2_PRUNE records instead. Since the TIDs that index
vacuuming removes always point to LP_DEAD items in the heap, it's safe
to lean on that.

> In fact, I'm
> thinking we might want to go even further and try to prevent even hint
> bit changes to the page during the first pass, especially because now
> we have checksums and wal_log_hints. If our vacuum cost settings are
> to believed (and I am not sure that they are) dirtying a page is 10
> times as expensive as reading one from the disk. So on a large table,
> we're paying 44 vacuum cost units per heap page vacuumed twice, when
> we could be paying only 24 such cost units. What a bargain!

In practice HOT generally works well enough that the number of heap
pages that prune significantly exceeds the subset that are also
vacuumed during the second pass over the heap -- at least when heap
fill factor has been tuned (which might be rare). The latter category
of pages is not reported on by the enhanced autovacuum logging added
to Postgres 14, so you might be able to get some sense of how this
works by looking at that.

> Could we have our cake and eat it too by updating the FSM with
> the amount of free space that the page WOULD have if we pruned it, but
> not actually do so?

Did you ever notice that VACUUM records free space after *it* prunes,
using its own horizon? With a long running VACUUM operation, where
unremoved "recently dead" tuples are common, it's possible that the
amount of free space that's effectively available (available to every
other backend) is significantly higher. And so we already record
"subjective amounts of free space" -- though not necessarily by
design.

> I'm thinking about this because of the "decoupling table and index
> vacuuming" thread, which I was discussing with Dilip this morning. In
> a world where table vacuuming and index vacuuming are decoupled, it
> feels like we want to have only one kind of heap vacuum. It pushes us
> in the direction of unifying the first and second pass, and doing all
> the cleanup work at once. However, I don't know that we want to use
> the approach described there in all cases. For a small table that is,
> let's just say, not part of any partitioning hierarchy, I'm not sure
> that using the conveyor belt approach makes a lot of sense, because
> the total amount of work we want to do is so small that we should just
> get it over with and not clutter up the disk with more conveyor belt
> forks -- especially for people who have large numbers of small tables,
> the inode consumption could be a real issue.

I'm not sure that what you're proposing here is the best way to go
about it, but let's assume for a moment that it is. Can't you just
simulate the conveyor belt approach, without needing a relation fork?
Just store the same information in memory, accessed using the same
interface, with a spillover path?

Ideally VACUUM will be able to use the conveyor belt for any table.
Whether or not it actually happens should be decided at the latest
possible point during VACUUM, based on considerations about the actual
number of dead items that we now need to remove from indexes, as well
as metadata from any preexisting conveyor belt.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread John Naylor

On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan  wrote:
>
> On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan  wrote:
> > I do see some value in that, too. Though it's not going to be a way of
> > turning off the early freezing stuff, which seems unnecessary (though
> > I do still have work to do on getting the overhead for that down).
>
> Attached is v7, a revision that overhauls the algorithm that decides
> what to freeze. I'm now calling it block-driven freezing in the commit
> message. Also included is a new patch, that makes VACUUM record zero
> free space in the FSM for an all-visible page, unless the total amount
> of free space happens to be greater than one half of BLCKSZ.
>
> The fact that I am now including this new FSM patch (v7-0006-*patch)
> may seem like a case of expanding the scope of something that could
> well do without it. But hear me out! It's true that the new FSM patch
> isn't essential. I'm including it now because it seems relevant to the
> approach taken with block-driven freezing -- it may even make my
> general approach easier to understand.

Without having looked at the latest patches, there was something in
the back of my mind while following the discussion upthread -- the
proposed opportunistic freezing made a lot more sense if the
earlier-proposed open/closed pages concept was already available.

> Freezing whole pages
> 

> It's possible that a higher cutoff (for example a cutoff of 80% of
> BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
> addition to the downsides from fragmentation -- it's far from a simple
> trade-off. (Not that you should believe that 50% is special, it's just
> a starting point for me.)

How was the space utilization with the 50% cutoff in the TPC-C test?

> TPC-C raw numbers
> =
>
> The single most important number for the patch might be the decrease
> in both buffer misses and buffer hits, which I believe is caused by
> the patch being able to use index-only scans much more effectively
> (with modifications to BenchmarkSQL to improve the indexing strategy
> [1]). This is quite clear from pg_stat_database state at the end.
>
> Patch:

> blks_hit | 174,551,067,731
> tup_fetched  | 124,797,772,450

> Here is the same pg_stat_database info for master:

> blks_hit | 283,015,966,386
> tup_fetched  | 237,052,965,901

That's impressive.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 1:46 PM Peter Geoghegan  wrote:
> That should work. All you need is a table with several indexes, and a
> workload consisting of updates that modify a column that is the key
> column for only one of the indexes. I would expect bottom-up index
> deletion to be 100% effective for the not-logically-modified indexes,
> in the sense that there will be no page splits -- provided there are
> no long held snapshots, and provided that the index isn't very small.
> If it is small (think of something like the pgbench_branches pkey),
> then even the occasional ANALYZE will act as a "long held snapshot"
> relative to the size of the index. And so then you might get one page
> split per original leaf page, but probably not a second, and very very
> probably not a third.
>
> The constantly modified index will be entirely dependent on index
> vacuuming here, and so an improved VACUUM design that allows that
> particular index to be vacuumed more frequently could really improve
> performance.

Thanks for checking my work here - I wasn't 100% sure I had the right idea.

> BTW, it's a good idea to avoid unique indexes in test cases where
> there is an index that you don't want to set LP_DEAD bits for, since
> _bt_check_unique() tends to do a good job of setting LP_DEAD bits,
> independent of the kill_prior_tuple thing. You can avoid using
> kill_prior_tuple by forcing bitmap scans, of course.

Thanks for this tip, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: decoupling table and index vacuum

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 1:15 PM Robert Haas  wrote:
> My second thought was that perhaps we can create a test scenario
> where, in one index, the deduplication and bottom-up index deletion
> and kill_prior_tuple mechanisms are very effective, and in another
> index, it's not effective at all. For example, maybe index A is an
> index on the primary key, and index B is a non-unique index on some
> column that we're updating with ever-increasing values (so that we
> never put new tuples into a page that could be productively cleaned
> up). I think what should happen in this case is that A should not grow
> in size even if it's never vacuumed, while B will require vacuuming to
> keep the size down.

That should work. All you need is a table with several indexes, and a
workload consisting of updates that modify a column that is the key
column for only one of the indexes. I would expect bottom-up index
deletion to be 100% effective for the not-logically-modified indexes,
in the sense that there will be no page splits -- provided there are
no long held snapshots, and provided that the index isn't very small.
If it is small (think of something like the pgbench_branches pkey),
then even the occasional ANALYZE will act as a "long held snapshot"
relative to the size of the index. And so then you might get one page
split per original leaf page, but probably not a second, and very very
probably not a third.

The constantly modified index will be entirely dependent on index
vacuuming here, and so an improved VACUUM design that allows that
particular index to be vacuumed more frequently could really improve
performance.

BTW, it's a good idea to avoid unique indexes in test cases where
there is an index that you don't want to set LP_DEAD bits for, since
_bt_check_unique() tends to do a good job of setting LP_DEAD bits,
independent of the kill_prior_tuple thing. You can avoid using
kill_prior_tuple by forcing bitmap scans, of course.

-- 
Peter Geoghegan

Re: support for CREATE MODULE

2022-02-04 Thread Bruce Momjian

On Wed, Feb  2, 2022 at 06:28:30PM -0800, Swaha Miller wrote:
> Hi,
> 
> I'm following up from Jim's POC for adding MODULE to PostgreSQL. [1]
> 
> My proposal implements modules as schema objects to be stored in a new
> system catalog pg_module with new syntax for CREATE [OR REPLACE] MODULE,

You might want to consider the steps that are most successful at getting
Postgres patches accepted:

https://wiki.postgresql.org/wiki/Todo
Desirability -> Design -> Implement -> Test -> Review -> Commit

In this case, you have jumped right to Implement.  Asking about
Desirability first can avoid a lot of effort.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Bruce Momjian

On Fri, Feb  4, 2022 at 01:33:00PM -0500, Robert Haas wrote:
> > I don't think it is fair to be criticizing OpenSSL for its mediocre
> > documentation when the alternative being considered, NSS, has no public
> > documentation.  Can the source-code-defined NSS documentation be
> > considered better than the mediocre OpenSSL public documentation?
> 
> I mean, I think it's fair to say that my experiences with trying to
> use the OpenSSL documentation have been poor. Admittedly it's been a
> few years now so maybe it's gotten better, but my experience was what
> it was. In one case, the function I needed wasn't documented at all,
> and I had to read the C code, which was weirdly-formatted and had no
> comments. That wasn't fun, and knowing that NSS could be an even worse
> experience doesn't retroactively turn that into a good one.

Oh, yeah, the OpenSSL documentation is verifiably mediocre.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Unclear problem reports

2022-02-04 Thread Bruce Momjian

On Thu, Feb  3, 2022 at 12:28:05PM +0800, Julien Rouhaud wrote:
> Hi,
> 
> On Wed, Feb 02, 2022 at 07:35:36PM -0500, Bruce Momjian wrote:
> > The Postgres community is great at diagnosing problems and giving users
> > feedback.  In most cases, we can either diagnose a problem and give a
> > fix, or at least give users a hint at finding the cause.
> > 
> > However, there is a class of problems that are very hard to help with,
> > and I have perhaps seen an increasing number of them recently, e.g.:
> > 
> > 
> > https://www.postgresql.org/message-id/17384-f50f2eedf541e512%40postgresql.org
> > 
> > https://www.postgresql.org/message-id/CALTnk7uOrMztNfzjNOZe3TdquAXDPD3vZKjWFWj%3D-Fv-gmROUQ%40mail.gmail.com
> > 
> > I consider these as problems that need digging to find the cause, and
> > users are usually unable to do sufficient digging, and we don't have
> > time to give them instructions, so they never get a reply.
> > 
> > Is there something we can do to improve this situation?  Should we just
> > tell them they need to hire a Postgres expert?  I assume these are users
> > who do not already have access to such experts.
> 
> There's also only so much you can do to help interacting by mail without
> access to the machine, even if you're willing to spend time helping people for
> free.

Agreed.

> One thing that might help would be to work on a troubleshooting section on the
> wiki with some common problems and some clear instructions on either how to 
> try
> to fix the problem or provide needed information for further debugging.  At
> least it would save the time for those steps and one could quickly respond 
> with
> that link, similarly to how we do for the slow query questions wiki entry.

I think the challenge there is that the problems are all different, and
if we had a pattern we could at least supply better errors and hints,
unless I am missing something obvious.  Does someone want to go back
through the bugs@ archive and see if they can find a pattern?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Unclear problem reports

2022-02-04 Thread Bruce Momjian

On Wed, Feb  2, 2022 at 07:21:19PM -0700, David G. Johnston wrote:
> On Wed, Feb 2, 2022 at 5:35 PM Bruce Momjian  wrote:
> 
> I consider these as problems that need digging to find the cause, and
> users are usually unable to do sufficient digging, and we don't have
> time to give them instructions, so they never get a reply.
> 
> Is there something we can do to improve this situation?  Should we just
> tell them they need to hire a Postgres expert?  I assume these are users
> who do not already have access to such experts.
> 
> 
> 
> Have pg_lister queue up a check for, say, two or three days after the bug
> reporting form is filled out.  If the report hasn't been responded to by
> someone other than the OP send out a reply that basically says:
> 
> We're sorry your message hasn't yet attracted a response.  If your report 
> falls
> into the category of "tech support for a malfunctioning server", and this
> includes error messages that are difficult or impossible to replicate in a
> development environment (maybe give some examples), you may wish to consider
> eliciting paid professional help.  Please see this page on our website for a
> directory of companies that provide such services.  The PostgreSQL Core 
> Project
> itself refrains from making recommendations since many, if not all, of these
> companies contribute back to the project in order to keep it both free and 
> open
> source.

Yes, that is an idea.  I have canned email responses for common issues
like PGAdmin questions on the bugs list, but for these cases, I don't
know if someone might actually know the answer, and I don't know how
long to wait for an answer.  Should we be going the other way and state
on the bugs submission page, https://www.postgresql.org/account/submitbug/:

If you are having a serious problem with the software and do not
receive a reply, consider additional support channels, including
professional support (https://www.postgresql.org/support/).

I have been hesitant to recommend professional support myself since I
work for a professional support company, but in these cases it is
clearly something the user should consider.

> I would also consider adding a form that directs messages to pg_admin instead
> of pg_bugs.  On that form we could put the above message upfront - basically
> saying that here is a place to ask for help but the core project (this 
> website)
> doesn't directly provide such services: so if you don't receive a reply, or
> your needs are immediate, consider enlisting a paid support from our 
> directory.

Yes, like I stated above.  Not sure how pg_admin fits in here because we
gets lots of questions that aren't "server not starting".

> The problem with the separate form is that we would need users to self-select
> to report their problem to the support list instead of using a bug reporting
> form.  Neither of the mentioned emails are good examples of bug reports. 

Yes, given the experience of many of the bugs posters, I don't think
self-selecting would help.

> Either we make it easier and hope self-selection works, or just resign
> ourselves to seeing these messages on -bugs and use the first option to at
> least be a bit more responsive.  The risks related to having a rote response,
> namely it not really being applicable to the situation, seem minimal and 
> others
> seeing that response (assuming we'd send it to the list and not just the OP)
> would likely encourage someone to at least give better suggestions for next
> steps should that be necessary.

The problem is that I don't even know how long to wait for someone to
reply, so doing it up front makes more sense.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 1:22 PM Bruce Momjian  wrote:
> On Thu, Feb  3, 2022 at 02:33:37PM -0500, Robert Haas wrote:
> > As a philosophical matter, I don't think it's great for us - or the
> > Internet in general - to be too dependent on OpenSSL. Software
> > monocultures are not great, and OpenSSL has near-constant security
> > updates and mediocre documentation. Now, maybe anything else we
>
> I don't think it is fair to be criticizing OpenSSL for its mediocre
> documentation when the alternative being considered, NSS, has no public
> documentation.  Can the source-code-defined NSS documentation be
> considered better than the mediocre OpenSSL public documentation?

I mean, I think it's fair to say that my experiences with trying to
use the OpenSSL documentation have been poor. Admittedly it's been a
few years now so maybe it's gotten better, but my experience was what
it was. In one case, the function I needed wasn't documented at all,
and I had to read the C code, which was weirdly-formatted and had no
comments. That wasn't fun, and knowing that NSS could be an even worse
experience doesn't retroactively turn that into a good one.

> For the record, I do like the idea of adding NSS, but I am concerned
> about its long-term maintenance, we you explained.

It sounds like we come down in about the same place here, in the end.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Bruce Momjian

On Thu, Feb  3, 2022 at 02:33:37PM -0500, Robert Haas wrote:
> As a philosophical matter, I don't think it's great for us - or the
> Internet in general - to be too dependent on OpenSSL. Software
> monocultures are not great, and OpenSSL has near-constant security
> updates and mediocre documentation. Now, maybe anything else we

I don't think it is fair to be criticizing OpenSSL for its mediocre
documentation when the alternative being considered, NSS, has no public
documentation.  Can the source-code-defined NSS documentation be
considered better than the mediocre OpenSSL public documentation?

For the record, I do like the idea of adding NSS, but I am concerned
about its long-term maintenance, we you explained.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: decoupling table and index vacuum

2022-02-04 Thread Robert Haas

On Wed, Jan 26, 2022 at 8:58 AM Dilip Kumar  wrote:
> TODO:
> - This is just a POC patch to discuss the design idea and needs a lot
> of improvement and testing.
> - We are using a slightly different format for storing the dead tids
> into the conveyor belt which is explained in the patch but the
> traditional multi-pass vacuum is still using the same format (array of
> ItemPointeData), so we need to unify that format.
> - Performance testing.
> - Cleaner interfaces so that we can easily be integrated with auto
> vacuum, currently, this is not provided for the manual vacuum.
> - Add test cases.

I think this is a pretty interesting piece of work. I appreciate the
effort you've obviously put into the comments, although I do think
some of them are going to need some additional clarification. But I
think the bigger questions here at the moment are things like (1) Is
this the right idea? and if not (2) How could we change it to make it
better? and (3) Is there any way that we can make it simpler? It was
the last of these questions that prompted me to post
http://postgr.es/m/ca+tgmoy18rzqqdm2je2wdkia8ngtedhp7ulthb3a-abs+wb...@mail.gmail.com
because, if that thought were to work out, then we could have more
things in common between the conveyor-belt and non-conveyor-belt
cases, and we might be able to start with some preliminary work to
jigger more things in to the second phase, and then look to integrate
the conveyor belt stuff separately.

I think what we ought to do at this point is try to figure out some
tests that might show how well this approach actually works in
practice. Now one motivation for this work was the desire to someday
have global indexes, but those don't exist yet, so it makes sense to
consider other scenarios in which the patch might (or might not) be
beneficial. And it seems to me that we should be looking for a
scenario where we have multiple indexes with different vacuuming
needs. How could that happen? Well, the first thing that occurred to
me was a table with a partial index. If we have a column t whose
values are randomly distributed between 1 and 10, and a partial index
on some other column WHERE t = 1, then the partial index should only
accumulate dead tuples about 10% as fast as a non-partial index on the
same column. On the other hand, the partial index also has a much
smaller number of total rows, so after a fixed number of updates, the
partial index should have the same *percentage* of dead tuples as the
non-partial index even though the absolute number is smaller. So maybe
that's not a great idea.

My second thought was that perhaps we can create a test scenario
where, in one index, the deduplication and bottom-up index deletion
and kill_prior_tuple mechanisms are very effective, and in another
index, it's not effective at all. For example, maybe index A is an
index on the primary key, and index B is a non-unique index on some
column that we're updating with ever-increasing values (so that we
never put new tuples into a page that could be productively cleaned
up). I think what should happen in this case is that A should not grow
in size even if it's never vacuumed, while B will require vacuuming to
keep the size down. If this example isn't exactly right, maybe we can
construct one where that does happen. Then we could try to demonstrate
that with this patch we can do less vacuuming work and still keep up
than what would be possible without the patch. We'll either be able to
show that this is true, or we will see that it's false, or we won't be
able to really see much difference. Any of those would be interesting
findings.

One thing we could try doing in order to make that easier would be:
tweak things so that when autovacuum vacuums the table, it only
vacuums the indexes if they meet some threshold for bloat. I'm not
sure exactly what happens with the heap vacuuming then - do we do
phases 1 and 2 always, or a combined heap pass, or what? But if we
pick some criteria that vacuums indexes sometimes and not other times,
we can probably start doing some meaningful measurement of whether
this patch is making bloat better or worse, and whether it's using
fewer or more resources to do it.

Do you have a git branch for this work?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support for NSS as a libpq TLS backend

2022-02-04 Thread Bruce Momjian

On Thu, Feb  3, 2022 at 01:42:53PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Tue, Feb  1, 2022 at 01:52:09PM -0800, Andres Freund wrote:
> > > There's https://hg.mozilla.org/projects/nspr/file/tip/pr/src - which is I
> > > think the upstream source.
> > > 
> > > A project without even a bare-minimal README at the root does have a 
> > > "internal
> > > only" feel to it...
> > 
> > I agree --- it is a library --- if they don't feel the need to publish
> > the API, it seems to mean they want to maintain the ability to change it
> > at any time, and therefore it is inappropriate for other software to
> > rely on that API.
> 
> This is really not a reasonable representation of how this library has
> been maintained historically nor is there any reason to think that their
> policy regarding the API has changed recently.  They do have a
> documented API and that hasn't changed- it's just that it's not easily
> available in web-page form any longer and that's due to something
> independent of the library maintenance.  They've also done a good job

So they have always been bad at providing an API, not just now, or that
their web content disappeared and they haven't fixed it, for how long? 
I guess that is better than the v8 case, but not much.  Is posting web
content really that hard for them?

> with maintaining the API as one would expect from a library and so this
> really isn't a reason to avoid using it.  If there's actual specific
> examples of the API not being well maintained and causing issues then
> please point to them and we can discuss if that is a reason to consider
> not depending on NSS/NSPR.

I have no specifics.

> > This is not the same as Postgres extensions needing to read the Postgres
> > source code --- they are an important but edge use case and we never saw
> > the need to standardize or publish the internal functions that must be
> > studied and adjusted possibly for major releases.
> 
> I agree that extensions and public libraries aren't entirely the same
> but I don't think it's all that unreasonable for developers that are
> using a library to look at the source code for that library when
> developing against it, that's certainly something I've done for a
> number of different libraries.

Wow, you have a much higher tolerance than I do.  How do you even know
which functions are the public API if you have to look at the source
code?

> > This kind of feels like the Chrome JavaScript code that used to be able
> > to be build separately for PL/v8, but has gotten much harder to do in
> > the past few years.
> 
> This isn't at all like that case, where the maintainers made a very
> clear and intentional choice to make it quite difficult for packagers to
> pull v8 out to package it.  Nothing like that has happened with NSS and
> there isn't any reason to think that it will based on what the
> maintainers have said and what they've done across the many years that
> NSS has been around.

As far as I know, the v8 developers didn't say anything, they just
started moving things around to make it easier for them and harder for
packagers --- and they didn't care.

I frankly think we need some public statement from the NSS developers
before moving forward --- there are just too many red flags here, and
once we support it, it will be hard to remove support for it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Synchronizing slots from primary to standby

2022-02-04 Thread Ashutosh Sharma

On Sat, Jan 22, 2022 at 4:33 AM Hsu, John  wrote:
>
>> I might be missing something but isn’t it okay even if the new primary
>> server is behind the subscribers? IOW, even if two slot's LSNs (i.e.,
>> restart_lsn and confirm_flush_lsn) are behind the subscriber's remote
>> LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only
>> transactions that were committed after the remote_lsn. So the
>> subscriber can resume logical replication with the new primary without
>> any data loss.
>
> Maybe I'm misreading, but I thought the purpose of this to make
> sure that the logical subscriber does not have data that has not been
> replicated to the new primary. The use-case I can think of would be
> if synchronous_commit were enabled and fail-over occurs. If
> we didn't have this set, isn't it possible that this logical subscriber
> has extra commits that aren't present on the newly promoted primary?
>

This is very much possible if the new primary used to be asynchronous
standby. But, it seems like the current patch is trying to hold the
logical replication until the data has been replicated to the physical
standby when synchronous_slot_names is set. This will ensure that the
logical subscriber is never ahead of the new primary. However, AFAIU
that's not the primary use-case of this patch; instead this is to
ensure that the logical subscribers continue getting data from the new
primary when the failover occurs.

>
> If a logical slot was dropped on the writer, should the worker drop 
> logical
> slots that it was previously synchronizing but are no longer present? Or
> should we leave that to the user to manage? I'm trying to think why users
> would want to sync logical slots to a reader but not have that be dropped
> as well if it's no longer present.
>

AFAIU this should be taken care of by the background worker used to
synchronize the replication slot.

--
With Regards,
Ashutosh Sharma.

Re: [PATCH] Accept IP addresses in server certificate SANs

2022-02-04 Thread Jacob Champion

On Thu, 2022-02-03 at 16:23 +0900, Kyotaro Horiguchi wrote:
> At Wed, 2 Feb 2022 19:46:13 +, Jacob Champion  
> wrote in 
> > On Mon, 2022-01-31 at 17:29 +0900, Kyotaro Horiguchi wrote:
> > > +#define PGSQL_AF_INET  (AF_INET + 0)
> > > +#define PGSQL_AF_INET6 (AF_INET + 1)
> > > +
> > > 
> > > Now we have the same definition thrice in frontend code. Coulnd't we
> > > define them in, say, libpq-fe.h or inet-fe.h (nonexistent) then
> > > include it from the three files?
> > 
> > I started down the inet-fe.h route, and then realized I didn't know
> > where that should go. Does it need to be included in (or part of)
> > port.h? And should it be installed as part of the logic in
> > src/include/Makefile?
> 
> I don't think it should be a part of port.h.  Though I suggested
> frontend-only header file by the name, isn't it enough to separate out
> the definitions from utils/inet.h to common/inet-common.h then include
> the inet-common.h from inet.h?

That works a lot better than what I had in my head. Done that way in
v4. Thanks!

--Jacob
From e54b41c722d04ea6eabe4434bdf10c92fa79a192 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 24 Nov 2021 14:46:11 -0800
Subject: [PATCH v4 1/3] Move inet_net_pton() to src/port

This will be helpful for IP address verification in libpq.
---
 src/backend/utils/adt/Makefile  |  1 -
 src/include/port.h  |  3 +++
 src/include/utils/builtins.h|  4 
 src/port/Makefile   |  1 +
 src/{backend/utils/adt => port}/inet_net_pton.c | 16 +++-
 src/tools/msvc/Mkvcbuild.pm |  2 +-
 6 files changed, 20 insertions(+), 7 deletions(-)
 rename src/{backend/utils/adt => port}/inet_net_pton.c (96%)

diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 41b486bcef..d173d52157 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -43,7 +43,6 @@ OBJS = \
 	geo_selfuncs.o \
 	geo_spgist.o \
 	inet_cidr_ntop.o \
-	inet_net_pton.o \
 	int.o \
 	int8.o \
 	json.o \
diff --git a/src/include/port.h b/src/include/port.h
index 3d103a2b31..2852e5b58b 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -515,6 +515,9 @@ extern int	pg_codepage_to_encoding(UINT cp);
 extern char *pg_inet_net_ntop(int af, const void *src, int bits,
 			  char *dst, size_t size);
 
+/* port/inet_net_pton.c */
+extern int	pg_inet_net_pton(int af, const char *src, void *dst, size_t size);
+
 /* port/pg_strong_random.c */
 extern void pg_strong_random_init(void);
 extern bool pg_strong_random(void *buf, size_t len);
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index d8f05a7df3..22ebbfda17 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -93,10 +93,6 @@ extern int	xidLogicalComparator(const void *arg1, const void *arg2);
 extern char *pg_inet_cidr_ntop(int af, const void *src, int bits,
 			   char *dst, size_t size);
 
-/* inet_net_pton.c */
-extern int	pg_inet_net_pton(int af, const char *src,
-			 void *dst, size_t size);
-
 /* network.c */
 extern double convert_network_to_scalar(Datum value, Oid typid, bool *failure);
 extern Datum network_scan_first(Datum in);
diff --git a/src/port/Makefile b/src/port/Makefile
index bfe1feb0d4..c3ae7b3d5c 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -43,6 +43,7 @@ OBJS = \
 	bsearch_arg.o \
 	chklocale.o \
 	inet_net_ntop.o \
+	inet_net_pton.o \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
diff --git a/src/backend/utils/adt/inet_net_pton.c b/src/port/inet_net_pton.c
similarity index 96%
rename from src/backend/utils/adt/inet_net_pton.c
rename to src/port/inet_net_pton.c
index d3221a1313..bae50ba67e 100644
--- a/src/backend/utils/adt/inet_net_pton.c
+++ b/src/port/inet_net_pton.c
@@ -14,14 +14,18 @@
  * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
  * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  *
- *	  src/backend/utils/adt/inet_net_pton.c
+ *	  src/port/inet_net_pton.c
  */
 
 #if defined(LIBC_SCCS) && !defined(lint)
 static const char rcsid[] = "Id: inet_net_pton.c,v 1.4.2.3 2004/03/17 00:40:11 marka Exp $";
 #endif
 
+#ifndef FRONTEND
 #include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
 
 #include 
 #include 
@@ -29,9 +33,19 @@ static const char rcsid[] = "Id: inet_net_pton.c,v 1.4.2.3 2004/03/17 00:40:11 m
 #include 
 #include 
 
+#ifndef FRONTEND
 #include "utils/builtins.h" /* pgrminclude ignore */	/* needed on some
 		 * platforms */
 #include "utils/inet.h"
+#else
+/*
+ * In a frontend build, we can't include inet.h, but we still need to have
+ * sensible definitions of these two constants.  Note that pg_inet_net_ntop()
+ * assumes that PGSQL_AF_INET is equal to AF_INET.
+ */
+#define PGSQL_AF_INET	(AF_INET + 0)
+#define PGSQL_AF_INET6	(AF_INET + 1)
+#endif
 
 
 static int	inet_net_pton_ipv4(const char *src, u_char *dst);
diff --git a/

Re: Stats collector's idx_blks_hit value is highly misleading in practice

2022-02-04 Thread John Naylor

On Fri, Feb 4, 2022 at 11:19 AM Peter Geoghegan  wrote:
>
> On Thu, Feb 3, 2022 at 7:08 PM John Naylor  
> wrote:
> > Is this a TODO candidate? What would be a succinct title for it?
>
> I definitely think that it's worth working on. I suppose it follows
> that it should go on the TODO list.

Added TODO item "Teach stats collector to differentiate between
internal and leaf index pages"

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Stats collector's idx_blks_hit value is highly misleading in practice

2022-02-04 Thread Peter Geoghegan

On Thu, Feb 3, 2022 at 7:08 PM John Naylor  wrote:
> Is this a TODO candidate? What would be a succinct title for it?

I definitely think that it's worth working on. I suppose it follows
that it should go on the TODO list.

-- 
Peter Geoghegan

Re: [PATCH] Add reloption for views to enable RLS

2022-02-04 Thread Laurenz Albe

On Wed, 2022-02-02 at 18:23 +0100, Christoph Heiss wrote:
> > Huh?  "duff" has no permission to insert into "tab"!
> That really should not happen, thanks for finding that and helping me 
> investigating on how to fix that!
> 
> This is now solved by checking the security_invoker property on the view 
> in rewriteTargetView().
> 
> I've also added a testcase for this in v4 to catch that in future.

I tested it, and the patch works fine now.

Some little comments:

> --- a/src/backend/rewrite/rewriteHandler.c
> +++ b/src/backend/rewrite/rewriteHandler.c
> @@ -3242,9 +3243,13 @@ rewriteTargetView(Query *parsetree, Relation view)
>0);
> 
> /*
> -* Mark the new target RTE for the permissions checks that we want to
> -* enforce against the view owner, as distinct from the query caller.  At
> -* the relation level, require the same INSERT/UPDATE/DELETE permissions
> +* If the view has security_invoker set, mark the new target RTE for the
> +* permissions checks that we want to enforce against the query caller, as
> +* distince from the view owner.

Typo: distince

diff --git a/src/test/regress/expected/create_view.out 
b/src/test/regress/expected/create_view.out
index 509e930fc7..fea893569f 100644
--- a/src/test/regress/expected/create_view.out
+++ b/src/test/regress/expected/create_view.out
@@ -261,15 +261,26 @@ CREATE VIEW mysecview3 WITH (security_barrier=false)
AS SELECT * FROM tbl1 WHERE a < 0;
 CREATE VIEW mysecview4 WITH (security_barrier)
AS SELECT * FROM tbl1 WHERE a <> 0;
-CREATE VIEW mysecview5 WITH (security_barrier=100) -- Error
+CREATE VIEW mysecview5 WITH (security_invoker=true)
+   AS SELECT * FROM tbl1 WHERE a = 100;
+CREATE VIEW mysecview6 WITH (security_invoker=false)
AS SELECT * FROM tbl1 WHERE a > 100;
+CREATE VIEW mysecview7 WITH (security_invoker)
+   AS SELECT * FROM tbl1 WHERE a < 100;
+CREATE VIEW mysecview8 WITH (security_barrier=100) -- Error
+   AS SELECT * FROM tbl1 WHERE a <> 100;
 ERROR:  invalid value for boolean option "security_barrier": 100
-CREATE VIEW mysecview6 WITH (invalid_option)   -- Error
+CREATE VIEW mysecview9 WITH (security_invoker=100) -- Error
+   AS SELECT * FROM tbl1 WHERE a = 100;
+ERROR:  invalid value for boolean option "security_invoker": 100
+CREATE VIEW mysecview10 WITH (invalid_option)  -- Error

I see no reasons to remove two of the existing tests.

+++ b/src/test/regress/expected/rowsecurity.out
@@ -8,9 +8,11 @@ DROP USER IF EXISTS regress_rls_alice;
 DROP USER IF EXISTS regress_rls_bob;
 DROP USER IF EXISTS regress_rls_carol;
 DROP USER IF EXISTS regress_rls_dave;
+DROP USER IF EXISTS regress_rls_grace;

But the name has to start with "e"!


I also see no reason to split a small patch like this into three parts.

In the attached, I dealt with the above and went over the comments.
How do you like it?

Yours,
Laurenz Albe
> 
From 46130acdc2649718fe81e289ef86ac8bf3eae124 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Fri, 4 Feb 2022 16:58:58 +0100
Subject: [PATCH] Add new boolean reloption "security_invoker" to views

When this reloption is set to true, all permissions on the underlying
objects will be checked against the invoking user rather than the view
owner, as is currently implemented.

Author: Christoph Heiss 
Discussion: https://postgr.es/m/b66dd6d6-ad3e-c6f2-8b90-47be773da240%40cybertec.at
---
 doc/src/sgml/ref/alter_view.sgml  | 12 -
 doc/src/sgml/ref/create_view.sgml | 32 +--
 src/backend/access/common/reloptions.c| 11 
 src/backend/nodes/copyfuncs.c |  1 +
 src/backend/nodes/equalfuncs.c|  1 +
 src/backend/nodes/outfuncs.c  |  1 +
 src/backend/nodes/readfuncs.c |  1 +
 src/backend/optimizer/plan/subselect.c|  1 +
 src/backend/optimizer/prep/prepjointree.c |  1 +
 src/backend/rewrite/rewriteHandler.c  | 20 ---
 src/backend/utils/cache/relcache.c| 63 --
 src/include/nodes/parsenodes.h|  1 +
 src/include/utils/rel.h   | 11 
 src/test/regress/expected/create_view.out | 46 
 src/test/regress/expected/rowsecurity.out | 65 ++-
 src/test/regress/sql/create_view.sql  | 22 +++-
 src/test/regress/sql/rowsecurity.sql  | 44 +++
 17 files changed, 282 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_view.sgml b/doc/src/sgml/ref/alter_view.sgml
index 98c312c5bf..8bdc90a5a1 100644
--- a/doc/src/sgml/ref/alter_view.sgml
+++ b/doc/src/sgml/ref/alter_view.sgml
@@ -156,7 +156,17 @@ ALTER VIEW [ IF EXISTS ] name RESET
 
  
   Changes the security-barrier property of the view.  The value must
-  be Boolean value, such as true
+  be a Boolean value, such as true
+  or false.
+ 
+
+   
+   
+security_invoker (boolean)
+
+ 
+  Changes th

Re: [BUG]Update Toast data failure in logical replication

2022-02-04 Thread Alvaro Herrera

I don't have a reason not to commit this patch.  I have some suggestions
on the comments and docs though.

> @@ -8359,14 +8408,15 @@ log_heap_new_cid(Relation relation, HeapTuple tup)
>   * Returns NULL if there's no need to log an identity or if there's no 
> suitable
>   * key defined.
>   *
> - * key_changed should be false if caller knows that no replica identity
> - * columns changed value.  It's always true in the DELETE case.
> + * key_required should be false if caller knows that no replica identity
> + * columns changed value and it doesn't has any external data.  It's always
> + * true in the DELETE case.
>   *
>   * *copy is set to true if the returned tuple is a modified copy rather than
>   * the same tuple that was passed in.
>   */
>  static HeapTuple
> -ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
> +ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,

I find the new comment pretty hard to interpret.  I would say something
like "Pass key_required true if any replica identity columns changed
value, or if any of them have external data.  DELETE must always pass
true".

> diff --git a/doc/src/sgml/ref/alter_table.sgml 
> b/doc/src/sgml/ref/alter_table.sgml
> index dee026e..d67ef7c 100644
> --- a/doc/src/sgml/ref/alter_table.sgml
> +++ b/doc/src/sgml/ref/alter_table.sgml
> @@ -873,8 +873,10 @@ WITH ( MODULUS  class="parameter">numeric_literal, REM
>This form changes the information which is written to the write-ahead 
> log
>to identify rows which are updated or deleted.  This option has no 
> effect
>except when logical replication is in use.
> -  In all cases, no old values are logged unless at least one of the 
> columns
> -  that would be logged differs between the old and new versions of the 
> row.
> +  In all cases except toasted values, no old values are logged unless at
> +  least one of the columns that would be logged differs between the old 
> and
> +  new versions of the row.  We detoast the unchanged old toast values 
> and log
> +  them.

Here we're patching with a minimal wording change with almost
incomprehensible results.  I think we should patch more extensively.
I suggest:

This form changes the information which is written to the
write-ahead log to identify rows which are updated or deleted.

In most cases, the old value of each column is only logged if
it differs from the new value; however, if the old value is
stored externally, it is always logged regardless of whether it
changed.

This option has no effect unless logical replication is in use.

I didn't get a chance to review the code, but I think this is valuable.





-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: [PATCH] Add min() and max() aggregate functions for xid8

2022-02-04 Thread Fujii Masao





On 2022/02/03 16:45, Ken Kato wrote:

Hi hackers,

Unlike xid, xid8 increases monotonically and cannot be reused.
This trait makes it possible to support min() and max() aggregate functions for 
xid8.
I thought they would be useful for monitoring.

So I made a patch for this.


Thanks for the patch! +1 with this feature.

+   PG_RETURN_FULLTRANSACTIONID((U64FromFullTransactionId(fxid1) > 
U64FromFullTransactionId(fxid2)) ? fxid1 : fxid2);

Shouldn't we use FullTransactionIdFollows() to compare those two fxid values 
here, instead?

+   PG_RETURN_FULLTRANSACTIONID((U64FromFullTransactionId(fxid1) < 
U64FromFullTransactionId(fxid2)) ? fxid1 : fxid2);

Shouldn't we use FullTransactionIdPrecedes() to compare those two fxid values 
here, instead?

Could you add the regression tests for those min() and max() functions for xid8?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Extensible Rmgr for Table AMs

2022-02-04 Thread Julien Rouhaud

On Fri, Feb 04, 2022 at 09:53:09AM -0500, Robert Haas wrote:
> On Fri, Feb 4, 2022 at 9:48 AM Julien Rouhaud  wrote:
> > I guess the idea was to have a compromise between letting rmgr authors 
> > choose
> > arbitrary ids to avoid any conflicts, especially with private 
> > implementations,
> > without wasting too much memory.  But those approaches would be pretty much
> > incompatible with the current definition:
> >
> > +#define RM_CUSTOM_MIN_ID   128
> > +#define RM_CUSTOM_MAX_ID   UINT8_MAX
> >
> > even if you only allocate up to the  max id found, nothing guarantees that 
> > you
> > won't get a quite high id.
> 
> Right, which I guess raises another question: if the maximum is
> UINT8_MAX, which BTW I find perfectly reasonable, why are we not just
> defining this as an array of size 256? There's no point in adding code
> complexity to save a few kB of memory.

Agreed, especially if combined with your suggested approach 3 (array of
pointers).

Re: Extensible Rmgr for Table AMs

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 9:48 AM Julien Rouhaud  wrote:
> I guess the idea was to have a compromise between letting rmgr authors choose
> arbitrary ids to avoid any conflicts, especially with private implementations,
> without wasting too much memory.  But those approaches would be pretty much
> incompatible with the current definition:
>
> +#define RM_CUSTOM_MIN_ID   128
> +#define RM_CUSTOM_MAX_ID   UINT8_MAX
>
> even if you only allocate up to the  max id found, nothing guarantees that you
> won't get a quite high id.

Right, which I guess raises another question: if the maximum is
UINT8_MAX, which BTW I find perfectly reasonable, why are we not just
defining this as an array of size 256? There's no point in adding code
complexity to save a few kB of memory.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Extensible Rmgr for Table AMs

2022-02-04 Thread Julien Rouhaud

Hi,

On Fri, Feb 04, 2022 at 09:10:42AM -0500, Robert Haas wrote:
> On Thu, Feb 3, 2022 at 12:34 AM Julien Rouhaud  wrote:
> > I agree that having dozen of custom rmgrs doesn't seem likely, but I also 
> > have
> > no idea of how much overhead you get by not doing a direct array access.  I
> > think it would be informative to benchmark something like simple OLTP write
> > workload on a fast storage (or a ramdisk, or with fsync off...), with the 
> > used
> > rmgr being the 1st and the 2nd custom rmgr.  Both scenario still seems
> > plausible and shouldn't degenerate on good hardware.
> 
> I think it would be hard to measure the overhead of this approach on a
> macrobenchmark.

Yeah that's also my initial thought, but I wouldn't be terribly surprised to be
wrong.

> That having been said, I find this a surprising
> implementation choice. I think that the approaches that are most worth
> considering are:
> 
> (1) reallocate the array if needed so that we can continue to just do
> RmgrTable[rmid]
> (2) have one array for builtins and a second array for extensions and
> do rmid < RM_CUSTOM_MIN_ID ? BuiltinRmgrTable[rmid] :
> ExtensionRmgrTable[rmid]
> (3) change RmgrTable to be an array of pointers to structs rather than
> an an array of structs. then the structs don't move around and can be
> const, but the pointers can be moved into a larger array if required
> 
> I'm not really sure which is best. My intuition for what will be
> cheapest on modern hardware is pretty shaky. However, I can't see how
> it can be the thing the patch is doing now; a linear search seems like
> it has to be the slowest option.

I guess the idea was to have a compromise between letting rmgr authors choose
arbitrary ids to avoid any conflicts, especially with private implementations,
without wasting too much memory.  But those approaches would be pretty much
incompatible with the current definition:

+#define RM_CUSTOM_MIN_ID   128
+#define RM_CUSTOM_MAX_ID   UINT8_MAX

even if you only allocate up to the  max id found, nothing guarantees that you
won't get a quite high id.

Re: pg_walfile_name uses XLByteToPrevSeg

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 9:05 AM Ashutosh Bapat
 wrote:
> And it gives some surprising results as well
> ---
> #select pg_walfile_name('0/0'::pg_lsn);
>  pg_walfile_name
> --
>  000100FF
> (1 row)
> 

Yeah, that seems wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Extensible Rmgr for Table AMs

2022-02-04 Thread Robert Haas

On Thu, Feb 3, 2022 at 12:34 AM Julien Rouhaud  wrote:
> I agree that having dozen of custom rmgrs doesn't seem likely, but I also have
> no idea of how much overhead you get by not doing a direct array access.  I
> think it would be informative to benchmark something like simple OLTP write
> workload on a fast storage (or a ramdisk, or with fsync off...), with the used
> rmgr being the 1st and the 2nd custom rmgr.  Both scenario still seems
> plausible and shouldn't degenerate on good hardware.

I think it would be hard to measure the overhead of this approach on a
macrobenchmark. That having been said, I find this a surprising
implementation choice. I think that the approaches that are most worth
considering are:

(1) reallocate the array if needed so that we can continue to just do
RmgrTable[rmid]
(2) have one array for builtins and a second array for extensions and
do rmid < RM_CUSTOM_MIN_ID ? BuiltinRmgrTable[rmid] :
ExtensionRmgrTable[rmid]
(3) change RmgrTable to be an array of pointers to structs rather than
an an array of structs. then the structs don't move around and can be
const, but the pointers can be moved into a larger array if required

I'm not really sure which is best. My intuition for what will be
cheapest on modern hardware is pretty shaky. However, I can't see how
it can be the thing the patch is doing now; a linear search seems like
it has to be the slowest option.

--
Robert Haas
EDB: http://www.enterprisedb.com

pg_walfile_name uses XLByteToPrevSeg

2022-02-04 Thread Ashutosh Bapat

Hi All,
pg_walfile_name() returns the WAL file name corresponding to the given
WAL location. Per
https://www.postgresql.org/docs/14/functions-admin.html
---
pg_walfile_name ( lsn pg_lsn ) → text

Converts a write-ahead log location to the name of the WAL file
holding that location.
---

The function uses XLByteToPrevSeg() which gives the name of previous
WAL file if the location falls on the boundary of WAL segment. I find
it misleading since the given LSN will fall into the segment provided
by XLByteToSeg() and not XLBytePrevSeg().

And it gives some surprising results as well
---
#select pg_walfile_name('0/0'::pg_lsn);
 pg_walfile_name
--
 000100FF
(1 row)


Comment in the code says
---
/*
 * Compute an xlog file name given a WAL location,
 * such as is returned by pg_stop_backup() or pg_switch_wal().
 */
Datum
pg_walfile_name(PG_FUNCTION_ARGS)
---
XLByteToPrevSeg() may be inline with the comment but I don't think
that's what is conveyed by the documentation at least.

--
Best Wishes,
Ashutosh

Re: make MaxBackends available in _PG_init

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 12:55 AM Nathan Bossart  wrote:
> Here is a new patch with fewer calls to GetMaxBackends().

For multixact.c, I think you should invent GetMaxOldestSlot() to avoid
confusion. Maybe it could be a static inline rather than a macro.

Likewise, I think PROCARRAY_MAXPROCS, NumProcSignalSlots, and
NumBackendStatSlots should be replaced with things that look more like
function calls.

Apart from that, I think this looks pretty good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Windows now has fdatasync()

2022-02-04 Thread Robert Haas

On Sun, Dec 12, 2021 at 4:32 PM Thomas Munro  wrote:
> One reason to consider developing this further is the problem
> discussed in the aptly named thread "Loaded footgun open_datasync on
> Windows"[1] (not the problem that was fixed in pg_test_fsync, but the
> problem with cache control, or lack thereof).  I saw 10x more TPS with
> open_datasync than with this experimental fdatasync on my little test
> VM, which was more than a little fishy, so I turned off the device
> write caching in "Device Manager" and got about the same number from
> open_datasync and fdatasync.  Clearly you can lose committed
> transactions on power loss[2] with the default OS settings and default
> PostgreSQL settings, though perhaps only if you're on SATA storage due
> to lack of FUA pass-through[3] (?).  I didn't try an NVMe stack.
>
> That suggests that fdatasync would actually be a better default ...
> except for the problems already mentioned with versions and not
> working on non-NTFS (not sure how it fails on non-NTFS, though, maybe
> it does a full flush, [4] doesn't say).

So my impression is that today we ship defaults that are unsafe on
Windows. I don't really understand much of what you are saying here,
but if there's a way we can stop doing that, +1 from me, especially if
it allows us to retain reasonable performance.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Bugs in pgoutput.c

2022-02-04 Thread Amit Kapila

On Thu, Feb 3, 2022 at 5:24 PM Amit Kapila  wrote:
>
> On Thu, Feb 3, 2022 at 8:18 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > Tom, is it okay for you if I go ahead with this patch after some testing?
> >
> > I've been too busy to get back to it, so sure.
> >
>
> Thanks. I have tested the patch by generating an invalidation message
> for table DDL before accessing the syscache in
> logicalrep_write_tuple(). I see that it correctly invalidates the
> entry and rebuilds it for the next operation. I couldn't come up with
> some automatic test for it so used the debugger to test it. I have
> made a minor change in one of the comments. I am planning to push this
> tomorrow unless there are comments or suggestions.
>

Pushed!

-- 
With Regards,
Amit Kapila.

Re: [BUG]Update Toast data failure in logical replication

2022-02-04 Thread Amit Kapila

On Mon, Jan 31, 2022 at 9:03 AM Dilip Kumar  wrote:
>
> >
> > I have changed this and various other comments in the patch. I have
> > modified the docs as well to reflect the changes. I thought of adding
> > a test but I think the current test in toast.sql seems sufficient.
> > Kindly let me know what you think of the attached? I think we should
> > backpatch this till v10. What do you think?
>
> Looks fine to me.
>

Attached are the patches for back-branches till v10. I have made two
modifications: (a) changed heap_tuple_attr_equals() to
heap_attr_equals() as we are not passing tuple to it; (b) changed
parameter name 'check_external_cols' to 'external_cols' to make it
sound similar to existing parameter 'interesting_cols' in
HeapDetermine* function.

Let me know what you think of the attached? Do you see any reason not
to back-patch this fix?

-- 
With Regards,
Amit Kapila.


HEAD-v7-0001-WAL-log-unchanged-toasted-replica-identity-key-at.patch
Description: Binary data


14-v7-0001-WAL-log-unchanged-toasted-replica-identity-key-a.patch
Description: Binary data


13-v7-0001-WAL-log-unchanged-toasted-replica-identity-key-a.patch
Description: Binary data


12-v7-0001-WAL-log-unchanged-toasted-replica-identity-key-a.patch
Description: Binary data


11-v7-0001-WAL-log-unchanged-toasted-replica-identity-key-a.patch
Description: Binary data


10-v7-0001-WAL-log-unchanged-toasted-replica-identity-key-a.patch
Description: Binary data

Re: Deparsing rewritten query

2022-02-04 Thread Pavel Stehule

pá 4. 2. 2022 v 10:36 odesílatel Julien Rouhaud  napsal:

> Hi,
>
> On Wed, Feb 02, 2022 at 07:49:41PM +0100, Pavel Stehule wrote:
> >
> > I checked this trivial patch, and I don't see any problem. Again I run
> > check-world with success. The documentation for this feature is not
> > necessary. But I am not sure about regress tests. Without any other code,
> > enfosing printalias will be invisible. What do you think about the
> > transformation of your extension to a new module in src/test/modules?
> Maybe
> > it can be used for other checks in future.
>
> I'm not opposed, but previously Tom explicitly said that he thinks this
> feature
> is useless and is strongly opposed to making any kind of promise that the
> current interface to make it possible (if get_query_def() is exposed)
> would be
> maintained.  Adding such a test module would probably a reason to reject
> the
> patch altogether.  I'm just hoping that this change, which is a no-op for
> any legal query, is acceptable.  It can only break something if you feed
> wrong
> data to get_query_def(), which would be my problem and not the project's
> problem.
>

ok, I don't have any problem with it. Then there is not necessarily any
change, and I'll mark this patch as ready for committer.

Regards

Pavel

Re: Add psql command to list constraints

2022-02-04 Thread Dag Lem

"David G. Johnston"  writes:

> On Monday, November 15, 2021, Tatsuro Yamada
>  wrote:
>
> I don't know if this is a good example, but if you look at
> StackOverflow,
> it seems that people who want to see a list of constraints appear
> regularly.
>
> https://stackoverflow.com/questions/62987794/how-to-list-all-constraints-
> of-a-table-in-postgresql
> 
>
> Given the questioner restricted their question to “for a given table”
> I’d say it supports leaving the status quo, not changing it.
>
> If, as you suppose, these come up regularly then finding one that asks
> for it “in the entire database”, ideally with some stated goal, should
> be doable.
>
> David J.
>
>

This is my review of the patch in
https://commitfest.postgresql.org/37/3468/

The patch adds the command "\dco" to list constraints in psql. This
seems useful to me.

The patch applies cleanly to HEAD, although some hunks have rather large
offsets.

As far as I can tell, the "\dco" command works as documented.

I have however found the following issues with the patch:

* A TAB character has been added to doc/src/sgml/ref/psql-ref.sgml -
  this should be replaced with spaces.
* The call to listConstraints in line src/bin/psql/command.c 794 refers
  to &cmd[2], this should rather be &cmd[3].
* The patch kills the "\dc" command in src/bin/psql/command.c
  This can be fixed by adding the following at line 800:
  else
success =
  listConversions(pattern, show_verbose, show_system);

Another comment is that the "\dco" command outputs quite a lot of
information, which only fits in a wide terminal window. Would it be an
idea to only display the columns "Schema" and "Name" by default, and
use "+" to specify inclusion of the columns "Definition" and "Table".

Best regards

Dag Lem

Re: Add connection active, idle time to pg_stat_activity

2022-02-04 Thread Sergey Dudoladov

Hi,

> > Could you please elaborate on this idea ?
> > So we have pgStatActiveTime and pgStatIdleInTransactionTime ultimately
> > used to report respective metrics in pg_stat_database.
> > Now beentry's st_total_active_time / st_total_transaction_idle_time
> > duplicates this info, so one may get rid of  pgStat*Time counters. Is
> > the idea to report  instead of them at every tabstat reporting the
> > difference between the last memorized value of  st_total_*_time and
> > its current value ?
>
> Exactly. The attached first diff is the schetch of that.

This diff actually adds more code than it removes and somewhat bloats the patch.
I decided  to incorporate it anyway because the diff explicitly shows
that time differences since the last report
are send to the statistics collector,which is not immediately evident
from the existing PgStat*Time counters.
That point may be worth further discussion though.


> And, it seems like I forgot to mention this, but as Kuntal suggested
> (in a different context and objective, though) upthraed, I think that
> we can show realtime values in the two time fields by adding the time
> of the current state.  See the attached second diff.

That is exactly what we need in our infra, also included into the patch.


@Kyotaro Horiguchi
Thank you for the contribution. I included both of your diffs with
minor changes.
Should I add you to the authors of the patch given that now half of it
is basically your code ?

Regards,
Sergey
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..bc76016834 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -967,6 +967,26 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
additional types.
   
  
+
+ 
+  
+   active_time double precision
+  
+  
+   Time in milliseconds this backend spent in active and
+   fastpath states.
+  
+ 
+
+ 
+  
+   idle_in_transaction_time double precision
+  
+  
+   Time in milliseconds this backend spent in idle in transaction
+   and idle in transaction (aborted) states.
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3cb69b1f87..e349709c05 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -841,7 +841,9 @@ CREATE VIEW pg_stat_activity AS
 s.backend_xmin,
 S.query_id,
 S.query,
-S.backend_type
+S.backend_type,
+S.active_time,
+S.idle_in_transaction_time
 FROM pg_stat_get_activity(NULL) AS S
 LEFT JOIN pg_database AS D ON (S.datid = D.oid)
 LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0646f53098..8b84533953 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -249,8 +249,8 @@ static int	pgStatXactRollback = 0;
 PgStat_Counter pgStatBlockReadTime = 0;
 PgStat_Counter pgStatBlockWriteTime = 0;
 static PgStat_Counter pgLastSessionReportTime = 0;
-PgStat_Counter pgStatActiveTime = 0;
-PgStat_Counter pgStatTransactionIdleTime = 0;
+PgStat_Counter pgStatLastActiveTime = 0;
+PgStat_Counter pgStatLastTransactionIdleTime = 0;
 SessionEndType pgStatSessionEndCause = DISCONNECT_NORMAL;
 
 /* Record that's written to 2PC state file when pgstat state is persisted */
@@ -1026,8 +1026,12 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg, TimestampTz now)
 			TimestampDifference(pgLastSessionReportTime, now, &secs, &usecs);
 			pgLastSessionReportTime = now;
 			tsmsg->m_session_time = (PgStat_Counter) secs * 100 + usecs;
-			tsmsg->m_active_time = pgStatActiveTime;
-			tsmsg->m_idle_in_xact_time = pgStatTransactionIdleTime;
+
+			/* send the difference since the last report */
+			tsmsg->m_active_time =
+pgstat_get_my_active_time() - pgStatLastActiveTime;
+			tsmsg->m_idle_in_xact_time =
+pgstat_get_my_transaction_idle_time() - pgStatLastTransactionIdleTime;
 		}
 		else
 		{
@@ -1039,8 +1043,8 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg, TimestampTz now)
 		pgStatXactRollback = 0;
 		pgStatBlockReadTime = 0;
 		pgStatBlockWriteTime = 0;
-		pgStatActiveTime = 0;
-		pgStatTransactionIdleTime = 0;
+		pgStatLastActiveTime = pgstat_get_my_active_time();
+		pgStatLastTransactionIdleTime = pgstat_get_my_transaction_idle_time();
 	}
 	else
 	{
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 2fecf26a2c..ccea8e3325 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -468,6 +468,13 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
 	beentry->st_procpid = 0;	/* mark invalid */
 
+	/*
+	 * Reset per-backend counters so that accumulated values for the current
+	 * backend are not used for future

Re: Deparsing rewritten query

2022-02-04 Thread Julien Rouhaud

Hi,

On Wed, Feb 02, 2022 at 07:49:41PM +0100, Pavel Stehule wrote:
>
> I checked this trivial patch, and I don't see any problem. Again I run
> check-world with success. The documentation for this feature is not
> necessary. But I am not sure about regress tests. Without any other code,
> enfosing printalias will be invisible. What do you think about the
> transformation of your extension to a new module in src/test/modules? Maybe
> it can be used for other checks in future.

I'm not opposed, but previously Tom explicitly said that he thinks this feature
is useless and is strongly opposed to making any kind of promise that the
current interface to make it possible (if get_query_def() is exposed) would be
maintained.  Adding such a test module would probably a reason to reject the
patch altogether.  I'm just hoping that this change, which is a no-op for
any legal query, is acceptable.  It can only break something if you feed wrong
data to get_query_def(), which would be my problem and not the project's
problem.

81 matches

Mail list logo