from:"Stephen Frost"

Re: Statistics Import and Export

2024-03-29 Thread Stephen Frost

Greetings,

On Fri, Mar 29, 2024 at 19:35 Jeff Davis  wrote:

> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> > I’d certainly think “with stats” would be the preferred default of
> > our users.
>
> I'm concerned there could still be paths that lead to an error. For
> pg_restore, or when loading a SQL file, a single error isn't fatal
> (unless -e is specified), but it still could be somewhat scary to see
> errors during a reload.

I understand that point.

Also, it's new behavior, so it may cause some minor surprises, or there
> might be minor interactions to work out. For instance, dumping stats
> doesn't make a lot of sense if pg_upgrade (or something else) is just
> going to run analyze anyway.

But we don’t expect anything to run analyze … do we?  So I’m not sure why
it makes sense to raise this as a concern.

What do you think about starting off with it as non-default, and then
> switching it to default in 18?

What’s different, given the above arguments, in making the change with 18
instead of now?  I also suspect that if we say “we will change the default
later” … that later won’t ever come and we will end up making our users
always have to remember to say “with-stats” instead.

The stats are important which is why the effort is being made in the first
place. If just doing an analyze after loading the data was good enough then
this wouldn’t be getting worked on.

Independently, I had a thought around doing an analyze as the data is being
loaded .. but we can’t do that for indexes (but we could perhaps analyze
the indexed values as we build the index..).  This works when we do a
truncate or create the table in the same transaction, so we would tie into
some of the existing logic that we have around that.  Would also adjust
COPY to accept an option that specifies the anticipated number of rows
being loaded (which we can figure out during the dump phase reasonably..).
Perhaps this would lead to a pg_dump option to do the data load as a
transaction with a truncate before the copy (point here being to be able to
still do parallel load while getting the benefits from knowing that we are
completely reloading the table). Just some other thoughts- which I don’t
intend to take away from the current effort at all, which I see as valuable
and should be enabled by default.

Thanks!

Stephen

>

Re: Statistics Import and Export

2024-03-29 Thread Stephen Frost

Greetings,

On Fri, Mar 29, 2024 at 11:05 Jeff Davis  wrote:

> On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> > 0002:
> > - All relstats and attrstats calls are now their own statement
> > instead of a compound statement
> > - moved the archive TOC entry from post-data back to SECTION_NONE (as
> > it was modeled on object COMMENTs), which seems to work better.
> > - remove meta-query in favor of more conventional query building
> > - removed all changes to fe_utils/
>
> Can we get a consensus on whether the default should be with stats or
> without? That seems like the most important thing remaining in the
> pg_dump changes.


I’d certainly think “with stats” would be the preferred default of our
users.

Thanks!

Stephen

Re: Large block sizes support in Linux

2024-03-27 Thread Stephen Frost

Greetings,

* Pankaj Raghav (ker...@pankajraghav.com) wrote:
> On 23/03/2024 05:53, Thomas Munro wrote:
> > On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
> >  wrote:
> >> My team and I have been working on adding Large block size(LBS)
> >> support to XFS in Linux[1]. Once this feature lands upstream, we will be
> >> able to create XFS with FS block size > page size of the system on Linux.
> >> We also gave a talk about it in Linux Plumbers conference recently[2]
> >> for more context. The initial support is only for XFS but more FSs will
> >> follow later.
> > 
> > Very cool!

Yes, this is very cool sounding and could be a real difference for PG.

> > (I used XFS on IRIX in the 90s, and it had large blocks then, a
> > feature lost in the port to Linux AFAIK.)
> 
> Yes, I heard this also from the Maintainer of XFS that they had to drop
> this functionality when they did the port. :)

I also recall the days of XFS on IRIX...  Many moons ago.

> > The short version is that we (and MySQL, via a different scheme with
> > different tradeoffs) could avoid writing all our stuff out twice if we
> > could count on atomic writes of a suitable size on power failure, so
> > the benefits are very large.  As far as I know, there are two things
> > we need from the kernel and storage to do that on "overwrite"
> > filesystems like XFS:
> > 
> > 1.  The disk must promise that its atomicity-on-power-failure is a
> > multiple of our block size -- something like NVMe AWUPF, right?  My
> > devices seem to say 0 :-(  Or I guess the filesystem has to
> > compensate, but then it's not exactly an overwrite filesystem
> > anymore...
> 
> 0 means 1 logical block, which might be 4k in your case. Typically device
> vendors have to put extra hardware to guarantee bigger atomic block sizes.

If I'm following correctly, this would mean that PG with FPW=off
(assuming everything else works) would be safe on more systems if PG
supported a 4K block size than if PG only supports 8K blocks, right?

There's been discussion and even some patches posted around the idea of
having run-time support in PG for different block sizes.  Currently,
it's a compile-time option with the default being 8K, meaning that's the
only option on a huge number of the deployed PG environments out there.
Moving it to run-time has some challenges and there's concerns about the
performance ... but if it meant we could run safely with FPW=off, that's
a pretty big deal.  On the other hand, if the expectation is that
basically everything will support atomic 8K, then we might be able to
simply keep that and not deal with supporting different page sizes at
run-time (of course, this is only one of the considerations in play, but
it could be particularly key, if I'm following correctly).

Appreciate any insights you can share on this.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-13 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > No, we should be keeping the lock until the end of the transaction
> > (which in this case would be just the one statement, but it would be the
> > whole statement and all of the calls in it).  See analyze.c:268 or
> > so, where we call relation_close(onerel, NoLock); meaning we're closing
> > the relation but we're *not* releasing the lock on it- it'll get
> > released at the end of the transaction.
>
> If that's the case, then changing the two table_close() statements to
> NoLock should resolve any remaining concern.

Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

If it can be shown to be an issue then we could certainly revisit this.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-12 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > > One persistent problem is that there is no _safe equivalent to ARRAY_IN,
> > so
> > > that can always fail on us, though it should only do so if the string
> > > passed in wasn't a valid array input format, or the values in the array
> > > can't coerce to the attribute's basetype.
> >
> > That would happen before we even get to being called and there's not
> > much to do about it anyway.
> 
> Not sure I follow you here. the ARRAY_IN function calls happen once for
> every non-null stavaluesN parameter, and it's done inside the function
> because the result type could be the base type for a domain/array type, or
> could be the type itself. I suppose we could move that determination to the
> caller, but then we'd need to call get_base_element_type() inside a client,
> and that seems wrong if it's even possible.

Ah, yeah, ok, I see what you're saying here and sure, there's a risk
those might ERROR too, but that's outright invalid data then as opposed
to a NULL getting passed in.

> > > Or one compound command
> > >
> > > SELECT pg_set_relation_stats(t.oid, ...)
> > >  pg_set_attribute_stats(t.oid, 'id'::name, ...),
> > >  pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
> > >  ...
> > > FROM (VALUES('foo.bar'::regclass)) AS t(oid);
> > >
> > > The second one has the feature that if any one attribute fails, then the
> > > whole update fails, except, of course, for the in-place update of
> > pg_class.
> > > This avoids having an explicit transaction block, but we could get that
> > > back by having restore wrap the list of commands in a transaction block
> > > (and adding the explicit lock commands) when it is safe to do so.
> >
> > Hm, I like this approach as it should essentially give us the
> > transaction block we had been talking about wanting but without needing
> > to explicitly do a begin/commit, which would add in some annoying
> > complications.  This would hopefully also reduce the locking concern
> > mentioned previously, since we'd get the lock needed in the first
> > function call and then the others would be able to just see that we've
> > already got the lock pretty quickly.
> 
> True, we'd get the lock needed in the first function call, but wouldn't we
> also release that very lock before the subsequent call? Obviously we'd be
> shrinking the window in which another process could get in line and take a
> superior lock, and the universe of other processes that would even want a
> lock that blocks us is nil in the case of an upgrade, identical to existing
> behavior in the case of an FDW ANALYZE, and perfectly fine in the case of
> someone tinkering with stats.

No, we should be keeping the lock until the end of the transaction
(which in this case would be just the one statement, but it would be the
whole statement and all of the calls in it).  See analyze.c:268 or
so, where we call relation_close(onerel, NoLock); meaning we're closing
the relation but we're *not* releasing the lock on it- it'll get
released at the end of the transaction.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-11 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > Having thought about it a bit more, I generally like the idea of being
> > able to just update one stat instead of having to update all of them at
> > once (and therefore having to go look up what the other values currently
> > are...).  That said, per below, perhaps making it strict is the better
> > plan.
> 
> v8 has it as strict.

Ok.

> > > > Also, in some cases we allow the function to be called with a
> > > > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > > > OID ends up being NULL).
> > >
> > > Thoughts on it emitting a WARN or NOTICE before returning false?
> >
> > Eh, I don't think so?
> >
> > Where this is coming from is that we can often end up with functions
> > like these being called inside of larger queries, and having them spit
> > out WARN or NOTICE will just make them noisy.
> >
> > That leads to my general feeling of just returning NULL if called with a
> > NULL OID, as we would get with setting the function strict.
> 
> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

> > Well, that code is for pg_statistic while I was looking at pg_class (in
> > vacuum.c:1428-1443, where we track if we're actually changing anything
> > and only make the pg_class change if there's actually something
> > different):
> 
> I can do that, especially since it's only 3 tuples of known types, but my
> reservations are summed up in the next comment.

> > Not sure why we don't treat both the same way though ... although it's
> > probably the case that it's much less likely to have an entire
> > pg_statistic row be identical than the few values in pg_class.
> 
> That would also involve comparing ANYARRAY values, yuk. Also, a matched
> record will never be the case when used in primary purpose of the function
> (upgrades), and not a big deal in the other future cases (if we use it in
> ANALYZE on foreign tables instead of remote table samples, users
> experimenting with tuning queries under hypothetical workloads).

Sure.  Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

> > Hmm, that's a valid point, so a NULL passed in would need to set that
> > value actually to NULL, presumably.  Perhaps then we should have
> > pg_set_relation_stats() be strict and have pg_set_attribute_stats()
> > handles NULLs passed in appropriately, and return NULL if the relation
> > itself or attname, or other required (not NULL'able) argument passed in
> > cause the function to return NULL.
> >
> 
> That's how I have relstats done in v8, and could make it do that for attr
> stats.

That'd be my suggestion, at least, but as I mention above, it's not a
position I hold very strongly.

> > (What I'm trying to drive at here is a consistent interface for these
> > functions, but one which does a no-op instead of returning an ERROR on
> > values being passed in which aren't allowable; it can be quite
> > frustrating trying to get a query to work where one of the functions
> > decides to return ERROR instead of just ignoring things passed in which
> > aren't valid.)
> 
> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be hig

Re: Statistics Import and Export

2024-03-11 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > > +/*
> > > + * Set statistics for a given pg_class entry.
> > > + *
> > > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> > > + *
> > > + * This does an in-place (i.e. non-transactional) update of pg_class,
> > just as
> > > + * is done in ANALYZE.
> > > + *
> > > + */
> > > +Datum
> > > +pg_set_relation_stats(PG_FUNCTION_ARGS)
> > > +{
> > > + const char *param_names[] = {
> > > + "relation",
> > > + "reltuples",
> > > + "relpages",
> > > + };
> > > +
> > > + Oid relid;
> > > + Relationrel;
> > > + HeapTuple   ctup;
> > > + Form_pg_class   pgcform;
> > > +
> > > + for (int i = 0; i <= 2; i++)
> > > + if (PG_ARGISNULL(i))
> > > + ereport(ERROR,
> > > +
> >  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +  errmsg("%s cannot be NULL",
> > param_names[i])));
> >
> > Why not just mark this function as strict..?  Or perhaps we should allow
> > NULLs to be passed in and just not update the current value in that
> > case?
> 
> Strict could definitely apply here, and I'm inclined to make it so.

Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...).  That said, per below, perhaps making it strict is the better
plan.

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
> 
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

> >   Not sure if that makes sense here or not
> > offhand but figured I'd mention it as something to consider.
> >
> > > + pgcform = (Form_pg_class) GETSTRUCT(ctup);
> > > + pgcform->reltuples = PG_GETARG_FLOAT4(1);
> > > + pgcform->relpages = PG_GETARG_INT32(2);
> >
> > Shouldn't we include relallvisible?
> 
> Yes. No idea why I didn't have that in there from the start.

Ok.

> > Also, perhaps we should use the approach that we have in ANALYZE, and
> > only actually do something if the values are different rather than just
> > always doing an update.
> 
> That was how it worked back in v1, more for the possibility that there was
> no matching JSON to set values.
> 
> Looking again at analyze.c (currently lines 1751-1780), we just check if
> there is a row in the way, and if so we replace it. I don't see where we
> compare existing values to new values.

Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

vacuum.c:1531
/* If anything changed, write out the tuple. */
if (dirty)
heap_inplace_update(rd, ctup);

Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

> > > +Datum
> > > +pg_set_attribute_stats(PG_FUNCTION_ARGS)
> >
> > > + /* names of columns that cannot be null */
> > > + const char *required_param_names[] = {
> > > + "relation",
> > > + "attname",
> > > + "stainherit",
> > > + "stanullfrac",
> > > + "stawidth",
> > > + "stadistinct",
> > > + "stakind1",
> > > + "stakind2",
> > > + "stakind3",
> > > + "stakind4",
> > > + "stakind5",
> > > + };
> >
> > Same comment here as above wrt NULL being passed in.
> 
> In this case, the last 10 params (stanumbersN and stavaluesN) can be null,
> and are NULL more often than not.

Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably.  Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of j

Re: Statistics Import and Export

2024-03-08 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > Having some discussion around that would be useful.  Is it better to
> > have a situation where there are stats for some columns but no stats for
> > other columns?  There would be a good chance that this would lead to a
> > set of queries that were properly planned out and a set which end up
> > with unexpected and likely poor query plans due to lack of stats.
> > Arguably that's better overall, but either way an ANALYZE needs to be
> > done to address the lack of stats for those columns and then that
> > ANALYZE is going to blow away whatever stats got loaded previously
> > anyway and all we did with a partial stats load was maybe have a subset
> > of queries have better plans in the interim, after having expended the
> > cost to try and individually load the stats and dealing with the case of
> > some of them succeeding and some failing.
> 
> It is my (incomplete and entirely second-hand) understanding is that
> pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value
> and then resets it on completion, presumably because analyzing a table
> before its data is loaded and indexes are created would just be a waste of
> time.

No, pg_upgrade starts the postmaster with -b (undocumented
binary-upgrade mode) which prevents autovacuum (and logical replication
workers) from starting, so we don't need to worry about autovacuum
coming in and causing issues during binary upgrade.  That doesn't
entirely eliminate the concerns around pg_dump vs. autovacuum because we
may restore a dump into a non-binary-upgrade-in-progress system though,
of course.

> > Overall, I'd suggest we wait to see what Corey comes up with in terms of
> > doing the stats load for all attributes in a single function call,
> > perhaps using the VALUES construct as you suggested up-thread, and then
> > we can contemplate if that's clean enough to work or if it's so grotty
> > that the better plan would be to do per-attribute function calls.  If it
> > ends up being the latter, then we can revisit this discussion and try to
> > answer some of the questions raised above.
> 
> In the patch below, I ended up doing per-attribute function calls, mostly
> because it allowed me to avoid creating a custom data type for the portable
> version of pg_statistic. This comes at the cost of a very high number of
> parameters, but that's the breaks.

Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

> This raises questions about whether a failure in one attribute update
> statement should cause the others in that relation to roll back or not, and
> I can see situations where both would be desirable.
> 
> I'm putting this out there ahead of the pg_dump / fe_utils work, mostly
> because what I do there heavily depends on how this is received.
> 
> Also, I'm still seeking confirmation that I can create a pg_dump TOC entry
> with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan
> them out into multiple entries.

If we do go with this approach, we'd certainly want to make sure to do
it in a manner which would allow pg_restore's single-transaction mode to
still work, which definitely complicates this some.

Given that and the other general feeling that the locking won't be a big
issue, I'd suggest the simple approach on the pg_dump side of just
dumping the stats out without the BEGIN/COMMIT.

> Anyway, here's v7. Eagerly awaiting feedback.

> Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats.

> diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
> index 291ed876fc..d12b6e3ca3 100644
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -8818,7 +8818,6 @@
>  { oid => '3813', descr => 'generate XML text node',
>proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
>proargtypes => 'text', prosrc => 'xmltext' },
> -
>  { oid => '2923', descr => 'map table contents to XML',
>proname => 'table_to_xml', procost => '100', provolatile => 's',
>proparallel => 'r', prorettype => 'xml',
> @@ -12163,8 +12162,24 @@
>  
>  # GiST stratnum implementations
>  { oid => '8047', descr => 'GiST support',
> -  proname => 'gist_stratnum_identity', prorettype => 'int2',
> +  proname => 'gist_stratnum_identity',prorettype => 'int2',
>proargtypes => 'int2',
>prosrc =>

Re: Proposal to add page headers to SLRU pages

2024-03-08 Thread Stephen Frost

Greetings,

* Li, Yong (y...@ebay.com) wrote:
> > On Mar 7, 2024, at 03:09, Stephen Frost  wrote:
> > * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> >> I suppose this is important to do if we ever want to move SLRUs into
> >> shared buffers.  However, I wonder about the extra time this adds to
> >> pg_upgrade.  Is this something we should be concerned about?  Is there
> >> any measurement/estimates to tell us how long this would be?  Right now,
> >> if you use a cloning strategy for the data files, the upgrade should be
> >> pretty quick ... but the amount of data in pg_xact and pg_multixact
> >> could be massive, and the rewrite is likely to take considerable time.
> > 
> > While I definitely agree that there should be some consideration of
> > this concern, it feels on-par with the visibility-map rewrite which was
> > done previously.  Larger systems will likely have more to deal with than
> > smaller systems, but it's still a relatively small portion of the data
> > overall.
> > 
> > The benefit of this change, beyond just the possibility of moving them
> > into shared buffers some day in the future, is that this would mean that
> > SLRUs will have checksums (if the cluster has them enabled).  That
> > benefit strikes me as well worth the cost of the rewrite taking some
> > time and the minor loss of space due to the page header.
> > 
> > Would it be useful to consider parallelizing this work?  There's already
> > parts of pg_upgrade which can be parallelized and so this isn't,
> > hopefully, a big lift to add, but I'm not sure if there's enough work
> > being done here CPU-wise, compared to the amount of IO being done, to
> > have it make sense to run it in parallel.  Might be worth looking into
> > though, at least, as disks have gotten to be quite fast.
> 
> Thank Alvaro and Stephen for your thoughtful comments.
> 
> I did a quick benchmark regarding pg_upgrade time, and here are the results.

> For clog, 2048 segments can host about 2 billion transactions, right at the 
> limit for wraparound.
> That’s the maximum we can have.  2048 segments are also big for pg_multixact 
> SLRUs.
> 
> Therefore, on a modern hardware, in the worst case, pg_upgrade will run for 7 
> seconds longer.

Thanks for testing!  That strikes me as perfectly reasonable and seems
unlikely that we'd get much benefit from parallelizing it, so I'd say it
makes sense to keep this code simple.

Thanks again!

Stephen


signature.asc
Description: PGP signature

Re: improve ssl error code, 2147483650

2024-03-07 Thread Stephen Frost

Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 07/03/2024 02:12, David Zhang wrote:
> > The SSL_CTX_load_verify_locations function in OpenSSL will return NULL
> > if there is a system error, such as "No such file or directory" in this
> > case:
> > 
> > const char *ERR_reason_error_string(unsigned long e)
> > {
> >       ERR_STRING_DATA d, *p = NULL;
> >       unsigned long l, r;
> > 
> >       if (!RUN_ONCE(&err_string_init, do_err_strings_init)) {
> >       return NULL;
> >       }
> > 
> >       /*
> >    * ERR_reason_error_string() can't safely return system error strings,
> >    * since openssl_strerror_r() needs a buffer for thread safety, and we
> >    * haven't got one that would serve any sensible purpose.
> >    */
> >       if (ERR_SYSTEM_ERROR(e))
> >       return NULL;
> 
> That's pretty unfortunate. As typical with OpenSSL, this stuff is not very
> well documented, but I think we could do something like this in
> SSLerrmessage():
> 
> if (ERR_SYSTEM_ERROR(e))
> errreason = strerror(ERR_GET_REASON(e));
> 
> ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only
> documentation I could find was in this one obscure place in the man pages: 
> https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html.
> But as a best-effort thing, it would still be better than "SSL error code
> 2147483650".

Agreed that it doesn't seem well documented.  I was trying to figure out
what the 'right' answer here was myself and not having much success.  If
the above works, then +1 to that.

> > It would be better to perform a simple SSL file check before passing the
> > SSL file to OpenSSL APIs so that the system error can be captured and a
> > meaningful message provided to the end user.
> 
> That feels pretty ugly. I agree it would catch most of the common mistakes
> in practice, so maybe we should just hold our noses and do it anyway, if the
> above ERR_SYSTEM_ERROR() method doesn't work.

Yeah, seems better to try and handle this the OpenSSL way ... if that's
possible to do.

> It's sad that we cannot pass a file descriptor or in-memory copy of the file
> contents to those functions.

Agreed.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Provide a pg_truncate_freespacemap function

2024-03-06 Thread Stephen Frost

Greetings,

* Ronan Dunklau (ronan.dunk...@aiven.io) wrote:
> As we are currently experiencing a FSM corruption issue [1], we need to 
> rebuild FSM when we detect it. 

Ideally, we'd figure out a way to pick up on this and address it without
the user needing to intervene, however ...

> I noticed we have something to truncate a visibility map, but nothing for the 
> freespace map, so I propose the attached (liberally copied from the VM 
> counterpart) to allow to truncate a FSM without incurring downtime, as 
> currently our only options are to either VACUUM FULL the table or stop the 
> cluster and remove the FSM manually.

I agree that this would generally be a useful thing to have.

> Does that seem correct ?

Definitely needs to have a 'REVOKE ALL ON FUNCTION' at the end of the
upgrade script, similar to what you'll find at the bottom of
pg_visibility--1.1.sql in the tree today, otherwise anyone could run it.

Beyond that, I'd suggest a function-level comment above the definition
of the function itself (which is where we tend to put those- not at the
point where we declare the function).

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Proposal to add page headers to SLRU pages

2024-03-06 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> I suppose this is important to do if we ever want to move SLRUs into
> shared buffers.  However, I wonder about the extra time this adds to
> pg_upgrade.  Is this something we should be concerned about?  Is there
> any measurement/estimates to tell us how long this would be?  Right now,
> if you use a cloning strategy for the data files, the upgrade should be
> pretty quick ... but the amount of data in pg_xact and pg_multixact
> could be massive, and the rewrite is likely to take considerable time.

While I definitely agree that there should be some consideration of
this concern, it feels on-par with the visibility-map rewrite which was
done previously.  Larger systems will likely have more to deal with than
smaller systems, but it's still a relatively small portion of the data
overall.

The benefit of this change, beyond just the possibility of moving them
into shared buffers some day in the future, is that this would mean that
SLRUs will have checksums (if the cluster has them enabled).  That
benefit strikes me as well worth the cost of the rewrite taking some
time and the minor loss of space due to the page header.

Would it be useful to consider parallelizing this work?  There's already
parts of pg_upgrade which can be parallelized and so this isn't,
hopefully, a big lift to add, but I'm not sure if there's enough work
being done here CPU-wise, compared to the amount of IO being done, to
have it make sense to run it in parallel.  Might be worth looking into
though, at least, as disks have gotten to be quite fast.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-06 Thread Stephen Frost

Greetings,

* Matthias van de Meent (boekewurm+postg...@gmail.com) wrote:
> On Wed, 6 Mar 2024 at 11:33, Stephen Frost  wrote:
> > On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent 
> >  wrote:
> >> Or even just one VALUES for the whole statistics loading?
> > I don’t think we’d want to go beyond one relation at a time as then it can 
> > be parallelized, we won’t be trying to lock a whole bunch of objects at 
> > once, and any failures would only impact that one relation’s stats load.
> 
> That also makes sense.

Great, thanks.

> >> I suspect the main issue with combining this into one statement
> >> (transaction) is that failure to load one column's statistics implies
> >> you'll have to redo all the other statistics (or fail to load the
> >> statistics at all), which may be problematic at the scale of thousands
> >> of relations with tens of columns each.
> >
> >
> > I’m pretty skeptical that “stats fail to load and lead to a failed 
> > transaction” is a likely scenario that we have to spend a lot of effort on.
> 
> Agreed on the "don't have to spend a lot of time on it", but I'm not
> so sure on the "unlikely" part while the autovacuum deamon is
> involved, specifically for non-upgrade pg_restore. I imagine (haven't
> checked) that autoanalyze is disabled during pg_upgrade, but
> pg_restore doesn't do that, while it would have to be able to restore
> statistics of a table if it is included in the dump (and the version
> matches).

Even if autovacuum was running and it kicked off an auto-analyze of the
relation at just the time that we were trying to load the stats, there
would be appropriate locking happening to keep them from causing an
outright ERROR and transaction failure, or if not, that's a lack of
locking and should be fixed.  With the per-attribute-function-call
approach, that could lead to a situation where some stats are from the
auto-analyze and some are from the stats being loaded but I'm not sure
if that's a big concern or not.

For users of this, I would think we'd generally encourage them to
disable autovacuum on the tables they're loading as otherwise they'll
end up with the stats going back to whatever an auto-analyze ends up
finding.  That may be fine in some cases, but not in others.

A couple questions to think about though: Should pg_dump explicitly ask
autovacuum to ignore these tables while we're loading them? 
Should these functions only perform a load when there aren't any
existing stats?  Should the latter be an argument to the functions to
allow the caller to decide?

> > What are the cases where we would be seeing stats reloads failing where it 
> > would make sense to re-try on a subset of columns, or just generally, if we 
> > know that the pg_dump version matches the target server version?
> 
> Last time I checked, pg_restore's default is to load data on a
> row-by-row basis without --single-transaction or --exit-on-error. Of
> course, pg_upgrade uses it's own set of flags, but if a user is
> restoring stats with  pg_restore, I suspect they'd rather have some
> column's stats loaded than no stats at all; so I would assume this
> requires one separate pg_import_pg_statistic()-transaction for every
> column.

Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-06 Thread Stephen Frost

Greetings,

On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Fri, 1 Mar 2024, 04:55 Corey Huinker,  wrote:
> >> Also per our prior discussion- this makes sense to include in post-data
> section, imv, and also because then we have the indexes we may wish to load
> stats for, but further that also means it’ll be in the paralleliziable part
> of the process, making me a bit less concerned overall about the individual
> timing.
> >
> >
> > The ability to parallelize is pretty persuasive. But is that
> per-statement parallelization or do we get transaction blocks? i.e. if we
> ended up importing stats like this:
> >
> > BEGIN;
> > LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> > LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> > SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> > SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> > SELECT pg_import_pg_statistic('schema.relation', 'name', ...);
>
> How well would this simplify to the following:
>
> SELECT pg_import_statistic('schema.relation', attname, ...)
> FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Using a VALUES construct for this does seem like it might make it cleaner,
so +1 for investigating that idea.

Or even just one VALUES for the whole statistics loading?

I don’t think we’d want to go beyond one relation at a time as then it can
be parallelized, we won’t be trying to lock a whole bunch of objects at
once, and any failures would only impact that one relation’s stats load.

I suspect the main issue with combining this into one statement
> (transaction) is that failure to load one column's statistics implies
> you'll have to redo all the other statistics (or fail to load the
> statistics at all), which may be problematic at the scale of thousands
> of relations with tens of columns each.

I’m pretty skeptical that “stats fail to load and lead to a failed
transaction” is a likely scenario that we have to spend a lot of effort
on.  I’m pretty bullish on the idea that this simply won’t happen except in
very exceptional cases under a pg_upgrade (where the pg_dump that’s used
must match the target server version) and where it happens under a pg_dump
it’ll be because it’s an older pg_dump’s output and the answer will likely
need to be “you’re using a pg_dump file generated using an older version of
pg_dump and need to exclude stats entirely from the load and instead run
analyze on the data after loading it.”

What are the cases where we would be seeing stats reloads failing where it
would make sense to re-try on a subset of columns, or just generally, if we
know that the pg_dump version matches the target server version?

Thanks!

Stephen

>

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-03-05 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2024-02-27 at 09:48 -0500, Stephen Frost wrote:
> > Attached is an updated patch which drops the 'such as' and adds a
> > sentence mentioning that BRIN is the only in-core summarizing index.
> 
> The original patch reads more clearly to me. In v4, summarizing (the
> exception) feels like it's dominating the description.
> 
> Also, is it standard practice to backport this kind of doc update? I
> ordinarily wouldn't be inclined to do so, but v4 seems targeted at 16
> as well.

I do think this change should be back-ported to when the change
happened, otherwise the documentation won't reflect what's in the
product for that version...

> Attached my own suggested wording that hopefully addresses Stephen and
> Alvaro's concerns. I agree that it's tricky to write so I took a more
> minimalist approach:

>  * I got rid of the "In summary" sentence because (a) it's confusing
> now that we're talking about summarizing indexes; and (b) it's not
> summarizing anything, it's just redundant.

>  * I removed the mention partial or expression indexes. It's a bit
> redundant and doesn't seem especially helpful in this context.

Just to point it out- the "In summary" did provide a bit of a summary,
before the 'partial or expression indexes' bit was removed.  That said,
I tend to still agree with these changes as I feel that users will
generally be able to infer that this applies to partial and expression
indexes without it having to be spelled out to them.

> If this is agreeable I can commit it.

Great, thanks!

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-01 Thread Stephen Frost

Greetings,

On Fri, Mar 1, 2024 at 12:14 Nathan Bossart 
wrote:

> On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
> >> That’s certainly a fair point and my initial reaction (which could
> >> certainly be wrong) is that it’s unlikely to be an issue- but also, if
> you
> >> feel you could make it work with an array and passing all the attribute
> >> info in with one call, which I suspect would be possible but just a bit
> >> more complex to build, then sure, go for it. If it ends up being overly
> >> unwieldy then perhaps the  per-attribute call would be better and we
> could
> >> perhaps acquire the lock before the function calls..?  Doing a check to
> see
> >> if we have already locked it would be cheaper than trying to acquire a
> new
> >> lock, I’m fairly sure.
> >
> > Well the do_analyze() code was already ok with acquiring the lock once
> for
> > non-inherited stats and again for inherited stats, so the locks were
> > already not the end of the world. However, that's at most a 2x of the
> > locking required, and this would natts * x, quite a bit more. Having the
> > procedures check for a pre-existing lock seems like a good compromise.
>
> I think this is a reasonable starting point.  If the benchmarks show that
> the locking is a problem, we can reevaluate, but otherwise IMHO we should
> try to keep it as simple/flexible as possible.


Yeah, this was my general feeling as well.  If it does become an issue, it
certainly seems like we would have ways to improve it in the future. Even
with this locking it is surely going to be better than having to re-analyze
the entire database which is where we are at now.

Thanks,

Stephen

>

Re: Statistics Import and Export

2024-02-29 Thread Stephen Frost

Greetings,

On Thu, Feb 29, 2024 at 17:48 Corey Huinker  wrote:

> Having looked through this thread and discussed a bit with Corey
>> off-line, the approach that Tom laid out up-thread seems like it would
>> make the most sense overall- that is, eliminate the JSON bits and the
>> SPI and instead export the stats data by running queries from the new
>> version of pg_dump/server (in the FDW case) against the old server
>> with the intelligence of how to transform the data into the format
>> needed for the current pg_dump/server to accept, through function calls
>> where the function calls generally map up to the rows/information being
>> updated- a call to update the information in pg_class for each relation
>> and then a call for each attribute to update the information in
>> pg_statistic.
>>
>
> Thanks for the excellent summary of our conversation, though I do add that
> we discussed a problem with per-attribute functions: each function would be
> acquiring locks on both the relation (so it doesn't go away) and
> pg_statistic, and that lock thrashing would add up. Whether that overhead
> is judged significant or not is up for discussion. If it is significant, it
> makes sense to package up all the attributes into one call, passing in an
> array of some new pg_statistic-esque special typethe very issue that
> sent me down the JSON path.
>
> I certainly see the flexibility in having a per-attribute functions, but
> am concerned about non-binary-upgrade situations where the attnums won't
> line up, and if we're passing them by name then the function has dig around
> looking for the right matching attnum, and that's overhead too. In the
> whole-table approach, we just iterate over the attributes that exist, and
> find the matching parameter row.
>

That’s certainly a fair point and my initial reaction (which could
certainly be wrong) is that it’s unlikely to be an issue- but also, if you
feel you could make it work with an array and passing all the attribute
info in with one call, which I suspect would be possible but just a bit
more complex to build, then sure, go for it. If it ends up being overly
unwieldy then perhaps the  per-attribute call would be better and we could
perhaps acquire the lock before the function calls..?  Doing a check to see
if we have already locked it would be cheaper than trying to acquire a new
lock, I’m fairly sure.

Also per our prior discussion- this makes sense to include in post-data
section, imv, and also because then we have the indexes we may wish to load
stats for, but further that also means it’ll be in the paralleliziable part
of the process, making me a bit less concerned overall about the individual
timing.

Thanks!

Stephen

>

Re: Statistics Import and Export

2024-02-29 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker 
> wrote:
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.

> v6 posted below.
> 
> Changes:
> 
> - Additional documentation about the overall process.
> - Rewording of SGML docs.
> - removed a fair number of columns from the transformation queries.
> - enabled require_match_oids in extended statistics, but I'm having my
> doubts about the value of that.
> - moved stats extraction functions to an fe_utils file stats_export.c that
> will be used by both pg_export_stats and pg_dump.
> - pg_export_stats now generates SQL statements rather than a tsv, and has
> boolean flags to set the validate and require_match_oids parameters in the
> calls to pg_import_(rel|ext)_stats.
> - pg_import_stats is gone, as importing can now be done with psql.

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Part of this process would include mapping from OIDs/attrnum's to names
on the source side and then from those names to the appropriate
OIDs/attrnum's on the destination side.

As this code would be used by both pg_dump and the postgres_fdw, it
seems logical that it would go into the common library.  Further, it
would make sense to have this code be able to handle multiple major
versions for the foreign side, such as how postgres_fdw and pg_dump
already do.

In terms of working to ensure that newer versions support loading from
older dumps (that is, that v18 would be able to load a dump file created
by a v17 pg_dump against a v17 server in the face of changes having been
made to the statistics system in v18), we could have the functions take
a version parameter (to handle cases where the data structure is the
same but the contents have to be handled differently), use overloaded
functions, or have version-specific names for the functions.  I'm also
generally supportive of the idea that we, perhaps initially, only
support dumping/loading stats with pg_dump when in binary-upgrade mode,
which removes our need to be concerned with this (perhaps that would be
a good v1 of this feature?) as the version of pg_dump needs to match
that of pg_upgrade and the destination server for various other reasons.
Including a switch to exclude stats on restore might also be an
acceptable answer, or even simply excluding them by default when going
between major versions except in binary-upgrade mode.

Along those same lines when it comes to a 'v1', I'd say that we may wish
to consider excluding extended statistics, which I am fairly confident
Corey's heard a number of times previously already but thought I would
add my own support for that.  To the extent that we do want to make
extended stats work down the road, we should probably have some
pre-patches to flush out the missing _in/_recv functions for those types
which don't have them today- and that would include modifying the _out
of those types to use names instead of OIDs/attrnums.  In thinking about
this, I was reviewing specifically pg_dependencies.  To the extent that
there are people who depend on the current output, I would think that
they'd actually appreciate this change.

I don't generally feel like we need to be checking that the OIDs between
the old server and the new server match- I appreciate that that should
be the case in a binary-upgrade situation but it still feels unnecessary
and complicated and clutters up the output and the function calls.

Overall, I definitely think this is a good project to work on as it's an
often, rightfully, complained about issue when it comes to pg_upgrade
and the amount of downtime required for it before the upgraded system
can be reasonably used again.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Atomic ops for unlogged LSN

2024-02-29 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> Here is a new version of the patch that uses the new atomic read/write
> functions with full barriers that were added in commit bd5132d.  Thoughts?

Saw that commit go in- glad to see it.  Thanks for updating this patch
too.  The changes look good to me.

Thanks again,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-02-27 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> On 2024-Feb-26, Stephen Frost wrote:
> > Here's an updated patch which tries to improve on the wording a bit by
> > having it be a bit more consistent.  Would certainly welcome feedback on
> > it though, of course.  This is a tricky bit of language to write and
> > a complex optimization to explain.
> 
> Should we try to explain what is a "summarizing" index is?  Right now
> the only way to know is to look at the source code or attach a debugger
> and see if IndexAmRoutine->amsummarizing is true.  Maybe we can just say
> "currently the only in-core summarizing index is BRIN" somewhere in the
> page.  (The patch's proposal to say "... such as BRIN" strikes me as too
> vague.)

Not sure about explaining what one is, but I'd be fine adding that
language.  I was disappointed to see that there's no way to figure out
the value of amsummarizing for an access method other than looking at
the code.  Not as part of this specific patch, but I'd generally support
having a way to that information at the SQL level (or perhaps everything
from IndexAmRoutine?).

Attached is an updated patch which drops the 'such as' and adds a
sentence mentioning that BRIN is the only in-core summarizing index.

Thanks!

Stephen
From 131f83868b947b379e57ea3dad0acf5a4f95bca8 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 26 Feb 2024 17:17:54 -0500
Subject: [PATCH] docs: Update HOT update docs for 19d8e2308b

Commit 19d8e2308b changed when the HOT update optimization is possible
but neglected to update the Heap-Only Tuples (HOT) documentation.  This
patch updates that documentation accordingly.

Author: Elizabeth Christensen
Reviewed-By: Stephen Frost, Alvaro Herrera
Backpatch-through: 16
Discussion: https://postgr.es/m/CABoUFXRjisr58Ct_3VsFEdQx+fJeQTWTdJnM7XAp=8mubto...@mail.gmail.com
---
 doc/src/sgml/storage.sgml | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3ea4e5526d..f2bb0283ae 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -1097,8 +1097,13 @@ data. Empty in ordinary tables.
   

 
- The update does not modify any columns referenced by the table's
- indexes, including expression and partial indexes.
+ The update only modifies columns which are not referenced by the
+ table's indexes or are only referenced by summarizing indexes and
+ does not update any columns referenced by the table's
+ non-summarizing indexes, including expression and partial
+ non-summarizing indexes.  The only summarizing index in the core
+ PostgreSQL distribution is
+ BRIN.
  


@@ -1114,7 +1119,8 @@ data. Empty in ordinary tables.
   

 
- New index entries are not needed to represent updated rows.
+ New index entries are not needed to represent updated rows, however,
+ summary indexes may still need to be updated.
 


@@ -1130,11 +1136,10 @@ data. Empty in ordinary tables.
  
 
  
-  In summary, heap-only tuple updates can only be created
-  if columns used by indexes are not updated.  You can
-  increase the likelihood of sufficient page space for
-  HOT updates by decreasing a table's fillfactor.
+  In summary, heap-only tuple updates can only be created if columns used by
+  non-summarizing indexes are not updated. You can increase the likelihood of
+  sufficient page space for HOT updates by decreasing
+  a table's fillfactor.
   If you don't, HOT updates will still happen because
   new rows will naturally migrate to new pages and existing pages with
   sufficient free space for new row versions.  The system view 

signature.asc
Description: PGP signature

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-02-26 Thread Stephen Frost

Greetings,

* Elizabeth Christensen (elizabeth.christen...@crunchydata.com) wrote:
> > On Feb 26, 2024, at 4:21 PM, Stephen Frost  wrote:
> > * Elizabeth Christensen (elizabeth.christen...@crunchydata.com) wrote:
> >> I have a small documentation patch to the HOT updates page
> >> <https://www.postgresql.org/docs/current/storage-hot.html>to add references
> >> to summary (BRIN) indexes not blocking HOT updates
> >> <https://www.postgresql.org/message-id/cafp7qwpmrgcdaqumn7onn9hjrj3u4x3zrxdgft0k5g2jwvn...@mail.gmail.com>,
> >> committed 19d8e2308b.
> > 
> > Sounds good to me, though the attached patch didn't want to apply, and
> > it isn't immediately clear to me why, but perhaps you could try saving
> > the patch from a different editor and re-sending it?
> 
> Thanks Stephen, attempt #2 here. 

Here's an updated patch which tries to improve on the wording a bit by
having it be a bit more consistent.  Would certainly welcome feedback on
it though, of course.  This is a tricky bit of language to write and
a complex optimization to explain.

Thanks!

Stephen
From bb88237d1f807572a496ff2a267ab56bef61ac8e Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 26 Feb 2024 17:17:54 -0500
Subject: [PATCH] docs: Update HOT update docs for 19d8e2308b

Commit 19d8e2308b changed when the HOT update optimization is possible
but neglected to update the Heap-Only Tuples (HOT) documentation.  This
patch updates that documentation accordingly.

Author: Elizabeth Christensen
Reviewed-By: Stephen Frost
Backpatch-through: 16
Discussion: https://postgr.es/m/CABoUFXRjisr58Ct_3VsFEdQx+fJeQTWTdJnM7XAp=8mubto...@mail.gmail.com
---
 doc/src/sgml/storage.sgml | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3ea4e5526d..d6c53a8d25 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -1097,8 +1097,11 @@ data. Empty in ordinary tables.
   

 
- The update does not modify any columns referenced by the table's
- indexes, including expression and partial indexes.
+ The update only modifies columns which are not referenced by the
+ table's indexes or are only referenced by summarizing indexes (such
+ as BRIN) and does not update any columns
+ referenced by the table's non-summarizing indexes, including
+ expression and partial non-summarizing indexes.
  


@@ -1114,7 +1117,8 @@ data. Empty in ordinary tables.
   

 
- New index entries are not needed to represent updated rows.
+ New index entries are not needed to represent updated rows, however,
+ summary indexes may still need to be updated.
 


@@ -1130,11 +1134,10 @@ data. Empty in ordinary tables.
  
 
  
-  In summary, heap-only tuple updates can only be created
-  if columns used by indexes are not updated.  You can
-  increase the likelihood of sufficient page space for
-  HOT updates by decreasing a table's fillfactor.
+  In summary, heap-only tuple updates can only be created if columns used by
+  non-summarizing indexes are not updated. You can increase the likelihood of
+  sufficient page space for HOT updates by decreasing
+  a table's fillfactor.
   If you don't, HOT updates will still happen because
   new rows will naturally migrate to new pages and existing pages with
   sufficient free space for new row versions.  The system view 

signature.asc
Description: PGP signature

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2024-02-26 Thread Stephen Frost

Greetings,

* Xing Guo (higuox...@gmail.com) wrote:
> Looks that many hackers are happy with the original patch except that
> the original patch needs a simple rebase, though it has been 3 years.

I'm not completely against the idea of adding these hooks, but I have to
say that it's unfortunate to imagine having to use an extension for
something like quotas as it's really something that would ideally be in
core.

> Shall we push forward this patch so that it can benefit extensions not
> only diskquota?

Would be great to hear about other use-cases for these hooks, which
would also help us be comfortable that these are the right hooks to
introduce with the correct arguments.  What are the other extensions
that you're referring to here..?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-02-26 Thread Stephen Frost

Greetings,

* Elizabeth Christensen (elizabeth.christen...@crunchydata.com) wrote:
> I have a small documentation patch to the HOT updates page
> to add references
> to summary (BRIN) indexes not blocking HOT updates
> ,
> committed 19d8e2308b.

Sounds good to me, though the attached patch didn't want to apply, and
it isn't immediately clear to me why, but perhaps you could try saving
the patch from a different editor and re-sending it?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Set log_lock_waits=on by default

2024-02-07 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> >> On Tue, 2024-02-06 at 12:29 -0500, Tom Lane wrote:
> >>> I was, and remain, of the opinion that that was a bad idea that
> >>> we'll eventually revert, just like we previously got rid of most
> >>> inessential log chatter in the default configuration.
> 
> >> Unsurprisingly, I want to argue against that.
> 
> > I tend to agree with this position- log_checkpoints being on has been a
> > recommended configuration for a very long time and is valuable
> > information to have about what's been happening when someone does go and
> > look at the log.
> 
> We turned on default log_checkpoints in v15, which means that behavior
> has been in the field for about sixteen months.  I don't think that
> that gives it the status of a settled issue; my bet is that most
> users still have not seen it.

Apologies for not being clear- log_checkpoints being on has been a
configuration setting that I (and many others I've run into) have been
recommending since as far back as I can remember.

I was not referring to the change in the default.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Set log_lock_waits=on by default

2024-02-07 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Tue, 2024-02-06 at 12:29 -0500, Tom Lane wrote:
> > Nathan Bossart  writes:
> > > It looks like there are two ideas:
> > > [...]
> > > * Set log_lock_waits on by default.  The folks on this thread seem to
> > >   support this idea, but given the lively discussion for enabling
> > >   log_checkpoints by default [0], I'm hesitant to commit something like
> > >   this without further community discussion.
> > 
> > I was, and remain, of the opinion that that was a bad idea that
> > we'll eventually revert, just like we previously got rid of most
> > inessential log chatter in the default configuration.  So I doubt
> > you'll have much trouble guessing my opinion of this one.  I think
> > the default logging configuration should be chosen with the
> > understanding that nobody ever looks at the logs of most
> > installations, and we should be more worried about their disk space
> > consumption than anything else.  Admittedly, log_lock_waits is less
> > bad than log_checkpoints, because no such events will occur in a
> > well-tuned configuration ... but still, it's going to be useless
> > chatter in the average installation.
> 
> Unsurprisingly, I want to argue against that.

I tend to agree with this position- log_checkpoints being on has been a
recommended configuration for a very long time and is valuable
information to have about what's been happening when someone does go and
look at the log.

Having log_lock_waits on by default is likely to be less noisy and even
more useful for going back in time to figure out what happened.

> You say that it is less bad than "log_checkpoints = on", and I agree.
> I can't remember seeing any complaints by users about
> "log_checkpoints", and I predict that the complaints about
> "log_lock_waits = on" will be about as loud.

Yeah, agreed.

> I am all for avoiding useless chatter in the log.  In my personal
> experience, that is usually "database typo does not exist" and
> constraint violation errors.  I always recommend people to enable
> "log_lock_waits", and so far I have not seen it spam the logs.

I really wish we could separate out the messages about typos and
constraint violations from these logs about processes waiting a long
time for locks or about checkpoints or even PANIC's or other really
important messages.  That said, that's a different problem and not
something this change needs to concern itself with.

As for if we want to separate out log_lock_waits from deadlock_timeout-
no, I don't think we do, for the reasons that Tom mentioned.  I don't
see it as necessary either for enabling log_lock_waits by default.
Waiting deadlock_timeout amount of time for a lock certainly is a
problem already and once we've waited that amount of time, I can't see
the time spent logging about it as being a serious issue.

+1 for simply enabling log_lock_waits by default.

All that said ... if we did come up with a nice way to separate out the
timing for deadlock_timeout and log_lock_waits, I wouldn't necessarily
be against it.  Perhaps one approach to that would be to set just one
timer but have it be the lower of the two, and then set another when
that fires (if there's more waiting to do) and then catch it when it
happens...  Again, I'd view this as some independent improvement though
and not a requirement for just enabling log_lock_waits by default.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Detecting some cases of missing backup_label

2023-12-18 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > I recently mentioned to Robert (and also Heikki earlier), that I think I 
> > see a
> > way to detect an omitted backup_label in a relevant subset of the cases 
> > (it'd
> > apply to the pg_control as well, if we moved to that).  Robert encouraged me
> > to share the idea, even though it does not provide complete protection.
> 
> That would certainly be nice.
> 
> > The subset I think we can address is the following:
> > 
> > a) An omitted backup_label would lead to corruption, i.e. without the
> >backup_label we won't start recovery at the right position. Obviously 
> > it'd
> >be better to also catch a wrong procedure when it'd not cause corruption 
> > -
> >perhaps my idea can be extended to handle that, with a small bit of
> >overhead.
> > 
> > b) The backup has been taken from a primary. Unfortunately that probably 
> > can't
> >be addressed - but the vast majority of backups are taken from a primary,
> >so I think it's still a worthwhile protection.
> 
> Agreed that this is a worthwhile set to try and address, even if we
> can't address other cases.
> 
> > Here's my approach
> > 
> > 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
> >primary, emitted just *after* the checkpoint completed
> > 
> > 2) When replaying a base backup start record, we create a state file that
> >includes the corresponding LSN in the filename
> > 
> > 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
> >that time. Instead the state file is created during pg_backup_stop().
> > 
> > 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
> >created by XLOG_BACKUP_START is present, and error out if not.  Backups
> >that started before the redo LSN from backup_label are ignored
> >(necessitates remembering that LSN, but we've been discussing that 
> > anyway).
> > 
> > 
> > Because the backup state file on the primary is only created during
> > pg_backup_stop(), a copy of the data directory taken between 
> > pg_backup_start()
> > and pg_backup_stop() does *not* contain the corresponding "backup state
> > file". Because of this, an omitted backup_label is detected if recovery does
> > not start early enough - recovery won't encounter the XLOG_BACKUP_START 
> > record
> > and thus would not create the state file, leading to an error in 4).
> 
> While I see the idea here, I think, doesn't it end up being an issue if
> things happen like this:
> 
> pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
> happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash
> 
> In that scenario, we'd go back to the new checkpoint (the one *after*
> the checkpoint that happened before we wrote XLOG_BACKUP_START), start
> replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
> Even though we're actually doing crash recovery and everything should be
> fine as long as we replay all of the WAL.

Andres and I discussed this in person at PGConf.eu and the idea is that
if we find a XLOG_BACKUP_STOP record then we can check if the state file
was written out and if so then we can conclude that we are doing crash
recovery and not restoring from a backup and therefore we don't error
out.  This also implies that we don't consider PG to be recovered at the
XLOG_BACKUP_STOP point, if the state file exists, but instead we have to
be sure to replay all WAL that's been written.  Perhaps we even
explicitly refuse to use restore_command in this case?

We do error out if we hit a XLOG_BACKUP_STOP and the state file
doesn't exist, as that implies that we started replaying from a point
after a XLOG_BACKUP_START record was written but are working from a copy
of the data directory which didn't include the state file.

Of course, we need to actually implement and test these different cases
to make sure it all works but I'm at least feeling better about the idea
and wanted to share that here.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Detecting some cases of missing backup_label

2023-12-05 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> I recently mentioned to Robert (and also Heikki earlier), that I think I see a
> way to detect an omitted backup_label in a relevant subset of the cases (it'd
> apply to the pg_control as well, if we moved to that).  Robert encouraged me
> to share the idea, even though it does not provide complete protection.

That would certainly be nice.

> The subset I think we can address is the following:
> 
> a) An omitted backup_label would lead to corruption, i.e. without the
>backup_label we won't start recovery at the right position. Obviously it'd
>be better to also catch a wrong procedure when it'd not cause corruption -
>perhaps my idea can be extended to handle that, with a small bit of
>overhead.
> 
> b) The backup has been taken from a primary. Unfortunately that probably can't
>be addressed - but the vast majority of backups are taken from a primary,
>so I think it's still a worthwhile protection.

Agreed that this is a worthwhile set to try and address, even if we
can't address other cases.

> Here's my approach
> 
> 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
>primary, emitted just *after* the checkpoint completed
> 
> 2) When replaying a base backup start record, we create a state file that
>includes the corresponding LSN in the filename
> 
> 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
>that time. Instead the state file is created during pg_backup_stop().
> 
> 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
>created by XLOG_BACKUP_START is present, and error out if not.  Backups
>that started before the redo LSN from backup_label are ignored
>(necessitates remembering that LSN, but we've been discussing that anyway).
> 
> 
> Because the backup state file on the primary is only created during
> pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
> and pg_backup_stop() does *not* contain the corresponding "backup state
> file". Because of this, an omitted backup_label is detected if recovery does
> not start early enough - recovery won't encounter the XLOG_BACKUP_START record
> and thus would not create the state file, leading to an error in 4).

While I see the idea here, I think, doesn't it end up being an issue if
things happen like this:

pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash

In that scenario, we'd go back to the new checkpoint (the one *after*
the checkpoint that happened before we wrote XLOG_BACKUP_START), start
replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
Even though we're actually doing crash recovery and everything should be
fine as long as we replay all of the WAL.

Perhaps we can make the pg_backup_stop and(/or?) the writing out of
XLOG_BACKUP_STOP wait until just before the next checkpoint and
hopefully minimize that window ... but I'm not sure if we could make
that window zero and what happens if someone does end up hitting it?
Doesn't seem like there's any way around it, which seems like it might
be a problem.  I suppose it wouldn't be hard to add some option to tell
PG to ignore the XLOG_BACKUP_STOP ... but then that's akin to removing
backup_label which lands us possibly back into the issue of people
mis-using that.

> It is not a problem that the primary does not create the state file before the
> pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
> XLOG_BACKUP_END and thus no error will be raised.  It's a bit odd that the
> sequence differs between normal processing and recovery, but I think that's
> nothing a good comment couldn't explain.

Right, crashing before pg_backup_stop() is fine, but crashing *after*
would be an issue, I think, as outlined above, until the next checkpoint
completes, so we've moved the window but not eliminated it.

> I haven't worked out the details, but I think we might be able extend this to
> catch errors even if there is no checkpoint during the base backup, by
> emitting the WAL record *before* the RequestCheckpoint(), and creating the
> corresponding state file during backup_label processing at the start of
> recovery.  That'd probably make the logic for when we can remove the backup
> state files a bit more complicated, but I think we could deal with that.

Not entirely following this- are you meaning that we might be able to
make something here work in the case where we don't have
pg_backup_start() wait for a checkpoint to happen (which I have some
serious doubts about?), or are you saying that the above doesn't work
unless there's at least one post-pg_backup_start() checkpoint?  I don't
immediately see why that would be the case though.  Also, if we wrote
out the XLOG_BACKUP_START before the checkpoint that we start replay
from and instead move that logic to backup_label processing ...

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Nov 28, 2023 at 12:24 PM Stephen Frost  wrote:
> > I don’t know what they’re doing now, as you don’t say, and so I really 
> > couldn’t say if ldap is better or worse for them. In some cases, sure, 
> > perhaps ldap is better than … something else,
> 
> That's EXACTLY right. You can't say whether LDAP is better or worse in
> every scenario. And therefore you should not be proposing to remove
> it.

I had been hoping you might shed some light on just what use cases you
were referring to so that we could have a constructive discussion about
if ldap is actually a reasonable solution.  I even explicitly pointed
out that there may still exist some environments, such as those with
OpenLDAP only and which don't have Kerberos that could be a valid reason
to keep ldap, but I haven't heard of any such in quite a long time.

Until more recently, an argument could be made that we needed to keep
ldap around because pgAdmin didn't support Kerberos, but that's been
addressed for a couple years now.

At some point, once there aren't any good use-cases for it to be kept,
we should remove it or at least deprecate and discourage its use, though
as I suspect everyone already knows, I really don't like deprecating
things because it means we just end up keeping them around forever.

As I tried to get at in my prior response, we should be encouraging
the use of secure options when they're available and that's something
which I actively do, on these lists and in working with clients.  That I
continue to find cases of perfectly good AD deployments which could be
using Kerberos but instead are using ldap is frustrating.

What I don't understand is this push-back against even talking about the
issues with ldap or appreciation that passing around cleartext passwords
is a security issue (modern password-based authentication methods, like
SCRAM, don't even do this anymore, because it's not a good idea).

If my opinion that we should remove it is that offensive, then fine,
let's drop that part of the discussion and move on to the question of:
why are people still using ldap?

I'm pretty sure the answer is that far too often people think that's
just how you integrate PG into an AD environment- and they think it's
either the only option or sometimes believe it's a secure option.  We've
had people claim on these lists that using ldap doesn't pass around
cleartext passwords, which I suspect means that maybe they thought our
'ldap' auth was actually somehow doing Kerberos under the hood (I really
wish it was, but sadly that's obviously not the case).

For some users, the issue is that our Kerberos/pg_ident/et al system
isn't as good as it could be in an AD environment, which is one of the
things that I was trying to allude to in my prior reply.  In particular,
we don't currently have support for working with AD's ldap for what it
was actually meant for, which is using it to query for information like
group membership.  There had been some work on this not too long ago but
sadly I haven't seen recent activity around that.  What I've heard
requested are things like having an option to say "users in group X are
allowed to log in to role Y", or "users in group X are allowed to log
into database Z".  Would also be great to have built-in support for
syncing from ldap the list of users which should exist in PG
(pg_ldap_sync does an alright job of this today but we could certainly
do better by having it built-in, and ideally with a background worker
that connects to ldap, does an initial sync, and then listens for
changes, instead of having to have a frequent cronjob doing the sync).

> >> I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
> >> MD5 authentication has a pass-the-hash vulnerability, and that sucks.
> >
> > So, given that we all agree with the CVE-worthy issue that exists with 
> > every release where we include md5 auth, we should be applying for q CVE 
> > prior to each release, no?
> 
> You know, I said in my previous email that you were so sure that you
> were right that there was no point in trying to have a serious
> discussion, and I can't really see how you could have proved that
> point more thoroughly than you did here. You twisted my words around
> to make it seem like I was agreeing with your point when you know full
> well that I was doing the exact opposite of that.

I quoted the part where you agreed that md5 has a known vulnerability to
point out that, given that, we should be making our users aware of that
through the normal process that we use for that.  I wasn't claiming that
you agreed that we should remove md5, unless you're referring to some
other part of my resp

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

On Tue, Nov 28, 2023 at 20:19 Robert Haas  wrote:

> On Tue, Nov 28, 2023 at 10:16 AM Stephen Frost  wrote:
> > We pass a completely cleartext password to the server from the client.
> > Yes, we might encrypt it on the way with TLS, but even SSH realized how
> > terrible that is long, long ago and strongly discourages it these days.
> >
> > The problem with ldap as an auth method is that a single compromised PG
> > server in an AD/ldap environment can collect up those username+password
> > credentials and gain access to those users *domain* level access.  The
> > CFO logging into a PG server with LDAP auth is giving up their complete
> > access credentials to the entire AD domain.  That's terrible.
>
> Good grief. I really think you need to reassess your approach to this
> whole area. It is massively unhelpful to be so prescriptive about what
> people are, or are not, allowed to be comfortable with from a security
> perspective. And it seems like you're so convinced that you're right
> that you're completely closed off to anyone else's input, so there's
> not even any point in arguing.

You asked what the issue is with our ldap auth method. I explained it
pretty clearly. Is there something incorrect about the issues I’ve pointed
out with ldap auth?  You basically came back at me saying that I’m being
unhelpful for responding with what the issues are.

In the reality I inhabit, many people could *massively* improve their
> security by switching to LDAP from the really lame things that they're
> doing right now. They would be FAR better off. It would not even be
> close. I pretty much know that you aren't going to believe that and
> are going to offer reasons why it isn't and can't possibly be true, or
> else just say that those people are so hopeless that we shouldn't even
> care about them. All I can say is that you're worrying about security
> problems that are so minute as to be invisible compared to what I see.

I don’t know what they’re doing now, as you don’t say, and so I really
couldn’t say if ldap is better or worse for them. In some cases, sure,
perhaps ldap is better than … something else, but in 99% of cases, ldap is
being used because it seems simpler than setting up GSSAPI … but that’s a
lack of clearer documentation on our side on how to set up GSSAPI with AD
and PG, imv.   I’ve tried to improve on that situation by explaining
clearly how to set up GSSAPI authentication with AD and then consistently
pointing people to that.  I don’t put this out there as fiat without any
effort behind it, I’ve been working to move folks in the right direction,
I’ve put in that effort on the lists and continue to do so, feel free to
review the archives.

What I don’t understand is the argument for using ldap in an AD environment
instead of GSSAPI/sspi. Is there some reason you feel makes ldap better?
When there are two such options and one is clearly much more secure,
shouldn’t we push people to the more secure option… possibly even to the
exclusion and removal of the insecure option?  Ditto below for md5 …

> I disagree- it's a known pass-the-hash vulnerability and frankly every
> > release we do with it still existing is deserving of an immediate CVE
> > (I've been asked off-list why we don't do this, in fact).
>
> I agree that the issues with MD5 are way worse than the ones with
> LDAP, but you're arguing here, as you did with exclusive backup mode,
> that the existence of options that are bad, or that you consider to be
> bad, is a problem even if nothing whatever compels people to use them.

I’m not impressed with the conflation of this discussion and that of
exclusive backup mode.  I wasn’t the one who deprecated exclusive backup
mode, tho I did agree with that move, and the reason for removing it was as
much about the complications of documenting it as for the fact that it was
actively broken.

I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
> MD5 authentication has a pass-the-hash vulnerability, and that sucks.

So, given that we all agree with the CVE-worthy issue that exists with
every release where we include md5 auth, we should be applying for q CVE
prior to each release, no?

So let's say we remove it. Then presumably we should also remove
> password authentication, which has an even easier-to-exploit
> vulnerability, and trust, which is easier to exploit still. And
> presumably it would be right to do all of that even if SCRAM
> authentication had never been implemented, so then we'd have no
> support for any form of password authentication at all. And we'd
> remove LDAP, too, so the most obvious way of doing password
> authentication without putting the passwords in the database would
> also not exist an

Re: Add recovery to pg_control and remove backup_label

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost  wrote:
> > What would really be helpful would be hearing from these individuals
> > directly as to what the issues are with the changes, such that perhaps
> > we can do things better in the future to avoid whatever the issue is
> > they're having with the changes.  Simply saying we shouldn't make
> > changes in this area isn't workable and the constant push-back is
> > actively discouraging to folks trying to make improvements.  Obviously
> > it's a biased view, but we've not had issues making the necessary
> > adjustments in pgbackrest with each release and I feel like if the
> > authors of wal-g or barman did that they would have spoken up.
> 
> I'm happy if people show up to comment on proposed changes, but I
> think you're being a little bit unrealistic here. I have had to help
> plenty of people who have screwed up their backups in one way or
> another, generally by using some home-grown script, sometimes by
> misusing some existing backup tool. Those people are EDB customers;
> they don't read and participate in discussions here. If they did,
> perhaps they wouldn't be paying EDB to have me and my colleagues sort
> things out for them when it all goes wrong. I'm not trying to say that
> EDB doesn't have customers who participate in mailing list
> discussions, because we do, but it's a small minority, and I don't
> think that should surprise anyone. Moreover, the people who don't
> wouldn't necessarily have the background, expertise, or *time* to
> assess specific proposals in detail. If your point is that my
> perspective on what's helpful or unhelpful is not valid because I've
> only helped 30 people who had problems in this area, but that the
> perspective of those 30 people who were helped would be more valid,
> well, I don't agree with that. I think your perspective and David's
> are valid precisely *because* you've worked a lot on pgbackrest and no
> doubt interacted with lots of users; I think Andres's perspective is
> valid precisely *because* of his experience working with the fleet at
> Microsoft and individual customers at EDB and 2Q before that; and I
> think my perspective is valid for the same kinds of reasons.

I didn't mean to imply that anyone's perspective wasn't valid.  I was
simply trying to get at the root question of: what *is* the issue with
the changes that are being made?  If the answer to that is: we made
this change, which was hard for folks to deal with, and could have
been avoided by doing X, then I really, really want to hear what X
was!  If the answer is, well, the changes weren't hard, but we didn't
like having to make any changes at all ... then I just don't have any
sympathy for that.  People who write backup software for PG, be it
pgbackrest authors, wal-g authors, or homegrown script authors, will
need to adapt between major versions as we discover things that are
broken (such as exclusive mode, and such as the clear risk that's been
demonstrated of a torn copy of pg_control getting copied, resulting in
a completely invalid backup) and fix them.

> I am more in agreement with the idea that it would be nice to hear
> from backup tool authors, but I think even that has limited value.
> Surely we can all agree that if the backup tool is correctly written,
> none of this matters, because you'll make the tool do the right things
> and then you'll be fine. The difficulty here, and the motivation
> behind this proposal and others like it, is that too many users fail
> to follow the procedure correctly. If we hear from the authors of
> well-written backup tools, I expect they will tell us they can adapt
> their tool to whatever we do. And if we hear from the authors of
> poorly-written tools, well, I don't think their opinions would form a
> great basis for making decisions.

Uhhh.  No, I disagree with this- I'd argue that pgbackrest was broken
until the most recently releases where we implemented a check to ensure
that the pg_control we copy has a valid PG CRC.  Did we know it was
broken before this discussion?  No, but that doesn't change the fact
that we certainly could have ended up copying an invalid pg_control and
thus have an invalid backup, which even our 'pgbackrest verify' wouldn't
have caught because that just checks that the checksum that pgbackrest
calculates for every file hasn't changed since we copied it- but that
didn't do anything for the issue about pg_control having an invalid
internal checksum due to a torn write when we copied it.

So, yes, it does matter.  We didn't make pgbackrest do the right thi

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Nov 28, 2023 at 9:55 AM Stephen Frost  wrote:
> > I do think we should use the correct terminology in our documentation
> > and would support your working on improving things in this area.
> 
> +1.
> 
> > I do wonder if perhaps we would be better off by having someone spend
> > time on removing terribly insecure authentication methods like md5 and
> > ldap though ...
> 
> Wait, what's insecure about LDAP?

We pass a completely cleartext password to the server from the client.
Yes, we might encrypt it on the way with TLS, but even SSH realized how
terrible that is long, long ago and strongly discourages it these days.

The problem with ldap as an auth method is that a single compromised PG
server in an AD/ldap environment can collect up those username+password
credentials and gain access to those users *domain* level access.  The
CFO logging into a PG server with LDAP auth is giving up their complete
access credentials to the entire AD domain.  That's terrible.

> I think we should eventually remove MD5, but I think there's no rush.

I disagree- it's a known pass-the-hash vulnerability and frankly every
release we do with it still existing is deserving of an immediate CVE
(I've been asked off-list why we don't do this, in fact).

> People who care about security will have already switched, and people
> who don't care about security are not required to start caring.

I wish it were this simple.  It's just not though.

> Eventually the maintenance burden will become large enough that it
> makes sense to phase it out for that reason, but I haven't seen any
> evidence that we're anywhere close to that point.

This seems to invite the idea that what people who care about this need
to do is make it painful for us to continue to keep it around, which I
really don't think is best for anyone.  We know it's bad, we know it is
broken, we need to remove it, not pretend like it's not broken or not
bad.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> Is there any interest in fixing our documentation that says encrypted
> when it means hashed?  Should I pursue this?

I do think we should use the correct terminology in our documentation
and would support your working on improving things in this area.

I do wonder if perhaps we would be better off by having someone spend
time on removing terribly insecure authentication methods like md5 and
ldap though ...

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Partial aggregates pushdown

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Nov 27, 2023 at 4:23 PM Tom Lane  wrote:
> > Well, one of the founding principles of postgres_fdw was to be able
> > to talk to PG servers that are not of the same version as yours.
> > If we break that in the name of performance, we are going to have
> > a lot of unhappy users.  Even the ones who do get the benefit of
> > the speedup are going to be unhappy when it breaks because they
> > didn't upgrade local and remote at exactly the same time.
> 
> I agree with this.

+1.  We do want to continue to make this work- to the extent possible.
I don't think there's any problem with saying that when talking to an
older server, you don't get the same capabilities as you do when talking
to a newer server.

> > Just because we'd like to have it doesn't make the patch workable
> > in the real world.
> 
> And also with this in concept - I'd like to plan arbitrarily
> complicated queries perfectly and near-instantly, and then execute
> them at faster-than-light speed, but we can't. However, I don't
> understand the fatalism with respect to the feature at hand. As I said
> before, it's not like no other product has made this work. Sure, some
> of those products may not have the extensible system of data types
> that we do, or may not care about cross-version communication, but
> those don't seem like good enough reasons to just immediately give up.

Certainly there are other projects out there which are based on PG that
have managed to make this work and work really quite well.

> TBH, I suspect even some PG forks have made this work, like maybe PGXC
> or PGXL, although I don't know for certain. We might not like the
> trade-offs they made to get there, but we haven't even talked through
> possible design ideas yet, so it seems way too early to give up.

Yes, Citus[1] and Greenplum[2], to just name two.

I certainly understand the concern around the security of this and would
have thought the approach we'd use would be to not just take internal
state and pass it along but rather to provide a way for aggregates to
opt-in to supporting this and have them serialize/deserialize with
new dedicated functions that have appropriate checks to avoid bad things
happening.  That could also be versioned, perhaps, if we feel that's
necessary (I'm a bit skeptical, but it would hopefully address the
concern about different versions having different data that they want to
pass along).

> One of the things that I think is a problem in this area is that the
> ways we have to configure FDW connections are just not very rich.

Agreed.

> We're trying to cram everything into a set of strings that can be
> attached to the foreign server or the user mapping, but that's not a
> very good fit for something like how all the local SQL functions that
> might exist map onto all of the remote SQL functions that might exist.
> Now you might well say that we don't want the act of configuring a
> foreign data wrapper to be insanely complicated, and I would agree
> with that. But, on the other hand, as Larry Wall once said, a good
> programming language makes simple things simple and complicated things
> possible. I think our current configuration system is only
> accomplishing the first of those goals.

We've already got issues in this area with extensions- there's no way
for a user to say what version of an extension exists on the remote side
and no way for an extension to do anything different based on that
information.  Perhaps we could work on a solution to both of these
issues, but at the least I don't see holding back on this effort for a
problem that already exists but which we've happily accepted because of
the benefit it provides, like being able to push-down postgis bounding
box conditionals to allow for indexed lookups.

Thanks,

Stephen

[1]: https://docs.citusdata.com/en/v11.1/develop/reference_sql.html
[2]: 
https://postgresconf.org/conferences/Beijing/program/proposals/implementation-of-distributed-aggregation-in-greenplum

signature.asc
Description: PGP signature

Re: Add recovery to pg_control and remove backup_label

2023-11-26 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 11/21/23 12:41, Andres Freund wrote:
> > Sure. They also receive a backup_label today. If an external solution 
> > forgets
> > to replace pg_control copied as part of the filesystem copy, they won't get 
> > an
> > error after the remove of backup_label, just like they don't get one today 
> > if
> > they don't put backup_label in the data directory.  Given that users don't 
> > do
> > the right thing with backup_label today, why can we rely on them doing the
> > right thing with pg_control?
> 
> I think reliable backup software does the right thing with backup_label, but
> if the user starts getting errors on recovery they the decide to remove
> backup_label. I know we can't do much about bad backup software, but we can
> at least make this a bit more resistant to user error after the fact.
> 
> It doesn't help that one of our hints suggests removing backup_label. In
> highly automated systems, the user might not even know they just restored
> from a backup. They are only in the loop because the restore failed and they
> are trying to figure out what is going wrong. When they remove backup_label
> the cluster comes up just fine. Victory!

Yup, this is exactly the issue.

> This is the scenario I've seen most often -- not the backup/restore process
> getting it wrong but the user removing backup_label on their own initiative.
> And because it yields such a positive result, at least initially, they
> remember in the future that the thing to do is to remove backup_label
> whenever they see the error.
> 
> If they only have pg_control, then their only choice is to get it right or
> run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take
> them longer to get there. Also, I think that tool make it pretty clear that
> corruption will result and the only thing to do is a logical dump and
> restore after using it.

Agreed.

> There are plenty of ways a user can mess things up. What I'd like to prevent
> is the appearance of everything being OK when in fact they have corrupted
> their cluster. That's the situation we have now with backup_label. Is this
> new solution perfect? No, but I do think it checks several boxes, and is a
> worthwhile improvement.

+1.

As for the complaint about 'operators' having issue with the changes
we've been making in this area- where are those people complaining,
exactly?  Who are they?  I feel like we keep getting this kind of
push-back in this area from folks on this list but not from actual
backup software authors; all the complaints seem to either be 
speculative or unattributed pass-through from someone else.

What would really be helpful would be hearing from these individuals
directly as to what the issues are with the changes, such that perhaps
we can do things better in the future to avoid whatever the issue is
they're having with the changes.  Simply saying we shouldn't make
changes in this area isn't workable and the constant push-back is
actively discouraging to folks trying to make improvements.  Obviously
it's a biased view, but we've not had issues making the necessary
adjustments in pgbackrest with each release and I feel like if the
authors of wal-g or barman did that they would have spoken up.

Making a change as suggested which only helps pg_basebackup (and tools
like pgbackrest, since it's in C and can also make this particular
change) ends up leaving tools like wal-g and barman potentially still
with an easy way for users of those tools to corrupt their databases-
even though we've not heard anything from the authors of those tools
about issues with the proposed change, nor have there been a lot of
complaints from them about the prior changes to indicate that they'd
even have an issue with the more involved change.  Given the lack of
complaint about past changes, I'd certainly rather err on the side of
improved safety for users than on the side of the authors of these tools
possibly complaining.

What these changes have done is finally break things like omnipitr
completely, which hasn't been maintained in a very long time.  The
changes in v12 broke recovery with omnipitr but not backup, and folks
were trying to use omnipitr as recently as with v13[1].  Certainly
having a backup tool that only works for backup (fsvo works, anyway, as
it still used exclusive backup mode meaning that a crash during a backup
would cause the system to not come back up after...) but doesn't work
for recovery isn't exactly great and I'm glad that, now, an attempt to
use omnipitr to perform a backup will fail.  As with lots of other areas
of PG, folks need to read the release notes and potentially update their
code for new major versions.  If anything, the backup area is less of an
issue for this because the authors of the backup tools are able to make
the change (and who are often the ones pushing for these changes) and
the end-user isn't impacted at all.

Much the same can be said for wal-e, with users st

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread Stephen Frost

Greetings,

On Mon, Nov 13, 2023 at 16:53 David Christensen <
david.christen...@crunchydata.com> wrote:

> On Mon, Nov 13, 2023 at 2:52 PM Andres Freund  wrote:
>
>>
>> > This scheme would open up space per page that would now be available for
>> > plenty of other things; the encoding in the header and the corresponding
>> > available space in the footer would seem to open up quite a few options
>> > now, no?
>>
>> Sure, if you're willing to rewrite the whole cluster to upgrade and
>> willing to
>> permanently sacrifice some data density.  If the stored data is actually
>> specific to the page - that is the place to put the data. If not, then the
>> tradeoff is much more complicated IMO.
>>
>> Of course this isn't a new problem - storing the page size on each page
>> was
>> just silly, it's never going to change across the cluster and even more
>> definitely not going to change within a single relation.
>>
>
> Crazy idea; since stored pagesize is already a fixed cost that likely
> isn't going away, what if instead of the pd_checksum field, we instead
> reinterpret pd_pagesize_version; 4 would mean "no page features", but
> anything 5 or higher could be looked up as an external page feature set,
> with storage semantics outside of the realm of the page itself (other than
> what the page features code itself needs to know); i.e,. move away from the
> on-page bitmap into a more abstract representation of features which could
> be something along the lines of what you were suggesting, including
> extension support.
>
> It seems like this could also support adding/removing features on page
> read/write as long as there was sufficient space in the reserved_page
> space; read the old feature set on page read, convert to the new feature
> set which will write out the page with the additional/changed format.
> Obviously there would be bookkeeping to be done in terms of making sure all
> pages had been converted from one format to another, but for the page level
> this would be straightforward.
>
> Just thinking aloud here...
>

In other crazy idea space … if the page didn’t have enough space to allow
for the desired features then make any insert/update actions forcibly have
to choose a different page for the new tuple, while allowing delete’s to do
their usual thing, and then when vacuum comes along and is able to clean up
the page and remove the all dead tuples, it could then enable the features
on the page that are desired…

Thanks,

Stephen

>

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Stephen Frost

Greetings,

On Wed, Nov 8, 2023 at 20:55 David Christensen <
david.christen...@crunchydata.com> wrote:

> On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:
>
>> * Andres Freund (and...@anarazel.de) wrote:
>> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
>> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
>> > > From: David Christensen 
>> > > Date: Tue, 9 May 2023 16:56:15 -0500
>> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>> > >
>> > > This space is reserved for extended data on the Page structure which
>> will be ultimately used for
>> > > encrypted data, extended checksums, and potentially other things.
>> This data appears at the end of
>> > > the Page, after any `pd_special` area, and will be calculated at
>> runtime based on specific
>> > > ControlFile features.
>> > >
>> > > No effort is made to ensure this is backwards-compatible with
>> existing clusters for `pg_upgrade`, as
>> > > we will require logical replication to move data into a cluster with
>> > > different settings here.
>> >
>> > The first part of the last paragraph makes it sound like pg_upgrade
>> won't be
>> > supported across this commit, rather than just between different
>> settings...
>>
>
> Yeah, that's vague, but you picked up on what I meant.
>
>
>> > I think as a whole this is not an insane idea. A few comments:
>>
>> Thanks for all the feedback!
>>
>> > - Why is it worth sacrificing space on every page to indicate which
>> features
>> >   were enabled?  I think there'd need to be some convincing reasons for
>> >   introducing such overhead.
>>
>> In conversations with folks (my memory specifically is a discussion with
>> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
>> there was a pretty strong push that a page should be able to 'stand
>> alone' and not depend on something else (eg: pg_control, or whatever) to
>> provide info needed be able to interpret the page.  For my part, I don't
>> have a particularly strong feeling on that, but that's what lead to this
>> design.
>>
>
> Unsurprisingly, I agree that it's useful to keep these features on the
> page itself; from a forensic standpoint this seems much easier to interpret
> what is happening here, as well it would allow you to have different
> features on a given page or type of page depending on need.  The initial
> patch utilizes pg_control to store the cluster page features, but there's
> no reason it couldn't be dependent on fork/page type or stored in
> pg_tablespace to utilize different features.
>

When it comes to authenticated encryption, it’s also the case that it’s
unclear what value the checksum field has, if any…  it’s certainly not
directly needed as a checksum, as the auth tag is much better for the
purpose of seeing if the page has been changed in some way. It’s also not
big enough to serve as an auth tag per NIST guidelines regarding the size
of the authenticated data vs. the size of the tag.  Using it to indicate
what features are enabled on the page seems pretty useful, as David notes.

Thanks,

Stephen

>

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen 
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will 
> > be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This 
> > data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime 
> > based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing 
> > clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
> 
> The first part of the last paragraph makes it sound like pg_upgrade won't be
> supported across this commit, rather than just between different settings...
> 
> I think as a whole this is not an insane idea. A few comments:

Thanks for all the feedback!

> - Why is it worth sacrificing space on every page to indicate which features
>   were enabled?  I think there'd need to be some convincing reasons for
>   introducing such overhead.

In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page.  For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.

Getting a consensus on if that's a requirement or not would definitely
be really helpful.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Moving forward with TDE [PATCH v3]

2023-11-07 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Nov  6, 2023 at 09:56:37AM -0500, Stephen Frost wrote:
> > The gist is, without a suggestion of things to try, we're left
> > to our own devices to try and figure out things which might be
> > successful, only to have those turned down too when we come back with
> > them, see [1] for what feels like an example of this.  Given your
> > feedback overall, which I'm very thankful for, I'm hopeful that you see
> > that this is, indeed, a useful feature that people are asking for and
> > therefore are willing to spend some time on it, but if the feedback is
> > that nothing on the page is acceptable to use for the nonce, we can't
> > put the nonce somewhere else, and we can't change the page format, then
> > everything else is just moving deck chairs around on the titanic that
> > has been this effort.
> 
> Yeah, I know the feeling, though I thought XTS was immune enough to
> nonce/LSN reuse that it was acceptable.

Ultimately it depends on the attack vector you're trying to address, but
generally, I think you're right about the XTS tweak reuse not being that
big of a deal.  XTS isn't CTR or GCM.

With FDE (full disk encryption) you're expecting the attacker to steal
the physical laptop, hard drive, etc, generally, and so the downside of
using the same tweak with XTS over and over again isn't that bad (and is
what most FDE solutions do, aiui, by simply using the sector number; we
could do something similar to that by using the relfilenode + block
number) because that re-use is a problem if the attacker is able to see
multiple copies of the same block over time where the block has been
encrypted with different data but the same key and tweak.

Using the LSN actually is better than what the FDE solutions do because
the LSN varies much more often than the sector number.  Sure, it doesn't
change with every write and maybe an attacker could glean something from
that, but that's quite narrow.  The downside from the LSN based approach
with XTS is probably more that using the LSN means that we can't encrypt
the LSN itself and that is a leak too- but then again, we leak that
through the simple WAL filenames too, to some extent, so it doesn't
strike me as a huge issue.

XTS as a block cipher doesn't suffer from the IV-reuse issue that you
have with streaming ciphers where the same key+IV and different data
leads to being able to trivally retrive the plaintext though and I worry
that maybe that's what people were thinking.

The README and comments I don't think were terribly clear on this and I
think may have even been from back when CTR was being considered, where
IV reuse would have resulted in plaintext being trivially available.

> What got me sunk on the feature was the complexity of adding temporary
> file encryption support and that tipped the scales in the negative for
> me in community value of the feature vs. added complexity. (Yeah, I used
> a Titanic reference in the last sentence. ;-) )  However, I am open to
> the community value and complexity values changing over time.  My blog
> post on the topic:

We do need to address the temporary file situation too and we do have a
bit of an issue that how we deal with temporary files today in PG isn't
very consistent and there's too many ways to do that.  There's a patch
that works on that, though it has some bitrot that we're working on
addressing currently.

There is value in simply fixing that situation wrt temporary file
management independent of encryption, though of course then encryption
of those temporary files becomes much simpler.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-07 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-11-06 13:02:50 -0500, Stephen Frost wrote:
> > > > The max_total_memory limit is checked whenever the global counters are
> > > > updated. There is no special error handling if a memory allocation 
> > > > exceeds
> > > > the global limit. That allocation returns a NULL for malloc style
> > > > allocations or an ENOMEM for shared memory allocations. Postgres has
> > > > existing mechanisms for dealing with out of memory conditions.
> > > 
> > > I still think it's extremely unwise to do tracking of memory and limiting 
> > > of
> > > memory in one patch.  You should work towards and acceptable patch that 
> > > just
> > > tracks memory usage in an as simple and low overhead way as possible. 
> > > Then we
> > > later can build on that.
> > 
> > Frankly, while tracking is interesting, the limiting is the feature
> > that's needed more urgently imv.
> 
> I agree that we need limiting, but that the tracking needs to be very robust
> for that to be usable.

Is there an issue with the tracking in the patch that you saw?  That's
certainly an area that we've tried hard to get right and to match up to
numbers from the rest of the system, such as the memory context system.

> > We could possibly split it up but there's an awful lot of the same code that
> > would need to be changed and that seems less than ideal.  Still, we'll look
> > into this.
> 
> Shrug. IMO keeping them together just makes it very likely that neither goes
> in.

I'm happy to hear your support for the limiting part of this- that's
encouraging.

> > > > For sanity checking, pgstat now includes the 
> > > > pg_backend_memory_allocation
> > > > view showing memory allocations made by the backend process. This view
> > > > includes a scan of the top memory context, so it compares memory 
> > > > allocations
> > > > reported through pgstat with actual allocations. The two should match.
> > > 
> > > Can't you just do this using the existing pg_backend_memory_contexts view?
> > 
> > Not and get a number that you can compare to the local backend number
> > due to the query itself happening and performing allocations and
> > creating new contexts.  We wanted to be able to show that we are
> > accounting correctly and exactly matching to what the memory context
> > system is tracking.
> 
> I think creating a separate view for this will be confusing for users, without
> really much to show for. Excluding the current query would be useful for other
> cases as well, why don't we provide a way to do that with
> pg_backend_memory_contexts?

Both of these feel very much like power-user views, so I'm not terribly
concerned about users getting confused.  That said, we could possibly
drop this as a view and just have the functions which are then used in
the regression tests to catch things should the numbers start to
diverge.

Having a way to get the memory contexts which don't include the
currently running query might be interesting too but is rather
independent of what this patch is trying to do.  The only reason we
collected up the memory-context info is as a cross-check to the tracking
that we're doing and while the existing memory-context view is just fine
for a lot of other things, it doesn't work for that specific need.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Atomic ops for unlogged LSN

2023-11-07 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Tue, Nov 07, 2023 at 12:57:32AM +, John Morris wrote:
> > I incorporated your suggestions and added a few more. The changes are
> > mainly related to catching potential errors if some basic assumptions
> > aren’t met.
> 
> Hm.  Could we move that to a separate patch?  We've lived without these
> extra checks for a very long time, and I'm not aware of any related issues,
> so I'm not sure it's worth the added complexity.  And IMO it'd be better to
> keep it separate from the initial atomics conversion, anyway.

I do see the value in adding in an Assert though I don't want to throw
away the info about what the recent unlogged LSN was when we crash.  As
that basically boils down to a one-line addition, I don't think it
really needs to be in a separate patch.

> > I found the comment about cache coherency a bit confusing. We are dealing
> > with a single address, so there should be no memory ordering or coherency
> > issues. (Did I misunderstand?) I see it more as a race condition. Rather
> > than merely explaining why it shouldn’t happen, the new version verifies
> > the assumptions and throws an Assert() if something goes wrong.
> 
> I was thinking of the comment for pg_atomic_read_u32() that I cited earlier
> [0].  This comment also notes that pg_atomic_read_u32/64() has no barrier
> semantics.  My interpretation of that comment is that these functions
> provide no guarantee that the value returned is the most up-to-date value.

There seems to be some serious misunderstanding about what is happening
here.  The value written into the control file for unlogged LSN during
normal operation does *not* need to be the most up-to-date value and
talking about it as if it needs to be the absolutely most up-to-date and
correct value is, if anything, adding to the confusion, not reducing
confusion.  The reason to write in anything other than a zero during
these routine checkpoints for unlogged LSN is entirely for forensics
purposes, not because we'll ever actually use the value- during crash
recovery and backup/restore, we're going to reset the unlogged LSN
counter anyway and we're going to throw away all of unlogged table
contents across the entire system.

We only care about the value of the unlogged LSN being correct during
normal shutdown when we're writing out the shutdown checkpoint, but by
that time everything else has been shut down and the value absolutely
should not be changing.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-06 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-10-31 17:11:26 +, John Morris wrote:
> > Postgres memory reservations come from multiple sources.
> > 
> >   *   Malloc calls made by the Postgres memory allocators.
> >   *   Static shared memory created by the postmaster at server startup,
> >   *   Dynamic shared memory created by the backends.
> >   *   A fixed amount (1Mb) of “initial” memory reserved whenever a process 
> > starts up.
> > 
> > Each process also maintains an accurate count of its actual memory
> > allocations. The process-private variable “my_memory” holds the total
> > allocations for that process. Since there can be no contention, each process
> > updates its own counters very efficiently.
> 
> I think this will introduce measurable overhead in low concurrency cases and
> very substantial overhead / contention when there is a decent amount of
> concurrency.  This makes all memory allocations > 1MB contend on a single
> atomic.  Massive amount of energy have been spent writing multi-threaded
> allocators that have far less contention than this - the current state is to
> never contend on shared resources on any reasonably common path. This gives
> away one of the few major advantages our process model has away.

We could certainly adjust the size of each reservation to reduce the
frequency of having to hit the atomic.  Specific suggestions about how
to benchmark and see the regression that's being worried about here
would be great.  Certainly my hope has generally been that when we do a
larger allocation, it's because we're about to go do a bunch of work,
meaning that hopefully the time spent updating the atomic is minor
overall.

> The patch doesn't just introduce contention when limiting is enabled - it
> introduces it even when memory usage is just tracked. It makes absolutely no
> sense to have a single contended atomic in that case - just have a per-backend
> variable in shared memory that's updated. It's *WAY* cheaper to compute the
> overall memory usage during querying than to keep a running total in shared
> memory.

Agreed that we should avoid the contention when limiting isn't being
used, certainly easy to do so, and had actually intended to but that
seems to have gotten lost along the way.  Will fix.

Other than that change inside update_global_reservation though, the code
for reporting per-backend memory usage and querying it does work as
you're outlining above inside the stats system.

That said- I just want to confirm that you would agree that querying the
amount of memory used by every backend, to add it all up to enforce an
overall limit, surely isn't something we're going to want to do during
an allocation and that having a global atomic for that is better, right?

> > Pgstat now includes global memory counters. These shared memory counters
> > represent the sum of all reservations made by all Postgres processes. For
> > efficiency, these global counters are only updated when new reservations
> > exceed a threshold, currently 1 Mb for each process. Consequently, the
> > global reservation counters are approximate totals which may differ from the
> > actual allocation totals by up to 1 Mb per process.
> 
> I see that you added them to the "cumulative" stats system - that doesn't
> immediately makes sense to me - what you're tracking here isn't an
> accumulating counter, it's something showing the current state, right?

Yes, this is current state, not an accumulation.

> > The max_total_memory limit is checked whenever the global counters are
> > updated. There is no special error handling if a memory allocation exceeds
> > the global limit. That allocation returns a NULL for malloc style
> > allocations or an ENOMEM for shared memory allocations. Postgres has
> > existing mechanisms for dealing with out of memory conditions.
> 
> I still think it's extremely unwise to do tracking of memory and limiting of
> memory in one patch.  You should work towards and acceptable patch that just
> tracks memory usage in an as simple and low overhead way as possible. Then we
> later can build on that.

Frankly, while tracking is interesting, the limiting is the feature
that's needed more urgently imv.  We could possibly split it up but
there's an awful lot of the same code that would need to be changed and
that seems less than ideal.  Still, we'll look into this.

> > For sanity checking, pgstat now includes the pg_backend_memory_allocation
> > view showing memory allocations made by the backend process. This view
> > includes a scan of the top memory context, so it compares memory allocations
> > reported through pgstat with actual allocations. The two should match.
> 
> Can't you just do this using the existing pg_backend_memory_contexts view?

Not and get a number that you can compare to the local backend number
due to the query itself happening and performing allocations and
creating new contexts.  We wanted to be able to show that we are
accounting correctl

Re: Moving forward with TDE [PATCH v3]

2023-11-06 Thread Stephen Frost

Greetings,

Thanks for your feedback on this.

* Andres Freund (and...@anarazel.de) wrote:
> I still am quite quite unconvinced that using the LSN as a nonce is a good
> design decision.

This is a really important part of the overall path to moving this
forward, so I wanted to jump to it and have a specific discussion
around this.  I agree that there are downsides to using the LSN, some of
which we could possibly address (eg: include the timeline ID in the IV),
but others that would be harder to deal with.

The question then is- what's the alternative?

One approach would be to change the page format to include space for an
explicit nonce.  I don't see the community accepting such a new page
format as the only format we support though as that would mean no
pg_upgrade support along with wasted space if TDE isn't being used.
Ideally, we'd instead be able to support multiple page formats where
users could decide when they create their cluster what features they
want- and luckily we even have such an effort underway with patches
posted for review [1].  Certainly, with the base page-special-feature
patch, we could have an option for users to choose that they want a
better nonce than the LSN, or we could bundle that assumption in with,
say, the authenticated-encryption feature (if you want authenticated
encryption, then you want more from the encryption system than the
basics, and therefore we presume you also want a better nonce than the
LSN).

Another approach would be a separate fork, but that then has a number of
downsides too- every write has to touch that too, and a single page of
nonces would cover a pretty large number of pages also.

Ultimately, I agree with you that the LSN isn't perfect and we really
shouldn't be calling it 'ideal' as it isn't, and we can work to fix that
language in the patch, but the lack of any alternative being proposed
that might be acceptable makes this feel a bit like rock management [2].

My apologies if there's something good that's already been specifically
pushed and I just missed it; if so, a link to that suggestion and
discussion would be greatly appreciated.

Thanks again!

Stephen

[1]: https://commitfest.postgresql.org/45/3986/
[2]: https://en.wikipedia.org/wiki/Wikipedia:Bring_me_a_rock ; though
that isn't great for a quick summary (which I tried to find on an
isolated page somewhere and didn't).

The gist is, without a suggestion of things to try, we're left
to our own devices to try and figure out things which might be
successful, only to have those turned down too when we come back with
them, see [1] for what feels like an example of this.  Given your
feedback overall, which I'm very thankful for, I'm hopeful that you see
that this is, indeed, a useful feature that people are asking for and
therefore are willing to spend some time on it, but if the feedback is
that nothing on the page is acceptable to use for the nonce, we can't
put the nonce somewhere else, and we can't change the page format, then
everything else is just moving deck chairs around on the titanic that
has been this effort.

signature.asc
Description: PGP signature

Re: Fwd: Annoying corruption in PostgreSQL.

2023-10-30 Thread Stephen Frost

Greetings,

Please don't top-post on these lists.

* Kirill Reshke (reshkekir...@gmail.com) wrote:
> We have physical backups and we can PITR. But restoring a cluster to some
> point in the past is a bit of a different task: we need our client's
> approval for these operations, since we are a Managed DBs Cloud Provider.
> Will try to ask someone.

Do you have page-level checksums enabled for these PG instances?

Are you able to see if these clusters which are reporting the corruption
have been restored in the past from a backup?  What are you using to
perform your backups and perform your restores?

Are you able to see if these clusters have ever crashed and come back up
after by doing WAL replay?

Where I'm heading with these questions is essentially: I suspect either
your backup/restore procedure is broken or you're running on a system
that doesn't properly fsync data.  Or possibly both.

Oh, and you should probably have checksums enabled.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Stephen Frost

Greetings,

On Tue, Oct 24, 2023 at 14:42 Robert Haas  wrote:

> On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis  wrote:
> > Perhaps the idea is that if there are constraints involved, the failure
> > or success of an INSERT/UPDATE/DELETE could leak information that you
> > don't have privileges to read.
>
> My recollection of this topic is pretty hazy, but like Tom, I seem to
> remember it being intentional, and I think the reason had something to
> do with wanting the slice of a RLS-protect table that you can see to
> feel like a complete table. When you update a row in a table all of
> which is visible to you, the updated row can never vanish as a result
> of that update, so it was thought, if I remember correctly, that this
> should also be true here. It's also similar to what happens if an
> updatable view has WITH CHECK OPTION, and I think that was part of the
> precedent as well. I don't know whether or not the constraint issue
> that you mention here was also part of the concern, but it may have
> been. This was all quite a while ago...


Yes, having it be similar to a view WITH CHECK OPTION was intentional, also
on not wishing for things to be able to disappear or to not get saved. The
risk of a constraint possibly causing the leak of information is better
than either having data just thrown away or having the constraint not
provide the guarantee it’s supposed to …

Thanks,

Stephen

(On my phone at an event currently, sorry for not digging in deeper on
this..)

>

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-20 Thread Stephen Frost

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> The only issue I worry about is the uncertainty and clutter that can be
> created by this feature. In the worst case, when we have a complex error
> stack (including the extension's CATCH sections, exceptions in stored
> procedures, etc.), the backend will throw the memory limit error repeatedly.

I'm not seeing what additional uncertainty or clutter there is- this is,
again, exactly the same as what happens today on a system with
overcommit disabled and I don't feel like we get a lot of complaints
about this today.

> Of course, one failed backend looks better than a surprisingly killed
> postmaster, but the mix of different error reports and details looks
> terrible and challenging to debug in the case of trouble. So, may we throw a
> FATAL error if we reach this limit while handling an exception?

I don't see why we'd do that when we can do better- we just fail
whatever the ongoing query or transaction is and allow further requests
on the same connection.  We already support exactly that and it works
really rather well and I don't see why we'd throw that away because
there's a different way to get an OOM error.

If you want to make the argument that we should throw FATAL on OOM when
handling an exception, that's something you could argue independently of
this effort already today, but I don't think you'll get agreement that
it's an improvement.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-20 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Wed, 2023-10-18 at 14:48 -0400, Stephen Frost wrote:
> > Right, we need more observability, agreed, but that's not strictly
> > necessary of this patch and could certainly be added independently. 
> > Is
> > there really a need to make this observability a requirement of this
> > particular change?
> 
> I won't draw a line in the sand, but it feels like something should be
> there to help the user keep track of which password they might want to
> keep. At least a "created on" date or something.

Sure, no objection to adding that and seems like it should be fairly
easy ... but then again, I tend to feel that we should do that for all
of the objects in the system and we've got some strong feelings against
doing that from others.  Perhaps this case is different to them, in
which case, great, but if it's not, it'd be unfortunate for this feature
to get bogged down due to that.

> > > (Aside: is the uniqueness of the salt enforced in the current
> > > patch?)
> > 
> > Err, the salt has to be *identical* for each password of a given
> > user,
> > not unique, so I'm a bit confused here.
> 
> Sorry, my mistake.

Sure, no worries.

> If the client needs to use the same salt as existing passwords, can you
> still use PQencryptPasswordConn() on the client to avoid sending the
> plaintext password to the server?

Short answer, yes ... but seems that wasn't actually done yet.  Requires
a bit of additional work, since the client needs to get the existing
salt (note that as part of the SCRAM typical exchange, the client is
provided with the salt, so this isn't exposing anything new) to use to
construct what is then sent to the server to store.  We'll also need to
decide how to handle the case if the client tries to send a password
that doesn't have the same salt as the existing ones (regardless of how
many passwords we end up supporting).  Perhaps we require the user,
through the grammar, to make clear if they want to add a password, and
then error if they don't provide a matching salt, or if they want to
remove existing passwords and replace with the new one.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-19 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-10-19 18:06:10 -0400, Stephen Frost wrote:
> > Ignoring such would defeat much of the point of this effort- which is to
> > get to a position where we can say with some confidence that we're not
> > going to go over some limit that the user has set and therefore not
> > allow ourselves to end up getting OOM killed.
> 
> I think that is a good medium to long term goal. I do however think that we'd
> be better off merging the visibility of memory allocations soon-ish and
> implement the limiting later. There's a lot of hairy details to get right for
> the latter, and even just having visibility will be a huge improvement.

I agree that having the visibility will be a great improvement and
perhaps could go in separately, but I don't know that I agree that the
limits are going to be that much of an issue.  In any case, there's been
work ongoing on this and that'll be posted soon.  I was just trying to
address the general comment raised in this sub-thread here.

> I think even patch 1 is doing too much at once. I doubt the DSM stuff is
> quite right.

Getting DSM right has certainly been tricky, along with other things,
but we've been working towards, and continue to work towards, getting
everything to line up nicely between memory context allocations of
various types and the amounts which are being seen as malloc'd/free'd.
There's been parts of this also reworked to allow us to see per-backend
reservations as well as total reserved and to get those numbers able to
be matched up inside of a given transaction using the statistics system.

> I'm unconvinced it's a good idea to split the different types of memory
> contexts out. That just exposes too much implementation detail stuff without a
> good reason.

DSM needs to be independent anyway ... as for the others, perhaps we
could combine them, though that's pretty easily done later and for now
it's been useful to see them split out as we've been working on the
patch.

> I think the overhead even just the tracking implies right now is likely too
> high and needs to be optimized. It should be a single math operation, not
> tracking things in multiple fields. I don't think pg_sub_u64_overflow() should
> be in the path either, that suddenly adds conditional branches.  You really
> ought to look at the difference in assembly for the hot functions.

This has been improved in the most recent work and we'll have that
posted soon, probably best to hold off from larger review of this right
now- as mentioned, I was just trying to address the specific question in
this sub-thread since a new patch is coming soon.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-19 Thread Stephen Frost

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> On 19/10/2023 02:00, Stephen Frost wrote:
> > * Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> > > On 29/9/2023 09:52, Andrei Lepikhov wrote:
> > > > On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:
> > > > > Attach patches updated to master.
> > > > > Pulled from patch 2 back to patch 1 a change that was also pertinent
> > > > > to patch 1.
> > > > +1 to the idea, have doubts on the implementation.
> > > > 
> > > > I have a question. I see the feature triggers ERROR on the exceeding of
> > > > the memory limit. The superior PG_CATCH() section will handle the error.
> > > > As I see, many such sections use memory allocations. What if some
> > > > routine, like the CopyErrorData(), exceeds the limit, too? In this case,
> > > > we could repeat the error until the top PG_CATCH(). Is this correct
> > > > behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for
> > > > recursion and allow error handlers to slightly exceed this hard limit?
> > 
> > > By the patch in attachment I try to show which sort of problems I'm 
> > > worrying
> > > about. In some PП_CATCH() sections we do CopyErrorData (allocate some
> > > memory) before aborting the transaction. So, the allocation error can move
> > > us out of this section before aborting. We await for soft ERROR message 
> > > but
> > > will face more hard consequences.
> > 
> > While it's an interesting idea to consider making exceptions to the
> > limit, and perhaps we'll do that (or have some kind of 'reserve' for
> > such cases), this isn't really any different than today, is it?  We
> > might have a malloc() failure in the main path, end up in PG_CATCH() and
> > then try to do a CopyErrorData() and have another malloc() failure.
> > 
> > If we can rearrange the code to make this less likely to happen, by
> > doing a bit more work to free() resources used in the main path before
> > trying to do new allocations, then, sure, let's go ahead and do that,
> > but that's independent from this effort.
> 
> I agree that rearranging efforts can be made independently. The code in the
> letter above was shown just as a demo of the case I'm worried about.
> IMO, the thing that should be implemented here is a recursion level for the
> memory limit. If processing the error, we fall into recursion with this
> limit - we should ignore it.
> I imagine custom extensions that use PG_CATCH() and allocate some data
> there. At least we can raise the level of error to FATAL.

Ignoring such would defeat much of the point of this effort- which is to
get to a position where we can say with some confidence that we're not
going to go over some limit that the user has set and therefore not
allow ourselves to end up getting OOM killed.  These are all the same
issues that already exist today on systems which don't allow overcommit
too, there isn't anything new here in regards to these risks, so I'm not
really keen to complicate this to deal with issues that are already
there.

Perhaps once we've got the basics in place then we could consider
reserving some space for handling such cases..  but I don't think it'll
actually be very clean and what if we have an allocation that goes
beyond what that reserved space is anyway?  Then we're in the same spot
again where we have the choice of either failing the allocation in a
less elegant way than we might like to handle that error, or risk
getting outright kill'd by the kernel.  Of those choices, sure seems
like failing the allocation is the better way to go.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2023-10-19 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-10-18 15:53:30 -0400, Stephen Frost wrote:
> > > Here how pg_backend_memory_contexts would look like with this patch:
> > > 
> > > postgres=# SELECT name, id, parent, parent_id, path
> > > FROM pg_backend_memory_contexts
> > > ORDER BY total_bytes DESC LIMIT 10;
> > >   name   | id  |  parent  | parent_id | path
> > > -+-+--+---+--
> > >  CacheMemoryContext  |  27 | TopMemoryContext | 0 | {0}
> > >  Timezones   | 124 | TopMemoryContext | 0 | {0}
> > >  TopMemoryContext|   0 |  |   |
> > >  MessageContext  |   8 | TopMemoryContext | 0 | {0}
> > >  WAL record construction | 118 | TopMemoryContext | 0 | {0}
> > >  ExecutorState   |  18 | PortalContext|17 | {0,16,17}
> > >  TupleSort main  |  19 | ExecutorState|18 | 
> > > {0,16,17,18}
> > >  TransactionAbortContext |  14 | TopMemoryContext | 0 | {0}
> > >  smgr relation table |  10 | TopMemoryContext | 0 | {0}
> > >  GUC hash table  | 123 | GUCMemoryContext |   122 | {0,122}
> > > (10 rows)
> > > 
> > > An example query to calculate the total_bytes including its children for a
> > > context (say CacheMemoryContext) would look like this:
> > > 
> > > WITH contexts AS (
> > > SELECT * FROM pg_backend_memory_contexts
> > > )
> > > SELECT sum(total_bytes)
> > > FROM contexts
> > > WHERE ARRAY[(SELECT id FROM contexts WHERE name = 'CacheMemoryContext')] 
> > > <@
> > > path;
> > 
> > I wonder if we should perhaps just include
> > "total_bytes_including_children" as another column?  Certainly seems
> > like a very useful thing that folks would like to see.
> 
> The "issue" is where to stop - should we also add that for some of the other
> columns? They are a bit less important, but not that much.

I'm not sure the others really make sense to aggregate in this way as
free space isn't able to be moved between contexts.  That said, if
someone wants it then I'm not against that.  I'm actively in support of
adding an aggregated total though as that, at least to me, seems to be
very useful to have.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: pg_dump needs SELECT privileges on irrelevant extension table

2023-10-18 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I wrote:
> > Why are we marking extension member objects as being subject to SECLABEL
> > or POLICY dumping?  As the comment notes, that isn't really sensible
> > unless what we are dumping is a delta from the extension's initial
> > assignments.  But we have no infrastructure for that, and none seems
> > likely to appear in the near future.
> 
> Here's a quick patch that does it that way.  The test changes
> are identical to Jacob's v3-0001.

What the comment is talking about is that we don't support initial
policies, not that we don't support policies on extension tables at all.
That said ... even the claim that we don't support such policies isn't
supported by code and there are people out there doing it, which creates
its own set of problems (ones we should really try to find solutions to
though..).

This change would mean that policies added by a user after the extension
is created would just be lost by a pg_dump/reload, doesn't it?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2023-10-18 Thread Stephen Frost

Greetings,

* Melih Mutlu (m.melihmu...@gmail.com) wrote:
> Melih Mutlu , 16 Haz 2023 Cum, 17:03 tarihinde şunu
> yazdı:
> 
> > With this change, here's a query to find how much space used by each
> > context including its children:
> >
> > > WITH RECURSIVE cte AS (
> > > SELECT id, total_bytes, id as root, name as root_name
> > > FROM memory_contexts
> > > UNION ALL
> > > SELECT r.id, r.total_bytes, cte.root, cte.root_name
> > > FROM memory_contexts r
> > > INNER JOIN cte ON r.parent_id = cte.id
> > > ),
> > > memory_contexts AS (
> > > SELECT * FROM pg_backend_memory_contexts
> > > )
> > > SELECT root as id, root_name as name, sum(total_bytes)
> > > FROM cte
> > > GROUP BY root, root_name
> > > ORDER BY sum DESC;
> 
> Given that the above query to get total bytes including all children is
> still a complex one, I decided to add an additional info in
> pg_backend_memory_contexts.
> The new "path" field displays an integer array that consists of ids of all
> parents for the current context. This way it's easier to tell whether a
> context is a child of another context, and we don't need to use recursive
> queries to get this info.

Nice, this does seem quite useful.

> Here how pg_backend_memory_contexts would look like with this patch:
> 
> postgres=# SELECT name, id, parent, parent_id, path
> FROM pg_backend_memory_contexts
> ORDER BY total_bytes DESC LIMIT 10;
>   name   | id  |  parent  | parent_id | path
> -+-+--+---+--
>  CacheMemoryContext  |  27 | TopMemoryContext | 0 | {0}
>  Timezones   | 124 | TopMemoryContext | 0 | {0}
>  TopMemoryContext|   0 |  |   |
>  MessageContext  |   8 | TopMemoryContext | 0 | {0}
>  WAL record construction | 118 | TopMemoryContext | 0 | {0}
>  ExecutorState   |  18 | PortalContext|17 | {0,16,17}
>  TupleSort main  |  19 | ExecutorState|18 | {0,16,17,18}
>  TransactionAbortContext |  14 | TopMemoryContext | 0 | {0}
>  smgr relation table |  10 | TopMemoryContext | 0 | {0}
>  GUC hash table  | 123 | GUCMemoryContext |   122 | {0,122}
> (10 rows)
> 
> An example query to calculate the total_bytes including its children for a
> context (say CacheMemoryContext) would look like this:
> 
> WITH contexts AS (
> SELECT * FROM pg_backend_memory_contexts
> )
> SELECT sum(total_bytes)
> FROM contexts
> WHERE ARRAY[(SELECT id FROM contexts WHERE name = 'CacheMemoryContext')] <@
> path;

I wonder if we should perhaps just include
"total_bytes_including_children" as another column?  Certainly seems
like a very useful thing that folks would like to see.  We could do that
either with C, or even something as simple as changing the view to do
something like:

WITH contexts AS MATERIALIZED (
  SELECT * FROM pg_get_backend_memory_contexts()
)
SELECT
  *,
  coalesce
  (
(
  (SELECT sum(total_bytes) FROM contexts WHERE ARRAY[a.id] <@ path)
  + total_bytes
),
total_bytes
  ) AS total_bytes_including_children
FROM contexts a;

> We still need to use cte since ids are not persisted and might change in
> each run of pg_backend_memory_contexts. Materializing the result can
> prevent any inconsistencies due to id change. Also it can be even good for
> performance reasons as well.

I don't think we really want this to be materialized, do we?  Where this
is particularly interesting is when it's being dumped to the log ( ...
though I wish we could do better than that and hope we do in the future)
while something is ongoing in a given backend and if we do that a few
times we are able to see what's changing in terms of allocations,
whereas if we materialized it (when?  transaction start?  first time
it's asked for?) then we'd only ever get the one view from whenever the
snapshot was taken.

> Any thoughts?

Generally +1 from me for working on improving this.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-18 Thread Stephen Frost

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> On 29/9/2023 09:52, Andrei Lepikhov wrote:
> > On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:
> > > Attach patches updated to master.
> > > Pulled from patch 2 back to patch 1 a change that was also pertinent
> > > to patch 1.
> > +1 to the idea, have doubts on the implementation.
> > 
> > I have a question. I see the feature triggers ERROR on the exceeding of
> > the memory limit. The superior PG_CATCH() section will handle the error.
> > As I see, many such sections use memory allocations. What if some
> > routine, like the CopyErrorData(), exceeds the limit, too? In this case,
> > we could repeat the error until the top PG_CATCH(). Is this correct
> > behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for
> > recursion and allow error handlers to slightly exceed this hard limit?

> By the patch in attachment I try to show which sort of problems I'm worrying
> about. In some PП_CATCH() sections we do CopyErrorData (allocate some
> memory) before aborting the transaction. So, the allocation error can move
> us out of this section before aborting. We await for soft ERROR message but
> will face more hard consequences.

While it's an interesting idea to consider making exceptions to the
limit, and perhaps we'll do that (or have some kind of 'reserve' for
such cases), this isn't really any different than today, is it?  We
might have a malloc() failure in the main path, end up in PG_CATCH() and
then try to do a CopyErrorData() and have another malloc() failure.

If we can rearrange the code to make this less likely to happen, by
doing a bit more work to free() resources used in the main path before
trying to do new allocations, then, sure, let's go ahead and do that,
but that's independent from this effort.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-18 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2023-10-17 at 22:52 -0400, Stephen Frost wrote:
> 
> > Reading back through the thread, from a user perspective, the primary
> > one seems to be that passwords are expected to be named.  I'm
> > surprised
> > this is being brought up as such a serious concern.  Lots and lots
> > and
> > lots of things in the system require naming, after all, and the idea
> > that this is somehow harder or more of an issue is quite odd to me.
> 
> In the simplest intended use case, the names will be arbitrary and
> temporary. It's easy for me to imagine someone wondering "was I
> supposed to delete 'bear' or 'lion'?". For indexes and other objects,
> there's a lot more to go on, easily visible with \d.

Sure, agreed.

> Now, obviously that is not the end of the world, and the user could
> prevent that problem a number of different ways. And we can do things
> like improve the monitoring of password use, and store the password
> creation time, to help users if they are confused. So I don't raise
> concerns about naming as an objection to the feature overall, but
> rather a concern that we aren't getting it quite right.

Right, we need more observability, agreed, but that's not strictly
necessary of this patch and could certainly be added independently.  Is
there really a need to make this observability a requirement of this
particular change?

> Maybe a name should be entirely optional, more like a comment, and the
> passwords can be referenced by the salt? The salt needs to be unique
> for a given user anyway.

I proposed an approach in the email you replied to explicitly suggesting
a way we could make the name be optional, so, sure, I'm generally on
board with that idea.  Note that it'd be optional for the user to
provide and then we'd simply generate one for them and then that name is
what would be used to refer to that password later.

> (Aside: is the uniqueness of the salt enforced in the current patch?)

Err, the salt has to be *identical* for each password of a given user,
not unique, so I'm a bit confused here.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-17 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2023-10-17 at 16:20 -0400, Stephen Frost wrote:
> > Agreed that it's a bad idea to design to support 2 and only 2.
> 
> I don't disagree, but it's difficult to come up with syntax that:
> 
>  (a) supports N passwords
>  (b) makes the ordinary cases simple and documentable
>  (c) helps users avoid mistakes (at least in the simple cases)
>  (d) makes sense passwords with and without validity period
>  (e) handles the challenging cases

Undoubtably biased ... but I don't agree that this is so difficult.
What points have been raised as a downside of the originally proposed
approach, specifically?

Reading back through the thread, from a user perspective, the primary
one seems to be that passwords are expected to be named.  I'm surprised
this is being brought up as such a serious concern.  Lots and lots and
lots of things in the system require naming, after all, and the idea
that this is somehow harder or more of an issue is quite odd to me.

> One challenging case is that we cannot allow the mixing of password
> protocols (e.g. MD5 & SCRAM), because the authentication exchange only
> gets one chance at success. If someone ends up with 7 MD5 passwords,
> and they'd like to migrate to SCRAM, then we can't allow them to
> migrate one password at a time (because then the other passwords would
> break). I'd like to see what the SQL for doing this should look like.

I've got absolutely no interest in supporting md5- it's high time to rip
that out and be happy that it's gone.  We nearly did it last year and
I'm really hoping we accomplish that this year.

I'm open to the idea that we may need to support some new SCRAM version
or an alternative mechanism in the future, but it's long past time to
spend any time worrying about md5.  As for how difficult it is to deal
with supporting an alternative in the future- that's going to depend a
great deal on what that alternative is and I don't know that we can
really code to handle that as part of this effort in a sensible way, and
trying to code for "anything" is going to likely make this much more
complicated, and probably rife with security issues too.

> >   If
> > nothing else, there's the very simple case that the user needs to do
> > another password rotation ... and they look and discover that the old
> > password is still being used and that if they took it away, things
> > would
> > break, but they need time to run down which system is still using it
> > while still needing to perform the migration for the other systems
> > that
> > are correctly being updated- boom, need 3 for that case.
> 
> That sounds like a reasonable use case. I don't know if we should make
> it a requirement, but if we come up with something reasonable that
> supports this case I'm fine with it. Ideally, it would still be easy to
> see when you are making a mistake (e.g. forgetting to ever remove
> passwords).

We have monitoring for many, many parts of the system and this would be
a good thing to monitor also- not just at a per-password level but also
at an overall role/user level, as you have a similar issue there and we
don't properly provide users with any way to check reasonably "hey, when
was the last time this user logged in?".  No, trawling through ancient
logs, if you even have them, isn't a proper solution to this problem.

> >   There's other
> > use-cases that could be interesting though- presumably we'd log which
> > password is used to authenticate and then users could have a fleet of
> > web servers which each have their own password but log into the same
> > PG
> > user and they could happily rotate the passwords independently for
> > all
> > of those systems.
> > 
> > if they're all
> > logging in with the same role and just a different password, that's
> > not
> > going to happen.
> 
> I'm not sure this is a great idea. Can you point to some precedent
> here?

It's already the case that lots and lots and lots of systems out there
log into PG using the same username/password.  With this, we're at least
offering them the ability to vary the password and keep the user the
same.  We've even seen this be asked for in other ways- the ability to
use distinct Kerberos or LDAP identifies to log into the same user in
the database, see pg_ident.conf and various suggestions for how to bring
that to LDAP too.  Other systems also support the ability to have a
group of users in LDAP, or such, be allowed to log into a specific
database user.  One big reason for this is that then you know everyong
logging into that account has exactly the same access to th

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-17 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Wed, Oct 04, 2023 at 10:41:15PM -0700, Gurjeet Singh wrote:
> > The patches posted in this thread so far attempt to add the ability to
> > allow the user to have an arbitrary number of passwords. I believe
> > that allowing arbitrary number of passwords is not only unnecessary,
> > but the need to name passwords, the need to store them in a shared
> > catalog, etc. may actually create problems in the field. The
> > users/admins will have to choose names for passwords, which they
> > didn't have to previously. The need to name them may also lead to
> > users storing password-hints in the password names (e.g. 'mom''s
> > birthday', 'ex''s phone number', 'third password'), rendering the
> > passwords weak.
> > 
> > Moreover, allowing an arbitrarily many number of passwords will
> > require us to provide additional infrastructure to solve problems like
> > observability (which passwords are currently in use, and which ones
> > have been effectively forgotten by applications), or create a nuisance
> > for admins that can create more problems than it solves.
> 
> IMHO neither of these problems seems insurmountable.  Besides advising
> against using hints as names, we could also automatically generate safe
> names, or even disallow user-provided names entirely.  And adding
> observability for passwords seems worthwhile anyway.

Agreed, particularly on adding observability for password use.
Regardless of what we do, I feel pretty strongly that we need that.
That said, having this handled in a separate catalog feels like a just
generally better idea than shoving it all into pg_authid as we extend
things to include information like "last used date", "last used source
IP", etc.

Providing this observability purely through logging strikes me as a
terrible idea.

I don't find the concern about names as 'hints' to be at all compelling.
Having a way to avoid having names may have some value, but only if we
can come up with something reasonable.

> > So I propose that the feature should allow no more than 2 passwords
> > for a role, each with their own validity periods. This eliminates the
> > need to name passwords, because at any given time there are no more
> > than 2 passwords; current one, and old one. This also eliminates the
> > need for a shared catalog to hold passwords, because with the limit of
> > 2 imposed, we can store the old password and its validity period in
> > additional columns in the pg_authid table.
> 
> Another approach could be to allow an abritrary number of passwords but to
> also allow administrators to limit how many passwords can be associated to
> each role.  That way, we needn't restrict this feature to 2 passwords for
> everyone.  Perhaps 2 should be the default, but in any case, IMO we
> shouldn't design to only support 2.

Agreed that it's a bad idea to design to support 2 and only 2.  If
nothing else, there's the very simple case that the user needs to do
another password rotation ... and they look and discover that the old
password is still being used and that if they took it away, things would
break, but they need time to run down which system is still using it
while still needing to perform the migration for the other systems that
are correctly being updated- boom, need 3 for that case.  There's other
use-cases that could be interesting though- presumably we'd log which
password is used to authenticate and then users could have a fleet of
web servers which each have their own password but log into the same PG
user and they could happily rotate the passwords independently for all
of those systems.

I don't propose this as some use-case just for the purpose of argument-
not sharing passwords across a bunch of systems is absolutely a good
stance when it comes to security, and due to the way permissions and
roles work in PG, being able to have both distinct passwords with
explicitly logged indication of which system used what password to log
in while not having to deal with possibly permission differences due to
using actually independent roles is valuable.  That is, each server
using a separate role to log in could lead to some servers having access
to something or other while others don't pretty easily- if they're all
logging in with the same role and just a different password, that's not
going to happen.

* Jeff Davis (pg...@j-davis.com) wrote:
> Using a number seems weird to me because either:
> 
>  (a) if the number is always increasing you'd have to look to find the
> number of the new slot to add and the old slot to remove; or
>  (b) if switched between two numbers (say 0 and 1), it would be error
> prone because you'd have to remember which is the old one that can be
> safely replaced

Yeah, a number doesn't strike me as very good either.

> Maybe a password is best described by its validity period rather than a
> name? But what about passwords that don't expire?

The validity idea is interesting but falls down w

Re: The danger of deleting backup_label

2023-10-17 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 10/16/23 15:06, Robert Haas wrote:
> > On Mon, Oct 16, 2023 at 1:00 PM David Steele  wrote:
> > > After some agonizing (we hope) they decide to delete backup_label and,
> > > wow, it just works! So now they merrily go on their way with a corrupted
> > > cluster. They also remember for the next time that deleting backup_label
> > > is definitely a good procedure.
> > > 
> > > The idea behind this patch is that deleting backup_label would produce a
> > > hard error because pg_control would be missing as well (if the backup
> > > software did its job). If both pg_control and backup_label are present
> > > (but pg_control has not been loaded with the contents of backup_label,
> > > i.e. it is the running copy from the backup cluster) we can also error.
> > 
> > I mean, I think we're just going in circles, here. I did and do
> > understand, but I didn't and don't agree. You're hypothesizing a user
> > who is willing to do ONE thing that they shouldn't do during backup
> > restoration (namely, remove backup_label) but who won't be willing to
> > do a SECOND thing that they shouldn't do during backup restoration
> > (namely, run pg_resetwal).
> 
> In my experience the first case is much more likely than the second. Your
> experience may vary.

My experience (though perhaps not a surprise) mirrors David's.

> Anyway, I think they are pretty different. Deleting backup label appears to
> give a perfectly valid restore. Running pg_resetwal is more clearly (I
> think) the nuclear solution.

Right, and a delete of backup_label is just an 'rm' that folks may think
"oh, this is just some leftover thing that isn't actually needed".

OTOH, pg_resetwal has an online documentation page and a man page that's
very clear that it's only to be used as a last resort (perhaps we should
pull that into the --help output too..?).  It's also pretty clear that
pg_resetwal is actually changing things about the cluster while nuking
backup_label doesn't *seem* to be in that same category, even though we
all know it is because it's needed once recovery begins.

I'd also put out there that while people don't do restore testing
nearly as much as they should, they tend to at _least_ try to do it once
after taking their first backup and if that fails then they try to figure
out why and what they're not doing right.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Disabling Heap-Only Tuples

2023-09-20 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Tue, 2023-09-19 at 12:52 -0400, Robert Haas wrote:
> > On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> > wrote:
> > > I was thinking something vaguely like "a table size that's roughly what
> > > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > > vacuum_scale_factor.  You would determine the absolute minimum size for
> > > the table given the current live tuples in the table, then add 20% to
> > > account for a steady state of dead tuples and vacuumed space.  So it's
> > > not 1.2x of the "current" table size at the time the local_update_limit
> > > feature is installed, but 1.2x of the optimal table size.
> > 
> > Right, that would be great. And honestly if that's something we can
> > figure out, then why does the parameter even need to be an integer
> > instead of a Boolean? If the system knows the optimal table size, then
> > the user can just say "try to compact this table" and need not say to
> > what size. The 1.2 multiplier is probably situation dependent and
> > maybe the multiplier should indeed be a configuration parameter, but
> > we would be way better off if the absolute size didn't need to be.
> 
> I don't have high hopes for a reliable way to automatically determine
> the target table size.  There are these queries floating around to estimate
> table bloat, which are used by various monitoring systems.  I find that they
> get it right a lot of the time, but sometimes they get it wrong.  Perhaps
> we can do better than that, but I vastly prefer a setting that I can control
> (even at the danger that I can misconfigure it) over an automatism that I
> cannot control and that sometimes gets it wrong.

Not completely against a setting- but would certainly prefer that this
be done in a more automated way, if possible.

To that end, my thought would be some kind of regular review of the FSM,
or maybe actual review by walking through the table (as VACUUM already
does...) to get an idea of where there's space and where there's used up
areas and then use that to inform various operations (either VACUUM
itself or perhaps UPDATEs from SQL).  We could also try to 'start
simple' and look for cases that we can say "well, that's definitely not
good" and address those initially.

Consider (imagine as a histogram; X is used space, . is empty):

 1: XXX
 2: XXX
 3: XXX
 4: XXX
 5: X
 6: X
 7: .
 8: .
 9: .
10: .
11: .
12: .
13: .
14: .
15: .
16: .
17: .
18: .
19: .
20: X

Well, obviously there's tons of free space in the middle and if we could
just move those few tuples/pages/whatever that are near the end to
earlier in the table then we'd be able to truncate off and shrink a
lot of the table.

> I like Alvaro's idea to automatically reset "local_update_limit" when the
> table has shrunk enough.  Why not perform that task during vacuum truncation?
> If vacuum truncation has taken place, check if the table size is no bigger
> than "local_update_limit" * (1 + "autovacuum_vacuum_scale_factor"), and if
> it is no bigger, reset "local_update_limit".  That way, we would not have
> to worry about a lock, because vacuum truncation already has the table locked.

Agreed on this too.  Essentially, once we've done some truncation, we
should 'reset'.

I've no doubt that there's some better algorithm for this, but I keep
coming back to something as simple as- if the entire second half of the
table will fit into the entire first half then the table is twice as
large as it needs to be and perhaps that triggers a preference for
placing tuples in the first half of the table.  As for what handles
this- maybe have both UPDATE and VACUUM able to, but prefer for UPDATE
to do so and only have VACUUM kick in once the tuples at the end of the
relation are older than some xid-based threshold (perhaps all of the
tuples on a given page have to be old enough?).

While it feels a bit 'late' in terms of when to start taking this
action, we could possibly start with 'all frozen' as an indicator of
'old enough'?  Then, between the FSM and the VM, VACUUM could decide
that pages at the end of the table should be moved to be earlier and go
about making that happen.  I'm a bit concerned about the risk of some
kind of deadlock or similar happening between VACUUM and user processes
if we're trying to do this with multiple tuples at a time but hopefully
we could come up with a way to avoid that.  This process naturally would
have to involve updating indexes and the VM and FSM as the tuples get
moved.

In terms of what this would look like, my thinking is that VACUUM would
scan the table and the FSM and perhaps the VM and then say "ok, this
table is bigger than it needs to be, let's try to fix that" and then set
a flag on the table, which a user could also explicitly set to give them
control over this process happening sooner or not happening at all, and
that would indicate to UPDATE to prefer earlier pages over the current
page or HOT updates, w

Re: Possibility to disable `ALTER SYSTEM`

2023-09-11 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Mon, Sep 11, 2023 at 1:56 PM Martín Marqués  
> wrote:
> > > I would like to propose a patch that allows administrators to disable 
> > > `ALTER SYSTEM` via either a runt-time option to pass to the Postgres 
> > > server process at startup (e.g. `--disable-alter-system=true`, false by 
> > > default) or a new GUC (or even both), without changing the current 
> > > default method of the server.
> >
> > I'm actually going to put a strong +1 to Gabriele's proposal. It's an
> > undeniable problem (I'm only seeing arguments regarding other ways the
> > system would be insecure), and there might be real use cases for users
> > outside kubernetes for having this feature and using it (meaning
> > disabling the use of ALTER SYSTEM).
> 
> If enough people are in favor of it *given the known issues with it*,
> I can drop my objection to a "meh, but I still don't think it's a good
> idea".

A lot of the objections seem to be on the grounds of returning a
'permission denied' kind of error and I generally agree with that being
the wrong approach.

As an alternative idea- what if we had something in postgresql.conf
along the lines of:

include_alter_system = true/false

and use that to determine if the postgresql.auto.conf is included or
not..?

> But to do that, there would need to be a very in-your-face warning in
> the documentation about it like "note that this only disables the
> ALTER SYSTEM command. It does not prevent a superuser from changing
> the configuration remotely using other means".

With the above, we could throw a WARNING or maybe just NOTICE when ALTER
SYSTEM is run that 'include_alter_system is false and therefore these
changes won't be included in the running configuration' or similar.

What this does cause problems with is that pg_basebackup and other tools
(eg: pgbackrest) write into postgresql.auto.conf currently and we'd want
those to still work.  That's an opportunity, imv, though, since I don't
really think where ALTER SYSTEM writes to and where backup/restore
tools are writing to should really be the same place anyway.  Therefore,
perhaps we add a 'postgresql.system.conf' or similar and maybe a
corresponding option in postgresql.conf to include it or not.

> For example, in the very simplest, wth the POC patch out there now, I
> can still run:
> postgres=# CREATE TEMP TABLE x(t text);
> CREATE TABLE
> postgres=# INSERT INTO x VALUES ('work_mem=1TB');
> INSERT 0 1
> postgres=# COPY x TO 
> '/home/mha/postgresql/inst/head/data/postgresql.auto.conf';
> COPY 1
> postgres=# SELECT pg_reload_conf();
>  pg_reload_conf
> 
>  t
> (1 row)
> postgres=# show work_mem;
>  work_mem
> --
>  1TB
> (1 row)
> 
> Or anything similar to that.

This is an issue if you're looking at it as a security thing.  This
isn't an issue if don't view it that way.  Still, I do see some merit in
making it so that you can't actually change the config that's loaded at
system start from inside the data directory or as the PG superuser,
which my proposal above would support- just configure in postgresql.conf
to not include any of the alter-system or generated config.  The actual
postgresql.conf could be owned by root then too.

> > In Patroni for example, the PostgreSQL service is controlled on all
> > nodes by Patroni, and these kinds of changes could end up breaking the
> > cluster if there was a failover. For this reason Patroni starts
> > postgres with some GUC options as CMD arguments so that values in
> > postgresql.conf or postgresql.auto.conf are ignored. The values in the
> > DCS are the ones that matter.
> 
> Right. And patroni would need to continue to do that even with this
> patch, because it also needs to override if somebody puts something in
> postgresql.conf, no? Removing the defence against that seems like a
> bad idea...
> 
> 
> > (see more about how Patroni manages this here:
> > https://patroni.readthedocs.io/en/latest/patroni_configuration.html#postgresql-parameters-controlled-by-patroni
> >
> > But let's face it, that's a hack, not something to be proud of, even
> > if it does what is intended. And this is a product and we shouldn't be
> > advertising hacks to overcome limitations.
> 
> It's in a way a hack. But it's not the fault of ALTER SYSTEM, as you'd
> also have to manage postgresql.conf. One slightly less hacky part
> might be to have patroni generate a config file of it's own and
> include it with the highest priority -- at that point it *would* be
> come a hack around ALTER SYSTEM, because ALTER SYSTEM has a higher
> priority than any manual user config file. But it is not today.

I suppose we could invent a priority control thing as part of the above
proposal too.. but I would think just having include_alter_system and
include_auto_config (or whatever we name them) would be enough, with the
auto-config bit being loaded last and therefore having the 'highest'
priority.

> Another idea to solve the problem could

Re: Adding a pg_get_owned_sequence function?

2023-09-08 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Fri, Sep 01, 2023 at 01:31:43PM -0400, Tom Lane wrote:
> > Nathan Bossart  writes:
> >> I wonder if it'd be possible to just remove pg_get_serial_sequence().
> > 
> > A quick search at http://codesearch.debian.net/ finds uses of it
> > in packages like gdal, qgis, rails, ...  We could maybe get rid of
> > it after a suitable deprecation period, but I think we can't just
> > summarily remove it.

I don't agree with this- we would only be removing it from the next
major release which is a year away and our other major releases will be
supported for years to come.  If we do remove it, it'd be great to
mention it to those projects and ask them to update ahead of the next
release.

> Given that, I'd still vote for marking it deprecated, but I don't feel
> strongly about actually removing it anytime soon (or anytime at all,
> really).

Why would we mark it as deprecated then?  If we're not going to get rid
of it, then we're supporting it and we'll fix issues with it and that
basically means it's not deprecated.  If it's broken and we're not going
to fix it, then we should get rid of it.

If we're going to actually mark it deprecated then it should be, at
least, a yearly discussion about removing it.  I'm generally more in
favor of either just keeping it, or just removing it, though.  We've had
very little success marking things as deprecated as a way of getting
everyone to stop using it- some folks will stop using it right away and
those are the same people who would just adapt to it being gone in the
next major version quickly, and then there's folks who won't do anything
until it's actually gone (and maybe not even then).  There really isn't
a serious middle-ground where deprecation is helpful given our yearly
release cycle and long major version support period.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: SLRUs in the main buffer pool - Page Header definitions

2023-09-08 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Aug 24, 2023 at 3:28 PM Stephen Frost  wrote:
> > Agreed that we'd certainly want to make sure it's all correct and to do
> > performance testing but in terms of how many buffers... isn't much of
> > the point of this that we have data in memory because it's being used
> > and if it's not then we evict it?  That is, I wouldn't think we'd have
> > set parts of the buffer pool for SLRUs vs. regular data but would let
> > the actual demand drive what pages are in the cache and I thought that
> > was part of this proposal and part of the reasoning behind making this
> > change.
> 
> I think that it's not quite that simple. In the regular buffer pool,
> access to pages is controlled by buffer pins and buffer content locks,
> but these mechanisms don't exist in the same way in the SLRU code. But
> buffer pins drive usage counts which drive eviction decisions. So if
> you move SLRU data into the main buffer pool, you either need to keep
> the current locking regime and use some new logic to decide how much
> of shared_buffers to bequeath to the SLRU pools, OR you need to make
> SLRU access use buffer pins and buffer content locks. If you do the
> latter, I think you substantially increase the cost of an uncontended
> SLRU buffer access, because you now need to pin the buffer, and and
> then take and release the content lock, and then release the pin;
> whereas today you can do all that by just taking and release the
> SLRU's lwlock. That's more atomic operations, and hence more costly, I
> think. But even if not, it could perform terribly if SLRU buffers
> become more vulnerable to eviction than they are at present, or
> alternatively if they take over too much of the buffer pool and force
> other important data out.

An SLRU buffer access does also update the cur_lru_count for the SLRU,
along with the per-page page_lru_count, but those are 32bit and we don't
enforce that they're done in order, so presumably those are less
expensive than the pinning and usage count updates.

This thread started with the issue that our current SLRUs are relatively
small though and simply increasing their size would lead to issues as
we're just doing simple things like a linear search through them all at
times, or at least that's more-or-less what I understood from [1].  More
details on the specific 'scaling and sizing challenges' would be nice to
have.  The current patches were at least claimed to improve performance
while also using ReadBuffer_common [2].  Having an idea as to what is
specifically leading to that would be interesting though with all these
changes likely non-trivial.  pgbench may not be the best way to measure
this, but it's still interesting to see an improvement like that.

Certainly one concern about using the regular buffer pool is that
finding a victim page can be expensive and having that as part of an
SLRU access could be pretty painful.  Though we also have that problem
elsewhere too.

If we're going to effectively segregate the buffer pool into SLRU parts
vs. everything else and then use the existing strategies for SLRUs and
have that be different from what everything else is using ... then
that doesn't seem like it's really changing things.  What would be the
point of moving the SLRUs into the main buffer pool then?

> SLRUs are a performance hotspot, so even relatively minor changes to
> their performance characteristics can, I believe, have pretty
> noticeable effects on performance overall.

Agreed, we certainly need to have a plan for how to performance test
this change and should try to come up with some 'worst case' tests.

Thanks,

Stephen

[1]: https://postgr.es/m/EFAAC0BE-27E9-4186-B925-79B7C696D5AC%40amazon.com
[2]: https://postgr.es/m/A09EAE0D-0D3F-4A34-ADE9-8AC1DCBE7D57%40amazon.com

signature.asc
Description: PGP signature

Re: Let's make PostgreSQL multi-threaded

2023-08-25 Thread Stephen Frost

Greetings,

* David Geier (geidav...@gmail.com) wrote:
> On 8/11/23 14:05, Merlin Moncure wrote:
> > Hm, noted this upthread, but asking again, does this
> > help/benefit interactions with the operating system make oom kill
> > situations less likely?   These things are the bane of my existence, and
> > I'm having a hard time finding a solution that prevents them other than
> > running pgbouncer and lowering max_connections, which adds complexity. 
> > I suspect I'm not the only one dealing with this.   What's really scary
> > about these situations is they come without warning.  Here's a pretty
> > typical example per sar -r.
> > 
> > The conjecture here is that lots of idle connections make the server
> > appear to have less memory available than it looks, and sudden transient
> > demands can cause it to destabilize.
> 
> It does in the sense that your server will have more memory available in
> case you have many long living connections around. Every connection has less
> kernel memory overhead if you will. Of course even then a runaway query will
> be able to invoke the OOM killer. The unfortunate thing with the OOM killer
> is that, in my experience, it often kills the checkpointer. That's because
> the checkpointer will touch all of shared buffers over time which makes it
> likely to get selected by the OOM killer. Have you tried disabling memory
> overcommit?

This is getting a bit far afield in terms of this specific thread, but
there's an ongoing effort to give PG administrators knobs to be able to
control how much actual memory is used rather than depending on the
kernel to actually tell us when we're "out" of memory.  There'll be new
patches for the September commitfest posted soon.  If you're interested
in this issue, it'd be great to get more folks involved in review and
testing.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: SLRUs in the main buffer pool - Page Header definitions

2023-08-24 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Aug 18, 2023 at 12:15 PM Nathan Bossart
>  wrote:
> > I think I agree with Stephen.  We routinely make changes that require
> > updates to extensions, and I doubt anyone is terribly wild about
> > maintaining two SLRU systems for several years.
> 
> Yeah, maintaining two systems doesn't sound like a good idea.
> 
> However, this would be a big change. I'm not sure how we validate a
> change of this magnitude. There are both correctness and performance
> considerations. I saw there had been a few performance results on the
> thread from Thomas that spawned this thread; but I guess we'd want to
> do more research. One question is: how do you decide how many buffers
> to use for each SLRU, and how many to leave available for regular
> data?

Agreed that we'd certainly want to make sure it's all correct and to do
performance testing but in terms of how many buffers... isn't much of
the point of this that we have data in memory because it's being used
and if it's not then we evict it?  That is, I wouldn't think we'd have
set parts of the buffer pool for SLRUs vs. regular data but would let
the actual demand drive what pages are in the cache and I thought that
was part of this proposal and part of the reasoning behind making this
change.

There's certainly an argument to be made that our current cache
management strategy doesn't account very well for the value of pages
(eg: btree root pages vs. random heap pages, perhaps) and that'd
certainly be a good thing to improve on, but that's independent of this.
If it's not, then that's certainly moving the goal posts a very long way
in terms of this effort.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Logging of matching pg_hba.conf entry during auth skips trust auth, potential security issue

2023-08-17 Thread Stephen Frost

Greetings,

On Thu, Aug 17, 2023 at 15:23 Robert Haas  wrote:

> On Thu, Aug 17, 2023 at 12:54 PM Jacob Champion 
> wrote:
> > On Thu, Aug 17, 2023 at 9:46 AM Stephen Frost 
> wrote:
> > > Don't like 'skipped' but that feels closer.
> > >
> > > How about 'connection bypassed authentication'?
> >
> > Works for me; see v2.
>
> For what it's worth, my vote would be for "connection authenticated:
> ... method=trust".

I don’t have any particular objection to this language and agree that it’s
actually closer to how we talk about the trust auth method in our
documentation.

Maybe if we decided to rework the documentation … or perhaps just ripped
“trust” out entirely … but those are whole different things from what we
are trying to accomplish here.

Thanks,

Stephen

Re: Logging of matching pg_hba.conf entry during auth skips trust auth, potential security issue

2023-08-17 Thread Stephen Frost

Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> On Thu, Aug 17, 2023 at 9:01 AM Stephen Frost  wrote:
> > Maybe 'connection allowed' instead..?
> 
> Hm. It hasn't really been allowed yet, either. To illustrate what I mean:
> 
> LOG:  connection received: host=[local]
> LOG:  connection allowed: user="jacob" method=trust
> (/home/jacob/src/data/pg16/pg_hba.conf:117)
> LOG:  connection authorized: user=jacob database=postgres
> application_name=psql
> 
> Maybe "unauthenticated connection:"? "connection without
> authentication:"? "connection skipped authentication:"?

Don't like 'skipped' but that feels closer.

How about 'connection bypassed authentication'?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Logging of matching pg_hba.conf entry during auth skips trust auth, potential security issue

2023-08-17 Thread Stephen Frost

Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> Maybe something like the attached?

> - I used the phrasing "connection not authenticated" in the hopes that
> it's a bit more greppable than just "connection", especially in
> combination with the existing "connection authenticated" lines.

That doesn't seem quite right ... admittedly, 'trust' isn't performing
authentication but there can certainly be an argument made that the
basic 'matched a line in pg_hba.conf' is a form of authentication, and
worse really, saying 'not authenticated' would seem to imply that we
didn't allow the connection when, really, we did, and that could be
confusing to someone.

Maybe 'connection allowed' instead..?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-24 Thread Stephen Frost

Greetings,

(Adding David Steele into the CC on this one...)

* Thomas Munro (thomas.mu...@gmail.com) wrote:
> This is a frustrating thread, because despite the last patch solving
> most of the problems we discussed, it doesn't address the
> low-level-backup procedure in a nice way.  We'd have to tell users
> they have to flock that file, or add a new step "pg_controldata --raw
> > pg_control", which seems weird when they already have a connection
> to the server.

flock'ing the file strikes me as dangerous to ask external tools to do
due to the chances of breaking the running system if they don't do it
right.  I'm generally open to the idea of having the backup tool have to
do more complicated work to be correct but that just seems likely to
cause issues.  Also- haven't looked yet, but I'm not sure that's even
possible if your backup tool is running as a user who only has read
access to the data directory?  I don't want us to give up on that
feature.

> Maybe it just doesn't matter if eg the pg_controldata program can
> spuriously fail if pointed at a running system, and I was being too
> dogmatic trying to fix even that.  Maybe we should just focus on
> fixing backups.  Even there, I am beginning to suspect we are solving
> this problem in the wrong place when a higher level change could
> simplify the problem away.

For a running system.. perhaps pg_controldata should be connecting to
the database and calling functions there?  Or just complain that the
system is online and tell the user to do that?

> Idea for future research:  Perhaps pg_backup_stop()'s label-file
> output should include the control file image (suitably encoded)?  Then
> the recovery-from-label code could completely ignore the existing
> control file, and overwrite it using that copy.  It's already
> partially ignoring it, by using the label file's checkpoint LSN
> instead of the control file's.  Perhaps the captured copy could
> include the correct LSN already, simplifying that code, and the low
> level backup procedure would not need any additional steps or caveats.
> No more atomicity problem for low-level-backups... but probably not
> something we would back-patch, for such a rare failure mode.

I like this general direction and wonder if we could possibly even push
a bit harder on it: have the backup_label include the control file's
contents in some form that is understood and then tell tools to *not*
copy the running pg_control file ... and maybe even complain if a
pg_control file exists when we detect that backup_label has the control
file's image.  We've certainly had problems in the past where people
would just nuke the backup_label file, even though they were restoring
from a backup, because they couldn't start the system up since their
restore command wasn't set up properly or their WAL archive was missing.

Being able to get rid of the control file being in the backup at all
would make it harder for someone to get to a corrupted-but-running
system and that seems like it's a good thing.

> Here's a new minimal patch that solves only the bugs in basebackup +
> the simple SQL-facing functions that read the control file, by simply
> acquiring ControlFileLock in the obvious places.  This should be
> simple enough for back-patching?

I don't particularly like fixing this in a way that only works for
pg_basebackup and means that the users of other backup tools don't have
a solution to this problem.  What are we supposed to tell users of
pgBackRest when they see this fix for pg_basebackup in the next set of
minor releases and they ask us if we've addressed this risk?

We might be able to accept the 're-read on CRC failure' approach, if it
were being used for pg_controldata and we documented that external
tools and especially backup tools using the low-level API are required
to check the CRC and to re-read on failure if accessing a running
system.

While it's not ideal, maybe we could get away with changing the contents
of the backup_label as part of a back-patch?  The documentation, at
least on a quick look, says to copy the second field from pg_backup_stop
into a backup_label file but doesn't say anything about what those
contents are or if they can change.  That would at least address the
concern of backups ending up with a corrupt pg_control file and not
being able to be restored, even if tools aren't updated to verify the
CRC or similar.  Of course, that's a fair bit of code churn for a
backpatch, which I certainly understand as being risky.  If we can't
back-patch that, it might still be the way to go moving forward, while
also telling tools to check the CRC.  (I'm not going to try to figure
out some back-patchable pretend solution for this for shell scripts that
pretend to be able to backup running PG databases; this issue is
probably the least of their issues anyway...)

A couple of interesting notes on this though- pgBackRest doesn't only
read the pg_control file at backup time, we also check it at
archive_command time, to

Re: Dumping policy on a table belonging to an extension is expected?

2023-07-17 Thread Stephen Frost

Greetings,

* Amul Sul (sula...@gmail.com) wrote:
> I have a ROW LEVEL SECURITY policy on the table part of an extension, and
> while
> dumping the database where that extension is installed, dumps the policy of
> that table too even though not dumpling that table .
> 
> Here is quick tests, where I have added following SQL to adminpack--1.0.sql
> extension file:
> 
> CREATE TABLE public.foo (i int CHECK(i > 0));
> ALTER TABLE public.foo ENABLE ROW LEVEL SECURITY;
> CREATE POLICY foo_policy ON public.foo USING (true);
> 
> After installation and creation of this extension, the dump output will have
> policy without that table:
> 
> --
> -- Name: foo; Type: ROW SECURITY; Schema: public; Owner: amul
> --
> 
> ALTER TABLE public.foo ENABLE ROW LEVEL SECURITY;
> 
> --
> -- Name: foo foo_policy; Type: POLICY; Schema: public; Owner: amul
> --
> 
> CREATE POLICY foo_policy ON public.foo USING (true);
> 
> 
> I am not sure if that is expected behaviour. The code comment in
> checkExtensionMembership() seems to be doing intentionally:
> 
>  * In 9.6 and above, mark the member object to have any non-initial ACL,
>  * policies, and security labels dumped.
> 
> The question is why were we doing this? Shouldn't skip this policy if it is
> part of the create-extension script?
> 
> Also, If you try to drop this policy, get dropped without any warning/error
> unlike tables or other objects which are not allowed to drop at all.

At least at the time, it wasn't really envisioned that policies would be
part of an extension's creation script.  That was probably short-sighted
and it'd be better if we handled that cleanly, but to do so we'd need
something akin to pg_init_privs where we track what policies are created
as part of the creation script vs. what are created afterwards and then
dump out the post-installation policy changes (note that we'd need to
track if any installation-time policies were dropped or changed too...)
as part of the pg_dump.

It'd be helpful if you could maybe provide some use-cases around this?
Permission changes such as those handled by pg_init_privs are a bit more
understandable since an extension script might want to revoke access
from PUBLIC for functions or to GRANT access to PUBLIC for other things,
or to leverage predefined roles, but how does that work for policies?
Most extensions aren't likely to be creating their own roles or
depending on non-predefined roles to already exist in the system as
otherwise they'd end up failing to install if those roles didn't
exist...

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Atomic ops for unlogged LSN

2023-07-17 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Mon, Jun 12, 2023 at 07:24:18PM -0400, Stephen Frost wrote:
> > * Nathan Bossart (nathandboss...@gmail.com) wrote:
> >> Is it?  I see uses in GiST indexing (62401db), so it's not immediately
> >> obvious to me how it is debugging-only.  If it is, then I think this patch
> >> ought to clearly document it so that nobody else tries to use it for
> >> non-debugging-only stuff.
> > 
> > I don't see it as a debugging value.  I'm not sure where that came
> > from..?  We do use it in places and if anything, I expect it to be used
> > more, not less, in the future as a persistant generally increasing
> > value (could go backwards on a crash-restart but in such case all
> > unlogged tables are truncated).
> 
> This is my understanding as well.
> 
> >> My concern would be whether GetFakeLSNForUnloggedRel or CreateCheckPoint
> >> might see an old value of unloggedLSN.  From the following note in
> >> README.barrier, it sounds like this would be a problem even if we ensured
> >> full barrier semantics:
> 
> Never mind.  I think I'm forgetting that the atomics support in Postgres
> deals with cache coherency.
> 
> > The concern around unlogged LSN, imv anyway, is less about shared memory
> > access and making sure that all callers understand that it can move
> > backwards on a crash/restart.  I don't think that's an issue for current
> > users but we just need to make sure to try and comment sufficiently
> > regarding that such that new users understand that about this particular
> > value.
> 
> Right.

Awesome.  Was there any other feedback on this change which basically is
just getting rid of a spinlock and replacing it with using atomics?
It's still in needs-review status but there's been a number of
comments/reviews (drive-by, at least) but without any real ask for any
changes to be made.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: SLRUs in the main buffer pool - Page Header definitions

2023-07-17 Thread Stephen Frost

Greetings,

[snipped quoted bits]

Would really be helpful to keep who the author of each quoted snipper
was when you quote them; dropping that makes it look like one person
wrote all of them and that's confusing.

* Bagga, Rishu (bagri...@amazon.com) wrote:
> The third patch brings back the the SLRU control structure, to keep it 
> as an extensible feature for now, and renames the handler for the 
> components we are moving into the buffercache to NREL (short for non 
> relational). nrel.c is essentially a copy of Munro’s modified slru.c, 
> and I have restored the original slru.c. This allows for existing 
> extensions utilizing SLRUs to keep working, and the test_slru unit tests 
> to pass, as well as introducing a more accurate name for the handling of 
> components (CLOG, Multixact Offsets/Members, Async, Serial, Subtrans) 
> that are no longer handled by an SLRU, but are still non relational 
> components. To address Andres’s concern - I modified the slru stats test 
> code to still track all these current components and maintain the 
> behavior, and confirmed as those tests pass as well.

Haven't really looked over the patches yet but I wanted to push back on
this a bit- you're suggesting that we'd continue to maintain and update
slru.c for the benefit of extensions which use it while none of the core
code uses it?  For how long?  For my 2c, at least, I'd rather we tell
extension authors that they need to update their code instead.  There's
reasons why we're moving the SLRUs into the main buffer pool and having
page headers for them and using the existing page code to read/write
them and extension authors should be eager to gain those advantages too.
Not sure how much concern to place on extensions that aren't willing to
adjust to changes like these.

> The fourth patch adds the page headers to these Non Relational (NREL) 
> components, and provides the upgrade story to rewrite the clog and 
> multixact files with page headers across upgrades.

Nice.

> With the changes from all four patches, they pass all tests with make 
> installcheck-world, as well as test_slru.

Awesome, will try to take a look soon.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-07-07 Thread Stephen Frost

Greetings,

* Nikolay Samokhvalov (n...@postgres.ai) wrote:
> On Fri, Jun 30, 2023 at 14:33 Bruce Momjian  wrote:
> > On Fri, Jun 30, 2023 at 04:16:31PM -0400, Robert Haas wrote:
> > > I'm not quite clear on how Nikolay got into trouble here. I don't
> > > think I understand under exactly what conditions the procedure is
> > > reliable and under what conditions it isn't. But there is no way in
> > > heck I would ever advise anyone to use this procedure on a database
> > > they actually care about. This is a great party trick or something to
> > > show off in a lightning talk at PGCon, not something you ought to be
> > > doing with valuable data that you actually care about.
> >
> > Well, it does get used, and if we remove it perhaps we can have it on
> > our wiki and point to it from our docs.

I was never a fan of having it actually documented because it's a pretty
complex and involved process that really requires someone doing it have
a strong understanding of how PG works.

> In my case, we performed some additional writes on the primary before
> running "pg_upgrade -k" and we did it *after* we shut down all the
> standbys. So those changes were not replicated and then "rsync --size-only"
> ignored them. (By the way, that cluster has wal_log_hints=on to allow
> Patroni run pg_rewind when needed.)

That's certainly going to cause problems..

> But this can happen with anyone who follows the procedure from the docs as
> is and doesn't do any additional steps, because in step 9 "Prepare for
> standby server upgrades":
> 
> 1) there is no requirement to follow specific order to shut down the nodes
>- "Streaming replication and log-shipping standby servers can remain
> running until a later step" should probably be changed to a
> requirement-like "keep them running"

Agreed that it would be good to clarify that the primary should be shut
down first, to make sure everything written by the primary has been
replicated to all of the replicas.

> 2) checking the latest checkpoint position with pg_controldata now looks
> like a thing that is good to do, but with uncertainty purpose -- it does
> not seem to be used to support any decision
>   - "There will be a mismatch if old standby servers were shut down before
> the old primary or if the old standby servers are still running" should
> probably be rephrased saying that if there is mismatch, it's a big problem

Yes, it's absolutely a big problem and that's the point of the check.
Slightly surprised that we need to explicitly say "if they don't match
then you need to figure out what you did wrong and don't move forward
until you get everything shut down and with matching values", but that's
also why it isn't a great idea to try and do this without a solid
understanding of how PG works.

> So following the steps as is, if some writes on the primary are not
> replicated (due to whatever reason) before execution of pg_upgrade -k +
> rsync --size-only, then those writes are going to be silently lost on
> standbys.

Yup.

> I wonder, if we ensure that standbys are fully caught up before upgrading
> the primary, if we check the latest checkpoint positions, are we good to
> use "rsync --size-only", or there are still some concerns? It seems so to
> me, but maybe I'm missing something.

I've seen a lot of success with it.

Ultimately, when I was describing this process, it was always with the
idea that it would be performed by someone quite familiar with the
internals of PG or, ideally, could be an outline of how an interested PG
hacker could write a tool to do it.  Hard to say, but I do feel like
having it documented has actually reduced the interest in writing a tool
to do it, which, if that's the case, is quite unfortunate.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Atomic ops for unlogged LSN

2023-06-12 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Thu, May 25, 2023 at 07:41:21AM +0900, Michael Paquier wrote:
> > On Wed, May 24, 2023 at 02:49:58PM -0700, Andres Freund wrote:
> >> So we indeed loose some "barrier strength" - but I think that's fine. For 
> >> one,
> >> it's a debugging-only value.
> 
> Is it?  I see uses in GiST indexing (62401db), so it's not immediately
> obvious to me how it is debugging-only.  If it is, then I think this patch
> ought to clearly document it so that nobody else tries to use it for
> non-debugging-only stuff.

I don't see it as a debugging value.  I'm not sure where that came
from..?  We do use it in places and if anything, I expect it to be used
more, not less, in the future as a persistant generally increasing
value (could go backwards on a crash-restart but in such case all
unlogged tables are truncated).

> >> But more importantly, I don't see what reordering
> >> the barrier could prevent - a barrier is useful for things like sequencing 
> >> two
> >> memory accesses to happen in the intended order - but unloggedLSN doesn't
> >> interact with another variable that's accessed within the ControlFileLock
> >> afaict.
> > 
> > This stuff is usually tricky enough that I am never completely sure
> > whether it is fine without reading again README.barrier, which is
> > where unloggedLSN is saved in the control file under its LWLock.
> > Something that I find confusing in the patch is that it does not
> > document the reason why this is OK.
> 
> My concern would be whether GetFakeLSNForUnloggedRel or CreateCheckPoint
> might see an old value of unloggedLSN.  From the following note in
> README.barrier, it sounds like this would be a problem even if we ensured
> full barrier semantics:
> 
> 3. No ordering guarantees.  While memory barriers ensure that any given
> process performs loads and stores to shared memory in order, they don't
> guarantee synchronization.  In the queue example above, we can use memory
> barriers to be sure that readers won't see garbage, but there's nothing to
> say whether a given reader will run before or after a given writer.  If 
> this
> matters in a given situation, some other mechanism must be used instead of
> or in addition to memory barriers.
> 
> IIUC we know that shared memory accesses cannot be reordered to precede
> aquisition or follow release of a spinlock (thanks to 0709b7e), which is
> why this isn't a problem in the current implementation.

The concern around unlogged LSN, imv anyway, is less about shared memory
access and making sure that all callers understand that it can move
backwards on a crash/restart.  I don't think that's an issue for current
users but we just need to make sure to try and comment sufficiently
regarding that such that new users understand that about this particular
value.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Let's make PostgreSQL multi-threaded

2023-06-09 Thread Stephen Frost

Greetings,

* Dave Cramer (davecramer@postgres.rocks) wrote:
> One thing I can think of is upgrading. AFAIK dump and restore is the only
> way to change the on disk format.
> Presuming that eventually we will be forced to change the on disk format it
> would be nice to be able to do so in a manner which does not force long
> down times

There is an ongoing effort moving in this direction.  The $subject isn't
great, but this patch set (which we are currently working on
updating...): https://commitfest.postgresql.org/43/3986/ attempts
changing a lot of currently compile-time block-size pieces to be
run-time which would open up the possibility to have a different page
format for, eg, different tablespaces.  Possibly even different block
sizes.  We'd certainly welcome discussion from others who are
interested.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Docs: Encourage strong server verification with SCRAM

2023-05-25 Thread Stephen Frost

Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> On 5/24/23 05:04, Daniel Gustafsson wrote:
> >> On 23 May 2023, at 23:02, Stephen Frost  wrote:
> >> Perhaps more succinctly- maybe we should be making adjustments to the
> >> current language instead of just adding a new paragraph.
> > 
> > +1
> 
> I'm 100% on board for doing that as well, but the "instead of"
> suggestion makes me think I didn't explain my goal very well. For
> example, Stephen, you said
> 
> > I have to admit that the patch as presented strikes me as a warning
> > without really providing steps for how to address the issues mentioned
> > though; there's a reference to what was just covered at the bottom but
> > nothing really new.
> 
> but the last sentence of my patch *is* the suggested step:
> 
> > +  ... It's recommended to employ strong server
> > +  authentication, using one of the above anti-spoofing measures, to 
> > prevent
> > +  these attacks.

I was referring specifically to that ordering as not being ideal or in
line with the rest of the flow of that section.  We should integrate the
concerns higher in the section where we outline the reason these things
matter and then follow those with the specific steps for how to address
them, not give a somewhat unclear reference from the very bottom back up
to something above.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Large files for relations

2023-05-25 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote:
> On 24.05.23 02:34, Thomas Munro wrote:
> > Thanks all for the feedback.  It was a nice idea and it *almost*
> > works, but it seems like we just can't drop segmented mode.  And the
> > automatic transition schemes I showed don't make much sense without
> > that goal.
> > 
> > What I'm hearing is that something simple like this might be more 
> > acceptable:
> > 
> > * initdb --rel-segsize (cf --wal-segsize), default unchanged
> 
> makes sense

Agreed, this seems alright in general.  Having more initdb-time options
to help with certain use-cases rather than having things be compile-time
is definitely just generally speaking a good direction to be going in,
imv.

> > * pg_upgrade would convert if source and target don't match
> 
> This would be good, but it could also be an optional or later feature.

Agreed.

> Maybe that should be a different mode, like --copy-and-adjust-as-necessary,
> so that users would have to opt into what would presumably be slower than
> plain --copy, rather than being surprised by it, if they unwittingly used
> incompatible initdb options.

I'm curious as to why it would be slower than a regular copy..?

> > I would probably also leave out those Windows file API changes, too.
> > --rel-segsize would simply refuse larger sizes until someone does the
> > work on that platform, to keep the initial proposal small.
> 
> Those changes from off_t to pgoff_t?  Yes, it would be good to do without
> those.  Apart of the practical problems that have been brought up, this was
> a major annoyance with the proposed patch set IMO.
> 
> > I would probably leave the experimental copy_on_write() ideas out too,
> > for separate discussion in a separate proposal.
> 
> right

You mean copy_file_range() here, right?

Shouldn't we just add support for that today into pg_upgrade,
independently of this?  Seems like a worthwhile improvement even without
the benefit it would provide to changing segment sizes.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Tue, May 23, 2023 at 09:46:58PM -0400, Stephen Frost wrote:
> > Not without breaking things we support today and for what seems like an
> > unclear benefit given that we've got channel binding today (though
> > perhaps we need to make sure there's ways to force it on both sides to
> > be on and to encourage everyone to do that- which is what this change is
> > generally about, I think).
> > 
> > As I recall, the reason we do it the way we do is because the SASL spec
> > that SCRAM is implemented under requires the username to be utf8 encoded
> > while we support other encodings, and I don't think we want to break
> > non-utf8 usage.
> 
> Yup, I remember this one, the encoding not being enforced by the
> protocol has been quite an issue when this was implemented, still I
> was wondering whether that's something that could be worth revisiting
> at some degree.

To the extent that there was an issue when it was implemented ... it's
now been implemented and so that was presumably overcome (though I don't
really specifically recall what the issues were there?  Seems like it
wouldn't matter at this point though), so I don't understand why we
would wish to revisit it.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Stephen Frost

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Tue, May 23, 2023 at 05:02:50PM -0400, Stephen Frost wrote:
> > * Jacob Champion (jchamp...@timescale.com) wrote:
> >> As touched on in past threads, our SCRAM implementation is slightly
> >> nonstandard and doesn't always protect the entirety of the
> >> authentication handshake:
> >>
> >> - the username in the startup packet is not covered by the SCRAM
> >> crypto and can be tampered with if channel binding is not in effect,
> >> or by an attacker holding only the server's key
> >
> > Encouraging channel binding when using TLS certainly makes a lot of
> > sense, particularly in environments where the trust anchors are not
> > strongly managed (that is- if you trust the huge number of root
> > certificates that a system may have installed).  Perhaps explicitly
> > encouraging users to *not* trust every root server installed on their
> > client for their PG connections would also be a good idea.  We should
> > probably add language to that effect to this section.
> 
> Currently the user name is not treated by the backend, like that in
> auth-scram.c:
> 
> /*
>  * Read username.  Note: this is ignored.  We use the username from the
>  * startup message instead, still it is kept around if provided as it
>  * proves to be useful for debugging purposes.
>  */
> state->client_username = read_attr_value(&p, 'n');
> 
> Could it make sense to cross-check that with the contents of the
> startup package instead, with or without channel binding?

Not without breaking things we support today and for what seems like an
unclear benefit given that we've got channel binding today (though
perhaps we need to make sure there's ways to force it on both sides to
be on and to encourage everyone to do that- which is what this change is
generally about, I think).

As I recall, the reason we do it the way we do is because the SASL spec
that SCRAM is implemented under requires the username to be utf8 encoded
while we support other encodings, and I don't think we want to break
non-utf8 usage.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Stephen Frost

Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> As touched on in past threads, our SCRAM implementation is slightly
> nonstandard and doesn't always protect the entirety of the
> authentication handshake:
> 
> - the username in the startup packet is not covered by the SCRAM
> crypto and can be tampered with if channel binding is not in effect,
> or by an attacker holding only the server's key

Encouraging channel binding when using TLS certainly makes a lot of
sense, particularly in environments where the trust anchors are not
strongly managed (that is- if you trust the huge number of root
certificates that a system may have installed).  Perhaps explicitly
encouraging users to *not* trust every root server installed on their
client for their PG connections would also be a good idea.  We should
probably add language to that effect to this section.

We do default to having channel binding being used though.  Breaking
down exactly the cases at risk (perhaps including what version certain
things were introduced) and what direct action administrators can take
to ensure their systems are as secure as possible (and maybe mention
what things that doesn't address still, so they're aware of what risks
still exist) would be good.

> - low iteration counts accepted by the client make it easier than it
> probably should be for a MITM to brute-force passwords (note that
> PG16's scram_iterations GUC, being server-side, does not mitigate
> this)

This would be good to improve on.

> - by default, a SCRAM exchange can be exited by the server prior to
> sending its verifier, skipping the client's server authentication step
> (this is mitigated by requiring channel binding, and PG16 adds
> require_auth=scram-sha-256 to help as well)

Yeah, encouraging this would also be good and should be mentioned as
something to do when one is using SCRAM.  Clearly that would go into a
PG16 version of this.

> These aren't currently considered security vulnerabilities, but it'd
> be good for the documentation to call them out, considering mutual
> authentication is one of the design goals of the SCRAM RFC. (I'd also
> like to shore up these problems, eventually, to make SCRAM-based
> mutual authn viable with Postgres. But that work has stalled a bit on
> my end.)

Improving the documentation certainly is a good plan.

> Here's a patch to explicitly warn people away from SCRAM as a form of
> server authentication, and nudge them towards a combination with
> verified TLS or gssenc. I've tried to keep the text version-agnostic,
> to make a potential backport easier. Is this a good place for the
> warning to go? Should I call out that GSS can't use channel binding,
> or promote the use of TLS versus GSS for SCRAM, or just keep it
> simple?

Adding channel binding to GSS (which does support it, we just don't
implement it today..) when using TLS or another encryption method for
transport would be a good improvement to make, particularly right now as
we don't yet support encryption with SSPI, meaning that you need to use
TLS to get session encryption when one of the systems is on Windows.  I
do hope to add support for encryption with SSPI during this next release
cycle and would certainly welcome help from anyone else who is
interested in this.

I have to admit that the patch as presented strikes me as a warning
without really providing steps for how to address the issues mentioned
though; there's a reference to what was just covered at the bottom but
nothing really new.  The current documentation leads with the warnings
and then goes into steps to take to address the risk, and I'd say it
would make more sense to put this wording about SCRAM not being a
complete solution to this issue above the steps that one can take to
reduce the spoofing risk, similar to how the documentation currently is.

Perhaps more succinctly- maybe we should be making adjustments to the
current language instead of just adding a new paragraph.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Naming of gss_accept_deleg

2023-05-21 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Abhijit Menon-Sen  writes:
> > At 2023-05-20 23:21:57 -0400, t...@sss.pgh.pa.us wrote:
> >> I thought the plan was to also rename the libpq "gssdeleg" connection
> >> parameter and so on?  I can look into that tomorrow, if nobody beats
> >> me to it.
> 
> > I was trying the change to see if it would be better to name it
> > "gssdelegate" instead (as in delegate on one side, and accept the
> > delegation on the other), but decided that "gssdelegation=enable"
> > reads better than "gssdelegate=enable".
> 
> Yeah, agreed.
> 
> > Here's the diff.
> 
> Thanks for doing that legwork!  I found a couple other places where
> "deleg" had escaped notice, and changed the lot.  Watching the
> buildfarm now ...

Thanks all for taking this up over a weekend.

Stephen


signature.asc
Description: PGP signature

Re: Adding SHOW CREATE TABLE

2023-05-20 Thread Stephen Frost

Greetings,

On Sat, May 20, 2023 at 13:32 David G. Johnston 
wrote:

> On Sat, May 20, 2023 at 10:26 AM Stephen Frost  wrote:
>
>> > A server function can be conveniently called from any client code.
>>
>> Clearly any client using libpq can conveniently call code which is in
>> libpq.
>>
>
> Clearly there are clients that don't use libpq.  JDBC comes to mind.
>

Indeed … as I mentioned up-thread already.

Are we saying that we want this to be available server side, and largely
duplicated, specifically to cater to non-libpq users?  I’ll put out there,
again, the idea that perhaps we put it into the common library then and
make it available via both libpq and as a server side function ..?

We also have similar code in postgres_fdw.. ideally, imv anyway, we’d not
end up with three copies of it.

Thanks,

Stephen

Re: Adding SHOW CREATE TABLE

2023-05-20 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Fri, 2023-05-19 at 13:08 -0400, Andrew Dunstan wrote:
> > On 2023-05-18 Th 19:53, Stephen Frost wrote:
> > > * Kirk Wolak (wol...@gmail.com) wrote:
> > > > My approach for now is to develop this as the \st command.
> > > > After reviewing the code/output from the 3 sources (psql, fdw, and
> > > > pg_dump).
> > >
> > > Having this only available via psql seems like the least desirable
> > > option as then it wouldn't be available to any other callers..
> > > 
> > > Having it in libpq, on the other hand, would make it available to psql
> > > as well as any other utility, library, or language / driver which uses
> > > libpq, including pg_dump..
> > 
> > I think the ONLY place we should have this is in server side functions.
> 
> +1

... but it already exists in pg_dump, so I'm unclear why folks seem to
be pushing to have a duplicate of it in core?  We certainly can't remove
it from pg_dump even if we add it to core because pg_dump has to
understand the changes between major versions.

> A function "pg_get_tabledef" would blend nicely into the following list:
> 
> \df pg_get_*def
>List of functions
>Schema   │  Name  │ Result data type │  Argument 
> data types  │ Type 
> ╪╪══╪═══╪══
>  pg_catalog │ pg_get_constraintdef   │ text │ oid 
>   │ func
>  pg_catalog │ pg_get_constraintdef   │ text │ oid, 
> boolean  │ func
>  pg_catalog │ pg_get_functiondef │ text │ oid 
>   │ func
>  pg_catalog │ pg_get_indexdef│ text │ oid 
>   │ func
>  pg_catalog │ pg_get_indexdef│ text │ oid, 
> integer, boolean │ func
>  pg_catalog │ pg_get_partition_constraintdef │ text │ oid 
>   │ func
>  pg_catalog │ pg_get_partkeydef  │ text │ oid 
>   │ func
>  pg_catalog │ pg_get_ruledef │ text │ oid 
>   │ func
>  pg_catalog │ pg_get_ruledef │ text │ oid, 
> boolean  │ func
>  pg_catalog │ pg_get_statisticsobjdef│ text │ oid 
>   │ func
>  pg_catalog │ pg_get_triggerdef  │ text │ oid 
>   │ func
>  pg_catalog │ pg_get_triggerdef  │ text │ oid, 
> boolean  │ func
>  pg_catalog │ pg_get_viewdef │ text │ oid 
>   │ func
>  pg_catalog │ pg_get_viewdef │ text │ oid, 
> boolean  │ func
>  pg_catalog │ pg_get_viewdef │ text │ oid, 
> integer  │ func
>  pg_catalog │ pg_get_viewdef │ text │ text
>   │ func
>  pg_catalog │ pg_get_viewdef │ text │ text, 
> boolean │ func
> (17 rows)
> 
> 
> A server function can be conveniently called from any client code.

Clearly any client using libpq can conveniently call code which is in
libpq.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: How to ensure that SSPI support (Windows) enabled?

2023-05-19 Thread Stephen Frost

Greetings,

Please don't top-post.

* Dimitry Markman (dmark...@mathworks.com) wrote:
> I was asking our 3p library people how to add windows support to gss and they 
> said that on windows we should use SSPI

They're correct.

> I’m not really familiar with either gssapi or SSPI

Kerberos support is provided through SSPI on Windows.  On Linux and Unix
systems in general, it's provided through GSSAPI.  On the wire, the two
are (mostly) compatible.

> I see that macOS has builtin support for gssapi, so all I need is to use 
> –with-gssapi

On most Unix-based systems (and certainly for MacOS), you should be
installing MIT Kerberos and using that for your GSSAPI library.  The
GSSAPI library included with MacOS has not been properly maintained by
Apple and is woefully out of date and using it will absolutely cause you
undue headaches.

> On linux I use MIT Kerberos that we build in our 3p environment (only linux)

Yes, MIT Kerberos on Linux makes sense.

> When I ask to build MIT Kerberos on windows that’s when I was advised simply 
> to use SSPI

That's correct, you should be using SSPI on Windows is the vast majority
of cases.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Naming of gss_accept_deleg

2023-05-19 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> Why is the new PG 16 GUC called "gss_accept_deleg" and not
> "gss_accept_delegation"?  The abbreviation here seems atypical.

At the time it felt natural to me but I don't feel strongly about it,
happy to change it if folks would prefer it spelled out.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Adding SHOW CREATE TABLE

2023-05-18 Thread Stephen Frost

Greetings,

* Kirk Wolak (wol...@gmail.com) wrote:
> My approach for now is to develop this as the \st command.
> After reviewing the code/output from the 3 sources (psql, fdw, and
> pg_dump).  This trivializes the approach,
> and requires the smallest set of changes (psql is already close with
> existing queries, etc).
> 
> And frankly, I would rather have an \st feature that handles 99% of the use
> cases then go 15yrs waiting for a perfect solution.
> Once this works well for the group.  Then, IMO, that would be the time to
> discuss moving it.

Having this only available via psql seems like the least desirable
option as then it wouldn't be available to any other callers..

Having it in libpq, on the other hand, would make it available to psql
as well as any other utility, library, or language / driver which uses
libpq, including pg_dump..

Using libpq would also make sense from the perspective that libpq can be
used to connect to a number of different major versions of PG and this
could work work for them all in much the way that pg_dump does.

The downside with this apporach is that drivers which don't use libpq
(eg: JDBC) would have to re-implement this if they wanted to keep
feature parity with libpq..

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCHES] Post-special page storage TDE support

2023-05-12 Thread Stephen Frost

Greetings,

* David Christensen (david.christen...@crunchydata.com) wrote:
> Refreshing this with HEAD as of today, v4.

Thanks for updating this!

> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> 
> This space is reserved for extended data on the Page structure which will be 
> ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data 
> appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime 
> based on specific
> ControlFile features.
> 
> No effort is made to ensure this is backwards-compatible with existing 
> clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with 
> different settings here.

This initial patch, at least, does maintain pg_upgrade as the
reserved_page_size (maybe not a great name?) is set to 0, right?
Basically this is just introducing the concept of a reserved_page_size
and adjusting all of the code that currently uses BLCKSZ or
PageGetPageSize() to account for this extra space.

Looking at the changes to bufpage.h, in particular ...

> diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h

> @@ -19,6 +19,14 @@
>  #include "storage/item.h"
>  #include "storage/off.h"
>  
> +extern PGDLLIMPORT int reserved_page_size;
> +
> +#define SizeOfPageReservedSpace() reserved_page_size
> +#define MaxSizeOfPageReservedSpace 0
> +
> +/* strict upper bound on the amount of space occupied we have reserved on
> + * pages in this cluster */

This will eventually be calculated based on what features are supported
concurrently?

> @@ -36,10 +44,10 @@
>   * |  v pd_upper 
>   |
>   * +-++
>   * |  | tupleN ... |
> - * +-+--+-+
> - * |... tuple3 tuple2 tuple1 | "special space" |
> - * ++-+
> - *   ^ 
> pd_special
> + * +-+-++++
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +---++++
> + *  ^ pd_special  ^ 
> reserved_page_space

Right, adds a dynamic amount of space 'post-special area'.

> @@ -73,6 +81,8 @@
>   * stored as the page trailer.  an access method should always
>   * initialize its pages with PageInit and then set its own opaque
>   * fields.
> + *
> + * XXX - update more comments here about reserved_page_space
>   */

Would be good to do. ;)

> @@ -325,7 +335,7 @@ static inline void
>  PageValidateSpecialPointer(Page page)
>  {
>   Assert(page);
> - Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> + AssertPageHeader) page)->pd_special + reserved_page_size) <= 
> BLCKSZ);
>   Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
>  }

This is just one usage ... but seems like maybe we should be using
PageGetPageSize() here instead of BLCKSZ, and that more-or-less
throughout?  Nearly everywhere we're using BLCKSZ today to give us that
compile-time advantage of a fixed block size is going to lose that
advantage anyway thanks to reserved_page_size being run-time.  Now, one
up-side to this is that it'd also get us closer to being able to support
dynamic block sizes concurrently which would be quite interesting.  That
is, a special tablespace with a 32KB block size while the rest are the
traditional 8KB.  This would likely require multiple shared buffer
pools, of course...

> diff --git a/src/backend/storage/page/bufpage.c 
> b/src/backend/storage/page/bufpage.c
> index 9a302ddc30..a93cd9df9f 100644
> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c
> @@ -26,6 +26,8 @@
>  /* GUC variable */
>  bool ignore_checksum_failure = false;
>  
> +int  reserved_page_size = 0; /* how much page space to 
> reserve for extended unencrypted metadata */
> +
>  
>  /* 
>   *   Page support functions
> @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
>  {
>   PageHeader  p = (PageHeader) page;
>  
> - specialSize = MAXALIGN(specialSize);
> + specialSize = MAXALIGN(specialSize) + reserved_page_size;

Rather than make it part of specialSize, I would think we'd be better
off just treating them independently.  Eg, the later pd_upper setting
would be done by:

p->pd_upper = pageSize - specialSize - reserved_page_size;

etc.

> @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int 
> flags)
>   *   one that is both unused and deallocated.
>   *
>   *   If flag PAI_IS_HEAP is set, we enforce that there can't be more than
> - *   MaxHeapTuplesPerPage

Re: Large files for relations

2023-05-12 Thread Stephen Frost

Greetings,

* Dagfinn Ilmari Mannsåker (ilm...@ilmari.org) wrote:
> Thomas Munro  writes:
> > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:
> >>> I am not aware of any modern/non-historic filesystem[2] that can't do
> >>> large files with ease.  Anyone know of anything to worry about on that
> >>> front?
> >>
> >> There is some trouble in the ambiguity of what we mean by "modern" and
> >> "large files". There are still a large number of users of ext4 where
> >> the max file size is 16TB. Switching to a single large file per
> >> relation would effectively cut the max table size in half for those
> >> users. How would a user with say a 20TB table running on ext4 be
> >> impacted by this change?
> […]
> > A less aggressive version of the plan would be that we just keep the
> > segment code for the foreseeable future with no planned cut off, and
> > we make all of those "piggy back" transformations that I showed in the
> > patch set optional.  For example, I had it so that CLUSTER would
> > quietly convert your relation to large format, if it was still in
> > segmented format (might as well if you're writing all the data out
> > anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> > base backup.  Etc.  Then someone concerned about hitting the 16TB
> > limit on ext4 could opt out.  Or something like that.  It seems funny
> > though, that's exactly the user who should want this feature (they
> > have 16,000 relation segment files).
> 
> If we're going to have to keep the segment code for the foreseeable
> future anyway, could we not get most of the benefit by increasing the
> segment size to something like 1TB?  The vast majority of tables would
> fit in one file, and there would be less risk of hitting filesystem
> limits.

While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much.  Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things.  50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.

Just some thoughts.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: base backup vs. concurrent truncation

2023-05-11 Thread Stephen Frost

Greetings,

* Greg Stark (st...@mit.edu) wrote:
> Including the pre-truncation length in the wal record is the obviously
> solid approach and I none of the below is a good substitution for it.

I tend to agree with the items mentioned in Andres's recent email on
this thread too in terms of improving the WAL logging around this.

> On Tue, 25 Apr 2023 at 13:30, Andres Freund  wrote:
> > It isn't - but the alternatives aren't great either. It's not that easy to 
> > hit
> > this scenario, so I think something along these lines is more palatable than
> > adding a pass through the entire data directory.
> 
> Doing one pass through the entire data directory on startup before
> deciding the directory is consistent doesn't sound like a crazy idea.

We're already making a pass through the entire data directory on
crash-restart (and fsync'ing everything too), which includes when
restoring from backup.  See src/backend/access/transam/xlog.c:5155
Extending that to check for oddities like segments following a not-1GB
segment certainly seems like a good idea to me.

> It's pretty easy to imagine bugs in backup software that leave out
> files in the middle of tables -- some of us don't even have to
> imagine...

Yup.

> Similarly checking for a stray next segment whenever extending a file
> to maximum segment size seems like a reasonable thing to check for
> too.

Yeah, that definitely seems like a good idea.  Extending a relation is
already expensive and we've taken steps to deal with that and so
detecting that the file we were expecting to create is already there
certainly seems like a good idea and I wouldn't expect (?) to add a lot
of extra time in the normal case.

> These kinds of checks are the kind of paranoia that catches filesystem
> bugs, backup software bugs, cron jobs, etc that we don't even know to
> watch for.

Agreed, and would also help in cases where such a situation already
exists out there somewhere and which no amount of new WAL records would
make go away..

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Large files for relations

2023-05-09 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> On Wed, May 3, 2023 at 1:37 AM Thomas Munro  wrote:
> > On Wed, May 3, 2023 at 5:21 PM Thomas Munro 
> > wrote:
> > > rsync --link-dest

... rsync isn't really a safe tool to use for PG backups by itself
unless you're using it with archiving and with start/stop backup and
with checksums enabled.

> > I wonder if rsync will grow a mode that can use copy_file_range() to
> > share blocks with a reference file (= previous backup).  Something
> > like --copy-range-dest.  That'd work for large-file relations
> > (assuming a file system that has block sharing, like XFS and ZFS).
> > You wouldn't get the "mtime is enough, I don't even need to read the
> > bytes" optimisation, which I assume makes all database hackers feel a
> > bit queasy anyway, but you'd get the space savings via the usual
> > rolling checksum or a cheaper version that only looks for strong
> > checksum matches at the same offset, or whatever other tricks rsync
> > might have up its sleeve.

There's also really good reasons to have multiple full backups and not
just a single full backup and then lots and lots of incrementals which
basically boils down to "are you really sure that one copy of that one
really important file won't every disappear from your backup
repository..?"

That said, pgbackrest does now have block-level incremental backups
(where we define our own block size ...) and there's reasons we decided
against going down the LSN-based approach (not the least of which is
that the LSN isn't always updated...), but long story short, moving to
larger than 1G files should be something that pgbackrest will be able
to handle without as much impact as there would have been previously in
terms of incremental backups.  There is a loss in the ability to use
mtime to scan just the parts of the relation that changed and that's
unfortunate but I wouldn't see it as really a game changer (and yes,
there's certainly an argument for not trusting mtime, though I don't
think we've yet had a report where there was an mtime issue that our
mtime-validity checking didn't catch and force pgbackrest into
checksum-based revalidation automatically which resulted in an invalid
backup... of course, not enough people test their backups...).

> I understand the need to reduce open file handles, despite the
> possibilities enabled by using large numbers of small file sizes.

I'm also generally in favor of reducing the number of open file handles
that we have to deal with.  Addressing the concerns raised nearby about
weird corner-cases of non-1G length ABCDEF.1 files existing while
ABCDEF.2, and more, files exist is certainly another good argument in
favor of getting rid of segments.

> I am curious whether a move like this to create a generational change in
> file file format shouldn't be more ambitious, perhaps altering the block
> format to insert a block format version number, whether that be at every
> block, or every megabyte, or some other interval, and whether we store it
> in-file or in a separate file to accompany the first non-segmented. Having
> such versioning information would allow blocks of different formats to
> co-exist in the same table, which could be critical to future changes such
> as 64 bit XIDs, etc.

To the extent you're interested in this, there are patches posted which
are alrady trying to move us in a direction that would allow for
different page formats that add in space for other features such as
64bit XIDs, better checksums, and TDE tags to be supported.

https://commitfest.postgresql.org/43/3986/

Currently those patches are expecting it to be declared at initdb time,
but the way they're currently written that's more of a soft requirement
as you can tell on a per-page basis what features are enabled for that
page.  Might make sense to support it in that form first anyway though,
before going down the more ambitious route of allowing different pages
to have different sets of features enabled for them concurrently.

When it comes to 'a separate file', we do have forks already and those
serve a very valuable but distinct use-case where you can get
information from the much smaller fork (be it the FSM or the VM or some
future thing) while something like 64bit XIDs or a stronger checksum is
something you'd really need on every page.  I have serious doubts about
a proposal where we'd store information needed on every page read in
some far away block that's still in the same file such as using
something every 1MB as that would turn every block access into two..

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-04-18 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@enterprisedb.com) wrote:
> On 13.04.23 04:45, Yurii Rashkovskii wrote:
> > But getting your agreement is important to get this in; I am willing to
> > play along and resolve both (1) and (2) in one go. As for the
> > implementation approach for (2), which of the following options would
> > you prefer?
> > 
> > a) Document postmaster.pid as it stands today
> > b) Expose the port number through pg_ctl (*my personal favorite)
> > c) Redesign its content below line 1 to make it extensible (make unnamed
> > lines named, for example)
> > 
> > If none of the above options suit you, do you have a strategy you'd prefer?
> 
> You could just drop another file into the data directory that just contains
> the port number ($PGDATA/port).  However, if we ever do multiple ports, that
> would still require a change in the format of that file, so I don't know if
> that's actually better than a).

If we did a port per line then it wouldn't be changing the format of the
first line, so that might not be all that bad.

> I don't think involving pg_ctl is necessary or desirable, since it would
> make any future changes like that even more complicated.

I'm a bit confused by this- if pg_ctl is invoked then we have
more-or-less full control over parsing and reporting out the answer, so
while it might be a bit more complicated for us, it seems surely simpler
for the end user.  Or maybe you're referring to something here that I'm
not thinking of?

Independent of the above though ... this hand-wringing about what we
might do in the relative near-term when we haven't done much in the past
many-many years regarding listen_addresses or port strikes me as
unlikely to be necessary.  Let's pick something and get it done and
accept that we may have to change it at some point in the future, but
that's kinda what major releases are for, imv anyway.

Thanks!

Stpehen

signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-17 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I'm guessing it's not really an issue but it does make changing
> > configure a bit annoying on my Ubuntu 22.04, when I run autoconf2.69, I
> > end up with this additional hunk as changed from what our configure
> > currently has.
> 
> Not surprising.  Thanks to autoconf's long release cycles, individual
> distros often are carrying local patches that affect its output.
> To ensure consistent results across committers, our policy is that you
> should use built-from-upstream-source autoconf not a vendor's version.
> (In principle that could bite us sometime, but it hasn't yet.)

... making me more excited about the idea of getting over to meson.

> Also, you should generally run autoheader after autoconf.
> Checking things here, I notice that pg_config.h.in hasn't been
> updated for the last few gssapi-related commits:
> 
> $ git diff
> diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
> index 3665e79..6d572c3 100644
> *** a/src/include/pg_config.h.in
> --- b/src/include/pg_config.h.in
> ***
> *** 196,201 
> --- 196,207 
>   /* Define to 1 if you have the `getpeerucred' function. */
>   #undef HAVE_GETPEERUCRED
>   
> + /* Define to 1 if you have the  header file. */
> + #undef HAVE_GSSAPI_EXT_H
> + 
> + /* Define to 1 if you have the  header file. */
> + #undef HAVE_GSSAPI_GSSAPI_EXT_H
> + 
>   /* Define to 1 if you have the  header file. */
>   #undef HAVE_GSSAPI_GSSAPI_H
>   
> Shall I push that, or do you want to?

Hrmpf.  Sorry about that.  Please feel free and thanks for pointing it
out.

Thanks again,

Stephen


signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-17 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Stephen Frost  writes:
> > > Done that way.
> > 
> > Looks like you neglected to update the configure script proper?
> 
> Pah, indeed.  Will fix.  Sorry about that.

Fixed.

I'm guessing it's not really an issue but it does make changing
configure a bit annoying on my Ubuntu 22.04, when I run autoconf2.69, I
end up with this additional hunk as changed from what our configure
currently has.

Thoughts?

Thanks,

Stephen
diff --git a/configure b/configure
index 82efa0d3f1..5489cddce2 100755
--- a/configure
+++ b/configure
@@ -15348,7 +15348,7 @@ else
 We can't simply define LARGE_OFF_T to be 9223372036854775807,
 since some C++ compilers masquerading as C compilers
 incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		   && LARGE_OFF_T % 2147483647 == 1)
 		  ? 1 : -1];
@@ -15394,7 +15394,7 @@ else
 We can't simply define LARGE_OFF_T to be 9223372036854775807,
 since some C++ compilers masquerading as C compilers
 incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		   && LARGE_OFF_T % 2147483647 == 1)
 		  ? 1 : -1];
@@ -15418,7 +15418,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 We can't simply define LARGE_OFF_T to be 9223372036854775807,
 since some C++ compilers masquerading as C compilers
 incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		   && LARGE_OFF_T % 2147483647 == 1)
 		  ? 1 : -1];
@@ -15463,7 +15463,7 @@ else
 We can't simply define LARGE_OFF_T to be 9223372036854775807,
 since some C++ compilers masquerading as C compilers
 incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		   && LARGE_OFF_T % 2147483647 == 1)
 		  ? 1 : -1];
@@ -15487,7 +15487,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 We can't simply define LARGE_OFF_T to be 9223372036854775807,
 since some C++ compilers masquerading as C compilers
 incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		   && LARGE_OFF_T % 2147483647 == 1)
 		  ? 1 : -1];


signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-17 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Done that way.
> 
> Looks like you neglected to update the configure script proper?

Pah, indeed.  Will fix.  Sorry about that.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-17 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > How about the attached which just switches from testing for
> > gss_init_sec_context to testing for gss_store_cred_into?
> 
> WFM.

Done that way.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-17 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Pushed, thanks again to everyone.
> > I'll monitor the buildfarm and assuming there isn't anything unexpected
> > then I'll mark the open item as resolved now.
> 
> The Debian 7 (Wheezy) members of the buildfarm (lapwing, skate, snapper)
> are all getting past the gssapi_ext.h check you added and then failing
> like this:
> 
> ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith 
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels 
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv 
> -fexcess-precision=standard -g -O2 -Werror -I../../../src/include  
> -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE 
> -I/usr/include/libxml2  -I/usr/include/et  -c -o be-gssapi-common.o 
> be-gssapi-common.c
> be-gssapi-common.c: In function 'pg_store_delegated_credential':
> be-gssapi-common.c:110:2: error: unknown type name 
> 'gss_key_value_element_desc'
> be-gssapi-common.c:111:2: error: unknown type name 'gss_key_value_set_desc'
> be-gssapi-common.c:113:4: error: request for member 'key' in something not a 
> structure or union
> be-gssapi-common.c:114:4: error: request for member 'value' in something not 
> a structure or union
> be-gssapi-common.c:115:7: error: request for member 'count' in something not 
> a structure or union
> be-gssapi-common.c:116:7: error: request for member 'elements' in something 
> not a structure or union
> be-gssapi-common.c:119:2: error: implicit declaration of function 
> 'gss_store_cred_into' [-Werror=implicit-function-declaration]
> 
> Debian 7 has been EOL five years or so, so I don't mind saying "get a
> newer OS or disable gssapi".  However, is it worth adding another
> configure check to fail a little faster with whatever Kerberos
> version this is?  Checking that gss_store_cred_into() exists
> seems like the most obvious one of these things to test for.

Sure, I can certainly do that and agreed that it makes sense to check
for gss_store_cred_into().

How about the attached which just switches from testing for
gss_init_sec_context to testing for gss_store_cred_into?

Thanks!

Stephen
diff --git a/configure.ac b/configure.ac
index c53a9c788e..1362f57a27 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1340,8 +1340,8 @@ fi
 
 if test "$with_gssapi" = yes ; then
   if test "$PORTNAME" != "win32"; then
-AC_SEARCH_LIBS(gss_init_sec_context, [gssapi_krb5 gss 'gssapi -lkrb5 -lcrypto'], [],
-   [AC_MSG_ERROR([could not find function 'gss_init_sec_context' required for GSSAPI])])
+AC_SEARCH_LIBS(gss_store_cred_into, [gssapi_krb5 gss 'gssapi -lkrb5 -lcrypto'], [],
+   [AC_MSG_ERROR([could not find function 'gss_store_cred_into' required for GSSAPI])])
   else
 LIBS="$LIBS -lgssapi32"
   fi
diff --git a/meson.build b/meson.build
index 3405cc07ee..f1db5455b0 100644
--- a/meson.build
+++ b/meson.build
@@ -634,14 +634,14 @@ if not gssapiopt.disabled()
   endif
 
   if not have_gssapi
-  elif cc.has_function('gss_init_sec_context', dependencies: gssapi,
+  elif cc.has_function('gss_store_cred_into', dependencies: gssapi,
   args: test_c_args, include_directories: postgres_inc)
 cdata.set('ENABLE_GSS', 1)
 
 krb_srvtab = 'FILE:/@0@/krb5.keytab)'.format(get_option('sysconfdir'))
 cdata.set_quoted('PG_KRB_SRVTAB', krb_srvtab)
   elif gssapiopt.enabled()
-error('''could not find function 'gss_init_sec_context' required for GSSAPI''')
+error('''could not find function 'gss_store_cred_into' required for GSSAPI''')
   else
 have_gssapi = false
   endif


signature.asc
Description: PGP signature

Re: PGBuildfarm member pollock Branch HEAD Failed at Stage Make

2023-04-13 Thread Stephen Frost

Greetings,

* buildfarm-adm...@lists.postgresql.org (buildfarm-adm...@lists.postgresql.org) 
wrote:
> The PGBuildfarm member pollock had the following event on branch HEAD:
> Failed at Stage: Make
> The snapshot timestamp for the build is: 2023-04-13 13:06:34
> The specs of this machine are:
>   OS:  OmniOS / illumos / r151038
>   Arch: amd64
>   Comp: gcc / 10.2.0

I've reached out to Andy to ask about dropping --with-gssapi from
pollock's configure line.  It's not entirely clear to me what is
installed on that system as it seems to have a gssapi_ext.h but not the
credential store feature, which would seem to indicate a decade+ old
version of MIT Kerberos or some other GSSAPI library which hasn't kept
pace, at the least, with ongoing Kerberos development.

As they recently also dropped GSSAPI support for OpenSSH for OmniOS[1],
it seems unlikely to be an issue for them to remove it for PG too.

Thanks,

Stephen

[1]: https://github.com/omniosorg/omnios-build/blob/r151038/doc/ReleaseNotes.md

signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-13 Thread Stephen Frost

Greetings,

* Jonathan S. Katz (jk...@postgresql.org) wrote:
> On 4/12/23 12:22 PM, Stephen Frost wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > Stephen Frost  writes:
> > > > Updated patch set attached.
> > > 
> > > LGTM
> > 
> > Great, thanks.
> > 
> > I cleaned up the commit messages a bit more and added links to the
> > discussion.  If there isn't anything more then I'll plan to push these
> > later today or tomorrow.
> 
> Great -- thanks for your attention to this. I'm glad we have an opportunity
> to de-revert (devert?).

Pushed, thanks again to everyone.

I'll monitor the buildfarm and assuming there isn't anything unexpected
then I'll mark the open item as resolved now.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-12 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Updated patch set attached.
> 
> LGTM

Great, thanks.

I cleaned up the commit messages a bit more and added links to the
discussion.  If there isn't anything more then I'll plan to push these
later today or tomorrow.

Thanks again!

Stephen
From a73e19808c8cb0e5075e6c86ba0d8d29538b1eeb Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 12 Apr 2023 09:33:39 -0400
Subject: [PATCH 1/2] De-Revert "Add support for Kerberos credential
 delegation"

This reverts commit 3d03b24c3 (Revert Add support for Kerberos
credential delegation) which was committed on the grounds of concern
about portability, but on further review and discussion, it's clear that
we are better off explicitly requiring MIT Kerberos as that appears to
be the only GSSAPI library currently that's under proper maintenance
and ongoing development.  The API used for storing credentials was added
to MIT Kerberos over a decade ago while for the other libraries which
appear to be mainly based on Heimdal, which exists explicitly to be a
re-implementation of MIT Kerberos, the API never made it to a released
version (even though it was added to the Heimdal git repo over 5 years
ago..).

This post-feature-freeze change was approved by the RMT.

Discussion: https://postgr.es/m/ZDDO6jaESKaBgej0%40tamriel.snowman.net
---
 contrib/dblink/dblink.c   | 127 ---
 contrib/dblink/expected/dblink.out|   4 +-
 contrib/postgres_fdw/connection.c |  72 +++-
 .../postgres_fdw/expected/postgres_fdw.out|  19 +-
 contrib/postgres_fdw/option.c |   6 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |   3 +-
 doc/src/sgml/config.sgml  |  17 +
 doc/src/sgml/dblink.sgml  |   5 +-
 doc/src/sgml/libpq.sgml   |  41 +++
 doc/src/sgml/monitoring.sgml  |   9 +
 doc/src/sgml/postgres-fdw.sgml|   7 +-
 src/backend/catalog/system_views.sql  |   3 +-
 src/backend/foreign/foreign.c |   1 +
 src/backend/libpq/auth.c  |  13 +-
 src/backend/libpq/be-gssapi-common.c  |  53 +++
 src/backend/libpq/be-secure-gssapi.c  |  26 +-
 src/backend/utils/activity/backend_status.c   |   1 +
 src/backend/utils/adt/pgstatfuncs.c   |  21 +-
 src/backend/utils/init/postinit.c |   8 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/libpq/auth.h  |   1 +
 src/include/libpq/be-gssapi-common.h  |   3 +
 src/include/libpq/libpq-be.h  |   2 +
 src/include/utils/backend_status.h|   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-auth.c|  15 +-
 src/interfaces/libpq/fe-connect.c |  17 +
 src/interfaces/libpq/fe-secure-gssapi.c   |  23 +-
 src/interfaces/libpq/libpq-fe.h   |   1 +
 src/interfaces/libpq/libpq-int.h  |   2 +
 src/test/kerberos/Makefile|   3 +
 src/test/kerberos/t/001_auth.pl   | 331 --
 src/test/perl/PostgreSQL/Test/Utils.pm|  27 ++
 src/test/regress/expected/rules.out   |  11 +-
 36 files changed, 755 insertions(+), 136 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 78a8bcee6e..55f75eff36 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -48,6 +48,7 @@
 #include "funcapi.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be.h"
 #include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -111,7 +112,8 @@ static HeapTuple get_tuple_of_interest(Relation rel, int *pkattnums, int pknumat
 static Relation get_rel_from_relname(text *relname_text, LOCKMODE lockmode, AclMode aclmode);
 static char *generate_relation_name(Relation rel);
 static void dblink_connstr_check(const char *connstr);
-static void dblink_security_check(PGconn *conn, remoteConn *rconn);
+static bool dblink_connstr_has_pw(const char *connstr);
+static void dblink_security_check(PGconn *conn, remoteConn *rconn, const char *connstr);
 static void dblink_res_error(PGconn *conn, const char *conname, PGresult *res,
 			 bool fail, const char *fmt,...) pg_attribute_printf(5, 6);
 static char *get_connect_string(const char *servername);
@@ -213,7 +215,7 @@ dblink_get_conn(char *conname_or_str,
 	 errmsg("could not establish connection"),
 	 errdetail_internal("%s", msg)));
 		}
-		dblink_security_check(conn, rconn);
+		dblink_security_check(conn, rconn, connstr);
 		if (PQclientEncoding(conn) != GetDatabaseEncoding())
 			PQsetCli

Re: longfin missing gssapi_ext.h

2023-04-12 Thread Stephen Frost

Greetings,

* Daniel Gustafsson (dan...@yesql.se) wrote:
> > On 12 Apr 2023, at 16:33, Stephen Frost  wrote:
> > Sure, reworked that way and attached.
> 
> While not changed in this hunk, does the comment regarding Heimdal still 
> apply?
> 
> @@ -918,6 +919,7 @@ pg_GSS_recvauth(Port *port)
>   int mtype;
>   StringInfoData buf;
>   gss_buffer_desc gbuf;
> + gss_cred_id_t delegated_creds;
>  
>   /*
>* Use the configured keytab, if there is one.  Unfortunately, Heimdal

Good catch.  No, it doesn't.  I'm not anxious to actually change that
code at this point but we could certainly consider changing it in the
future.  I'll update this comment (and the identical one in
secure_open_gssapi) accordingly.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: longfin missing gssapi_ext.h

2023-04-12 Thread Stephen Frost

Greetings,

* Jonathan S. Katz (jk...@postgresql.org) wrote:
> On 4/12/23 10:33 AM, Stephen Frost wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > Stephen Frost  writes:
> > > > Understood.  Please find attached the updated patch with changes to the
> > > > commit message to indicate that we now require MIT Kerberos, an
> > > > additional explicit check for gssapi_ext.h in configure.ac/configure,
> > > > along with updated documentation explicitly saying we require MIT
> > > > Kerberos for GSSAPI support.
> > > 
> > > Um ... could you package this as a straight un-revert of the
> > > previous commit, then a delta patch?  Would be easier to review.
> > 
> > Sure, reworked that way and attached.
> 
> Docs read well. A few questions/commenets:
> 
> * On [1] -- do we want to add a note that it's not just Kerberos, but MIT
> Kerberos?

Yes, makes sense, updated.

> * On [2] -- we mention "kadmin tool of MIT-compatible Kerberos 5" which is
> AIUI is still technically correct, but do we want to drop the "-compatible?"
> (precedent in [3])

Yup, cleaned that up also.

Updated patch set attached.

Thanks!

Stephen
From 9acb2ef2e6a35544e28d690978ba8d5e8c062f7e Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 12 Apr 2023 09:33:39 -0400
Subject: [PATCH 1/2] Revert "Revert "Add support for Kerberos credential
 delegation""

This reverts commit 3d03b24c3 (Add support for Kerberos credential
delegation) which was reverted on the grounds of concern about
portability, but on further review and discussion, it's clear that
we are better off simply explicitly requiring MIT Kerberos as that seems
to be the only GSSAPI library currently that's under proper maintenance
and ongoing development.  The API used for storing credentials was added
to MIT Kerberos over a decade ago while for the other libraries which
appear to be mainly based on Heimdal, which exists explicitly to be a
re-implementation of MIT Kerberos, the API never made it to a released
version (even though it was added to the Heimdal git repo over 5 years
ago..).

This post-feature-freeze change was approved by the RMT.

Discussion: https://postgr.es/m/ZDDO6jaESKaBgej0%40tamriel.snowman.net
---
 contrib/dblink/dblink.c   | 127 ---
 contrib/dblink/expected/dblink.out|   4 +-
 contrib/postgres_fdw/connection.c |  72 +++-
 .../postgres_fdw/expected/postgres_fdw.out|  19 +-
 contrib/postgres_fdw/option.c |   6 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |   3 +-
 doc/src/sgml/config.sgml  |  17 +
 doc/src/sgml/dblink.sgml  |   5 +-
 doc/src/sgml/libpq.sgml   |  41 +++
 doc/src/sgml/monitoring.sgml  |   9 +
 doc/src/sgml/postgres-fdw.sgml|   7 +-
 src/backend/catalog/system_views.sql  |   3 +-
 src/backend/foreign/foreign.c |   1 +
 src/backend/libpq/auth.c  |  13 +-
 src/backend/libpq/be-gssapi-common.c  |  53 +++
 src/backend/libpq/be-secure-gssapi.c  |  26 +-
 src/backend/utils/activity/backend_status.c   |   1 +
 src/backend/utils/adt/pgstatfuncs.c   |  21 +-
 src/backend/utils/init/postinit.c |   8 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/libpq/auth.h  |   1 +
 src/include/libpq/be-gssapi-common.h  |   3 +
 src/include/libpq/libpq-be.h  |   2 +
 src/include/utils/backend_status.h|   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-auth.c|  15 +-
 src/interfaces/libpq/fe-connect.c |  17 +
 src/interfaces/libpq/fe-secure-gssapi.c   |  23 +-
 src/interfaces/libpq/libpq-fe.h   |   1 +
 src/interfaces/libpq/libpq-int.h  |   2 +
 src/test/kerberos/Makefile|   3 +
 src/test/kerberos/t/001_auth.pl   | 331 --
 src/test/perl/PostgreSQL/Test/Utils.pm|  27 ++
 src/test/regress/expected/rules.out   |  11 +-
 36 files changed, 755 insertions(+), 136 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 78a8bcee6e..55f75eff36 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -48,6 +48,7 @@
 #include "funcapi.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be.h"
 #include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -111,7 +112,8 @@ static HeapTuple get_tuple_of_interest(Relation rel, int *pkattnu

Re: longfin missing gssapi_ext.h

2023-04-12 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Understood.  Please find attached the updated patch with changes to the
> > commit message to indicate that we now require MIT Kerberos, an
> > additional explicit check for gssapi_ext.h in configure.ac/configure,
> > along with updated documentation explicitly saying we require MIT
> > Kerberos for GSSAPI support.
> 
> Um ... could you package this as a straight un-revert of the
> previous commit, then a delta patch?  Would be easier to review.

Sure, reworked that way and attached.

Thanks,

Stephen
From 9acb2ef2e6a35544e28d690978ba8d5e8c062f7e Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 12 Apr 2023 09:33:39 -0400
Subject: [PATCH 1/2] Revert "Revert "Add support for Kerberos credential
 delegation""

This reverts commit 3d03b24c3 (Add support for Kerberos credential
delegation) which was reverted on the grounds of concern about
portability, but on further review and discussion, it's clear that
we are better off simply explicitly requiring MIT Kerberos as that seems
to be the only GSSAPI library currently that's under proper maintenance
and ongoing development.  The API used for storing credentials was added
to MIT Kerberos over a decade ago while for the other libraries which
appear to be mainly based on Heimdal, which exists explicitly to be a
re-implementation of MIT Kerberos, the API never made it to a released
version (even though it was added to the Heimdal git repo over 5 years
ago..).

This post-feature-freeze change was approved by the RMT.

Discussion: https://postgr.es/m/ZDDO6jaESKaBgej0%40tamriel.snowman.net
---
 contrib/dblink/dblink.c   | 127 ---
 contrib/dblink/expected/dblink.out|   4 +-
 contrib/postgres_fdw/connection.c |  72 +++-
 .../postgres_fdw/expected/postgres_fdw.out|  19 +-
 contrib/postgres_fdw/option.c |   6 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |   3 +-
 doc/src/sgml/config.sgml  |  17 +
 doc/src/sgml/dblink.sgml  |   5 +-
 doc/src/sgml/libpq.sgml   |  41 +++
 doc/src/sgml/monitoring.sgml  |   9 +
 doc/src/sgml/postgres-fdw.sgml|   7 +-
 src/backend/catalog/system_views.sql  |   3 +-
 src/backend/foreign/foreign.c |   1 +
 src/backend/libpq/auth.c  |  13 +-
 src/backend/libpq/be-gssapi-common.c  |  53 +++
 src/backend/libpq/be-secure-gssapi.c  |  26 +-
 src/backend/utils/activity/backend_status.c   |   1 +
 src/backend/utils/adt/pgstatfuncs.c   |  21 +-
 src/backend/utils/init/postinit.c |   8 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/libpq/auth.h  |   1 +
 src/include/libpq/be-gssapi-common.h  |   3 +
 src/include/libpq/libpq-be.h  |   2 +
 src/include/utils/backend_status.h|   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-auth.c|  15 +-
 src/interfaces/libpq/fe-connect.c |  17 +
 src/interfaces/libpq/fe-secure-gssapi.c   |  23 +-
 src/interfaces/libpq/libpq-fe.h   |   1 +
 src/interfaces/libpq/libpq-int.h  |   2 +
 src/test/kerberos/Makefile|   3 +
 src/test/kerberos/t/001_auth.pl   | 331 --
 src/test/perl/PostgreSQL/Test/Utils.pm|  27 ++
 src/test/regress/expected/rules.out   |  11 +-
 36 files changed, 755 insertions(+), 136 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 78a8bcee6e..55f75eff36 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -48,6 +48,7 @@
 #include "funcapi.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be.h"
 #include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -111,7 +112,8 @@ static HeapTuple get_tuple_of_interest(Relation rel, int *pkattnums, int pknumat
 static Relation get_rel_from_relname(text *relname_text, LOCKMODE lockmode, AclMode aclmode);
 static char *generate_relation_name(Relation rel);
 static void dblink_connstr_check(const char *connstr);
-static void dblink_security_check(PGconn *conn, remoteConn *rconn);
+static bool dblink_connstr_has_pw(const char *connstr);
+static void dblink_security_check(PGconn *conn, remoteConn *rconn, const char *connstr);
 static void dblink_res_error(PGconn *conn, const char *conname, PGresult *res,
 			 bool fail, const char *fmt,...) pg_attribute_printf(5, 6);
 static char *get_connect_string(const char *servername);
@@ -213,7 +215,7 @@ dblink_get_con

Re: longfin missing gssapi_ext.h

2023-04-11 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> Greetings,
> 
> * Justin Pryzby (pry...@telsasoft.com) wrote:
> > >  configure |  27 ++
> > >  configure.ac  |   2 +
> > 
> > Does meson.build need the corresponding change ?
> 
> Ah, yes, presumably.

No, more like attached actually.  Picks up on the dependency properly
with this when I ran meson/ninja, at least.

I'll include this then (along with any other suggestions, of course).

Thanks!

Stephen
diff --git a/meson.build b/meson.build
index b69aaddb1f..3405cc07ee 100644
--- a/meson.build
+++ b/meson.build
@@ -623,6 +623,16 @@ if not gssapiopt.disabled()
 have_gssapi = false
   endif
 
+  if not have_gssapi
+  elif cc.check_header('gssapi/gssapi_ext.h', dependencies: gssapi, required: false,
+  args: test_c_args, include_directories: postgres_inc)
+cdata.set('HAVE_GSSAPI_GSSAPI_EXT_H', 1)
+  elif cc.check_header('gssapi_ext.h', args: test_c_args, dependencies: gssapi, required: gssapiopt)
+cdata.set('HAVE_GSSAPI_EXT_H', 1)
+  else
+have_gssapi = false
+  endif
+
   if not have_gssapi
   elif cc.has_function('gss_init_sec_context', dependencies: gssapi,
   args: test_c_args, include_directories: postgres_inc)


signature.asc
Description: PGP signature

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1180 matches

Mail list logo