from:"Stephen Frost"

Re: Set log_lock_waits=on by default

2024-02-07 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Tue, 2024-02-06 at 12:29 -0500, Tom Lane wrote:
> > Nathan Bossart  writes:
> > > It looks like there are two ideas:
> > > [...]
> > > * Set log_lock_waits on by default.  The folks on this thread seem to
> > >   support this idea, but given the lively discussion for enabling
> > >   log_checkpoints by default [0], I'm hesitant to commit something like
> > >   this without further community discussion.
> > 
> > I was, and remain, of the opinion that that was a bad idea that
> > we'll eventually revert, just like we previously got rid of most
> > inessential log chatter in the default configuration.  So I doubt
> > you'll have much trouble guessing my opinion of this one.  I think
> > the default logging configuration should be chosen with the
> > understanding that nobody ever looks at the logs of most
> > installations, and we should be more worried about their disk space
> > consumption than anything else.  Admittedly, log_lock_waits is less
> > bad than log_checkpoints, because no such events will occur in a
> > well-tuned configuration ... but still, it's going to be useless
> > chatter in the average installation.
> 
> Unsurprisingly, I want to argue against that.

I tend to agree with this position- log_checkpoints being on has been a
recommended configuration for a very long time and is valuable
information to have about what's been happening when someone does go and
look at the log.

Having log_lock_waits on by default is likely to be less noisy and even
more useful for going back in time to figure out what happened.

> You say that it is less bad than "log_checkpoints = on", and I agree.
> I can't remember seeing any complaints by users about
> "log_checkpoints", and I predict that the complaints about
> "log_lock_waits = on" will be about as loud.

Yeah, agreed.

> I am all for avoiding useless chatter in the log.  In my personal
> experience, that is usually "database typo does not exist" and
> constraint violation errors.  I always recommend people to enable
> "log_lock_waits", and so far I have not seen it spam the logs.

I really wish we could separate out the messages about typos and
constraint violations from these logs about processes waiting a long
time for locks or about checkpoints or even PANIC's or other really
important messages.  That said, that's a different problem and not
something this change needs to concern itself with.

As for if we want to separate out log_lock_waits from deadlock_timeout-
no, I don't think we do, for the reasons that Tom mentioned.  I don't
see it as necessary either for enabling log_lock_waits by default.
Waiting deadlock_timeout amount of time for a lock certainly is a
problem already and once we've waited that amount of time, I can't see
the time spent logging about it as being a serious issue.

+1 for simply enabling log_lock_waits by default.

All that said ... if we did come up with a nice way to separate out the
timing for deadlock_timeout and log_lock_waits, I wouldn't necessarily
be against it.  Perhaps one approach to that would be to set just one
timer but have it be the lower of the two, and then set another when
that fires (if there's more waiting to do) and then catch it when it
happens...  Again, I'd view this as some independent improvement though
and not a requirement for just enabling log_lock_waits by default.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Set log_lock_waits=on by default

2024-02-07 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> >> On Tue, 2024-02-06 at 12:29 -0500, Tom Lane wrote:
> >>> I was, and remain, of the opinion that that was a bad idea that
> >>> we'll eventually revert, just like we previously got rid of most
> >>> inessential log chatter in the default configuration.
> 
> >> Unsurprisingly, I want to argue against that.
> 
> > I tend to agree with this position- log_checkpoints being on has been a
> > recommended configuration for a very long time and is valuable
> > information to have about what's been happening when someone does go and
> > look at the log.
> 
> We turned on default log_checkpoints in v15, which means that behavior
> has been in the field for about sixteen months.  I don't think that
> that gives it the status of a settled issue; my bet is that most
> users still have not seen it.

Apologies for not being clear- log_checkpoints being on has been a
configuration setting that I (and many others I've run into) have been
recommending since as far back as I can remember.

I was not referring to the change in the default.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-02-26 Thread Stephen Frost

Greetings,

* Elizabeth Christensen (elizabeth.christen...@crunchydata.com) wrote:
> I have a small documentation patch to the HOT updates page
> to add references
> to summary (BRIN) indexes not blocking HOT updates
> ,
> committed 19d8e2308b.

Sounds good to me, though the attached patch didn't want to apply, and
it isn't immediately clear to me why, but perhaps you could try saving
the patch from a different editor and re-sending it?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2024-02-26 Thread Stephen Frost

Greetings,

* Xing Guo (higuox...@gmail.com) wrote:
> Looks that many hackers are happy with the original patch except that
> the original patch needs a simple rebase, though it has been 3 years.

I'm not completely against the idea of adding these hooks, but I have to
say that it's unfortunate to imagine having to use an extension for
something like quotas as it's really something that would ideally be in
core.

> Shall we push forward this patch so that it can benefit extensions not
> only diskquota?

Would be great to hear about other use-cases for these hooks, which
would also help us be comfortable that these are the right hooks to
introduce with the correct arguments.  What are the other extensions
that you're referring to here..?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-02-26 Thread Stephen Frost

Greetings,

* Elizabeth Christensen (elizabeth.christen...@crunchydata.com) wrote:
> > On Feb 26, 2024, at 4:21 PM, Stephen Frost  wrote:
> > * Elizabeth Christensen (elizabeth.christen...@crunchydata.com) wrote:
> >> I have a small documentation patch to the HOT updates page
> >> <https://www.postgresql.org/docs/current/storage-hot.html>to add references
> >> to summary (BRIN) indexes not blocking HOT updates
> >> <https://www.postgresql.org/message-id/cafp7qwpmrgcdaqumn7onn9hjrj3u4x3zrxdgft0k5g2jwvn...@mail.gmail.com>,
> >> committed 19d8e2308b.
> > 
> > Sounds good to me, though the attached patch didn't want to apply, and
> > it isn't immediately clear to me why, but perhaps you could try saving
> > the patch from a different editor and re-sending it?
> 
> Thanks Stephen, attempt #2 here. 

Here's an updated patch which tries to improve on the wording a bit by
having it be a bit more consistent.  Would certainly welcome feedback on
it though, of course.  This is a tricky bit of language to write and
a complex optimization to explain.

Thanks!

Stephen
From bb88237d1f807572a496ff2a267ab56bef61ac8e Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 26 Feb 2024 17:17:54 -0500
Subject: [PATCH] docs: Update HOT update docs for 19d8e2308b

Commit 19d8e2308b changed when the HOT update optimization is possible
but neglected to update the Heap-Only Tuples (HOT) documentation.  This
patch updates that documentation accordingly.

Author: Elizabeth Christensen
Reviewed-By: Stephen Frost
Backpatch-through: 16
Discussion: https://postgr.es/m/CABoUFXRjisr58Ct_3VsFEdQx+fJeQTWTdJnM7XAp=8mubto...@mail.gmail.com
---
 doc/src/sgml/storage.sgml | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3ea4e5526d..d6c53a8d25 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -1097,8 +1097,11 @@ data. Empty in ordinary tables.
   

 
- The update does not modify any columns referenced by the table's
- indexes, including expression and partial indexes.
+ The update only modifies columns which are not referenced by the
+ table's indexes or are only referenced by summarizing indexes (such
+ as BRIN) and does not update any columns
+ referenced by the table's non-summarizing indexes, including
+ expression and partial non-summarizing indexes.
  


@@ -1114,7 +1117,8 @@ data. Empty in ordinary tables.
   

 
- New index entries are not needed to represent updated rows.
+ New index entries are not needed to represent updated rows, however,
+ summary indexes may still need to be updated.
 


@@ -1130,11 +1134,10 @@ data. Empty in ordinary tables.
  
 
  
-  In summary, heap-only tuple updates can only be created
-  if columns used by indexes are not updated.  You can
-  increase the likelihood of sufficient page space for
-  HOT updates by decreasing a table's fillfactor.
+  In summary, heap-only tuple updates can only be created if columns used by
+  non-summarizing indexes are not updated. You can increase the likelihood of
+  sufficient page space for HOT updates by decreasing
+  a table's fillfactor.
   If you don't, HOT updates will still happen because
   new rows will naturally migrate to new pages and existing pages with
   sufficient free space for new row versions.  The system view 

signature.asc
Description: PGP signature

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-02-27 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> On 2024-Feb-26, Stephen Frost wrote:
> > Here's an updated patch which tries to improve on the wording a bit by
> > having it be a bit more consistent.  Would certainly welcome feedback on
> > it though, of course.  This is a tricky bit of language to write and
> > a complex optimization to explain.
> 
> Should we try to explain what is a "summarizing" index is?  Right now
> the only way to know is to look at the source code or attach a debugger
> and see if IndexAmRoutine->amsummarizing is true.  Maybe we can just say
> "currently the only in-core summarizing index is BRIN" somewhere in the
> page.  (The patch's proposal to say "... such as BRIN" strikes me as too
> vague.)

Not sure about explaining what one is, but I'd be fine adding that
language.  I was disappointed to see that there's no way to figure out
the value of amsummarizing for an access method other than looking at
the code.  Not as part of this specific patch, but I'd generally support
having a way to that information at the SQL level (or perhaps everything
from IndexAmRoutine?).

Attached is an updated patch which drops the 'such as' and adds a
sentence mentioning that BRIN is the only in-core summarizing index.

Thanks!

Stephen
From 131f83868b947b379e57ea3dad0acf5a4f95bca8 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 26 Feb 2024 17:17:54 -0500
Subject: [PATCH] docs: Update HOT update docs for 19d8e2308b

Commit 19d8e2308b changed when the HOT update optimization is possible
but neglected to update the Heap-Only Tuples (HOT) documentation.  This
patch updates that documentation accordingly.

Author: Elizabeth Christensen
Reviewed-By: Stephen Frost, Alvaro Herrera
Backpatch-through: 16
Discussion: https://postgr.es/m/CABoUFXRjisr58Ct_3VsFEdQx+fJeQTWTdJnM7XAp=8mubto...@mail.gmail.com
---
 doc/src/sgml/storage.sgml | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3ea4e5526d..f2bb0283ae 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -1097,8 +1097,13 @@ data. Empty in ordinary tables.
   

 
- The update does not modify any columns referenced by the table's
- indexes, including expression and partial indexes.
+ The update only modifies columns which are not referenced by the
+ table's indexes or are only referenced by summarizing indexes and
+ does not update any columns referenced by the table's
+ non-summarizing indexes, including expression and partial
+ non-summarizing indexes.  The only summarizing index in the core
+ PostgreSQL distribution is
+ BRIN.
  


@@ -1114,7 +1119,8 @@ data. Empty in ordinary tables.
   

 
- New index entries are not needed to represent updated rows.
+ New index entries are not needed to represent updated rows, however,
+ summary indexes may still need to be updated.
 


@@ -1130,11 +1136,10 @@ data. Empty in ordinary tables.
  
 
  
-  In summary, heap-only tuple updates can only be created
-  if columns used by indexes are not updated.  You can
-  increase the likelihood of sufficient page space for
-  HOT updates by decreasing a table's fillfactor.
+  In summary, heap-only tuple updates can only be created if columns used by
+  non-summarizing indexes are not updated. You can increase the likelihood of
+  sufficient page space for HOT updates by decreasing
+  a table's fillfactor.
   If you don't, HOT updates will still happen because
   new rows will naturally migrate to new pages and existing pages with
   sufficient free space for new row versions.  The system view 

signature.asc
Description: PGP signature

Re: Atomic ops for unlogged LSN

2024-02-29 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> Here is a new version of the patch that uses the new atomic read/write
> functions with full barriers that were added in commit bd5132d.  Thoughts?

Saw that commit go in- glad to see it.  Thanks for updating this patch
too.  The changes look good to me.

Thanks again,

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-02-29 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker 
> wrote:
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.

> v6 posted below.
> 
> Changes:
> 
> - Additional documentation about the overall process.
> - Rewording of SGML docs.
> - removed a fair number of columns from the transformation queries.
> - enabled require_match_oids in extended statistics, but I'm having my
> doubts about the value of that.
> - moved stats extraction functions to an fe_utils file stats_export.c that
> will be used by both pg_export_stats and pg_dump.
> - pg_export_stats now generates SQL statements rather than a tsv, and has
> boolean flags to set the validate and require_match_oids parameters in the
> calls to pg_import_(rel|ext)_stats.
> - pg_import_stats is gone, as importing can now be done with psql.

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Part of this process would include mapping from OIDs/attrnum's to names
on the source side and then from those names to the appropriate
OIDs/attrnum's on the destination side.

As this code would be used by both pg_dump and the postgres_fdw, it
seems logical that it would go into the common library.  Further, it
would make sense to have this code be able to handle multiple major
versions for the foreign side, such as how postgres_fdw and pg_dump
already do.

In terms of working to ensure that newer versions support loading from
older dumps (that is, that v18 would be able to load a dump file created
by a v17 pg_dump against a v17 server in the face of changes having been
made to the statistics system in v18), we could have the functions take
a version parameter (to handle cases where the data structure is the
same but the contents have to be handled differently), use overloaded
functions, or have version-specific names for the functions.  I'm also
generally supportive of the idea that we, perhaps initially, only
support dumping/loading stats with pg_dump when in binary-upgrade mode,
which removes our need to be concerned with this (perhaps that would be
a good v1 of this feature?) as the version of pg_dump needs to match
that of pg_upgrade and the destination server for various other reasons.
Including a switch to exclude stats on restore might also be an
acceptable answer, or even simply excluding them by default when going
between major versions except in binary-upgrade mode.

Along those same lines when it comes to a 'v1', I'd say that we may wish
to consider excluding extended statistics, which I am fairly confident
Corey's heard a number of times previously already but thought I would
add my own support for that.  To the extent that we do want to make
extended stats work down the road, we should probably have some
pre-patches to flush out the missing _in/_recv functions for those types
which don't have them today- and that would include modifying the _out
of those types to use names instead of OIDs/attrnums.  In thinking about
this, I was reviewing specifically pg_dependencies.  To the extent that
there are people who depend on the current output, I would think that
they'd actually appreciate this change.

I don't generally feel like we need to be checking that the OIDs between
the old server and the new server match- I appreciate that that should
be the case in a binary-upgrade situation but it still feels unnecessary
and complicated and clutters up the output and the function calls.

Overall, I definitely think this is a good project to work on as it's an
often, rightfully, complained about issue when it comes to pg_upgrade
and the amount of downtime required for it before the upgraded system
can be reasonably used again.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-02-29 Thread Stephen Frost

Greetings,

On Thu, Feb 29, 2024 at 17:48 Corey Huinker  wrote:

> Having looked through this thread and discussed a bit with Corey
>> off-line, the approach that Tom laid out up-thread seems like it would
>> make the most sense overall- that is, eliminate the JSON bits and the
>> SPI and instead export the stats data by running queries from the new
>> version of pg_dump/server (in the FDW case) against the old server
>> with the intelligence of how to transform the data into the format
>> needed for the current pg_dump/server to accept, through function calls
>> where the function calls generally map up to the rows/information being
>> updated- a call to update the information in pg_class for each relation
>> and then a call for each attribute to update the information in
>> pg_statistic.
>>
>
> Thanks for the excellent summary of our conversation, though I do add that
> we discussed a problem with per-attribute functions: each function would be
> acquiring locks on both the relation (so it doesn't go away) and
> pg_statistic, and that lock thrashing would add up. Whether that overhead
> is judged significant or not is up for discussion. If it is significant, it
> makes sense to package up all the attributes into one call, passing in an
> array of some new pg_statistic-esque special typethe very issue that
> sent me down the JSON path.
>
> I certainly see the flexibility in having a per-attribute functions, but
> am concerned about non-binary-upgrade situations where the attnums won't
> line up, and if we're passing them by name then the function has dig around
> looking for the right matching attnum, and that's overhead too. In the
> whole-table approach, we just iterate over the attributes that exist, and
> find the matching parameter row.
>

That’s certainly a fair point and my initial reaction (which could
certainly be wrong) is that it’s unlikely to be an issue- but also, if you
feel you could make it work with an array and passing all the attribute
info in with one call, which I suspect would be possible but just a bit
more complex to build, then sure, go for it. If it ends up being overly
unwieldy then perhaps the  per-attribute call would be better and we could
perhaps acquire the lock before the function calls..?  Doing a check to see
if we have already locked it would be cheaper than trying to acquire a new
lock, I’m fairly sure.

Also per our prior discussion- this makes sense to include in post-data
section, imv, and also because then we have the indexes we may wish to load
stats for, but further that also means it’ll be in the paralleliziable part
of the process, making me a bit less concerned overall about the individual
timing.

Thanks!

Stephen

>

Re: Statistics Import and Export

2024-03-01 Thread Stephen Frost

Greetings,

On Fri, Mar 1, 2024 at 12:14 Nathan Bossart 
wrote:

> On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
> >> That’s certainly a fair point and my initial reaction (which could
> >> certainly be wrong) is that it’s unlikely to be an issue- but also, if
> you
> >> feel you could make it work with an array and passing all the attribute
> >> info in with one call, which I suspect would be possible but just a bit
> >> more complex to build, then sure, go for it. If it ends up being overly
> >> unwieldy then perhaps the  per-attribute call would be better and we
> could
> >> perhaps acquire the lock before the function calls..?  Doing a check to
> see
> >> if we have already locked it would be cheaper than trying to acquire a
> new
> >> lock, I’m fairly sure.
> >
> > Well the do_analyze() code was already ok with acquiring the lock once
> for
> > non-inherited stats and again for inherited stats, so the locks were
> > already not the end of the world. However, that's at most a 2x of the
> > locking required, and this would natts * x, quite a bit more. Having the
> > procedures check for a pre-existing lock seems like a good compromise.
>
> I think this is a reasonable starting point.  If the benchmarks show that
> the locking is a problem, we can reevaluate, but otherwise IMHO we should
> try to keep it as simple/flexible as possible.


Yeah, this was my general feeling as well.  If it does become an issue, it
certainly seems like we would have ways to improve it in the future. Even
with this locking it is surely going to be better than having to re-analyze
the entire database which is where we are at now.

Thanks,

Stephen

>

Re: [PATCH] updates to docs about HOT updates for BRIN

2024-03-05 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2024-02-27 at 09:48 -0500, Stephen Frost wrote:
> > Attached is an updated patch which drops the 'such as' and adds a
> > sentence mentioning that BRIN is the only in-core summarizing index.
> 
> The original patch reads more clearly to me. In v4, summarizing (the
> exception) feels like it's dominating the description.
> 
> Also, is it standard practice to backport this kind of doc update? I
> ordinarily wouldn't be inclined to do so, but v4 seems targeted at 16
> as well.

I do think this change should be back-ported to when the change
happened, otherwise the documentation won't reflect what's in the
product for that version...

> Attached my own suggested wording that hopefully addresses Stephen and
> Alvaro's concerns. I agree that it's tricky to write so I took a more
> minimalist approach:

>  * I got rid of the "In summary" sentence because (a) it's confusing
> now that we're talking about summarizing indexes; and (b) it's not
> summarizing anything, it's just redundant.

>  * I removed the mention partial or expression indexes. It's a bit
> redundant and doesn't seem especially helpful in this context.

Just to point it out- the "In summary" did provide a bit of a summary,
before the 'partial or expression indexes' bit was removed.  That said,
I tend to still agree with these changes as I feel that users will
generally be able to infer that this applies to partial and expression
indexes without it having to be spelled out to them.

> If this is agreeable I can commit it.

Great, thanks!

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-06 Thread Stephen Frost

Greetings,

On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Fri, 1 Mar 2024, 04:55 Corey Huinker,  wrote:
> >> Also per our prior discussion- this makes sense to include in post-data
> section, imv, and also because then we have the indexes we may wish to load
> stats for, but further that also means it’ll be in the paralleliziable part
> of the process, making me a bit less concerned overall about the individual
> timing.
> >
> >
> > The ability to parallelize is pretty persuasive. But is that
> per-statement parallelization or do we get transaction blocks? i.e. if we
> ended up importing stats like this:
> >
> > BEGIN;
> > LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> > LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> > SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> > SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> > SELECT pg_import_pg_statistic('schema.relation', 'name', ...);
>
> How well would this simplify to the following:
>
> SELECT pg_import_statistic('schema.relation', attname, ...)
> FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Using a VALUES construct for this does seem like it might make it cleaner,
so +1 for investigating that idea.

Or even just one VALUES for the whole statistics loading?

I don’t think we’d want to go beyond one relation at a time as then it can
be parallelized, we won’t be trying to lock a whole bunch of objects at
once, and any failures would only impact that one relation’s stats load.

I suspect the main issue with combining this into one statement
> (transaction) is that failure to load one column's statistics implies
> you'll have to redo all the other statistics (or fail to load the
> statistics at all), which may be problematic at the scale of thousands
> of relations with tens of columns each.

I’m pretty skeptical that “stats fail to load and lead to a failed
transaction” is a likely scenario that we have to spend a lot of effort
on.  I’m pretty bullish on the idea that this simply won’t happen except in
very exceptional cases under a pg_upgrade (where the pg_dump that’s used
must match the target server version) and where it happens under a pg_dump
it’ll be because it’s an older pg_dump’s output and the answer will likely
need to be “you’re using a pg_dump file generated using an older version of
pg_dump and need to exclude stats entirely from the load and instead run
analyze on the data after loading it.”

What are the cases where we would be seeing stats reloads failing where it
would make sense to re-try on a subset of columns, or just generally, if we
know that the pg_dump version matches the target server version?

Thanks!

Stephen

>

Re: Statistics Import and Export

2024-03-06 Thread Stephen Frost

Greetings,

* Matthias van de Meent (boekewurm+postg...@gmail.com) wrote:
> On Wed, 6 Mar 2024 at 11:33, Stephen Frost  wrote:
> > On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent 
> >  wrote:
> >> Or even just one VALUES for the whole statistics loading?
> > I don’t think we’d want to go beyond one relation at a time as then it can 
> > be parallelized, we won’t be trying to lock a whole bunch of objects at 
> > once, and any failures would only impact that one relation’s stats load.
> 
> That also makes sense.

Great, thanks.

> >> I suspect the main issue with combining this into one statement
> >> (transaction) is that failure to load one column's statistics implies
> >> you'll have to redo all the other statistics (or fail to load the
> >> statistics at all), which may be problematic at the scale of thousands
> >> of relations with tens of columns each.
> >
> >
> > I’m pretty skeptical that “stats fail to load and lead to a failed 
> > transaction” is a likely scenario that we have to spend a lot of effort on.
> 
> Agreed on the "don't have to spend a lot of time on it", but I'm not
> so sure on the "unlikely" part while the autovacuum deamon is
> involved, specifically for non-upgrade pg_restore. I imagine (haven't
> checked) that autoanalyze is disabled during pg_upgrade, but
> pg_restore doesn't do that, while it would have to be able to restore
> statistics of a table if it is included in the dump (and the version
> matches).

Even if autovacuum was running and it kicked off an auto-analyze of the
relation at just the time that we were trying to load the stats, there
would be appropriate locking happening to keep them from causing an
outright ERROR and transaction failure, or if not, that's a lack of
locking and should be fixed.  With the per-attribute-function-call
approach, that could lead to a situation where some stats are from the
auto-analyze and some are from the stats being loaded but I'm not sure
if that's a big concern or not.

For users of this, I would think we'd generally encourage them to
disable autovacuum on the tables they're loading as otherwise they'll
end up with the stats going back to whatever an auto-analyze ends up
finding.  That may be fine in some cases, but not in others.

A couple questions to think about though: Should pg_dump explicitly ask
autovacuum to ignore these tables while we're loading them? 
Should these functions only perform a load when there aren't any
existing stats?  Should the latter be an argument to the functions to
allow the caller to decide?

> > What are the cases where we would be seeing stats reloads failing where it 
> > would make sense to re-try on a subset of columns, or just generally, if we 
> > know that the pg_dump version matches the target server version?
> 
> Last time I checked, pg_restore's default is to load data on a
> row-by-row basis without --single-transaction or --exit-on-error. Of
> course, pg_upgrade uses it's own set of flags, but if a user is
> restoring stats with  pg_restore, I suspect they'd rather have some
> column's stats loaded than no stats at all; so I would assume this
> requires one separate pg_import_pg_statistic()-transaction for every
> column.

Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Proposal to add page headers to SLRU pages

2024-03-06 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> I suppose this is important to do if we ever want to move SLRUs into
> shared buffers.  However, I wonder about the extra time this adds to
> pg_upgrade.  Is this something we should be concerned about?  Is there
> any measurement/estimates to tell us how long this would be?  Right now,
> if you use a cloning strategy for the data files, the upgrade should be
> pretty quick ... but the amount of data in pg_xact and pg_multixact
> could be massive, and the rewrite is likely to take considerable time.

While I definitely agree that there should be some consideration of
this concern, it feels on-par with the visibility-map rewrite which was
done previously.  Larger systems will likely have more to deal with than
smaller systems, but it's still a relatively small portion of the data
overall.

The benefit of this change, beyond just the possibility of moving them
into shared buffers some day in the future, is that this would mean that
SLRUs will have checksums (if the cluster has them enabled).  That
benefit strikes me as well worth the cost of the rewrite taking some
time and the minor loss of space due to the page header.

Would it be useful to consider parallelizing this work?  There's already
parts of pg_upgrade which can be parallelized and so this isn't,
hopefully, a big lift to add, but I'm not sure if there's enough work
being done here CPU-wise, compared to the amount of IO being done, to
have it make sense to run it in parallel.  Might be worth looking into
though, at least, as disks have gotten to be quite fast.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Provide a pg_truncate_freespacemap function

2024-03-06 Thread Stephen Frost

Greetings,

* Ronan Dunklau (ronan.dunk...@aiven.io) wrote:
> As we are currently experiencing a FSM corruption issue [1], we need to 
> rebuild FSM when we detect it. 

Ideally, we'd figure out a way to pick up on this and address it without
the user needing to intervene, however ...

> I noticed we have something to truncate a visibility map, but nothing for the 
> freespace map, so I propose the attached (liberally copied from the VM 
> counterpart) to allow to truncate a FSM without incurring downtime, as 
> currently our only options are to either VACUUM FULL the table or stop the 
> cluster and remove the FSM manually.

I agree that this would generally be a useful thing to have.

> Does that seem correct ?

Definitely needs to have a 'REVOKE ALL ON FUNCTION' at the end of the
upgrade script, similar to what you'll find at the bottom of
pg_visibility--1.1.sql in the tree today, otherwise anyone could run it.

Beyond that, I'd suggest a function-level comment above the definition
of the function itself (which is where we tend to put those- not at the
point where we declare the function).

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: improve ssl error code, 2147483650

2024-03-07 Thread Stephen Frost

Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 07/03/2024 02:12, David Zhang wrote:
> > The SSL_CTX_load_verify_locations function in OpenSSL will return NULL
> > if there is a system error, such as "No such file or directory" in this
> > case:
> > 
> > const char *ERR_reason_error_string(unsigned long e)
> > {
> >       ERR_STRING_DATA d, *p = NULL;
> >       unsigned long l, r;
> > 
> >       if (!RUN_ONCE(&err_string_init, do_err_strings_init)) {
> >       return NULL;
> >       }
> > 
> >       /*
> >    * ERR_reason_error_string() can't safely return system error strings,
> >    * since openssl_strerror_r() needs a buffer for thread safety, and we
> >    * haven't got one that would serve any sensible purpose.
> >    */
> >       if (ERR_SYSTEM_ERROR(e))
> >       return NULL;
> 
> That's pretty unfortunate. As typical with OpenSSL, this stuff is not very
> well documented, but I think we could do something like this in
> SSLerrmessage():
> 
> if (ERR_SYSTEM_ERROR(e))
> errreason = strerror(ERR_GET_REASON(e));
> 
> ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only
> documentation I could find was in this one obscure place in the man pages: 
> https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html.
> But as a best-effort thing, it would still be better than "SSL error code
> 2147483650".

Agreed that it doesn't seem well documented.  I was trying to figure out
what the 'right' answer here was myself and not having much success.  If
the above works, then +1 to that.

> > It would be better to perform a simple SSL file check before passing the
> > SSL file to OpenSSL APIs so that the system error can be captured and a
> > meaningful message provided to the end user.
> 
> That feels pretty ugly. I agree it would catch most of the common mistakes
> in practice, so maybe we should just hold our noses and do it anyway, if the
> above ERR_SYSTEM_ERROR() method doesn't work.

Yeah, seems better to try and handle this the OpenSSL way ... if that's
possible to do.

> It's sad that we cannot pass a file descriptor or in-memory copy of the file
> contents to those functions.

Agreed.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Proposal to add page headers to SLRU pages

2024-03-08 Thread Stephen Frost

Greetings,

* Li, Yong (y...@ebay.com) wrote:
> > On Mar 7, 2024, at 03:09, Stephen Frost  wrote:
> > * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> >> I suppose this is important to do if we ever want to move SLRUs into
> >> shared buffers.  However, I wonder about the extra time this adds to
> >> pg_upgrade.  Is this something we should be concerned about?  Is there
> >> any measurement/estimates to tell us how long this would be?  Right now,
> >> if you use a cloning strategy for the data files, the upgrade should be
> >> pretty quick ... but the amount of data in pg_xact and pg_multixact
> >> could be massive, and the rewrite is likely to take considerable time.
> > 
> > While I definitely agree that there should be some consideration of
> > this concern, it feels on-par with the visibility-map rewrite which was
> > done previously.  Larger systems will likely have more to deal with than
> > smaller systems, but it's still a relatively small portion of the data
> > overall.
> > 
> > The benefit of this change, beyond just the possibility of moving them
> > into shared buffers some day in the future, is that this would mean that
> > SLRUs will have checksums (if the cluster has them enabled).  That
> > benefit strikes me as well worth the cost of the rewrite taking some
> > time and the minor loss of space due to the page header.
> > 
> > Would it be useful to consider parallelizing this work?  There's already
> > parts of pg_upgrade which can be parallelized and so this isn't,
> > hopefully, a big lift to add, but I'm not sure if there's enough work
> > being done here CPU-wise, compared to the amount of IO being done, to
> > have it make sense to run it in parallel.  Might be worth looking into
> > though, at least, as disks have gotten to be quite fast.
> 
> Thank Alvaro and Stephen for your thoughtful comments.
> 
> I did a quick benchmark regarding pg_upgrade time, and here are the results.

> For clog, 2048 segments can host about 2 billion transactions, right at the 
> limit for wraparound.
> That’s the maximum we can have.  2048 segments are also big for pg_multixact 
> SLRUs.
> 
> Therefore, on a modern hardware, in the worst case, pg_upgrade will run for 7 
> seconds longer.

Thanks for testing!  That strikes me as perfectly reasonable and seems
unlikely that we'd get much benefit from parallelizing it, so I'd say it
makes sense to keep this code simple.

Thanks again!

Stephen


signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-03-08 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > Having some discussion around that would be useful.  Is it better to
> > have a situation where there are stats for some columns but no stats for
> > other columns?  There would be a good chance that this would lead to a
> > set of queries that were properly planned out and a set which end up
> > with unexpected and likely poor query plans due to lack of stats.
> > Arguably that's better overall, but either way an ANALYZE needs to be
> > done to address the lack of stats for those columns and then that
> > ANALYZE is going to blow away whatever stats got loaded previously
> > anyway and all we did with a partial stats load was maybe have a subset
> > of queries have better plans in the interim, after having expended the
> > cost to try and individually load the stats and dealing with the case of
> > some of them succeeding and some failing.
> 
> It is my (incomplete and entirely second-hand) understanding is that
> pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value
> and then resets it on completion, presumably because analyzing a table
> before its data is loaded and indexes are created would just be a waste of
> time.

No, pg_upgrade starts the postmaster with -b (undocumented
binary-upgrade mode) which prevents autovacuum (and logical replication
workers) from starting, so we don't need to worry about autovacuum
coming in and causing issues during binary upgrade.  That doesn't
entirely eliminate the concerns around pg_dump vs. autovacuum because we
may restore a dump into a non-binary-upgrade-in-progress system though,
of course.

> > Overall, I'd suggest we wait to see what Corey comes up with in terms of
> > doing the stats load for all attributes in a single function call,
> > perhaps using the VALUES construct as you suggested up-thread, and then
> > we can contemplate if that's clean enough to work or if it's so grotty
> > that the better plan would be to do per-attribute function calls.  If it
> > ends up being the latter, then we can revisit this discussion and try to
> > answer some of the questions raised above.
> 
> In the patch below, I ended up doing per-attribute function calls, mostly
> because it allowed me to avoid creating a custom data type for the portable
> version of pg_statistic. This comes at the cost of a very high number of
> parameters, but that's the breaks.

Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

> This raises questions about whether a failure in one attribute update
> statement should cause the others in that relation to roll back or not, and
> I can see situations where both would be desirable.
> 
> I'm putting this out there ahead of the pg_dump / fe_utils work, mostly
> because what I do there heavily depends on how this is received.
> 
> Also, I'm still seeking confirmation that I can create a pg_dump TOC entry
> with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan
> them out into multiple entries.

If we do go with this approach, we'd certainly want to make sure to do
it in a manner which would allow pg_restore's single-transaction mode to
still work, which definitely complicates this some.

Given that and the other general feeling that the locking won't be a big
issue, I'd suggest the simple approach on the pg_dump side of just
dumping the stats out without the BEGIN/COMMIT.

> Anyway, here's v7. Eagerly awaiting feedback.

> Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats.

> diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
> index 291ed876fc..d12b6e3ca3 100644
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -8818,7 +8818,6 @@
>  { oid => '3813', descr => 'generate XML text node',
>proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
>proargtypes => 'text', prosrc => 'xmltext' },
> -
>  { oid => '2923', descr => 'map table contents to XML',
>proname => 'table_to_xml', procost => '100', provolatile => 's',
>proparallel => 'r', prorettype => 'xml',
> @@ -12163,8 +12162,24 @@
>  
>  # GiST stratnum implementations
>  { oid => '8047', descr => 'GiST support',
> -  proname => 'gist_stratnum_identity', prorettype => 'int2',
> +  proname => 'gist_stratnum_identity',prorettype => 'int2',
>proargtypes => 'int2',
>prosrc =>

Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread Stephen Frost

Greetings,

On Mon, Nov 13, 2023 at 16:53 David Christensen <
david.christen...@crunchydata.com> wrote:

> On Mon, Nov 13, 2023 at 2:52 PM Andres Freund  wrote:
>
>>
>> > This scheme would open up space per page that would now be available for
>> > plenty of other things; the encoding in the header and the corresponding
>> > available space in the footer would seem to open up quite a few options
>> > now, no?
>>
>> Sure, if you're willing to rewrite the whole cluster to upgrade and
>> willing to
>> permanently sacrifice some data density.  If the stored data is actually
>> specific to the page - that is the place to put the data. If not, then the
>> tradeoff is much more complicated IMO.
>>
>> Of course this isn't a new problem - storing the page size on each page
>> was
>> just silly, it's never going to change across the cluster and even more
>> definitely not going to change within a single relation.
>>
>
> Crazy idea; since stored pagesize is already a fixed cost that likely
> isn't going away, what if instead of the pd_checksum field, we instead
> reinterpret pd_pagesize_version; 4 would mean "no page features", but
> anything 5 or higher could be looked up as an external page feature set,
> with storage semantics outside of the realm of the page itself (other than
> what the page features code itself needs to know); i.e,. move away from the
> on-page bitmap into a more abstract representation of features which could
> be something along the lines of what you were suggesting, including
> extension support.
>
> It seems like this could also support adding/removing features on page
> read/write as long as there was sufficient space in the reserved_page
> space; read the old feature set on page read, convert to the new feature
> set which will write out the page with the additional/changed format.
> Obviously there would be bookkeeping to be done in terms of making sure all
> pages had been converted from one format to another, but for the page level
> this would be straightforward.
>
> Just thinking aloud here...
>

In other crazy idea space … if the page didn’t have enough space to allow
for the desired features then make any insert/update actions forcibly have
to choose a different page for the new tuple, while allowing delete’s to do
their usual thing, and then when vacuum comes along and is able to clean up
the page and remove the all dead tuples, it could then enable the features
on the page that are desired…

Thanks,

Stephen

>

Re: Add recovery to pg_control and remove backup_label

2023-11-26 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 11/21/23 12:41, Andres Freund wrote:
> > Sure. They also receive a backup_label today. If an external solution 
> > forgets
> > to replace pg_control copied as part of the filesystem copy, they won't get 
> > an
> > error after the remove of backup_label, just like they don't get one today 
> > if
> > they don't put backup_label in the data directory.  Given that users don't 
> > do
> > the right thing with backup_label today, why can we rely on them doing the
> > right thing with pg_control?
> 
> I think reliable backup software does the right thing with backup_label, but
> if the user starts getting errors on recovery they the decide to remove
> backup_label. I know we can't do much about bad backup software, but we can
> at least make this a bit more resistant to user error after the fact.
> 
> It doesn't help that one of our hints suggests removing backup_label. In
> highly automated systems, the user might not even know they just restored
> from a backup. They are only in the loop because the restore failed and they
> are trying to figure out what is going wrong. When they remove backup_label
> the cluster comes up just fine. Victory!

Yup, this is exactly the issue.

> This is the scenario I've seen most often -- not the backup/restore process
> getting it wrong but the user removing backup_label on their own initiative.
> And because it yields such a positive result, at least initially, they
> remember in the future that the thing to do is to remove backup_label
> whenever they see the error.
> 
> If they only have pg_control, then their only choice is to get it right or
> run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take
> them longer to get there. Also, I think that tool make it pretty clear that
> corruption will result and the only thing to do is a logical dump and
> restore after using it.

Agreed.

> There are plenty of ways a user can mess things up. What I'd like to prevent
> is the appearance of everything being OK when in fact they have corrupted
> their cluster. That's the situation we have now with backup_label. Is this
> new solution perfect? No, but I do think it checks several boxes, and is a
> worthwhile improvement.

+1.

As for the complaint about 'operators' having issue with the changes
we've been making in this area- where are those people complaining,
exactly?  Who are they?  I feel like we keep getting this kind of
push-back in this area from folks on this list but not from actual
backup software authors; all the complaints seem to either be 
speculative or unattributed pass-through from someone else.

What would really be helpful would be hearing from these individuals
directly as to what the issues are with the changes, such that perhaps
we can do things better in the future to avoid whatever the issue is
they're having with the changes.  Simply saying we shouldn't make
changes in this area isn't workable and the constant push-back is
actively discouraging to folks trying to make improvements.  Obviously
it's a biased view, but we've not had issues making the necessary
adjustments in pgbackrest with each release and I feel like if the
authors of wal-g or barman did that they would have spoken up.

Making a change as suggested which only helps pg_basebackup (and tools
like pgbackrest, since it's in C and can also make this particular
change) ends up leaving tools like wal-g and barman potentially still
with an easy way for users of those tools to corrupt their databases-
even though we've not heard anything from the authors of those tools
about issues with the proposed change, nor have there been a lot of
complaints from them about the prior changes to indicate that they'd
even have an issue with the more involved change.  Given the lack of
complaint about past changes, I'd certainly rather err on the side of
improved safety for users than on the side of the authors of these tools
possibly complaining.

What these changes have done is finally break things like omnipitr
completely, which hasn't been maintained in a very long time.  The
changes in v12 broke recovery with omnipitr but not backup, and folks
were trying to use omnipitr as recently as with v13[1].  Certainly
having a backup tool that only works for backup (fsvo works, anyway, as
it still used exclusive backup mode meaning that a crash during a backup
would cause the system to not come back up after...) but doesn't work
for recovery isn't exactly great and I'm glad that, now, an attempt to
use omnipitr to perform a backup will fail.  As with lots of other areas
of PG, folks need to read the release notes and potentially update their
code for new major versions.  If anything, the backup area is less of an
issue for this because the authors of the backup tools are able to make
the change (and who are often the ones pushing for these changes) and
the end-user isn't impacted at all.

Much the same can be said for wal-e, with users st

Re: Partial aggregates pushdown

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Nov 27, 2023 at 4:23 PM Tom Lane  wrote:
> > Well, one of the founding principles of postgres_fdw was to be able
> > to talk to PG servers that are not of the same version as yours.
> > If we break that in the name of performance, we are going to have
> > a lot of unhappy users.  Even the ones who do get the benefit of
> > the speedup are going to be unhappy when it breaks because they
> > didn't upgrade local and remote at exactly the same time.
> 
> I agree with this.

+1.  We do want to continue to make this work- to the extent possible.
I don't think there's any problem with saying that when talking to an
older server, you don't get the same capabilities as you do when talking
to a newer server.

> > Just because we'd like to have it doesn't make the patch workable
> > in the real world.
> 
> And also with this in concept - I'd like to plan arbitrarily
> complicated queries perfectly and near-instantly, and then execute
> them at faster-than-light speed, but we can't. However, I don't
> understand the fatalism with respect to the feature at hand. As I said
> before, it's not like no other product has made this work. Sure, some
> of those products may not have the extensible system of data types
> that we do, or may not care about cross-version communication, but
> those don't seem like good enough reasons to just immediately give up.

Certainly there are other projects out there which are based on PG that
have managed to make this work and work really quite well.

> TBH, I suspect even some PG forks have made this work, like maybe PGXC
> or PGXL, although I don't know for certain. We might not like the
> trade-offs they made to get there, but we haven't even talked through
> possible design ideas yet, so it seems way too early to give up.

Yes, Citus[1] and Greenplum[2], to just name two.

I certainly understand the concern around the security of this and would
have thought the approach we'd use would be to not just take internal
state and pass it along but rather to provide a way for aggregates to
opt-in to supporting this and have them serialize/deserialize with
new dedicated functions that have appropriate checks to avoid bad things
happening.  That could also be versioned, perhaps, if we feel that's
necessary (I'm a bit skeptical, but it would hopefully address the
concern about different versions having different data that they want to
pass along).

> One of the things that I think is a problem in this area is that the
> ways we have to configure FDW connections are just not very rich.

Agreed.

> We're trying to cram everything into a set of strings that can be
> attached to the foreign server or the user mapping, but that's not a
> very good fit for something like how all the local SQL functions that
> might exist map onto all of the remote SQL functions that might exist.
> Now you might well say that we don't want the act of configuring a
> foreign data wrapper to be insanely complicated, and I would agree
> with that. But, on the other hand, as Larry Wall once said, a good
> programming language makes simple things simple and complicated things
> possible. I think our current configuration system is only
> accomplishing the first of those goals.

We've already got issues in this area with extensions- there's no way
for a user to say what version of an extension exists on the remote side
and no way for an extension to do anything different based on that
information.  Perhaps we could work on a solution to both of these
issues, but at the least I don't see holding back on this effort for a
problem that already exists but which we've happily accepted because of
the benefit it provides, like being able to push-down postgis bounding
box conditionals to allow for indexed lookups.

Thanks,

Stephen

[1]: https://docs.citusdata.com/en/v11.1/develop/reference_sql.html
[2]: 
https://postgresconf.org/conferences/Beijing/program/proposals/implementation-of-distributed-aggregation-in-greenplum

signature.asc
Description: PGP signature

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> Is there any interest in fixing our documentation that says encrypted
> when it means hashed?  Should I pursue this?

I do think we should use the correct terminology in our documentation
and would support your working on improving things in this area.

I do wonder if perhaps we would be better off by having someone spend
time on removing terribly insecure authentication methods like md5 and
ldap though ...

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Nov 28, 2023 at 9:55 AM Stephen Frost  wrote:
> > I do think we should use the correct terminology in our documentation
> > and would support your working on improving things in this area.
> 
> +1.
> 
> > I do wonder if perhaps we would be better off by having someone spend
> > time on removing terribly insecure authentication methods like md5 and
> > ldap though ...
> 
> Wait, what's insecure about LDAP?

We pass a completely cleartext password to the server from the client.
Yes, we might encrypt it on the way with TLS, but even SSH realized how
terrible that is long, long ago and strongly discourages it these days.

The problem with ldap as an auth method is that a single compromised PG
server in an AD/ldap environment can collect up those username+password
credentials and gain access to those users *domain* level access.  The
CFO logging into a PG server with LDAP auth is giving up their complete
access credentials to the entire AD domain.  That's terrible.

> I think we should eventually remove MD5, but I think there's no rush.

I disagree- it's a known pass-the-hash vulnerability and frankly every
release we do with it still existing is deserving of an immediate CVE
(I've been asked off-list why we don't do this, in fact).

> People who care about security will have already switched, and people
> who don't care about security are not required to start caring.

I wish it were this simple.  It's just not though.

> Eventually the maintenance burden will become large enough that it
> makes sense to phase it out for that reason, but I haven't seen any
> evidence that we're anywhere close to that point.

This seems to invite the idea that what people who care about this need
to do is make it painful for us to continue to keep it around, which I
really don't think is best for anyone.  We know it's bad, we know it is
broken, we need to remove it, not pretend like it's not broken or not
bad.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add recovery to pg_control and remove backup_label

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost  wrote:
> > What would really be helpful would be hearing from these individuals
> > directly as to what the issues are with the changes, such that perhaps
> > we can do things better in the future to avoid whatever the issue is
> > they're having with the changes.  Simply saying we shouldn't make
> > changes in this area isn't workable and the constant push-back is
> > actively discouraging to folks trying to make improvements.  Obviously
> > it's a biased view, but we've not had issues making the necessary
> > adjustments in pgbackrest with each release and I feel like if the
> > authors of wal-g or barman did that they would have spoken up.
> 
> I'm happy if people show up to comment on proposed changes, but I
> think you're being a little bit unrealistic here. I have had to help
> plenty of people who have screwed up their backups in one way or
> another, generally by using some home-grown script, sometimes by
> misusing some existing backup tool. Those people are EDB customers;
> they don't read and participate in discussions here. If they did,
> perhaps they wouldn't be paying EDB to have me and my colleagues sort
> things out for them when it all goes wrong. I'm not trying to say that
> EDB doesn't have customers who participate in mailing list
> discussions, because we do, but it's a small minority, and I don't
> think that should surprise anyone. Moreover, the people who don't
> wouldn't necessarily have the background, expertise, or *time* to
> assess specific proposals in detail. If your point is that my
> perspective on what's helpful or unhelpful is not valid because I've
> only helped 30 people who had problems in this area, but that the
> perspective of those 30 people who were helped would be more valid,
> well, I don't agree with that. I think your perspective and David's
> are valid precisely *because* you've worked a lot on pgbackrest and no
> doubt interacted with lots of users; I think Andres's perspective is
> valid precisely *because* of his experience working with the fleet at
> Microsoft and individual customers at EDB and 2Q before that; and I
> think my perspective is valid for the same kinds of reasons.

I didn't mean to imply that anyone's perspective wasn't valid.  I was
simply trying to get at the root question of: what *is* the issue with
the changes that are being made?  If the answer to that is: we made
this change, which was hard for folks to deal with, and could have
been avoided by doing X, then I really, really want to hear what X
was!  If the answer is, well, the changes weren't hard, but we didn't
like having to make any changes at all ... then I just don't have any
sympathy for that.  People who write backup software for PG, be it
pgbackrest authors, wal-g authors, or homegrown script authors, will
need to adapt between major versions as we discover things that are
broken (such as exclusive mode, and such as the clear risk that's been
demonstrated of a torn copy of pg_control getting copied, resulting in
a completely invalid backup) and fix them.

> I am more in agreement with the idea that it would be nice to hear
> from backup tool authors, but I think even that has limited value.
> Surely we can all agree that if the backup tool is correctly written,
> none of this matters, because you'll make the tool do the right things
> and then you'll be fine. The difficulty here, and the motivation
> behind this proposal and others like it, is that too many users fail
> to follow the procedure correctly. If we hear from the authors of
> well-written backup tools, I expect they will tell us they can adapt
> their tool to whatever we do. And if we hear from the authors of
> poorly-written tools, well, I don't think their opinions would form a
> great basis for making decisions.

Uhhh.  No, I disagree with this- I'd argue that pgbackrest was broken
until the most recently releases where we implemented a check to ensure
that the pg_control we copy has a valid PG CRC.  Did we know it was
broken before this discussion?  No, but that doesn't change the fact
that we certainly could have ended up copying an invalid pg_control and
thus have an invalid backup, which even our 'pgbackrest verify' wouldn't
have caught because that just checks that the checksum that pgbackrest
calculates for every file hasn't changed since we copied it- but that
didn't do anything for the issue about pg_control having an invalid
internal checksum due to a torn write when we copied it.

So, yes, it does matter.  We didn't make pgbackrest do the right thi

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

On Tue, Nov 28, 2023 at 20:19 Robert Haas  wrote:

> On Tue, Nov 28, 2023 at 10:16 AM Stephen Frost  wrote:
> > We pass a completely cleartext password to the server from the client.
> > Yes, we might encrypt it on the way with TLS, but even SSH realized how
> > terrible that is long, long ago and strongly discourages it these days.
> >
> > The problem with ldap as an auth method is that a single compromised PG
> > server in an AD/ldap environment can collect up those username+password
> > credentials and gain access to those users *domain* level access.  The
> > CFO logging into a PG server with LDAP auth is giving up their complete
> > access credentials to the entire AD domain.  That's terrible.
>
> Good grief. I really think you need to reassess your approach to this
> whole area. It is massively unhelpful to be so prescriptive about what
> people are, or are not, allowed to be comfortable with from a security
> perspective. And it seems like you're so convinced that you're right
> that you're completely closed off to anyone else's input, so there's
> not even any point in arguing.

You asked what the issue is with our ldap auth method. I explained it
pretty clearly. Is there something incorrect about the issues I’ve pointed
out with ldap auth?  You basically came back at me saying that I’m being
unhelpful for responding with what the issues are.

In the reality I inhabit, many people could *massively* improve their
> security by switching to LDAP from the really lame things that they're
> doing right now. They would be FAR better off. It would not even be
> close. I pretty much know that you aren't going to believe that and
> are going to offer reasons why it isn't and can't possibly be true, or
> else just say that those people are so hopeless that we shouldn't even
> care about them. All I can say is that you're worrying about security
> problems that are so minute as to be invisible compared to what I see.

I don’t know what they’re doing now, as you don’t say, and so I really
couldn’t say if ldap is better or worse for them. In some cases, sure,
perhaps ldap is better than … something else, but in 99% of cases, ldap is
being used because it seems simpler than setting up GSSAPI … but that’s a
lack of clearer documentation on our side on how to set up GSSAPI with AD
and PG, imv.   I’ve tried to improve on that situation by explaining
clearly how to set up GSSAPI authentication with AD and then consistently
pointing people to that.  I don’t put this out there as fiat without any
effort behind it, I’ve been working to move folks in the right direction,
I’ve put in that effort on the lists and continue to do so, feel free to
review the archives.

What I don’t understand is the argument for using ldap in an AD environment
instead of GSSAPI/sspi. Is there some reason you feel makes ldap better?
When there are two such options and one is clearly much more secure,
shouldn’t we push people to the more secure option… possibly even to the
exclusion and removal of the insecure option?  Ditto below for md5 …

> I disagree- it's a known pass-the-hash vulnerability and frankly every
> > release we do with it still existing is deserving of an immediate CVE
> > (I've been asked off-list why we don't do this, in fact).
>
> I agree that the issues with MD5 are way worse than the ones with
> LDAP, but you're arguing here, as you did with exclusive backup mode,
> that the existence of options that are bad, or that you consider to be
> bad, is a problem even if nothing whatever compels people to use them.

I’m not impressed with the conflation of this discussion and that of
exclusive backup mode.  I wasn’t the one who deprecated exclusive backup
mode, tho I did agree with that move, and the reason for removing it was as
much about the complications of documenting it as for the fact that it was
actively broken.

I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
> MD5 authentication has a pass-the-hash vulnerability, and that sucks.

So, given that we all agree with the CVE-worthy issue that exists with
every release where we include md5 auth, we should be applying for q CVE
prior to each release, no?

So let's say we remove it. Then presumably we should also remove
> password authentication, which has an even easier-to-exploit
> vulnerability, and trust, which is easier to exploit still. And
> presumably it would be right to do all of that even if SCRAM
> authentication had never been implemented, so then we'd have no
> support for any form of password authentication at all. And we'd
> remove LDAP, too, so the most obvious way of doing password
> authentication without putting the passwords in the database would
> also not exist an

Re: [HACKERS] Changing references of password encryption to hashing

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Nov 28, 2023 at 12:24 PM Stephen Frost  wrote:
> > I don’t know what they’re doing now, as you don’t say, and so I really 
> > couldn’t say if ldap is better or worse for them. In some cases, sure, 
> > perhaps ldap is better than … something else,
> 
> That's EXACTLY right. You can't say whether LDAP is better or worse in
> every scenario. And therefore you should not be proposing to remove
> it.

I had been hoping you might shed some light on just what use cases you
were referring to so that we could have a constructive discussion about
if ldap is actually a reasonable solution.  I even explicitly pointed
out that there may still exist some environments, such as those with
OpenLDAP only and which don't have Kerberos that could be a valid reason
to keep ldap, but I haven't heard of any such in quite a long time.

Until more recently, an argument could be made that we needed to keep
ldap around because pgAdmin didn't support Kerberos, but that's been
addressed for a couple years now.

At some point, once there aren't any good use-cases for it to be kept,
we should remove it or at least deprecate and discourage its use, though
as I suspect everyone already knows, I really don't like deprecating
things because it means we just end up keeping them around forever.

As I tried to get at in my prior response, we should be encouraging
the use of secure options when they're available and that's something
which I actively do, on these lists and in working with clients.  That I
continue to find cases of perfectly good AD deployments which could be
using Kerberos but instead are using ldap is frustrating.

What I don't understand is this push-back against even talking about the
issues with ldap or appreciation that passing around cleartext passwords
is a security issue (modern password-based authentication methods, like
SCRAM, don't even do this anymore, because it's not a good idea).

If my opinion that we should remove it is that offensive, then fine,
let's drop that part of the discussion and move on to the question of:
why are people still using ldap?

I'm pretty sure the answer is that far too often people think that's
just how you integrate PG into an AD environment- and they think it's
either the only option or sometimes believe it's a secure option.  We've
had people claim on these lists that using ldap doesn't pass around
cleartext passwords, which I suspect means that maybe they thought our
'ldap' auth was actually somehow doing Kerberos under the hood (I really
wish it was, but sadly that's obviously not the case).

For some users, the issue is that our Kerberos/pg_ident/et al system
isn't as good as it could be in an AD environment, which is one of the
things that I was trying to allude to in my prior reply.  In particular,
we don't currently have support for working with AD's ldap for what it
was actually meant for, which is using it to query for information like
group membership.  There had been some work on this not too long ago but
sadly I haven't seen recent activity around that.  What I've heard
requested are things like having an option to say "users in group X are
allowed to log in to role Y", or "users in group X are allowed to log
into database Z".  Would also be great to have built-in support for
syncing from ldap the list of users which should exist in PG
(pg_ldap_sync does an alright job of this today but we could certainly
do better by having it built-in, and ideally with a background worker
that connects to ldap, does an initial sync, and then listens for
changes, instead of having to have a frequent cronjob doing the sync).

> >> I think that is, to borrow a phrase from Tom, arrant nonsense. Sure,
> >> MD5 authentication has a pass-the-hash vulnerability, and that sucks.
> >
> > So, given that we all agree with the CVE-worthy issue that exists with 
> > every release where we include md5 auth, we should be applying for q CVE 
> > prior to each release, no?
> 
> You know, I said in my previous email that you were so sure that you
> were right that there was no point in trying to have a serious
> discussion, and I can't really see how you could have proved that
> point more thoroughly than you did here. You twisted my words around
> to make it seem like I was agreeing with your point when you know full
> well that I was doing the exact opposite of that.

I quoted the part where you agreed that md5 has a known vulnerability to
point out that, given that, we should be making our users aware of that
through the normal process that we use for that.  I wasn't claiming that
you agreed that we should remove md5, unless you're referring to some
other part of my resp

Re: Detecting some cases of missing backup_label

2023-12-05 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> I recently mentioned to Robert (and also Heikki earlier), that I think I see a
> way to detect an omitted backup_label in a relevant subset of the cases (it'd
> apply to the pg_control as well, if we moved to that).  Robert encouraged me
> to share the idea, even though it does not provide complete protection.

That would certainly be nice.

> The subset I think we can address is the following:
> 
> a) An omitted backup_label would lead to corruption, i.e. without the
>backup_label we won't start recovery at the right position. Obviously it'd
>be better to also catch a wrong procedure when it'd not cause corruption -
>perhaps my idea can be extended to handle that, with a small bit of
>overhead.
> 
> b) The backup has been taken from a primary. Unfortunately that probably can't
>be addressed - but the vast majority of backups are taken from a primary,
>so I think it's still a worthwhile protection.

Agreed that this is a worthwhile set to try and address, even if we
can't address other cases.

> Here's my approach
> 
> 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
>primary, emitted just *after* the checkpoint completed
> 
> 2) When replaying a base backup start record, we create a state file that
>includes the corresponding LSN in the filename
> 
> 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
>that time. Instead the state file is created during pg_backup_stop().
> 
> 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
>created by XLOG_BACKUP_START is present, and error out if not.  Backups
>that started before the redo LSN from backup_label are ignored
>(necessitates remembering that LSN, but we've been discussing that anyway).
> 
> 
> Because the backup state file on the primary is only created during
> pg_backup_stop(), a copy of the data directory taken between pg_backup_start()
> and pg_backup_stop() does *not* contain the corresponding "backup state
> file". Because of this, an omitted backup_label is detected if recovery does
> not start early enough - recovery won't encounter the XLOG_BACKUP_START record
> and thus would not create the state file, leading to an error in 4).

While I see the idea here, I think, doesn't it end up being an issue if
things happen like this:

pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash

In that scenario, we'd go back to the new checkpoint (the one *after*
the checkpoint that happened before we wrote XLOG_BACKUP_START), start
replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
Even though we're actually doing crash recovery and everything should be
fine as long as we replay all of the WAL.

Perhaps we can make the pg_backup_stop and(/or?) the writing out of
XLOG_BACKUP_STOP wait until just before the next checkpoint and
hopefully minimize that window ... but I'm not sure if we could make
that window zero and what happens if someone does end up hitting it?
Doesn't seem like there's any way around it, which seems like it might
be a problem.  I suppose it wouldn't be hard to add some option to tell
PG to ignore the XLOG_BACKUP_STOP ... but then that's akin to removing
backup_label which lands us possibly back into the issue of people
mis-using that.

> It is not a problem that the primary does not create the state file before the
> pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no
> XLOG_BACKUP_END and thus no error will be raised.  It's a bit odd that the
> sequence differs between normal processing and recovery, but I think that's
> nothing a good comment couldn't explain.

Right, crashing before pg_backup_stop() is fine, but crashing *after*
would be an issue, I think, as outlined above, until the next checkpoint
completes, so we've moved the window but not eliminated it.

> I haven't worked out the details, but I think we might be able extend this to
> catch errors even if there is no checkpoint during the base backup, by
> emitting the WAL record *before* the RequestCheckpoint(), and creating the
> corresponding state file during backup_label processing at the start of
> recovery.  That'd probably make the logic for when we can remove the backup
> state files a bit more complicated, but I think we could deal with that.

Not entirely following this- are you meaning that we might be able to
make something here work in the case where we don't have
pg_backup_start() wait for a checkpoint to happen (which I have some
serious doubts about?), or are you saying that the above doesn't work
unless there's at least one post-pg_backup_start() checkpoint?  I don't
immediately see why that would be the case though.  Also, if we wrote
out the XLOG_BACKUP_START before the checkpoint that we start replay
from and instead move that logic to backup_label processing ...

Re: Detecting some cases of missing backup_label

2023-12-18 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > I recently mentioned to Robert (and also Heikki earlier), that I think I 
> > see a
> > way to detect an omitted backup_label in a relevant subset of the cases 
> > (it'd
> > apply to the pg_control as well, if we moved to that).  Robert encouraged me
> > to share the idea, even though it does not provide complete protection.
> 
> That would certainly be nice.
> 
> > The subset I think we can address is the following:
> > 
> > a) An omitted backup_label would lead to corruption, i.e. without the
> >backup_label we won't start recovery at the right position. Obviously 
> > it'd
> >be better to also catch a wrong procedure when it'd not cause corruption 
> > -
> >perhaps my idea can be extended to handle that, with a small bit of
> >overhead.
> > 
> > b) The backup has been taken from a primary. Unfortunately that probably 
> > can't
> >be addressed - but the vast majority of backups are taken from a primary,
> >so I think it's still a worthwhile protection.
> 
> Agreed that this is a worthwhile set to try and address, even if we
> can't address other cases.
> 
> > Here's my approach
> > 
> > 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a
> >primary, emitted just *after* the checkpoint completed
> > 
> > 2) When replaying a base backup start record, we create a state file that
> >includes the corresponding LSN in the filename
> > 
> > 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at
> >that time. Instead the state file is created during pg_backup_stop().
> > 
> > 4) When replaying a XLOG_BACKUP_END record, we verif that the state file
> >created by XLOG_BACKUP_START is present, and error out if not.  Backups
> >that started before the redo LSN from backup_label are ignored
> >(necessitates remembering that LSN, but we've been discussing that 
> > anyway).
> > 
> > 
> > Because the backup state file on the primary is only created during
> > pg_backup_stop(), a copy of the data directory taken between 
> > pg_backup_start()
> > and pg_backup_stop() does *not* contain the corresponding "backup state
> > file". Because of this, an omitted backup_label is detected if recovery does
> > not start early enough - recovery won't encounter the XLOG_BACKUP_START 
> > record
> > and thus would not create the state file, leading to an error in 4).
> 
> While I see the idea here, I think, doesn't it end up being an issue if
> things happen like this:
> 
> pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint
> happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash
> 
> In that scenario, we'd go back to the new checkpoint (the one *after*
> the checkpoint that happened before we wrote XLOG_BACKUP_START), start
> replaying, and then hit the XLOG_BACKUP_STOP and then error out, right?
> Even though we're actually doing crash recovery and everything should be
> fine as long as we replay all of the WAL.

Andres and I discussed this in person at PGConf.eu and the idea is that
if we find a XLOG_BACKUP_STOP record then we can check if the state file
was written out and if so then we can conclude that we are doing crash
recovery and not restoring from a backup and therefore we don't error
out.  This also implies that we don't consider PG to be recovered at the
XLOG_BACKUP_STOP point, if the state file exists, but instead we have to
be sure to replay all WAL that's been written.  Perhaps we even
explicitly refuse to use restore_command in this case?

We do error out if we hit a XLOG_BACKUP_STOP and the state file
doesn't exist, as that implies that we started replaying from a point
after a XLOG_BACKUP_START record was written but are working from a copy
of the data directory which didn't include the state file.

Of course, we need to actually implement and test these different cases
to make sure it all works but I'm at least feeling better about the idea
and wanted to share that here.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: The danger of deleting backup_label

2023-10-17 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 10/16/23 15:06, Robert Haas wrote:
> > On Mon, Oct 16, 2023 at 1:00 PM David Steele  wrote:
> > > After some agonizing (we hope) they decide to delete backup_label and,
> > > wow, it just works! So now they merrily go on their way with a corrupted
> > > cluster. They also remember for the next time that deleting backup_label
> > > is definitely a good procedure.
> > > 
> > > The idea behind this patch is that deleting backup_label would produce a
> > > hard error because pg_control would be missing as well (if the backup
> > > software did its job). If both pg_control and backup_label are present
> > > (but pg_control has not been loaded with the contents of backup_label,
> > > i.e. it is the running copy from the backup cluster) we can also error.
> > 
> > I mean, I think we're just going in circles, here. I did and do
> > understand, but I didn't and don't agree. You're hypothesizing a user
> > who is willing to do ONE thing that they shouldn't do during backup
> > restoration (namely, remove backup_label) but who won't be willing to
> > do a SECOND thing that they shouldn't do during backup restoration
> > (namely, run pg_resetwal).
> 
> In my experience the first case is much more likely than the second. Your
> experience may vary.

My experience (though perhaps not a surprise) mirrors David's.

> Anyway, I think they are pretty different. Deleting backup label appears to
> give a perfectly valid restore. Running pg_resetwal is more clearly (I
> think) the nuclear solution.

Right, and a delete of backup_label is just an 'rm' that folks may think
"oh, this is just some leftover thing that isn't actually needed".

OTOH, pg_resetwal has an online documentation page and a man page that's
very clear that it's only to be used as a last resort (perhaps we should
pull that into the --help output too..?).  It's also pretty clear that
pg_resetwal is actually changing things about the cluster while nuking
backup_label doesn't *seem* to be in that same category, even though we
all know it is because it's needed once recovery begins.

I'd also put out there that while people don't do restore testing
nearly as much as they should, they tend to at _least_ try to do it once
after taking their first backup and if that fails then they try to figure
out why and what they're not doing right.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-17 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Wed, Oct 04, 2023 at 10:41:15PM -0700, Gurjeet Singh wrote:
> > The patches posted in this thread so far attempt to add the ability to
> > allow the user to have an arbitrary number of passwords. I believe
> > that allowing arbitrary number of passwords is not only unnecessary,
> > but the need to name passwords, the need to store them in a shared
> > catalog, etc. may actually create problems in the field. The
> > users/admins will have to choose names for passwords, which they
> > didn't have to previously. The need to name them may also lead to
> > users storing password-hints in the password names (e.g. 'mom''s
> > birthday', 'ex''s phone number', 'third password'), rendering the
> > passwords weak.
> > 
> > Moreover, allowing an arbitrarily many number of passwords will
> > require us to provide additional infrastructure to solve problems like
> > observability (which passwords are currently in use, and which ones
> > have been effectively forgotten by applications), or create a nuisance
> > for admins that can create more problems than it solves.
> 
> IMHO neither of these problems seems insurmountable.  Besides advising
> against using hints as names, we could also automatically generate safe
> names, or even disallow user-provided names entirely.  And adding
> observability for passwords seems worthwhile anyway.

Agreed, particularly on adding observability for password use.
Regardless of what we do, I feel pretty strongly that we need that.
That said, having this handled in a separate catalog feels like a just
generally better idea than shoving it all into pg_authid as we extend
things to include information like "last used date", "last used source
IP", etc.

Providing this observability purely through logging strikes me as a
terrible idea.

I don't find the concern about names as 'hints' to be at all compelling.
Having a way to avoid having names may have some value, but only if we
can come up with something reasonable.

> > So I propose that the feature should allow no more than 2 passwords
> > for a role, each with their own validity periods. This eliminates the
> > need to name passwords, because at any given time there are no more
> > than 2 passwords; current one, and old one. This also eliminates the
> > need for a shared catalog to hold passwords, because with the limit of
> > 2 imposed, we can store the old password and its validity period in
> > additional columns in the pg_authid table.
> 
> Another approach could be to allow an abritrary number of passwords but to
> also allow administrators to limit how many passwords can be associated to
> each role.  That way, we needn't restrict this feature to 2 passwords for
> everyone.  Perhaps 2 should be the default, but in any case, IMO we
> shouldn't design to only support 2.

Agreed that it's a bad idea to design to support 2 and only 2.  If
nothing else, there's the very simple case that the user needs to do
another password rotation ... and they look and discover that the old
password is still being used and that if they took it away, things would
break, but they need time to run down which system is still using it
while still needing to perform the migration for the other systems that
are correctly being updated- boom, need 3 for that case.  There's other
use-cases that could be interesting though- presumably we'd log which
password is used to authenticate and then users could have a fleet of
web servers which each have their own password but log into the same PG
user and they could happily rotate the passwords independently for all
of those systems.

I don't propose this as some use-case just for the purpose of argument-
not sharing passwords across a bunch of systems is absolutely a good
stance when it comes to security, and due to the way permissions and
roles work in PG, being able to have both distinct passwords with
explicitly logged indication of which system used what password to log
in while not having to deal with possibly permission differences due to
using actually independent roles is valuable.  That is, each server
using a separate role to log in could lead to some servers having access
to something or other while others don't pretty easily- if they're all
logging in with the same role and just a different password, that's not
going to happen.

* Jeff Davis (pg...@j-davis.com) wrote:
> Using a number seems weird to me because either:
> 
>  (a) if the number is always increasing you'd have to look to find the
> number of the new slot to add and the old slot to remove; or
>  (b) if switched between two numbers (say 0 and 1), it would be error
> prone because you'd have to remember which is the old one that can be
> safely replaced

Yeah, a number doesn't strike me as very good either.

> Maybe a password is best described by its validity period rather than a
> name? But what about passwords that don't expire?

The validity idea is interesting but falls down w

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-17 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2023-10-17 at 16:20 -0400, Stephen Frost wrote:
> > Agreed that it's a bad idea to design to support 2 and only 2.
> 
> I don't disagree, but it's difficult to come up with syntax that:
> 
>  (a) supports N passwords
>  (b) makes the ordinary cases simple and documentable
>  (c) helps users avoid mistakes (at least in the simple cases)
>  (d) makes sense passwords with and without validity period
>  (e) handles the challenging cases

Undoubtably biased ... but I don't agree that this is so difficult.
What points have been raised as a downside of the originally proposed
approach, specifically?

Reading back through the thread, from a user perspective, the primary
one seems to be that passwords are expected to be named.  I'm surprised
this is being brought up as such a serious concern.  Lots and lots and
lots of things in the system require naming, after all, and the idea
that this is somehow harder or more of an issue is quite odd to me.

> One challenging case is that we cannot allow the mixing of password
> protocols (e.g. MD5 & SCRAM), because the authentication exchange only
> gets one chance at success. If someone ends up with 7 MD5 passwords,
> and they'd like to migrate to SCRAM, then we can't allow them to
> migrate one password at a time (because then the other passwords would
> break). I'd like to see what the SQL for doing this should look like.

I've got absolutely no interest in supporting md5- it's high time to rip
that out and be happy that it's gone.  We nearly did it last year and
I'm really hoping we accomplish that this year.

I'm open to the idea that we may need to support some new SCRAM version
or an alternative mechanism in the future, but it's long past time to
spend any time worrying about md5.  As for how difficult it is to deal
with supporting an alternative in the future- that's going to depend a
great deal on what that alternative is and I don't know that we can
really code to handle that as part of this effort in a sensible way, and
trying to code for "anything" is going to likely make this much more
complicated, and probably rife with security issues too.

> >   If
> > nothing else, there's the very simple case that the user needs to do
> > another password rotation ... and they look and discover that the old
> > password is still being used and that if they took it away, things
> > would
> > break, but they need time to run down which system is still using it
> > while still needing to perform the migration for the other systems
> > that
> > are correctly being updated- boom, need 3 for that case.
> 
> That sounds like a reasonable use case. I don't know if we should make
> it a requirement, but if we come up with something reasonable that
> supports this case I'm fine with it. Ideally, it would still be easy to
> see when you are making a mistake (e.g. forgetting to ever remove
> passwords).

We have monitoring for many, many parts of the system and this would be
a good thing to monitor also- not just at a per-password level but also
at an overall role/user level, as you have a similar issue there and we
don't properly provide users with any way to check reasonably "hey, when
was the last time this user logged in?".  No, trawling through ancient
logs, if you even have them, isn't a proper solution to this problem.

> >   There's other
> > use-cases that could be interesting though- presumably we'd log which
> > password is used to authenticate and then users could have a fleet of
> > web servers which each have their own password but log into the same
> > PG
> > user and they could happily rotate the passwords independently for
> > all
> > of those systems.
> > 
> > if they're all
> > logging in with the same role and just a different password, that's
> > not
> > going to happen.
> 
> I'm not sure this is a great idea. Can you point to some precedent
> here?

It's already the case that lots and lots and lots of systems out there
log into PG using the same username/password.  With this, we're at least
offering them the ability to vary the password and keep the user the
same.  We've even seen this be asked for in other ways- the ability to
use distinct Kerberos or LDAP identifies to log into the same user in
the database, see pg_ident.conf and various suggestions for how to bring
that to LDAP too.  Other systems also support the ability to have a
group of users in LDAP, or such, be allowed to log into a specific
database user.  One big reason for this is that then you know everyong
logging into that account has exactly the same access to th

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-18 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Tue, 2023-10-17 at 22:52 -0400, Stephen Frost wrote:
> 
> > Reading back through the thread, from a user perspective, the primary
> > one seems to be that passwords are expected to be named.  I'm
> > surprised
> > this is being brought up as such a serious concern.  Lots and lots
> > and
> > lots of things in the system require naming, after all, and the idea
> > that this is somehow harder or more of an issue is quite odd to me.
> 
> In the simplest intended use case, the names will be arbitrary and
> temporary. It's easy for me to imagine someone wondering "was I
> supposed to delete 'bear' or 'lion'?". For indexes and other objects,
> there's a lot more to go on, easily visible with \d.

Sure, agreed.

> Now, obviously that is not the end of the world, and the user could
> prevent that problem a number of different ways. And we can do things
> like improve the monitoring of password use, and store the password
> creation time, to help users if they are confused. So I don't raise
> concerns about naming as an objection to the feature overall, but
> rather a concern that we aren't getting it quite right.

Right, we need more observability, agreed, but that's not strictly
necessary of this patch and could certainly be added independently.  Is
there really a need to make this observability a requirement of this
particular change?

> Maybe a name should be entirely optional, more like a comment, and the
> passwords can be referenced by the salt? The salt needs to be unique
> for a given user anyway.

I proposed an approach in the email you replied to explicitly suggesting
a way we could make the name be optional, so, sure, I'm generally on
board with that idea.  Note that it'd be optional for the user to
provide and then we'd simply generate one for them and then that name is
what would be used to refer to that password later.

> (Aside: is the uniqueness of the salt enforced in the current patch?)

Err, the salt has to be *identical* for each password of a given user,
not unique, so I'm a bit confused here.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-18 Thread Stephen Frost

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> On 29/9/2023 09:52, Andrei Lepikhov wrote:
> > On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:
> > > Attach patches updated to master.
> > > Pulled from patch 2 back to patch 1 a change that was also pertinent
> > > to patch 1.
> > +1 to the idea, have doubts on the implementation.
> > 
> > I have a question. I see the feature triggers ERROR on the exceeding of
> > the memory limit. The superior PG_CATCH() section will handle the error.
> > As I see, many such sections use memory allocations. What if some
> > routine, like the CopyErrorData(), exceeds the limit, too? In this case,
> > we could repeat the error until the top PG_CATCH(). Is this correct
> > behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for
> > recursion and allow error handlers to slightly exceed this hard limit?

> By the patch in attachment I try to show which sort of problems I'm worrying
> about. In some PП_CATCH() sections we do CopyErrorData (allocate some
> memory) before aborting the transaction. So, the allocation error can move
> us out of this section before aborting. We await for soft ERROR message but
> will face more hard consequences.

While it's an interesting idea to consider making exceptions to the
limit, and perhaps we'll do that (or have some kind of 'reserve' for
such cases), this isn't really any different than today, is it?  We
might have a malloc() failure in the main path, end up in PG_CATCH() and
then try to do a CopyErrorData() and have another malloc() failure.

If we can rearrange the code to make this less likely to happen, by
doing a bit more work to free() resources used in the main path before
trying to do new allocations, then, sure, let's go ahead and do that,
but that's independent from this effort.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2023-10-18 Thread Stephen Frost

Greetings,

* Melih Mutlu (m.melihmu...@gmail.com) wrote:
> Melih Mutlu , 16 Haz 2023 Cum, 17:03 tarihinde şunu
> yazdı:
> 
> > With this change, here's a query to find how much space used by each
> > context including its children:
> >
> > > WITH RECURSIVE cte AS (
> > > SELECT id, total_bytes, id as root, name as root_name
> > > FROM memory_contexts
> > > UNION ALL
> > > SELECT r.id, r.total_bytes, cte.root, cte.root_name
> > > FROM memory_contexts r
> > > INNER JOIN cte ON r.parent_id = cte.id
> > > ),
> > > memory_contexts AS (
> > > SELECT * FROM pg_backend_memory_contexts
> > > )
> > > SELECT root as id, root_name as name, sum(total_bytes)
> > > FROM cte
> > > GROUP BY root, root_name
> > > ORDER BY sum DESC;
> 
> Given that the above query to get total bytes including all children is
> still a complex one, I decided to add an additional info in
> pg_backend_memory_contexts.
> The new "path" field displays an integer array that consists of ids of all
> parents for the current context. This way it's easier to tell whether a
> context is a child of another context, and we don't need to use recursive
> queries to get this info.

Nice, this does seem quite useful.

> Here how pg_backend_memory_contexts would look like with this patch:
> 
> postgres=# SELECT name, id, parent, parent_id, path
> FROM pg_backend_memory_contexts
> ORDER BY total_bytes DESC LIMIT 10;
>   name   | id  |  parent  | parent_id | path
> -+-+--+---+--
>  CacheMemoryContext  |  27 | TopMemoryContext | 0 | {0}
>  Timezones   | 124 | TopMemoryContext | 0 | {0}
>  TopMemoryContext|   0 |  |   |
>  MessageContext  |   8 | TopMemoryContext | 0 | {0}
>  WAL record construction | 118 | TopMemoryContext | 0 | {0}
>  ExecutorState   |  18 | PortalContext|17 | {0,16,17}
>  TupleSort main  |  19 | ExecutorState|18 | {0,16,17,18}
>  TransactionAbortContext |  14 | TopMemoryContext | 0 | {0}
>  smgr relation table |  10 | TopMemoryContext | 0 | {0}
>  GUC hash table  | 123 | GUCMemoryContext |   122 | {0,122}
> (10 rows)
> 
> An example query to calculate the total_bytes including its children for a
> context (say CacheMemoryContext) would look like this:
> 
> WITH contexts AS (
> SELECT * FROM pg_backend_memory_contexts
> )
> SELECT sum(total_bytes)
> FROM contexts
> WHERE ARRAY[(SELECT id FROM contexts WHERE name = 'CacheMemoryContext')] <@
> path;

I wonder if we should perhaps just include
"total_bytes_including_children" as another column?  Certainly seems
like a very useful thing that folks would like to see.  We could do that
either with C, or even something as simple as changing the view to do
something like:

WITH contexts AS MATERIALIZED (
  SELECT * FROM pg_get_backend_memory_contexts()
)
SELECT
  *,
  coalesce
  (
(
  (SELECT sum(total_bytes) FROM contexts WHERE ARRAY[a.id] <@ path)
  + total_bytes
),
total_bytes
  ) AS total_bytes_including_children
FROM contexts a;

> We still need to use cte since ids are not persisted and might change in
> each run of pg_backend_memory_contexts. Materializing the result can
> prevent any inconsistencies due to id change. Also it can be even good for
> performance reasons as well.

I don't think we really want this to be materialized, do we?  Where this
is particularly interesting is when it's being dumped to the log ( ...
though I wish we could do better than that and hope we do in the future)
while something is ongoing in a given backend and if we do that a few
times we are able to see what's changing in terms of allocations,
whereas if we materialized it (when?  transaction start?  first time
it's asked for?) then we'd only ever get the one view from whenever the
snapshot was taken.

> Any thoughts?

Generally +1 from me for working on improving this.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: pg_dump needs SELECT privileges on irrelevant extension table

2023-10-18 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I wrote:
> > Why are we marking extension member objects as being subject to SECLABEL
> > or POLICY dumping?  As the comment notes, that isn't really sensible
> > unless what we are dumping is a delta from the extension's initial
> > assignments.  But we have no infrastructure for that, and none seems
> > likely to appear in the near future.
> 
> Here's a quick patch that does it that way.  The test changes
> are identical to Jacob's v3-0001.

What the comment is talking about is that we don't support initial
policies, not that we don't support policies on extension tables at all.
That said ... even the claim that we don't support such policies isn't
supported by code and there are people out there doing it, which creates
its own set of problems (ones we should really try to find solutions to
though..).

This change would mean that policies added by a user after the extension
is created would just be lost by a pg_dump/reload, doesn't it?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2023-10-19 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-10-18 15:53:30 -0400, Stephen Frost wrote:
> > > Here how pg_backend_memory_contexts would look like with this patch:
> > > 
> > > postgres=# SELECT name, id, parent, parent_id, path
> > > FROM pg_backend_memory_contexts
> > > ORDER BY total_bytes DESC LIMIT 10;
> > >   name   | id  |  parent  | parent_id | path
> > > -+-+--+---+--
> > >  CacheMemoryContext  |  27 | TopMemoryContext | 0 | {0}
> > >  Timezones   | 124 | TopMemoryContext | 0 | {0}
> > >  TopMemoryContext|   0 |  |   |
> > >  MessageContext  |   8 | TopMemoryContext | 0 | {0}
> > >  WAL record construction | 118 | TopMemoryContext | 0 | {0}
> > >  ExecutorState   |  18 | PortalContext|17 | {0,16,17}
> > >  TupleSort main  |  19 | ExecutorState|18 | 
> > > {0,16,17,18}
> > >  TransactionAbortContext |  14 | TopMemoryContext | 0 | {0}
> > >  smgr relation table |  10 | TopMemoryContext | 0 | {0}
> > >  GUC hash table  | 123 | GUCMemoryContext |   122 | {0,122}
> > > (10 rows)
> > > 
> > > An example query to calculate the total_bytes including its children for a
> > > context (say CacheMemoryContext) would look like this:
> > > 
> > > WITH contexts AS (
> > > SELECT * FROM pg_backend_memory_contexts
> > > )
> > > SELECT sum(total_bytes)
> > > FROM contexts
> > > WHERE ARRAY[(SELECT id FROM contexts WHERE name = 'CacheMemoryContext')] 
> > > <@
> > > path;
> > 
> > I wonder if we should perhaps just include
> > "total_bytes_including_children" as another column?  Certainly seems
> > like a very useful thing that folks would like to see.
> 
> The "issue" is where to stop - should we also add that for some of the other
> columns? They are a bit less important, but not that much.

I'm not sure the others really make sense to aggregate in this way as
free space isn't able to be moved between contexts.  That said, if
someone wants it then I'm not against that.  I'm actively in support of
adding an aggregated total though as that, at least to me, seems to be
very useful to have.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-19 Thread Stephen Frost

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> On 19/10/2023 02:00, Stephen Frost wrote:
> > * Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> > > On 29/9/2023 09:52, Andrei Lepikhov wrote:
> > > > On 22/5/2023 22:59, reid.thomp...@crunchydata.com wrote:
> > > > > Attach patches updated to master.
> > > > > Pulled from patch 2 back to patch 1 a change that was also pertinent
> > > > > to patch 1.
> > > > +1 to the idea, have doubts on the implementation.
> > > > 
> > > > I have a question. I see the feature triggers ERROR on the exceeding of
> > > > the memory limit. The superior PG_CATCH() section will handle the error.
> > > > As I see, many such sections use memory allocations. What if some
> > > > routine, like the CopyErrorData(), exceeds the limit, too? In this case,
> > > > we could repeat the error until the top PG_CATCH(). Is this correct
> > > > behaviour? Maybe to check in the exceeds_max_total_bkend_mem() for
> > > > recursion and allow error handlers to slightly exceed this hard limit?
> > 
> > > By the patch in attachment I try to show which sort of problems I'm 
> > > worrying
> > > about. In some PП_CATCH() sections we do CopyErrorData (allocate some
> > > memory) before aborting the transaction. So, the allocation error can move
> > > us out of this section before aborting. We await for soft ERROR message 
> > > but
> > > will face more hard consequences.
> > 
> > While it's an interesting idea to consider making exceptions to the
> > limit, and perhaps we'll do that (or have some kind of 'reserve' for
> > such cases), this isn't really any different than today, is it?  We
> > might have a malloc() failure in the main path, end up in PG_CATCH() and
> > then try to do a CopyErrorData() and have another malloc() failure.
> > 
> > If we can rearrange the code to make this less likely to happen, by
> > doing a bit more work to free() resources used in the main path before
> > trying to do new allocations, then, sure, let's go ahead and do that,
> > but that's independent from this effort.
> 
> I agree that rearranging efforts can be made independently. The code in the
> letter above was shown just as a demo of the case I'm worried about.
> IMO, the thing that should be implemented here is a recursion level for the
> memory limit. If processing the error, we fall into recursion with this
> limit - we should ignore it.
> I imagine custom extensions that use PG_CATCH() and allocate some data
> there. At least we can raise the level of error to FATAL.

Ignoring such would defeat much of the point of this effort- which is to
get to a position where we can say with some confidence that we're not
going to go over some limit that the user has set and therefore not
allow ourselves to end up getting OOM killed.  These are all the same
issues that already exist today on systems which don't allow overcommit
too, there isn't anything new here in regards to these risks, so I'm not
really keen to complicate this to deal with issues that are already
there.

Perhaps once we've got the basics in place then we could consider
reserving some space for handling such cases..  but I don't think it'll
actually be very clean and what if we have an allocation that goes
beyond what that reserved space is anyway?  Then we're in the same spot
again where we have the choice of either failing the allocation in a
less elegant way than we might like to handle that error, or risk
getting outright kill'd by the kernel.  Of those choices, sure seems
like failing the allocation is the better way to go.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-19 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-10-19 18:06:10 -0400, Stephen Frost wrote:
> > Ignoring such would defeat much of the point of this effort- which is to
> > get to a position where we can say with some confidence that we're not
> > going to go over some limit that the user has set and therefore not
> > allow ourselves to end up getting OOM killed.
> 
> I think that is a good medium to long term goal. I do however think that we'd
> be better off merging the visibility of memory allocations soon-ish and
> implement the limiting later. There's a lot of hairy details to get right for
> the latter, and even just having visibility will be a huge improvement.

I agree that having the visibility will be a great improvement and
perhaps could go in separately, but I don't know that I agree that the
limits are going to be that much of an issue.  In any case, there's been
work ongoing on this and that'll be posted soon.  I was just trying to
address the general comment raised in this sub-thread here.

> I think even patch 1 is doing too much at once. I doubt the DSM stuff is
> quite right.

Getting DSM right has certainly been tricky, along with other things,
but we've been working towards, and continue to work towards, getting
everything to line up nicely between memory context allocations of
various types and the amounts which are being seen as malloc'd/free'd.
There's been parts of this also reworked to allow us to see per-backend
reservations as well as total reserved and to get those numbers able to
be matched up inside of a given transaction using the statistics system.

> I'm unconvinced it's a good idea to split the different types of memory
> contexts out. That just exposes too much implementation detail stuff without a
> good reason.

DSM needs to be independent anyway ... as for the others, perhaps we
could combine them, though that's pretty easily done later and for now
it's been useful to see them split out as we've been working on the
patch.

> I think the overhead even just the tracking implies right now is likely too
> high and needs to be optimized. It should be a single math operation, not
> tracking things in multiple fields. I don't think pg_sub_u64_overflow() should
> be in the path either, that suddenly adds conditional branches.  You really
> ought to look at the difference in assembly for the hot functions.

This has been improved in the most recent work and we'll have that
posted soon, probably best to hold off from larger review of this right
now- as mentioned, I was just trying to address the specific question in
this sub-thread since a new patch is coming soon.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PoC/RFC] Multiple passwords, interval expirations

2023-10-20 Thread Stephen Frost

Greetings,

* Jeff Davis (pg...@j-davis.com) wrote:
> On Wed, 2023-10-18 at 14:48 -0400, Stephen Frost wrote:
> > Right, we need more observability, agreed, but that's not strictly
> > necessary of this patch and could certainly be added independently. 
> > Is
> > there really a need to make this observability a requirement of this
> > particular change?
> 
> I won't draw a line in the sand, but it feels like something should be
> there to help the user keep track of which password they might want to
> keep. At least a "created on" date or something.

Sure, no objection to adding that and seems like it should be fairly
easy ... but then again, I tend to feel that we should do that for all
of the objects in the system and we've got some strong feelings against
doing that from others.  Perhaps this case is different to them, in
which case, great, but if it's not, it'd be unfortunate for this feature
to get bogged down due to that.

> > > (Aside: is the uniqueness of the salt enforced in the current
> > > patch?)
> > 
> > Err, the salt has to be *identical* for each password of a given
> > user,
> > not unique, so I'm a bit confused here.
> 
> Sorry, my mistake.

Sure, no worries.

> If the client needs to use the same salt as existing passwords, can you
> still use PQencryptPasswordConn() on the client to avoid sending the
> plaintext password to the server?

Short answer, yes ... but seems that wasn't actually done yet.  Requires
a bit of additional work, since the client needs to get the existing
salt (note that as part of the SCRAM typical exchange, the client is
provided with the salt, so this isn't exposing anything new) to use to
construct what is then sent to the server to store.  We'll also need to
decide how to handle the case if the client tries to send a password
that doesn't have the same salt as the existing ones (regardless of how
many passwords we end up supporting).  Perhaps we require the user,
through the grammar, to make clear if they want to add a password, and
then error if they don't provide a matching salt, or if they want to
remove existing passwords and replace with the new one.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-10-20 Thread Stephen Frost

Greetings,

* Andrei Lepikhov (a.lepik...@postgrespro.ru) wrote:
> The only issue I worry about is the uncertainty and clutter that can be
> created by this feature. In the worst case, when we have a complex error
> stack (including the extension's CATCH sections, exceptions in stored
> procedures, etc.), the backend will throw the memory limit error repeatedly.

I'm not seeing what additional uncertainty or clutter there is- this is,
again, exactly the same as what happens today on a system with
overcommit disabled and I don't feel like we get a lot of complaints
about this today.

> Of course, one failed backend looks better than a surprisingly killed
> postmaster, but the mix of different error reports and details looks
> terrible and challenging to debug in the case of trouble. So, may we throw a
> FATAL error if we reach this limit while handling an exception?

I don't see why we'd do that when we can do better- we just fail
whatever the ongoing query or transaction is and allow further requests
on the same connection.  We already support exactly that and it works
really rather well and I don't see why we'd throw that away because
there's a different way to get an OOM error.

If you want to make the argument that we should throw FATAL on OOM when
handling an exception, that's something you could argue independently of
this effort already today, but I don't think you'll get agreement that
it's an improvement.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Stephen Frost

Greetings,

On Tue, Oct 24, 2023 at 14:42 Robert Haas  wrote:

> On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis  wrote:
> > Perhaps the idea is that if there are constraints involved, the failure
> > or success of an INSERT/UPDATE/DELETE could leak information that you
> > don't have privileges to read.
>
> My recollection of this topic is pretty hazy, but like Tom, I seem to
> remember it being intentional, and I think the reason had something to
> do with wanting the slice of a RLS-protect table that you can see to
> feel like a complete table. When you update a row in a table all of
> which is visible to you, the updated row can never vanish as a result
> of that update, so it was thought, if I remember correctly, that this
> should also be true here. It's also similar to what happens if an
> updatable view has WITH CHECK OPTION, and I think that was part of the
> precedent as well. I don't know whether or not the constraint issue
> that you mention here was also part of the concern, but it may have
> been. This was all quite a while ago...


Yes, having it be similar to a view WITH CHECK OPTION was intentional, also
on not wishing for things to be able to disappear or to not get saved. The
risk of a constraint possibly causing the leak of information is better
than either having data just thrown away or having the constraint not
provide the guarantee it’s supposed to …

Thanks,

Stephen

(On my phone at an event currently, sorry for not digging in deeper on
this..)

>

Re: Fwd: Annoying corruption in PostgreSQL.

2023-10-30 Thread Stephen Frost

Greetings,

Please don't top-post on these lists.

* Kirill Reshke (reshkekir...@gmail.com) wrote:
> We have physical backups and we can PITR. But restoring a cluster to some
> point in the past is a bit of a different task: we need our client's
> approval for these operations, since we are a Managed DBs Cloud Provider.
> Will try to ask someone.

Do you have page-level checksums enabled for these PG instances?

Are you able to see if these clusters which are reporting the corruption
have been restored in the past from a backup?  What are you using to
perform your backups and perform your restores?

Are you able to see if these clusters have ever crashed and come back up
after by doing WAL replay?

Where I'm heading with these questions is essentially: I suspect either
your backup/restore procedure is broken or you're running on a system
that doesn't properly fsync data.  Or possibly both.

Oh, and you should probably have checksums enabled.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Moving forward with TDE [PATCH v3]

2023-11-06 Thread Stephen Frost

Greetings,

Thanks for your feedback on this.

* Andres Freund (and...@anarazel.de) wrote:
> I still am quite quite unconvinced that using the LSN as a nonce is a good
> design decision.

This is a really important part of the overall path to moving this
forward, so I wanted to jump to it and have a specific discussion
around this.  I agree that there are downsides to using the LSN, some of
which we could possibly address (eg: include the timeline ID in the IV),
but others that would be harder to deal with.

The question then is- what's the alternative?

One approach would be to change the page format to include space for an
explicit nonce.  I don't see the community accepting such a new page
format as the only format we support though as that would mean no
pg_upgrade support along with wasted space if TDE isn't being used.
Ideally, we'd instead be able to support multiple page formats where
users could decide when they create their cluster what features they
want- and luckily we even have such an effort underway with patches
posted for review [1].  Certainly, with the base page-special-feature
patch, we could have an option for users to choose that they want a
better nonce than the LSN, or we could bundle that assumption in with,
say, the authenticated-encryption feature (if you want authenticated
encryption, then you want more from the encryption system than the
basics, and therefore we presume you also want a better nonce than the
LSN).

Another approach would be a separate fork, but that then has a number of
downsides too- every write has to touch that too, and a single page of
nonces would cover a pretty large number of pages also.

Ultimately, I agree with you that the LSN isn't perfect and we really
shouldn't be calling it 'ideal' as it isn't, and we can work to fix that
language in the patch, but the lack of any alternative being proposed
that might be acceptable makes this feel a bit like rock management [2].

My apologies if there's something good that's already been specifically
pushed and I just missed it; if so, a link to that suggestion and
discussion would be greatly appreciated.

Thanks again!

Stephen

[1]: https://commitfest.postgresql.org/45/3986/
[2]: https://en.wikipedia.org/wiki/Wikipedia:Bring_me_a_rock ; though
that isn't great for a quick summary (which I tried to find on an
isolated page somewhere and didn't).

The gist is, without a suggestion of things to try, we're left
to our own devices to try and figure out things which might be
successful, only to have those turned down too when we come back with
them, see [1] for what feels like an example of this.  Given your
feedback overall, which I'm very thankful for, I'm hopeful that you see
that this is, indeed, a useful feature that people are asking for and
therefore are willing to spend some time on it, but if the feedback is
that nothing on the page is acceptable to use for the nonce, we can't
put the nonce somewhere else, and we can't change the page format, then
everything else is just moving deck chairs around on the titanic that
has been this effort.

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-06 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-10-31 17:11:26 +, John Morris wrote:
> > Postgres memory reservations come from multiple sources.
> > 
> >   *   Malloc calls made by the Postgres memory allocators.
> >   *   Static shared memory created by the postmaster at server startup,
> >   *   Dynamic shared memory created by the backends.
> >   *   A fixed amount (1Mb) of “initial” memory reserved whenever a process 
> > starts up.
> > 
> > Each process also maintains an accurate count of its actual memory
> > allocations. The process-private variable “my_memory” holds the total
> > allocations for that process. Since there can be no contention, each process
> > updates its own counters very efficiently.
> 
> I think this will introduce measurable overhead in low concurrency cases and
> very substantial overhead / contention when there is a decent amount of
> concurrency.  This makes all memory allocations > 1MB contend on a single
> atomic.  Massive amount of energy have been spent writing multi-threaded
> allocators that have far less contention than this - the current state is to
> never contend on shared resources on any reasonably common path. This gives
> away one of the few major advantages our process model has away.

We could certainly adjust the size of each reservation to reduce the
frequency of having to hit the atomic.  Specific suggestions about how
to benchmark and see the regression that's being worried about here
would be great.  Certainly my hope has generally been that when we do a
larger allocation, it's because we're about to go do a bunch of work,
meaning that hopefully the time spent updating the atomic is minor
overall.

> The patch doesn't just introduce contention when limiting is enabled - it
> introduces it even when memory usage is just tracked. It makes absolutely no
> sense to have a single contended atomic in that case - just have a per-backend
> variable in shared memory that's updated. It's *WAY* cheaper to compute the
> overall memory usage during querying than to keep a running total in shared
> memory.

Agreed that we should avoid the contention when limiting isn't being
used, certainly easy to do so, and had actually intended to but that
seems to have gotten lost along the way.  Will fix.

Other than that change inside update_global_reservation though, the code
for reporting per-backend memory usage and querying it does work as
you're outlining above inside the stats system.

That said- I just want to confirm that you would agree that querying the
amount of memory used by every backend, to add it all up to enforce an
overall limit, surely isn't something we're going to want to do during
an allocation and that having a global atomic for that is better, right?

> > Pgstat now includes global memory counters. These shared memory counters
> > represent the sum of all reservations made by all Postgres processes. For
> > efficiency, these global counters are only updated when new reservations
> > exceed a threshold, currently 1 Mb for each process. Consequently, the
> > global reservation counters are approximate totals which may differ from the
> > actual allocation totals by up to 1 Mb per process.
> 
> I see that you added them to the "cumulative" stats system - that doesn't
> immediately makes sense to me - what you're tracking here isn't an
> accumulating counter, it's something showing the current state, right?

Yes, this is current state, not an accumulation.

> > The max_total_memory limit is checked whenever the global counters are
> > updated. There is no special error handling if a memory allocation exceeds
> > the global limit. That allocation returns a NULL for malloc style
> > allocations or an ENOMEM for shared memory allocations. Postgres has
> > existing mechanisms for dealing with out of memory conditions.
> 
> I still think it's extremely unwise to do tracking of memory and limiting of
> memory in one patch.  You should work towards and acceptable patch that just
> tracks memory usage in an as simple and low overhead way as possible. Then we
> later can build on that.

Frankly, while tracking is interesting, the limiting is the feature
that's needed more urgently imv.  We could possibly split it up but
there's an awful lot of the same code that would need to be changed and
that seems less than ideal.  Still, we'll look into this.

> > For sanity checking, pgstat now includes the pg_backend_memory_allocation
> > view showing memory allocations made by the backend process. This view
> > includes a scan of the top memory context, so it compares memory allocations
> > reported through pgstat with actual allocations. The two should match.
> 
> Can't you just do this using the existing pg_backend_memory_contexts view?

Not and get a number that you can compare to the local backend number
due to the query itself happening and performing allocations and
creating new contexts.  We wanted to be able to show that we are
accounting correctl

Re: Atomic ops for unlogged LSN

2023-11-07 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Tue, Nov 07, 2023 at 12:57:32AM +, John Morris wrote:
> > I incorporated your suggestions and added a few more. The changes are
> > mainly related to catching potential errors if some basic assumptions
> > aren’t met.
> 
> Hm.  Could we move that to a separate patch?  We've lived without these
> extra checks for a very long time, and I'm not aware of any related issues,
> so I'm not sure it's worth the added complexity.  And IMO it'd be better to
> keep it separate from the initial atomics conversion, anyway.

I do see the value in adding in an Assert though I don't want to throw
away the info about what the recent unlogged LSN was when we crash.  As
that basically boils down to a one-line addition, I don't think it
really needs to be in a separate patch.

> > I found the comment about cache coherency a bit confusing. We are dealing
> > with a single address, so there should be no memory ordering or coherency
> > issues. (Did I misunderstand?) I see it more as a race condition. Rather
> > than merely explaining why it shouldn’t happen, the new version verifies
> > the assumptions and throws an Assert() if something goes wrong.
> 
> I was thinking of the comment for pg_atomic_read_u32() that I cited earlier
> [0].  This comment also notes that pg_atomic_read_u32/64() has no barrier
> semantics.  My interpretation of that comment is that these functions
> provide no guarantee that the value returned is the most up-to-date value.

There seems to be some serious misunderstanding about what is happening
here.  The value written into the control file for unlogged LSN during
normal operation does *not* need to be the most up-to-date value and
talking about it as if it needs to be the absolutely most up-to-date and
correct value is, if anything, adding to the confusion, not reducing
confusion.  The reason to write in anything other than a zero during
these routine checkpoints for unlogged LSN is entirely for forensics
purposes, not because we'll ever actually use the value- during crash
recovery and backup/restore, we're going to reset the unlogged LSN
counter anyway and we're going to throw away all of unlogged table
contents across the entire system.

We only care about the value of the unlogged LSN being correct during
normal shutdown when we're writing out the shutdown checkpoint, but by
that time everything else has been shut down and the value absolutely
should not be changing.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-07 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-11-06 13:02:50 -0500, Stephen Frost wrote:
> > > > The max_total_memory limit is checked whenever the global counters are
> > > > updated. There is no special error handling if a memory allocation 
> > > > exceeds
> > > > the global limit. That allocation returns a NULL for malloc style
> > > > allocations or an ENOMEM for shared memory allocations. Postgres has
> > > > existing mechanisms for dealing with out of memory conditions.
> > > 
> > > I still think it's extremely unwise to do tracking of memory and limiting 
> > > of
> > > memory in one patch.  You should work towards and acceptable patch that 
> > > just
> > > tracks memory usage in an as simple and low overhead way as possible. 
> > > Then we
> > > later can build on that.
> > 
> > Frankly, while tracking is interesting, the limiting is the feature
> > that's needed more urgently imv.
> 
> I agree that we need limiting, but that the tracking needs to be very robust
> for that to be usable.

Is there an issue with the tracking in the patch that you saw?  That's
certainly an area that we've tried hard to get right and to match up to
numbers from the rest of the system, such as the memory context system.

> > We could possibly split it up but there's an awful lot of the same code that
> > would need to be changed and that seems less than ideal.  Still, we'll look
> > into this.
> 
> Shrug. IMO keeping them together just makes it very likely that neither goes
> in.

I'm happy to hear your support for the limiting part of this- that's
encouraging.

> > > > For sanity checking, pgstat now includes the 
> > > > pg_backend_memory_allocation
> > > > view showing memory allocations made by the backend process. This view
> > > > includes a scan of the top memory context, so it compares memory 
> > > > allocations
> > > > reported through pgstat with actual allocations. The two should match.
> > > 
> > > Can't you just do this using the existing pg_backend_memory_contexts view?
> > 
> > Not and get a number that you can compare to the local backend number
> > due to the query itself happening and performing allocations and
> > creating new contexts.  We wanted to be able to show that we are
> > accounting correctly and exactly matching to what the memory context
> > system is tracking.
> 
> I think creating a separate view for this will be confusing for users, without
> really much to show for. Excluding the current query would be useful for other
> cases as well, why don't we provide a way to do that with
> pg_backend_memory_contexts?

Both of these feel very much like power-user views, so I'm not terribly
concerned about users getting confused.  That said, we could possibly
drop this as a view and just have the functions which are then used in
the regression tests to catch things should the numbers start to
diverge.

Having a way to get the memory contexts which don't include the
currently running query might be interesting too but is rather
independent of what this patch is trying to do.  The only reason we
collected up the memory-context info is as a cross-check to the tracking
that we're doing and while the existing memory-context view is just fine
for a lot of other things, it doesn't work for that specific need.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Moving forward with TDE [PATCH v3]

2023-11-07 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Nov  6, 2023 at 09:56:37AM -0500, Stephen Frost wrote:
> > The gist is, without a suggestion of things to try, we're left
> > to our own devices to try and figure out things which might be
> > successful, only to have those turned down too when we come back with
> > them, see [1] for what feels like an example of this.  Given your
> > feedback overall, which I'm very thankful for, I'm hopeful that you see
> > that this is, indeed, a useful feature that people are asking for and
> > therefore are willing to spend some time on it, but if the feedback is
> > that nothing on the page is acceptable to use for the nonce, we can't
> > put the nonce somewhere else, and we can't change the page format, then
> > everything else is just moving deck chairs around on the titanic that
> > has been this effort.
> 
> Yeah, I know the feeling, though I thought XTS was immune enough to
> nonce/LSN reuse that it was acceptable.

Ultimately it depends on the attack vector you're trying to address, but
generally, I think you're right about the XTS tweak reuse not being that
big of a deal.  XTS isn't CTR or GCM.

With FDE (full disk encryption) you're expecting the attacker to steal
the physical laptop, hard drive, etc, generally, and so the downside of
using the same tweak with XTS over and over again isn't that bad (and is
what most FDE solutions do, aiui, by simply using the sector number; we
could do something similar to that by using the relfilenode + block
number) because that re-use is a problem if the attacker is able to see
multiple copies of the same block over time where the block has been
encrypted with different data but the same key and tweak.

Using the LSN actually is better than what the FDE solutions do because
the LSN varies much more often than the sector number.  Sure, it doesn't
change with every write and maybe an attacker could glean something from
that, but that's quite narrow.  The downside from the LSN based approach
with XTS is probably more that using the LSN means that we can't encrypt
the LSN itself and that is a leak too- but then again, we leak that
through the simple WAL filenames too, to some extent, so it doesn't
strike me as a huge issue.

XTS as a block cipher doesn't suffer from the IV-reuse issue that you
have with streaming ciphers where the same key+IV and different data
leads to being able to trivally retrive the plaintext though and I worry
that maybe that's what people were thinking.

The README and comments I don't think were terribly clear on this and I
think may have even been from back when CTR was being considered, where
IV reuse would have resulted in plaintext being trivially available.

> What got me sunk on the feature was the complexity of adding temporary
> file encryption support and that tipped the scales in the negative for
> me in community value of the feature vs. added complexity. (Yeah, I used
> a Titanic reference in the last sentence. ;-) )  However, I am open to
> the community value and complexity values changing over time.  My blog
> post on the topic:

We do need to address the temporary file situation too and we do have a
bit of an issue that how we deal with temporary files today in PG isn't
very consistent and there's too many ways to do that.  There's a patch
that works on that, though it has some bitrot that we're working on
addressing currently.

There is value in simply fixing that situation wrt temporary file
management independent of encryption, though of course then encryption
of those temporary files becomes much simpler.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen 
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will 
> > be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This 
> > data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime 
> > based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing 
> > clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
> 
> The first part of the last paragraph makes it sound like pg_upgrade won't be
> supported across this commit, rather than just between different settings...
> 
> I think as a whole this is not an insane idea. A few comments:

Thanks for all the feedback!

> - Why is it worth sacrificing space on every page to indicate which features
>   were enabled?  I think there'd need to be some convincing reasons for
>   introducing such overhead.

In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page.  For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.

Getting a consensus on if that's a requirement or not would definitely
be really helpful.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Stephen Frost

Greetings,

On Wed, Nov 8, 2023 at 20:55 David Christensen <
david.christen...@crunchydata.com> wrote:

> On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:
>
>> * Andres Freund (and...@anarazel.de) wrote:
>> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
>> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
>> > > From: David Christensen 
>> > > Date: Tue, 9 May 2023 16:56:15 -0500
>> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>> > >
>> > > This space is reserved for extended data on the Page structure which
>> will be ultimately used for
>> > > encrypted data, extended checksums, and potentially other things.
>> This data appears at the end of
>> > > the Page, after any `pd_special` area, and will be calculated at
>> runtime based on specific
>> > > ControlFile features.
>> > >
>> > > No effort is made to ensure this is backwards-compatible with
>> existing clusters for `pg_upgrade`, as
>> > > we will require logical replication to move data into a cluster with
>> > > different settings here.
>> >
>> > The first part of the last paragraph makes it sound like pg_upgrade
>> won't be
>> > supported across this commit, rather than just between different
>> settings...
>>
>
> Yeah, that's vague, but you picked up on what I meant.
>
>
>> > I think as a whole this is not an insane idea. A few comments:
>>
>> Thanks for all the feedback!
>>
>> > - Why is it worth sacrificing space on every page to indicate which
>> features
>> >   were enabled?  I think there'd need to be some convincing reasons for
>> >   introducing such overhead.
>>
>> In conversations with folks (my memory specifically is a discussion with
>> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
>> there was a pretty strong push that a page should be able to 'stand
>> alone' and not depend on something else (eg: pg_control, or whatever) to
>> provide info needed be able to interpret the page.  For my part, I don't
>> have a particularly strong feeling on that, but that's what lead to this
>> design.
>>
>
> Unsurprisingly, I agree that it's useful to keep these features on the
> page itself; from a forensic standpoint this seems much easier to interpret
> what is happening here, as well it would allow you to have different
> features on a given page or type of page depending on need.  The initial
> patch utilizes pg_control to store the cluster page features, but there's
> no reason it couldn't be dependent on fork/page type or stored in
> pg_tablespace to utilize different features.
>

When it comes to authenticated encryption, it’s also the case that it’s
unclear what value the checksum field has, if any…  it’s certainly not
directly needed as a checksum, as the auth tag is much better for the
purpose of seeing if the page has been changed in some way. It’s also not
big enough to serve as an auth tag per NIST guidelines regarding the size
of the authenticated data vs. the size of the tag.  Using it to indicate
what features are enabled on the page seems pretty useful, as David notes.

Thanks,

Stephen

>

Re: [PATCH v20] GSSAPI encryption support

2019-02-11 Thread Stephen Frost

Greetings Robbie,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Attached please find version 20 of the GSSAPI encryption support.  This
> has been rebased onto master (thanks Stephen for calling out ab69ea9).
> Other changes since v19 from Stephen's review:
> 
> - About 100 lines of new comments

Thanks for that.

> - pgindent run over code (only the stuff I'm changing; it reports other
>   problems on master, but if I understand correctly, that's not on me to
>   fix here)

That's correct.

> Thanks for the feedback!  And speaking of feedback, this patch is in the
> "Needs Review" state so if people have additional feedback, that would
> be appreciated.  I've CC'd people who I can remember reviewing before;
> they should of course feel no obligation to review again if time
> commitments/interests do not allow.

I've looked over this again and have been playing with it off-and-on for
a while now.  One thing I was actually looking at implementing before we
push to commit this was to have similar information to what SSL provides
on connection, specifically what printSSLInfo() prints by way of
cipher, bits, etc for the client-side and then something like
pg_stat_ssl for the server side.

I went ahead and added a printGSSENCInfo(), and then PQgetgssSASLSSF(),
which calls down into gss_inquire_sec_context_by_oid() for
GSS_C_SEC_CONTEXT_SASL_SSF to get the bits (which also required
including gssapi_ext.h), and now have psql producing:

---
psql (12devel)
GSS Encrypted connection (bits: 256)
Type "help" for help.
---

I was curious though- is there a good way to get at the encryption type
used..?  I haven't had much luck in reading the RFPs and poking around,
but perhaps I'm just not looking in the right place.  Also, what
information do you think would be particularly useful on the server
side?  I was thinking that the princ used to authenticate, the
encryption type/cipher, and bits used, at least, would be meaningful.
I'm also guessing that we may need to add something to make these
functions not fail on older systems which don't provide the SASL SSF
OID..?  I haven't looked into that yet but we should.

I don't think these things are absolutely required to push the patch
forward, just to be clear, but it's helping my confidence level, at
least, and would make it closer to on-par with our current SSL
implementation.  They also seem, well, reasonably straight-forward to
add and quite useful.

I've attached my WIP patch for adding the bits, patched over-top of your
v20 (would be neat if there was a way to tell cfbot to ignore it...).

Thoughts?

Thanks!

Stephen
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
new file mode 100644
index ab259c4..2599330
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
*** static void print_with_linenumbers(FILE
*** 159,164 
--- 159,165 
  static void minimal_error_message(PGresult *res);

  static void printSSLInfo(void);
+ static void printGSSENCInfo(void);
  static bool printPsetInfo(const char *param, struct printQueryOpt *popt);
  static char *pset_value_string(const char *param, struct printQueryOpt *popt);

*** exec_command_conninfo(PsqlScanState scan
*** 621,626 
--- 622,628 
  		   db, PQuser(pset.db), host, PQport(pset.db));
  			}
  			printSSLInfo();
+ 			printGSSENCInfo();
  		}
  	}

*** connection_warnings(bool in_startup)
*** 3184,3189 
--- 3186,3192 
  		checkWin32Codepage();
  #endif
  		printSSLInfo();
+ 		printGSSENCInfo();
  	}
  }

*** printSSLInfo(void)
*** 3216,3221 
--- 3219,3244 
  		   (compression && strcmp(compression, "off") != 0) ? _("on") : _("off"));
  }

+ /*
+  * printGSSENCInfo
+  *
+  * Prints information about the current GSS-Encrypted connection, if GSS
+  * encryption is in use.
+  */
+ static void
+ printGSSENCInfo(void)
+ {
+ 	const char *bits;
+ 
+ 	if (!PQgssEncInUse(pset.db))
+ 		return;	/* no GSS Encryption */
+ 
+ 	bits = PQgetgssSASLSSF(pset.db);
+ 
+ 	printf(_("GSS Encrypted connection (bits: %s)\n"),
+ 		   bits ? bits : _("unknown"));
+ }
+ 

  /*
   * checkWin32Codepage
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
new file mode 100644
index cc9ee9c..3db936f
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQresultVerboseErrorMessage 171
*** 174,176 
--- 174,179 
  PQencryptPasswordConn 172
  PQresultMemorySize173
  PQhostaddr174
+ PQgssEncInUse 175
+ PQgetgssctx   176
+ PQgetgssSASLSSF   177
diff --git a/src/interfaces/libpq/fe-secure-gssapi.c b/src/interfaces/libpq/fe-secure-gssapi.c
new file mode 100644
index 5d9c1f8..df63882
*** a/src/interfaces/libpq/fe-secure-gssapi.c
--- b/src/interfaces/libpq/fe-secure-gssapi.c
***
*** 17,22 
--- 17,24 
  #include "libpq-int.h"
  #include "fe-gssapi-common.h"

+ #include "port/pg_bswap.h"
+ 
  /*

Re: [PATCH v20] GSSAPI encryption support

2019-02-11 Thread Stephen Frost

Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Stephen Frost  writes:
> > * Robbie Harwood (rharw...@redhat.com) wrote:
> >> Attached please find version 20 of the GSSAPI encryption support.
> >> This has been rebased onto master (thanks Stephen for calling out
> >> ab69ea9).
> >
> > I've looked over this again and have been playing with it off-and-on
> > for a while now.  One thing I was actually looking at implementing
> > before we push to commit this was to have similar information to what
> > SSL provides on connection, specifically what printSSLInfo() prints by
> > way of cipher, bits, etc for the client-side and then something like
> > pg_stat_ssl for the server side.
> >
> > I went ahead and added a printGSSENCInfo(), and then
> > PQgetgssSASLSSF(), which calls down into
> > gss_inquire_sec_context_by_oid() for GSS_C_SEC_CONTEXT_SASL_SSF to get
> > the bits (which also required including gssapi_ext.h), and now have
> > psql producing:
> 
> Neat!  Two things here:
> 
> First, because this is a SASL extension, it requires existing mechanism
> integration with SASL.  For instance, NTLM (through gss-ntlmssp) doesn't
> implement it.  PQgetgssSASLSSF() logs an error message here, which I
> don't think is quite right - probably you should only log an error
> message if you don't get back GSS_S_COMPLETE or GSS_S_UNAVAILABLE.

Ok, that's easy enough to fix.

> (NTLM is an example here; I cannot in good conscience recommend its use,
> and neither does Microsoft.)

Agreed!

> Also, this is an MIT-specific extension: Heimdal doesn't support it.

Is that simply because they've not gotten around to it..?  From what I
saw, this approach is in an accepted RFC, so if they want to implement
it, they could do so..?

> While I (a krb5 developer) am fine with a hard MIT dependency, the
> project doesn't currently have this.  (And if we went that road, I'd
> like to use gss_acquire_cred_from() instead of the setenv() in
> be-secure-gssapi.c.)

We certainly don't want a hard MIT dependency, but if it's following an
RFC and we can gracefully fall-back if it isn't available then I think
it's acceptable.  If there's a better approach which would work with
both MIT and Heimdal and ideally is defined through an RFC, that'd be
better, but I'm guessing there isn't...

> > ---
> > psql (12devel)
> > GSS Encrypted connection (bits: 256)
> > Type "help" for help.
> > ---
> >
> > I was curious though- is there a good way to get at the encryption
> > type used..?  I haven't had much luck in reading the RFPs and poking
> > around, but perhaps I'm just not looking in the right place.
> 
> It's something of a layering issue.  GSSAPI might not be using Kerberos,
> and even if it is, Kerberos has algorithm agility.

Right, I'm aware that Kerberos could be using different encryption
algorithms, similar to SSL and that GSSAPI itself would allow for
different algorithms through different mechanisms- but the question is,
how do we know what the encryption algorithm ultimately used, regardless
of the rest, is?

> A call to gss_inquire_context() produces mech type, so you could print
> something like "GSS Encrypted connection (Kerberos 256 bits)" or "GSS
> encrypted connection (NTLM)".  I think this'd be pretty reasonable.

I'm... not impressed with that approach, since "Kerberos" as an
encryption algorithm doesn't mean squat- that could be DES or RC4 for
all we know.

> For Kerberos-specific introspection, MIT krb5 and Heimdal both support
> an interface called lucid contexts that allows this kind of
> introspection (and some other gross layering violations) for use with
> the kernel.  Look at gssapi_krb5.h (it's called that in both).  I don't
> think it's worth supporting rfc1964 at this point (it's 1DES-only).

I agree that there's no sense in supporting 1DES-only.

I also, frankly, don't agree with the characterization that wanting to
know what the agreed-upon encryption algorithm is constitutes a gross
layering violation, but that's really an independent discussion that we
can have in a pub somewhere. ;)

To get where I'd like us to be, however, it sounds like we need to:

a) determine if we've got encryption (that's easy, assuming I've done it
   correctly so far)

b) Figure out if the encryption is being provided by Kerberos (using
   gss_inquire_context() would do that, right?)

c) If we're using Kerberos, use the lucid contexts to ask what the
   actual encryption algorithm used is

That's a bit obnoxious but it at least seems doable.  I don't

Re: [PATCH v20] GSSAPI encryption support

2019-02-12 Thread Stephen Frost

Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Stephen Frost  writes:
> > * Robbie Harwood (rharw...@redhat.com) wrote:
> >> Stephen Frost  writes:
> >>> * Robbie Harwood (rharw...@redhat.com) wrote:
> >>>
> >>>> Attached please find version 20 of the GSSAPI encryption support.
> >>>> This has been rebased onto master (thanks Stephen for calling out
> >>>> ab69ea9).
> >>>
> >>> I've looked over this again and have been playing with it off-and-on
> >>> for a while now.  One thing I was actually looking at implementing
> >>> before we push to commit this was to have similar information to
> >>> what SSL provides on connection, specifically what printSSLInfo()
> >>> prints by way of cipher, bits, etc for the client-side and then
> >>> something like pg_stat_ssl for the server side.
> >>>
> >>> I went ahead and added a printGSSENCInfo(), and then
> >>> PQgetgssSASLSSF(), which calls down into
> >>> gss_inquire_sec_context_by_oid() for GSS_C_SEC_CONTEXT_SASL_SSF to
> >>> get the bits (which also required including gssapi_ext.h), and now
> >>> have psql producing:
> >> 
> >> Also, this is an MIT-specific extension: Heimdal doesn't support it.
> >
> > Is that simply because they've not gotten around to it..?  From what I
> > saw, this approach is in an accepted RFC, so if they want to implement
> > it, they could do so..?
> 
> Yes.  Heimdal is less active these days than MIT; they were dormant for
> a while, though are active again.  They do have an issue open to track
> this feature: https://github.com/heimdal/heimdal/issues/400

Ok, good, when they add it then hopefully it'll be picked up
more-or-less 'for free'.

> >> While I (a krb5 developer) am fine with a hard MIT dependency, the
> >> project doesn't currently have this.  (And if we went that road, I'd
> >> like to use gss_acquire_cred_from() instead of the setenv() in
> >> be-secure-gssapi.c.)
> >
> > We certainly don't want a hard MIT dependency, but if it's following an
> > RFC and we can gracefully fall-back if it isn't available then I think
> > it's acceptable.  If there's a better approach which would work with
> > both MIT and Heimdal and ideally is defined through an RFC, that'd be
> > better, but I'm guessing there isn't...
> 
> While I think the concept of SASL SSF is standardized, I don't think the
> semantics of this OID are - the OID itself is in the MIT gssapi
> extension space.

We can adjust for that pretty easily, presumably, if another OID ends up
being assigned (though if one already exists, isn't it something that
Heimdal, for example, could potentially decide to just support..?).

> As a historical note, the reason this interface exists in krb5 is of
> course for SASL.  In particular, cyrus was reporting treating the SSF of
> all krb5 types as if they were DES.  So while technically "assume DES"
> could be a fallback... we probably shouldn't do that.

Agreed, we shouldn't do that.

> >> A call to gss_inquire_context() produces mech type, so you could
> >> print something like "GSS Encrypted connection (Kerberos 256 bits)"
> >> or "GSS encrypted connection (NTLM)".  I think this'd be pretty
> >> reasonable.
> >
> > I'm... not impressed with that approach, since "Kerberos" as an
> > encryption algorithm doesn't mean squat- that could be DES or RC4 for
> > all we know.
> 
> This is a fair concern.

I will say that I'm alright with listing Kerberos in the message if we
think it's useful- but we should also list the actual encryption
algorithm and bits, if possible (which it sounds like it is, so that's
definitely good).

> > To get where I'd like us to be, however, it sounds like we need to:
> >
> > a) determine if we've got encryption (that's easy, assuming I've done it
> >correctly so far)
> 
> Yes.
> 
> > b) Figure out if the encryption is being provided by Kerberos (using
> >gss_inquire_context() would do that, right?)
> 
> Right - check and log mech_out there, then proceed if it's
> gss_mech_krb5.  Probably not worth handling _old and _wrong, especially
> since Heimdal doesn't have them and they shouldn't be in use anyway.
> 
> > c) If we're using Kerberos, use the lucid contexts to ask what the
> >actual encryption algorithm used is
> 
> Yes.  This'll involve some i

Re: WAL insert delay settings

2019-02-13 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
> a lot of WAL.  A lot of WAL at once can cause delays in replication.

Agreed, though I think VACUUM should certainly be included in this.

I'm all for the idea though it seems like a different approach is needed
based on the down-thread discussion.  Ultimately, having a way to have
these activities happen without causing issues for replicas is a great
idea and would definitely address a practical issue that a lot of people
run into.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: WAL insert delay settings

2019-02-14 Thread Stephen Frost

Greetings,

On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 14/02/2019 11:03, Tomas Vondra wrote:
> > But if you add extra sleep() calls somewhere (say because there's also
> > limit on WAL throughput), it will affect how fast VACUUM works in
> > general. Yet it'll continue with the cost-based throttling, but it will
> > never reach the limits. Say you do another 20ms sleep somewhere.
> > Suddenly it means it only does 25 rounds/second, and the actual write
> > limit drops to 4 MB/s.
>
> I think at a first approximation, you probably don't want to add WAL
> delays to vacuum jobs, since they are already slowed down, so the rate
> of WAL they produce might not be your first problem.  The problem is
> more things like CREATE INDEX CONCURRENTLY that run at full speed.
>
> That leads to an alternative idea of expanding the existing cost-based
> vacuum delay system to other commands.
>
> We could even enhance the cost system by taking WAL into account as an
> additional factor.


This is really what I was thinking- let’s not have multiple independent
ways of slowing down maintenance and similar jobs to reduce their impact on
I/o to the heap and to WAL.

Thanks!

Stephen

Re: WAL insert delay settings

2019-02-15 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> > peter.eisentr...@2ndquadrant.com> wrote:
> > > On 14/02/2019 11:03, Tomas Vondra wrote:
> > > > But if you add extra sleep() calls somewhere (say because there's also
> > > > limit on WAL throughput), it will affect how fast VACUUM works in
> > > > general. Yet it'll continue with the cost-based throttling, but it will
> > > > never reach the limits. Say you do another 20ms sleep somewhere.
> > > > Suddenly it means it only does 25 rounds/second, and the actual write
> > > > limit drops to 4 MB/s.
> > >
> > > I think at a first approximation, you probably don't want to add WAL
> > > delays to vacuum jobs, since they are already slowed down, so the rate
> > > of WAL they produce might not be your first problem.  The problem is
> > > more things like CREATE INDEX CONCURRENTLY that run at full speed.
> > >
> > > That leads to an alternative idea of expanding the existing cost-based
> > > vacuum delay system to other commands.
> > >
> > > We could even enhance the cost system by taking WAL into account as an
> > > additional factor.
> > 
> > This is really what I was thinking- let’s not have multiple independent
> > ways of slowing down maintenance and similar jobs to reduce their impact on
> > I/o to the heap and to WAL.
> 
> I think that's a bad idea. Both because the current vacuum code is
> *terrible* if you desire higher rates because both CPU and IO time
> aren't taken into account. And it's extremely hard to control.  And it
> seems entirely valuable to be able to limit the amount of WAL generated
> for replication, but still try go get the rest of the work done as
> quickly as reasonably possible wrt local IO.

I'm all for making improvements to the vacuum code and making it easier
to control.

I don't buy off on the argument that there is some way to segregate the
local I/O question from the WAL when we're talking about these kinds of
operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
do I think we do our users a service by giving them independent knobs
for both that will undoubtably end up making it more difficult to
understand and control what's going on overall.

Even here, it seems, you're arguing that the existing approach for
VACUUM is hard to control; wouldn't adding another set of knobs for
controlling the amount of WAL generated by VACUUM make that worse?  I
have a hard time seeing how it wouldn't.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Regarding participating in GSOC 2019 with PostgreSQL

2019-02-15 Thread Stephen Frost

Greetings,

* Abhishek Agrawal (aagrawal...@gmail.com) wrote:
> I am interested in doing GSOC with PostgreSQL if you guys are applying for it 
> this year. If anyone could direct me to proper links or some channel to 
> prepare for the same then it will be really helpful. I have some questions 
> like: What will be the projects this year, What are the requirement for the 
> respective projects, who will be the mentors, etc.

We have applied but there is no guarantee that we'll be accepted.
Google has said they're announcing on Feb 26 the projects which have
been accepted.  Also, it's up to Google to decide how many slots we are
given to have GSoC students, and we also won't know that until they tell
us.

I'd suggest you wait until the announcement is made by Google, at which
time they'll also tell us how many slots we have, and will provide
direction to prospective students regarding the next steps.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: [PATCH v20] GSSAPI encryption support

2019-02-18 Thread Stephen Frost

Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Andres Freund  writes:
> 
> > On 2018-12-18 14:12:46 -0500, Robbie Harwood wrote:
> >
> >> Subject: [PATCH] libpq GSSAPI encryption support
> >
> > Could some of these be split into separate patches that could be more
> > eagerly merged? This is a somewhat large patch...
> 
> What splits do you propose?  (It's been a single patch since v3 as per
> your request in id:20151003161810.gd30...@alap3.anarazel.de and Michael
> Paquier's request in
> id:cab7npqtjd-ttrm1vu8p55_4kkvedx8dfz9v1d_xsqqvr_xa...@mail.gmail.com).

Yeah, I tend to agree with the earlier comments that this can be a
single patch.  There's a little bit that could maybe be split out (when
it comes to the bits that are mostly just moving things around and
removing things) but I'm not convinced it's necessary or, really,
particularly worth it.

> >> diff --git a/src/interfaces/libpq/fe-secure.c 
> >> b/src/interfaces/libpq/fe-secure.c
> >> index a06fc7dc82..f4f196e3b4 100644
> >> --- a/src/interfaces/libpq/fe-secure.c
> >> +++ b/src/interfaces/libpq/fe-secure.c
> >> @@ -220,6 +220,13 @@ pqsecure_read(PGconn *conn, void *ptr, size_t len)
> >>n = pgtls_read(conn, ptr, len);
> >>}
> >>else
> >> +#endif
> >> +#ifdef ENABLE_GSS
> >> +  if (conn->gssenc)
> >> +  {
> >> +  n = pg_GSS_read(conn, ptr, len);
> >> +  }
> >> +  else
> >>  #endif
> >>{
> >>n = pqsecure_raw_read(conn, ptr, len);
> >> @@ -287,7 +294,7 @@ pqsecure_raw_read(PGconn *conn, void *ptr, size_t len)
> >>   * to determine whether to continue/retry after error.
> >>   */
> >>  ssize_t
> >> -pqsecure_write(PGconn *conn, const void *ptr, size_t len)
> >> +pqsecure_write(PGconn *conn, void *ptr, size_t len)
> >>  {
> >>ssize_t n;
> >>  
> >> @@ -297,6 +304,13 @@ pqsecure_write(PGconn *conn, const void *ptr, size_t 
> >> len)
> >>n = pgtls_write(conn, ptr, len);
> >>}
> >>else
> >> +#endif
> >> +#ifdef ENABLE_GSS
> >> +  if (conn->gssenc)
> >> +  {
> >> +  n = pg_GSS_write(conn, ptr, len);
> >> +  }
> >> +  else
> >>  #endif
> >>{
> >>n = pqsecure_raw_write(conn, ptr, len);
> >
> > Not a fan of this. Seems like we'll grow more and more such branches
> > over time?  Wouldn't the right thing be to have callbacks in PGconn
> > (and similarly in the backend)?
> 
> Is that really a problem?  Each branch is only seven lines, which is a
> lot less than adding callback support will be.  And we'll only get a new
> branch when we want to support a new encryption protocol - unlike
> authentication, there aren't too many of those.

Considering this is only the second encryption protocol in the project's
lifetime, I agree that using callbacks would be overkill here.  What
other encryption protocols are you thinking we would be adding here?  I
think most would be quite hard-pressed to name a second general-purpose
one beyond TLS/SSL, and those who can almost certainly would say GSS,
but a third?  Certainly OpenSSH has its own, but it's not intended to be
general purpose and I can't see us adding support for OpenSSH's
encryption protocol to PG.

Adding support for different libraries which implement TLS/SSL wouldn't
be changing this part of the code, if that was perhaps what was being
considered..?

> > Seems like if that's done reasonably it'd also make integration of
> > compression easier, because that could just layer itself between
> > encryption and wire?

Building a general-purpose filtering mechanism for streams is actually
quite a bit of work (we did it for pgBackRest, feel free to check it
out- and all that code is in C these days too) and I don't think it's
really necessary when the options are "optionally compress and
optionally encrypt using one of these two protocols".  I don't see any
reason why we'd need to, say, compress a stream multiple times, or
encrypt it multiple times (like with TLS first and then GSS).

> The current interface would allow a compress/decompress call in a way
> that makes sense to me (here for write, ignoring ifdefs):

[...]

> (pqsecure_read would look similarly, with decompression as the last step
> instead of the first.)

That certainly seems reasonable to me.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: WAL insert delay settings

2019-02-18 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 2/15/19 7:41 PM, Andres Freund wrote:
> > On 2019-02-15 08:50:03 -0500, Stephen Frost wrote:
> >> * Andres Freund (and...@anarazel.de) wrote:
> >>> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> >>>> On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> >>>> peter.eisentr...@2ndquadrant.com> wrote:
> >>>>> On 14/02/2019 11:03, Tomas Vondra wrote:
> >>>>>> But if you add extra sleep() calls somewhere (say because there's also
> >>>>>> limit on WAL throughput), it will affect how fast VACUUM works in
> >>>>>> general. Yet it'll continue with the cost-based throttling, but it will
> >>>>>> never reach the limits. Say you do another 20ms sleep somewhere.
> >>>>>> Suddenly it means it only does 25 rounds/second, and the actual write
> >>>>>> limit drops to 4 MB/s.
> >>>>>
> >>>>> I think at a first approximation, you probably don't want to add WAL
> >>>>> delays to vacuum jobs, since they are already slowed down, so the rate
> >>>>> of WAL they produce might not be your first problem.  The problem is
> >>>>> more things like CREATE INDEX CONCURRENTLY that run at full speed.
> >>>>>
> >>>>> That leads to an alternative idea of expanding the existing cost-based
> >>>>> vacuum delay system to other commands.
> >>>>>
> >>>>> We could even enhance the cost system by taking WAL into account as an
> >>>>> additional factor.
> >>>>
> >>>> This is really what I was thinking- let’s not have multiple independent
> >>>> ways of slowing down maintenance and similar jobs to reduce their impact 
> >>>> on
> >>>> I/o to the heap and to WAL.
> >>>
> >>> I think that's a bad idea. Both because the current vacuum code is
> >>> *terrible* if you desire higher rates because both CPU and IO time
> >>> aren't taken into account. And it's extremely hard to control.  And it
> >>> seems entirely valuable to be able to limit the amount of WAL generated
> >>> for replication, but still try go get the rest of the work done as
> >>> quickly as reasonably possible wrt local IO.
> >>
> >> I'm all for making improvements to the vacuum code and making it easier
> >> to control.
> >>
> >> I don't buy off on the argument that there is some way to segregate the
> >> local I/O question from the WAL when we're talking about these kinds of
> >> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
> >> do I think we do our users a service by giving them independent knobs
> >> for both that will undoubtably end up making it more difficult to
> >> understand and control what's going on overall.
> >>
> >> Even here, it seems, you're arguing that the existing approach for
> >> VACUUM is hard to control; wouldn't adding another set of knobs for
> >> controlling the amount of WAL generated by VACUUM make that worse?  I
> >> have a hard time seeing how it wouldn't.
> > 
> > I think it's because I see them as, often, having two largely 
> > independent use cases. If your goal is to avoid swamping replication 
> > with WAL, you don't necessarily care about also throttling VACUUM
> > (or REINDEX, or CLUSTER, or ...)'s local IO.  By forcing to combine
> > the two you just make the whole feature less usable.
> 
> I agree with that.

I can agree that they're different use-cases but one does end up
impacting the other and that's what I had been thinking about from the
perspective of "if we could proivde just one knob for this."

VACUUM is a pretty good example- if we're dirtying a page with VACUUM
then we're also writing that page into the WAL (at least, if the
relation isn't unlogged).  Now, VACUUM does do other things (such as
read pages), as does REINDEX or CLUSTER, so maybe there's a way to think
about this feature in those terms- cost for doing local read i/o, vs.
cost for doing write i/o (to both heap and WAL) vs. cost for doing
"local" write i/o (just to heap, ie: unlogged tables).

What I was trying to say I didn't like previously was the idea of having
a "local i/o write" cost for VACUUM, to a logged table, *and* a "WAL
write" cost for VACUUM, since those are very tightly co

Re: Some thoughts on NFS

2019-02-19 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Feb 19, 2019 at 2:03 AM Thomas Munro  wrote:
> > How can we achieve that, without writing our
> > own NFS client?
> 
> 
> 
> Instead of writing our own NFS client, how about writing our own
> network storage protocol?  Imagine a stripped-down postmaster process
> running on the NFS server that essentially acts as a block server.
> Through some sort of network chatter, it can exchange blocks with the
> real postmaster running someplace else.  The mini-protocol would
> contain commands like READ_BLOCK, WRITE_BLOCK, EXTEND_RELATION,
> FSYNC_SEGMENT, etc. - basically whatever the relevant operations at
> the smgr layer are.  And the user would see the remote server as a
> tablespace mapped to a special smgr.

In reading this, I honestly thought somewhere along the way you'd say
"and then you have WAL, so just run a replica and forget this whole
network filesystem business."

The practical issue of WAL replay being single-process is an issue
though.  It seems like your mini-protocol was going in a direction that
would have allowed multiple processes to be working between the PG
system and the storage system concurrently, avoiding the single-threaded
issue with WAL but also making it such that the replica wouldn't be able
to be used for read-only queries (without some much larger changes
happening anyway).  I'm not sure the use-case is big enough but it does
seem to me that we're getting to a point where people are generating
enough WAL with systems that they care an awful lot about that they
might be willing to forgo having the ability to perform read-only
queries on the replica as long as they know that they can flip traffic
over to the replica without losing data.

So, what this all really boils down to is that I think this idea of a
different protocol that would allow PG to essentially replicate to a
remote system, or possibly run entirely off of the remote system without
any local storage, could be quite interesting in some situations.

On the other hand, I pretty much agree 100% with Magnus that the NFS
use-case is almost entirely because someone bought a big piece of
hardware that talks NFS and no, you don't get to run whatever code you
want on it.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Compressed TOAST Slicing

2019-02-20 Thread Stephen Frost

Greetings,

* Paul Ramsey (pram...@cleverelephant.ca) wrote:
> On Wed, Feb 20, 2019 at 10:50 AM Daniel Verite  
> wrote:
> >
> > Paul Ramsey wrote:
> >
> > > Oddly enough, I couldn't find many/any things that were sensitive to
> > > left-end decompression. The only exception is "LIKE this%" which
> > > clearly would be helped, but unfortunately wouldn't be a quick
> > > drop-in, but a rather major reorganization of the regex handling.
> >
> > What about starts_with(string, prefix)?
> >
> > text_starts_with(arg1,arg2) in varlena.c does a full decompression
> > of  arg1 when it could limit itself to the length of the smaller arg2:
> 
> Nice catch, I didn't find that one as it's not user visible, seems to
> be only called in spgist (!!)
> ./backend/access/spgist/spgtextproc.c:
> DatumGetBool(DirectFunctionCall2(text_starts_with
> 
> Thanks, I'll add that.

That sounds good to me, I look forward to an updated patch.  As Andres
mentioned, he and I chatted a bit about this approach vs the iterator
approach at FOSDEM and convinced me that this is worthwhile to do even
if we might add an iterator approach later- which also seems to be the
consensus of this thread, so once the patch has been updated to catch
this case, I'll review it (again) with an eye on committing it soon.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: WAL insert delay settings

2019-02-20 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 2/19/19 8:40 PM, Andres Freund wrote:
> > On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote:
> >> On 2/19/19 8:22 PM, Andres Freund wrote:
> >>> And my main point is that even if you implement a proper bytes/sec limit
> >>> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
> >>> meaningfully more confusing than right now.
> >>
> >> So, why not to modify autovacuum to also use this approach? I wonder if
> >> the situation there is more complicated because of multiple workers
> >> sharing the same budget ...
> > 
> > I think the main reason is that implementing a scheme like this for WAL
> > rate limiting isn't a small task, but it'd be aided by the fact that
> > it'd probably not on by default, and that there'd not be any regressions
> > because the behaviour didn't exist before. I contrast, people are
> > extremely sensitive to autovacuum behaviour changes, even if it's to
> > improve autovacuum. I think it makes more sense to build the logic in a
> > lower profile case first, and then migrate autovacuum over it. Even
> > leaving the maturity issue aside, reducing the scope of the project into
> > more bite sized chunks seems to increase the likelihood of getting
> > anything substantially.
> 
> Maybe.

I concur with that 'maybe'. :)

> I guess the main thing I'm advocating for here is to aim for a unified
> throttling approach, not multiple disparate approaches interacting in
> ways that are hard to understand/predict.

Yes, agreed.

> The time-based approach you described looks fine, an it's kinda what I
> was imagining (and not unlike the checkpoint throttling). I don't think
> it'd be that hard to tweak autovacuum to use it too, but I admit I have
> not thought about it particularly hard and there's stuff like per-table
> settings which might make it more complex.

When reading Andres' proposal, I was heavily reminded of how checkpoint
throttling is handled and wondered if there might be some way to reuse
or generalize that existing code/technique/etc and make it available to
be used for WAL, and more-or-less any/every other bulk operation (CREATE
INDEX, REINDEX, CLUSTER, VACUUM...).

> So maybe doing it for WAL first makes sense. But I still think we need
> to have at least a rough idea how it interacts with the existing
> throttling and how it will work in the end.

Well, it seems like Andres explained how it'll work with the existing
throttling, no?  As for how all of this will work in the end, that's a
good question but also a rather difficult one to answer, I suspect.

Just to share a few additional thoughts after pondering this for a
while, but the comment Andres made up-thread really struck a chord- we
don't necessairly want to throttle anything, what we'd really rather do
is *prioritize* things, whereby foreground work (regular queries and
such) have a higher priority than background/bulk work (VACUUM, REINDEX,
etc) but otherwise we use the system to its full capacity.  We don't
actually want to throttle a VACUUM run any more than a CREATE INDEX, we
just don't want those to hurt the performance of regular queries that
are happening.

The other thought I had was that doing things on a per-table basis, in
particular, isn't really addressing the resource question appropriately.
WAL is relatively straight-forward and independent of a resource from
the IO for the heap/indexes, so getting an idea from the admin of how
much capacity they have for WAL makes sense.  When it comes to the
capacity for the heap/indexes, in terms of IO, that really goes to the
underlying storage system/channel, which would actually be a tablespace
in properly set up environments (imv anyway).

Wrapping this up- it seems unlikely that we're going to get a
priority-based system in place any time particularly soon but I do think
it's worthy of some serious consideration and discussion about how we
might be able to get there.  On the other hand, if we can provide a way
for the admin to say "these are my IO channels (per-tablespace values,
plus a value for WAL), here's what their capacity is, and here's how
much buffer for foreground work I want to have (again, per IO channel),
so, PG, please arrange to not use more than 'capacity-buffer' amount of
resources for background/bulk tasks (per IO channel)" then we can at
least help them address the issue that foreground tasks are being
stalled or delayed due to background/bulk work.  This approach means
that they won't be utilizing the system to its full capacity, but
they'll know that and they'll know that it's because, for them, it's
more important that they have that low latency for foreground tasks.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: WAL insert delay settings

2019-02-20 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 2/20/19 10:43 PM, Stephen Frost wrote:
> > Just to share a few additional thoughts after pondering this for a
> > while, but the comment Andres made up-thread really struck a chord- we
> > don't necessairly want to throttle anything, what we'd really rather do
> > is *prioritize* things, whereby foreground work (regular queries and
> > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > etc) but otherwise we use the system to its full capacity.  We don't
> > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > just don't want those to hurt the performance of regular queries that
> > are happening.
> 
> I think you're forgetting the motivation of this very patch was to
> prevent replication lag caused by a command generating large amounts of
> WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> do with prioritization or foreground/background split.
> 
> I'm not arguing against ability to prioritize stuff, but I disagree it
> somehow replaces throttling.

Why is replication lag an issue though?  I would contend it's an issue
because with sync replication, it makes foreground processes wait, and
with async replication, it makes the actions of foreground processes
show up late on the replicas.

If the actual WAL records for the foreground processes got priority and
were pushed out earlier than the background ones, that would eliminate
both of those issues with replication lag.  Perhaps there's other issues
that replication lag cause but which aren't solved by prioritizing the
actual WAL records that you care about getting to the replicas faster,
but if so, I'd like to hear what those are.

> > The other thought I had was that doing things on a per-table basis, in
> > particular, isn't really addressing the resource question appropriately.
> > WAL is relatively straight-forward and independent of a resource from
> > the IO for the heap/indexes, so getting an idea from the admin of how
> > much capacity they have for WAL makes sense.  When it comes to the
> > capacity for the heap/indexes, in terms of IO, that really goes to the
> > underlying storage system/channel, which would actually be a tablespace
> > in properly set up environments (imv anyway).
> > 
> > Wrapping this up- it seems unlikely that we're going to get a
> > priority-based system in place any time particularly soon but I do think
> > it's worthy of some serious consideration and discussion about how we
> > might be able to get there.  On the other hand, if we can provide a way
> > for the admin to say "these are my IO channels (per-tablespace values,
> > plus a value for WAL), here's what their capacity is, and here's how
> > much buffer for foreground work I want to have (again, per IO channel),
> > so, PG, please arrange to not use more than 'capacity-buffer' amount of
> > resources for background/bulk tasks (per IO channel)" then we can at
> > least help them address the issue that foreground tasks are being
> > stalled or delayed due to background/bulk work.  This approach means
> > that they won't be utilizing the system to its full capacity, but
> > they'll know that and they'll know that it's because, for them, it's
> > more important that they have that low latency for foreground tasks.
> 
> I think it's mostly orthogonal feature to throttling.

I'm... not sure that what I was getting at above really got across.

What I was saying above, in a nutshell, is that if we're going to
provide throttling then we should give users a way to configure the
throttling on a per-IO-channel basis, which means at the tablespace
level, plus an independent configuration option for WAL since we allow
that to be placed elsewhere too.

Ideally, the configuration parameter would be in the same units as the
actual resource is too- which would probably be IOPS+bandwidth, really.
Just doing it in terms of bandwidth ends up being a bit of a mismatch
as compared to reality, and would mean that users would have to tune it
down farther than they might otherwise and therefore give up that much
more in terms of system capability.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: WAL insert delay settings

2019-02-20 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > > Just to share a few additional thoughts after pondering this for a
> > > > while, but the comment Andres made up-thread really struck a chord- we
> > > > don't necessairly want to throttle anything, what we'd really rather do
> > > > is *prioritize* things, whereby foreground work (regular queries and
> > > > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > > > etc) but otherwise we use the system to its full capacity.  We don't
> > > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > > > just don't want those to hurt the performance of regular queries that
> > > > are happening.
> > > 
> > > I think you're forgetting the motivation of this very patch was to
> > > prevent replication lag caused by a command generating large amounts of
> > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> > > do with prioritization or foreground/background split.
> > > 
> > > I'm not arguing against ability to prioritize stuff, but I disagree it
> > > somehow replaces throttling.
> > 
> > Why is replication lag an issue though?  I would contend it's an issue
> > because with sync replication, it makes foreground processes wait, and
> > with async replication, it makes the actions of foreground processes
> > show up late on the replicas.
> 
> I think reaching the bandwidth limit of either the replication stream,
> or of the startup process is actually more common than these. And for
> that prioritization doesn't help, unless it somehow reduces the total
> amount of WAL.

The issue with hitting those bandwidth limits is that you end up with
queues outside of your control and therefore are unable to prioritize
the data going through them.  I agree, that's an issue and it might be
necessary to ask the admin to provide what the bandwidth limit is, so
that we could then avoid running into issues with downstream queues that
are outside of our control causing unexpected/unacceptable lag.

> > If the actual WAL records for the foreground processes got priority and
> > were pushed out earlier than the background ones, that would eliminate
> > both of those issues with replication lag.  Perhaps there's other issues
> > that replication lag cause but which aren't solved by prioritizing the
> > actual WAL records that you care about getting to the replicas faster,
> > but if so, I'd like to hear what those are.
> 
> Wait, what? Are you actually suggesting that different sources of WAL
> records should be streamed out in different order? You're blowing a
> somewhat reasonably doable project up into "let's rewrite a large chunk
> of all of the durability layer in postgres".
> 
> Stephen, we gotta stop blowing up projects into something that can't
> ever realistically be finished.

I started this sub-thread specifically the way I did because I was
trying to make it clear that these were just ideas for possible
discussion- I'm *not* suggesting, nor saying, that we have to go
implement this right now instead of implementing the throttling that
started this thread.  I'm also, to be clear, not objecting to
implementing the throttling discussed (though, as mentioned but
seemingly ignored, I'd see it maybe configurable in different ways than
originally suggested).

If there's a way I can get that across more clearly than saying "Just to
share a few additional thoughts", I'm happy to try and do so, but I
don't agree that I should be required to simply keep such thoughts to
myself; indeed, I'll admit that I don't know how large a project this
would actually be and while I figured it'd be *huge*, I wanted to share
the thought in case someone might see a way that we could implement it
with much less work and have a better solution as a result.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: WAL insert delay settings

2019-02-21 Thread Stephen Frost

Greetings,

* Ants Aasma (ants.aa...@eesti.ee) wrote:
> On Thu, Feb 21, 2019 at 2:20 AM Stephen Frost  wrote:
> > The issue with hitting those bandwidth limits is that you end up with
> > queues outside of your control and therefore are unable to prioritize
> > the data going through them.  I agree, that's an issue and it might be
> > necessary to ask the admin to provide what the bandwidth limit is, so
> > that we could then avoid running into issues with downstream queues that
> > are outside of our control causing unexpected/unacceptable lag.
> 
> If there is a global rate limit on WAL throughput it could be adjusted by a
> control loop, measuring replication queue length and/or apply delay. I
> don't see any sane way how one would tune a per command rate limit, or even
> worse, a cost-delay parameter. It would have the same problems as work_mem
> settings.

Yeah, having some kind of feedback loop would be interesting.  I agree
that a per-command rate limit would have similar problems to work_mem,
and that's definitely one problem we have with the way VACUUM is tuned
today but the ship has more-or-less sailed on that- I don't think we're
going to be able to simply remove the VACUUM settings.  Avoiding adding
new settings that are per-command would be good though, if we can sort
out a way how.

> Rate limit in front of WAL insertion would allow for allocating the
> throughput between foreground and background tasks, and even allow for
> priority inheritance to alleviate priority inversion due to locks.

I'm not sure how much we have to worry about priority inversion here as
you need to have conflicts for that and if there's actually a conflict,
then it seems like we should just press on.

That is, a non-concurrent REINDEX is going to prevent an UPDATE from
modifying anything in the table, which if the UPDATE is a higher
priority than the REINDEX would be priority inversion, but that doesn't
mean we should slow down the REINDEX to allow the UPDATE to happen
because the UPDATE simply can't happen until the REINDEX is complete.
Now, we might slow down the REINDEX because there's UPDATEs against
*other* tables that aren't conflicting and we want those UPDATEs to be
prioritized over the REINDEX but then that isn't priority inversion.

Basically, I'm not sure that there's anything we can do, or need to do,
differently from what we do today when it comes to priority inversion
risk, at least as it relates to this discussion.  There's an interesting
discussion to be had about if we should delay the REINDEX taking the
lock at all when there's an UPDATE pending, but you run the risk of
starving the REINDEX from ever getting the lock and being able to run in
that case.  A better approach is what we're already working on- arrange
for the REINDEX to not require a conflicting lock, so that both can run
concurrently.

> There is also an implicit assumption here that a maintenance command is a
> background task and a normal DML query is a foreground task. This is not
> true for all cases, users may want to throttle transactions doing lots of
> DML to keep synchronous commit latencies for smaller transactions within
> reasonable limits.

Agreed, that was something that I was contemplating too- and one could
possibly argue in the other direction as well (maybe that REINDEX is on
a small table but has a very high priority and we're willing to accept
that some regular DML is delayed a bit to allow that REINDEX to finish).
Still, I would think we'd basically want to use the heuristic that DDL
is bulk and DML is a higher priority for a baseline/default position,
but then provide users with a way to change the priority on a
per-session level, presumably with a GUC or similar, if they have a case
where that heuristic is wrong.

Again, just to be clear, this is all really 'food for thought' and
interesting discussion and shouldn't keep us from doing something simple
now, if we can, to help alleviate the immediate practical issue that
bulk commands can cause serious WAL lag.  I think it's good to have
these discussions since they may help us to craft the simple solution in
a way that could later be extended (or at least won't get in the way)
for these much larger changes, but even if that's not possible, we
should be open to accepting a simpler, short-term, improvement, as these
larger changes would very likely be multiple major releases away if
they're able to be done at all.

> As a wild idea for how to handle the throttling, what if when all our wal
> insertion credits are used up XLogInsert() sets InterruptPending and the
> actual sleep is done inside ProcessInterrupts()?

This comment might be better if it was made higher up in the thread,
closer to where the discussion was happening about the issues with

Re: WAL insert delay settings

2019-02-21 Thread Stephen Frost

Greetings,

* Ants Aasma (ants.aa...@eesti.ee) wrote:
> On Thu, Feb 21, 2019 at 12:50 PM Stephen Frost  wrote:
> 
> > > Rate limit in front of WAL insertion would allow for allocating the
> > > throughput between foreground and background tasks, and even allow for
> > > priority inheritance to alleviate priority inversion due to locks.
> >
> > I'm not sure how much we have to worry about priority inversion here as
> > you need to have conflicts for that and if there's actually a conflict,
> > then it seems like we should just press on.
> >
> > That is, a non-concurrent REINDEX is going to prevent an UPDATE from
> > modifying anything in the table, which if the UPDATE is a higher
> > priority than the REINDEX would be priority inversion, but that doesn't
> > mean we should slow down the REINDEX to allow the UPDATE to happen
> > because the UPDATE simply can't happen until the REINDEX is complete.
> > Now, we might slow down the REINDEX because there's UPDATEs against
> > *other* tables that aren't conflicting and we want those UPDATEs to be
> > prioritized over the REINDEX but then that isn't priority inversion.
> 
> I was thinking along the lines that each backend gets a budget of WAL
> insertion credits per time interval, and when the credits run out the
> process sleeps. With this type of scheme it would be reasonably
> straightforward to let UPDATEs being blocked by REINDEX to transfer their
> WAL insertion budgets to the REINDEX, making it get a larger piece of the
> total throughput pie.

Sure, that could possibly be done if it's a per-backend throttle
mechanism, but that's got more-or-less the same issue as a per-command
mechanism and work_mem as discussed up-thread.

Also seems like if we've solved for a way to do this transferring and
delay and such that we could come up with a way to prioritize (or 'give
more credits') to foreground and less to background (there was another
point made elsewhere in the thread that background processes should
still be given *some* amount of credits, always, so that they don't end
up starving completely, and I agree with that).

There's actually a lot of similarity or parallels between this and basic
network traffic shaping, it seems to me anyway, where you have a pipe of
a certain size and you want to prioritize some things (like interactive
SSH) while de-prioritizing other things (bulk SCP) but also using the
pipe fully.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: [PATCH v20] GSSAPI encryption support

2019-02-21 Thread Stephen Frost

Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Stephen Frost  writes:
> > * Robbie Harwood (rharw...@redhat.com) wrote:
> >> Sure!  I'll go ahead and hack up the checks and lucid stuff and get
> >> back to you.
> >
> > Great!  I'll finish fleshing out the basics of how to have this work
> > server-side and once the checks and lucid stuff are in on the psql
> > side, it should be pretty straight-forward to copy that same
> > information into beentry alongside the SSL info, and expose it through
> > pg_stat_get_activity() into a pg_stat_gss view.
> 
> When the mech is gss_mech_krb5 under MIT krb5:
> 
> psql (12devel)
> GSSAPI encrypted connection (krb5 using aes256-cts-hmac-sha384-192)
> Type "help" for help.
> 
> And the same under Heimdal:
> 
> psql (12devel)
> GSSAPI encrypted connection (krb5 using aes256-cts-hmac-sha1-96)
> Type "help" for help
> 
> If the mech weren't gss_krb5, or Lucid introspection failed, but we're a
> SASL-aware mech (and using MIT):
> 
> psql (12devel)
> GSSAPI encrypted connection (~256 bits)
> Type "help" for help.
> 
> The third case (no lucid, no SASL SSF):
> 
> psql (12devel)
> GSSAPI encrypted connection (unknown mechanism)
> Type "help" for help.
> 
> Feel free to tweak these strings as needed.

That all looks fantastic!  Do you see any reason to not say:

GSSAPI encrypted connection (SASL SSF: ~256 bits)

instead, since that's what we are technically reporting there?

> I've also adjusted the layering somewhat and moved the actual printf()
> call down into fe-secure-gssapi.c  I don't know whether this model makes
> more sense in the long run, but for PoC code it was marginally easier to
> reason about.

No, I think we need to provide a way for libpq-using applications to get
at that information easily..

> Another thing I've been thinking about here is whether the SASL SSF
> logic is useful.  It'll only get invoked when the mechanism isn't
> gss_mech_krb5, and only under MIT.  SPNEGO (krb5), NTLM (gss-ntlmssp),
> IAKERB (krb5), and EAP (moonshot) all don't support
> GSS_C_SEC_CONTEXT_SASL_SSF; I actually couldn't come up with another
> mechanism that does.  I defer to you on whether to remove that, though.

Oh, hmm, I see.  Yeah, since there's no case where that could actually
end up being used today then perhaps it makes sense to remove it- it's
not a terribly good interface anyway since it doesn't provide the
actual encryption algorithm, I had just gone down that route because I
saw how to.  The lucid stuff is much better. :)

> I did some testing with Heimdal since we're using some extensions here,
> and found out that the current build machinery doesn't support Heimdal.

Hah!

> (The configure.in detection logic only seems to work for MIT and
> Solaris.)  However, it's possible to explicitly pass in CFLAGS/LDFLAGS
> and it worked prior to my changes, so I've preserved that behavior.

Alright.  That seems like an independent change, if we decide to make
it easier for people to build with heimdal, but there hasn't been much
call for it, clearly.

> Finally, as this patch touches configure.in, configure needs to be
> regenerated; I'm not sure what project policy is on when that happens
> (and it produces rather a lot of churn on my machine).

There's some magic there due to vendor changes to autoconf, as I recall,
which is likely why you're seeing a lot of churn.

> Patch attached after the break; apply on top of -20.

Ok.  I'm pretty amazed at how little code it was to do..  Is there
somewhere that these functions are publicly documented and how they can
be used from a GSSAPI handle is documented?

Thanks a lot for this work!

Stephen


signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-22 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Mon, Feb 18, 2019 at 6:13 AM David Steele  wrote:
> > On 2/16/19 5:57 AM, Andres Freund wrote:
> > > On 2019-01-05 13:19:20 -0500, Stephen Frost wrote:
> > >> * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> > >>> On 12/12/2018 05:31, Robert Haas wrote:
> > >>>> Most of the features I've been involved in removing have been
> > >>>> deprecated for 5+ years.  The first release where this one was
> > >>>> deprecated was only 2 years ago.  So it feels dramatically faster to
> > >>>> me than what I think we have typically done.
> > >>>
> > >>> I was just looking this up as well, and I find it too fast.  The
> > >>> nonexclusive backups were introduced in 9.6.  So I'd say that we could
> > >>> remove the exclusive ones when 9.5 goes EOL.  (That would mean this
> > >>> patch could be submitted for PostgreSQL 13, since 9.5 will go out of
> > >>> support around the time PG13 would be released.)
> > >>
> > >> I don't agree with either the notion that we have to wait 5 years in
> > >> this case or that we've only had a good alternative to the exclusive
> > >> backup mode since 9.5 as we've had pg_basebackup since 9.1.
> > >
> > > I don't agree with a general 5 year deprecation window either. But it
> > > seems pretty clear that there's no majority for removing exclusive
> > > backups in v12.   I think it'd be good to make the warning about its
> > > impending death more explicit, but otherwise mark this CF entry either
> > > as rejected or returned with feedback.
> >
> > I think there is support for the patch in PG13 so I was planning to move
> > it out of the March CF to the first PG13 CF as soon as the app will
> > allow it, i.e., once there is only a single open CF.
>
> Agreed, and I think we should also update the documentation for 12 with the
> suggested more explicit mention of the deprecation.

+1.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-22 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Feb 22, 2019 at 12:35 PM Fujii Masao  wrote:
> > -1 for the removal. I think that there are still many users of an exclusive
> > backup API, and it's not so easy to migrate their backup scripts so that
> > only non-exclusive one is used because of the reason I wrote in another 
> > thread.
> > https://postgr.es/m/cahgqgwhukebkvexvfwnljmq2rzos_shymiect+kbn-cbpq5...@mail.gmail.com
> 
> Yeah, those are interesting points.  Unfortunately, I think something
> as simple as your proposed...
> 
> psql -c "SELECT pg_start_backup()"
> rsync, cp, tar, storage backup, or something
> psql -c "SELECT pg_stop_backup()"
> 
> ...doesn't have much chance of being correct.  If you aren't checking
> whether pg_start_backup() throws an error, for example, you have got a
> very failure-prone set-up.  That being said, it's possible to fix that
> problem using some shell logic, whereas the problem of keeping a
> connection open for the duration of the backup from the shell is much
> harder.  I recently ran across a case where someone attempted it but
> did not get the logic entirely right, with rather painful
> consequences.
> 
> Now you might say that people should not write their own tools and
> just use professionally produced ones.  But I don't really agree with
> that, and whatever we think people SHOULD do, there are in fact a lot
> of people using their own tools.

Yes, they are- but I'd argue that's specifically because the
old/deprecated/broken API makes it *look* safe, even though it isn't.

Getting a solid and resiliant backup to work from a shell script is, imv
anyway (though I might have a bit of experience, having tried numerous
times myself and then realizing that it just isn't practical...), a
downright fool's errand.

Perhaps there's some way we can fix that so that people can write a lazy
rsync-based backup script and use it, but I don't see anyone seriously
working on that and I don't think that's because they just really like
to make things difficult for people but rather because it really is
*hard* to do, if not downright impossible.

What we need is a backup tool included in core that users feel
comfortable using instead of trying to write their own.  I don't think
they write their own because it's somehow better than the existing tools
or because they really think it's a good idea- it's just what looks like
the "thing to do" based on our documentation and the API we provide
today.  Sure, pg_basebackup is good for smaller environments and I
always encourage people to use it when they want something that's in
core that is safe to use and that does things correctly, but it's got
some pretty serious limitations due to using the replication protocol.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Journal based VACUUM FULL

2019-02-22 Thread Stephen Frost

Greetings,

* Andrew Dunstan (andrew.duns...@2ndquadrant.com) wrote:
> On 2/22/19 2:15 PM, Andres Freund wrote:
> > On 2019-02-21 16:27:06 +0100, Andreas Karlsson wrote:
> >> I have not heard many requests for bringing back the old behavior, but I
> >> could easily have missed them. Either way I do not think there would be 
> >> much
> >> demand for an in-place VACUUM FULL unless the index bloat problem is also
> >> solved.
> > Yea, I don't think there's much either. What I think there's PLENTY need
> > for is something like pg_repack in core.  And could argue that the
> > trigger based logging it does to catch up to changes made concurrently
> > with the rewrite, to the old table, is a form of journaling...
> 
> +1. Maybe this is something that should be on the agenda of the next
> developers' meeting.

Seems more appropriate to the developer unconference, though perhaps
that's what you meant..?

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: oddity with ALTER ROLE/USER

2019-02-23 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Joe Conway  writes:
> > I noticed that ALTER ROLE/USER succeeds even when called without any
> > options:
> 
> > postgres=# alter user foo;
> > ALTER ROLE
> > postgres=# alter role foo;
> > ALTER ROLE
> > postgres=# alter group foo;
> > ERROR:  syntax error at or near ";"
> > LINE 1: alter group foo;
> 
> > That seems odd, does nothing useful, and is inconsistent with, for
> > example, ALTER GROUP as shown above.
> 
> > Proposed patch attached.
> 
> If you want to make it act like alter group, why not make it act
> like alter group?  That is, the way to fix this is to change the
> grammar so that AlterOptRoleList doesn't permit an expansion with
> zero list elements.
> 
> Having said that, I can't get excited about changing this at all.
> Nobody will thank us for it, and someone might complain.

Is there no chance that someone's issueing such an ALTER ROLE foo;
statement and thinking it's actually doing something?  I suppose it's
pretty unlikely, but we do complain in some cases where we've been asked
to do something and we end up not actually doing anything for various
reasons ("no privileges were granted..."), though that's only a warning.
Maybe that'd be useful to have here though?  At least then we're making
it clear to the user that nothing is happening and we don't break
existing things.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: [PATCH v20] GSSAPI encryption support

2019-02-23 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> I don't know much about GSSAPI, but from what I can tell, this seems an
> attractive feature, and the implementation is compact enough.  I have
> done a bit of work on the internal SSL API refactoring, so I have some
> thoughts on this patch.
> 
> Looking at the file structure, we would have
> 
> be-secure.c
> be-secure-openssl.c
> be-secure-[othersslimpl].c
> be-secure-gssapi.c
> be-secure-common.c
> 
> This implies a code structure that isn't really there.
> be-secure-common.c is used by SSL implementations but not by the GSSAPI
> implementation.

be-secure-common.c seems very clearly mis-named, I mean, look at the
comment at the top of the file:

* common implementation-independent SSL support code

Seems we've been conflating SSL and "Secure" and we should probably fix
that.

What I also really don't like is that "secure_read()" is really only
*maybe* secure.  be-secure.c is really just an IO-abstraction layer that
lets other things hook in and implement the read/write themselves.

> Perhaps we should rename be-secure-openssl.c to be-ssl-openssl.c and
> be-secure-common.c to be-ssl-common.c.

This might be overly pedantic, but what we do in other parts of the tree
is use these things called directories..

I do think we need to rename be-secure-common since it's just flat out
wrong as-is, but that's independent of the GSSAPI encryption work,
really.

> Or maybe we avoid that, and you rename be-secure-gssapi.c to just
> be-gssapi.c and also combine that with the contents of be-gssapi-common.c.

I don't know why we would need to, or want to, combine
be-secure-gssapi.c and be-gssapi-common.c, they do have different roles
in that be-gssapi-common.c is used even if you aren't doing encryption,
while bs-secure-gssapi.c is specifically for handling the encryption
side of things.

Then again, be-gssapi-common.c does currently only have the one function
in it, and it's not like there's an option to compile for *just* GSS
authentication (and not encryption), or at least, I don't think that
would be a useful option to have..  Robbie?

> (And similarly in libpq.)

I do agree that we should try to keep the frontend and backend code
structures similar in this area, that seems to make sense.

> About pg_hba.conf: The "hostgss" keyword seems a bit confusing.  It only
> applies to encrypted gss-using connections, not all of them.  Maybe
> "hostgssenc" or "hostgsswrap"?

Not quite sure what you mean here, but 'hostgss' seems to be quite well
in-line with what we do for SSL...  as in, we have 'hostssl', we don't
say 'hostsslenc'.  I feel like I'm just not understanding what you mean
by "not all of them".

> I don't see any tests in the patch.  We have a Kerberos test suite at
> src/test/kerberos/ and an SSL test suite at src/test/ssl/.  You can get
> some ideas there.

Yeah, I was going to comment on that as well.  We definitely should
implement tests around this.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-24 Thread Stephen Frost

Greetings,

* Christophe Pettus (x...@thebuild.com) wrote:
> > On Feb 22, 2019, at 15:18, Stephen Frost  wrote:
> > Getting a solid and resiliant backup to work from a shell script is, imv
> > anyway (though I might have a bit of experience, having tried numerous
> > times myself and then realizing that it just isn't practical...), a
> > downright fool's errand.
> 
> The reality, though, is that there are a lot of organizations who have 
> invested time and effort into getting a backup strategy working using the 
> existing APIs, and there will be quite a bit of pushback against the version 
> in which the existing exclusive API is removed.

Do they realize how that existing backup strategy is flawed?  I doubt
it, and when they discover how it's broken, I don't think they're going
to be thanking us for letting them continue to use it.

> Some of those will be able to move to non-exclusive backups easily; others 
> won't.  For the ones that can't move easily, the reaction will not be, 
> "PostgreSQL version x has a safer backup API"; it will be "PostgreSQL version 
> x broke our backups, so we're not upgrading to it."

We don't cater to this line of argument when it comes to breaking
changes in the backend, or when we break monitoring scripts, and I don't
see a reason why we should do so here.  They aren't required to upgrade
immediately- we provide 5 *years* of support for major versions which we
release, and we're going to give advance warning of this change that's
even beyond that 5 years.  I have essentially zero sympathy for
organizations which refuse to allocate even the bit of time necessary to
address this change for over 5+ years.

> Rather than deprecate the existing API, I'd rather see the documentation 
> updated to discuss the danger cases.

Ok, then please do so, and please be prepared to continue to maintain
the documentation of both methods moving forward, because others have
tried and have (rightfully, in my opinion) decided that it's frankly not
worth the effort and ultimately just terribly confusing for users that
we have these two different backup methods and even just updating the
documentation for one or the other is downright painful (to the point
that people litterally give up on it).  That really isn't a good place
to be in.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-24 Thread Stephen Frost

Greetings,

* Christophe Pettus (x...@thebuild.com) wrote:
> > On Feb 24, 2019, at 12:00, Stephen Frost  wrote:
> > 
> > Do they realize how that existing backup strategy is flawed?
> 
> Undoubtedly, some do, some don't.  However, given that it has been the *only* 
> backup API for a very long time, many organizations have spent a lot of time 
> closing all of the holes.

No, it *hasn't* been the only backup API for a long time- that was only
true up until 9.6 was released, since then we've had both, and it's made
everything a downright mess because of exactly these arguments.

> It's not impossible to do safe backups with the existing API.  
> Unquestionably, there are installations doing backups that might end up with 
> a silently badly one, but it's entirely possible to do that unsafely (in many 
> of the same ways) with the non-exclusively API.

Yes, it *is* impossible to do safe backups with the existing API.  There
is an unquestionable race condition where a system restart will cause
your system to not come back up without you going in and removing the
backup_label file- and the only way you make that race window small is
to remove the backup_label file right after you run pg_start_backup and
copy it, and then PUT IT BACK at the end before you call pg_stop_backup,
which is insane, but otherwise the 'race window' is the ENTIRE length of
the backup.

This is the reason the non-exclusive backup mode *exists*.  All of the
effort that went into building the non-exclusive backup mode wasn't done
just for the fun of it.

> The installations that need to fix the scripts are also exactly the ones that 
> can't use pg_basebackup or another pre-packaged solution, usually because 
> they have a specific way of taking the file system copy (SAN snapshot, etc.) 
> that those don't support.

I'm not going to bother going into the issues that SAN snapshots have
here, because it's largely orthogonal to this discussion.  What I'll say
is that they'll have over 5 years to adjust those scripts and I don't
see an issue with that.

> > We don't cater to this line of argument when it comes to breaking
> > changes in the backend, or when we break monitoring scripts, and I don't
> > see a reason why we should do so here.
> 
> Those cases are not analogous.

I disagree *entirely*.

> 1. Backend APIs are declared non-stable, and do not have a wide audience 
> compared to backing up the database.

*ALL* of our APIs are declared non-stable between major releases.
That's why we have *major* and *minor* releases.

> 2. Monitoring scripts, while important, are not as critical as the backup 
> system.  (And, in fact, I didn't agree with breaking those views either, but 
> that's another discussion.)

Agreed, we should care even more about making sure that the tools we
provide to perform a backup are reliable, which is a good reason to
remove the broken and deprecated exclusive API.

> > Ok, then please do so, and please be prepared to continue to maintain
> > the documentation of both methods moving forward, because others have
> > tried and have (rightfully, in my opinion) decided that it's frankly not
> > worth the effort and ultimately just terribly confusing for users that
> > we have these two different backup methods and even just updating the
> > documentation for one or the other is downright painful (to the point
> > that people litterally give up on it).
> 
> We're going to have to do that anyway.  For as long as we are maintaining the 
> documentation on a version that has both APIs, we're going to have to say, 
> "Don't use this one, and *here's why*."  Saying, "Don't use this one because 
> we said so" when it is an API of long standing that works just as it always 
> did isn't going to cut it.

We have existing documentation for the versions that have both APIs.  It
sucks, it's horribly complicated, and no one wants to work on it because
of that.

Sure, it'd be nice if someone would fix it to list out the problems with
the exclusive API, but that requires someone wanting to spend time on
it.  If that's not going to happen, I certainly don't think that means
we should *keep* it.

THanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-24 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> Stephen Frost wrote:
> > Yes, it *is* impossible to do safe backups with the existing API.  There
> > is an unquestionable race condition where a system restart will cause
> > your system to not come back up without you going in and removing the
> > backup_label file- and the only way you make that race window small is
> > to remove the backup_label file right after you run pg_start_backup and
> > copy it, and then PUT IT BACK at the end before you call pg_stop_backup,
> > which is insane, but otherwise the 'race window' is the ENTIRE length of
> > the backup.
> 
> I just have an idea:
> 
> What about an option to keep WAL around for the duration of an exclusive 
> backup?
> 
> That way PostgreSQL can still restart after a crash.  It will take longer than
> expected, but it will work.  But then, perhaps the long recovery time is only
> marginally better than having to manually delete the backup_label file...

I'm afraid that we'd end up with many, many complaints about people
running out of disk space on WAL when they are trying to take a backup..

I do applaud your efforts to think of a better solution but I'm afraid
that isn't really workable.  While crashing with a backup_label in place
definitely sucks and makes recovering from that not fun, it's probably
better than having people run out of disk space and having the system
PANIC from that during what would otherwise be perfectly normal
operation.

That would also seem like a bit of an odd difference between the
exclusive and non-exclusive backup methods... and another things we'd
have to write up documentation for if we kept both methods around to try
and explain to users and that is just not a pleasant thought.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-24 Thread Stephen Frost

Greetings,

* Christophe Pettus (x...@thebuild.com) wrote:
> > On Feb 24, 2019, at 13:00, Stephen Frost  wrote:
> > 
> > No, it *hasn't* been the only backup API for a long time- that was only
> > true up until 9.6 was released, since then we've had both, and it's made
> > everything a downright mess because of exactly these arguments.
> 
> 9.6 has been out for about 2.5 years.  How long was the old version of 
> pg_start_backup() the only interface?  Considerably longer.

Right, and PG12 will be out for another *5* years beyond that, meaning
people will have had 8.5 years to move from the exclusive API to the
non-exclusive one.

> > Yes, it *is* impossible to do safe backups with the existing API.
> 
> That's an overstatement.  "The existing interface has failure conditions that 
> are hard to work around" is true.  Saying "it's impossible to do safe 
> backups" implies that no pre-9.6 system has ever been backed up.  If we take 
> that line, it's just going to create a "Oh, come *on*" reaction from users.

Every one of the exclusive backups that was taken wasn't *safe*, and a
crash during any of them (which we've certainly heard plenty about
through various support channels over the years...) would have resulted
in a failure to properly restart.

No, I'm not saying that such backups are corrupted, *that* is an
overstatement, but it isn't actually what I'm saying, just a strawman
you've come up with to argue against.

> > What I'll say
> > is that they'll have over 5 years to adjust those scripts and I don't
> > see an issue with that.
> 
> That's not quite the situation.  What they will have is a choice between 
> upgrading, and or their existing scripts continuing to work.  That choice 
> will start ticking the instant a release without the old interface comes out. 
>  The back-pressure on upgrading is already very high; this is going to put a 
> lot of installations into a bind, and a lot will make the decision not to 
> upgrade because of that.

That's fine- they can push off upgrading, for as long as 5 years, and
we'll still support them.  That's the model we operate under for
everything, this isn't any different in that regard and I don't have
sympathy for people who insist on saying that *this*, just *this* one
API of *all* the ones we have simply *can't* be changed *ever*.

> > *ALL* of our APIs are declared non-stable between major releases.
> > That's why we have *major* and *minor* releases.
> 
> I note that a word game is being played here.  We're calling 
> pg_start_backup() "an API", as if it is the same as, say, the GiST C API, but 
> it's a SQL level construct.  SQL level constructs are not promises we need to 
> keep forever, of course, but they're not the same class of promise as a C API.

I'm not following here, at all...  We're actually better, historically,
about not changing SQL-level constructs than C-level APIs, but we can
and do still change them, regularly, with every major release- including
making new reserved keywords and other very clearly potentially breaking
changes.

> > Sure, it'd be nice if someone would fix it to list out the problems with
> > the exclusive API, but that requires someone wanting to spend time on
> > it.
> 
> I'll take a turn at it.

Just to be clear- I don't agree that simply documenting the issues with
it is a sufficient solution to make us keep it.  I'll certainly continue
to advocate for its removal, but it would be nice for people who are
currently using it to be aware of the issues, so I do think it'd be good
to have the documentation improved in that regard, just make sure to not
invalidate the documentation around non-exclusive backup while you're at
it.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-24 Thread Stephen Frost

Greetings,

* Christophe Pettus (x...@thebuild.com) wrote:
> > On Feb 24, 2019, at 13:44, Stephen Frost  wrote:
> > Right, and PG12 will be out for another *5* years beyond that, meaning
> > people will have had 8.5 years to move from the exclusive API to the
> > non-exclusive one.
> 
> The thing is that for 90% of installations, the clock will start ticking when 
> the deprecation.  Until then, most of them are not going to feel any pressure 
> to make the change.  Most installations have a lot of other things on their 
> minds.  They have the option of not upgrading, and that will be the option a 
> lot of them take.

Just to be clear- it was *already* deprecated, we are just talking here
about making that more clear/obvious.

And, again, they're welcome to not upgrade for as long as they'd like,
frankly, but we'll only provide back-patches for 5 years.

> > Every one of the exclusive backups that was taken wasn't *safe*, and a
> > crash during any of them (which we've certainly heard plenty about
> > through various support channels over the years...) would have resulted
> > in a failure to properly restart.
> 
> Most installations don't routinely encounter this.  I know it happens, and it 
> is definitely a problem, but the field is not strewn with corpses here.  The 
> new interface is unquestionably an improvement, but let's not get our 
> messaging amped up to the point that it hurts our credibility.
> 
> > No, I'm not saying that such backups are corrupted, *that* is an
> > overstatement, but it isn't actually what I'm saying, just a strawman
> > you've come up with to argue against.
> 
> That may not be what you are saying, but statements like "it *is* impossible 
> to do safe backups with the existing API" will be heard that way.  Let's keep 
> the messaging we give users accurate.
> 
> > this isn't any different in that regard and I don't have
> > sympathy for people who insist on saying that *this*, just *this* one
> > API of *all* the ones we have simply *can't* be changed *ever*.
> 
> Sure, we can change anything.  It's whether this particular deprecation is a 
> good idea.

You say above that the new interface is unquestionably an improvement
and here say that we shouldn't deprecate the old one in favor of it
(even though we actually already have... but that's beside the point I'm
trying to make here), so what you're advocating for is that we keep an
old and known broken interface that we know causes real issues even
after we've developed a new and unquestionably better one.

For my 2c, I think keeping the old interface around at all was a mistake
from the start, just the same as things like pg_user have been kept
around forever because of fear of breaking things, even as we rip out
the recovery.conf file or rename tons of things in the catalog across
major versions.

I really, honestly, believe what we need to *stop* doing is trying to
drag along support for old/deprecated interfaces after we've introduced
new ones, thus avoiding these arguments and the time spent on them, and
the time spent dealing with implementing and documenting new APIs around
old ones.  The only thing it seems to unquestionably do is make us argue
about when we are going to *finally* rip out the old/deprecated API (or
interface, or whatever you want to call it).

This is the new "should we call it PG 10.0 or PG 9.7 this time?"
argument, all over again...

> > I'm not following here, at all...  We're actually better, historically,
> > about not changing SQL-level constructs than C-level APIs
> 
> That's my point.  Calling this an "API" makes the change sound more routine 
> than it is.  It's an "API" that a *lot* of installations depend on.

A lot of them depended on pg_wal being named pg_xlog too, but we seem to
have managed reasonably well through that, not to mention all the
catalog changes that caused issues for monitoring, etc.

> > I don't agree that simply documenting the issues with
> > it is a sufficient solution to make us keep it.
> 
> Understood, but I think we need to document it no matter what.

Sure, go for it.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Adrien NAYRAT (adrien.nay...@anayrat.info) wrote:
> On 2/22/19 7:10 PM, Robert Haas wrote:
> >On Fri, Feb 22, 2019 at 12:35 PM Fujii Masao  wrote:
> >>-1 for the removal. I think that there are still many users of an exclusive
> >>backup API, and it's not so easy to migrate their backup scripts so that
> >>only non-exclusive one is used because of the reason I wrote in another 
> >>thread.
> >>https://postgr.es/m/cahgqgwhukebkvexvfwnljmq2rzos_shymiect+kbn-cbpq5...@mail.gmail.com
> >
> >Yeah, those are interesting points.  Unfortunately, I think something
> >as simple as your proposed...
> >
> >psql -c "SELECT pg_start_backup()"
> >rsync, cp, tar, storage backup, or something
> >psql -c "SELECT pg_stop_backup()"
> >
> >...doesn't have much chance of being correct.  If you aren't checking
> >whether pg_start_backup() throws an error, for example, you have got a
> >very failure-prone set-up.  That being said, it's possible to fix that
> >problem using some shell logic, whereas the problem of keeping a
> >connection open for the duration of the backup from the shell is much
> >harder.  I recently ran across a case where someone attempted it but
> >did not get the logic entirely right, with rather painful
> >consequences.
> >
> >Now you might say that people should not write their own tools and
> >just use professionally produced ones.  But I don't really agree with
> >that, and whatever we think people SHOULD do, there are in fact a lot
> >of people using their own tools.
> 
> FWIW +1 for not remove exclusive backup.
> 
> Maintain a connection during the backup is a hard pre-requisite. In my
> previous jobs I already done custom scripts to perform backup by using
> pre/post backup hook to control backup software: With vmware to do PITR
> backup with VM snapshot or with another file backup tool which perform block
> deduplication. I do not see where is the problem if you check
> pg_start_backup()/pg_stop_backup() went well?

This, really, is a large part of the problem- people don't realize and
just simply don't want to believe, it seems at least, that there's any
problem with what they're doing because it works some, or even most, of
the time.

Making a running PG cluster *look* like it was restored from a backup is
really bad and leads to a lot of confusion and cases where PG won't
start up, but that's what the exclusive pg_start_backup/pg_stop_backup
method does and is more-or-less *required* of any solution that wants to
just be able to tar up the data directory as the backup method, which is
why it ends up just being a bad interface overall.

And then there's other issues- things like making sure that you actually
got all of the WAL, without which the backup is inconsistent being the
leading one but there's others.

> It will be annoying if after this removal, companies must change their
> backup strategy by using specific postgres tools (pgbackrest, barman).

You don't write your own database system using CSV files and shell
magic, do you?  I have to say that it continues to boggle my mind that
people insist that *this* part of the system has to be able to be
implementable using shell scripts.

Folks, these are your backups we're talking about, your last resort if
everything else goes up in flames, why do you want to risk that by
implementing your own one-off solution, particularly when there's known
serious issues using that interface, and you want to just use shell
scripts to do it...?

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Dagfinn Ilmari Mannsåker (ilm...@ilmari.org) wrote:
> David Steele  writes:
> > On 2/25/19 12:35 AM, Christophe Pettus wrote:
> >>> On Feb 24, 2019, at 14:19, Stephen Frost  wrote:
> >>> You say above that the new interface is unquestionably an improvement
> >>> and here say that we shouldn't deprecate the old one in favor of it
> >>> (even though we actually already have... but that's beside the point I'm
> >>> trying to make here), so what you're advocating for is that we keep an
> >>> old and known broken interface that we know causes real issues even
> >>> after we've developed a new and unquestionably better one.
> >>
> >> Yes, I am advocating exactly that.  The reason that I think we need
> >> to keep the old one (or, at least, not remove it as soon as 12)...
> >
> > Exclusive backup will not be removed for PG12.  There wasn't support for
> > it so I push it out to PG13.
> 
> How about making the deprecation more obvious in PG12, by making
> pg_start_backup() in exclusive mode issue a WARNING that it will be
> removed in PG13?

So..  We've more-or-less come full circle back to where the thread had
left off last time- the plan is to update the documentation to make it
clearer that the exclusive mode is deprecated and that it's going to be
removed, and then we'll actually remove it in PG13.

We could also add such a warning but I'm not entirely convinced it's a
good idea.  There was some earlier discussion about it that I'd probably
want to go back and reread first.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-02-24 17:19:22 -0500, Stephen Frost wrote:
> > You say above that the new interface is unquestionably an improvement
> 
> FWIW, I think we didn't design it even remotely as well as we ought to
> have. It was both a mistake to not offer a version of non-exclusive
> backups that works with a client connection (for integration with the
> TSMs of this world), and it was a mistake not to offer a commandline
> tool that does the non-exclusive pg/start backup.

I'm not really following- did you mean to say "without a client
connection" above?

We could still add some kind of commandline tool to help with the
non-exclusive pg start/stop backup but I'm not really sure exactly what
that would look like and I honestly don't have terribly much interest in
spending resources on implementing it since it would lack a great many
features that we'd then end up wanting to add on... and then we end up
backing into building yet another backup tool.

> > I really, honestly, believe what we need to *stop* doing is trying to
> > drag along support for old/deprecated interfaces after we've introduced
> > new ones, thus avoiding these arguments and the time spent on them, and
> > the time spent dealing with implementing and documenting new APIs around
> > old ones.  The only thing it seems to unquestionably do is make us argue
> > about when we are going to *finally* rip out the old/deprecated API (or
> > interface, or whatever you want to call it).
> 
> While I agree that we should remove non-exclusive backups, I VEHEMENTLY
> oppose the unconditionality of the above statement. Yes, it's annoying
> to keep interfaces around, but unnecessarily inflicting pain to everyone
> also is annoying.

Alright, then how about we provide a bit of help for everyone who's got
a system built around recovery.conf today, instead of just completely
ripping that out?

If we're happy to do that then I really can't agree with these arguments
that there's things we should try to maintain when it comes to
interfaces, and that's far from the only example of a pretty massive
change that just went into version X with zero notice to users about
what they're using today being deprecated.

> That's not to say we shouldn't be better at announcing and then
> following through on the deprecation of old interfaces.

We announce things have changed every year, and people have five years
from that point, during which we'll continue to support them and fix
bugs and deal with security issues, to make whatever changes they need
to for the new interface.

The argument seems to be that people want new features but they don't
want to have to do any work to get those new features.  Instead, they
expect the community to take on the burden of maintaining those old
interfaces for them, so that they can get those new features.  That
seems like a pretty raw deal for the community and not one that I
really think we should be supporting, and it isn't like we're actually
consistent with it at all- except that we don't break the back-branches
and we do make breaking changes in major versions.

When is something too big of a change to make in a given major version
and we have to, instead, provide backwards compatibility for it?  What
is that policy?  How many releases do we have to support it for?  How do
we communicate that to our users, so that they know what they can depend
on being there across major releases and what they can't?

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> Stephen Frost wrote:
> > > It will be annoying if after this removal, companies must change their
> > > backup strategy by using specific postgres tools (pgbackrest, barman).
> > 
> > You don't write your own database system using CSV files and shell
> > magic, do you?  I have to say that it continues to boggle my mind that
> > people insist that *this* part of the system has to be able to be
> > implementable using shell scripts.
> > 
> > Folks, these are your backups we're talking about, your last resort if
> > everything else goes up in flames, why do you want to risk that by
> > implementing your own one-off solution, particularly when there's known
> > serious issues using that interface, and you want to just use shell
> > scripts to do it...?
> 
> If we didn't think that simplicity in backup has some value, why does
> PostgreSQL provide pg_basebackup?

I'm all for people using pg_basebackup and encourage them to do so, but,
really, that tool *is* too simple by itself: you need to combine it with
something else that handles data retention, backup validation, etc.

This could be some independent backup solution though, since
pg_basebackup gives you a simple set of files for the backup that can
then be backed up using those other backup tools.  (The same is true of
pgbackrest, btw, and people certainly do use pgbackrest to back up to a
repository and then back that repo up using other tools, but in this
case pgbackrest does have some of those data retention and backup
validation capabilities itself, so it isn't necessary to do that.)

> Not everybody wants to use tools like pgbackrest and barman because it
> takes some effort to set them up properly.  It seems that you think that
> people who want something simpler are irresponsible.

I don't mind simpler (we actually try pretty hard to make pgbackrest
simple to use, in fact...  the simplest config only requires a 4-line
config file), provided that it actually does everything that you
need a backup tool to do.  I do think it's an unfortunate reality that
not enough people take the time to think about what they really need
from a backup solution and spend the time to either find or build a
proper solution.

> Incidentally, both barman and pgbackrest stream the backup to a backup server.

I'm pretty sure both can backup to a local repository too (I know
pgbackrest can), and not just stream to a backup server, so I'm not
really sure what you're getting at here.

> I think the most important use case for the low level backup API is when
> your database is so large that you want to leverage storage techniques
> like snapshots for backup, because you can't or don't want to stream all
> those terabytes across the network.

Snapshots place an awful lot of faith in your storage layer-
particularly when you aren't doing any kind of validation that the
snapshot is in good shape and that there hasn't been any in-place latent
corruption, ever.  If you think about it, a snapshot is actually very
similar to just always doing incremental backups with pgbackrest and
never doing a new full backup or even a check of those old files that
haven't changed to see if they're still valid.  It's not an approach I'd
recommend depending on exclusively.

There's ways to address that, of course, such as building a manifest of
checksums of the files in the data directory, such that you can then
validate that the snapshot is correct, and detect if any corruption ends
up happening (and checksum the manifest as well), though you need to
also track the LSN of the pg_start_backup, so you can know that changes
which happened after that LSN are in the WAL that's captured during the
checksum, and you need a place to put all of this for when you want to
go back and re-validate an old snapshot..  and likely other things that
I'm forgetting at the moment, but it's definitely something we've
thought about and have considered how we might add support to pgbackrest
for working with snapshots (and it wouldn't surprise me if that got
implemented before v13 even..).

If you are happy to place complete faith in snapshots and trust that
they're entirely atomic, you can always forgo dealing with any of this
and just snapshot PGDATA and let PG go through crash recovery when you
restore.

> I'm not playing devil's advocate here to annoy you.  I see the problems
> with the exclusive backup, and I see how it can hurt people.
> I just think that removing exclusive backup without some kind of help
> like Andres sketched above will make people unhappy.

Keeping it will definitely make me unhappy. :)

I'm not against something simple being added to help with the new API,
or w

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Christophe Pettus (x...@thebuild.com) wrote:
> > On Feb 25, 2019, at 05:18, Stephen Frost  wrote:
> > So..  We've more-or-less come full circle back to where the thread had
> > left off last time- the plan is to update the documentation to make it
> > clearer that the exclusive mode is deprecated and that it's going to be
> > removed, and then we'll actually remove it in PG13.
> 
> That's not my position, certainly; I still object to its removal.

It wasn't my intent to outline that as your position, my apologies if it
came across that way.

I do believe that's still the plan though, based on both this and the
prior discussion, unless something else happens to address the known
issues.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sun, Feb 24, 2019 at 3:00 PM Stephen Frost  wrote:
> > Ok, then please do so, and please be prepared to continue to maintain
> > the documentation of both methods moving forward, because others have
> > tried and have (rightfully, in my opinion) decided that it's frankly not
> > worth the effort and ultimately just terribly confusing for users that
> > we have these two different backup methods and even just updating the
> > documentation for one or the other is downright painful (to the point
> > that people litterally give up on it).  That really isn't a good place
> > to be in.
> 
> This, to me, is downright rude.  You don't get to tell other people
> what to do.  Nor do you get to tell other people "hey, I'm going to
> make this change that you don't like unless you agree to do the work
> which I specify."  If you want to make a change, you have to build
> consensus for that change.  If you can't get people to agree with your
> proposed change, what happens is that you don't get to make that
> change.  Whether other people choose to do any work that you and they
> might happen to agree is valuable is up to them.

It wasn't my intent to be rude, and my apologies to you and Christophe
for it coming across that way.

I do think we can ask that people who wish to have a capability included
in PG (or continue to be included when there are serious known issues
with it...) be prepared to either build and maintain it themselves or to
convince someone else to do so (or both, and have a committer agree to
it).  That's how we've long operated and it wasn't my intent to imply
otherwise, but I agree that I could have said it in a nicer way to avoid
it coming across as telling Christophe what to do.

I'm also on board with building a consensus for making a change, but a
consensus does not mean that everyone must agree or be supportive of the
change.  There's also a practical side to things which is that if the
consensus seems to be "we are going to get rid of X" and someone wants
to work on X, we should probably make it pretty clear to them that
there's a consensus to remove it to allow them the option to decide if
they wish to still work on X, or not.  Similairly, I think it's good to
let people know who want to work on Y when Y is really hard, so that
they have some idea what they're getting into.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Feb 25, 2019 at 8:42 AM Stephen Frost  wrote:
> > Alright, then how about we provide a bit of help for everyone who's got
> > a system built around recovery.conf today, instead of just completely
> > ripping that out?
> >
> > If we're happy to do that then I really can't agree with these arguments
> > that there's things we should try to maintain when it comes to
> > interfaces, and that's far from the only example of a pretty massive
> > change that just went into version X with zero notice to users about
> > what they're using today being deprecated.
> 
> It's not the same thing.  First of all, the recovery.conf changes have
> been under discussion for quite a few years now, whereas this change
> has been under discussion for only a few months.  

This argument doesn't work for me because it doesn't make *any*
difference to our users- the vast majority of whom don't follow
-hackers.  Our users will simply see that in v12 the system they've
built around recovery.conf won't work and they'll have to learn the new
way and adjust everything to deal with it.

If that argument did matter, we could go back and find the prior
discussions about the issues around the exclusive backup mode and about
removing it, or next year we could point to this thread about it, or the
year after, and say "well, we talked about it a lot, so now let's just
do it", but it all ends up looking the same to our users unless we
actually write into some kind of user-facing documentation or WARNINGs
or similar that something is going away.

> It is also worth
> noting that those recovery.conf changes would've been made years ago
> if people hadn't objected to breaking backward compatibility; people
> have just as much of a right to object to these changes on those
> grounds as they did to those changes.  Second, the recovery.conf
> changes are intended to provide a tangible benefit - namely, getting
> the recovery.conf settings into the GUC system, so that they can be
> queried and changed using methods available for other GUCs.  But the
> change proposed here breaks things for people who are using the
> mechanism in question and it doesn't give them anything else instead.

This isn't a fair argument either because we're having this *after* the
new API was implemented for backup- namely non-exclusive mode, which
*did* give people something better to use instead.

If we went back and added back into the v12 branch support for a
recovery.conf file, then we'd be having exactly this same argument next
year about if we should remove the recovery.conf file support or leave
it in- and you could make the same argument then that removing the
recovery.conf support doesn't give the user anything tangible to replace
it with- because that would have already been done.

> Indeed, it seems quite obvious that you don't have the LEAST interest
> in putting any effort into actually making things in this area better
> for users; it appears that you just want to break what they're doing
> now and shove the work of coping with those changes onto them.

What will make things better for users is actually moving to the new API
that's been available for years now, so that they can avoid the pitfalls
with the old API.

> Christophe is quite right to question whether that will cause users
> not to upgrade.  I think it's overtly user-hostile.  You have referred
> to the effort of maintaining this code, but you haven't pointed to any
> tangible improvement that you'd like to make in this area that is
> actually being blocked by the existence of exclusive backup mode.

I disagree quite a bit with this statement- the existing documentation
is absolutely horrid and needs to be completely ripped out and rewritten
and maintaining the exclusive backup mode in such a rewrite would
absolutely be a *lot* of additional work.

We actually took a shot at trying to improve the documentation while
continuing to cover both the exclusive and the non-exclusive mode and
it's hugely painful.

> It's all just griping about how terrible the whole thing is, and it
> seems that there are actually quite a few people here who find it less
> terrible than you think it is.

I used to be one of those people.  I know that it looks fine and it
certainly seems appealing but having gone through bad experiences with
it, and seen others stumble through those same experiences time and time
again, I've learned that it really is an issue, and one that I would
very much like to avoid causing future users to stumble over.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Feb 22, 2019 at 6:18 PM Stephen Frost  wrote:
> > What we need is a backup tool included in core that users feel
> > comfortable using instead of trying to write their own.
> 
> I agree.  That's a great idea.  Let's talk about how to make that
> happen.  Providing a tool that gives people MORE AND BETTER options
> than what they have today is likely to make users more happy.

I've been working for years on making this happen, and am currently
putting quite a large amount of resources into what I hope will be the
final step in the process to get a solution that will have a great many
more and better options than what's available today.  Maybe it'll end up
happening, which I think would be great, maybe it won't, which would be
sad, but at least we'll have tried.

> Removing an option that people are currently using, and which they
> find better than other available options for reasons with which I
> understand that you disagree, will make users more sad.  Happy is
> better.

I don't want to see more users stumbling over the issues with the
exclusive backup interface.  A better interface exists, and has existed
since 9.6.  The exclusive backup method was deprecated in 2016.  One of
the really bad things about the exclusive backup method is that it
*looks* like it works well and that there aren't any issues with it, but
when users hit the issues, they get very sad.  We might have 1000s of
happy users who never run into those issues and therefore don't want to
see us change anything, but what about the 10s or 100s of users who do
hit those issues, what do we tell them?  Seems like we're saying "well,
sorry, we knew those issues existed and it's unfortunate you hit them,
but there's this better thing that you *should* have been using and then
you wouldn't have hit those issues."  That doesn't seem likely to make
people happy with us.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Feb 25, 2019 at 11:23 AM Stephen Frost  wrote:
> > If that argument did matter, we could go back and find the prior
> > discussions about the issues around the exclusive backup mode and about
> > removing it, or next year we could point to this thread about it, or the
> > year after, and say "well, we talked about it a lot, so now let's just
> > do it", but it all ends up looking the same to our users unless we
> > actually write into some kind of user-facing documentation or WARNINGs
> > or similar that something is going away.
> 
> That's true.  But nobody's objecting to strengthening the warnings in
> the documentation.  People are objecting to removing the thing itself.

Then how long will we carry it forward?  How much warning do we need to
provide?

> > This isn't a fair argument either because we're having this *after* the
> > new API was implemented for backup- namely non-exclusive mode, which
> > *did* give people something better to use instead.
> 
> There are several people who are saying that doesn't meet all their
> needs, giving reasons why it's problematic, and suggesting things that
> could be done to make it less problematic.  It's not OK to say "we can
> remove exclusive backup mode because now we have non-exclusive backup
> mode" unless other people actually agree that all the use cases are
> covered, and it appears that they don't.

I haven't heard anyone say it doesn't meet their needs, just that it's
not as easy to use, which is a fair critcism, but not the same thing.
I'm all for making the non-exclusive mode less problematic if we're able
to.  If that's something that would help us move forward with getting
rid of the exclusive backup mode than that's an area that I'd be willing
to put resources.

> > I disagree quite a bit with this statement- the existing documentation
> > is absolutely horrid and needs to be completely ripped out and rewritten
> > and maintaining the exclusive backup mode in such a rewrite would
> > absolutely be a *lot* of additional work.
> >
> > We actually took a shot at trying to improve the documentation while
> > continuing to cover both the exclusive and the non-exclusive mode and
> > it's hugely painful.
> 
> Well, perhaps that's pain you need to incur.  The alternative seems to
> be to rip out something that people don't want ripped out.

... now I feel like I'm being told what to do.

The point is that keeping it actually *is* work, it isn't free, not even
in our tree.  Making our users stumble over the issues with it also
isn't free, that's another cost of keeping it.

> > I used to be one of those people.  I know that it looks fine and it
> > certainly seems appealing but having gone through bad experiences with
> > it, and seen others stumble through those same experiences time and time
> > again, I've learned that it really is an issue, and one that I would
> > very much like to avoid causing future users to stumble over.
> 
> Sure, that sounds great.  But the way to do that is to continue
> improving the system until exclusive-mode backups really are not a
> useful thing any more, not to remove it while there are still a lot of
> people relying on it who can offer tangible explanations for their
> choice to do so.
> 
> It feels to me like you are portraying the increasing number of people
> objecting to this change as naive or foolish or at least not as
> enlightened as you are, and I really object to that.  I happen to
> think that people like Christophe Pettus and Fujii Masao and Laurenz
> Albe are smart people whose opinions ought to be taken as just as
> valid as your own.

I honestly do doubt that they have had the same experiences that I have
had, but that doesn't mean that they're not smart or that I don't value
their opinion- I agree that they're all smart people and I certainly do
value their opinions.  That doesn't mean that I can't disagree with them
or can't explain why I disagree.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Feb 25, 2019 at 11:33 AM Stephen Frost  wrote:
> > > Removing an option that people are currently using, and which they
> > > find better than other available options for reasons with which I
> > > understand that you disagree, will make users more sad.  Happy is
> > > better.
> >
> > I don't want to see more users stumbling over the issues with the
> > exclusive backup interface.  A better interface exists, and has existed
> > since 9.6.  The exclusive backup method was deprecated in 2016.  One of
> > the really bad things about the exclusive backup method is that it
> > *looks* like it works well and that there aren't any issues with it, but
> > when users hit the issues, they get very sad.  We might have 1000s of
> > happy users who never run into those issues and therefore don't want to
> > see us change anything, but what about the 10s or 100s of users who do
> > hit those issues, what do we tell them?  Seems like we're saying "well,
> > sorry, we knew those issues existed and it's unfortunate you hit them,
> > but there's this better thing that you *should* have been using and then
> > you wouldn't have hit those issues."  That doesn't seem likely to make
> > people happy with us.
> 
> Sure, nobody wants users to keep hitting the same problems over and
> over again, but the above is still reductionist.  If we take the view
> that anything that can cause data corruption in any scenario, no
> matter how remote, is more than everything else, then we should just
> stop shipping minor releases and halt all other development work until
> we deal with https://commitfest.postgresql.org/22/528/ that has been
> open for 14 CommitFests and until we've fully dealt with every aspect
> of fsync-gate.  I don't see you treating either of those issues with
> the same sense of urgency with which you're treating this one, so
> evidently you feel entitled to decide which
> potentially-data-corrupting issues you want to attack first.

I'm not advocating for removing it today or tomorrow, or in some point
release.  We're already to the stage where we're talking about
something that wouldn't hit until late 2020.  Also, frankly, while there
might be something I could do to help, the whole madness with fsync gate
seems to be getting dealt with by some very smart people who know a
whole lot more about the internals of Linux and FreeBSD and I'd rather
not get in their way.  On the other hand, dealing with backups and
issues around backups has been something that I at least hope I've
gotten decent at.

> And I agree that you should have exactly that right.  Along similar
> lines, I'd like our users to have the right to decide when they want
> to upgrade from using exclusive backup mode to non-exclusive backup
> mode, instead of being forced into making a change at a time decided
> by our fiat.

Our users *have* that option, and have for over two years now, and
it'll have been 4 years by the time v13 comes out.  I agree that the
non-exclusive backup mode isn't as simple as the exclusive backup mode-
to use your example, what we *used* to be doing with fsync() was sure a
lot simpler than what we're going to be doing in the future, but
sometimes people have to make changes, even to more complicated APIs,
to make sure that they're doing things the right way to ensure that
their data doesn't get eaten.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Feb 25, 2019 at 11:09 AM Stephen Frost  wrote:
> > I do think we can ask that people who wish to have a capability included
> > in PG (or continue to be included when there are serious known issues
> > with it...) be prepared to either build and maintain it themselves or to
> > convince someone else to do so (or both, and have a committer agree to
> > it).  That's how we've long operated and it wasn't my intent to imply
> > otherwise, but I agree that I could have said it in a nicer way to avoid
> > it coming across as telling Christophe what to do.
> 
> I think it is certainly true that if you want a new capability added,
> you have to either write it yourself or get someone to do it for you.
> There is SOME precedent for the proposition that if you want an
> obsolete capability not to be removed, you should be prepared to help
> fix it.  But frankly I think the latter proposition is one we've taken
> only occasionally, and only for things that were a whole lot cruftier
> and more problematic than this is.  For example, if a library is no
> longer available and we have a contrib module that depends on that
> library, it's reasonable to say that we can't keep the contrib module
> unless somebody rewrites it not to depend on that library any more.
> But the current situation is completely different.  The code
> maintenance burden resulting from keeping exclusive-mode backups is
> mostly hypothetical; the code works as well as it ever did.  We rarely
> take the position that we're going to rip out older code that doesn't
> work as nicely as newer code does just because nobody's prepared to
> e.g. better-document the old code.

I'd argue that if something which has been deprecated for years is
seriously getting in the way of making progress that we should be
willing to accept a proposal to rip it out after giving people
sufficient time to adjust, and that if someone really wants to keep that
deprecated API then they need to be willing to spend the time to address
the reasons why it was deprecated.

> For example, we still have FEBE protocol 2 support floating around.
> If you're going to talk about badly-designed footguns that ought to be
> excised with extreme prejudice, that would IMHO be a far better place
> to start than this.  Unlike this code, which it's now obvious is used
> by quite a number of people even just among those who read this list
> regularly, that code is probably nearly unused and untested and, if we
> broke it, we probably wouldn't know.

You can probably guess my feelings with regard to the FEBE v2 support.

At least, in that case, people aren't really using it though.  What I
really dislike is that we've got a *lot* of people using something that
we *know* has some pretty serious potential data-eating problems.
Saying that we should keep it around because a lot of people are using
it isn't the way to persuade me that we should keep it.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Christophe Pettus (x...@thebuild.com) wrote:
> > On Feb 25, 2019, at 08:55, Stephen Frost  wrote:
> > 
> > I honestly do doubt that they have had the same experiences that I have
> > had
> 
> Well, I guarantee you that no two people on this list have had identical 
> experiences. :)  

Indeed! :)

> I certainly have been bitten by the problems with the current system.  But 
> the resistance to major version upgrades is *huge*, and I'm strongly biased 
> against anything that will make that harder.  I'm not sure I'm communicating 
> how big a problem telling many large installations, "If you move to 
> v12/13/etc., you will have to change your backup system" is going to be.

Aren't they going to need to make a change for v12 now anyway?

Hopefully they're regularly testing their backups by doing a restore of
them, and dropping a recovery.conf into the directory of a v12 system
after restore will do exactly nothing and they'll get errors complaining
about how they need to provide a restore_command.

That's actually what prompted bringing this painful topic up again-
there's a large change being done to how backup/restore with PG is done
(the recovery.conf file is going away), and people who have written any
kind of automation (or even just documented procedures...) around that
will have to update their systems.  At least from my perspective, making
them have to do such an update once, instead of once now and another
time in the future when we remove exclusive backup (or figure out a way
to do it that's safe and update the instructions for how to do it
right...), is the better option.

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Feb 25, 2019 at 11:33:33AM -0500, Stephen Frost wrote:
> > I don't want to see more users stumbling over the issues with the
> > exclusive backup interface.  A better interface exists, and has existed
> > since 9.6.
> 
> Do you really think we would be having this discussion if the
> non-exclusive backup method was unequivocally better?  It is better for
> some use-cases, and impossible for others.

Based on Christophe's comment above, anything which required users to
make a change on upgrade to their backup system would be cause to have
this discussion, which likely includes most possible solutions to the
issues with exclusive backup too, unfortunately..

> Also, you can't say it will have no impact for five years on people who
> do not upgrade.  The impact will be that they will have no new Postgres
> features for five years.

I don't think I made the claim that there wouldn't be any impact for
five years, I said they would continue to have support for five years.

Also, this is exactly what we tell them for any other breaking change
(such as removal of recovery.conf).

> I am not taking a stance on this issue, but making unclear statements
> isn't helping the discussion.

It's not my intent to make unclear statements, so I certainly appreicate
you, and anyone else, pointing out when I do.  I'm happy to clarify.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 2/25/19 7:50 PM, Fujii Masao wrote:
> >Thought?
> 
> Here's the really obvious bad thing: if users do not update their procedures
> and we ignore backup_label.pending on startup then they will end up with a
> corrupt database because it will not replay from the correct checkpoint.  If
> we error on the presence of backup_label.pending then we are right back to
> where we started.

Right, we definitely can't make a change which would cause non-updated
systems using the same APIs to suddenly end up with corruption.  If we
require a change be made to the scripts that run the process then we
must make it something the user is essentially opt'ing into- where
they've told us explicitly that they're ready to use the new interface,
which is how the non-exclusive backup mode was added.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-02-25 08:14:16 -0500, Stephen Frost wrote:
> > > It will be annoying if after this removal, companies must change their
> > > backup strategy by using specific postgres tools (pgbackrest, barman).
> > 
> > You don't write your own database system using CSV files and shell
> > magic, do you?  I have to say that it continues to boggle my mind that
> > people insist that *this* part of the system has to be able to be
> > implementable using shell scripts.
> > 
> > Folks, these are your backups we're talking about, your last resort if
> > everything else goes up in flames, why do you want to risk that by
> > implementing your own one-off solution, particularly when there's known
> > serious issues using that interface, and you want to just use shell
> > scripts to do it...?
> 
> FWIW, if you weren't selling backrest quite so hard everywhere backups
> are mentioned, I'd find this thread a lot more convicing.

I push pgbackrest, just like I push PG, because I view it as the best
open source tool out there for the job, well architected, designed from
the ground-up to do what a backup tool needs to, and it has a strong
community around it with people contributing back, all openly and under
a permissive license which allows anyone to use it, hack on it, and do
what they want with it- basically, I see it as the PG of backup tools
for PG.  What PG is to relational databases, pgbackrest is to PG backup
tools, imv.

As David mentioned already, this really doesn't change things for
pgbackrest one way or the other- we don't use the old API and haven't
since the non-exclusive API was available.  In addition, pgbackrest
hasn't got a solution for the "we just want to take a snapshot" or "we
just want to use start/stop hooks".  If I was really trying to push
people to pgbackrest because of the flaws in the exclusive backup mode,
I'd surely develop answers to those issues first and make them part of
pgbackrest in short order, and then start yelling from the tops of
buildings about how broken the exclusive backup API is and about how
I've got an easy solution for everyone to move to, maybe do a blog post
on planet.  I'm not doing that though, instead I'm discussing the
issues here, with a tiny fraction of our community, and pushing for us
to agree to get rid of a dangerous API in favor of an API that anyone
can use w/o pgbackrest, if they want to.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-02-25 08:42:02 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > On 2019-02-24 17:19:22 -0500, Stephen Frost wrote:
> > > > You say above that the new interface is unquestionably an improvement
> > > 
> > > FWIW, I think we didn't design it even remotely as well as we ought to
> > > have. It was both a mistake to not offer a version of non-exclusive
> > > backups that works with a client connection (for integration with the
> > > TSMs of this world), and it was a mistake not to offer a commandline
> > > tool that does the non-exclusive pg/start backup.
> > 
> > I'm not really following- did you mean to say "without a client
> > connection" above?
> 
> Yes.

Ah, alright, that makes more sense then.

You might bring that up with the individual who wrote it.

> > We could still add some kind of commandline tool to help with the
> > non-exclusive pg start/stop backup
> 
> That doesn't help, as long as the connection requirement is there. As
> explained elsewhere in this thread, a lot of larger external backup
> infrastructure just gives you a start/stop hook, and that makes using a
> persistent connection hard and fragile.

I tend to agree w/ Magnus in that I'm not sure that this is actually
as common these days as claimed.  That said, I'm not against someone
trying to implement something that doesn't require a connection, though
I understand from discussion with David and his comments elsewhere on
this thread that it isn't that simple a thing to do- and people would
have to migrate to whatever it is anyway.

> > but I'm not really sure exactly what that would look like and I
> > honestly don't have terribly much interest in spending resources on
> > implementing it since it would lack a great many features that we'd
> > then end up wanting to add on... and then we end up backing into
> > building yet another backup tool.
> 
> Meh. You are proposing to remove a feature (even if a dangerous
> one). Requiring some minimal compat work to make htat more palpaple
> isn't crazy.  Nor does using backrest or whatnot do *ANYTHING* about the
> users of large company wide backup tools.

Yes, I'm supporting the proposal to remove a dangerous feature.  I
really don't think much more needs to be said than that.  If the feature
wasn't dangerous then I wouldn't be argueing for its removal.  I do feel
bad that pgbackrest hasn't got a solution already to this, but we've had
other priorities.  I do think we'll get there, eventually, but not any
time soon.  Being told that I have to implement another feature so that
we can remove the dangerous one is something I don't agree with.

> > > > I really, honestly, believe what we need to *stop* doing is trying to
> > > > drag along support for old/deprecated interfaces after we've introduced
> > > > new ones, thus avoiding these arguments and the time spent on them, and
> > > > the time spent dealing with implementing and documenting new APIs around
> > > > old ones.  The only thing it seems to unquestionably do is make us argue
> > > > about when we are going to *finally* rip out the old/deprecated API (or
> > > > interface, or whatever you want to call it).
> > > 
> > > While I agree that we should remove non-exclusive backups, I VEHEMENTLY
> > > oppose the unconditionality of the above statement. Yes, it's annoying
> > > to keep interfaces around, but unnecessarily inflicting pain to everyone
> > > also is annoying.
> > 
> > Alright, then how about we provide a bit of help for everyone who's got
> > a system built around recovery.conf today, instead of just completely
> > ripping that out?
> 
> Oh, comeon. Obviously we're sometimes going to have to make breaking
> changes. But you're essentially arguing that we shouldn't provide compat
> where we can come up with a reasonable way to provide backward
> compatibility. And I think that's crazy and will hurt postgres.

This was discussed, *extensively*, previously on this list with a whole
lot of proposals trying to come up with easy ways to "fix" exclusive
backups and I don't really think that this thread has proposed anything
novel in that regard.  Basically, I just don't think it's that easy.
Perhaps I'm wrong, in which case, great, I'd be happy to review a simple
patch that solves it.

> > > That's not to say we shouldn't be better at announcing and then
> > > following through on the depre

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> Hmm, so what you're saying is that you'd like to disable an API that
> some non-backrest users are relying upon but which no backrest users
> are relying upon.  And you don't understand why some non-backrest
> users are opposed to that plan.  Is that a correct summary of your
> position?

This topic has come up more than once and on multiple threads (entirely
independent of pgbackrest and any of the other individuals on this
thread) and I think it needs to be a discussion held at the PGCon
Developer meeting.

I'll propose it.

If you'd like to propose something to fix the, as well discussed,
dangerous exclusive backup method, or further litigate at what point a
dangerous and misleading interface should be removed, I'm sure you'll
find we'd be happy to talk about that.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-25 Thread Stephen Frost

Greetings Mark,

* Mark Kirkwood (mark.kirkw...@catalyst.net.nz) wrote:
> ISTM that the onus should be on the patch submitter to provide additions to
> pg_basebackup that make it as painless as possible for those people *not*
> using pgBackRest to continue making backups. Breaking this is just not
> right. Submitting patches that mean that people *must* use pgBackRest is
> also not right IMHO.

I'm sorry that there's some confusion here- to be clear, no one is
required to use pgBackRest.  pg_basebackup works quite well and wouldn't
be impacted by the changes proposed no this thread.  The arguments
against removing the exclusive backup feature don't have anything to do
with pg_basebackup.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Stephen Frost

Greetings,

* Mark Kirkwood (mark.kirkw...@catalyst.net.nz) wrote:
> On 26/02/19 5:41 PM, Stephen Frost wrote:
> >* Mark Kirkwood (mark.kirkw...@catalyst.net.nz) wrote:
> >>ISTM that the onus should be on the patch submitter to provide additions to
> >>pg_basebackup that make it as painless as possible for those people *not*
> >>using pgBackRest to continue making backups. Breaking this is just not
> >>right. Submitting patches that mean that people *must* use pgBackRest is
> >>also not right IMHO.
> >I'm sorry that there's some confusion here- to be clear, no one is
> >required to use pgBackRest.  pg_basebackup works quite well and wouldn't
> >be impacted by the changes proposed no this thread.  The arguments
> >against removing the exclusive backup feature don't have anything to do
> >with pg_basebackup.
>
> Ah yes (checks pg_basbackup code), you are correct! Reading this thread I
> thought I saw a comment to the effect that pg_basebackup was being broken,
> hence the less than impressed post.

No, I don't see breaking pg_basebackup as ok, but I do have concerns
about how it doesn't provide for any validation of the backup unless you
use tar+gzip.  Unlike with exclusive mode, at least you don't run the
risk with pg_basebackup that a crash will mean that PG won't come back
up without intervention.  I can understand how the comments which you
were responding to might have lead to that confusion.

> Your relentless bashing of people doing their own backups and heavy
> marketing of pgBackRest - unfortunately - made it easy for me to believe
> that this was a possibility that you might see as ok. So - apologies for the
> misunderstanding, however less marketing of your own product would avoid me
> jumping to the wrong conclusion.

As for my personal recommendations, if you go back far enough, you can
see where I used to discuss how to use the exclusive API for doing
backups with people, because maybe they didn't like the tools available
at the time (I didn't, after all), but after seeing how many issues come
from doing that and realizing how bad it is that things like our
documented archive_command doesn't ensure that WAL is actually written
out to disk, I realized why dedicated tools had been written and started
to recommend that people pick one of the dedicated tools instead of
writing their own (and I used to recommend the other solution like
barman right alongside pgBackRest, and I still do make the generic
recommendation at times that people not try to build their own backup
tool) but then after seeing things like CIFS mounts happily throwing
kernel errors while returning success on (obviously, actually failed)
fsync() calls, resulting in corrupted backups that would silently be
corrupt if restored from, I realized that having an actual manifest of
all files backed up and their checksums wasn't a "nice to have" kind of
feature for a backup tool but something critical to a backup solution
(thankfully, David knew that right from the start and so it was always
part of pgBackRest) and so I have a pretty hard time recommending
solutions that don't have that these days.

As for back to the main topic of this thread, do I hope that removing
the exclusive backup mode will cause people to look at and consider
existing, purpose-build, backup solutions for PG, rather than trying to
update their one-off backup shell scripts?  Yes, of course I do.  Am I
trying to sneak something in that's going to force them to use
pgBackRest?  No, of course not.  Note that pg_basebackup, barman and
pgBackRest have had support for the non-exclusive API since it was
introduced (or, for barman, even before then if one installed the
pgespresso extension, and pg_basebackup since it was written because it
uses the replication protocol), and so does WAL-G (though I'm not sure
exactly when it got it, or maybe it always had it), and pg_probackup
(which appears to have gotten it in May 2017, less than a year after 9.6
was released), and even the shell-based pitery (which got it at some
point before pg_probackup, it looks like, based on the commits).

Considering there's at least *5* other backup tools which already
support the non-exclusive API, all of which have done so since the
non-exclusive mode was introduced or shortly after, I really have to
question the angle of attack being used here which is claiming that
this thread is motivated by a goal of forcing people to pgBackRest.

The main group who would be impacted by the exclusive mode going away
are those who aren't using an existing solution and who either didn't
get the memo about exclusive mode being deprecated, or decided just to
ignore it, and those are the installations which really do concern me
the most because they're the highest risk for people ending up losing
their data.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-26 Thread Stephen Frost

Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> I think the fundamental problem with all these approaches is that there is
> no safe way to distinguish a server crashed in backup mode from a restored
> backup.  This is what makes the problem so hard.

Right- if you want to just call start/stop and take a snapshot in the
middle and then be able to restore that directly and start up the
database, then there *can't* be any way to distinguish between the two,
which is, I'm pretty sure, where this whole discussion ended up back
during the 9.6 development cycle and why it's still an issue.  If there
was an easy way to fix this, I feel like we would have already.

> The existing exclusive backup is in my opinion the safest variant: it refuses
> to create a corrupted cluster without manual intervention and gives you a dire
> warning to consider if you are doing the right thing.

... it's the least dangerous if you limit yourself to that method, but
that doesn't make it safe. :(

In the end, you basically *have* to have a way of extracting out the
data needed for the backup (start/stop WAL and such) that doesn't make
the running cluster look like it's a backup being restored, and you
*have* to make that information available to the database cluster when
it's restored somehow, and notify PG that it's doing backup recovery and
*not* crash recovery, to eliminate this risk, and that's pretty hard to
manage if all you want to do is snapshot the filesystem.

Of course, you have to have a solution for WAL too and the thought has
crossed my mind that maybe there's something we could do when it comes
to stash all the info needed in the WAL archive, but I'm still not sure
how we'd solve for knowing if we're doing backup recovery or crash
recovery in that case without some kind of marker or something external
telling us that's what we're doing.  As you proposed previously, but
with a bit of a twist, maybe we could just always do backup recovery if
we find a .backup (or whatever) file in the WAL that, when compared to
pg_control, shows that we were in the process of doing a backup...  That
would require that everyone always have a restore_command set, which
wasn't possible before because that went into recovery.conf, but it's
possible to just always have that set now, and that would eliminate the
risk of us running the system out of disk space by keeping all the WAL
that's generated during the backup local.

Obviously, a lot of this is pretty hand-wavy, and you still have the
unfortunate situation that if you're actually recoverying from a crash
that just happened to happen while you were taking a backup then you
could be replaying a heck of a lot more WAL than you needed to, and you
have to have a working restore_command on the primary, and you'd have to
figure out a way for PG to check for these files .backup or whatever
files on startup that doesn't take forever or require stepping through
every WAL segment or something, but maybe those concerns could be
addressed.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-02-28 Thread Stephen Frost

Greetings,

* Fujii Masao (masao.fu...@gmail.com) wrote:
> On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe  wrote:
> >
> > Fujii Masao wrote:
> > > So, let me clarify the situations;
> > >
> > > (3) If backup_label.pending exists but recovery.signal doesn't, the server
> > >ignores (or removes) backup_label.pending and do the recovery
> > >starting the pg_control's REDO location. This case can happen if
> > >the server crashes while an exclusive backup is in progress.
> > >So crash-in-the-middle-of-backup doesn't prevent the recovery from
> > >starting in this idea
> >
> > That's where I see the problem with your idea.
> >
> > It is fairly easy for someone to restore a backup and then just start
> > the server without first creating "recovery.signal", and that would
> > lead to data corruption.
> 
> Isn't this case problematic even when a backup was taken by pg_basebackup?
> Because of lack of recovery.signal, no archived WAL files are replayed and
> the database may not reach to the latest point.

There's a number of problems, imv, that I think have just been
ignored/missed when it comes to the recovery.conf removal and tools like
pg_basebackup, see this msg also:

https://postgr.es/m/20181127153405.gx3...@tamriel.snowman.net

At this point, I think we need to put these issues on the v12 open items
page so that we don't forget about them and get these things fixed
before v12 comes out.

Thanks!

Stephen


signature.asc
Description: PGP signature

PostgreSQL Participates in GSoC 2019!

2019-02-28 Thread Stephen Frost

Greetings,

I'm pleased to announce that we have been accepted by Google to
participate in the Summer of Code (GSoC) 2019 program. This will be the
12th time that the PostgreSQL Project will provide mentorship for
students to help develop new features for PostgreSQL. We have the chance
to accept student projects that will be developed from May to August.

If you are a student, and want to participate in this year's GSoC,
please watch this Wiki page: https://wiki.postgresql.org/wiki/GSoC

If you are interested in mentoring a student, you can add your own idea
to the project list. Please reach out to the PG GSoC admins, listed
here: https://wiki.postgresql.org/wiki/GSoC

And finally, we ask everyone to reach out to students and universities
and let them know about GSoC.

Thanks!

Stephen
PostgreSQL GSoC 2019 Administrator


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-02 Thread Stephen Frost

Greetings,

* Michael Banck (michael.ba...@credativ.de) wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >  wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> > 
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
> 
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?
> 
> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> pg_verify_checksums in tight loops) with the current patch version, and
> I am seeing short reads very, very rarely (maybe every 1000th run) with
> a warning like:
> 
> |1174
> |pg_verify_checksums: warning: could not read block 374 in file 
> "data/base/18032/18045": read 4096 of 8192
> |pg_verify_checksums: warning: could not read block 375 in file 
> "data/base/18032/18045": read 4096 of 8192
> |Files skipped: 2
> 
> The 1174 is the sequence number, the first 1173 runs of
> pg_verify_checksums only skipped blocks.
> 
> However, the fact it shows two warnings for the same file means there is
> something wrong here. It was continueing to the next block while I think
> it should just skip to the next file on read failures. So I have changed
> that now, new patch attached.

I'm confused- if previously it was continueing to the next block instead
of doing the re-read on the same block, why don't we just change it to
do the re-read on the same block properly and see if that fixes the
retry, instead of just giving up and skipping..?  I'm not necessairly
against skipping to the next file, to be clear, but I think I'd be
happier if we kept reading the file until we actually get EOF.

(I've not looked at the actual patch, just read what you wrote..)

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: GSoC 2019

2019-03-02 Thread Stephen Frost

Greetings,

* Sumukha Pk (sumukhap...@gmail.com) wrote:
> I am Sumukha PK a student of NITK. I am interested in the WAL-G backup tool. 
> I haven’t been able to catch hold of anyone through the IRC channels so I 
> need someone to point me to appropriate resources so that I can be introduced 
> to it. I am proficient in Golang an would be interested to work on this 
> project.

Please work with Andrey on coming up with a project plan to be submitted
formally through the GSoC website.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Online verification of checksums

2019-03-05 Thread Stephen Frost

Greetings,

On Tue, Mar 5, 2019 at 18:36 Michael Paquier  wrote:

> On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> > Based on quickly skimming that thread the main issue seems to be
> > deciding which files in the data directory are expected to have
> > checksums. Which is a valid issue, of course, but I was expecting
> > something about partial read/writes etc.
>
> I remember complaining about partial write handling as well for the
> base backup checks...  There should be an email about it on the list,
> cannot find it now ;p
>
> > My understanding is that:
> >
> > (a) The checksum verification should not generate false positives (same
> > as for basebackup).
> >
> > (b) The partial reads do emit warnings, which might be considered false
> > positives I guess. Which is why I'm arguing for changing it to do the
> > same thing basebackup does, i.e. ignore this.
>
> Well, at least that's consistent...  Argh, I really think that we
> ought to make the failures reported harder because that's easier to
> detect within a tool and some deployments set log_min_messages >
> WARNING so checksum failures would just be lost.  For base backups we
> don't care much about that as files are just blindly copied so they
> could have torn pages, which is fine as that's fixed at replay.  Now
> we are talking about a set of tools which could have reliable
> detection mechanisms for those problems.

I’m traveling but will try to comment more in the coming days but in
general I agree with Tomas on these items. Also, pg_basebackup has to
handle torn pages when it comes to checksums just like the verify tool
does, and having them be consistent (along with external tools) would
really be for the best, imv.  I still feel like a retry of a short read
(try reading more to get the whole page..) would be alright and reading
until we hit eof and then moving on. I’m not sure it’s possible but I do
worry a bit that we might get a short read from a network file system or
something that isn’t actually at eof and then we would skip a significant
remaining portion of the file...   another thought might be to stat the
file after we have opened it to see it’s length...

Just a few thoughts since I’m on my phone.  Will try to write up something
more in a day or two.

Thanks!

Stephen

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1608 matches

Mail list logo