Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-20 Thread Chris Travers
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

I have a very serious concern about the current patch set. as someone who has 
faced transaction id wraparound in the past.

I can start by saying I think it would be helpful (if the other issues are 
approached reasonably) to have 64-bit xids, but there is an important piece of 
context in reventing xid wraparounds that seems missing from this patch unless 
I missed something.

XID wraparound is a symptom, not an underlying problem.  It usually occurs when 
autovacuum or other vacuum strategies have unexpected stalls and therefore fail 
to work as expected.  Shifting to 64-bit XIDs dramatically changes the sorts of 
problems that these stalls are likely to pose to operational teams.  -- you can 
find you are running out of storage rather than facing an imminent database 
shutdown.  Worse, this patch delays the problem until some (possibly far 
later!) time, when vacuum will take far longer to finish, and options for 
resolving the problem are diminished.  As a result I am concerned that merely 
changing xids from 32-bit to 64-bit will lead to a smaller number of far more 
serious outages.

What would make a big difference from my perspective would be to combine this 
with an inverse system for warning that there is a problem, allowing the 
administrator to throw warnings about xids since last vacuum, with a 
configurable threshold.  We could have this at two billion by default as that 
would pose operational warnings not much later than we have now.

Otherwise I can imagine cases where instead of 30 hours to vacuum a table, it 
takes 300 hours on a database that is short on space.  And I would not want to 
be facing such a situation.

The new status of this patch is: Waiting on Author


Re: Fix comments atop pg_get_replication_slots

2022-11-20 Thread sirisha chamarthi
Amit, thanks for looking into this!

On Sun, Nov 20, 2022 at 11:38 PM Amit Kapila 
wrote:

> On Mon, Nov 21, 2022 at 12:45 PM sirisha chamarthi
>  wrote:
> >
> > Hi Hackers,
> >
> > The comments atop seem to indicate that it is only showing active
> replication slots. The comment is ambiguous as it also shows all the slots
> including lost and inactive slots. Attached a small patch to fix it.
> >
>
> I agree that it is a bit confusing. How about "SQL SRF showing all
> replication slots that currently exist on the database cluster"?
>

Looks good to me. Attached a patch for the same.


>
> --
> With Regards,
> Amit Kapila.
>


0002-Fix-atop-pg_get_replication_slots-function-to-reflec.patch
Description: Binary data


Re: Reducing power consumption on idle servers

2022-11-20 Thread Simon Riggs
On Sun, 20 Nov 2022 at 22:55, Nathan Bossart  wrote:
>
> On Mon, Nov 21, 2022 at 10:31:15AM +1300, Thomas Munro wrote:
> > On Mon, Nov 21, 2022 at 9:00 AM Simon Riggs
> >  wrote:
> >> As a 3rd patch, I will work on making logical workers hibernate.
> >
> > Duelling patch warning: Nathan mentioned[1] that he's hacking on a
> > patch for that, along the lines of the recent walreceiver change IIUC.
>
> I coded something up last week, but, like the walreceiver patch, it caused
> check-world to take much longer [0], and I haven't looked into whether it
> could be easily fixed.  I'm hoping to make some time for this again in the
> near future.
>
> [0] https://postgr.es/m/20221113222644.GA1269110%40nathanxps13

OK, Nathan, will leave this one to you - remembering that we need to
fix ALL processes to get a useful power reduction when idle.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Fix comments atop pg_get_replication_slots

2022-11-20 Thread Amit Kapila
On Mon, Nov 21, 2022 at 12:45 PM sirisha chamarthi
 wrote:
>
> Hi Hackers,
>
> The comments atop seem to indicate that it is only showing active replication 
> slots. The comment is ambiguous as it also shows all the slots including lost 
> and inactive slots. Attached a small patch to fix it.
>

I agree that it is a bit confusing. How about "SQL SRF showing all
replication slots that currently exist on the database cluster"?

-- 
With Regards,
Amit Kapila.




Re: Reducing power consumption on idle servers

2022-11-20 Thread Simon Riggs
On Mon, 21 Nov 2022 at 05:07, Laurenz Albe  wrote:
>
> On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote:
> > I'll wait 24 hours before committing, to
> > provide a last chance for anyone who wants to complain about dropping
> > promote_trigger_file.
>
> Remove "promote_trigger_file"?  Now I have never seen anybody use that
> parameter, but I don't think that it is a good idea to deviate from our
> usual standard of deprecating a feature for about five years before
> actually removing it.

We aren't removing the ability to promote, just enforcing a change to
a better mechanism, hence I don't see a reason for a long(er)
deprecation period than we have already had.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-20 Thread Amit Kapila
On Sat, Nov 19, 2022 at 6:35 AM Andres Freund  wrote:
>
> On 2022-11-18 11:20:36 +0530, Amit Kapila wrote:
> > Okay, updated the patch accordingly.
>
> Assuming it passes tests etc, this'd work for me.
>

Thanks, Pushed.

-- 
With Regards,
Amit Kapila.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> One issue with this patch: The "fanout" member is a uint8, so it can't
hold 256 for the largest node kind. That's not an issue in practice, since
we never need to grow it, and we only compare that value with the count in
an Assert(), so I just set it to zero. That does break an invariant, so
it's not great. We could use 2 bytes to be strictly correct in all cases,
but that limits what we can do with the smallest node kind.

Thinking about this part, there's an easy resolution -- use a different
macro for fixed- and variable-sized node kinds to determine if there is a
free slot.

Also, I wanted to share some results of adjusting the boundary between the
two smallest node kinds. In the hackish attached patch, I modified the
fixed height search benchmark to search a small (within L1 cache) tree
thousands of times. For the first set I modified node4's maximum fanout and
filled it up. For the second, I set node4's fanout to 1, which causes 2+ to
spill to node32 (actually the partially-filled node15 size class
as demoed earlier).

node4:

NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16520 |  0 |3

NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16456 |  0 |   17

NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16456 |  0 |   89

NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |16488 |  0 |  327


node32:

NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16488 |  0 |5
(1 row)

NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16520 |  0 |   28

NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16408 |  0 |   79

NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |24616 |  0 |  199

In this test, node32 seems slightly faster than node4 with 4 elements, at
the cost of more memory.

Assuming the smallest node is fixed size (i.e. fanout/capacity member not
part of the common set, so only part of variable-sized nodes), 3 has a nice
property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

--
John Naylor
EDB: http://www.enterprisedb.com


Fix comments atop pg_get_replication_slots

2022-11-20 Thread sirisha chamarthi
Hi Hackers,

The comments atop seem to indicate that it is only showing active
replication slots. The comment is ambiguous as it also shows all the slots
including lost and inactive slots. Attached a small patch to fix it.

Thanks,
Sirisha


0001-Fix-atop-pg_get_replication_slots-function-to-reflec.patch
Description: Binary data


Catalog_xmin is not advanced when a logical slot is lost

2022-11-20 Thread sirisha chamarthi
Hi Hackers,

forking this thread from the discussion [1] as suggested by Amit.

Catalog_xmin is not advanced when a logical slot is invalidated (lost)
until the invalidated slot is dropped. This patch ignores invalidated slots
while computing the oldest xmin. Attached a small patch to address this and
the output after the patch is as shown below.

postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
---+---+---++--+---+++--+--+-+-++---+---
s2 | test_decoding | logical | 5 | postgres | f | f
| | | 771 | 0/30466368 | 0/304663A0
| reserved | 28903824 | f
(1 row)

postgres=# create table t2(c int, c1 char(100));
CREATE TABLE
postgres=# drop table t2;
DROP TABLE
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname =
'pg_class';
n_dead_tup

2
(1 row)

postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr |
client_hostname | client_port | backend_start | backend_xmin | state |
sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag |
replay_lag | sync_pri
ority | sync_state | reply_time
-+--+-+--+-+-+-+---+--+---+--+---+---++---+---++-
--++
(0 rows)

postgres=# insert into t1 select * from t1;
INSERT 0 2097152
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
---+---+---++--+---+++--+--+-+-++---+---
s2 | test_decoding | logical | 5 | postgres | f | f
| | | 771 | | 0/304663A0
| lost | | f
(1 row)

postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname =
'pg_class';
n_dead_tup

0
(1 row)


[1]
https://www.postgresql.org/message-id/flat/CAKrAKeW-sGqvkw-2zKuVYiVv%3DEOG4LEqJn01RJPsHfS2rQGYng%40mail.gmail.com

Thanks,
Sirisha


0001-Ignore-invalidated-slots-while-computing-the-oldest-.patch
Description: Binary data


Re: Reducing power consumption on idle servers

2022-11-20 Thread Bharath Rupireddy
On Mon, Nov 21, 2022 at 2:43 AM Thomas Munro  wrote:
>
> On Mon, Nov 21, 2022 at 3:35 AM Simon Riggs
>  wrote:
> > On Sat, 19 Nov 2022 at 10:59, Simon Riggs  
> > wrote:
> > > New version attached.
> >
> > Fix for doc xref
>
> I removed a stray variable declaration from xlogrecovery.h, and wrote
> a draft commit message.  I'll wait 24 hours before committing, to
> provide a last chance for anyone who wants to complain about dropping
> promote_trigger_file.
>
> (We could of course change it so that the timeout based wakeup only
> happens if you have actually set promote_trigger_file, but I think
> we've established that we just don't want the filesystem polling
> feature so I'm whispering this in parentheses.)

Thanks. The v11 patch mostly looks good to me and it passes cirrus-ci
tests - 
https://github.com/BRupireddy/postgres/tree/do_away_with_promote_trigger_file.

I have a comment:

- * Wait for more WAL to arrive. Time out after 5 seconds
- * to react to a trigger file promptly and to check if the
- * WAL receiver is still active.
+ * Wait for more WAL to arrive, when we will be woken
+ * immediately by the WAL receiver. Use of trigger file
+ * via promote_trigger_file is now fully removed.
  */
Do we need to introduce reference to removal of promote_trigger_file
in the code? If left as-is, it'll lie there for many years. Isn't it
enough to specify in appendix-obsolete-recovery-config.sgml?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada 
wrote:
>
> On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada 
wrote:
> >
> > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> >  wrote:

> > > That means my idea for the pointer struct might have some problems,
at least as currently implemented. Maybe in the course of separating out
and polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

Digging a bit deeper, I see a flaw in my benchmark: Even though the total
distribution of node kinds is decently even, the pattern that the benchmark
sees is not terribly random:

 3,343,352  branch-misses:u  #0.85% of all
branches
   393,204,959  branches:u

Recall a previous benchmark [1] where the leaf node was about half node16
and half node32. Randomizing the leaf node between the two caused branch
misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this
new benchmark, each level has a skewed distribution of nodes, giving a
smart branch predictor something to work with. We will need a way to
efficiently generate keys that lead to a relatively unpredictable
distribution of node kinds, as seen by a searcher. Especially in the leaves
(or just above the leaves), since those are less likely to be cached.

> > I'll also run the test on my environment and do the investigation
tomorrow.
> >
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.

Would you be able to share the refactoring patch? And a fix for the failing
tests? I'm thinking I want to try the templating approach fairly soon.

[1]
https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-20 Thread Peter Smith
On Fri, Nov 18, 2022 at 6:03 PM Peter Smith  wrote:
>
> Here are some review comments for v47-0001
>
> (This review is a WIP - I will post more comments for this patch next week)
>

Here are the rest of my comments for v47-0001

==

doc/src/sgml/monitoring.

1.

@@ -1851,6 +1851,11 @@ postgres   27093  0.0  0.0  30096  2752 ?
 Ss   11:34   0:00 postgres: ser
   Waiting to acquire an advisory user lock.
  
  
+  applytransaction
+  Waiting to acquire acquire a lock on a remote transaction being
+  applied on the subscriber side.
+ 
+ 

1a.
Typo "acquire acquire"

~

1b.
Maybe "on the subscriber side" does not mean much without any context.
Maybe better to word it as below.

SUGGESTION
Waiting to acquire a lock on a remote transaction being applied by a
logical replication subscriber.

==

doc/src/sgml/system-views.sgml

2.

@@ -1361,8 +1361,9 @@
virtualxid,
spectoken,
object,
-   userlock, or
-   advisory.
+   userlock,
+   advisory or
+   applytransaction.

This change removed the Oxford comma that was there before. I assume
it was unintended.

==

.../replication/logical/applyparallelworker.c

3. globals

The parallel_apply_XXX functions were all shortened to pa_XXX.

I wondered if the same simplification should be done also to the
global statics...

e.g.
ParallelApplyWorkersHash -> PAWorkerHash
ParallelApplyWorkersList -> PAWorkerList
ParallelApplyMessagePending -> PAMessagePending
etc...

~~~

4. pa_get_free_worker

+ foreach(lc, active_workers)
+ {
+ ParallelApplyWorkerInfo *winfo = NULL;
+
+ winfo = (ParallelApplyWorkerInfo *) lfirst(lc);

No need to assign NULL because the next line just overwrites that anyhow.

~

5.

+ /*
+ * Try to free the worker first, because we don't wait for the rollback
+ * command to finish so the worker may not be freed at the end of the
+ * transaction.
+ */
+ if (pa_free_worker(winfo, winfo->shared->xid))
+ continue;
+
+ if (!winfo->in_use)
+ return winfo;

Shouldn't the (!winfo->in_use) check be done first as well -- e.g. why
are we trying to free a worker which is maybe not even in_use?

SUGGESTION (this will need some comment to explain what it is doing)
if (!winfo->in_use || !pa_free_worker(winfo, winfo->shared->xid) &&
!winfo->in_use)
return winfo;

~~~

6. pa_free_worker

+/*
+ * Remove the parallel apply worker entry from the hash table. Stop the work if
+ * there are enough workers in the pool.
+ *

Typo? "work" -> "worker"

~

7.

+ /* Are there enough workers in the pool? */
+ if (napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
+ {

IMO that comment should be something more like "Don't detach/stop the
worker unless..."

~~~

8. pa_send_data

+ /*
+ * Retry after 1s to reduce the cost of getting the system time and
+ * calculating the time difference.
+ */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 1000L,
+ WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE);

8a.
I am not sure you need to explain the reason in the comment. Just
saying "Wait before retrying." seems sufficient to me.

~

8b.
Instead of the hardwired "1s" in the comment, and 1000L in the code,
maybe better to just have another constant.

SUGGESTION
#define SHM_SEND_RETRY_INTERVAL_MS 1000
#define SHM_SEND_TIMEOUT_MS 1

~

9.

+ if (startTime == 0)
+ startTime = GetCurrentTimestamp();
+ else if (TimestampDifferenceExceeds(startTime, GetCurrentTimestamp(),

IMO the initial startTime should be at top of the function otherwise
the timeout calculation seems wrong.

==

src/backend/replication/logical/worker.c

10. handle_streamed_transaction

+ * In streaming case (receiving a block of streamed transaction), for
+ * SUBSTREAM_ON mode, simply redirect it to a file for the proper toplevel
+ * transaction, and for SUBSTREAM_PARALLEL mode, send the changes to parallel
+ * apply workers (LOGICAL_REP_MSG_RELATION or LOGICAL_REP_MSG_TYPE changes
+ * will be applied by both leader apply worker and parallel apply workers).

I'm not sure this function comment should be referring to SUBSTREAM_ON
and SUBSTREAM_PARALLEL because the function body does not use those
anywhere in the logic.

~~~

11. apply_handle_stream_start

+ /*
+ * Increment the number of messages waiting to be processed by
+ * parallel apply worker.
+ */
+ pg_atomic_add_fetch_u32(&(winfo->shared->pending_message_count), 1);
+

The &() parens are not needed. Just write >shared->pending_message_count.

Also, search/replace others like this -- there are a few of them.

~~~

12. apply_handle_stream_stop

+ if (!abort_toplevel_transaction &&
+ pg_atomic_sub_fetch_u32(&(MyParallelShared->pending_message_count), 1) == 0)
+ {
+ pa_lock_stream(MyParallelShared->xid, AccessShareLock);
+ pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
+ }

That lock/unlock seems like it is done just as a way of
testing/waiting for an exclusive lock held on the xid to be released.
But the code is too tricky -- IMO it 

Re: Reducing power consumption on idle servers

2022-11-20 Thread Bharath Rupireddy
On Mon, Nov 21, 2022 at 10:37 AM Laurenz Albe  wrote:
>
> On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote:
> > I'll wait 24 hours before committing, to
> > provide a last chance for anyone who wants to complain about dropping
> > promote_trigger_file.
>
> Remove "promote_trigger_file"?  Now I have never seen anybody use that
> parameter, but I don't think that it is a good idea to deviate from our
> usual standard of deprecating a feature for about five years before
> actually removing it.

I'm not sure what the guidelines are here, however years have gone by
since pg_ctl promote [1] in 2011 and pg_promote() [2] in 2018 were
added. With two such alternatives in place for many years, it was sort
of an undeclared deprecation of promote_trigger_file GUC. And the
changes required to move to newer ways from the GUC aren't that hard
for those who're still relying on the GUC. Therefore, I think it's now
time for us to do away with the GUC.

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4695da5ae97bbb58d274887fd68edbe88d03ebcb
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=10074651e3355e2405015f6253602be8344bc829

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Error-safe user functions

2022-11-20 Thread Tom Lane
Corey Huinker  writes:
> On Tue, Nov 15, 2022 at 11:36 AM Andrew Dunstan  wrote:
>> Nikita,
>> just checking in, are you making progress on this? I think we really
>> need to get this reviewed and committed ASAP if we are to have a chance
>> to get the SQL/JSON stuff reworked to use it in time for release 16.

> I'm making an attempt at this or something very similar to it. I don't yet
> have a patch ready.

Cool.  We can't delay too much longer on this if we want to have
a credible feature in v16.  Although I want a minimal initial
patch, there will still be a ton of incremental work to do after
the core capability is reviewed and committed, so there's no
time to lose.

regards, tom lane




Re: Error-safe user functions

2022-11-20 Thread Corey Huinker
On Tue, Nov 15, 2022 at 11:36 AM Andrew Dunstan  wrote:

>
> On 2022-10-07 Fr 13:37, Tom Lane wrote:
>
>
> [ lots of detailed review ]
>
> > Basically, this patch set should be a lot smaller and not have ambitions
> > beyond "get the API right" and "make one or two datatypes support COPY
> > NULL_ON_ERROR".  Add more code once that core functionality gets reviewed
> > and committed.
> >
> >
>
>
> Nikita,
>
> just checking in, are you making progress on this? I think we really
> need to get this reviewed and committed ASAP if we are to have a chance
> to get the SQL/JSON stuff reworked to use it in time for release 16.
>
>
I'm making an attempt at this or something very similar to it. I don't yet
have a patch ready.


Re: Multitable insert syntax support on Postgres?

2022-11-20 Thread Corey Huinker
>
> WITH data_src AS (SELECT * FROM source_tbl),
>   insert_a AS (INSERT INTO a SELECT * FROM data_src WHERE d < 5),
>   insert_b AS (INSERT INTO b SELECT * FROM data_src WHERE d >= 5)
> INSERT INTO c SELECT * FROM data_src WHERE d < 5
>

I suppose you could just do a dummy SELECT at the bottom to make it look
more symmetrical

WITH data_src AS (SELECT * FROM source_tbl),
  insert_a AS (INSERT INTO a SELECT * FROM data_src WHERE d < 5),
  insert_b AS (INSERT INTO b SELECT * FROM data_src WHERE d >= 5)
  insert_c AS (INSERT INTO c SELECT * FROM data_src WHERE d < 5)
SELECT true AS inserts_complete;

Or maybe get some diagnostics out of it:

WITH data_src AS (SELECT * FROM source_tbl),
  insert_a AS (INSERT INTO a SELECT * FROM data_src WHERE d < 5 RETURNING
NULL),
  insert_b AS (INSERT INTO b SELECT * FROM data_src WHERE d >= 5 RETURNING
NULL),
  insert_c AS (INSERT INTO c SELECT * FROM data_src WHERE d < 5 RETURNING
NULL)
SELECT
  (SELECT COUNT(*) FROM insert_a) AS new_a_rows,
  (SELECT COUNT(*) FROM insert_b) AS new_b_rows,
  (SELECT COUNT(*) FROM insert_c) AS new_c_rows;


Re: Reducing power consumption on idle servers

2022-11-20 Thread Laurenz Albe
On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote:
> I'll wait 24 hours before committing, to
> provide a last chance for anyone who wants to complain about dropping
> promote_trigger_file.

Remove "promote_trigger_file"?  Now I have never seen anybody use that
parameter, but I don't think that it is a good idea to deviate from our
usual standard of deprecating a feature for about five years before
actually removing it.

Yours,
Laurenz Albe




Re: Multitable insert syntax support on Postgres?

2022-11-20 Thread Corey Huinker
On Mon, Nov 14, 2022 at 7:06 PM Alexandre hadjinlian guerra <
alexhgue...@gmail.com> wrote:

> Hello
> Are there any plans to incorporate a formal syntax multitable/conditional
> insert , similar to the syntax below? snowflake does have the same feature
>
> https://oracle-base.com/articles/9i/multitable-inserts
>
> Today, im resorting to a function that receives the necessary parameters
> from the attributes definition/selection area in a sql select query, called
> by each tuple retrieved. A proper syntax show be real cool
>
> Thanks!
>

I'm not aware of any efforts to implement this at this time, mostly because
I don't think it's supported in the SQL Standard. Being in the standard
would change the question from "why" to "why not".

I've used that feature when I worked with Oracle in a data warehouse
situation. I found it most useful when migrating data dumps from mainframes
where the data file contained subrecords and in cases where one field in a
row changes the meaning of subsequent fields in the same row. That may
sound like a First Normal Form violation, and it is, but such data formats
are common in the IBM VSAM world, or at least they were in the data dumps
that I had to import.


Re: Unstable regression test for contrib/pageinspect

2022-11-20 Thread Tom Lane
Peter Geoghegan  writes:
> On Sun, Nov 20, 2022 at 12:37 PM Tom Lane  wrote:
>> The core reloptions.sql and vacuum.sql tests are two places that are
>> also using this option, but they are applying it to temp tables,
>> which I think makes it safe (and the lack of failures, seeing that
>> they run within parallel test groups, reinforces that).  Can we apply
>> that idea in pageinspect?

> I believe so. The temp table horizons guarantee isn't all that old, so
> the tests may well have been written before it was possible.

Ah, right, I see that that only dates back to v14 (cf a7212be8b).
So we can fix pageinspect's issue by making that table be temp,
but only as far back as v14.

That's probably good enough in terms of reducing the buildfarm
noise level, seeing that mamba has only reported this failure
on HEAD so far.  I'd be tempted to propose back-patching
a7212be8b, but there would be ABI-stability issues, and it's
probably not worth dealing with that.

>> contrib/amcheck and contrib/pg_visibility are also using
>> DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards.

> I think that most use of DISABLE_PAGE_SKIPPING by the regression tests
> just isn't necessary.

Apparently not -- see followup discussion with Mark Dilger.

regards, tom lane




Re: Unstable regression test for contrib/pageinspect

2022-11-20 Thread Tom Lane
Mark Dilger  writes:
> On Nov 20, 2022, at 12:37 PM, Tom Lane  wrote:
>> contrib/amcheck and contrib/pg_visibility are also using
>> DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards.
>> I haven't seen them fall over, though.

> In the amcheck regression test case, it's because the test isn't
> sensitive to whether the freeze actually happens.  You can comment
> out that line, and the only test difference is the comment:

Interesting.  I tried that with pg_visibility, with the same result:
removing its VACUUM commands altogether changes nothing else in the
test output.  I'm not sure this is a good thing.  It makes one wonder
whether these tests really test what they claim to.  But it certainly
explains the lack of failures.

> The amcheck TAP test is sensitive to commenting out the freeze, though:
> ...
> But the TAP test also disables autovacuum, so a background
> auto-analyze shouldn't be running.  Maybe that's why you haven't
> seen amcheck fall over?

Ah, right, I see

$node->append_conf('postgresql.conf', 'autovacuum=off');

in 001_verify_heapam.pl.  So that one's okay too.

Bottom line seems to be that converting pageinspect's test table
to a temp table should fix this.  If no objections, I'll do that
tomorrow.

regards, tom lane




Re: Unstable regression test for contrib/pageinspect

2022-11-20 Thread Mark Dilger



> On Nov 20, 2022, at 12:37 PM, Tom Lane  wrote:
> 
> contrib/amcheck and contrib/pg_visibility are also using
> DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards.
> I haven't seen them fall over, though.

In the amcheck regression test case, it's because the test isn't sensitive to 
whether the freeze actually happens.  You can comment out that line, and the 
only test difference is the comment:

@@ -108,8 +108,8 @@
 ERROR:  ending block number must be between 0 and 0
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 1, 
endblock := 11000);
 ERROR:  starting block number must be between 0 and 0
--- Vacuum freeze to change the xids encountered in subsequent tests
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest;
+-- -- Vacuum freeze to change the xids encountered in subsequent tests
+-- VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty frozen table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');


The amcheck TAP test is sensitive to commenting out the freeze, though:

t/001_verify_heapam.pl .. 42/? 
#   Failed test 'all-frozen corrupted table skipping all-frozen'
#   at t/001_verify_heapam.pl line 58.
#  got: '0|3||line pointer redirection to item at offset 21840 exceeds 
maximum offset 38
# 0|4||line pointer to page offset 21840 with length 21840 ends beyond maximum 
page offset 8192
# 0|5||line pointer redirection to item at offset 0 precedes minimum offset 1
# 0|6||line pointer length 0 is less than the minimum tuple header size 24
# 0|7||line pointer to page offset 15 is not maximally aligned
# 0|8||line pointer length 15 is less than the minimum tuple header size 24'
# expected: ''
t/001_verify_heapam.pl .. 211/? # Looks like you failed 1 test of 272.
t/001_verify_heapam.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/272 subtests 
t/002_cic.pl  ok   
t/003_cic_2pc.pl  ok   

Test Summary Report
---
t/001_verify_heapam.pl (Wstat: 256 (exited 1) Tests: 272 Failed: 1)
  Failed test:  80
  Non-zero exit status: 1
Files=3, Tests=280, 10 wallclock secs ( 0.05 usr  0.02 sys +  3.84 cusr  3.10 
csys =  7.01 CPU)
Result: FAIL
make: *** [check] Error 1


But the TAP test also disables autovacuum, so a background auto-analyze 
shouldn't be running.  Maybe that's why you haven't seen amcheck fall over?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: CREATE INDEX CONCURRENTLY on partitioned index

2022-11-20 Thread Justin Pryzby
I finally found time to digest and integrate your changes into my local
branch.  This fixes the three issues you reported: FORCE_RELEASE, issue
with INVALID partitions issue (for which I adapted your patch into an
earlier patch in my series), and progress reporting.  And rebased.

-- 
Justin
>From 4ba360eaaac5e1ac169d41c26cf6213b0c6a2432 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 6 Jun 2020 17:42:23 -0500
Subject: [PATCH] Allow CREATE INDEX CONCURRENTLY on partitioned table

Note, this effectively reverts 050098b14, so take care to not reintroduce the
bug it fixed.
---
 doc/src/sgml/ddl.sgml  |   4 +-
 doc/src/sgml/ref/create_index.sgml |   9 --
 src/backend/commands/indexcmds.c   | 214 +++--
 src/include/catalog/index.h|   1 +
 src/test/regress/expected/indexing.out | 136 +++-
 src/test/regress/sql/indexing.sql  |  26 ++-
 6 files changed, 320 insertions(+), 70 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 03c01937094..fd56e21ef49 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4131,9 +4131,7 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
  so that they are applied automatically to the entire hierarchy.
  This is very
  convenient, as not only will the existing partitions become indexed, but
- also any partitions that are created in the future will.  One limitation is
- that it's not possible to use the CONCURRENTLY
- qualifier when creating such a partitioned index.  To avoid long lock
+ also any partitions that are created in the future will.  To avoid long lock
  times, it is possible to use CREATE INDEX ON ONLY
  the partitioned table; such an index is marked invalid, and the partitions
  do not get the index applied automatically.  The indexes on partitions can
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 40986aa502f..fc8cda655f0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -692,15 +692,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 91cee27743d..bb98e745267 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -71,6 +71,7 @@
 
 
 /* non-export function prototypes */
+static void reindex_invalid_child_indexes(Oid indexRelationId);
 static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
 static void CheckPredicate(Expr *predicate);
 static void ComputeIndexAttrs(IndexInfo *indexInfo,
@@ -104,7 +105,9 @@ static void reindex_error_callback(void *arg);
 static void ReindexPartitions(Oid relid, ReindexParams *params,
 			  bool isTopLevel);
 static void ReindexMultipleInternal(List *relids,
-	ReindexParams *params);
+	ReindexParams *params,
+	Oid parent,
+	int npart);
 static bool ReindexRelationConcurrently(Oid relationOid,
 		ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
@@ -697,17 +700,6 @@ DefineIndex(Oid relationId,
 	partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
 	if (partitioned)
 	{
-		/*
-		 * Note: we check 'stmt->concurrent' rather than 'concurrent', so that
-		 * the error is thrown also for temporary tables.  Seems better to be
-		 * consistent, even though we could do it on temporary table because
-		 * we're not actually doing it concurrently.
-		 */
-		if (stmt->concurrent)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("cannot create index on partitioned table \"%s\" concurrently",
-			RelationGetRelationName(rel;
 		if (stmt->excludeOpNames)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -1147,6 +1139,11 @@ DefineIndex(Oid relationId,
 		if (pd->nparts != 0)
 			flags |= INDEX_CREATE_INVALID;
 	}
+	else if (concurrent && OidIsValid(parentIndexId))
+	{
+		/* If concurrent, initially build index partitions as "invalid" */
+		flags |= INDEX_CREATE_INVALID;
+	}
 
 	if (stmt->deferrable)
 		constr_flags |= INDEX_CONSTR_CREATE_DEFERRABLE;
@@ -1212,17 +1209,30 @@ DefineIndex(Oid relationId,
 		partdesc = RelationGetPartitionDesc(rel, true);
 		if ((!stmt->relation || stmt->relation->inh) && partdesc->nparts > 0)
 		{
+			/*
+			 * Need to close the relation before recursing into children, so
+			 * copy needed data into a longlived context.
+			 */
+
+			MemoryContext	ind_context = 

Re: Unable to reset stats using pg_stat_reset_replication_slot

2022-11-20 Thread Kyotaro Horiguchi
This doesn't seem like fitting here, but..

At Fri, 18 Nov 2022 18:38:56 +0530, Satya Thirumani 
 wrote in 
> I'm unable to reset stats. Please help me to fix this?
> 
> testdb => select * from pg_stat_reset_replication_slot(NULL);
> ERROR: permission denied for function pg_stat_reset_replication_slot

Yeah, the user doesn't seem to be allowed to do that. Only superusers
can do that defaultly.

https://www.postgresql.org/docs/devel/monitoring-stats.html
>   This function is restricted to superusers by default, but other
>   users can be granted EXECUTE to run the function.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: when the startup process doesn't (logging startup delays)

2022-11-20 Thread Kyotaro Horiguchi
At Fri, 18 Nov 2022 15:55:00 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Nov 18, 2022 at 12:42 AM Robert Haas  wrote:
> >
> > On Thu, Nov 17, 2022 at 2:22 AM Bharath Rupireddy
> >  wrote:
> > > Duplication is a problem that I agree with and I have an idea here -
> > > how about introducing a new function, say EnableStandbyMode() that
> > > sets StandbyMode to true and disables the startup progress timeout,
> > > something like the attached?
> >
> > That works for me, more or less. But I think that
> > enable_startup_progress_timeout should be amended to either say if
> > (log_startup_progress_interval == 0 || StandbyMode) return; or else it
> > should at least Assert(!StandbyMode), so that we can't accidentally
> > re-enable the timer after we shut it off.
> 
> Hm, an assertion may not help in typical production servers running on
> non-assert builds. I've modified the if condition, please see the
> attached v5 patch.

I prefer Robert's approach as it is more robust for future changes and
simple. I prefer to avoid this kind of piggy-backing and it doesn't
seem to be needed in this case. XLogShutdownWalRcv() looks like a
similar case to me and honestly I don't like it in the sense of
robustness but it is simpler than checking walreceiver status at every
site that refers to the flag.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Understanding WAL - large amount of activity from removing data

2022-11-20 Thread David G. Johnston
On Sun, Nov 20, 2022 at 6:24 PM Isaac Morland 
wrote:

> What I'm finding is that the UPDATE is taking over an hour for 5000
> records, and tons of WAL is being generated, several files per minute.
> Selecting the non-PDF columns from the entire table takes a few
> milliseconds, and the only thing I'm doing with the records is updating
> them to much smaller values. Why so much activity just to remove data? The
> new rows are tiny.
>

Simplistic answer (partly because the second part of this isn't spelled out
explicitly in the docs that I could find) when you UPDATE two things
happen, the old record is modified to indicate it has been deleted and a
new record is inserted.  Both of these are written to the WAL, and a record
is always written to the WAL as a self-contained unit, so the old record is
full sized in the newly written WAL.  TOAST apparently has an optimization
if you don't change the TOASTed value, but here you are so that
optimization doesn't apply.

David J.


Understanding WAL - large amount of activity from removing data

2022-11-20 Thread Isaac Morland
I'm encountering some surprising (to me) behaviour related to WAL, and I'm
wondering if anybody can point me at an article that might help me
understand what is happening, or give a brief explanation.

I'm trying to make a slimmed down version of my database for testing
purposes. As part of this, I'm running a query something like this:

UPDATE table1
SET pdfcolumn = 'redacted'
WHERE pdfcolumn IS NOT NULL;

(literally 'redacted', not redacted here for your benefit)

The idea is to replace the actual contents of the column, which are PDF
documents totalling 70GB, with just a short placeholder value, without
affecting the other columns, which are a more ordinary collection - a few
integers and short strings.

The end result will be a database which is way easier to copy around but
which still has all the records of the original; the only change is that an
attempt to access one of the PDFs will not return the actual PDF but rather
a garbage value. For most testing this will make little to no difference.

What I'm finding is that the UPDATE is taking over an hour for 5000
records, and tons of WAL is being generated, several files per minute.
Selecting the non-PDF columns from the entire table takes a few
milliseconds, and the only thing I'm doing with the records is updating
them to much smaller values. Why so much activity just to remove data? The
new rows are tiny.


Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-11-20 Thread Andres Freund
Hi,


One good follow up patch will be to rip out the accounting for
pg_stat_bgwriter's buffers_backend, buffers_backend_fsync and perhaps
buffers_alloc and replace it with a subselect getting the equivalent data from
pg_stat_io.  It might not be quite worth doing for buffers_alloc because of
the way that's tied into bgwriter pacing.


On 2022-11-03 13:00:24 -0400, Melanie Plageman wrote:
> > + again. A high number of repossessions is a sign of contention for the +
> > blocks operated on by the strategy operation.
> >
> > This (and in general the repossession description) makes sense, but
> > I'm not sure what to do with the information. Maybe Andres is right
> > that we could skip this in the first version?
>
> I've removed repossessed and rejected in attached v37. I am a bit sad
> about this because I don't see a good way forward and I think those
> could be useful for users.

Let's get the basic patch in and then check whether we can find a way to have
something providing at least some more information like repossessed and
rejected. I think it'll be easier to analyze in isolation.


> I have added the new column Andres recommended in [1] ("io_object") to
> clarify temp and local buffers and pave the way for bypass IO (IO not
> done through a buffer pool), which can be done on temp or permanent
> files for temp or permanent relations, and spill file IO which is done
> on temporary files but isn't related to temporary tables.

> IOObject has increased the memory footprint and complexity of the code
> around tracking and accumulating the statistics, though it has not
> increased the number of rows in the view.

It doesn't look too bad from here. Is there a specific portion of the code
where it concerns you the most?


> One question I still have about this additional dimension is how much
> enumeration we need of the various combinations of IO operations, IO
> objects, IO ops, and backend types which are allowed and not allowed.
>
> Currently because it is only valid to operate on both IOOBJECT_RELATION
> and IOOBJECT_TEMP_RELATION in IOCONTEXT_BUFFER_POOL, the changes to the
> various functions asserting and validating what is "allowed" in terms of
> combinations of ops, objects, contexts, and backend types aren't much
> different than they were without IO Object. However, once we begin
> adding other objects and contexts, we will need to make this logic more
> comprehensive. I'm not sure whether or not I should do that
> preemptively.

I'd not do it preemptively.



> @@ -833,6 +836,22 @@ ReadBuffer_common(SMgrRelation smgr, char 
> relpersistence, ForkNumber forkNum,
>
>   isExtend = (blockNum == P_NEW);
>
> + if (isLocalBuf)
> + {
> + /*
> +  * Though a strategy object may be passed in, no strategy is 
> employed
> +  * when using local buffers. This could happen when doing, for 
> example,
> +  * CREATE TEMPORRARY TABLE AS ...
> +  */
> + io_context = IOCONTEXT_BUFFER_POOL;
> + io_object = IOOBJECT_TEMP_RELATION;
> + }
> + else
> + {
> + io_context = IOContextForStrategy(strategy);
> + io_object = IOOBJECT_RELATION;
> + }

I think given how frequently ReadBuffer_common() is called in some workloads,
it'd be good to make IOContextForStrategy inlinable. But I guess that's not
easily doable, because struct BufferAccessStrategyData is only defined in
freelist.c.

Could we defer this until later, given that we don't currently need this in
case of buffer hits afaict?


> @@ -1121,6 +1144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, 
> ForkNumber forkNum,
>   BufferAccessStrategy strategy,
>   bool *foundPtr)
>  {
> + boolfrom_ring;
> + IOContext   io_context;
>   BufferTag   newTag; /* identity of requested block 
> */
>   uint32  newHash;/* hash value for newTag */
>   LWLock *newPartitionLock;   /* buffer partition lock for it */
> @@ -1187,9 +1212,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, 
> ForkNumber forkNum,
>*/
>   LWLockRelease(newPartitionLock);
>
> + io_context = IOContextForStrategy(strategy);

Hm - doesn't this mean we do IOContextForStrategy() twice? Once in
ReadBuffer_common() and then again here?


>   /* Loop here in case we have to try another victim buffer */
>   for (;;)
>   {
> +
>   /*
>* Ensure, while the spinlock's not yet held, that there's a 
> free
>* refcount entry.
> @@ -1200,7 +1228,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, 
> ForkNumber forkNum,
>* Select a victim buffer.  The buffer is returned with its 
> header
>* spinlock still held!
>*/
> - buf = StrategyGetBuffer(strategy, _state);
> + buf = StrategyGetBuffer(strategy, _state, 

Re: Polyphase merge is obsolete

2022-11-20 Thread Heikki Linnakangas

On 19/11/2022 13:00, Peter Eisentraut wrote:

On 18.10.21 14:15, Heikki Linnakangas wrote:

On 05/10/2021 20:24, John Naylor wrote:

I've had a chance to review and test out the v5 patches.


Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong
mentioned, and pushed.


AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15,
but did not adequately update the function header comment.  The comment
still mentions the "shared" argument, which has been removed.  There is
a new "preallocate" argument that is not mentioned at all.  Also, it's
not easy to match the actual callers to the call variants described in
the comment.  Could someone who remembers this work perhaps look this
over and update the comment?


Is the attached more readable?

I'm not 100% sure of the "preallocate" argument. If I understand 
correctly, you should pass true if you are writing multiple tapes at the 
same time, and false otherwise. HashAgg passed true, tuplesort passes 
false. However, it's not clear to me why we couldn't just always do the 
preallocation. It seems pretty harmless even if it's not helpful. Or do 
it when there are multiple writer tapes, and not otherwise. The 
parameter was added in commit 0758964963 so presumably there was a 
reason, but at a quick glance at the thread that led to that commit, I 
couldn't see what it was.


- Heikki
From 1c5d4e28a21f6052d22667e79b5bd443aec3a5b5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 21 Nov 2022 01:55:05 +0200
Subject: [PATCH 1/1] Fix and clarify function comment on LogicalTapeSetCreate.

Commit c4649cce39 removed the "shared" and "ntapes" arguments, but the
comment still talked about "shared". It also talked about "a shared
file handle", which was technically correct because it even before
commit c4649cce39, the "shared file handle" referred to the "fileset"
argument, not "shared". But it was very confusing. Improve the
comment.

Also add a comment on what the "preallocate" argument does.

Discussion: https://www.postgresql.org/message-id/af989685-91d5-aad4-8f60-1d066b5ec...@enterprisedb.com
---
 src/backend/utils/sort/logtape.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index c384f98e13..b518a4b9c5 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -544,14 +544,20 @@ ltsInitReadBuffer(LogicalTape *lt)
  * The tape set is initially empty. Use LogicalTapeCreate() to create
  * tapes in it.
  *
- * Serial callers pass NULL argument for shared, and -1 for worker.  Parallel
- * worker callers pass a shared file handle and their own worker number.
+ * In a single-process sort, pass NULL argument for fileset, and -1 for
+ * worker.
  *
- * Leader callers pass a shared file handle and -1 for worker. After creating
- * the tape set, use LogicalTapeImport() to import the worker tapes into it.
+ * In a parallel sort, parallel workers pass the shared fileset handle and
+ * their own worker number.  After the workers have finished, create the
+ * tape set in the leader, passing the shared fileset handle and -1 for
+ * worker, and use LogicalTapeImport() to import the worker tapes into it.
  *
  * Currently, the leader will only import worker tapes into the set, it does
  * not create tapes of its own, although in principle that should work.
+ *
+ * If preallocate is true, blocks for each individual tape are allocated in
+ * batches.  This avoids fragmentation when writing multiple tapes at the
+ * same time.
  */
 LogicalTapeSet *
 LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
-- 
2.30.2



Re: Split index and table statistics into different types of stats

2022-11-20 Thread Andres Freund
Hi,

On 2022-11-18 12:18:38 +0100, Drouvot, Bertrand wrote:
> On 11/16/22 9:12 PM, Andres Freund wrote:
> > This still leaves a fair bit of boilerplate. ISTM that the function body
> > really should just be a single line.
> > 
> > Might even be worth defining the whole function via a macro. Perhaps 
> > something like
> > 
> > PGSTAT_DEFINE_REL_FIELD_ACCESSOR(PGSTAT_KIND_INDEX, pg_stat_get_index, 
> > numscans);
> 
> Thanks for the feedback!
> 
> Right, what about something like the following?
> 
> "
> #define PGSTAT_FETCH_STAT_ENTRY(pgstat_entry_kind, 
> pgstat_fetch_stat_function, relid, stat_name) \
>   do { \
>   pgstat_entry_kind *entry = pgstat_fetch_stat_function(relid); \
>   PG_RETURN_INT64(entry == NULL ? 0 : (int64) 
> (entry->stat_name)); \
>   } while (0)
> 
> Datum
> pg_stat_get_index_numscans(PG_FUNCTION_ARGS)
> {
>   PGSTAT_FETCH_STAT_ENTRY(PgStat_StatIndEntry, 
> pgstat_fetch_stat_indentry, PG_GETARG_OID(0), numscans);
> }
> "

That's better, but still seems like quite a bit of repetition, given the
number of accessors. I think I like my idea of a macro defining the whole
function a bit better.

I'd define a "base" macro and then a version that's specific to tables and
indexes each, so that the pieces related to that don't have to be repeated as
often.


> > This should probably be done in a preparatory commit.
> 
> Proposal submitted in [1].

Now merged.

Greetings,

Andres Freund




Re: Getting rid of SQLValueFunction

2022-11-20 Thread Ted Yu
On Sun, Nov 20, 2022 at 3:12 PM Michael Paquier  wrote:

> On Sun, Nov 20, 2022 at 08:21:10AM -0800, Ted Yu wrote:
> > For get_func_sql_syntax(), the code for cases
> > of F_CURRENT_TIME, F_CURRENT_TIMESTAMP, F_LOCALTIME and F_LOCALTIMESTAMP
> is
> > mostly the same.
> > Maybe we can introduce a helper so that code duplication is reduced.
>
> It would.  Thanks for the suggestion.
>
> Do you like something like the patch 0002 attached?  This reduces a
> bit the overall size of the patch.  Both ought to be merged in the
> same commit, still it is easier to see the simplification created.
> --
> Michael
>
Hi,
Thanks for the quick response.

+ * timestamp.  These require a specific handling with their typmod is given
+ * by the function caller through their SQL keyword.

typo: typmod is given -> typmod given

Other than the above, code looks good to me.

Cheers


Re: Add LZ4 compression in pg_dump

2022-11-20 Thread Michael Paquier
On Sun, Nov 20, 2022 at 11:26:11AM -0600, Justin Pryzby wrote:
> I think this patch record should be closed for now.  You can re-open the
> existing patch record once a patch is ready to be reviewed.

Indeed.  As of things are, this is just a dead entry in the CF which
would be confusing.  I have marked it as RwF.
--
Michael


signature.asc
Description: PGP signature


Re: Getting rid of SQLValueFunction

2022-11-20 Thread Michael Paquier
On Sun, Nov 20, 2022 at 08:21:10AM -0800, Ted Yu wrote:
> For get_func_sql_syntax(), the code for cases
> of F_CURRENT_TIME, F_CURRENT_TIMESTAMP, F_LOCALTIME and F_LOCALTIMESTAMP is
> mostly the same.
> Maybe we can introduce a helper so that code duplication is reduced.

It would.  Thanks for the suggestion.

Do you like something like the patch 0002 attached?  This reduces a
bit the overall size of the patch.  Both ought to be merged in the
same commit, still it is easier to see the simplification created.
--
Michael
From 3611cfa6f0171dcd65eeb88461f4c48989cd3f1a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Sun, 20 Nov 2022 11:53:28 +0900
Subject: [PATCH v5 1/2] Replace SQLValueFunction by direct function calls

This impacts 9 patterns where the SQL grammar takes priority:
- localtime, that can take a typmod.
- localtimestamp, that can take a typmod.
- current_time, that can take a typmod.
- current_timestamp, that can take a typmod.
- current_date.

Five new entries are added to pg_proc to compensate the removal of
SQLValueFunction to keep compatibility (when a keyword is specified in a
SELECT or in a FROM clause without an alias), with tests to cover all
that.

XXX: bump catalog version.

Reviewed-by: Corey Huinker
Discussion: https://postgr.es/m/yzag3morycguu...@paquier.xyz
---
 src/include/catalog/pg_proc.dat   |  15 +++
 src/include/executor/execExpr.h   |   8 --
 src/include/nodes/primnodes.h |  33 ---
 src/include/utils/date.h  |   4 -
 src/include/utils/timestamp.h |   4 -
 src/backend/catalog/system_functions.sql  |  26 ++
 src/backend/executor/execExpr.c   |  11 ---
 src/backend/executor/execExprInterp.c |  46 -
 src/backend/jit/llvm/llvmjit_expr.c   |   6 --
 src/backend/jit/llvm/llvmjit_types.c  |   1 -
 src/backend/nodes/nodeFuncs.c |  27 +-
 src/backend/optimizer/path/costsize.c |   1 -
 src/backend/optimizer/util/clauses.c  |  39 ++--
 src/backend/parser/gram.y |  59 +++-
 src/backend/parser/parse_expr.c   |  52 ---
 src/backend/parser/parse_target.c |  25 -
 src/backend/utils/adt/date.c  |  81 +---
 src/backend/utils/adt/ruleutils.c | 109 +-
 src/backend/utils/adt/timestamp.c |  72 --
 src/backend/utils/misc/queryjumble.c  |   9 --
 src/test/regress/expected/expressions.out |   2 +-
 src/test/regress/sql/expressions.sql  |   2 +-
 src/tools/pgindent/typedefs.list  |   2 -
 23 files changed, 245 insertions(+), 389 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fd2559442e..35dc369728 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1520,6 +1520,21 @@
 { oid => '9977', descr => 'system user name',
   proname => 'system_user', provolatile => 's', prorettype => 'text',
   proargtypes => '', prosrc => 'system_user' },
+{ oid => '9978', descr => 'current date',
+  proname => 'current_date', provolatile => 's', prorettype => 'date',
+  proargtypes => '', prosrc => 'current_date' },
+{ oid => '9979', descr => 'current time',
+  proname => 'current_time', provolatile => 's', prorettype => 'timetz',
+  proargtypes => 'int4', prosrc => 'current_time', proisstrict => 'f' },
+{ oid => '9980', descr => 'current timestamp',
+  proname => 'current_timestamp', provolatile => 's', prorettype => 'timestamptz',
+  proargtypes => 'int4', prosrc => 'current_timestamp', proisstrict => 'f' },
+{ oid => '9981', descr => 'local time',
+  proname => 'localtime', provolatile => 's', prorettype => 'time',
+  proargtypes => 'int4', prosrc => 'sql_localtime', proisstrict => 'f' },
+{ oid => '9982', descr => 'local timestamp',
+  proname => 'localtimestamp', provolatile => 's', prorettype => 'timestamp',
+  proargtypes => 'int4', prosrc => 'sql_localtimestamp', proisstrict => 'f' },
 
 { oid => '744',
   proname => 'array_eq', prorettype => 'bool',
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index e14f15d435..0557302b92 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -170,7 +170,6 @@ typedef enum ExprEvalOp
 	EEOP_DISTINCT,
 	EEOP_NOT_DISTINCT,
 	EEOP_NULLIF,
-	EEOP_SQLVALUEFUNCTION,
 	EEOP_CURRENTOFEXPR,
 	EEOP_NEXTVALUEEXPR,
 	EEOP_ARRAYEXPR,
@@ -416,12 +415,6 @@ typedef struct ExprEvalStep
 			FunctionCallInfo fcinfo_data_in;
 		}			iocoerce;
 
-		/* for EEOP_SQLVALUEFUNCTION */
-		struct
-		{
-			SQLValueFunction *svf;
-		}			sqlvaluefunction;
-
 		/* for EEOP_NEXTVALUEEXPR */
 		struct
 		{
@@ -741,7 +734,6 @@ extern void ExecEvalParamExec(ExprState *state, ExprEvalStep *op,
 			  ExprContext *econtext);
 extern void ExecEvalParamExtern(ExprState *state, ExprEvalStep *op,
 ExprContext *econtext);
-extern void ExecEvalSQLValueFunction(ExprState *state, ExprEvalStep *op);
 extern 

Re: perform_spin_delay() vs wait events

2022-11-20 Thread Andres Freund
Hi,

On 2022-11-20 17:26:11 -0500, Robert Haas wrote:
> On Sun, Nov 20, 2022 at 3:43 PM Andres Freund  wrote:
> > I couldn't quite decide what wait_event_type to best group this under? In 
> > the
> > attached patch I put it under timeouts, which doesn't seem awful.
> 
> I think it would be best to make it its own category, like we do with
> buffer pins.

I was wondering about that too - but decided against it because it would only
show a single wait event. And wouldn't really describe spinlocks as a whole,
just the "extreme" delays. If we wanted to report the spin waits more
granular, we'd presumably have to fit the wait events into the lwlock, buffers
and some new category where we name individual spinlocks.

But I guess a single spinlock wait event type with an ExponentialBackoff wait
event or such wouldn't be too bad.

Greetings,

Andres Freund




Re: Reducing power consumption on idle servers

2022-11-20 Thread Nathan Bossart
On Mon, Nov 21, 2022 at 10:31:15AM +1300, Thomas Munro wrote:
> On Mon, Nov 21, 2022 at 9:00 AM Simon Riggs
>  wrote:
>> As a 3rd patch, I will work on making logical workers hibernate.
> 
> Duelling patch warning: Nathan mentioned[1] that he's hacking on a
> patch for that, along the lines of the recent walreceiver change IIUC.

I coded something up last week, but, like the walreceiver patch, it caused
check-world to take much longer [0], and I haven't looked into whether it
could be easily fixed.  I'm hoping to make some time for this again in the
near future.

[0] https://postgr.es/m/20221113222644.GA1269110%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: More efficient build farm animal wakeup?

2022-11-20 Thread Thomas Munro
On Sun, Nov 20, 2022 at 2:44 AM Andrew Dunstan  wrote:
> It might not suit your use case, but one of the things I do to reduce
> fetch load is to run a local mirror which runs
>
>git fetch -q --prune
>
> every 5 minutes. It also runs a git daemon, and several of my animals
> point at that.

Thanks.  I understand now that my configuration without a local mirror
is super inefficient (it spends the first ~25s of each minute running
git commands).  Still, even though that can be improved by me setting
up more stuff, I'd like something event-driven rather than short
polling-based for lower latency.

> If there's a better git API I'll be happy to try to use it.

Cool.  Seems like we just have to invent something first...

FWIW I'm also trying to chase the short polling out of cfbot.  It
regularly harasses the git servers at one end (could be fixed with
this approach), and wastes a percentage of our allotted CPU slots on
the other end by scheduling periodically (could be fixed with webhooks
from Cirrus).




Re: perform_spin_delay() vs wait events

2022-11-20 Thread Robert Haas
On Sun, Nov 20, 2022 at 3:43 PM Andres Freund  wrote:
> The lwlock wait queue scalability issue [1] was quite hard to find because of
> the exponential backoff and because we adjust spins_per_delay over time within
> a backend.
>
> I think the least we could do to make this easier would be to signal spin
> delays as wait events. We surely don't want to do so for non-contended spins
> because of the overhead, but once we get to the point of calling pg_usleep()
> that's not an issue.
>
> I don't think it's worth trying to hand down more detailed information about
> the specific spinlock we're waiting on, at least for now. We'd have to invent
> a whole lot of new wait events because most spinlocks don't have ones yet.
>
> I couldn't quite decide what wait_event_type to best group this under? In the
> attached patch I put it under timeouts, which doesn't seem awful.

I think it would be best to make it its own category, like we do with
buffer pins.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Question concerning backport of CVE-2022-2625

2022-11-20 Thread Roberto C . Sánchez
Hi Tom,

On Sun, Nov 20, 2022 at 11:43:41AM -0500, Tom Lane wrote:
> Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?=  writes:
> >  -- this makes a shell "point <<@@ polygon" operator too
> >  CREATE OPERATOR @@>> ( PROCEDURE = poly_contain_pt,
> >LEFTARG = polygon, RIGHTARG = point,
> >COMMUTATOR = <<@@ );
> >  CREATE EXTENSION test_ext_cor;  -- fail
> >  ERROR:  operator <<@@(point,polygon) is not a member of extension 
> > "test_ext_cor"
> >  DETAIL:  An extension is not allowed to replace an object that it does not 
> > own.
> >  DROP OPERATOR <<@@ (point, polygon);
> >  CREATE EXTENSION test_ext_cor;  -- now it should work
> > +ERROR:  operator 16427 is not a member of extension "test_ext_cor"
> > +DETAIL:  An extension is not allowed to replace an object that it does not 
> > own.
> 
> That is ... odd.  Since 9.4 is long out of support I'm unenthused
> about investigating it myself.  (Why is it that people will move heaven
> and earth to fix "security" bugs in dead branches, but ignore even
> catastrophic data-loss bugs?)  But if you're stuck with pursuing
> this exercise, I think you'd better figure out exactly what's
> happening.  I agree that it smells like c94959d41 could be related,
> but I don't see just how that'd produce this symptom.  Before that
> commit, the DROP OPERATOR <<@@ would have left a dangling link
> behind in @@>> 's oprcom field, but there doesn't seem to be a
> reason why that'd affect the test_ext_cor extension: it will not
> be re-using the same operator OID, nor would it have any reason to
> touch @@>>, since there's no COMMUTATOR clause in the extension.
> 
I understand your reticence to dive into a branch that is long dead from
your perspective.  That said, I am grateful for the insights you
provided here.

> It'd likely be a good idea to reproduce this with a gdb breakpoint
> set at errfinish, and see exactly what's leading up to the error.
> 
Thanks for this suggestion.  I will see if I am able to isolate the
precise cause of the failure with this.

Regards,

-Roberto

-- 
Roberto C. Sánchez




Re: More efficient build farm animal wakeup?

2022-11-20 Thread Thomas Munro
On Mon, Nov 21, 2022 at 10:31 AM Magnus Hagander  wrote:
> Um, branches of interest will only pick up when it gets a new *branch*, not a 
> new *commit*, so I think that would be a very different problem to solve. And 
> I don't think we have new branche *that* often...

Sure, could be done with an extra different request you make from time
to time or keeping the existing list.  No strong opinions on that, I
was just observing that it could also be combined, something like:

Client: I have 14@1234, 15@1234, HEAD@1234; what should I do now, boss?
Server: You should fetch 14 (it has a new commit) and 16 (it's a new
branch you didn't mention).

> I'd imagine something like a
> GET https://git.postgresql.org/buildfarm-branchtips
> X-branch-master: a4adc31f69
> X-branch-REL_14_STABLE: b33283cbd3
> X-longpoll: 120
>
> For that one it would check branch master and rel 14, and if either branchtip 
> doesn't match what was in the header, it'd return immediately with a textfile 
> that's basically
> master:
>
> if master has changed and not REL_14.
>
> If nothing has changed, go into longpoll for 120 seconds based on the header, 
> and if nothing at all has changed in that time, return a 304.

LGTM, that's exactly the sort of thing I was imagining.

> We could also use something like a websocket to just stream the changes out 
> over.

True.  The reason I started on about long polling instead of
websockets is that I was imagining that the simpler, dumber protocol
where the client doesn't even really know it's participating a new
kind of magic would be more cromulent in ye olde perl script (no new
cpan dependencies).

> In either case it would also need to change the buildfarm client to run as a 
> daemon rather than a cronjob I think? (obviously optional, we don't have to 
> remove the current abilities)

Given that the point of the build farm is (these days) to test on
weird computers and operating systems, I expect that proper 'run like
a service' support would be painful or not get done.  It'd be nice if
there were some way to make this work with simple crontab entries...




Re: Precedence of bitwise operators

2022-11-20 Thread Tom Lane
Bauyrzhan Sakhariyev  writes:
> Hi! Do I get it right, that bitwise operations have the same precedence?

Yes, that is what the documentation says:

https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-PRECEDENCE

Operator precedence is hard-wired into our parser, so we don't get
to have a lot of flexibility in assigning precedences for any except
a very small set of operator names.

regards, tom lane




Re: Reducing power consumption on idle servers

2022-11-20 Thread Thomas Munro
On Mon, Nov 21, 2022 at 9:00 AM Simon Riggs
 wrote:
> As a 3rd patch, I will work on making logical workers hibernate.

Duelling patch warning: Nathan mentioned[1] that he's hacking on a
patch for that, along the lines of the recent walreceiver change IIUC.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJGhX4r2LPUE3Oy9BX71Eum6PBcS8L3sJpScR9oKaTVaA%40mail.gmail.com




Re: More efficient build farm animal wakeup?

2022-11-20 Thread Magnus Hagander
On Sun, Nov 20, 2022 at 4:56 AM Thomas Munro  wrote:

> On Sun, Nov 20, 2022 at 1:35 AM Magnus Hagander 
> wrote:
> > tl,tr; it's not there now, but yes if we can find a smart way for th ebf
> clients to consume it, it is something we could build and deploy fairly
> easily.
>
> Cool -- it sounds a lot like you've thought about this already :-)
>
> About the client: currently run_branches.pl makes an HTTP request for
> the "branches of interest" list.  Seems like a candidate point for a
> long poll?  I don't think it'd have to be much smarter than it is
> today, it'd just have to POST the commits it already has, I think.
>

Um, branches of interest will only pick up when it gets a new *branch*, not
a new *commit*, so I think that would be a very different problem to solve.
And I don't think we have new branche *that* often...


Perhaps as a first step, the server could immediately report which
> branches to bother fetching, considering the client's existing
> commits.  That'd almost always be none, but ~11.7 times per day a new
> commit shows up, and once a year there's a new interesting branch.
> That would avoid the need for the 6 git fetches that usually follow in
> the common case, which admittedly might not be a change worth making
> on its own.  After all, the git fetches are probably quite similar
> HTTP requests themselves, except that there 6 of them, one per branch,
> and they hit the public git server instead of some hypothetical
> buildfarm endpoint.
>

As Andres mentioned downthread, that's not a lot more lightweight than what
"git fetch" does.

The thing we'd want to avoid is having to do that so much and often. And
getting to that is going to require modification of the buildfarm client to
make it more "smart" regardless. In particular, making it do this "right"
in the face of multiple branches is probably going to be a big win.


Then you could switch to long polling by letting the client say "if
> currently none, I'm prepared to wait up to X seconds for a different
> answer", assuming you know how to build the server side of that
> (insert magic here).  Of course, you can't make it too long or your
> session might be dropped in the badlands between client and server,
> but that's just a reason to make X configurable.  I think RFC6202 says
> that 120 seconds probably works fine across most kinds of links, which
> means that you lower the total poll rate hitting the server, but--more
> interestingly for me as a client--you minimise latency when something
> finally happens.  (With various keepalive tricks and/or heartbeat
> streaming tricks you could possibly make it much higher, who knows...
> but you'd have to set it very very low to do worse than what we're
> doing today in total request count).  Or maybe there is some existing
> easy perl library that could be used for this (joke answer: cpan
> install Twitter::API and follow @pg_commits).
>

I also honestly wonder how big a problem a much longer than 120 seconds
timeout would be in practice. Since we own both the client and the server
in this case, we'd only be at mercy of network equipment in between and I
think we're much less exposed to weirdness there than "the average
browser". Thus, as long as it's configurable, I think we could go for
something much longer by default.

I'd imagine something like a
GET https://git.postgresql.org/buildfarm-branchtips
X-branch-master: a4adc31f69
X-branch-REL_14_STABLE: b33283cbd3
X-longpoll: 120

For that one it would check branch master and rel 14, and if either
branchtip doesn't match what was in the header, it'd return immediately
with a textfile that's basically
master:

if master has changed and not REL_14.

If nothing has changed, go into longpoll for 120 seconds based on the
header, and if nothing at all has changed in that time, return a 304.


We could also use something like a websocket to just stream the changes out
over.

In either case it would also need to change the buildfarm client to run as
a daemon rather than a cronjob I think? (obviously optional, we don't have
to remove the current abilities)


However, when I started this thread I was half expecting such a thing
> to exist already, somewhere, I just haven't been able to find it
> myself...  Don't other people have this problem?  Maybe everybody who
> has this problem uses webhooks (git server post commit hook opens
> connection to client) as you mentioned, but as you also mentioned
> that'd never fly for our topology.
>

Yeah, webhook seems to be what most people use.

FWIW, an implementation for us would be a small daemon that receives such
webhooks from our git server and redistributtes it for the long polling.
That's still the easiest way to get the data out of git itself...

//Magnus


Precedence of bitwise operators

2022-11-20 Thread Bauyrzhan Sakhariyev
Hi! Do I get it right, that bitwise operations have the same precedence?

Query *SELECT 1 & 2 | 3, 3 | 1 & 2*
returns 3 and 2 respectively. See also
https://www.db-fiddle.com/f/iZHd8zG7A1HjbB6J2y8R7k/1. It looks like the
result is calculated from left to right and operators have
the same precedence.

I checked relevant documentation pages (
https://www.postgresql.org/docs/current/functions-bitstring.html and
https://www.postgresql.org/docs/current/sql-syntax-lexical.html) and
couldn't find any information about bitwise operations precedence, only
information about logical operations precedence.

I'm not saying it's a bug, rather trying to clarify as precedence of
bitwise operators is different in programming languages, say c++ (
https://en.cppreference.com/w/c/language/operator_precedence) or java (
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/operators.html)


Re: Reducing power consumption on idle servers

2022-11-20 Thread Thomas Munro
On Mon, Nov 21, 2022 at 3:35 AM Simon Riggs
 wrote:
> On Sat, 19 Nov 2022 at 10:59, Simon Riggs  
> wrote:
> > New version attached.
>
> Fix for doc xref

I removed a stray variable declaration from xlogrecovery.h, and wrote
a draft commit message.  I'll wait 24 hours before committing, to
provide a last chance for anyone who wants to complain about dropping
promote_trigger_file.

(We could of course change it so that the timeout based wakeup only
happens if you have actually set promote_trigger_file, but I think
we've established that we just don't want the filesystem polling
feature so I'm whispering this in parentheses.)
From 6a6db8862ceb6618fb5f98c6245dc7d834fe89db Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 21 Nov 2022 09:37:47 +1300
Subject: [PATCH v11] Remove promote_trigger_file.

Previously an otherwise idle startup (recovery) process would wake up
every 5 seconds to have a chance to poll for promote_trigger_file, even
if that GUC was not configured.  That signaling mechanism was superseded
by pg_ctl promote and pg_promote() a long time ago.  There probably
aren't many users left and it's very easy to change to the modern
mechanisms, so let's remove that feature.

This is part of a campaign to reduce wakeups on idle systems.

Author: Simon Riggs 
Reviewed-by: Bharath Rupireddy 
Reviewed-by: Robert Haas 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/CANbhV-FsjnzVOQGBpQ589%3DnWuL1Ex0Ykn74Nh1hEjp2usZSR5g%40mail.gmail.com
---
 .../appendix-obsolete-recovery-config.sgml|  9 ++
 doc/src/sgml/config.sgml  | 18 ---
 doc/src/sgml/high-availability.sgml   | 24 ++
 src/backend/access/transam/xlogrecovery.c | 32 ---
 src/backend/utils/misc/guc_tables.c   | 10 --
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/access/xlogrecovery.h |  1 -
 7 files changed, 18 insertions(+), 77 deletions(-)

diff --git a/doc/src/sgml/appendix-obsolete-recovery-config.sgml b/doc/src/sgml/appendix-obsolete-recovery-config.sgml
index 1cf4913114..8ca519b5f1 100644
--- a/doc/src/sgml/appendix-obsolete-recovery-config.sgml
+++ b/doc/src/sgml/appendix-obsolete-recovery-config.sgml
@@ -35,13 +35,8 @@
 

 The
-trigger_file
-
- trigger_file
- promote_trigger_file
-
-setting has been renamed to
-.
+trigger_file and promote_trigger_file
+have been removed.

 

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd50ea8e48..211dfc27ac 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4610,24 +4610,6 @@ ANY num_sync ( 
-promote_trigger_file (string)
-
-  promote_trigger_file configuration parameter
-
-
-
- 
-  Specifies a trigger file whose presence ends recovery in the
-  standby.  Even if this value is not set, you can still promote
-  the standby using pg_ctl promote or calling
-  pg_promote().
-  This parameter can only be set in the postgresql.conf
-  file or on the server command line.
- 
-
-   
-
  
   hot_standby (boolean)
   
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b2b3129397..cac8c71a44 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -653,11 +653,11 @@ protocol to make nodes agree on a serializable transactional order.
 

 Standby mode is exited and the server switches to normal operation
-when pg_ctl promote is run,
-pg_promote() is called, or a trigger file is found
-(promote_trigger_file). Before failover,
-any WAL immediately available in the archive or in pg_wal will be
-restored, but no attempt is made to connect to the primary.
+when pg_ctl promote is run, or
+pg_promote() is called.  The parameter
+promote_trigger_file has been removed. Before failover,
+any WAL immediately available in the archive or in pg_wal
+will be restored, but no attempt is made to connect to the primary.

   
 
@@ -1483,15 +1483,11 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
 

 To trigger failover of a log-shipping standby server, run
-pg_ctl promote, call pg_promote(),
-or create a trigger file with the file name and path specified by the
-promote_trigger_file. If you're planning to use
-pg_ctl promote or to call
-pg_promote() to fail over,
-promote_trigger_file is not required. If you're
-setting up the reporting servers that are only used to offload read-only
-queries from the primary, not for high availability purposes, you don't
-need to promote it.
+pg_ctl promote or call pg_promote().
+The parameter promote_trigger_file has been removed.
+If you're setting up the reporting servers that are only used to offload
+read-only queries from the primary, not for high 

Re: Unstable regression test for contrib/pageinspect

2022-11-20 Thread Peter Geoghegan
On Sun, Nov 20, 2022 at 12:37 PM Tom Lane  wrote:
> The core reloptions.sql and vacuum.sql tests are two places that are
> also using this option, but they are applying it to temp tables,
> which I think makes it safe (and the lack of failures, seeing that
> they run within parallel test groups, reinforces that).  Can we apply
> that idea in pageinspect?

I believe so. The temp table horizons guarantee isn't all that old, so
the tests may well have been written before it was possible.

> contrib/amcheck and contrib/pg_visibility are also using
> DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards.
> I haven't seen them fall over, though.

DISABLE_PAGE_SKIPPING forces aggressive mode (which is also possible
with FREEZE), but unlike FREEZE it also forces VACUUM to scan even
all-frozen pages. The other difference is that DISABLE_PAGE_SKIPPING
doesn't affect FreezeLimit/freeze_min_age, whereas FREEZE sets it to
0.

I think that most use of DISABLE_PAGE_SKIPPING by the regression tests
just isn't necessary. Especially where it's combined with FREEZE like
this, as it often seems to be. Why should the behavior around skipping
all-frozen pages (the only thing changed by using
DISABLE_PAGE_SKIPPING on top of FREEZE) actually matter to these
tests?

-- 
Peter Geoghegan




perform_spin_delay() vs wait events

2022-11-20 Thread Andres Freund
Hi,

The lwlock wait queue scalability issue [1] was quite hard to find because of
the exponential backoff and because we adjust spins_per_delay over time within
a backend.

I think the least we could do to make this easier would be to signal spin
delays as wait events. We surely don't want to do so for non-contended spins
because of the overhead, but once we get to the point of calling pg_usleep()
that's not an issue.

I don't think it's worth trying to hand down more detailed information about
the specific spinlock we're waiting on, at least for now. We'd have to invent
a whole lot of new wait events because most spinlocks don't have ones yet.

I couldn't quite decide what wait_event_type to best group this under? In the
attached patch I put it under timeouts, which doesn't seem awful.

I reverted a4adc31f690 and indeed it shows SpinDelay wait events before the
fix and not after.

Greetings,

Andres Freund

[1] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin%40awork3.anarazel.de
>From 6d7c62e5ebbe9bf906e5f4e1682ece6fff69cd37 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 20 Nov 2022 12:39:14 -0800
Subject: [PATCH] Add WAIT_EVENT_SPIN_DELAY

---
 src/include/utils/wait_event.h  | 3 ++-
 src/backend/storage/lmgr/s_lock.c   | 8 
 src/backend/utils/activity/wait_event.c | 3 +++
 doc/src/sgml/monitoring.sgml| 4 
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 6f2d5612e06..3d87d550119 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -146,7 +146,8 @@ typedef enum
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
 	WAIT_EVENT_REGISTER_SYNC_REQUEST,
 	WAIT_EVENT_VACUUM_DELAY,
-	WAIT_EVENT_VACUUM_TRUNCATE
+	WAIT_EVENT_VACUUM_TRUNCATE,
+	WAIT_EVENT_SPIN_DELAY
 } WaitEventTimeout;
 
 /* --
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 4e473ec27ec..b04930db760 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -53,6 +53,7 @@
 #include "common/pg_prng.h"
 #include "port/atomics.h"
 #include "storage/s_lock.h"
+#include "utils/wait_event.h"
 
 #define MIN_SPINS_PER_DELAY 10
 #define MAX_SPINS_PER_DELAY 1000
@@ -136,7 +137,14 @@ perform_spin_delay(SpinDelayStatus *status)
 		if (status->cur_delay == 0) /* first time to delay? */
 			status->cur_delay = MIN_DELAY_USEC;
 
+		/*
+		 * Once we start sleeping, the overhead of reporting a wait event is
+		 * justified. Actively spinning easily stands out in profilers, but
+		 * sleeping with an exponential backoff is harder to spot...
+		 */
+		pgstat_report_wait_start(WAIT_EVENT_SPIN_DELAY);
 		pg_usleep(status->cur_delay);
+		pgstat_report_wait_end();
 
 #if defined(S_LOCK_TEST)
 		fprintf(stdout, "*");
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 92f24a6c9bc..02e953420e5 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -503,6 +503,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_VACUUM_TRUNCATE:
 			event_name = "VacuumTruncate";
 			break;
+		case WAIT_EVENT_SPIN_DELAY:
+			event_name = "SpinDelay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d5147..0599eac71a6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2298,6 +2298,10 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   Waiting to acquire an exclusive lock to truncate off any
empty pages at the end of a table vacuumed.
  
+ 
+  SpinDelay
+  Waiting to acquire a contended spinlock.
+ 
 

   
-- 
2.38.0



Unstable regression test for contrib/pageinspect

2022-11-20 Thread Tom Lane
My very slow buildfarm animal mamba has failed pageinspect
several times [1][2][3][4] with this symptom:

diff -U3 
/home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out 
/home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out
--- 
/home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out  
2022-11-20 10:12:51.780935488 -0500
+++ 
/home/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out   
2022-11-20 14:00:25.818743985 -0500
@@ -92,9 +92,9 @@
 SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
  LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
- t_infomask | t_infomask2 | raw_flags  
   |   combined_flags   
-+-+---+
-   2816 |   2 | 
{HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
+ t_infomask | t_infomask2 |raw_flags| 
combined_flags 
++-+-+
+   2304 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)
 
 -- tests for decoding of combined flags

It's not hard to guess what the problem is here: the immediately preceding
bit is hopelessly optimistic.

-- If we freeze the only tuple on test1, the infomask should
-- always be the same in all test runs.
VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test1;

The fact that you asked for a freeze doesn't mean you'll get a freeze.
I suppose that a background auto-analyze is holding back global xmin
so that the tuple doesn't actually get frozen.

The core reloptions.sql and vacuum.sql tests are two places that are
also using this option, but they are applying it to temp tables,
which I think makes it safe (and the lack of failures, seeing that
they run within parallel test groups, reinforces that).  Can we apply
that idea in pageinspect?

contrib/amcheck and contrib/pg_visibility are also using
DISABLE_PAGE_SKIPPING, so I wonder if they have similar hazards.
I haven't seen them fall over, though.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-20%2015%3A13%3A19
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-10-31%2013%3A33%3A35
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-10-19%2016%3A34%3A07
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-08-29%2017%3A49%3A02




Re: Reducing power consumption on idle servers

2022-11-20 Thread Simon Riggs
On Thu, 24 Mar 2022 at 16:21, Robert Haas  wrote:
>
> On Thu, Mar 24, 2022 at 12:02 PM Simon Riggs

> > What changes will be acceptable for bgwriter, walwriter and logical worker?
>
> Hmm, I think it would be fine to introduce some kind of hibernation
> mechanism for logical workers. bgwriter and wal writer already have a
> hibernation mechanism, so I'm not sure what your concern is there
> exactly. In your initial email you said you weren't proposing changes
> there, but maybe that changed somewhere in the course of the
> subsequent discussion. If you could catch me up on your present
> thinking that would be helpful.

Now that we seem to have solved the problem for Startup process, let's
circle back to the others

Bgwriter does hibernate currently, but only at 50x the bgwriter_delay.
At default values that is 5s, but could be much less. So we need to
move that up to 60s, same as others.

WALwriter also hibernates currently, but only at 25x the
wal_writer_delay. At default values that is 2.5s, but could be much
less. So we need to move that up to 60s, same as others. At the same
time, make sure that when we hibernate we use a new WaitEvent,
similarly to the way bgwriter reports its hibernation state (which
also helps test the patch).


As a 3rd patch, I will work on making logical workers hibernate.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


hibernate_bgwriter_walwriter.v5.patch
Description: Binary data


Re: HOT chain validation in verify_heapam()

2022-11-20 Thread Peter Geoghegan
On Tue, Nov 15, 2022 at 10:55 PM Andres Freund  wrote:
> I'm quite certain that it's possible to end up freezing an earlier row
> versions in a hot chain in < 14, I got there with careful gdb
> orchestration. Of course possible I screwed something up, given I did it once,
> interactively. Not sure if trying to fix it is worth the risk of backpatching
> all the necessary changes to switch to the retry approach.

There is code in heap_prepare_freeze_tuple() that treats a raw xmax as
"xmax_already_frozen = true", even when the raw xmax value isn't
already set to InvalidTransactionId. I'm referring to this code:

if ( ... ) // process raw xmax

else if (TransactionIdIsNormal(xid))

else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
{
freeze_xmax = false;
xmax_already_frozen = true;
/* No need for relfrozenxid_out handling for already-frozen xmax */
}
else
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found xmax %u (infomask 0x%04x) not
frozen, not multi, not normal",
 xid, tuple->t_infomask)));

Why should it be okay to not process xmax during this call (by setting
"xmax_already_frozen = true"), just because HEAP_XMAX_INVALID happens
to be set? Isn't HEAP_XMAX_INVALID purely a hint? (HEAP_XMIN_FROZEN is
*not* a hint, but we're dealing with xmax here.)

I'm not sure how relevant this is to the concerns you have about
frozen xmax, or even if it's any kind of problem, but it still seems
worth fixing. It seems to me that there should be clear rules on what
special transaction IDs can appear in xmax. Namely: the only special
transaction ID that can ever appear in xmax is InvalidTransactionId.
(Also, it's not okay to see *any* other XID in the
"xmax_already_frozen = true" path, nor would it be okay to leave any
other XID behind in xmax in the nearby "freeze_xmax = true" path.)

-- 
Peter Geoghegan




Re: heavily contended lwlocks with long wait queues scale badly

2022-11-20 Thread Andres Freund
Hi,

On 2022-11-09 17:03:13 -0800, Andres Freund wrote:
> On 2022-11-09 09:38:08 -0800, Andres Freund wrote:
> > I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's 
> > OK
> > to push it to HEAD if I get it done in the next few hours. Bigger issues,
> > which I do not expect, should show up before tomorrow afternoon. Smaller
> > things could wait till Sunday if necessary.
> 
> I didn't get to it in time, so I'll leave it for when I'm back.

Took a few days longer, partially because I encountered an independent issue
(see 8c954168cff) while testing.

I pushed it to HEAD now.

I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.

Regards,

Andres




Re: ssl tests aren't concurrency safe due to get_free_port()

2022-11-20 Thread Andres Freund
On 2022-11-20 10:10:38 -0500, Andrew Dunstan wrote:
> OK, pushed with a little more tweaking.

Thanks!


> I didn't upcase top_builddir
> because the existing prove_installcheck recipes already export it and I
> wanted to stay consistent with those.

Makes sense.


> If it works ok I will backpatch in couple of days.

+1




Re: Avoid double lookup in pgstat_fetch_stat_tabentry()

2022-11-20 Thread Andres Freund
Hi,

On 2022-11-19 09:38:26 +0100, Drouvot, Bertrand wrote:
> I'd vote for V3 for readability, size and "backward compatibility" with 
> current code.

Pushed that. Thanks for the patch and evaluation.

Greetings,

Andres Freund




Re: Allow placeholders in ALTER ROLE w/o superuser

2022-11-20 Thread Alexander Korotkov
.On Sun, Nov 20, 2022 at 8:48 PM Alexander Korotkov
 wrote:
> I've drafted a patch implementing ALTER ROLE ... SET ... TO ... USER SET 
> syntax.
>
> These options are working only for USERSET GUC variables, but require
> less privileges to set.  I think there is no problem to implement
>
> Also it seems that this approach doesn't conflict with future
> privileges for USERSET GUCs [1].  I expect that USERSET GUCs should be
> available unless explicitly REVOKEd.  That mean we should be able to
> check those privileges during ALTER ROLE.
>
> Opinions on the patch draft?
>
> Links
> 1. 
> https://mail.google.com/mail/u/0/?ik=a20b091faa=om=msg-f%3A1749871710745577015

Uh, sorry for the wrong link.  I meant
https://www.postgresql.org/message-id/2271988.1668807...@sss.pgh.pa.us

--
Regards,
Alexander Korotkov




Re: Allow placeholders in ALTER ROLE w/o superuser

2022-11-20 Thread Alexander Korotkov
On Sat, Nov 19, 2022 at 4:02 AM Alexander Korotkov  wrote:
> On Sat, Nov 19, 2022 at 12:41 AM Tom Lane  wrote:
> > ... BTW, re-reading the commit message for a0ffa885e:
> >
> > One caveat is that PGC_USERSET GUCs are unaffected by the SET privilege
> > --- one could wish that those were handled by a revocable grant to
> > PUBLIC, but they are not, because we couldn't make it robust enough
> > for GUCs defined by extensions.
> >
> > it suddenly struck me to wonder if the later 13d838815 changed the
> > situation enough to allow revisiting that problem, and/or if storing
> > the source role's OID in pg_db_role_setting would help.
> >
> > I don't immediately recall all the problems that led us to leave USERSET
> > GUCs out of the feature, so maybe this is nuts; but maybe it isn't.
> > It'd be worth considering if we're trying to improve matters here.
>
> I think if we implement the user-visible USERSET flag for ALTER ROLE,
> then we might just check permissions for such parameters from the
> target role.

I've drafted a patch implementing ALTER ROLE ... SET ... TO ... USER SET syntax.

These options are working only for USERSET GUC variables, but require
less privileges to set.  I think there is no problem to implement

Also it seems that this approach doesn't conflict with future
privileges for USERSET GUCs [1].  I expect that USERSET GUCs should be
available unless explicitly REVOKEd.  That mean we should be able to
check those privileges during ALTER ROLE.

Opinions on the patch draft?

Links
1. 
https://mail.google.com/mail/u/0/?ik=a20b091faa=om=msg-f%3A1749871710745577015

--
Regards,
Alexander Korotkov


0001-ALTER-ROLE-.-SET-.-TO-.-USER-SET-v1.patch
Description: Binary data


Re: Add LZ4 compression in pg_dump

2022-11-20 Thread Justin Pryzby
On Fri, Aug 05, 2022 at 02:23:45PM +, Georgios Kokolatos wrote:
> Thank you for your work during commitfest.
> 
> The patch is still in development. Given vacation status, expect the next 
> patches to be ready for the November commitfest.
> For now it has moved to the September one. Further action will be taken then 
> as needed.

On Sun, Nov 06, 2022 at 02:53:12PM +, gkokola...@pm.me wrote:
> On Wed, Nov 2, 2022 at 14:28, Justin Pryzby  wrote:
> > Checking if you'll be able to submit new patches soon ?
> 
> Thank you for checking up. Expect new versions within this commitfest cycle.

Hi,

I think this patch record should be closed for now.  You can re-open the
existing patch record once a patch is ready to be reviewed.

The commitfest is a time for committing/reviewing patches that were
previously submitted, but there's no new patch since July.  Making a
patch available for review at the start of the commitfest seems like a
requirement for current patch records (same as for new patch records).

I wrote essentially the same patch as your early patches 2 years ago
(before postgres was ready to consider new compression algorithms), so
I'm happy to review a new patch when it's available, regardless of its
status in the cfapp.

BTW, some of my own review comments from March weren't addressed.
Please check.  Also, in February, I asked if you knew how to use
cirrusci to run checks on cirrusci, but the patches still had
compilation errors and warnings on various OS.

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/39/3571

-- 
Justin




Re: predefined role(s) for VACUUM and ANALYZE

2022-11-20 Thread Nathan Bossart
On Sat, Nov 19, 2022 at 10:50:04AM -0800, Nathan Bossart wrote:
> another rebase

Another rebase for cfbot.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5f716f33e93187491686381b2180894ab2b1b92c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 7 Sep 2022 22:25:29 -0700
Subject: [PATCH v12 1/4] Change AclMode from a uint32 to a uint64.

---
 src/backend/nodes/outfuncs.c|  2 +-
 src/bin/pg_upgrade/check.c  | 35 +
 src/include/catalog/pg_type.dat |  4 ++--
 src/include/nodes/parsenodes.h  |  6 +++---
 src/include/utils/acl.h | 28 +-
 5 files changed, 55 insertions(+), 20 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f05e72f0dc..8f150e9a2e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -560,7 +560,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 	WRITE_BOOL_FIELD(lateral);
 	WRITE_BOOL_FIELD(inh);
 	WRITE_BOOL_FIELD(inFromCl);
-	WRITE_UINT_FIELD(requiredPerms);
+	WRITE_UINT64_FIELD(requiredPerms);
 	WRITE_OID_FIELD(checkAsUser);
 	WRITE_BITMAPSET_FIELD(selectedCols);
 	WRITE_BITMAPSET_FIELD(insertedCols);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index f1bc1e6886..615a53a864 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster);
 static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
+static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
@@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check)
 	check_for_reg_data_type_usage(_cluster);
 	check_for_isn_and_int8_passing_mismatch(_cluster);
 
+	/*
+	 * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk
+	 * format for existing data.
+	 */
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
+		check_for_aclitem_data_type_usage(_cluster);
+
 	/*
 	 * PG 14 changed the function signature of encoding conversion functions.
 	 * Conversions from older versions cannot be upgraded automatically
@@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster)
 		check_ok();
 }
 
+/*
+ * check_for_aclitem_data_type_usage
+ *
+ *	aclitem changed its storage format in 16, so check for it.
+ */
+static void
+check_for_aclitem_data_type_usage(ClusterInfo *cluster)
+{
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for incompatible aclitem data type in user tables");
+
+	snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt");
+
+	if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path))
+	{
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n"
+ "The internal format of \"aclitem\" changed in PostgreSQL version 16\n"
+ "so this cluster cannot currently be upgraded.  You can drop the\n"
+ "problem columns and restart the upgrade.  A list of the problem\n"
+ "columns is in the file:\n"
+ "%s", output_path);
+	}
+	else
+		check_ok();
+}
 
 /*
  * check_for_jsonb_9_4_usage()
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index df45879463..0763dfde39 100644
--- a/src/include/catalog/pg_type.dat
+++ b/src/include/catalog/pg_type.dat
@@ -267,9 +267,9 @@
 # OIDS 1000 - 1099
 
 { oid => '1033', array_type_oid => '1034', descr => 'access control list',
-  typname => 'aclitem', typlen => '12', typbyval => 'f', typcategory => 'U',
+  typname => 'aclitem', typlen => '16', typbyval => 'f', typcategory => 'U',
   typinput => 'aclitemin', typoutput => 'aclitemout', typreceive => '-',
-  typsend => '-', typalign => 'i' },
+  typsend => '-', typalign => 'd' },
 { oid => '1042', array_type_oid => '1014',
   descr => 'char(length), blank-padded string, fixed storage length',
   typname => 'bpchar', typlen => '-1', typbyval => 'f', typcategory => 'S',
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7caff62af7..f4ed9bbff9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -73,12 +73,12 @@ typedef enum SetQuantifier
 
 /*
  * Grantable rights are encoded so that we can OR them together in a bitmask.
- * The present representation of AclItem limits us to 16 distinct rights,
- * even though AclMode is defined as uint32.  See utils/acl.h.
+ * The present representation of AclItem limits us to 32 distinct rights,
+ * even though AclMode is defined as uint64.  See utils/acl.h.
  *
  * Caution: changing these codes breaks stored ACLs, hence forces initdb.
  

Re: Question concerning backport of CVE-2022-2625

2022-11-20 Thread Tom Lane
Roberto =?iso-8859-1?Q?C=2E_S=E1nchez?=  writes:
>  -- this makes a shell "point <<@@ polygon" operator too
>  CREATE OPERATOR @@>> ( PROCEDURE = poly_contain_pt,
>LEFTARG = polygon, RIGHTARG = point,
>COMMUTATOR = <<@@ );
>  CREATE EXTENSION test_ext_cor;  -- fail
>  ERROR:  operator <<@@(point,polygon) is not a member of extension 
> "test_ext_cor"
>  DETAIL:  An extension is not allowed to replace an object that it does not 
> own.
>  DROP OPERATOR <<@@ (point, polygon);
>  CREATE EXTENSION test_ext_cor;  -- now it should work
> +ERROR:  operator 16427 is not a member of extension "test_ext_cor"
> +DETAIL:  An extension is not allowed to replace an object that it does not 
> own.

That is ... odd.  Since 9.4 is long out of support I'm unenthused
about investigating it myself.  (Why is it that people will move heaven
and earth to fix "security" bugs in dead branches, but ignore even
catastrophic data-loss bugs?)  But if you're stuck with pursuing
this exercise, I think you'd better figure out exactly what's
happening.  I agree that it smells like c94959d41 could be related,
but I don't see just how that'd produce this symptom.  Before that
commit, the DROP OPERATOR <<@@ would have left a dangling link
behind in @@>> 's oprcom field, but there doesn't seem to be a
reason why that'd affect the test_ext_cor extension: it will not
be re-using the same operator OID, nor would it have any reason to
touch @@>>, since there's no COMMUTATOR clause in the extension.

It'd likely be a good idea to reproduce this with a gdb breakpoint
set at errfinish, and see exactly what's leading up to the error.

regards, tom lane




Re: Slow standby snapshot

2022-11-20 Thread Simon Riggs
On Sun, 20 Nov 2022 at 13:45, Michail Nikolaev
 wrote:

> If such approach looks committable - I could do more careful
> performance testing to find the best value for
> WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS.

Nice patch.

We seem to have replaced one magic constant with another, so not sure
if this is autotuning, but I like it much better than what we had
before (i.e. better than my prev patch).

Few thoughts

1. I was surprised that you removed the limits on size and just had
the wasted work limit. If there is no read traffic that will mean we
hardly ever compress, which means the removal of xids at commit will
get slower over time.  I would prefer that we forced compression on a
regular basis, such as every time we process an XLOG_RUNNING_XACTS
message (every 15s), as well as when we hit certain size limits.

2. If there is lots of read traffic but no changes flowing, it would
also make sense to force compression when the startup process goes
idle rather than wait for the work to be wasted first.

Quick patch to add those two compression events also.

That should favour the smaller wasted work limits.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


events_that_force_compression.v1.patch
Description: Binary data


Re: Getting rid of SQLValueFunction

2022-11-20 Thread Ted Yu
On Sat, Nov 19, 2022 at 7:01 PM Michael Paquier  wrote:

> On Fri, Nov 18, 2022 at 10:23:58AM +0900, Michael Paquier wrote:
> > Please note that in order to avoid tweaks when choosing the attribute
> > name of function call, this needs a total of 8 new catalog functions
> > mapping to the SQL keywords, which is what the test added by 2e0d80c
> > is about:
> > - current_role
> > - user
> > - current_catalog
> > - current_date
> > - current_time
> > - current_timestamp
> > - localtime
> > - localtimestamp
> >
> > Any objections?
>
> Hearing nothing, I have gone through 0001 again and applied it as
> fb32748 to remove the dependency between names and SQLValueFunction.
> Attached is 0002, to bring back the CI to a green state.
> --
> Michael
>

Hi,
For get_func_sql_syntax(), the code for cases
of F_CURRENT_TIME, F_CURRENT_TIMESTAMP, F_LOCALTIME and F_LOCALTIMESTAMP is
mostly the same.
Maybe we can introduce a helper so that code duplication is reduced.

Cheers


Re: ssl tests aren't concurrency safe due to get_free_port()

2022-11-20 Thread Andrew Dunstan


On 2022-11-19 Sa 15:16, Andres Freund wrote:
> Hi,
>
> On 2022-11-19 10:56:33 -0500, Andrew Dunstan wrote:
>>> Perhaps we should just export a directory in configure instead of this
>>> guessing game?
>> I think the obvious candidate would be to export top_builddir from
>> src/Makefile.global. That would remove the need to infer it from
>> TESTDATADIR.
> I think that'd be good. I'd perhaps rename it in the process so it's
> exported uppercase, but whatever...
>

OK, pushed with a little more tweaking. I didn't upcase top_builddir
because the existing prove_installcheck recipes already export it and I
wanted to stay consistent with those.

If it works ok I will backpatch in couple of days.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Reducing power consumption on idle servers

2022-11-20 Thread Simon Riggs
On Sat, 19 Nov 2022 at 10:59, Simon Riggs  wrote:

> New version attached.

Fix for doc xref

-- 
Simon Riggshttp://www.EnterpriseDB.com/


hibernate_startup.v10.patch
Description: Binary data


Question concerning backport of CVE-2022-2625

2022-11-20 Thread Roberto C . Sánchez
Greetings PGSQL hackers,

I am working on a backport of CVE-2022-2625 to PostgreSQL 9.6 and 9.4.
I am starting from commit 5919bb5a5989cda232ac3d1f8b9d90f337be2077.

The backport to 9.6 was relatively straightforward, the principal change
being to omit some of the hunks related to commands in 9.6 that did not
have support for 'IF NOT EXISTS'.  When it came to 9.4, things got a
little more interesting.  There were additional instances of commands
that did not have support for 'IF NOT EXISTS' and some of the
contructions were slightly different as well, but nothing insurmountable
there.

I did have to hack at the 9.4 test harness a bit since the
test_extensions sub-directory seems to have been introduced post-9.4 and
it seemed like a good idea to have the actual tests from the
aforementioned commit to help guard against some sort of unintended
change on my part.  However, after I got through the CINE changes and
started dealing with the COR changes I ran into something fairly
peculiar.  The test output included this:

 DROP VIEW ext_cor_view; 
 CREATE TYPE test_ext_type;
 CREATE EXTENSION test_ext_cor;  -- fail
 ERROR:  type test_ext_type is not a member of extension "test_ext_cor"
 DETAIL:  An extension is not allowed to replace an object that it does not own.
 DROP TYPE test_ext_type;
 -- this makes a shell "point <<@@ polygon" operator too
 CREATE OPERATOR @@>> ( PROCEDURE = poly_contain_pt,
   LEFTARG = polygon, RIGHTARG = point,
   COMMUTATOR = <<@@ );
 CREATE EXTENSION test_ext_cor;  -- fail
 ERROR:  operator <<@@(point,polygon) is not a member of extension 
"test_ext_cor"
 DETAIL:  An extension is not allowed to replace an object that it does not own.
 DROP OPERATOR <<@@ (point, polygon);
 CREATE EXTENSION test_ext_cor;  -- now it should work
+ERROR:  operator 16427 is not a member of extension "test_ext_cor"
+DETAIL:  An extension is not allowed to replace an object that it does not own.
 SELECT ext_cor_func();

This made me suspect that there was an issue with 'DROP OPERATOR'.
After a little scavenger hunt, I located a commit which appears to be
related, c94959d4110a1965472956cfd631082a96f64a84, and which was made
post-9.4.  So then, my question: is the existing behavior that produces
"ERROR:  operator ... is not a member of extension ..." a sufficient
guard against the CVE-2022-2625 vulnerability when it comes to
operators?  (My thought is that it might be sufficient, and if it is I
would need to add something like 'DROP OPERATOR @@>> (point, polygon);'
to allow the extension creation to work and the test to complete.)

If the apparently buggy behavior is not a sufficient guard, then is a
backport of c94959d4110a1965472956cfd631082a96f64a84 in conjunction with
the CVE-2022-2625 fix the correct solution?

Regards,

-Roberto

-- 
Roberto C. Sánchez




Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Oh, seems like it is not my day :) The image fixed again.


Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Oops, wrong image, this is correct one. But is 1-run tests, so it
shows only basic correlation,


Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Hello.

On Wed, Nov 16, 2022 at 3:44 AM Andres Freund  wrote:
> Approach 1:

> We could have an atomic variable in ProcArrayStruct that counts the amount of
> wasted effort and have processes update it whenever they've wasted a
> meaningful amount of effort.  Something like counting the skipped elements in
> KnownAssignedXidsGetAndSetXmin in a function local static variable and
> updating the shared counter whenever that reaches

I made the WIP patch for that approach and some initial tests. It
seems like it works pretty well.
At least it is better than previous ways for standbys without high
read only load.

Both patch and graph in attachments. Strange numbers is a limit of
wasted work to perform compression.
I have used the same (1) testing script and configuration as before
(two 16-CPU machines, long transaction on primary at 60th second,
simple-update and select-only for pgbench).

If such approach looks committable - I could do more careful
performance testing to find the best value for
WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS.

[1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28
--
Michail Nikolaev
Index: src/backend/storage/ipc/procarray.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
--- a/src/backend/storage/ipc/procarray.c	(revision fb32748e32e2b6b2fcb32220980b93d5436f855e)
+++ b/src/backend/storage/ipc/procarray.c	(date 1668950653116)
@@ -272,6 +272,7 @@
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsWastedSnapshotWork;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -451,6 +452,10 @@
 			ShmemInitStruct("KnownAssignedXidsValid",
 			mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
 			);
+		KnownAssignedXidsWastedSnapshotWork = (pg_atomic_uint32 *)
+			ShmemInitStruct("KnownAssignedXidsWastedSnapshotWork",
+			sizeof(pg_atomic_uint32), );
+		pg_atomic_init_u32(KnownAssignedXidsWastedSnapshotWork, 0);
 	}
 }
 
@@ -4616,20 +4621,9 @@
 
 	if (!force)
 	{
-		/*
-		 * If we can choose how much to compress, use a heuristic to avoid
-		 * compressing too often or not often enough.
-		 *
-		 * Heuristic is if we have a large enough current spread and less than
-		 * 50% of the elements are currently in use, then compress. This
-		 * should ensure we compress fairly infrequently. We could compress
-		 * less often though the virtual array would spread out more and
-		 * snapshots would become more expensive.
-		 */
-		int			nelements = head - tail;
-
-		if (nelements < 4 * PROCARRAY_MAXPROCS ||
-			nelements < 2 * pArray->numKnownAssignedXids)
+#define WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS 111
+		if (pg_atomic_read_u32(KnownAssignedXidsWastedSnapshotWork)
+	< WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS)
 			return;
 	}
 
@@ -4650,6 +4644,8 @@
 
 	pArray->tailKnownAssignedXids = 0;
 	pArray->headKnownAssignedXids = compress_index;
+	/* Reset wasted work counter */
+	pg_atomic_write_u32(KnownAssignedXidsWastedSnapshotWork, 0);
 }
 
 /*
@@ -5031,6 +5027,7 @@
 KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
 			   TransactionId xmax)
 {
+	ProcArrayStruct *pArray = procArray;
 	int			count = 0;
 	int			head,
 tail;
@@ -5078,6 +5075,10 @@
 		}
 	}
 
+	/* Add number of invalid items scanned to wasted work counter */
+	pg_atomic_add_fetch_u32(KnownAssignedXidsWastedSnapshotWork,
+			(head - tail) - pArray->numKnownAssignedXids);
+
 	return count;
 }
 


Re: How to *really* quit psql?

2022-11-20 Thread Fabien COELHO



Hello David,

Question: is there any way to really abort a psql script from an 
included file?


Under what circumstances would it be appropriate for a script to take
it on itself to decide that?  It has no way of knowing what the next -f
option is or what the user intended.


Can we add an exit code argument to the \quit meta-command that could be
set to non-zero and, combined with ON_ERROR_STOP, produces the desired
effect of aborting everything just like an error under ON_ERROR_STOP does
(which is the workaround here I suppose, but an ugly one that involves the
server).


I like the simple idea of adding an optional exit status argument to 
\quit. I'm unsure whether "ON_ERROR_STOP" should or should not change the 
behavior, or whether it should just exit(n) with \quit n.


Note that using quit to abort a psql script is already used when loading 
extensions to prevent them to be run directly by psql:


-- from some sql files in "contrib/pg_stat_statements/":
\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.10'" to load 
this file. \quit

But the same trick would fail if the guard is reach with an include.

--
Fabien.