Proposal: Allow user with pg_monitor role to call pg_stat_reset* functions

2022-11-21 Thread sirisha chamarthi
Hi Hackers,

At present, calling pg_stat_reset* functions requires super user access
unless explicitly grant execute permission on those. In this thread, I am
proposing to grant execute on them to users with pg_monitor role
permissions. This comes handy to the monitoring users (part of pg_monitor
role) to capture the stats fresh and analyze. Do you see any concerns with
this approach?

Thanks,
Sirisha


Re: Reducing power consumption on idle servers

2022-11-21 Thread Laurenz Albe
On Mon, 2022-11-21 at 11:42 +0530, Bharath Rupireddy wrote:
> On Mon, Nov 21, 2022 at 10:37 AM Laurenz Albe  
> wrote:
> > 
> > On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote:
> > > I'll wait 24 hours before committing, to
> > > provide a last chance for anyone who wants to complain about dropping
> > > promote_trigger_file.
> > 
> > Remove "promote_trigger_file"?  Now I have never seen anybody use that
> > parameter, but I don't think that it is a good idea to deviate from our
> > usual standard of deprecating a feature for about five years before
> > actually removing it.
> 
> I'm not sure what the guidelines are here, however years have gone by
> since pg_ctl promote [1] in 2011 and pg_promote() [2] in 2018 were
> added. With two such alternatives in place for many years, it was sort
> of an undeclared deprecation of promote_trigger_file GUC. And the
> changes required to move to newer ways from the GUC aren't that hard
> for those who're still relying on the GUC. Therefore, I think it's now
> time for us to do away with the GUC.

I disagree.  With the same argument, you could rip out "serial", since we
have supported identity columns since v11.

I understand the desire to avoid needless wakeups, but isn't it possible
to only perform the regular poll if "promote_trigger_file" is set?

Yours,
Laurenz Albe




Re: Reducing power consumption on idle servers

2022-11-21 Thread Laurenz Albe
On Mon, 2022-11-21 at 07:36 +, Simon Riggs wrote:
> On Mon, 21 Nov 2022 at 05:07, Laurenz Albe  wrote:
> > 
> > On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote:
> > > I'll wait 24 hours before committing, to
> > > provide a last chance for anyone who wants to complain about dropping
> > > promote_trigger_file.
> > 
> > Remove "promote_trigger_file"?  Now I have never seen anybody use that
> > parameter, but I don't think that it is a good idea to deviate from our
> > usual standard of deprecating a feature for about five years before
> > actually removing it.
> 
> We aren't removing the ability to promote, just enforcing a change to
> a better mechanism, hence I don't see a reason for a long(er)
> deprecation period than we have already had.

We have had a deprecation period?  I looked at the documentation, but found
no mention of a deprecation.  How hard can it be to leave the GUC and only
poll for the existence of the file if it is set?

I personally don't need the GUC, and I know nobody who does, but I think
we should not be cavalier about introducing unnecessary compatibility breaks.

Yours,
Laurenz Albe




Operation log for major operations

2022-11-21 Thread Dmitry Koval

Hi, hackers!

It is important for customer support to know what system operations 
(pg_resetwal, pg_rewind, pg_upgrade, ...) have been executed on the 
database.  A variant of implementation of the log for system operations 
(operation log) is attached to this email.


Introduction.
-
Operation log is designed to store information about critical system 
events (like pg_upgrade, pg_resetwal, pg_resetwal, etc.).
This information is not interesting to the ordinary user, but it is very 
important for the vendor's technical support.
An example: client complains about DB damage to technical support 
(damage was caused by pg_resetwal which was "silent" executed by one of 
administrators).


Concepts.
-
* operation log is placed in the file 'global/pg_control', starting from 
position 4097 (log size is 4kB);
* user can not modify the operation log;  log can be changed  by 
function call only (from postgres or from postgres utilities);
* operation log is a ring buffer (with CRC-32 protection), deleting 
entries from the operation log is possible only when the buffer is 
overflowed;

* SQL-function is used to read data of operation log.

Example of operation log data.
--

>select * from pg_operation_log();
   event|edition|version|   lsn   | last |count
+---+---+-+--+--
 startup|vanilla|10.22.0|0/828|2022-11-18 23:06:27+03|1
 pg_upgrade |vanilla|15.0.0 |0/828|2022-11-18 23:06:27+03|1
 startup|vanilla|15.0.0 |0/80001F8|2022-11-18 23:11:53+03|3
 pg_resetwal|vanilla|15.0.0 |0/80001F8|2022-11-18 23:09:53+03|2
(4 rows)

Some details about inserting data to operation log.
---
There are two modes of inserting information about events in the 
operation log.


* MERGE mode (events "startup", "pg_resetwal", "pg_rewind").
We searches in ring buffer of operation log an event with the same type 
("startup" for example) with the same version number.
If event was found, we will increment event counter by 1 and update the 
date/time of event ("last" field) with the current value.
If event was not found, we will add this event to the ring buffer (see 
INSERT mode).

* INSERT mode (events "bootstrap", "pg_upgrade", "promoted").
We will add an event to the ring buffer (without searching).


P.S. File 'global/pg_control' was chosen as operation log storage 
because data of this file cannot be removed or modified in a simple way 
and no need to change any extensions and utilities to support this file.


I attached the patch (v1-0001-Operation-log.patch) and extended 
description of operation log (Operation-log.txt).



With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com-
Operation log structures.
-
The operation log is placed in the pg_control file (file size 
PG_CONTROL_FILE_SIZE=8192).
Operation log size is PG_OPERATION_LOG_FULL_SIZE=4096.
Operation log is a ring buffer of 170 elements (24 bytes each) with header (16 
bytes).

а) Operation log element:

typedef struct OperationLogData
{
uint8   ol_type;/* operation type */
uint8   ol_edition; /* postgres edition */
uint16  ol_count;   /* number of operations */
uint32  ol_version; /* postgres version */
pg_time_t   ol_timestamp;   /* = int64, operation date/time */
XLogRecPtr  ol_lsn; /* = uint64, last check point 
record ptr */
}   OperationLogData;

Description of fields:

* ol_type - event type. The types are contained in an enum:

typedef enum
{
OLT_BOOTSTRAP = 1,  /* bootstrap */
OLT_STARTUP,/* server startup */
OLT_RESETWAL,   /* pg_resetwal */
OLT_REWIND, /* pg_rewind */
OLT_UPGRADE,/* pg_upgrade */
OLT_PROMOTED,   /* promoted */
OLT_NumberOfTypes   /* should be last */
}   ol_type_enum;

* ol_edition - PostgreSQL edition (this field is important for custom 
PostgreSQL editions).
The editions are contained in an enum:

typedef enum
{
ED_PG_ORIGINAL = 0
/* Here can be custom editions */
}   PgNumEdition;

* ol_count - the number of fired events in case events of this type can be 
merged (otherwise 1). Maximum is 65536;
* ol_version - version formed with using PG_VERSION_NUM (PG_VERSION_NUM*100; 
for example: 13000800 for v13.8). Two last digits can be used for patch version;
* ol_timestamp - date/time of the last fired event in case events of this type 
can be merged (otherwise - the date/time of the event);
* ol_lsn - "Latest checkpoint location" v

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread Masahiko Sawada
On Mon, Nov 21, 2022 at 4:20 PM John Naylor
 wrote:
>
>
> On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> > One issue with this patch: The "fanout" member is a uint8, so it can't hold 
> > 256 for the largest node kind. That's not an issue in practice, since we 
> > never need to grow it, and we only compare that value with the count in an 
> > Assert(), so I just set it to zero. That does break an invariant, so it's 
> > not great. We could use 2 bytes to be strictly correct in all cases, but 
> > that limits what we can do with the smallest node kind.
>
> Thinking about this part, there's an easy resolution -- use a different macro 
> for fixed- and variable-sized node kinds to determine if there is a free slot.
>
> Also, I wanted to share some results of adjusting the boundary between the 
> two smallest node kinds. In the hackish attached patch, I modified the fixed 
> height search benchmark to search a small (within L1 cache) tree thousands of 
> times. For the first set I modified node4's maximum fanout and filled it up. 
> For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 
> (actually the partially-filled node15 size class as demoed earlier).
>
> node4:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16520 |  0 |3
>
> NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16456 |  0 |   17
>
> NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16456 |  0 |   89
>
> NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |16488 |  0 |  327
>
>
> node32:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16488 |  0 |5
> (1 row)
>
> NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16520 |  0 |   28
>
> NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16408 |  0 |   79
>
> NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |24616 |  0 |  199
>
> In this test, node32 seems slightly faster than node4 with 4 elements, at the 
> cost of more memory.
>
> Assuming the smallest node is fixed size (i.e. fanout/capacity member not 
> part of the common set, so only part of variable-sized nodes), 3 has a nice 
> property: no wasted padding space:
>
> node4: 5 + 4+(7) + 4*8 = 48 bytes
> node3: 5 + 3 + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3 :  5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-20, sirisha chamarthi wrote:

> Hi Hackers,
> 
> forking this thread from the discussion [1] as suggested by Amit.
> 
> Catalog_xmin is not advanced when a logical slot is invalidated (lost)
> until the invalidated slot is dropped. This patch ignores invalidated slots
> while computing the oldest xmin. Attached a small patch to address this and
> the output after the patch is as shown below.

Oh wow, that's bad :-(  I'll get it patched immediately.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Here's a general engineering tip: if the non-fun part is too complex for you
to figure out, that might indicate the fun part is too ambitious." (John Naylor)
https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com




Re: Polyphase merge is obsolete

2022-11-21 Thread Peter Eisentraut

On 21.11.22 00:57, Heikki Linnakangas wrote:

On 19/11/2022 13:00, Peter Eisentraut wrote:

On 18.10.21 14:15, Heikki Linnakangas wrote:

On 05/10/2021 20:24, John Naylor wrote:

I've had a chance to review and test out the v5 patches.


Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong
mentioned, and pushed.


AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15,
but did not adequately update the function header comment.  The comment
still mentions the "shared" argument, which has been removed.  There is
a new "preallocate" argument that is not mentioned at all.  Also, it's
not easy to match the actual callers to the call variants described in
the comment.  Could someone who remembers this work perhaps look this
over and update the comment?


Is the attached more readable?


That looks better, thanks.

I'm not 100% sure of the "preallocate" argument. If I understand 
correctly, you should pass true if you are writing multiple tapes at the 
same time, and false otherwise. HashAgg passed true, tuplesort passes 
false. However, it's not clear to me why we couldn't just always do the 
preallocation. It seems pretty harmless even if it's not helpful. Or do 
it when there are multiple writer tapes, and not otherwise. The 
parameter was added in commit 0758964963 so presumably there was a 
reason, but at a quick glance at the thread that led to that commit, I 
couldn't see what it was.


Right, these are the kinds of questions such a comment ought to answer.

Let's see if anyone chimes in here, otherwise let's complain in the 
original thread, since it had nothing to do with this one.






Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread John Naylor
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada 
wrote:
>
> On Mon, Nov 21, 2022 at 4:20 PM John Naylor
>  wrote:

> > Assuming the smallest node is fixed size (i.e. fanout/capacity member
not part of the common set, so only part of variable-sized nodes), 3 has a
nice property: no wasted padding space:
> >
> > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > node3: 5 + 3 + 3*8 = 32
>
> IIUC if we store the fanout member only in variable-sized nodes,
> rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> size doesn't change but there is 1 byte padding space.

I forgot to mention I'm assuming no pointer-tagging for this exercise.
You've demonstrated it can be done in a small amount of code, and I hope we
can demonstrate a speedup in search. Just in case there is some issue with
portability, valgrind, or some other obstacle, I'm being pessimistic in my
calculations.

> Also, even if we have the node3 a variable-sized node, size class 1
> for node3 could be a good choice since it also doesn't need padding
> space and could be a good alternative to path compression.
>
> node3 :  5 + 3 + 3*8 = 32 bytes
> size class 1 : 5 + 3 + 1*8 = 16 bytes

Precisely! I have that scenario in my notes as well -- it's quite
compelling.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Getting rid of SQLValueFunction

2022-11-21 Thread Michael Paquier
On Sun, Nov 20, 2022 at 03:15:34PM -0800, Ted Yu wrote:
> + * timestamp.  These require a specific handling with their typmod is given
> + * by the function caller through their SQL keyword.
> 
> typo: typmod is given -> typmod given
> 
> Other than the above, code looks good to me.

Thanks for double-checking.  I intended a different wording, actually,
so fixed this one.  And applied after an extra round of reviews.
--
Michael


signature.asc
Description: PGP signature


Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-21 Thread Pavel Borisov
> I have a very serious concern about the current patch set. as someone who has 
> faced transaction id wraparound in the past.
>
> I can start by saying I think it would be helpful (if the other issues are 
> approached reasonably) to have 64-bit xids, but there is an important piece 
> of context in reventing xid wraparounds that seems missing from this patch 
> unless I missed something.
>
> XID wraparound is a symptom, not an underlying problem.  It usually occurs 
> when autovacuum or other vacuum strategies have unexpected stalls and 
> therefore fail to work as expected.  Shifting to 64-bit XIDs dramatically 
> changes the sorts of problems that these stalls are likely to pose to 
> operational teams.  -- you can find you are running out of storage rather 
> than facing an imminent database shutdown.  Worse, this patch delays the 
> problem until some (possibly far later!) time, when vacuum will take far 
> longer to finish, and options for resolving the problem are diminished.  As a 
> result I am concerned that merely changing xids from 32-bit to 64-bit will 
> lead to a smaller number of far more serious outages.
>
> What would make a big difference from my perspective would be to combine this 
> with an inverse system for warning that there is a problem, allowing the 
> administrator to throw warnings about xids since last vacuum, with a 
> configurable threshold.  We could have this at two billion by default as that 
> would pose operational warnings not much later than we have now.
>
> Otherwise I can imagine cases where instead of 30 hours to vacuum a table, it 
> takes 300 hours on a database that is short on space.  And I would not want 
> to be facing such a situation.

Hi, Chris!
I had a similar stance when I started working on this patch. Of
course, it seemed horrible just to postpone the consequences of
inadequate monitoring, too long running transactions that prevent
aggressive autovacuum etc. So I can understand your point.

With time I've got to a little bit of another view of this feature i.e.

1. It's important to correctly set monitoring, the cut-off of long
transactions, etc. anyway. It's not the responsibility of vacuum
before wraparound to report inadequate monitoring etc. Furthermore, in
real life, this will be already too late if it prevents 32-bit
wraparound and invokes much downtime in an unexpected moment of time
if it occurs already. (The rough analogy for that is the machine
running at 120mph turns every control off and applies full brakes just
because the cooling liquid is low (of course there might be a warning
previously, but anyway))

2. The checks and handlers for the event that is never expected in the
cluster lifetime (~200 years at constant rate of 1e6 TPS) can be just
dropped. Of course we still need to do automatic routine maintenance
like cutting SLRU buffers (but with a much bigger interval if we have
much disk space e.g.). But I considered that we either can not care
what will be with cluster after > 200 years (it will be migrated many
times before this, on many reasons not related to Postgres even for
the most conservative owners). So the radical proposal is to drop
64-bit wraparound at all. The most moderate one is just not taking
very much care that after 200 years we have more hassle than next
month if we haven't set up everything correctly. Next month's pain
will be more significant even if it teaches dba something.

Big thanks for your view on the general implementation of this feature, anyway.

Kind regards,
Pavel Borisov.
Supabase




Re: New docs chapter on Transaction Management and related changes

2022-11-21 Thread Laurenz Albe
On Fri, 2022-11-18 at 14:28 -0500, Bruce Momjian wrote:
> New patch attached.

Thanks.

> --- a/doc/src/sgml/ref/release_savepoint.sgml
> +++ b/doc/src/sgml/ref/release_savepoint.sgml

> +   RELEASE SAVEPOINT releases the named savepoint and
> +   all active savepoints that were created after the named savepoint,
> +   and frees their resources.  All changes made since the creation of the
> +   savepoint, excluding rolled back savepoints changes, are merged into
> +   the transaction or savepoint that was active when the named savepoint
> +   was created.  Changes made after RELEASE SAVEPOINT
> +   will also be part of this active transaction or savepoint.

I am not sure if "rolled back savepoints changes" is clear enough.
I understand that you are trying to avoid the term "subtransaction".
How about:

  All changes made since the creation of the savepoint that didn't already
  get rolled back are merged ...

> --- a/doc/src/sgml/ref/rollback.sgml
> +++ b/doc/src/sgml/ref/rollback.sgml
>
> +  If AND CHAIN is specified, a new (unaborted)

*Sigh* I'll make one last plea for "not aborted".

> --- /dev/null
> +++ b/doc/src/sgml/xact.sgml

> +  
> +   Transactions can be created explicitly using BEGIN
> +   and COMMIT, which creates a transaction block.
> +   An SQL statement outside of a transaction block automatically uses
> +   a single-statement transaction.
> +  

Sorry, I should have noted that the first time around.

Transactions are not created using COMMIT.

Perhaps we should also avoid the term "transaction block".  Even without 
speaking
of a "block", way too many people confuse PL/pgSQL's BEGIN ... END blocks
with transactions.  On the other hand, we use "transaction block" everywhere
else in the documentation...

How about:

  
   Multi-statement transactions can be created explicitly using
   BEGIN or START TRANSACTION and
   are ended using COMMIT or ROLLBACK.
   An SQL statement outside of a transaction block automatically uses
   a single-statement transaction.
  

> + 
> +
> +  Transactions and Locking
> +
> +  
> +   The transaction IDs of currently executing transactions are shown in
> +   pg_locks
> +   in columns virtualxid and
> +   transactionid.  Read-only transactions
> +   will have virtualxids but NULL
> +   transactionids, while read-write transactions
> +   will have both as non-NULL.
> +  

Perhaps the following will be prettier than "have both as non-NULL":

  ..., while both columns will be set in read-write transactions.

Yours,
Laurenz Albe




Re: New docs chapter on Transaction Management and related changes

2022-11-21 Thread Alvaro Herrera
Agreed on not using "unaborted", per previous discussion.

On 2022-Nov-21, Laurenz Albe wrote:

> Perhaps we should also avoid the term "transaction block".  Even without 
> speaking
> of a "block", way too many people confuse PL/pgSQL's BEGIN ... END blocks
> with transactions.  On the other hand, we use "transaction block" everywhere
> else in the documentation...

Yeah, I don't understand why we need this "transaction block" term at
all.  It adds nothing.  We could just use the term "transaction", and
little meaning would be lost.  When necessary, we could just say
"explicit transaction" or something to that effect.  In this particular
case, we could modify your proposed wording,

>   
>Multi-statement transactions can be created explicitly using
>BEGIN or START TRANSACTION and
>are ended using COMMIT or ROLLBACK.
>An SQL statement outside of a transaction block automatically uses
>a single-statement transaction.
>   

by removing the word "block":

>Any SQL statement outside of an transaction automatically uses
>a single-statement transaction.

and perhaps add "explicit", but I don't think it's necessary:

>Any SQL statement outside of an explicit transaction automatically
>uses a single-statement transaction.


(I also changed "An" to "Any" because it seems more natural, but I
suppose it's a stylistic choice.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Operation log for major operations

2022-11-21 Thread Justin Pryzby
See also prior discussions:
https://www.postgresql.org/message-id/flat/62750df5b126e1d8ee039a79ef3cc64ac3d47cd5.camel%40j-davis.com
https://www.postgresql.org/message-id/flat/20180228214311.jclah37cwh572t2c%40alap3.anarazel.de

-- 
Justin




Cleanup: Duplicated, misplaced comment in HeapScanDescData

2022-11-21 Thread Matthias van de Meent
Hi,

I noticed that the comment on/beneath rs_numblocks in HeapScanDescData
is duplicated above rs_strategy. I don't know if there should have
been a different comment above rs_strategy, but the current one is
definitely out of place, so I propose to remove it as per attached.

The comment was duplicated in c2fe139c20 with the update to the table
scan APIs, which was first seen in PostgreSQL 11.

Kind regards,

Matthias van de Meent


v1-0001-Remove-duplicate-of-comment-under-rs_numblocks.patch
Description: Binary data


Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-21 Thread Aleksander Alekseev
Hi hackers,

> > I have a very serious concern about the current patch set. as someone who 
> > has faced transaction id wraparound in the past.
>
> [...]
>
> I had a similar stance when I started working on this patch. Of
> course, it seemed horrible just to postpone the consequences of
> inadequate monitoring, too long running transactions that prevent
> aggressive autovacuum etc. So I can understand your point.
>
> With time I've got to a little bit of another view of this feature i.e.
>
> 1. It's important to correctly set monitoring, the cut-off of long
> transactions, etc. anyway. It's not the responsibility of vacuum
> before wraparound to report inadequate monitoring etc. Furthermore, in
> real life, this will be already too late if it prevents 32-bit
> wraparound and invokes much downtime in an unexpected moment of time
> if it occurs already. (The rough analogy for that is the machine
> running at 120mph turns every control off and applies full brakes just
> because the cooling liquid is low (of course there might be a warning
> previously, but anyway))
>
> 2. The checks and handlers for the event that is never expected in the
> cluster lifetime (~200 years at constant rate of 1e6 TPS) can be just
> dropped. Of course we still need to do automatic routine maintenance
> like cutting SLRU buffers (but with a much bigger interval if we have
> much disk space e.g.). But I considered that we either can not care
> what will be with cluster after > 200 years (it will be migrated many
> times before this, on many reasons not related to Postgres even for
> the most conservative owners). So the radical proposal is to drop
> 64-bit wraparound at all. The most moderate one is just not taking
> very much care that after 200 years we have more hassle than next
> month if we haven't set up everything correctly. Next month's pain
> will be more significant even if it teaches dba something.
>
> Big thanks for your view on the general implementation of this feature, 
> anyway.

I'm inclined to agree with Pavel on this one. Keeping 32-bit XIDs in
order to intentionally trigger XID wraparound to indicate the ending
disk space and/or misconfigured system (by the time when it's usually
too late anyway) is a somewhat arguable perspective. It would be great
to notify the user about the potential issues with the configuration
and/or the fact that VACUUM doesn't catch up. But it doesn't mean we
should keep 32-bit XIDs in order to achive this.

-- 
Best regards,
Aleksander Alekseev




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Ashutosh Bapat
Hi Sirisha,

Thanks for identifying the bug and the solution. Some review comments inlined.

On Mon, Nov 21, 2022 at 2:49 PM Alvaro Herrera  wrote:
>
> On 2022-Nov-20, sirisha chamarthi wrote:
>
> > Hi Hackers,
> >
> > forking this thread from the discussion [1] as suggested by Amit.
> >
> > Catalog_xmin is not advanced when a logical slot is invalidated (lost)
> > until the invalidated slot is dropped. This patch ignores invalidated slots
> > while computing the oldest xmin. Attached a small patch to address this and
> > the output after the patch is as shown below.
>
> Oh wow, that's bad :-(  I'll get it patched immediately.

+/* ignore invalid slots while computing the oldest xmin */
+if (TransactionIdIsValid(invalidated_at_lsn))
+continue;

I think the condition should be

if (!XLogRecPtrIsInvalid(invalidated_at_lsn)) LSN and XID are
different data types.

and to be inline with pg_get_replication_slots()
361 if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
362 !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
363 walstate = WALAVAIL_REMOVED;

we should also check restart_lsn.

I would write this as

bool invalidated_slot = false;

then under spinlock
invalidated_slot = XLogRecPtrIsInvalid(s->data.restart_lsn) &&
!XLogRecPtrIsInvalid(s->data.invalidated_at);

if (invalidated_slot)
continue.

-- 
Best Wishes,
Ashutosh Bapat




Re: Cleanup: Duplicated, misplaced comment in HeapScanDescData

2022-11-21 Thread Matthias van de Meent
On Mon, 21 Nov 2022 at 12:12, Matthias van de Meent
 wrote:
>
> Hi,
>
> I noticed that the comment on/beneath rs_numblocks in HeapScanDescData
> is duplicated above rs_strategy. I don't know if there should have
> been a different comment above rs_strategy, but the current one is
> definitely out of place, so I propose to remove it as per attached.
>
> The comment was duplicated in c2fe139c20 with the update to the table
> scan APIs, which was first seen in PostgreSQL 11.

I made a mistake in determining this version number; it was PostgreSQL
12 where this commit was first included. Attached is the same patch
with the description updated accordingly.

Kind regards,

Matthias van de Meent


v2-0001-Remove-duplicate-of-comment-under-rs_numblocks.patch
Description: Binary data


Re: Logical Replication Custom Column Expression

2022-11-21 Thread Ashutosh Bapat
On Sat, Nov 19, 2022 at 6:47 PM Stavros Koureas
 wrote:
>
> Hi all,
>
> Working with PostgreSQL Logical Replication is just great! It helps a lot 
> doing real time replication for analytical purposes without using any other 
> 3d party service. Although all these years working as product architect of 
> reporting i have noted a few requirements which are always a challenge and 
> may help enhance logical replication even better.
>
> To the point:
> PostgreSQL14 Logical Replication allows replication of a table to another 
> table that exists in another database or even in another host. It also allows 
> multiple upstream tables using the same structure to downstream into a single 
> table.
> CREATE PUBLICATION pb_test FOR TABLE test
>
> PostgreSQL15 Logical Replication allows even better replication options, like 
> selecting subsets of the columns from publisher tables. It also supports 
> plenty of options like disable_on_error etc.
> CREATE PUBLICATION pb_test FOR TABLE test ("id", "name")
>
> What does not support is the option for defining custom column expressions, 
> as keys or values, into the upstream (publication). This will give more 
> flexibility into making replication from multiple upstreams into less 
> downstreams adding more logic. For instance, in a project for analytical 
> purposes there is the need to consolidate data from multiple databases into 
> one and at the same time keep the origin of each replicated data identified 
> by a tenanant_id column. In this case we also need the ability to define the 
> new column as an additional key which will participate into the destination 
> table.
>
> Tenant 1 table
> id serial pk
> description varchar
>
> Tenant 2 table
> id integer pk
> description varchar
>
> Group table
> tenant integer pk
> id integer pk
> description varchar
>
> Possible syntax to archive that
> CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias} 
> ,"id", "name")
>
> Example
> CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} ,"id", 
> "name")

I think that's a valid usecase.

This looks more like a subscription option to me. In multi-subscriber
multi-publisher scenarios, on one subscriber a given upstream may be
tenant 1 but on some other it could be 2. But I don't think we allow
specifying subscription options for a single table. AFAIU, the origin
ids are available as part of the commit record which contained this
change; that's how conflict resolution is supposed to know it. So
somehow the subscriber will need to fetch those from there and set the
tenant.

-- 
Best Wishes,
Ashutosh Bapat




Re: Reducing power consumption on idle servers

2022-11-21 Thread Simon Riggs
On Mon, 21 Nov 2022 at 08:40, Laurenz Albe  wrote:
>
> On Mon, 2022-11-21 at 07:36 +, Simon Riggs wrote:
> > On Mon, 21 Nov 2022 at 05:07, Laurenz Albe  wrote:
> > >
> > > On Mon, 2022-11-21 at 10:13 +1300, Thomas Munro wrote:
> > > > I'll wait 24 hours before committing, to
> > > > provide a last chance for anyone who wants to complain about dropping
> > > > promote_trigger_file.
> > >
> > > Remove "promote_trigger_file"?  Now I have never seen anybody use that
> > > parameter, but I don't think that it is a good idea to deviate from our
> > > usual standard of deprecating a feature for about five years before
> > > actually removing it.
> >
> > We aren't removing the ability to promote, just enforcing a change to
> > a better mechanism, hence I don't see a reason for a long(er)
> > deprecation period than we have already had.
>
> We have had a deprecation period?  I looked at the documentation, but found
> no mention of a deprecation.  How hard can it be to leave the GUC and only
> poll for the existence of the file if it is set?
>
> I personally don't need the GUC, and I know nobody who does,

Nobody else does either.

> I disagree.  With the same argument, you could rip out "serial", since we
> have supported identity columns since v11.

...and this is not a user facing change, only HA systems interface with this.

> but I think
> we should not be cavalier about introducing unnecessary compatibility breaks.

I agree we should not be cavalier, nor has anyone been so. The first
version of the patch was cautious on this very point, but since many
people think we should remove it, it is not a user facing feature and
nobody on this thread knows anybody or anything that uses it, I have
changed my mind and now think we should remove it.

We have two versions to choose from now
* v6 deprecates this option
* v10 removes this option
so it is no effort at all to choose either option.

The main issue is about reducing power usage, so please let's move to
commit something soon, one way or another.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Jakub Wartak
Hi hackers,

the case of planner's
src/backend/utils/adt/selfuncs.c:get_actual_variable_endpoint()
spending literally seconds seems to be well known fact across hackers
(in the extreme wild case it can be over 1+ hour on VLDBs). For those
unfamiliar it is planner estimation that tries to read real table
index (including deadrows) until min/max. It is blackbox mechanism
that works without any warning which often is hugely affected by
number of dead tuples in indexes and there's no on/off switch or
built-in limitation of how far it can go. It was discussed on pgsql
mailing lists several times [1]-[5]. It almost seems like it works
fine in 99.9% cases, until it doesn't and blows up big time on larger
systems and from there operator doesn't have a lot of choices [a lot
of time being already wasted on identifying the root-cause being the
planner]:
1) one can properly VACUUM (which everybody seem to agree is the
proper way to go, but it is often problematic due to other various
circumstances, especially on big tables without serious partitioning
strategies) - again this might be very time consuming
2) one cannot trade a lot CPU/IO burning on planning (actually
fetching indexes on multi-TB tables) to less accurate plans, and
realistically speaking rewriting queries is often impossible
3) application might not support enable prepared statements and even
if then simple queries/reports are also affected
4) there is no visibility into how much activity is spent on btree
index get_actual_variable_endpoint() alone, so one cannot estimate the
system-wide impact

I would like to trigger the discussion on how to give at least partial
control to the end-user of what the optimizer performs. I was thinking
about several things and each of those has pros and cons:

a) the attached quick patch (by Simon Riggs) that put maximum allowed
cost constraints on the index-lookup machinery as a safeguard (that
#define is debatable; in my testing it reduced the hit from ~850ms to
0.6ms +/- 0.3ms at the current value of 20).
b) I was wondering about creating a new wait class "Planner" with the
event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
naming/categorization as wait events are ... well.. as the name "WAIT"
implies, while those btree lookups could easily be ON-CPU activity.
c) Any other idea, e.g. see [3] or [5] (cache was being proposed).
d) For completeness : a new GUC/knob to completely disable the
functionality (debug_optimizer_minmax_est), but that's actually
trimmed functionality of the patch.
e) I was also wondering about some DEBUG/NOTICE elog() when taking
more than let's say arbitrary 10s, but that could easily spam the log
file

Reproducer on a very small dataset follows. Please note the reproducer
here shows the effect on 1st run EXPLAIN, however in real observed
situation (multi-TB unpartitioned table) each consecutive planner
operation (just EXPLAIN) on that index was affected (I don't know why
LP_DEAD/hints cleaning was not kicking in, but maybe it was, but given
the scale of the problem it was not helping much).

-Jakub Wartak.

[1] - 
https://www.postgresql.org/message-id/flat/54446AE2.6080909%40BlueTreble.com#f436bb41cf044b30eeec29472a13631e
[2] - 
https://www.postgresql.org/message-id/flat/db7111f2-05ef-0ceb-c013-c34adf4f4121%40gmail.com
[3] - 
https://www.postgresql.org/message-id/flat/05C72CF7-B5F6-4DB9-8A09-5AC897653113%40yandex.ru
(SnapshotAny vs SnapshotDirty discussions between Tom and Robert)
[4] - 
https://www.postgresql.org/message-id/flat/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw%40mail.gmail.com
[5] - https://postgrespro.com/list/thread-id/2436130 (cache)

s1:
=# drop table test;
=# create table test (id bigint primary key) with (autovacuum_enabled = 'off');
=# insert into test select generate_series(1,1000); -- ~310MB
table, ~210MB index

s2/start the long running transaction:
=# begin;
=# select min(id) from test;

s1:
=# delete from test where id>100;
=# analyze test;
=# set enable_indexonlyscan = 'off'; -- just in case to force
BitmapHeapScans which according to backend/access/nbtree/README
won'tset LP_DEAD, but my bt_page_items() tests indicate that it does
(??)
=# set enable_indexscan = 'off';
=# explain (buffers, verbose) select * from test where id > 1100;
=>  Planning: Buffers: shared hit=9155 read=55276 dirtied=55271
written=53617 / Time: 851.761 ms
=# explain (buffers, verbose) select * from test where id > 1100;
=>  Planning: Buffers: shared read=24594 / Time: 49.172 ms
=# vacuum (verbose) test; => index scan needed: 39824 pages from table
(90.00% of total) had 900 dead item identifiers removed
=# explain (buffers, verbose) select * from test where id > 1100;
=>  Planning: Buffers: shared hit=14 read=3 / Time: 0.550 ms

with patch, the results are:
p=# explain (buffers, verbose) select * from test where id > 1100;
=> Planning: / Buffers: shared hit=17 read=6 dirtied=3 written=5 =>
Time: 0.253 ms
p=# explain (buffers, verbose) select * 

Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-21, Ashutosh Bapat wrote:

> I think the condition should be
> 
> if (!XLogRecPtrIsInvalid(invalidated_at_lsn)) LSN and XID are
> different data types.

Yeah, this bit is wrong.  I agree with your suggestion to just keep a
boolean flag, as we don't need more than that.

> and to be inline with pg_get_replication_slots()
> 361 if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
> 362 !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
> 363 walstate = WALAVAIL_REMOVED;
> 
> we should also check restart_lsn.

Hmm, I'm not sure about this one.  I'm not sure why we check both in
pg_get_replication_slots.  I suppose we didn't want to ignore a slot
only if it had a non-zero invalidated_at in case it was accidental (say,
we initialize a slot as valid, but forget to zero-out the invalidated_at
value); but I think that's pretty much useless.  This is only changed
with the spinlock held, so it's not like you can see partially-set
state.

In fact, as I recall we could replace invalidated_at in
ReplicationSlotPersistentData with a boolean "invalidated" flag, and
leave restart_lsn alone when invalidated.  IIRC the only reason we
didn't do it that way was that we feared some code might observe some
valid value in restart_lsn without noticing that it belonged to an
invalidate slot.  (Which is exactly what happened now, except with a
different field.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-21, Jakub Wartak wrote:

> b) I was wondering about creating a new wait class "Planner" with the
> event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
> naming/categorization as wait events are ... well.. as the name "WAIT"
> implies, while those btree lookups could easily be ON-CPU activity.

I think we should definitely do this, regardless of any other fixes we
add, so that this condition can be identified more easily.  I wonder if
we can backpatch it safely.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."(Julien PUYDT)




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Ashutosh Bapat
On Mon, Nov 21, 2022 at 5:39 PM Alvaro Herrera  wrote:
>
> On 2022-Nov-21, Ashutosh Bapat wrote:
>
> > I think the condition should be
> >
> > if (!XLogRecPtrIsInvalid(invalidated_at_lsn)) LSN and XID are
> > different data types.
>
> Yeah, this bit is wrong.  I agree with your suggestion to just keep a
> boolean flag, as we don't need more than that.
>
> > and to be inline with pg_get_replication_slots()
> > 361 if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
> > 362 !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
> > 363 walstate = WALAVAIL_REMOVED;
> >
> > we should also check restart_lsn.
>
> Hmm, I'm not sure about this one.  I'm not sure why we check both in
> pg_get_replication_slots.  I suppose we didn't want to ignore a slot
> only if it had a non-zero invalidated_at in case it was accidental (say,
> we initialize a slot as valid, but forget to zero-out the invalidated_at
> value); but I think that's pretty much useless.  This is only changed
> with the spinlock held, so it's not like you can see partially-set
> state.
>
> In fact, as I recall we could replace invalidated_at in
> ReplicationSlotPersistentData with a boolean "invalidated" flag, and
> leave restart_lsn alone when invalidated.  IIRC the only reason we
> didn't do it that way was that we feared some code might observe some
> valid value in restart_lsn without noticing that it belonged to an
> invalidate slot.  (Which is exactly what happened now, except with a
> different field.)
>

Maybe. In that case pg_get_replication_slots() should be changed. We
should use the same criteria to decide whether a slot is invalidated
or not at all the places.
I am a fan of stricter, all-assumption-covering conditions. In case we
don't want to check restart_lsn, an Assert might be useful to validate
our assumption.

-- 
Best Wishes,
Ashutosh Bapat




Re: Patch: Global Unique Index

2022-11-21 Thread Simon Riggs
On Thu, 17 Nov 2022 at 22:01, Cary Huang  wrote:
>
> Patch: Global Unique Index

Let me start by expressing severe doubt on the usefulness of such a
feature, but also salute your efforts to contribute.

> In other words, a global unique index and a regular partitioned index are 
> essentially the same in terms of their storage structure except that one can 
> do cross-partition uniqueness check, the other cannot.

This is the only workable architecture, since it allows DETACH to be
feasible, which is essential.

You don't seem to mention that this would require a uniqueness check
on each partition. Is that correct? This would result in O(N) cost of
uniqueness checks, severely limiting load speed. I notice you don't
offer any benchmarks on load speed or the overhead associated with
this, which is not good if you want to be taken seriously, but at
least it is recoverable.

(It might be necessary to specify some partitions as READ ONLY, to
allow us to record their min/max values for the indexed cols, allowing
us to do this more quickly.)

> - Supported Features -
> 1. Global unique index is supported only on btree index type

Why? Surely any index type that supports uniqueness is good.

> - Not-supported Features -
> 1. Global uniqueness check with Sub partition tables is not yet supported as 
> we do not have immediate use case and it may involve majoy change in current 
> implementation

Hmm, sounds like a problem. Arranging the calls recursively should work.

> - Create a global unique index -
> To create a regular unique index on a partitioned table, Postgres has to 
> perform heap scan and sorting on every child partition. Uniqueness check 
> happens during the sorting phase and will raise an error if multiple tuples 
> with the same index key are sorted together. To achieve global uniqueness 
> check, we make Postgres perform the sorting after all of the child partitions 
> have been scanned instead of on the "sort per partition" fashion. In 
> otherwords, the sorting only happens once at the very end and it sorts the 
> tuples coming from all the partitions and therefore can ensure global 
> uniqueness.

My feeling is that performance on this will suck so badly that we must
warn people away from it, and tell people if they want this, create
the index at the start and let it load.

Hopefully CREATE INDEX CONCURRENTLY still works.

Let's see some benchmarks on this also please.

You'll need to think about progress reporting early because correctly
reporting the progress and expected run times are likely critical for
usability.

> Example:
>
> > CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);
> > CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);
> > CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);
> > CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL;
> > INSERT INTO gidx_part values(5, 5, 'test');
> > INSERT INTO gidx_part values(15, 5, 'test');
> ERROR:  duplicate key value violates unique constraint "gidx_part1_b_idx"
> DETAIL:  Key (b)=(5) already exists.

Well done.

> - DETACH -
> Since we retain the same partitioned structure, detaching a partition with 
> global unique index is straightforward. Upon DETACH, Postgres will change its 
> relkind from RELKIND_GLOBAL_INDEX to RELKIND_INDEX and remove their 
> inheritance relationship as usual.

It's the only way that works

> - Optimizer, query planning and vacuum -
> Since no major modification is done on global unique index's structure and 
> storage, it works in the same way as a regular partitioned index. No major 
> change is required to be done on optimizer, planner and vacuum process as 
> they should work in the same way as regular index.

Agreed


Making a prototype is a great first step.

The next step is to understand the good and the bad aspects of it, so
you can see what else needs to be done. You need to be honest and real
about the fact that this may not actually be desirable in practice, or
in a restricted use case.

That means performance analysis of create, load, attach, detach,
INSERT, SELECT, UPD/DEL and anything else that might be affected,
together with algorithmic analysis of what happens for larger N and
larger tables.

Expect many versions; take provisions for many days.

Best of luck

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Allow pageinspect's bt_page_stats function to return a set of rows instead of a single row

2022-11-21 Thread Naeem Akhter
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

i've tested and verified the documentation.

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-21 Thread houzj.f...@fujitsu.com
On Friday, November 18, 2022 8:36 AM Masahiko Sawada  
wrote:
> 
> Here are review comments on v47-0001 and v47-0002 patches:

Thanks for the comments!

> When the parallel apply worker exited, I got the following server log.
> I think this log is not appropriate since the worker was not terminated by
> administrator command but exited by itself. Also, probably it should exit with
> exit code 0?
> 
> FATAL:  terminating logical replication worker due to administrator command
> LOG:  background worker "logical replication parallel worker" (PID
> 3594918) exited with exit code 1

Changed to report a LOG and exited with code 0.

> ---
> /*
>  * Stop the worker if there are enough workers in the pool or the leader
>  * apply worker serialized part of the transaction data to a file due to
>  * send timeout.
>  */
> if (winfo->serialize_changes ||
> napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
> 
> Why do we need to stop the worker if the leader serializes changes?

Because there might be partial sent message left in memory queue if send 
timeout.
And we need to either re-send the same message until success or detach from the 
memory
queue. To make the logic simple, the patch directly stop the worker in this 
case.


> ---
> +/*
> + * Release all session level locks that could be held in parallel 
> apply
> + * mode.
> + */
> +LockReleaseAll(DEFAULT_LOCKMETHOD, true);
> +
> 
> I think we call LockReleaseAll() at the process exit (in ProcKill()), but do 
> we
> really need to do LockReleaseAll() here too?

If we don't release locks before ProcKill, we might break an Assert check at
the beginning of ProcKill which is used to ensure all the locks are released.
And It seems ProcKill doesn't release session level locks after the assert
check. So I think we'd better release them here.

> ---
> 
> +elog(ERROR, "could not find replication state slot
> for replication"
> + "origin with OID %u which was acquired by
> %d", node, acquired_by);
> 
> Let's not break the error log message in the middle so that the user can 
> search
> the message by grep easily.

Changed.

> ---
> +{
> +{"max_parallel_apply_workers_per_subscription",
> +PGC_SIGHUP,
> +REPLICATION_SUBSCRIBERS,
> +gettext_noop("Maximum number of parallel
> apply workers per subscription."),
> +NULL,
> +},
> +&max_parallel_apply_workers_per_subscription,
> +2, 0, MAX_BACKENDS,
> +NULL, NULL, NULL
> +},
> +
> 
> I think we should use MAX_PARALLEL_WORKER_LIMIT as the max value instead.
> MAX_BACKENDS is too high.

Changed.

> ---
> +/*
> + * Indicates whether there are pending messages in the queue.
> The parallel
> + * apply worker will check it before starting to wait.
> + */
> +pg_atomic_uint32   pending_message_count;
> 
> The "pending messages" sounds like individual logical replication messages
> such as LOGICAL_REP_MSG_INSERT. But IIUC what this value actually shows is
> how many streamed chunks are pending to process, right?

Yes, renamed this.

> ---
> When the parallel apply worker raises an error, I got the same error twice 
> from
> the leader worker and parallel worker as follows. Can we suppress either one?
> 
> 2022-11-17 17:30:23.490 JST [3814552] LOG:  logical replication parallel apply
> worker for subscription "test_sub1" has started
> 2022-11-17 17:30:23.490 JST [3814552] ERROR:  duplicate key value violates
> unique constraint "test1_c_idx"
> 2022-11-17 17:30:23.490 JST [3814552] DETAIL:  Key (c)=(1) already exists.
> 2022-11-17 17:30:23.490 JST [3814552] CONTEXT:  processing remote data for
> replication origin "pg_16390" during message type "INSERT" for replication
> target relatio n "public.test1" in transaction 731
> 2022-11-17 17:30:23.490 JST [3814550] ERROR:  duplicate key value violates
> unique constraint "test1_c_idx"
> 2022-11-17 17:30:23.490 JST [3814550] DETAIL:  Key (c)=(1) already exists.
> 2022-11-17 17:30:23.490 JST [3814550] CONTEXT:  processing remote data for
> replication origin "pg_16390" during message type "INSERT" for replication
> target relatio n "public.test1" in transaction 731
> parallel apply worker

It seems similar to the behavior of parallel query which will report the same
error twice. But I agree it might be better for the leader to report something
different. So, I changed the error message reported by leader in the new
version patch.

Best regards,
Hou zj


RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-21 Thread houzj.f...@fujitsu.com
On  Monday, November 21, 2022 2:26 PM Peter Smith  wrote:
> On Fri, Nov 18, 2022 at 6:03 PM Peter Smith 
> wrote:
> >
> > Here are some review comments for v47-0001
> >
> > (This review is a WIP - I will post more comments for this patch next
> > week)
> >
> 
> Here are the rest of my comments for v47-0001

Thanks for the comments!

> ==
> 
> doc/src/sgml/monitoring.
> 
> 1.
> 
> @@ -1851,6 +1851,11 @@ postgres   27093  0.0  0.0  30096  2752 ?
>  Ss   11:34   0:00 postgres: ser
>Waiting to acquire an advisory user lock.
>   
>   
> +  applytransaction
> +  Waiting to acquire acquire a lock on a remote transaction being
> +  applied on the subscriber side.
> + 
> + 
> 
> 1a.
> Typo "acquire acquire"

Fixed.

> ~
> 
> 1b.
> Maybe "on the subscriber side" does not mean much without any context.
> Maybe better to word it as below.
> 
> SUGGESTION
> Waiting to acquire a lock on a remote transaction being applied by a logical
> replication subscriber.

Changed.

> ==
> 
> doc/src/sgml/system-views.sgml
> 
> 2.
> 
> @@ -1361,8 +1361,9 @@
> virtualxid,
> spectoken,
> object,
> -   userlock, or
> -   advisory.
> +   userlock,
> +   advisory or
> +   applytransaction.
> 
> This change removed the Oxford comma that was there before. I assume it was
> unintended.

Changed.

> ==
> 
> .../replication/logical/applyparallelworker.c
> 
> 3. globals
> 
> The parallel_apply_XXX functions were all shortened to pa_XXX.
> 
> I wondered if the same simplification should be done also to the global
> statics...
> 
> e.g.
> ParallelApplyWorkersHash -> PAWorkerHash ParallelApplyWorkersList ->
> PAWorkerList ParallelApplyMessagePending -> PAMessagePending etc...

I personally feel these names looks fine to me.

> ~~~
> 
> 4. pa_get_free_worker
> 
> + foreach(lc, active_workers)
> + {
> + ParallelApplyWorkerInfo *winfo = NULL;
> +
> + winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
> 
> No need to assign NULL because the next line just overwrites that anyhow.

Changed.

> ~
> 
> 5.
> 
> + /*
> + * Try to free the worker first, because we don't wait for the rollback
> + * command to finish so the worker may not be freed at the end of the
> + * transaction.
> + */
> + if (pa_free_worker(winfo, winfo->shared->xid)) continue;
> +
> + if (!winfo->in_use)
> + return winfo;
> 
> Shouldn't the (!winfo->in_use) check be done first as well -- e.g. why are we
> trying to free a worker which is maybe not even in_use?
> 
> SUGGESTION (this will need some comment to explain what it is doing) if
> (!winfo->in_use || !pa_free_worker(winfo, winfo->shared->xid) &&
> !winfo->in_use)
> return winfo;

Since the pa_free_worker will check the in_use flag as well and
the current style looks clean to me. So I didn't change this.

But it seems we need to first call pa_free_worker for every worker and then
choose a free a free, otherwise a stopped worker info(shared memory or ...)
might be left for a long time. I will think about this and try to fix it in
next version.

> ~~~
> 
> 6. pa_free_worker
> 
> +/*
> + * Remove the parallel apply worker entry from the hash table. Stop the
> +work if
> + * there are enough workers in the pool.
> + *
> 
> Typo? "work" -> "worker"
> 

Fixed.

> 
> 7.
> 
> + /* Are there enough workers in the pool? */ if (napplyworkers >
> + (max_parallel_apply_workers_per_subscription / 2)) {
> 
> IMO that comment should be something more like "Don't detach/stop the
> worker unless..."
> 

Improved.

> 
> 8. pa_send_data
> 
> + /*
> + * Retry after 1s to reduce the cost of getting the system time and
> + * calculating the time difference.
> + */
> + (void) WaitLatch(MyLatch,
> + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 1000L,
> + WAIT_EVENT_LOGICAL_PARALLEL_APPLY_STATE_CHANGE);
> 
> 8a.
> I am not sure you need to explain the reason in the comment. Just saying "Wait
> before retrying." seems sufficient to me.

Changed.

> ~
> 
> 8b.
> Instead of the hardwired "1s" in the comment, and 1000L in the code, maybe
> better to just have another constant.
> 
> SUGGESTION
> #define SHM_SEND_RETRY_INTERVAL_MS 1000
> #define SHM_SEND_TIMEOUT_MS 1

Changed.

> ~
> 
> 9.
> 
> + if (startTime == 0)
> + startTime = GetCurrentTimestamp();
> + else if (TimestampDifferenceExceeds(startTime, GetCurrentTimestamp(),
> 
> IMO the initial startTime should be at top of the function otherwise the 
> timeout
> calculation seems wrong.

Setting startTime at beginning will bring unnecessary cost if we don't need to 
retry.
And start counting from the first failure looks fine to me.

> ==
> 
> src/backend/replication/logical/worker.c
> 
> 10. handle_streamed_transaction
> 
> + * In streaming case (receiving a block of streamed transaction), for
> + * SUBSTREAM_ON mode, simply redirect it to a file for the proper
> + toplevel
> + * transaction, and for SUBSTREAM_PARALLEL mode, send the changes to
> + parallel
> + * apply workers (LOGICAL_REP_MSG_RELATION or LOG

Re: ExecRTCheckPerms() and many prunable partitions

2022-11-21 Thread Amit Langote
On Mon, Nov 21, 2022 at 9:03 PM Amit Langote  wrote:
> On Thu, Nov 10, 2022 at 8:58 PM Alvaro Herrera  
> wrote:
> > Why do callers of add_rte_to_flat_rtable() have to modify the rte's
> > perminfoindex themselves, instead of having the function do it for them?
> > That looks strange.  But also it's odd that flatten_unplanned_rtes
> > concatenates the two lists after all the perminfoindexes have been
> > modified, rather than doing both things (adding each RTEs perminfo to
> > the global list and offsetting the index) as we walk the list, in
> > flatten_rtes_walker.  It looks like these two actions are disconnected
> > from one another, but unless I misunderstand, in reality the opposite is
> > true.
>
> OK, I have moved the step of updating perminfoindex into
> add_rte_to_flat_rtable(), where it looks up the RTEPermissionInfo for
> an RTE being added using GetRTEPermissionInfo() and lappend()'s it to
> finalrtepermlist before updating the index.  For flatten_rtes_walker()
> then to rely on that facility, I needed to make some changes to its
> arguments to pass the correct Query node to pick the rtepermlist from.
> Overall, setrefs.c changes now hopefully look saner than in the last
> version.
>
> As soon as I made that change, I noticed a bunch of ERRORs in
> regression tests due to the checks in GetRTEPermissionInfo(), though
> none that looked like live bugs.

I figured the no-live-bugs part warrants some clarification.  The
plan-time errors that I saw were caused in many cases by an RTE not
pointing into the correct list or having incorrect perminfoindex, most
or all of those cases involving views.  Passing a wrong perminfoindex
to the executor, though obviously not good and fixed in the latest
version, wasn't a problem in those cases, because none of those tests
would cause the executor to use the perminfoindex, such as by calling
GetRTEPermissionInfo().  I thought about that being problematic in
terms of our coverage of perminfoindex related code in the executor,
but don't really see how we could improve that situation as far as
views are concerned, because the executor is only concerned about
checking permissions for views and perminfoindex is irrelevant in that
path, because the RTEPermissionInfos are accessed directly from
es_rtepermlist.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




[BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-21 Thread Maxim Orlov
Hi!


PREAMBLE

For a last couple of months, I stumbled into a problem while running tests
on ARM (Debain, aarch64) and some more wired platforms
for my 64–bit XIDs patch set. Test contrib/test_decoding
(catalog_change_snapshot) rarely failed with the next message:

TRAP: FailedAssertion("TransactionIdIsNormal(InitialRunningXacts[0]) &&
TransactionIdIsNormal(builder->xmin)", File: "snapbuild.c"

I have plenty of failing on ARM, couple on x86 and none (if memory serves)
on x86–64.

At first, my thought was to blame my 64–bit XID patch for what, but this is
not the case. This error persist from PG15 to PG10
without any patch applied. Though hard to reproduce.


PROBLEM

After some investigation, I think, the problem is in the snapbuild.c
(commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
array in the context of builder->context, but for the time when we call
SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F)
values. This is not caused buildfarm to fail due to that code:

if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
 builder->xmin))
return;

Since the cluster is initialised with XID way less than 0x7F7F7F7F, we get
to return here, but the problem is still existing.
I've attached the patch based on branch REL_15_STABLE to reproduce the
problem on x86-64.

On my patch set of 64–bit XID's this problem manifested since I do init
cluster with XID far beyond 32–bit bound.

Alternatively, I did try to use my patch [1] to init cluster with first
transaction 2139062143 (i.e. 0x7F7F7F7F).
Then put pg_sleep call just like in the attached patch:
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
if (NInitialRunningXacts == 0)
return;

+   pg_usleep(100L * 2L);
+
/* bound check if there is at least one transaction to remove */
if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],

 builder->xmin))

Run installcheck-force for many times for a test_decoding/
catalog_change_snapshot's and got a segfault.


CONCLUSION

In snapbuild.c, context allocated array InitialRunningXacts may be free'd,
this caused assertion failed (at best) or
may lead to the more serious problems.


P.S.

Simple fix like:
@@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr
lsn, xl_running_xacts *runn
 * changes. See SnapBuildXidSetCatalogChanges.
 */
NInitialRunningXacts = nxacts;
-   InitialRunningXacts = MemoryContextAlloc(builder->context,
sz);
+   InitialRunningXacts = MemoryContextAlloc(TopMemoryContext,
sz);
memcpy(InitialRunningXacts, running->xids, sz);
qsort(InitialRunningXacts, nxacts, sizeof(TransactionId),
xidComparator);

seems to solve the described problem, but I'm not in the context of [0] and
why array is allocated in builder->context.


[0] https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
[1]
https://www.postgresql.org/message-id/flat/CACG=ezaa4vqYjJ16yoxgrpa-=gxnf0vv3ey9bjgrrrfn2yy...@mail.gmail.com

-- 
Best regards,
Maxim Orlov.
From d09a031f1f807cdfe1e02000b2bf4fd3eaaedd8f Mon Sep 17 00:00:00 2001
From: Maxim Orlov 
Date: Mon, 21 Nov 2022 14:50:02 +0300
Subject: [PATCH] catalog_change_snapshot-fail

---
 contrib/test_decoding/Makefile  | 1007 ++-
 src/backend/replication/logical/snapbuild.c |   12 +
 2 files changed, 1012 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c7ce603706..aaf7a63411 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -3,12 +3,1006 @@
 MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
-REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-   decoding_into_rel binary prepared replorigin time messages \
-   spill slot truncate stream stats twophase twophase_stream
-ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-   oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-   twophase_snapshot slot_creation_error catalog_change_snapshot
+ISOLATION = catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+ 

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-11-21 Thread Maxim Orlov
> Pushed.
>
> --
> With Regards,
> Amit Kapila.
>
>
Hi!

While working on 64–bit XID's patch set, I stumble into problems with
contrib/test_decoding/catalog_change_snapshot test [0].

AFAICS, the problem is not related to the 64–bit XID's patch set and the
problem is in InitialRunningXacts array, allocaled in
builder->context. Do we really need it to be allocated that way?


[0]
https://www.postgresql.org/message-id/CACG%3DezZoz_KG%2BRyh9MrU_g5e0HiVoHocEvqFF%3DNRrhrwKmEQJQ%40mail.gmail.com


-- 
Best regards,
Maxim Orlov.


Re: Split index and table statistics into different types of stats

2022-11-21 Thread Drouvot, Bertrand

Hi,

On 11/21/22 12:19 AM, Andres Freund wrote:

Hi,

On 2022-11-18 12:18:38 +0100, Drouvot, Bertrand wrote:

On 11/16/22 9:12 PM, Andres Freund wrote:

This still leaves a fair bit of boilerplate. ISTM that the function body
really should just be a single line.

Might even be worth defining the whole function via a macro. Perhaps something 
like

PGSTAT_DEFINE_REL_FIELD_ACCESSOR(PGSTAT_KIND_INDEX, pg_stat_get_index, 
numscans);


Thanks for the feedback!

Right, what about something like the following?

"
#define PGSTAT_FETCH_STAT_ENTRY(pgstat_entry_kind, pgstat_fetch_stat_function, 
relid, stat_name) \
do { \
pgstat_entry_kind *entry = pgstat_fetch_stat_function(relid); \
PG_RETURN_INT64(entry == NULL ? 0 : (int64) 
(entry->stat_name)); \
} while (0)

Datum
pg_stat_get_index_numscans(PG_FUNCTION_ARGS)
{
PGSTAT_FETCH_STAT_ENTRY(PgStat_StatIndEntry, 
pgstat_fetch_stat_indentry, PG_GETARG_OID(0), numscans);
}
"


That's better, but still seems like quite a bit of repetition, given the
number of accessors. I think I like my idea of a macro defining the whole
function a bit better.



Got it, what about creating another preparatory commit to first introduce 
something like:

"
#define PGSTAT_DEFINE_REL_FIELD_ACCESSOR(function_name_prefix, stat_name) \
Datum \
function_name_prefix##_##stat_name(PG_FUNCTION_ARGS) \
{ \
Oid relid = PG_GETARG_OID(0); \
int64   result; \
PgStat_StatTabEntry *tabentry; \
if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL) \
result = 0; \
else \
result = (int64) (tabentry->stat_name); \
PG_RETURN_INT64(result); \
} \

PGSTAT_DEFINE_REL_FIELD_ACCESSOR(pg_stat_get, numscans);

PGSTAT_DEFINE_REL_FIELD_ACCESSOR(pg_stat_get, tuples_returned);
.
.
.
"

If that makes sense to you, I'll submit this preparatory patch.


Now merged.


Thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Allow specification of custom slot for custom nodes

2022-11-21 Thread Pavel Borisov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I've looked at this patch and don't see any problems with it. It is minimally 
invasive, it doesn't affect functionality unless anyone (e.g. extension) sets 
its own slotOps in CustomScanState.
Furthermore, the current patch very slightly modifies patch 0b03e5951bf0 with 
the intention of introducing extensibility. So I think adding more 
extensibility regarding different tuple formats is an excellent thing to do.

I'm going to mark it as RfC if there are no objections.

Kind regards,
Pavel Borisov, 
Supabase

The new status of this patch is: Ready for Committer


Re: Allow single table VACUUM in transaction block

2022-11-21 Thread Simon Riggs
On Fri, 18 Nov 2022 at 18:26, Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Fri, Nov 18, 2022 at 7:04 AM Simon Riggs
> >  wrote:
> >> So if consistency is also a strong requirement, then maybe we should
> >> make that new command the default, i.e. make VACUUM always just a
> >> request to vacuum in background. That way it will be consistent.
>
> > Since one fairly common reason for running vacuum in the foreground is
> > needing to vacuum a table when all autovacuum workers are busy, or
> > when they are vacuuming it with a cost limit and it needs to get done
> > sooner, I think this would surprise a lot of users in a negative way.
>
> It would also break a bunch of our regression tests, which expect a
> VACUUM to complete immediately.
>
> >> Can we at least have a vacuum_runs_in_background = on | off, to allow
> >> users to take advantage of this WITHOUT needing to rewrite all of
> >> their scripts?
>
> > I'm not entirely convinced that's a good idea, but happy to hear what
> > others think.
>
> I think the answer to that one is flat no.  We learned long ago that GUCs
> with significant semantic impact on queries are a bad idea.  For example,
> if a user issues VACUUM expecting behavior A and she gets behavior B
> because somebody changed the postgresql.conf entry, she won't be happy.
>
> Basically, I am not buying Simon's requirement that this be transparent.
> I think the downsides would completely outweigh whatever upside there
> may be (and given the shortage of prior complaints, I don't think the
> upside is very large).

Just to say I'm happy with that decision and will switch to the
request for a background vacuum.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: TAP output format in pg_regress

2022-11-21 Thread Dagfinn Ilmari Mannsåker
Andres Freund  writes:

> But either way, it seems nicer to output the # inside a helper function?

Note that the helper function should inject '# ' at the start of every
line in the message, not just the first line.  It might also be worth
having two separate functions: one that prints to stdout, so is only
shown by the harness is in verbose mode, and another which prints to
stderr, so is always shown.  Perl's Test::More calls these note() and
diag(), respectively.

>> +/*
>> + * In order for this information (or any error information) to be shown
>> + * when running pg_regress test suites under prove it needs to be 
>> emitted
>> + * stderr instead of stdout.
>> + */
>>  if (file_size(difffilename) > 0)
>>  {
>> -printf(_("The differences that caused some tests to fail can be 
>> viewed in the\n"
>> - "file \"%s\".  A copy of the test summary that 
>> you see\n"
>> - "above is saved in the file \"%s\".\n\n"),
>> +status(_("\n# The differences that caused some tests to fail 
>> can be viewed in the\n"
>> +  "# file \"%s\".  A copy of the test summary 
>> that you see\n"
>> +  "# above is saved in the file \"%s\".\n\n"),
>> difffilename, logfilename);
>
> The comment about needing to print to stderr is correct - but the patch
> doesn't do so (anymore)?
>
> The count of failed tests also should go to stderr (but not the report of all
> tests having passed).
>
> bail() probably also ought to print the error to stderr, so the most important
> detail is immediately visible when looking at the failed test result.

Agreed on all three points.

> Greetings,
>
> Andres Freund

- ilmari




Re: Operation log for major operations

2022-11-21 Thread Dmitry Koval

Thanks for references, Justin!

Couple comments about these references.

1) "Make unlogged table resets detectable".
https://www.postgresql.org/message-id/flat/62750df5b126e1d8ee039a79ef3cc64ac3d47cd5.camel%40j-davis.com

This conversation is about specific problem (unlogged table repairing). 
Operation log has another use - it is primary a helper for diagnostics.


2) "RFC: Add 'taint' field to pg_control."
https://www.postgresql.org/message-id/flat/20180228214311.jclah37cwh572t2c%40alap3.anarazel.de

This is almost the same problem that we want to solve with operation 
log. Differences between the operation log and what is discussed in the 
thread:
* there suggested to store operation log in pg_control file - but 
separate from pg_control main data (and write data separately too);
* operation log data can be represented in relational form (not flags), 
this is more usable for RDBMS;
* number of registered event types can be increased easy (without 
changes of storage).


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: createuser doesn't tell default settings for some options

2022-11-21 Thread Daniel Gustafsson
> On 10 Aug 2022, at 10:28, Daniel Gustafsson  wrote:
> 
>> On 10 Aug 2022, at 08:12, Kyotaro Horiguchi  wrote:
>> 
>> (I suppose this is a pg15 issue)
>> 
>> createuser --help shows the following help text.
>> 
>>> --bypassrls   role can bypass row-level security (RLS) policy
>>> --no-bypassrlsrole cannot bypass row-level security (RLS) policy
>>> --replication role can initiate replication
>>> --no-replication  role cannot initiate replication
>> 
>> For other options the text tells which one is the default, which I
>> think the two options also should have the same.
> 
> Agreed.  For --no-replication the docs in createuser.sgml should fixed to
> include a "This is the default" sentence like the others have as well.

>> The ternary options are checked against decimal 0, but it should use
>> TRI_DEFAULT instead. (attached third)
> 
> Agreed, nice catch.

Attached is my proposal for this, combining your 0001 and 0003 patches with
some docs and test fixups to match.

--
Daniel Gustafsson   https://vmware.com/



v2-0001-Fix-handling-of-default-option-values-in-createus.patch
Description: Binary data


Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-21, Ashutosh Bapat wrote:

> Maybe. In that case pg_get_replication_slots() should be changed. We
> should use the same criteria to decide whether a slot is invalidated
> or not at all the places.

Right.

> I am a fan of stricter, all-assumption-covering conditions. In case we
> don't want to check restart_lsn, an Assert might be useful to validate
> our assumption.

Agreed.  I'll throw in an assert.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: CREATE INDEX CONCURRENTLY on partitioned index

2022-11-21 Thread Alexander Pyhalov

Justin Pryzby писал 2022-11-21 06:00:

I finally found time to digest and integrate your changes into my local
branch.  This fixes the three issues you reported: FORCE_RELEASE, issue
with INVALID partitions issue (for which I adapted your patch into an
earlier patch in my series), and progress reporting.  And rebased.


Hi.

Thank you for the effort.
I've looked through and tested new patch a bit. Overall it looks good to 
me.
The question I have is whether we should update 
pg_stat_progress_create_index in reindex_invalid_child_indexes(), when 
we skip valid indexes?

--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 7:22 AM Alvaro Herrera  wrote:
> On 2022-Nov-21, Jakub Wartak wrote:
> > b) I was wondering about creating a new wait class "Planner" with the
> > event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
> > naming/categorization as wait events are ... well.. as the name "WAIT"
> > implies, while those btree lookups could easily be ON-CPU activity.
>
> I think we should definitely do this, regardless of any other fixes we
> add, so that this condition can be identified more easily.  I wonder if
> we can backpatch it safely.

I don't think this is safe at all. Wait events can only bracket
individual operations, like the reads of the individual index blocks,
not report on which phase of a larger operation is in progress. If we
try to make them do the latter, we will have a hot mess on our hands.
It might not be a bad capability to have, but it's a different system.

But that doesn't mean we can't do anything about this problem, and I
think we should do something about this problem. It's completely crazy
that after this many years of people getting hosed by this, we haven't
taken more than half measures to fix the problem. I think commit
3ca930fc39ccf987c1c22fd04a1e7463b5dd0dfd was the last time we poked at
this, and before that there was commit
fccebe421d0c410e6378fb281419442c84759213, but neither of those
prevented us from scanning an unbounded number of index tuples before
finding one that we're willing to use, as the commit messages pretty
much admit.

What we need is a solution that avoids reading an unbounded number of
tuples under any circumstances. I previously suggested using
SnapshotAny here, but Tom didn't like that. I'm not sure if there are
safety issues there or if Tom was just concerned about the results
being misleading. Either way, maybe there's some variant on that theme
that could work. For instance, could we teach the index scan to stop
if the first 100 tuples that it finds are all invisible? Or to reach
at most 1 page, or at most 10 pages, or something? If we don't find a
match, we could either try to use a dead tuple, or we could just
return false which, I think, would end up using the value from
pg_statistic rather than any updated value. That is of course not a
great outcome, but it is WAY WAY BETTER than spending OVER AN HOUR
looking for a more suitable tuple, as Jakub describes having seen on a
production system.

I really can't understand why this is even mildly controversial. What
exactly to do here may be debatable, but the idea that it's OK to
spend an unbounded amount of resources during any phase of planning is
clearly wrong. We can say that at the time we wrote the code we didn't
know it was going to actually ever happen, and that is fine and true.
But there have been multiple reports of this over the years and we
know *for sure* that spending totally unreasonable amounts of time
inside this function is a real-world problem that actually brings down
production systems. Unless and until we find a way of putting a tight
bound on the amount of effort that can be expended here, that's going
to keep happening.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Tom Lane
Robert Haas  writes:
> I don't think this is safe at all. Wait events can only bracket
> individual operations, like the reads of the individual index blocks,
> not report on which phase of a larger operation is in progress. If we
> try to make them do the latter, we will have a hot mess on our hands.

Agreed.

> What we need is a solution that avoids reading an unbounded number of
> tuples under any circumstances. I previously suggested using
> SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> safety issues there or if Tom was just concerned about the results
> being misleading. Either way, maybe there's some variant on that theme
> that could work. For instance, could we teach the index scan to stop
> if the first 100 tuples that it finds are all invisible? Or to reach
> at most 1 page, or at most 10 pages, or something?

A hard limit on the number of index pages examined seems like it
might be a good idea.

> If we don't find a
> match, we could either try to use a dead tuple, or we could just
> return false which, I think, would end up using the value from
> pg_statistic rather than any updated value.

Yeah, the latter seems like the best bet.  Returning a definitely-dead
value could be highly misleading.  In the end this is meant to be
an incremental improvement on what we could get from pg_statistic,
so it's reasonable to limit how hard we'll work on it.

If we do install such a thing, should we undo any of the previous
changes that backed off the reliability of the result?

regards, tom lane




Re: Allow single table VACUUM in transaction block

2022-11-21 Thread Simon Riggs
On Fri, 18 Nov 2022 at 11:54, Simon Riggs  wrote:
>
> On Thu, 17 Nov 2022 at 20:00, Justin Pryzby  wrote:
> >
> > On Wed, Nov 16, 2022 at 05:14:07PM -0500, Greg Stark wrote:
> > > I think this requesting autovacuum worker should be a distinct
> > > command. Or at least an explicit option to vacuum.
> >
> > +1.  I was going to suggest VACUUM (NOWAIT) ..
>
> Yes, I have no problem with an explicit command.
>
> At the moment the patch runs VACUUM in the background in an autovacuum
> process, but the call is asynchronous, since we do not wait for the
> command to finish (or even start).
>
> So the command names I was thinking of would be one of these:
>
> VACUUM (BACKGROUND) or VACUUM (AUTOVACUUM) - which might be clearer
> or
> VACUUM (ASYNC) - which is more descriptive of the behavior
>
> or we could go for both
> VACUUM (BACKGROUND, ASYNC) - since this allows us to have a
> BACKGROUND, SYNC version in the future


Attached patch implements VACUUM (BACKGROUND).

There are quite a few small details to consider; please read the docs
and comments.

There is a noticeable delay before the background vacuum starts.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


background_vacuum.v3.patch
Description: Binary data


Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Simon Riggs
On Mon, 21 Nov 2022 at 15:01, Tom Lane  wrote:
>
> > What we need is a solution that avoids reading an unbounded number of
> > tuples under any circumstances. I previously suggested using
> > SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> > safety issues there or if Tom was just concerned about the results
> > being misleading. Either way, maybe there's some variant on that theme
> > that could work. For instance, could we teach the index scan to stop
> > if the first 100 tuples that it finds are all invisible? Or to reach
> > at most 1 page, or at most 10 pages, or something?
>
> A hard limit on the number of index pages examined seems like it
> might be a good idea.

Good, that is what the patch does.

> > If we don't find a
> > match, we could either try to use a dead tuple, or we could just
> > return false which, I think, would end up using the value from
> > pg_statistic rather than any updated value.
>
> Yeah, the latter seems like the best bet.

Yes, just breaking out of the loop is enough to use the default value.

> If we do install such a thing, should we undo any of the previous
> changes that backed off the reliability of the result?

Not sure.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread sirisha chamarthi
Thanks Alvaro,  Ashutosh  for your comments.

On Mon, Nov 21, 2022 at 6:20 AM Alvaro Herrera 
wrote:

> On 2022-Nov-21, Ashutosh Bapat wrote:
>
> > Maybe. In that case pg_get_replication_slots() should be changed. We
> > should use the same criteria to decide whether a slot is invalidated
> > or not at all the places.
>
> Right.
>

Agreed.


>
> > I am a fan of stricter, all-assumption-covering conditions. In case we
> > don't want to check restart_lsn, an Assert might be useful to validate
> > our assumption.
>
> Agreed.  I'll throw in an assert.
>

Changed this in the patch to throw an assert.


> --
> Álvaro Herrera PostgreSQL Developer  —
> https://www.EnterpriseDB.com/
>


0002-Ignore-invalidated-slots-while-computing-the-oldest-.patch
Description: Binary data


postgres_fdw binary protocol support

2022-11-21 Thread Ilya Gladyshev
Hi everyone,

I have made a patch that introduces support for libpq binary protocol
in postgres_fdw. The idea is simple, when a user knows that the foreign
server is binary compatible with the local and his workload could
somehow benefit from using binary protocol, it can be switched on for a
particular server or even a particular table. 

The patch adds a new foreign server and table option 'binary_format'
(by default off) and implements serialization/deserialization of query
results and parameters for binary protocol. I have tested the patch by
switching foreign servers in postgres_fdw.sql tests to binary_mode, the
only diff was in the text of the error for parsing an invalid integer
value, so it worked as expected for the test. There are a few minor
issues I don't like in the code and I am yet to write the tests and
docs for it. It would be great to get some feedback and understand,
whether this is a welcome feature, before proceeding with all of the
abovementioned.

Thanks,
Ilya Gladyshev
From 2cb72df03ed94d55cf51531a2d21a7d3369ae27b Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Sat, 19 Nov 2022 17:47:49 +0400
Subject: [PATCH] postgres_fdw libpq binary proto support

---
 contrib/postgres_fdw/option.c   |   6 +-
 contrib/postgres_fdw/postgres_fdw.c | 389 
 contrib/postgres_fdw/postgres_fdw.h |   1 +
 3 files changed, 338 insertions(+), 58 deletions(-)

diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index fa80ee2a55..f96cb79b42 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -125,7 +125,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 			strcmp(def->defname, "truncatable") == 0 ||
 			strcmp(def->defname, "async_capable") == 0 ||
 			strcmp(def->defname, "parallel_commit") == 0 ||
-			strcmp(def->defname, "keep_connections") == 0)
+			strcmp(def->defname, "keep_connections") == 0 ||
+			strcmp(def->defname, "binary_format") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -253,6 +254,9 @@ InitPgFdwOptions(void)
 		/* async_capable is available on both server and table */
 		{"async_capable", ForeignServerRelationId, false},
 		{"async_capable", ForeignTableRelationId, false},
+		/* async_capable is available on both server and table */
+		{"binary_format", ForeignServerRelationId, false},
+		{"binary_format", ForeignTableRelationId, false},
 		{"parallel_commit", ForeignServerRelationId, false},
 		{"keep_connections", ForeignServerRelationId, false},
 		{"password_required", UserMappingRelationId, false},
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 8d7500abfb..9344b6f5fc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -76,6 +76,8 @@ enum FdwScanPrivateIndex
 	FdwScanPrivateRetrievedAttrs,
 	/* Integer representing the desired fetch_size */
 	FdwScanPrivateFetchSize,
+	/* Boolean flag showing whether to use binary or text libpq protocol */
+	FdwScanPrivateBinaryFormat,
 
 	/*
 	 * String describing join i.e. names of relations being joined and types
@@ -128,7 +130,8 @@ enum FdwDirectModifyPrivateIndex
 	/* Integer list of attribute numbers retrieved by RETURNING */
 	FdwDirectModifyPrivateRetrievedAttrs,
 	/* set-processed flag (as a Boolean node) */
-	FdwDirectModifyPrivateSetProcessed
+	FdwDirectModifyPrivateSetProcessed,
+	FdwDirectModifyPrivateBinaryFormat
 };
 
 /*
@@ -154,6 +157,7 @@ typedef struct PgFdwScanState
 	FmgrInfo   *param_flinfo;	/* output conversion functions for them */
 	List	   *param_exprs;	/* executable expressions for param values */
 	const char **param_values;	/* textual values of query parameters */
+	int *param_lengths;
 
 	/* for storing result tuples */
 	HeapTuple  *tuples;			/* array of currently-retrieved tuples */
@@ -172,6 +176,7 @@ typedef struct PgFdwScanState
 	MemoryContext temp_cxt;		/* context for per-tuple temporary data */
 
 	int			fetch_size;		/* number of tuples per fetch */
+	bool binary_format; /* whether to use libpq binary or text format */
 } PgFdwScanState;
 
 /*
@@ -195,6 +200,7 @@ typedef struct PgFdwModifyState
 	int			batch_size;		/* value of FDW option "batch_size" */
 	bool		has_returning;	/* is there a RETURNING clause? */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by RETURNING */
+	bool binary_format;
 
 	/* info about parameters for prepared statement */
 	AttrNumber	ctidAttno;		/* attnum of input resjunk ctid column */
@@ -225,7 +231,7 @@ typedef struct PgFdwDirectModifyState
 	bool		has_returning;	/* is there a RETURNING clause? */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by RETURNING */
 	bool		set_processed;	/* do we set the command es_processed? */
-
+	bool binary_format;
 	/* for remote query execution */
 	PGconn	   *conn;			/* connection for the update */
 	PgFdwConnState *conn_state; /* extra per-connection state */
@@ -233,6 +239,7 @@ typedef struct PgFdwDirectModifyState
 	FmgrInf

Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 10:14 AM Simon Riggs
 wrote:
> > > What we need is a solution that avoids reading an unbounded number of
> > > tuples under any circumstances. I previously suggested using
> > > SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> > > safety issues there or if Tom was just concerned about the results
> > > being misleading. Either way, maybe there's some variant on that theme
> > > that could work. For instance, could we teach the index scan to stop
> > > if the first 100 tuples that it finds are all invisible? Or to reach
> > > at most 1 page, or at most 10 pages, or something?
> >
> > A hard limit on the number of index pages examined seems like it
> > might be a good idea.
>
> Good, that is what the patch does.



Oh, that's surprisingly simple. Nice!

Is there any reason to tie this into page costs? I'd be more inclined
to just make it a hard limit on the number of pages. I think that
would be more predictable and less prone to surprising (bad) behavior.
And to be honest I would be inclined to make it quite a small number.
Perhaps 5 or 10. Is there a good argument for going any higher?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Simon Riggs
On Mon, 21 Nov 2022 at 15:23, Robert Haas  wrote:
>
> On Mon, Nov 21, 2022 at 10:14 AM Simon Riggs
>  wrote:
> > > > What we need is a solution that avoids reading an unbounded number of
> > > > tuples under any circumstances. I previously suggested using
> > > > SnapshotAny here, but Tom didn't like that. I'm not sure if there are
> > > > safety issues there or if Tom was just concerned about the results
> > > > being misleading. Either way, maybe there's some variant on that theme
> > > > that could work. For instance, could we teach the index scan to stop
> > > > if the first 100 tuples that it finds are all invisible? Or to reach
> > > > at most 1 page, or at most 10 pages, or something?
> > >
> > > A hard limit on the number of index pages examined seems like it
> > > might be a good idea.
> >
> > Good, that is what the patch does.
>
> 
>
> Oh, that's surprisingly simple. Nice!
>
> Is there any reason to tie this into page costs? I'd be more inclined
> to just make it a hard limit on the number of pages. I think that
> would be more predictable and less prone to surprising (bad) behavior.
> And to be honest I would be inclined to make it quite a small number.
> Perhaps 5 or 10. Is there a good argument for going any higher?

+1, that makes the patch smaller and the behavior more predictable.

(Just didn't want to do anything too simple, in case it looked like a kluge.)

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: heavily contended lwlocks with long wait queues scale badly

2022-11-21 Thread Jonathan S. Katz

On 11/20/22 2:56 PM, Andres Freund wrote:

Hi,

On 2022-11-09 17:03:13 -0800, Andres Freund wrote:

On 2022-11-09 09:38:08 -0800, Andres Freund wrote:

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.


I didn't get to it in time, so I'll leave it for when I'm back.


Took a few days longer, partially because I encountered an independent issue
(see 8c954168cff) while testing.

I pushed it to HEAD now.


Thanks!


I still think it might be worth to backpatch in a bit, but so far the votes on
that weren't clear enough on that to feel comfortable.


My general feeling is "yes" on backpatching, particularly if this is a 
bug and it's fixable without ABI breaks.


My comments were around performing additional workload benchmarking just 
to ensure people feel comfortable that we're not introducing any 
performance regressions, and to consider the Feb 2023 release as the 
time to introduce this (vs. Nov 2022). That gives us ample time to 
determine if there are any performance regressions introduced.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Tom Lane
Robert Haas  writes:
> Is there any reason to tie this into page costs? I'd be more inclined
> to just make it a hard limit on the number of pages. I think that
> would be more predictable and less prone to surprising (bad) behavior.

Agreed, a simple limit of N pages fetched seems appropriate.

> And to be honest I would be inclined to make it quite a small number.
> Perhaps 5 or 10. Is there a good argument for going any higher?

Sure: people are not complaining until it gets into the thousands.
And you have to remember that the entire mechanism exists only
because of user complaints about inaccurate estimates.  We shouldn't
be too eager to resurrect that problem.

I'd be happy with a limit of 100 pages.

regards, tom lane




Re: perform_spin_delay() vs wait events

2022-11-21 Thread Robert Haas
On Sun, Nov 20, 2022 at 6:10 PM Andres Freund  wrote:
> I was wondering about that too - but decided against it because it would only
> show a single wait event. And wouldn't really describe spinlocks as a whole,
> just the "extreme" delays. If we wanted to report the spin waits more
> granular, we'd presumably have to fit the wait events into the lwlock, buffers
> and some new category where we name individual spinlocks.
>
> But I guess a single spinlock wait event type with an ExponentialBackoff wait
> event or such wouldn't be too bad.

Oh, hmm. I guess it is actually bracketing a timed wait, now that I
look closer at what you did. So perhaps your first idea was best after
all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 10:32 AM Tom Lane  wrote:
> Robert Haas  writes:
> > Is there any reason to tie this into page costs? I'd be more inclined
> > to just make it a hard limit on the number of pages. I think that
> > would be more predictable and less prone to surprising (bad) behavior.
>
> Agreed, a simple limit of N pages fetched seems appropriate.
>
> > And to be honest I would be inclined to make it quite a small number.
> > Perhaps 5 or 10. Is there a good argument for going any higher?
>
> Sure: people are not complaining until it gets into the thousands.
> And you have to remember that the entire mechanism exists only
> because of user complaints about inaccurate estimates.  We shouldn't
> be too eager to resurrect that problem.
>
> I'd be happy with a limit of 100 pages.

OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: allowing for control over SET ROLE

2022-11-21 Thread Robert Haas
On Sat, Nov 19, 2022 at 1:00 AM Michael Paquier  wrote:
> On Fri, Nov 18, 2022 at 04:19:15PM -0500, Robert Haas wrote:
> > Fixed that, and the other mistake Álvaro spotted, and also bumped
> > catversion because I forgot that earlier.
>
> I was looking at this code yesterday, to see today that psql's
> completion should be completed with this new clause, similary to ADMIN
> and INHERIT.

Seems like a good idea but I'm not sure about this hunk:

  TailMatches("GRANT|REVOKE", "ALTER", "SYSTEM") ||
- TailMatches("REVOKE", "GRANT", "OPTION", "FOR", "ALTER", "SYSTEM"))
+ TailMatches("REVOKE", "GRANT", "OPTION", "FOR", "ALTER", "SYSTEM") ||
+ TailMatches("REVOKE", "GRANT", "OPTION", "FOR", "SET"))

That might be a correct change for other reasons, but it doesn't seem
related to this patch. The rest looks good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-21, sirisha chamarthi wrote:

> > > I am a fan of stricter, all-assumption-covering conditions. In case we
> > > don't want to check restart_lsn, an Assert might be useful to validate
> > > our assumption.
> >
> > Agreed.  I'll throw in an assert.
> 
> Changed this in the patch to throw an assert.

Thank you.  I had pushed mine for CirrusCI to test, and it failed the
assert I added in slot.c:
https://cirrus-ci.com/build/4786354503548928
Not yet sure why, looking into it.

You didn't add any asserts to the slot.c code.

In slotfuncs.c, I'm not sure I want to assert anything about restart_lsn
in any cases other than when invalidated_at is set.  In other words, I
prefer this coding in pg_get_replication_slots:

if (!XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
{

Assert(XLogRecPtrIsInvalid(slot_contents.data.restart_lsn));
walstate = WALAVAIL_REMOVED;
}
else
walstate = 
GetWALAvailability(slot_contents.data.restart_lsn);


Your proposal is doing this:

switch (walstate)
{
[...]
case WALAVAIL_REMOVED:

if 
(!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
{
[...]
if (pid != 0)
[...] break;
}

Assert(XLogRecPtrIsInvalid(slot_contents.data.restart_lsn));

which sounds like it could be hit if the replica is connected to the
slot.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Jakub Wartak
Hi,

Draft version of the patch attached (it is based on Simon's)
I would be happier if we could make that #define into GUC (just in
case), although I do understand the effort to reduce the number of
various knobs (as their high count causes their own complexity).

-Jakub Wartak.

On Mon, Nov 21, 2022 at 4:35 PM Robert Haas  wrote:
>
> On Mon, Nov 21, 2022 at 10:32 AM Tom Lane  wrote:
> > Robert Haas  writes:
> > > Is there any reason to tie this into page costs? I'd be more inclined
> > > to just make it a hard limit on the number of pages. I think that
> > > would be more predictable and less prone to surprising (bad) behavior.
> >
> > Agreed, a simple limit of N pages fetched seems appropriate.
> >
> > > And to be honest I would be inclined to make it quite a small number.
> > > Perhaps 5 or 10. Is there a good argument for going any higher?
> >
> > Sure: people are not complaining until it gets into the thousands.
> > And you have to remember that the entire mechanism exists only
> > because of user complaints about inaccurate estimates.  We shouldn't
> > be too eager to resurrect that problem.
> >
> > I'd be happy with a limit of 100 pages.
>
> OK.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


0001-Damage-control-for-planner-s-get_actual_variable_end.patch
Description: Binary data


Re: Reducing power consumption on idle servers

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 3:40 AM Laurenz Albe  wrote:
> We have had a deprecation period?  I looked at the documentation, but found
> no mention of a deprecation.  How hard can it be to leave the GUC and only
> poll for the existence of the file if it is set?
>
> I personally don't need the GUC, and I know nobody who does, but I think
> we should not be cavalier about introducing unnecessary compatibility breaks.

As a project, we have a problem with deprecation warnings, simply
because community consensus changes over time as people change their
minds and as participants come and go. Slapping a deprecation notice
on things saying that they will be removed in five years does not mean
they will actually get removed in five years, nor does the failure to
slap a deprecation notice on them mean that they won't be removed in
five years.  We have some instances of following through on such
promises, but plenty where what we said we were going to do and what
we actually did were quite different. Furthermore, even if we were
100% on follow-through on such things, I have no faith in the idea
that a mandatory five-year deprecation period in all cases would be
good for the project.

I think it's totally reasonable to phase out a little-used feature
that no one really cares about faster than a feature that is
heavily-relied upon and for which there are no reasonable
alternatives. Something like standard_conforming_strings or the switch
from md5 to SCRAM authentication affects not only database users but
also drivers and applications; time needs to be allowed for the effect
to seep through the whole ecosystem. Backward compatibility is really
important for the transition period. In this case, though, the
alternatives have already existed for a long time and we have reasons
to think most people have already been using them for a long time.
Moreover, if they haven't been, the transition shouldn't be too hard.
For there to be a problem here, we have to imagine a user who not
can't run pg_promote() or pg_ctl promote, who can only create a file,
and who furthermore needs that file to be located someplace other than
the data directory. I'm having a hard time envisioning that.

The reason that I pushed back -- not as successfully as I would have
liked -- on the changes to pg_stop_backup / pg_start_backup is that I
know there are people using the old method successfully, and it's not
just a 1:1 substitution. Here I don't, and it is. I'm totally open to
the feedback that such people exist and to hearing why adopting one of
the newer methods would be a problem for them, if that's the case. But
if there's no evidence that such people exist or that changing is a
problem for them, I don't think waiting 5 years on principle is good
for the project.

P.S. One notable example of how bad we are at these things is that
contrib/xml2 has been deprecated since PostgreSQL 8.3. The deprecation
notice says that it isn't needed any more because all of the
functionality is otherwise available, but some people think that's not
true so we haven't removed the module. An obvious alternative would be
to remove the deprecation notice but FWICT other people think that
getting rid of it is the right idea so we haven't done that either.
Whee.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread sirisha chamarthi
On Mon, Nov 21, 2022 at 8:05 AM Alvaro Herrera 
wrote:

> On 2022-Nov-21, sirisha chamarthi wrote:
>
> > > > I am a fan of stricter, all-assumption-covering conditions. In case
> we
> > > > don't want to check restart_lsn, an Assert might be useful to
> validate
> > > > our assumption.
> > >
> > > Agreed.  I'll throw in an assert.
> >
> > Changed this in the patch to throw an assert.
>
> Thank you.  I had pushed mine for CirrusCI to test, and it failed the
> assert I added in slot.c:
> https://cirrus-ci.com/build/4786354503548928
> Not yet sure why, looking into it.
>

Can this be because restart_lsn is not set to InvalidXLogRecPtr for the
physical slots? My repro is as follows:

select pg_create_physical_replication_slot('s5');
// Load some data to invalidate slot
postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s5 -D .
pg_receivewal: error: unexpected termination of replication stream: ERROR:
 requested WAL segment 000100EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
pg_receivewal: error: unexpected termination of replication stream: ERROR:
 requested WAL segment 000100EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
psql (16devel)
Type "help" for help.

postgres=# select * from pg_replication_slots;
 slot_name |plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
---+---+---++--+---+++--+--+-+-++---+---
 s3| test_decoding | logical   |  5 | postgres | f | f
 ||  |  769 | | 0/A992E7D0
 | lost   |   | f
 s5|   | physical  ||  | f | f
 ||  |  | 0/EB00  |
| lost   |   | f



>
>
> --
> Álvaro Herrera PostgreSQL Developer  —
> https://www.EnterpriseDB.com/
>


Re: Sending SIGABRT to child processes (was Re: Strange failure on mamba)

2022-11-21 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> I suspect that having a GUC would be a good idea. I needed something similar
>> recently, debugging an occasional hang in the AIO patchset. I first tried
>> something like your #define approach and it did cause a problematic flood of
>> core files.

> Yeah, the main downside of such a thing is the risk of lots of core files
> accumulating over repeated crashes.  Nonetheless, I think it'll be a
> useful debugging aid.  Here's a proposed patch.  (I took the opportunity
> to kill off the long-since-unimplemented Reinit switch, too.)

Hearing no complaints, I've pushed this and reconfigured mamba to use
send_abort_for_kill.  Once I've got a core file or two to look at,
I'll try to figure out what's going on there.

> One thing I'm not too clear on is if we want to send SIGABRT to the child
> groups (ie, SIGABRT grandchild processes too).  I made signal_child do
> so here, but perhaps it's overkill.

After further thought, we do have to SIGABRT the grandchildren too,
or they won't shut down promptly.  I think there might be a small
risk of some programs trapping SIGABRT and doing something other than
what we want; but since this is only a debug aid that's probably
tolerable.

regards, tom lane




Re: when the startup process doesn't (logging startup delays)

2022-11-21 Thread Robert Haas
On Sun, Nov 20, 2022 at 9:20 PM Kyotaro Horiguchi
 wrote:
> I prefer Robert's approach as it is more robust for future changes and
> simple. I prefer to avoid this kind of piggy-backing and it doesn't
> seem to be needed in this case. XLogShutdownWalRcv() looks like a
> similar case to me and honestly I don't like it in the sense of
> robustness but it is simpler than checking walreceiver status at every
> site that refers to the flag.

I don't understand what you want changed. Can you be more specific
about what you mean by "Robert's approach"?

I don't agree with Bharath's logic for preferring an if-test to an
Assert. There are some cases where we think we've written the code
correctly but also recognize that the logic is complicated enough that
something might slip through the cracks. Then, a runtime check makes
sense, because otherwise something real bad might happen on a
production instance. But here, I don't think that's the main risk. I
think the main risk is that some future patch tries to add code that
should print startup log messages later on. That would probably be a
coding mistake, and Assert would alert the patch author about that,
whereas amending the if-test would just make the code do something
differently then the author intended.

But I don't feel super-strongly about this, which is why I mentioned
both options in my previous email. I'm not clear on whether you are
expressing an opinion on this point in particular or something more
general.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Reducing power consumption on idle servers

2022-11-21 Thread Tom Lane
Robert Haas  writes:
> The reason that I pushed back -- not as successfully as I would have
> liked -- on the changes to pg_stop_backup / pg_start_backup is that I
> know there are people using the old method successfully, and it's not
> just a 1:1 substitution. Here I don't, and it is. I'm totally open to
> the feedback that such people exist and to hearing why adopting one of
> the newer methods would be a problem for them, if that's the case. But
> if there's no evidence that such people exist or that changing is a
> problem for them, I don't think waiting 5 years on principle is good
> for the project.

We make incompatible changes in every release; see the release notes.
Unless somebody can give a plausible use-case where this'd be a
difficult change to deal with, I concur that we don't need to
deprecate it ahead of time.

regards, tom lane




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-21, sirisha chamarthi wrote:

> On Mon, Nov 21, 2022 at 8:05 AM Alvaro Herrera 
> wrote:

> > Thank you.  I had pushed mine for CirrusCI to test, and it failed the
> > assert I added in slot.c:
> > https://cirrus-ci.com/build/4786354503548928
> > Not yet sure why, looking into it.
> 
> Can this be because restart_lsn is not set to InvalidXLogRecPtr for the
> physical slots?

Hmm, that makes no sense.  Is that yet another bug?  Looking.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No es bueno caminar con un hombre muerto"




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 17:06:16 +0100, Jakub Wartak wrote:
> @@ -6213,14 +6216,26 @@ get_actual_variable_endpoint(Relation heapRel,
>   /* Fetch first/next tuple in specified direction */
>   while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
>   {
> + BlockNumber block = ItemPointerGetBlockNumber(tid);
>   if (!VM_ALL_VISIBLE(heapRel,
> - 
> ItemPointerGetBlockNumber(tid),
> + block,
>   &vmbuffer))
>   {
>   /* Rats, we have to visit the heap to check visibility 
> */
>   if (!index_fetch_heap(index_scan, tableslot))
>   continue;   /* no visible tuple, 
> try next index entry */
>  
> + {
> + CHECK_FOR_INTERRUPTS();
> + if (block != last_block)
> + visited_pages++;
> +#define VISITED_PAGES_LIMIT 100
> + if (visited_pages > VISITED_PAGES_LIMIT)
> + break;
> + else
> + continue; /* no visible tuple, try next 
> index entry */
> + }
> +
>   /* We don't actually need the heap tuple for anything */
>   ExecClearTuple(tableslot);
>  
> -- 
> 2.30.2

This can't quite be right - isn't this only applying the limit if we found a
visible tuple?

Greetings,

Andres Freund




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Justin Pryzby
This patch version runs "continue" unconditionally (rather than
conditionally, like the previous version).

if (!index_fetch_heap(index_scan, tableslot))
continue;   /* no visible tuple, 
try next index entry */





Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> Looks like a chunk of the buildfarm doesn't like this - presumably because
>> they use force_parallel_mode = regress. Seems ok to just force that to off in
>> this test?

> Ugh ... didn't occur to me to try that.  I'll take a look.

Hmm, so the problem is:

SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
ERROR:  cannot access temporary tables during a parallel operation

Why in the world is get_raw_page() marked as parallel safe?
It clearly isn't, given this restriction.

regards, tom lane




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread sirisha chamarthi
On Mon, Nov 21, 2022 at 9:12 AM Alvaro Herrera 
wrote:

> On 2022-Nov-21, sirisha chamarthi wrote:
>
> > On Mon, Nov 21, 2022 at 8:05 AM Alvaro Herrera 
> > wrote:
>
> > > Thank you.  I had pushed mine for CirrusCI to test, and it failed the
> > > assert I added in slot.c:
> > > https://cirrus-ci.com/build/4786354503548928
> > > Not yet sure why, looking into it.
> >
> > Can this be because restart_lsn is not set to InvalidXLogRecPtr for the
> > physical slots?
>
> Hmm, that makes no sense.  Is that yet another bug?  Looking.
>

It appears to be. wal_sender is setting restart_lsn to a valid LSN even
when the slot is invalidated.

postgres=# select pg_Create_physical_replication_slot('s1');
 pg_create_physical_replication_slot
-
 (s1,)
(1 row)

postgres=# select * from pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
wal_status | safe_wal_size | two_phase
---++---++--+---+++--+--+-+-++---+---
 s1|| physical  ||  | f | f  |
   |  |  | | |
   |   -8254390272 | f
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
wal_status | safe_wal_size | two_phase
---++---++--+---+++--+--+-+-++---+---
 s1|| physical  ||  | f | f  |
   |  |  | | |
   |   -8374095064 | f
(1 row)

postgres=# \q
postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
pg_receivewal: error: unexpected termination of replication stream: ERROR:
 requested WAL segment 000100EB has already been removed
pg_receivewal: disconnected; waiting 5 seconds to try again
^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
psql (16devel)
Type "help" for help.

postgres=# select * from pg_replication_slots;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?> ^C
!?>


In the log:
2022-11-21 17:31:48.159 UTC [3953664] STATEMENT:  START_REPLICATION SLOT
"s1" 0/EB00 TIMELINE 1
TRAP: failed Assert("XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)"),
File: "slotfuncs.c", Line: 371, PID: 3953707


>
> --
> Álvaro Herrera   48°01'N 7°57'E  —
> https://www.EnterpriseDB.com/
> "No es bueno caminar con un hombre muerto"
>


Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 12:30 PM Andres Freund  wrote:
> This can't quite be right - isn't this only applying the limit if we found a
> visible tuple?

It doesn't look that way to me, but perhaps I'm just too dense to see
the problem?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Tom Lane
Andres Freund  writes:
> This can't quite be right - isn't this only applying the limit if we found a
> visible tuple?

What it's restricting is the number of heap page fetches, which
might be good enough.  We don't have a lot of visibility here
into how many index pages were scanned before returning the next
not-dead index entry, so I'm not sure how hard it'd be to do better.

regards, tom lane




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 12:38 PM Tom Lane  wrote:
> Andres Freund  writes:
> > This can't quite be right - isn't this only applying the limit if we found a
> > visible tuple?
>
> What it's restricting is the number of heap page fetches, which
> might be good enough.  We don't have a lot of visibility here
> into how many index pages were scanned before returning the next
> not-dead index entry, so I'm not sure how hard it'd be to do better.

Oh. That's really sad. Because I think the whole problem here is that
the number of dead index entries can be huge.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 21, 2022 at 12:38 PM Tom Lane  wrote:
>> What it's restricting is the number of heap page fetches, which
>> might be good enough.  We don't have a lot of visibility here
>> into how many index pages were scanned before returning the next
>> not-dead index entry, so I'm not sure how hard it'd be to do better.

> Oh. That's really sad. Because I think the whole problem here is that
> the number of dead index entries can be huge.

If they're *actually* dead, we have enough mitigations in place I think,
as explained by the long comment in get_actual_variable_endpoint.
The problem here is with still-visible-to-somebody tuples.  At least,
Jakub's test case sets it up that way.

regards, tom lane




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 12:35 PM Tom Lane  wrote:
> I wrote:
> > Andres Freund  writes:
> >> Looks like a chunk of the buildfarm doesn't like this - presumably because
> >> they use force_parallel_mode = regress. Seems ok to just force that to off 
> >> in
> >> this test?
>
> > Ugh ... didn't occur to me to try that.  I'll take a look.
>
> Hmm, so the problem is:
>
> SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
> ERROR:  cannot access temporary tables during a parallel operation
>
> Why in the world is get_raw_page() marked as parallel safe?
> It clearly isn't, given this restriction.

I suspect that restriction was overlooked when evaluating the marking
of this function.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 21, 2022 at 12:35 PM Tom Lane  wrote:
>> Hmm, so the problem is:
>> 
>> SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
>> ERROR:  cannot access temporary tables during a parallel operation
>> 
>> Why in the world is get_raw_page() marked as parallel safe?
>> It clearly isn't, given this restriction.

> I suspect that restriction was overlooked when evaluating the marking
> of this function.

So it would seem.  PARALLEL RESTRICTED should work, though.

I'll check to see if any sibling functions have the same issue,
and push a patch to adjust them.

Presumably the parallel labeling has to be fixed as far back as
it's marked that way (didn't look).  Maybe we should push the
test change further back too, just to exercise this.

regards, tom lane




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-21, sirisha chamarthi wrote:

> It appears to be. wal_sender is setting restart_lsn to a valid LSN even
> when the slot is invalidated.

> postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
> pg_receivewal: error: unexpected termination of replication stream: ERROR:
>  requested WAL segment 000100EB has already been removed
> pg_receivewal: disconnected; waiting 5 seconds to try again
> ^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
> psql (16devel)
> Type "help" for help.
> 
> postgres=# select * from pg_replication_slots;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

Whoa, I cannot reproduce this :-(

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming"  (A. Stepanov)




Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Andres Freund
Hi,

On November 21, 2022 9:37:34 AM PST, Robert Haas  wrote:
>On Mon, Nov 21, 2022 at 12:30 PM Andres Freund  wrote:
>> This can't quite be right - isn't this only applying the limit if we found a
>> visible tuple?
>
>It doesn't look that way to me, but perhaps I'm just too dense to see
>the problem?

The earlier version didn't have the issue, but the latest seems to only limit 
after a visible tuple has been found. Note the continue; when fetching a heap 
tuple fails.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread sirisha chamarthi
On Mon, Nov 21, 2022 at 10:11 AM Alvaro Herrera 
wrote:

> On 2022-Nov-21, sirisha chamarthi wrote:
>
> > It appears to be. wal_sender is setting restart_lsn to a valid LSN even
> > when the slot is invalidated.
>
> > postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
> > pg_receivewal: error: unexpected termination of replication stream:
> ERROR:
> >  requested WAL segment 000100EB has already been removed
> > pg_receivewal: disconnected; waiting 5 seconds to try again
> > ^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
> > psql (16devel)
> > Type "help" for help.
> >
> > postgres=# select * from pg_replication_slots;
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
>
> Whoa, I cannot reproduce this :-(
>

I have a old .partial file in the data directory to reproduce this.

postgres=# select * from pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active |
active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
wal_status | safe_wal_size | two_phase
---++---++--+---+++--+--+-+-++---+---
 s2|| physical  ||  | f | f  |
   |  |  | 2/DC00  | | lost
  |   | f
(1 row)

postgres=# \q
postgres@pgvm:~$ ls
0001000200D8  0001000200D9
 0001000200DA  0001000200DB
 0001000200DC.partial


>
> --
> Álvaro HerreraBreisgau, Deutschland  —
> https://www.EnterpriseDB.com/
> "Java is clearly an example of money oriented programming"  (A. Stepanov)
>


Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Simon Riggs
On Mon, 21 Nov 2022 at 18:17, Andres Freund  wrote:
>
> Hi,
>
> On November 21, 2022 9:37:34 AM PST, Robert Haas  
> wrote:
> >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund  wrote:
> >> This can't quite be right - isn't this only applying the limit if we found 
> >> a
> >> visible tuple?
> >
> >It doesn't look that way to me, but perhaps I'm just too dense to see
> >the problem?
>
> The earlier version didn't have the issue, but the latest seems to only limit 
> after a visible tuple has been found. Note the continue; when fetching a heap 
> tuple fails.

Agreed, resolved in this version.


Robert, something like this perhaps? limit on both the index and the heap.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


0001-Damage-control-for-planner-s-get_actual_variable_end.v2.patch
Description: Binary data


Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread sirisha chamarthi
On Mon, Nov 21, 2022 at 10:40 AM sirisha chamarthi <
sirichamarth...@gmail.com> wrote:

>
>
> On Mon, Nov 21, 2022 at 10:11 AM Alvaro Herrera 
> wrote:
>
>> On 2022-Nov-21, sirisha chamarthi wrote:
>>
>> > It appears to be. wal_sender is setting restart_lsn to a valid LSN even
>> > when the slot is invalidated.
>>
>> > postgres@pgvm:~$ /usr/local/pgsql/bin/pg_receivewal -S s1 -D .
>> > pg_receivewal: error: unexpected termination of replication stream:
>> ERROR:
>> >  requested WAL segment 000100EB has already been removed
>> > pg_receivewal: disconnected; waiting 5 seconds to try again
>> > ^Cpostgres@pgvm:~$ /usr/local/pgsql/bin/psql
>> > psql (16devel)
>> > Type "help" for help.
>> >
>> > postgres=# select * from pg_replication_slots;
>> > server closed the connection unexpectedly
>> > This probably means the server terminated abnormally
>> > before or while processing the request.
>>
>> Whoa, I cannot reproduce this :-(
>>
>
> I have a old .partial file in the data directory to reproduce this.
>
> postgres=# select * from pg_replication_slots;
>  slot_name | plugin | slot_type | datoid | database | temporary | active |
> active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn |
> wal_status | safe_wal_size | two_phase
>
> ---++---++--+---+++--+--+-+-++---+---
>  s2|| physical  ||  | f | f  |
>|  |  | 2/DC00  | | lost
>   |   | f
> (1 row)
>
> postgres=# \q
> postgres@pgvm:~$ ls
> 0001000200D8  0001000200D9
>  0001000200DA  0001000200DB
>  0001000200DC.partial
>

Just to be clear, it was hitting the assert I added in the slotfuncs.c but
not in the code you mentioned. Apologies for the confusion. Also it appears
in the above case I mentioned, the slot is not invalidated yet as the
checkpointer did not run though the state says it is lost.



>
>
>>
>> --
>> Álvaro HerreraBreisgau, Deutschland  —
>> https://www.EnterpriseDB.com/
>> "Java is clearly an example of money oriented programming"  (A. Stepanov)
>>
>


Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread Alvaro Herrera
On 2022-Nov-21, sirisha chamarthi wrote:

> I have a old .partial file in the data directory to reproduce this.

I don't think the .partial file is in itself important.  But I think
this whole thing is a distraction.  I managed to reproduce it
eventually, by messing with the slot and WAL at random, and my
conclusion is that we shouldn't mess with this at all for this bugfix.
Instead I'm going to do what Ashutosh mentioned at the start, which is
to verify both the restart_lsn and the invalidated_at, when deciding
whether to ignore the slot.

It seems to me that there is a bigger mess here, considering that we use
the effective_xmin in some places and the other xmin (the one that's
saved to disk) in others.  I have no patience for trying to disentangle
that at this point, though.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)




Re: Catalog_xmin is not advanced when a logical slot is lost

2022-11-21 Thread sirisha chamarthi
On Mon, Nov 21, 2022 at 10:56 AM Alvaro Herrera 
wrote:

> On 2022-Nov-21, sirisha chamarthi wrote:
>
> > I have a old .partial file in the data directory to reproduce this.
>
> I don't think the .partial file is in itself important.  But I think
> this whole thing is a distraction.

Yes, sorry for the confusion.


> I managed to reproduce it
> eventually, by messing with the slot and WAL at random, and my
> conclusion is that we shouldn't mess with this at all for this bugfix.
>

Agreed.


> Instead I'm going to do what Ashutosh mentioned at the start, which is
> to verify both the restart_lsn and the invalidated_at, when deciding
> whether to ignore the slot.
>

Sounds good to me. Thanks!


>
> It seems to me that there is a bigger mess here, considering that we use
> the effective_xmin in some places and the other xmin (the one that's
> saved to disk) in others.  I have no patience for trying to disentangle
> that at this point, though.


> --
> Álvaro Herrera PostgreSQL Developer  —
> https://www.EnterpriseDB.com/
> "Having your biases confirmed independently is how scientific progress is
> made, and hence made our great society what it is today" (Mary Gardiner)
>


Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-21 Thread Peter Eisentraut

Question about

"""
Subject: [PATCH v50 5/8] Add initdb option to initialize cluster with
 non-standard xid/mxid/mxoff.

To date testing database cluster wraparund was not easy as initdb has always
inited it with default xid/mxid/mxoff. The option to specify any valid
xid/mxid/mxoff at cluster startup will make these things easier.
"""

Doesn't pg_resetwal already provide that functionality, or at least some 
of it?






Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Andres Freund
Hi, 

On November 21, 2022 10:44:17 AM PST, Simon Riggs 
 wrote:
>Robert, something like this perhaps? limit on both the index and the heap.

I don't think we should add additional code / struct members into very common 
good paths for these limits. 

I don't really understand the point of limiting in the index - where would the 
large number of pages accessed come from?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Tom Lane
I wrote:
> I'll check to see if any sibling functions have the same issue,
> and push a patch to adjust them.
> Presumably the parallel labeling has to be fixed as far back as
> it's marked that way (didn't look).  Maybe we should push the
> test change further back too, just to exercise this.

Hmm, so this is easy enough to fix in HEAD and v15, as attached.
However, there's a problem in older branches: their newest
versions of pageinspect are older than 1.10, so this doesn't
work as-is.

We could imagine inventing versions like 1.9.1, and providing
a script pageinspect--1.9--1.9.1.sql to do what's done here
as well as (in later branches) pageinspect--1.9.1--1.10.sql that
duplicates pageinspect--1.9--1.10.sql, and then the same again for
1.8 and 1.7 for the older in-support branches.  That seems like
an awful lot of trouble for something that there have been no
field complaints about.

I'm inclined to apply this in HEAD and v15, and revert the
test change in v14, and call it good.

regards, tom lane

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index 5c0736564a..ad5a3ac511 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -13,7 +13,8 @@ OBJS = \
 	rawpage.o
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.9--1.10.sql pageinspect--1.8--1.9.sql \
+DATA =  pageinspect--1.10--1.11.sql \
+	pageinspect--1.9--1.10.sql pageinspect--1.8--1.9.sql \
 	pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
diff --git a/contrib/pageinspect/meson.build b/contrib/pageinspect/meson.build
index 3ec50b9445..25fa7dc20c 100644
--- a/contrib/pageinspect/meson.build
+++ b/contrib/pageinspect/meson.build
@@ -33,6 +33,7 @@ install_data(
   'pageinspect--1.7--1.8.sql',
   'pageinspect--1.8--1.9.sql',
   'pageinspect--1.9--1.10.sql',
+  'pageinspect--1.10--1.11.sql',
   'pageinspect.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pageinspect/pageinspect--1.10--1.11.sql b/contrib/pageinspect/pageinspect--1.10--1.11.sql
new file mode 100644
index 00..8fa5e105bc
--- /dev/null
+++ b/contrib/pageinspect/pageinspect--1.10--1.11.sql
@@ -0,0 +1,28 @@
+/* contrib/pageinspect/pageinspect--1.10--1.11.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pageinspect UPDATE TO '1.11'" to load this file. \quit
+
+--
+-- Functions that fetch relation pages must be PARALLEL RESTRICTED,
+-- not PARALLEL SAFE, otherwise they will fail when run on a
+-- temporary table in a parallel worker process.
+--
+
+ALTER FUNCTION get_raw_page(text, int8) PARALLEL RESTRICTED;
+ALTER FUNCTION get_raw_page(text, text, int8) PARALLEL RESTRICTED;
+-- tuple_data_split must be restricted because it may fetch TOAST data.
+ALTER FUNCTION tuple_data_split(oid, bytea, integer, integer, text) PARALLEL RESTRICTED;
+ALTER FUNCTION tuple_data_split(oid, bytea, integer, integer, text, bool) PARALLEL RESTRICTED;
+-- heap_page_item_attrs must be restricted because it calls tuple_data_split.
+ALTER FUNCTION heap_page_item_attrs(bytea, regclass, bool) PARALLEL RESTRICTED;
+ALTER FUNCTION heap_page_item_attrs(bytea, regclass) PARALLEL RESTRICTED;
+ALTER FUNCTION bt_metap(text) PARALLEL RESTRICTED;
+ALTER FUNCTION bt_page_stats(text, int8) PARALLEL RESTRICTED;
+ALTER FUNCTION bt_page_items(text, int8) PARALLEL RESTRICTED;
+ALTER FUNCTION hash_bitmap_info(regclass, int8) PARALLEL RESTRICTED;
+-- brin_page_items might be parallel safe, because it seems to touch
+-- only index metadata, but I don't think there's a point in risking it.
+-- Likewise for gist_page_items.
+ALTER FUNCTION brin_page_items(bytea, regclass) PARALLEL RESTRICTED;
+ALTER FUNCTION gist_page_items(bytea, regclass) PARALLEL RESTRICTED;
diff --git a/contrib/pageinspect/pageinspect.control b/contrib/pageinspect/pageinspect.control
index 7cdf37913d..f277413dd8 100644
--- a/contrib/pageinspect/pageinspect.control
+++ b/contrib/pageinspect/pageinspect.control
@@ -1,5 +1,5 @@
 # pageinspect extension
 comment = 'inspect the contents of database pages at a low level'
-default_version = '1.10'
+default_version = '1.11'
 module_pathname = '$libdir/pageinspect'
 relocatable = true


Re: Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Tom Lane
Andres Freund  writes:
> On November 21, 2022 10:44:17 AM PST, Simon Riggs 
>  wrote:
>> Robert, something like this perhaps? limit on both the index and the heap.

> I don't think we should add additional code / struct members into very common 
> good paths for these limits. 

Yeah, I don't like that either: for one thing, it seems completely
unsafe to back-patch.

regards, tom lane




Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-21 Thread Tom Lane
Peter Eisentraut  writes:
>> To date testing database cluster wraparund was not easy as initdb has always
>> inited it with default xid/mxid/mxoff. The option to specify any valid
>> xid/mxid/mxoff at cluster startup will make these things easier.

> Doesn't pg_resetwal already provide that functionality, or at least some 
> of it?

pg_resetwal does seem like a better, more useful home for this; it'd
allow you to adjust these numbers after initial creation which might be
useful.  I'm not sure how flexible it is right now in terms of where
you can set the new values, but that can always be improved.

regards, tom lane




$1 IS NULL with unknown type

2022-11-21 Thread Daniele Varrazzo
Hello,

The operator `IS NULL` doesn't work if the argument has unknown type.
In psycopg 3:

>>> conn.execute("select %s is null", ['foo']).fetchone()
IndeterminateDatatype: could not determine data type of parameter $1

This can get in the way of using the unknown type for strings (but
specifying the text oid for strings is worse, because there is no
implicit cast from string to most types).

It doesn't seem necessary to specify a type for an argument if it only
has to be compared with null: nullness is independent from the type
and is even specified, in the query parameters, in a separate array
from the parameter values.

Maybe this behaviour can be relaxed?

Cheers

-- Daniele




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 12:52:01 -0500, Robert Haas wrote:
> On Mon, Nov 21, 2022 at 12:35 PM Tom Lane  wrote:
> > I wrote:
> > > Andres Freund  writes:
> > >> Looks like a chunk of the buildfarm doesn't like this - presumably 
> > >> because
> > >> they use force_parallel_mode = regress. Seems ok to just force that to 
> > >> off in
> > >> this test?
> >
> > > Ugh ... didn't occur to me to try that.  I'll take a look.
> >
> > Hmm, so the problem is:
> >
> > SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
> > ERROR:  cannot access temporary tables during a parallel operation
> >
> > Why in the world is get_raw_page() marked as parallel safe?
> > It clearly isn't, given this restriction.
> 
> I suspect that restriction was overlooked when evaluating the marking
> of this function.

It's somewhat sad to add this restriction - I've used get_raw_page() (+
other functions) to scan a whole database for a bug. IIRC that actually
did end up using parallelism, albeit likely not very efficiently.

Don't really have a better idea though.

It may be worth inventing a framework where a function could analyze its
arguments (presumably via prosupport) to determine the degree of
parallel safety, but this doesn't seem sufficient reason...

Greetings

Andres Freund




Re: $1 IS NULL with unknown type

2022-11-21 Thread Tom Lane
Daniele Varrazzo  writes:
> The operator `IS NULL` doesn't work if the argument has unknown type.
> conn.execute("select %s is null", ['foo']).fetchone()
> IndeterminateDatatype: could not determine data type of parameter $1

Yeah.

> It doesn't seem necessary to specify a type for an argument if it only
> has to be compared with null: nullness is independent from the type
> and is even specified, in the query parameters, in a separate array
> from the parameter values.

True, the IS NULL operator itself doesn't care about the data type,
but that doesn't mean that noplace else in the system does.

As an example, if we silently resolved the type as "text" as you seem
to wish, and then the client sends a non-null integer in binary format,
we'll likely end up throwing a bad-encoding error.

Maybe that's not a big problem, I dunno.  But we can't just not
resolve the type, we have to pick something.

regards, tom lane




Re: Cleanup: Duplicated, misplaced comment in HeapScanDescData

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 12:34:12 +0100, Matthias van de Meent wrote:
> On Mon, 21 Nov 2022 at 12:12, Matthias van de Meent
>  wrote:
> >
> > Hi,
> >
> > I noticed that the comment on/beneath rs_numblocks in HeapScanDescData
> > is duplicated above rs_strategy. I don't know if there should have
> > been a different comment above rs_strategy, but the current one is
> > definitely out of place, so I propose to remove it as per attached.
> >
> > The comment was duplicated in c2fe139c20 with the update to the table
> > scan APIs, which was first seen in PostgreSQL 11.
> 
> I made a mistake in determining this version number; it was PostgreSQL
> 12 where this commit was first included. Attached is the same patch
> with the description updated accordingly.

I guess that happened because of the odd placement of the comment from
before the change:

 boolrs_temp_snap;   /* unregister snapshot at scan end? */
-
-/* state set up at initscan time */
-BlockNumber rs_nblocks; /* total number of blocks in rel */
-BlockNumber rs_startblock;  /* block # to start at */
-BlockNumber rs_numblocks;   /* max number of blocks to scan */
-/* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
-BufferAccessStrategy rs_strategy;   /* access strategy for reads */
 boolrs_syncscan;/* report location to syncscan logic? */

We rarely put comments document a struct member after it.

I'm inclined to additionally move the "legitimate" copy of the comment
to before rs_numblocks, rather than after it.

Greetings,

Andres Freund




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-21 12:52:01 -0500, Robert Haas wrote:
>> On Mon, Nov 21, 2022 at 12:35 PM Tom Lane  wrote:
>>> Why in the world is get_raw_page() marked as parallel safe?
>>> It clearly isn't, given this restriction.

> It's somewhat sad to add this restriction - I've used get_raw_page() (+
> other functions) to scan a whole database for a bug. IIRC that actually
> did end up using parallelism, albeit likely not very efficiently.
> Don't really have a better idea though.

Me either.

> It may be worth inventing a framework where a function could analyze its
> arguments (presumably via prosupport) to determine the degree of
> parallel safety, but this doesn't seem sufficient reason...

Maybe, but in this example you could only decide you were parallel
safe if the argument is an OID constant, which'd be pretty limiting.

If I were trying to find a better fix I'd be looking for ways for
parallel workers to be able to read the parent's temp tables.
(Perhaps that could tie in with the blue-sky discussion we had
the other day about allowing autovacuum on temp tables??)

regards, tom lane




Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 14:21:35 -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
> >> To date testing database cluster wraparund was not easy as initdb has 
> >> always
> >> inited it with default xid/mxid/mxoff. The option to specify any valid
> >> xid/mxid/mxoff at cluster startup will make these things easier.
>
> > Doesn't pg_resetwal already provide that functionality, or at least some
> > of it?
>
> pg_resetwal does seem like a better, more useful home for this; it'd
> allow you to adjust these numbers after initial creation which might be
> useful.  I'm not sure how flexible it is right now in terms of where
> you can set the new values, but that can always be improved.

IIRC the respective pg_resetwal parameters are really hard to use for
something like this, because they don't actually create the respective
SLRU segments. We of course could fix that.

Greetings,

Andres Freund




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread David G. Johnston
On Mon, Nov 21, 2022 at 1:12 PM Tom Lane  wrote:

> Andres Freund  writes:
> > On 2022-11-21 12:52:01 -0500, Robert Haas wrote:
> >> On Mon, Nov 21, 2022 at 12:35 PM Tom Lane  wrote:
> >>> Why in the world is get_raw_page() marked as parallel safe?
> >>> It clearly isn't, given this restriction.
>
> > It's somewhat sad to add this restriction - I've used get_raw_page() (+
> > other functions) to scan a whole database for a bug. IIRC that actually
> > did end up using parallelism, albeit likely not very efficiently.
> > Don't really have a better idea though.
>
> Me either.
>
> > It may be worth inventing a framework where a function could analyze its
> > arguments (presumably via prosupport) to determine the degree of
> > parallel safety, but this doesn't seem sufficient reason...
>
> Maybe, but in this example you could only decide you were parallel
> safe if the argument is an OID constant, which'd be pretty limiting.
>
> If I were trying to find a better fix I'd be looking for ways for
> parallel workers to be able to read the parent's temp tables.
> (Perhaps that could tie in with the blue-sky discussion we had
> the other day about allowing autovacuum on temp tables??)
>
>
>
I don't suppose we want to just document the fact that these power-user
non-core functions are unable to process temporary tables safely without
first disabling parallelism for the session.

David J.


Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-21 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-21 14:21:35 -0500, Tom Lane wrote:
>> pg_resetwal does seem like a better, more useful home for this; it'd
>> allow you to adjust these numbers after initial creation which might be
>> useful.  I'm not sure how flexible it is right now in terms of where
>> you can set the new values, but that can always be improved.

> IIRC the respective pg_resetwal parameters are really hard to use for
> something like this, because they don't actually create the respective
> SLRU segments. We of course could fix that.

Is that still true?  We should fix it, for sure.

regards, tom lane




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 15:12:15 -0500, Tom Lane wrote:
> If I were trying to find a better fix I'd be looking for ways for
> parallel workers to be able to read the parent's temp tables.
> (Perhaps that could tie in with the blue-sky discussion we had
> the other day about allowing autovacuum on temp tables??)

That'd be a nontrivial change, because we explicitly don't use any
locking for anything relating to localbuf.c. One possible benefit could
be that we could substantially reduce the code duplication between
"normal" bufmgr.c and localbuf.c.

Greetings,

Andres Freund




Re: Add 64-bit XIDs into PostgreSQL 15

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 15:16:46 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-11-21 14:21:35 -0500, Tom Lane wrote:
> >> pg_resetwal does seem like a better, more useful home for this; it'd
> >> allow you to adjust these numbers after initial creation which might be
> >> useful.  I'm not sure how flexible it is right now in terms of where
> >> you can set the new values, but that can always be improved.
> 
> > IIRC the respective pg_resetwal parameters are really hard to use for
> > something like this, because they don't actually create the respective
> > SLRU segments. We of course could fix that.
> 
> Is that still true?  We should fix it, for sure.

Sure looks that way to me. I think it might mostly work if you manage to
find nextXid, nextMulti, nextMultiOffset values that each point to the
start of a segment that'd then be created whenever those values are
used.

Greetings,

Andres Freund




Re: pgsql: Prevent instability in contrib/pageinspect's regression test.

2022-11-21 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-21 15:12:15 -0500, Tom Lane wrote:
>> If I were trying to find a better fix I'd be looking for ways for
>> parallel workers to be able to read the parent's temp tables.
>> (Perhaps that could tie in with the blue-sky discussion we had
>> the other day about allowing autovacuum on temp tables??)

> That'd be a nontrivial change, because we explicitly don't use any
> locking for anything relating to localbuf.c. One possible benefit could
> be that we could substantially reduce the code duplication between
> "normal" bufmgr.c and localbuf.c.

I didn't say this was easy ;-).  Aside from locking, the local buffers
are inside the process's address space and not accessible from outside.
Maybe they could be mapped into a shared memory region instead?
And there are optimizations like commit a7212be8b that depend on the
assumption that nothing else is accessing our process's temp tables.
That'd need a lot of thought, if we don't want to give up all the
performance benefits of temp tables.

regards, tom lane




fixing CREATEROLE

2022-11-21 Thread Robert Haas
The CREATEROLE permission is in a very bad spot right now. The biggest
problem that I know about is that it allows you to trivially access
the OS user account under which PostgreSQL is running, which is
expected behavior for a superuser but simply wrong behavior for any
other user. This is because CREATEROLE conveys powerful capabilities
not only to create roles but also to manipulate them in various ways,
including granting any non-superuser role in the system to any new or
existing user, including themselves. Since v11, the roles that can be
granted include pg_execute_server_program and pg_write_server_files
which are trivially exploitable. Perhaps this should have been treated
as an urgent security issue and a fix back-patched, although it is not
clear to me exactly what such a fix would look like. Since we haven't
done that, I went looking for a way to improve things in a principled
way going forward, taking advantage also of recent master-only work to
improve various aspects of the role grant system.

Here, I feel it important to point out that I think the current system
would be broken even if we didn't have predefined roles that are
trivially exploitable to obtain OS user access. We would still lack
any way to restrict the scope of the CREATEROLE privilege. Sure, the
privilege doesn't extend to superusers, but that's not really good
enough. Consider:

rhaas=# create role alice createrole;
CREATE ROLE
rhaas=# create role bob password 'known_only_to_bob';
CREATE ROLE
rhaas=# set session authorization alice;
SET
rhaas=> alter role bob password 'known_to_alice';
ALTER ROLE

Assuming that some form of password authentication is supported, alice
is basically empowered to break into any non-superuser account on the
system and assume all of its privileges. That's really not cool: it's
OK, I think, to give a non-superuser the right to change somebody
else's passwords, but it should be possible to limit it in some way,
e.g. to the users that alice creates. Also, while the ability to make
this sort of change seems to be the clear intention of the code, it's
not documented on the CREATE ROLE page. The problems with
pg_execute_server_program et. al. are not documented either; all it
says is that you should "regard roles that have the CREATEROLE
privilege as almost-superuser-roles," which seems to me to be
understating the extent of the problem.

I have drafted a few patches to try to improve the situation. It seems
to me that the root of any fix in this area must be to change the rule
that CREATEROLE can administer any role whatsoever. Instead, I propose
to change things so that you can only administer roles for which you
have ADMIN OPTION. Administering a role here includes changing the
password for a role, renaming a role, dropping a role, setting the
comment or security label on a role, or granting membership in that
role to another role, whether at role creation time or later. All of
these options are treated in essentially two places in the code, so it
makes sense to handle them all in a symmetric way. One problem with
this proposal is that, if we did exactly this much, then a CREATEROLE
user wouldn't be able to administer the roles which they themselves
had just created. That seems like it would be restricting the
privileges of CREATEROLE users too much.

To fix that, I propose when a non-superuser creates a role, the role
be implicitly granted back to the creator WITH ADMIN OPTION. This
arguably doesn't add any fundamentally new capability because the
CREATEROLE user could do something like "CREATE ROLE some_new_role
ADMIN myself" anyway, but that's awkward to remember and doing it
automatically seems a lot more convenient. However, there's a little
bit of trickiness here, too. Granting the new role back to the creator
might, depending on whether the INHERIT or SET flags are true or false
for the new grant, allow the CREATEROLE user to inherit the privileges
of, or set role to, the target role, which under current rules would
not be allowed. We can minimize behavior changes from the status quo
by setting up the new, implicit grant with SET FALSE, INHERIT FALSE.

However, that might not be what everyone wants. It's definitely not
what *I* want. I want a way for non-superusers to create new roles and
automatically inherit the privileges of those roles just as a
superuser automatically inherits everyone's privileges. I just don't
want the users who can do this to also be able to break out to the OS
as if they were superusers when they're not actually supposed to be.
However, it's clear from previous discussion that other people do NOT
want that, so I propose to make it configurable. Thus, this patch
series also proposes to add INHERITCREATEDROLES and SETCREATEDROLES
properties to roles. These have no meaning if the role is not marked
CREATEROLE. If it is, then they control the properties of the implicit
grant that happens when a CREATEROLE user who is not a superuser
creates a role. If INHERITCREATEDROLES is

Re: Understanding WAL - large amount of activity from removing data

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-20 19:02:12 -0700, David G. Johnston wrote:
> Both of these are written to the WAL, and a record is always written
> to the WAL as a self-contained unit, so the old record is full sized
> in the newly written WAL.

That's not really true. Normally the update record just logs the xmax,
offset, infomask for the old tuple. However, full_page_writes can lead
to the old tuple's whole page to be logged.

We do log the old tuple contents if the replica identity of the table is
set to 'FULL' - if you're using that, we'll indeed log the whole old
version of the tuple to the WAL.

I think the more likely explanation in this case is that deleting the
toast values with the PDF - which is what you're doing by updating the
value to = 'redacted' - will have to actually mark all those toast
tuples as deleted. Which then likely is causing a lot of full page
writes.

In a case like this you might have better luck forcing the table to be
rewritten with something like

ALTER TABLE tbl ALTER COLUMN data TYPE text USING ('redacted');

which should just drop the old toast table, without going through it
one-by-one.

Greetings,

Andres Freund




Re: Proposal: Allow user with pg_monitor role to call pg_stat_reset* functions

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 00:16:20 -0800, sirisha chamarthi wrote:
> At present, calling pg_stat_reset* functions requires super user access
> unless explicitly grant execute permission on those. In this thread, I am
> proposing to grant execute on them to users with pg_monitor role
> permissions. This comes handy to the monitoring users (part of pg_monitor
> role) to capture the stats fresh and analyze. Do you see any concerns with
> this approach?

I think the common assumption is that a monitoring role cannot modify
the system, but this would change that. Normally a monitoring tool
should be able to figure out what changed in stats by comparing values
across time, rather than resetting stats.

Greetings,

Andres Freund




Re: Proposal: Allow user with pg_monitor role to call pg_stat_reset* functions

2022-11-21 Thread Robert Haas
On Mon, Nov 21, 2022 at 3:45 PM Andres Freund  wrote:
> On 2022-11-21 00:16:20 -0800, sirisha chamarthi wrote:
> > At present, calling pg_stat_reset* functions requires super user access
> > unless explicitly grant execute permission on those. In this thread, I am
> > proposing to grant execute on them to users with pg_monitor role
> > permissions. This comes handy to the monitoring users (part of pg_monitor
> > role) to capture the stats fresh and analyze. Do you see any concerns with
> > this approach?
>
> I think the common assumption is that a monitoring role cannot modify
> the system, but this would change that. Normally a monitoring tool
> should be able to figure out what changed in stats by comparing values
> across time, rather than resetting stats.

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Reducing power consumption on idle servers

2022-11-21 Thread Laurenz Albe
On Mon, 2022-11-21 at 12:11 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > The reason that I pushed back -- not as successfully as I would have
> > liked -- on the changes to pg_stop_backup / pg_start_backup is that I
> > know there are people using the old method successfully, and it's not
> > just a 1:1 substitution. Here I don't, and it is. I'm totally open to
> > the feedback that such people exist and to hearing why adopting one of
> > the newer methods would be a problem for them, if that's the case. But
> > if there's no evidence that such people exist or that changing is a
> > problem for them, I don't think waiting 5 years on principle is good
> > for the project.
> 
> We make incompatible changes in every release; see the release notes.
> Unless somebody can give a plausible use-case where this'd be a
> difficult change to deal with, I concur that we don't need to
> deprecate it ahead of time.

Since I am the only one that seems to worry, I'll shut up.  You are probably
right that it the feature won't be missed by many users.

Yours,
Laurenz Albe




Re: [PATCH] Allow specification of custom slot for custom nodes

2022-11-21 Thread Alexander Korotkov
On Mon, Nov 21, 2022 at 4:34 PM Pavel Borisov  wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested
>
> I've looked at this patch and don't see any problems with it. It is minimally 
> invasive, it doesn't affect functionality unless anyone (e.g. extension) sets 
> its own slotOps in CustomScanState.
> Furthermore, the current patch very slightly modifies patch 0b03e5951bf0 with 
> the intention of introducing extensibility. So I think adding more 
> extensibility regarding different tuple formats is an excellent thing to do.
>
> I'm going to mark it as RfC if there are no objections.

Thank you for your feedback.  I also don't see how this patch could
affect anybody.
I'm going to push this if there are no objections.

--
Regards,
Alexander Korotkov




Re: More efficient build farm animal wakeup?

2022-11-21 Thread Andrew Dunstan


On 2022-11-20 Su 17:32, Thomas Munro wrote:
> On Sun, Nov 20, 2022 at 2:44 AM Andrew Dunstan  wrote:
>> It might not suit your use case, but one of the things I do to reduce
>> fetch load is to run a local mirror which runs
>>
>>git fetch -q --prune
>>
>> every 5 minutes. It also runs a git daemon, and several of my animals
>> point at that.
> Thanks.  I understand now that my configuration without a local mirror
> is super inefficient (it spends the first ~25s of each minute running
> git commands).  Still, even though that can be improved by me setting
> up more stuff, I'd like something event-driven rather than short
> polling-based for lower latency.
>
>> If there's a better git API I'll be happy to try to use it.
> Cool.  Seems like we just have to invent something first...
>
> FWIW I'm also trying to chase the short polling out of cfbot.  It
> regularly harasses the git servers at one end (could be fixed with
> this approach), and wastes a percentage of our allotted CPU slots on
> the other end by scheduling periodically (could be fixed with webhooks
> from Cirrus).



I think I have solved most of the actual issues without getting too complex.

Here's how:

The buildfarm server now creates a companion to branches_of_interest.txt
called branches_of_interest.json which looks like this:

[
   {
  "REL_11_STABLE" : "140c803723"
   },
   {
  "REL_12_STABLE" : "4cbcb7ed85"
   },
   {
  "REL_13_STABLE" : "c13667b518"
   },
   {
  "REL_14_STABLE" : "5cda142bb9"
   },
   {
  "REL_15_STABLE" : "ff9d27ee2b"
   },
   {
  "HEAD" : "51b5834cd5"
   }
]

It updates this every time it does a git fetch, currently every 5 minutes.

run_branches.pl fetches this file instead of the plain list of branches,
and before running run_build.pl checks if the given commit was the
latest one tested, and if so and a build isn't being forced, skips the
branch. Thus, in the case where all the branches are up to date there
will be no git calls whatsoever.

You can try it out by getting run_branches.pl from



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 15:47:12 +0300, Maxim Orlov wrote:
> After some investigation, I think, the problem is in the snapbuild.c
> (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> array in the context of builder->context, but for the time when we call
> SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
> InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F)
> values. This is not caused buildfarm to fail due to that code:

Amit, that does indeed seem to be a problem...

Greetings,

Andres Freund




  1   2   >