Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-31 Thread Bharath Rupireddy
On Mon, Feb 1, 2021 at 12:43 PM Fujii Masao  wrote:
> On 2021/01/27 10:06, Bharath Rupireddy wrote:
> > On Tue, Jan 26, 2021 at 8:38 AM Bharath Rupireddy
> >  wrote:
> >> I will post "keep_connections" GUC and "keep_connection" server level
> >> option patches later.
> >
> > Attaching v19 patch set for "keep_connections" GUC and
> > "keep_connection" server level option. Please review them further.
>
> These options are no longer necessary because we now support 
> idle_session_timeout? If we want to disconnect the foreign server connections 
> that sit on idle to prevent them from eating up the connection capacities in 
> the foriegn servers, we can just set idle_session_timeout in those foreign 
> servers. If we want to avoid the cluster-wide setting of 
> idle_session_timeout, we can set that per role. One issue for this approach 
> is that the connection entry remains even after idle_session_timeout happens. 
> So postgres_fdw_get_connections() returns that connection even though it's 
> actually closed by the timeout. Which is confusing. But which doesn't cause 
> any actual problem, right? When the foreign table is accessed the next time, 
> that connection entry is dropped, an error is detected, and then new 
> connection will be remade.

First of all, idle_session_timeout is by default 0 i.e. disabled,
there are chances that users may not use that and don't want to set it
just for not caching any foreign server connection. A simple use case
where server level option can be useful is that, users are accessing
foreign tables (may be not that frequently, once in a while) from a
long running local session using foreign servers and they don't want
to keep the local session cache those connections, then setting this
server level option, keep_connections to false makes their life
easier, without having to depend on setting idle_session_timeout on
the remote server.

And, just using idle_session_timeout on a remote server may not help
us completely. Because the remote session may go away, while we are
still using that cached connection in an explicit txn on the local
session. Our connection retry will also not work because we are in the
middle of an xact, so the local explicit txn gets aborted.

So, IMO, we can still have both server level option as well as
postgres_fdw contrib level GUC (to tell the local session that "I
don't want to keep any foreign connections active" instead of setting
keep_connection server level option for each foreign server).

> Sorry I've not read the past long discussion about this feature. If there is 
> the consensus that these options are still necessary and useful even when we 
> have idle_session_timeout, please correct me.
>
> ISTM that it's intuitive (at least for me) to add this kind of option into 
> the foreign server. But I'm not sure if it's good idea to expose the option 
> as GUC. Also if there is the consensus about this, please correct me.

See here [1].

[1] - 
https://www.postgresql.org/message-id/f58d1df4ae58f6cf3bfa560f923462e0%40postgrespro.ru

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Single transaction in the tablesync worker?

2021-01-31 Thread Peter Smith
On Mon, Feb 1, 2021 at 5:19 PM Amit Kapila  wrote:

 > > > AFAIK there is always a potential race with DropSubscription dropping
> > > > slots. The DropSubscription might be running at exactly the same time
> > > > the apply worker has just dropped the very same tablesync slot.
> > > >
> > >
> > > We stopped the workers before getting a list of NotReady relations and
> > > then we try to drop the corresponding slots. So, how such a race
> > > condition can happen? Note, because we have a lock on pg_subscrition,
> > > there is no chance that the workers can restart till the transaction
> > > end.
> >
> > OK. I think I was forgetting the logicalrep_worker_stop would also go
> > into a loop waiting for the worker process to die. So even if the
> > tablesync worker does simultaneously drop it's own slot, I think it
> > will certainly at least be in SYNCDONE state before DropSubscription
> > does anything else with that worker.
> >
>
> How is that ensured? We don't have anything like HOLD_INTERRUPTS
> between the time dropped the slot and updated rel state as SYNCDONE.
> So, isn't it possible that after we dropped the slot and before we
> update the state, the SIGTERM signal arrives and led to worker exit?
>

The worker has the SIGTERM handler of "die". IIUC the "die" function
doesn't normally do anything except set some flags to say please die
at the next convenient opportunity. My understanding is that the
worker process will not actually exit until it next executes
CHECK_FOR_INTERRUPTS(), whereupon it will see the ProcDiePending flag
and *really* die. So even if the SIGTERM signal arrives immediately
after the slot is dropped, the tablesync will still become SYNCDONE.
Is this wrong understanding?

But your scenario could still be possible if "die" exited immediately
(e.g. only in single user mode?).


Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-31 Thread Bharath Rupireddy
On Mon, Feb 1, 2021 at 12:29 PM Fujii Masao  wrote:
> On 2021/01/30 9:28, Bharath Rupireddy wrote:
> > On Sat, Jan 30, 2021 at 12:14 AM Fujii Masao
> >  wrote:
> >> +   /*
> >> +* It doesn't make sense to show this entry in the 
> >> output with a
> >> +* NULL server_name as it will be closed at the 
> >> xact end.
> >> +*/
> >> +   continue;
> >>
> >> -1 with this change because I still think that it's more useful to list
> >> all the open connections.
> >
> > If postgres_fdw_get_connections doesn't have a "valid" column, then I
> > thought it's better not showing server_name NULL in the output.
>
> Or if we don't have strong reason to remove "valid" column,
> the current design is enough?

My only worry was that the statement from [1] "A cache flush should
not cause user-visible state changes." But the newly added function
postgres_fdw_get_connections is VOLATILE which means that the results
returned by postgres_fdw_get_connections() is also VOLATILE. Isn't
this enough, so that users will not get surprised with different
results in case invalidations occur within the server by the time they
run the query subsequent times and see different results than what
they saw in the first run?

Thoughts?

[1] 
https://www.postgresql.org/message-id/flat/2724627.1611886184%40sss.pgh.pa.us


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-31 Thread Fujii Masao




On 2021/01/27 10:06, Bharath Rupireddy wrote:

On Tue, Jan 26, 2021 at 8:38 AM Bharath Rupireddy
 wrote:

I will post "keep_connections" GUC and "keep_connection" server level
option patches later.


Attaching v19 patch set for "keep_connections" GUC and
"keep_connection" server level option. Please review them further.


These options are no longer necessary because we now support 
idle_session_timeout? If we want to disconnect the foreign server connections 
that sit on idle to prevent them from eating up the connection capacities in 
the foriegn servers, we can just set idle_session_timeout in those foreign 
servers. If we want to avoid the cluster-wide setting of idle_session_timeout, 
we can set that per role. One issue for this approach is that the connection 
entry remains even after idle_session_timeout happens. So 
postgres_fdw_get_connections() returns that connection even though it's 
actually closed by the timeout. Which is confusing. But which doesn't cause any 
actual problem, right? When the foreign table is accessed the next time, that 
connection entry is dropped, an error is detected, and then new connection will 
be remade.

Sorry I've not read the past long discussion about this feature. If there is 
the consensus that these options are still necessary and useful even when we 
have idle_session_timeout, please correct me.

ISTM that it's intuitive (at least for me) to add this kind of option into the 
foreign server. But I'm not sure if it's good idea to expose the option as GUC. 
Also if there is the consensus about this, please correct me.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-31 Thread Fujii Masao




On 2021/01/30 9:28, Bharath Rupireddy wrote:

On Sat, Jan 30, 2021 at 12:14 AM Fujii Masao
 wrote:

+   /*
+* It doesn't make sense to show this entry in the 
output with a
+* NULL server_name as it will be closed at the xact 
end.
+*/
+   continue;

-1 with this change because I still think that it's more useful to list
all the open connections.


If postgres_fdw_get_connections doesn't have a "valid" column, then I
thought it's better not showing server_name NULL in the output.


Or if we don't have strong reason to remove "valid" column,
the current design is enough?



Do you
think that we need to output some fixed strings for such connections
like "" or "" or "" or ""? I'm not sure whether
we are allowed to have fixed strings as column output.


This makes me think that more discussion would be necessary before
changing the interface of postgres_fdw_get_connections(). On the other
hand, we should address the issue ASAP to make the buildfarm member fine.
So at first I'd like to push only the change of regression test.
Patch attached. I tested it both with CLOBBER_CACHE_ALWAYS set and unset,
and the results were stable.


Thanks, the postgres_fdw.patch looks good to me.


Thanks for checking the patch! I pushed that.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Is Recovery actually paused?

2021-01-31 Thread Kyotaro Horiguchi
At Sun, 31 Jan 2021 11:24:30 +0530, Dilip Kumar  wrote 
in 
> On Fri, Jan 29, 2021 at 4:33 PM Dilip Kumar  wrote:
> >
> > On Fri, Jan 29, 2021 at 3:25 PM Yugo NAGATA  wrote:
> > >
> > > On Thu, 28 Jan 2021 09:55:42 +0530
> > > Dilip Kumar  wrote:
> > >
> > > > On Wed, Jan 27, 2021 at 2:28 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Wed, Jan 27, 2021 at 2:06 PM Yugo NAGATA  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, 27 Jan 2021 13:29:23 +0530
> > > > > > Dilip Kumar  wrote:
> > > > > >
> > > > > > > On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy
> > > > > > > > >  wrote:
> > > > > > > > > > +1 to just show the recovery pause state in the output of
> > > > > > > > > > pg_is_wal_replay_paused. But, should the function name
> > > > > > > > > > "pg_is_wal_replay_paused" be something like
> > > > > > > > > > "pg_get_wal_replay_pause_state" or some other? To me, when 
> > > > > > > > > > "is" exists
> > > > > > > > > > in a function, I expect a boolean output. Others may have 
> > > > > > > > > > better
> > > > > > > > > > thoughts.
> > > > > > > > >
> > > > > > > > > Maybe we should leave the existing function 
> > > > > > > > > pg_is_wal_replay_paused()
> > > > > > > > > alone and add a new one with the name you suggest that 
> > > > > > > > > returns text.
> > > > > > > > > That would create less burden for tool authors.
> > > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > >
> > > > > > > Yeah, we can do that, I will send an updated patch soon.
> > > > > >
> > > > > > This means pg_is_wal_replay_paused is left without any change and 
> > > > > > this
> > > > > > returns whether pause is requested or not? If so, it seems good to 
> > > > > > modify
> > > > > > the documentation of this function in order to note that this could 
> > > > > > not
> > > > > > return the actual pause state.
> > > > >
> > > > > Yes, we can say that it will return true if the replay pause is
> > > > > requested.  I am changing that in my new patch.
> > > >
> > > > I have modified the patch, changes
> > > >
> > > > - I have added a new interface pg_get_wal_replay_pause_state to get
> > > > the pause request state
> > > > - Now, we are not waiting for the recovery to actually get paused so I
> > > > think it doesn't make sense to put a lot of checkpoints to check the
> > > > pause requested so I have removed that check from the
> > > > recoveryApplyDelay but I think it better we still keep that check in
> > > > the WaitForWalToBecomeAvailable because it can wait forever before the
> > > > next wal get available.
> > >
> > > I think basically the check in WaitForWalToBecomeAvailable is independent
> > > of the feature of pg_get_wal_replay_pause_state, that is, reporting the
> > > actual pause state.  This function could just return 'pause requested'
> > > if a pause is requested during waiting for WAL.
> > >
> > > However, I agree the change to allow recovery to transit the state to
> > > 'paused' during WAL waiting because 'paused' has more useful information
> > > for users than 'pause requested'.  Returning 'paused' lets users know
> > > clearly that no more WAL are applied until recovery is resumed.  On the
> > > other hand, when 'pause requested' is returned, user can't say whether
> > > the next WAL wiill be applied or not from this information.
> > >
> > > For the same reason, I think it is also useful to call recoveryPausesHere
> > > in recoveryApplyDelay.
> >
> > IMHO the WaitForWalToBecomeAvailable can wait until the next wal get
> > available so it can not be controlled by user so it is good to put a
> > check for the recovery pause,  however recoveryApplyDelay wait for the
> > apply delay which is configured by user and it is predictable value by
> > the user.  I don't have much objection to putting that check in the
> > recoveryApplyDelay as well but I feel it is not necessary.  Any other
> > thoughts on this?
> >
> > > In addition, in RecoveryRequiresIntParameter, recovery should get paused
> > > if a parameter value has a problem.  However, 
> > > pg_get_wal_replay_pause_state
> > > will return 'pause requested' in this case. So, I think, we should pass
> > > RECOVERY_PAUSED to SetRecoveryPause() instead of RECOVERY_PAUSE_REQUESTED,
> > > or call CheckAndSetRecoveryPause() in the loop like recoveryPausesHere().
> >
> > Yeah, absolutely right, it must pass RECOVERY_PAUSED.  I will change
> > this, thanks for noticing this.
> 
> I have changed this in the new patch.

It seems to work well. The checkpoints seems to be placed properly.

+SetRecoveryPause(RecoveryPauseState state)
 {
+   Assert(state >= RECOVERY_NOT_PAUSED && state <= RECOVERY_PAUSED);

I'm not sure that state worth FATAL.  Isn't it enough to just ERROR
out like XLogFileRead?

CheckAndSetRecovery() has only one caller.  

Re: [sqlsmith] Failed assertion during partition pruning

2021-01-31 Thread David Rowley
On Mon, 1 Feb 2021 at 18:57, Tom Lane  wrote:
>
> I wrote:
> > David Rowley  writes:
> >> Parent RT indexes are guaranteed to be lower than their children RT
> >> indexes,
>
> > I was intentionally avoiding that assumption ;-).  Maybe it buys enough
> > to be worth the loss of generality, but ...
>
> Oh, it's too late at night.  I now remember that the real problem
> I had with that representation was that it cannot work for joinrels.
> Currently we only apply this logic to partitioned baserels, but
> don't you think it might someday be called on to optimize
> partitionwise joins?

I've not looked in detail, but I think the code would need a pretty
big overhaul before that could happen. For example, ever since we
allowed ATTACH PARTITION to work without taking an AEL we now have a
PartitionedRelPruneInfo.relid_map field that stores Oids for the
executor to look at to see if it can figure out if a partition has
been added since the plan was generated.  Not sure how that can work
with non-base rels as we have no Oid for join rels.  Perhaps I'm just
not thinking hard enough, but either way, it does seem like it would
take a pretty big hit with a hammer to make it work. My current
thinking is that being unable to represent join rels in a set of
Relids is fairly insignificant compared to what would be required to
get the feature to work correctly.

David




Re: Single transaction in the tablesync worker?

2021-01-31 Thread Amit Kapila
On Mon, Feb 1, 2021 at 11:23 AM Peter Smith  wrote:
>
> On Mon, Feb 1, 2021 at 3:44 PM Amit Kapila  wrote:
> >
> > On Mon, Feb 1, 2021 at 9:38 AM Peter Smith  wrote:
> > >
> > > > I think this is true only when the user specifically requested it by
> > > > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > > > Otherwise, we give an error on a broken connection. Also, if that is
> > > > true then is there a reason to pass missing_ok as true while dropping
> > > > tablesync slots?
> > > >
> > >
> > > AFAIK there is always a potential race with DropSubscription dropping
> > > slots. The DropSubscription might be running at exactly the same time
> > > the apply worker has just dropped the very same tablesync slot.
> > >
> >
> > We stopped the workers before getting a list of NotReady relations and
> > then we try to drop the corresponding slots. So, how such a race
> > condition can happen? Note, because we have a lock on pg_subscrition,
> > there is no chance that the workers can restart till the transaction
> > end.
>
> OK. I think I was forgetting the logicalrep_worker_stop would also go
> into a loop waiting for the worker process to die. So even if the
> tablesync worker does simultaneously drop it's own slot, I think it
> will certainly at least be in SYNCDONE state before DropSubscription
> does anything else with that worker.
>

How is that ensured? We don't have anything like HOLD_INTERRUPTS
between the time dropped the slot and updated rel state as SYNCDONE.
So, isn't it possible that after we dropped the slot and before we
update the state, the SIGTERM signal arrives and led to worker exit?

-- 
With Regards,
Amit Kapila.




Re: Faulty HEAP_XMAX_LOCK_ONLY & HEAP_KEYS_UPDATED hintbit combination

2021-01-31 Thread Dilip Kumar
On Sun, Jan 24, 2021 at 9:31 PM Alvaro Herrera  wrote:
>
> On 2021-Jan-24, Julien Rouhaud wrote:
>
> > + /*
> > +  * Do not allow tuples with invalid combinations of hint bits to be 
> > placed
> > +  * on a page.  These combinations are detected as corruption by the
> > +  * contrib/amcheck logic, so if you disable one or both of these
> > +  * assertions, make corresponding changes there.
> > +  */
> > + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> > +  (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
> >
> >
> > I attach a simple self contained script to reproduce the problem, the last
> > UPDATE triggering the Assert.
> >
> > I'm not really familiar with this part of the code, so it's not exactly 
> > clear
> > to me if some logic is missing in compute_new_xmax_infomask() /
> > heap_prepare_insert(), or if this should actually be an allowed combination 
> > of
> > hint bit.
>
> Hmm, it's probably a bug in compute_new_xmax_infomask.  I don't think
> the combination is sensible.
>

If we see the logic of GetMultiXactIdHintBits then it appeared that we
can get this combination in the case of multi-xact.

switch (members[i].status)
{
...
   case MultiXactStatusForUpdate:
   bits2 |= HEAP_KEYS_UPDATED;
   break;
}


if (!has_update)
bits |= HEAP_XMAX_LOCK_ONLY;

Basically, if it is "select for update" then we will mark infomask2 as
HEAP_KEYS_UPDATED and the informask as HEAP_XMAX_LOCK_ONLY.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [sqlsmith] Failed assertion during partition pruning

2021-01-31 Thread Tom Lane
I wrote:
> David Rowley  writes:
>> Parent RT indexes are guaranteed to be lower than their children RT
>> indexes,

> I was intentionally avoiding that assumption ;-).  Maybe it buys enough
> to be worth the loss of generality, but ...

Oh, it's too late at night.  I now remember that the real problem
I had with that representation was that it cannot work for joinrels.
Currently we only apply this logic to partitioned baserels, but
don't you think it might someday be called on to optimize
partitionwise joins?

regards, tom lane




Re: Single transaction in the tablesync worker?

2021-01-31 Thread Peter Smith
On Mon, Feb 1, 2021 at 3:44 PM Amit Kapila  wrote:
>
> On Mon, Feb 1, 2021 at 9:38 AM Peter Smith  wrote:
> >
> > On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila  wrote:
> > >
> > > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith  wrote:
> > > >
> > > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > I have made the below changes in the patch. Let me know what you think
> > > > > about these?
> > > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > > where we drop the slots at the end after finishing all the other
> > > > > cleanup.
> > > >
> > > > There was a reason why the v22 logic was different from HEAD.
> > > >
> > > > The broken connection leaves dangling slots which is unavoidable.
> > > >
> > >
> > > I think this is true only when the user specifically requested it by
> > > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > > Otherwise, we give an error on a broken connection. Also, if that is
> > > true then is there a reason to pass missing_ok as true while dropping
> > > tablesync slots?
> > >
> >
> > AFAIK there is always a potential race with DropSubscription dropping
> > slots. The DropSubscription might be running at exactly the same time
> > the apply worker has just dropped the very same tablesync slot.
> >
>
> We stopped the workers before getting a list of NotReady relations and
> then we try to drop the corresponding slots. So, how such a race
> condition can happen? Note, because we have a lock on pg_subscrition,
> there is no chance that the workers can restart till the transaction
> end.

OK. I think I was forgetting the logicalrep_worker_stop would also go
into a loop waiting for the worker process to die. So even if the
tablesync worker does simultaneously drop it's own slot, I think it
will certainly at least be in SYNCDONE state before DropSubscription
does anything else with that worker.

>
> > By
> > saying missing_ok = true it means DropSubscription would not give
> > ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
> > with an unexpected error.
> >
> > >
> > > > But,
> > > > whereas the user knows the name of the Subscription slot (they named
> > > > it), there is no easy way for them to know the names of the remaining
> > > > tablesync slots unless we log them.
> > > >
> > > > That is why the v22 code was written to process the tablesync slots
> > > > even for wrconn == NULL so their name could be logged:
> > > > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > > > syncslotname);
> > > >
> > > > The v23 patch removed this dangling slot name information, so it makes
> > > > it difficult for the user to know what tablesync slots to cleanup.
> > > >
> > >
> > > Okay, then can we think of combining with the existing error of the
> > > replication slot? I think that might produce a very long message, so
> > > another idea could be to LOG a separate WARNING for each such slot
> > > just before giving the error.
> >
> > There may be many subscribed tables so I agree combining to one
> > message might be too long. Yes, we can add another loop to output the
> > necessary information. But, isn’t logging each tablesync slot WARNING
> > before the subscription slot ERROR exactly the behaviour which already
> > existed in v22. IIUC the DropSubscription refactoring in V23 was not
> > done for any functional change, but was intended only to make the code
> > simpler, but how is that goal achieved if v23 ends up needing 3 loops
> > where v22 only needed 1 loop to do the same thing?
> >
>
> No, there is a functionality change as well. The way we have code in
> v22 can easily lead to a problem when we have dropped the slots but
> get an error while removing origins or an entry from subscription rel.
> In such cases, we won't be able to rollback the drop of slots but the
> other database operations will be rolled back. This is the reason we
> have to drop the slots at the end. We need to ensure the same thing
> for AlterSubscription_refresh. Does this make sense now?
>

OK.


Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: [sqlsmith] Failed assertion during partition pruning

2021-01-31 Thread Tom Lane
David Rowley  writes:
> What I can't understand is why you changed to a List-of-Lists rather
> than a List-of-Relids.

Yeah, I spent no effort on micro-optimizing the data structure.  I figured
that since we were not including leaf partitions, there would never be
enough rels involved to worry about.  Perhaps that's wrong though.

> Parent RT indexes are guaranteed to be lower than their children RT
> indexes,

I was intentionally avoiding that assumption ;-).  Maybe it buys enough
to be worth the loss of generality, but ...

regards, tom lane




Re: Single transaction in the tablesync worker?

2021-01-31 Thread Amit Kapila
On Mon, Feb 1, 2021 at 10:14 AM Amit Kapila  wrote:
>
> On Mon, Feb 1, 2021 at 9:38 AM Peter Smith  wrote:
> >
> > On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila  wrote:
> > >
> > > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith  wrote:
> > > >
> > > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > I have made the below changes in the patch. Let me know what you think
> > > > > about these?
> > > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > > where we drop the slots at the end after finishing all the other
> > > > > cleanup.
> > > >
> > > > There was a reason why the v22 logic was different from HEAD.
> > > >
> > > > The broken connection leaves dangling slots which is unavoidable.
> > > >
> > >
> > > I think this is true only when the user specifically requested it by
> > > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > > Otherwise, we give an error on a broken connection. Also, if that is
> > > true then is there a reason to pass missing_ok as true while dropping
> > > tablesync slots?
> > >
> >
> > AFAIK there is always a potential race with DropSubscription dropping
> > slots. The DropSubscription might be running at exactly the same time
> > the apply worker has just dropped the very same tablesync slot.
> >
>
> We stopped the workers before getting a list of NotReady relations and
> then we try to drop the corresponding slots. So, how such a race
> condition can happen?
>

I think it is possible that the state is still not SYNCDONE but the
slot is already dropped so here we should be ready with the missing
slot.


-- 
With Regards,
Amit Kapila.




Re: [sqlsmith] Failed assertion during partition pruning

2021-01-31 Thread David Rowley
On Sun, 31 Jan 2021 at 11:42, Tom Lane  wrote:
>
> For simplicity of review I divided the patch into two parts.
> 0001 revises make_partition_pruneinfo() and children to identify
> the relevant parent partitions for themselves, which is not too
> hard to do by chasing up the child-to-parent AppendRelInfo links.
> Not formerly documented, AFAICT, is that we want to collect just
> the parent partitions that are between the Append path's own rel
> (possibly equal to it) and the subpaths' rels.  I'd first tried
> to code this by using the top_parent_relids and part_rels[] links
> in the RelOptInfos, but that turns out not to work.  We might
> ascend to a top parent that's above the Append's rel (if considering
> an appendpath for a sub-partition, which happens in partitionwise
> join).  We could also descend to a child at or below some subpath
> level, since there are also cases where subpaths correspond to
> unflattened non-leaf partitions.  Either of those things result
> in failures.  But once you wrap your head around handling those
> restrictions, it's quite simple.

I had a look at this one and it all makes sense and the logic for
obtaining the lineage of parent partitioned tables seems fine.

What I can't understand is why you changed to a List-of-Lists rather
than a List-of-Relids. This makes the patch both bigger than it needs
to be and the processing quite a bit less efficient.

For example, in make_partition_pruneinfo() when you're walking up to
the top-level target partitioned table you must lcons() each new list
member to ensure the children always come after the parents. These
chains are likely to be short so the horrible overheads of the
memmove() in lcons() won't cost that much, but there's more to it.
The other seemingly needlessly slow part is in the
list_concat_unique_ptr() call. This needs to be done for every subpath
in the Append/Merge append. It would be good to get rid of that.
Given, the list are most likely going to be small, but that's still a
quadratic function, so it seems like a good idea to try to avoid using
it if there is another way to do it.

The memory allocations could also be more efficient for Relids rather
than Lists.  Since we're working up from the child to the parent in
the lineage calculation code in make_partition_pruneinfo(), we'll
always allocate the Bitmapset to the correct size right away, rather
than possibly having to increase the size if the next RT index were
not to fit in the current number of words. Of course, with a
List-of-Lists, not every lcons() would require a new allocation, but
there's an above zero chance that it might.

I've attached a version of your 0001 patch which just maintains using
a List-of-Relids. This shrinks the diff down about 3kb.

Parent RT indexes are guaranteed to be lower than their children RT
indexes, so it's pretty simple to figure out the target RT index by
just looking at the lowest set bit.  Doing it this way also simplifies
things as add_part_rel_list() no longer must insist that the sublists
are in parent-to-child order.

David
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 42c7c5f554..447bc2f78e 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -138,6 +138,7 @@ typedef struct PruneStepResult
 } PruneStepResult;
 
 
+static List *add_part_rel_list(List *partrellists, Relids partrels);
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,

   RelOptInfo *parentrel,

   int *relid_subplan_map,
@@ -213,59 +214,94 @@ static void partkey_datum_from_expr(PartitionPruneContext 
*context,
  *
  * 'parentrel' is the RelOptInfo for an appendrel, and 'subpaths' is the list
  * of scan paths for its child rels.
- *
- * 'partitioned_rels' is a List containing Lists of relids of partitioned
- * tables (a/k/a non-leaf partitions) that are parents of some of the child
- * rels.  Here we attempt to populate the PartitionPruneInfo by adding a
- * 'prune_infos' item for each sublist in the 'partitioned_rels' list.
- * However, some of the sets of partitioned relations may not require any
- * run-time pruning.  In these cases we'll simply not include a 'prune_infos'
- * item for that set and instead we'll add all the subplans which belong to
- * that set into the PartitionPruneInfo's 'other_subplans' field.  Callers
- * will likely never want to prune subplans which are mentioned in this field.
- *
- * 'prunequal' is a list of potential pruning quals.
+ * 'prunequal' is a list of potential pruning quals (i.e., restriction
+ * clauses that are applicable to the appendrel).
  */
 PartitionPruneInfo *
 make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
-List *subpaths, List 
*partitioned_rels,
+ 

Re: Printing backtrace of postgres processes

2021-01-31 Thread Bharath Rupireddy
On Mon, Feb 1, 2021 at 6:14 AM Bharath Rupireddy
 wrote:
> On Fri, Jan 29, 2021 at 7:10 PM vignesh C  wrote:
> > > 4) How about following
> > > + errmsg("must be a superuser to print backtrace
> > > of backend process")));
> > > instead of
> > > + errmsg("must be a superuser to print backtrace
> > > of superuser query process")));
> > >
> >
> > Here the message should include superuser, we cannot remove it. Non
> > super user can log non super user provided if user has permissions for
> > it.
> >
> > > 5) How about following
> > >  errmsg("must be a member of the role whose backed
> > > process's backtrace is being printed or member of
> > > pg_signal_backend")));
> > > instead of
> > > + errmsg("must be a member of the role whose
> > > backtrace is being logged or member of pg_signal_backend")));
> > >
> >
> > Modified it.
>
> Maybe I'm confused here to understand the difference between
> SIGNAL_BACKEND_NOSUPERUSER and SIGNAL_BACKEND_NOPERMISSION macros and
> corresponding error messages. Some clarification/use case to know in
> which scenarios we hit those error messages might help me. Did we try
> to add test cases that show up these error messages for
> pg_print_backtrace? If not, can we add?

Are these superuser and permission checks enough from a security
standpoint that we don't expose some sensitive information to the
user? Although I'm not sure, say from the backtrace printed and
attached to GDB, can users see the passwords or other sensitive
information from the system that they aren't supposed to see?

I'm sure this point would have been discussed upthread.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Add extra statistics to explain for Nested Loop

2021-01-31 Thread Julien Rouhaud
On Thu, Jan 28, 2021 at 8:38 PM Yugo NAGATA  wrote:
>
> postgres=# explain (analyze, verbose) select * from a,b where a.i=b.j;
>   
>   QUERY PLAN
> ---
>  Nested Loop  (cost=0.00..2752.00 rows=991 width=8) (actual 
> time=0.021..17.651 rows=991 loops=1)
>Output: a.i, b.j
>Join Filter: (a.i = b.j)
>Rows Removed by Join Filter: 99009
>->  Seq Scan on public.b  (cost=0.00..2.00 rows=100 width=4) (actual 
> time=0.009..0.023 rows=100 loops=1)
>  Output: b.j
>->  Seq Scan on public.a  (cost=0.00..15.00 rows=1000 width=4) (actual 
> time=0.005..0.091 min_time=0.065 max_time=0.163 min_rows=1000 rows=1000 
> max_rows=1000 loops=100)
>  Output: a.i
>  Planning Time: 0.066 ms
>  Execution Time: 17.719 ms
> (10 rows)
>
> I don't like this format where the extra statistics appear in the same
> line of existing information because the output format differs depended
> on whether the plan node's loops > 1 or not. This makes the length of a
> line too long. Also, other information reported by VERBOSE doesn't change
> the exiting row format and just add extra rows for new information.
>
> Instead, it seems good for me to add extra rows for the new statistics
> without changint the existing row format as other VERBOSE information,
> like below.
>
>->  Seq Scan on public.a  (cost=0.00..15.00 rows=1000 width=4) (actual 
> time=0.005..0.091 rows=1000  loops=100)
>  Output: a.i
>  Min Time: 0.065 ms
>  Max Time: 0.163 ms
>  Min Rows: 1000
>  Max Rows: 1000
>
> or, like Buffers,
>
>->  Seq Scan on public.a  (cost=0.00..15.00 rows=1000 width=4) (actual 
> time=0.005..0.091 rows=1000  loops=100)
>  Output: a.i
>  Loops: min_time=0.065 max_time=0.163 min_rows=1000 max_rows=1000
>
> and so  on. What do you think about it?

It's true that the current output is a bit long, which isn't really
convenient to read.  Using one of those alternative format would also
have the advantage of not breaking compatibility with tools that
process those entries.  I personally prefer the 2nd option with the
extra "Loops:" line .  For non text format, should we keep the current
format?

> 2)
> In parallel scan, the extra statistics are not reported correctly.
>
> postgres=# explain (analyze, verbose) select * from a,b where a.i=b.j;
>   
>  QUERY PLAN
> -
>  Gather  (cost=1000.00..2463.52 rows=991 width=8) (actual time=0.823..25.651 
> rows=991 loops=1)
>Output: a.i, b.j
>Workers Planned: 2
>Workers Launched: 2
>->  Nested Loop  (cost=0.00..1364.42 rows=413 width=8) (actual 
> time=9.426..16.723 min_time=0.000 max_time=22.017 min_rows=0 rows=330 
> max_rows=991 loops=3)
>  Output: a.i, b.j
>  Join Filter: (a.i = b.j)
>  Rows Removed by Join Filter: 33003
>  Worker 0:  actual time=14.689..14.692 rows=0 loops=1
>  Worker 1:  actual time=13.458..13.460 rows=0 loops=1
>  ->  Parallel Seq Scan on public.a  (cost=0.00..9.17 rows=417 
> width=4) (actual time=0.049..0.152 min_time=0.000 max_time=0.202 min_rows=0 
> rows=333 max_rows=452 loops=3)
>Output: a.i
>Worker 0:  actual time=0.040..0.130 rows=322 loops=1
>Worker 1:  actual time=0.039..0.125 rows=226 loops=1
>  ->  Seq Scan on public.b  (cost=0.00..2.00 rows=100 width=4) (actual 
> time=0.006..0.026 min_time=0.012 max_time=0.066 min_rows=100 rows=100 
> max_rows=100 loops=1000)
>Output: b.j
>Worker 0:  actual time=0.006..0.024 min_time=0.000  
> max_time=0.000 min_rows=0 rows=100 max_rows=0 loops=322
>Worker 1:  actual time=0.008..0.030 min_time=0.000  
> max_time=0.000 min_rows=0 rows=100 max_rows=0 loops=226
>  Planning Time: 0.186 ms
>  Execution Time: 25.838 ms
> (20 rows)
>
> This reports max/min rows or time of inner scan as 0 in parallel workers,
> and as a result only the leader process's ones are accounted. To fix this,
> we would change InstrAggNode as below.
>
> @@ -167,6 +196,10 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
> dst->nloops += add->nloops;
> dst->nfiltered1 += add->nfiltered1;
> dst->nfiltered2 += add->nfiltered2;
> +   dst->min_t = Min(dst->min_t, add->min_t);
> +   dst->max_t = Max(dst->max_t, add->max_t);
> +   dst->min_tuples = Min(dst->min_tuples, add->min_tuples);
> +   dst->max_tuples = Max(dst->max_tuples, add->max_tuples);

Agreed.

> 3)
> There are garbage 

RE: Determine parallel-safety of partition relations for Inserts

2021-01-31 Thread Hou, Zhijie
Hi greg,

Thanks for the review !

> Personally, I think a table's "parallel_dml" option should be ON by default.
> It's annoying to have to separately enable it for each and every table being
> used, when I think the need to turn it selectively OFF should be fairly
> rare.

Yes, I agreed.
Changed.

> And I'm not sure that "parallel_dml" is the best name for the table option
> - because it sort of implies parallel dml WILL be used - but that isn't
> true, it depends on other factors too.
> So I think (to be consistent with other table option naming) it would have
> to be something like "parallel_dml_enabled".

Agreed.
Changed to parallel_dml_enabled.

Attatching v2 patch which addressed the comments above.

Some further refactor:

Introducing a new function is_parallel_possible_for_modify() which decide 
whether to do safety check.

IMO, It seems more readable to extract all the check that we can do before the 
safety-check and put them
in the new function.

Please consider it for further review.

Best regards,
houzj




v2_0003-reloption-parallel_dml-src.patch
Description: v2_0003-reloption-parallel_dml-src.patch


v2_0004-reloption-parallel_dml-test-and-doc.patch
Description: v2_0004-reloption-parallel_dml-test-and-doc.patch


v2_0001-guc-option-enable_parallel_dml-src.patch
Description: v2_0001-guc-option-enable_parallel_dml-src.patch


v2_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v2_0002-guc-option-enable_parallel_dml-doc-and-test.patch


Re: Single transaction in the tablesync worker?

2021-01-31 Thread Amit Kapila
On Mon, Feb 1, 2021 at 9:38 AM Peter Smith  wrote:
>
> On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila  wrote:
> >
> > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith  wrote:
> > >
> > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila  
> > > wrote:
> > > >
> > > > I have made the below changes in the patch. Let me know what you think
> > > > about these?
> > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > where we drop the slots at the end after finishing all the other
> > > > cleanup.
> > >
> > > There was a reason why the v22 logic was different from HEAD.
> > >
> > > The broken connection leaves dangling slots which is unavoidable.
> > >
> >
> > I think this is true only when the user specifically requested it by
> > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > Otherwise, we give an error on a broken connection. Also, if that is
> > true then is there a reason to pass missing_ok as true while dropping
> > tablesync slots?
> >
>
> AFAIK there is always a potential race with DropSubscription dropping
> slots. The DropSubscription might be running at exactly the same time
> the apply worker has just dropped the very same tablesync slot.
>

We stopped the workers before getting a list of NotReady relations and
then we try to drop the corresponding slots. So, how such a race
condition can happen? Note, because we have a lock on pg_subscrition,
there is no chance that the workers can restart till the transaction
end.

> By
> saying missing_ok = true it means DropSubscription would not give
> ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
> with an unexpected error.
>
> >
> > > But,
> > > whereas the user knows the name of the Subscription slot (they named
> > > it), there is no easy way for them to know the names of the remaining
> > > tablesync slots unless we log them.
> > >
> > > That is why the v22 code was written to process the tablesync slots
> > > even for wrconn == NULL so their name could be logged:
> > > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > > syncslotname);
> > >
> > > The v23 patch removed this dangling slot name information, so it makes
> > > it difficult for the user to know what tablesync slots to cleanup.
> > >
> >
> > Okay, then can we think of combining with the existing error of the
> > replication slot? I think that might produce a very long message, so
> > another idea could be to LOG a separate WARNING for each such slot
> > just before giving the error.
>
> There may be many subscribed tables so I agree combining to one
> message might be too long. Yes, we can add another loop to output the
> necessary information. But, isn’t logging each tablesync slot WARNING
> before the subscription slot ERROR exactly the behaviour which already
> existed in v22. IIUC the DropSubscription refactoring in V23 was not
> done for any functional change, but was intended only to make the code
> simpler, but how is that goal achieved if v23 ends up needing 3 loops
> where v22 only needed 1 loop to do the same thing?
>

No, there is a functionality change as well. The way we have code in
v22 can easily lead to a problem when we have dropped the slots but
get an error while removing origins or an entry from subscription rel.
In such cases, we won't be able to rollback the drop of slots but the
other database operations will be rolled back. This is the reason we
have to drop the slots at the end. We need to ensure the same thing
for AlterSubscription_refresh. Does this make sense now?

-- 
With Regards,
Amit Kapila.




Re: Single transaction in the tablesync worker?

2021-01-31 Thread Peter Smith
On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila  wrote:
>
> On Mon, Feb 1, 2021 at 6:48 AM Peter Smith  wrote:
> >
> > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila  
> > wrote:
> > >
> > > I have made the below changes in the patch. Let me know what you think
> > > about these?
> > > 1. It was a bit difficult to understand the code in DropSubscription
> > > so I have rearranged the code to match the way we are doing in HEAD
> > > where we drop the slots at the end after finishing all the other
> > > cleanup.
> >
> > There was a reason why the v22 logic was different from HEAD.
> >
> > The broken connection leaves dangling slots which is unavoidable.
> >
>
> I think this is true only when the user specifically requested it by
> the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> Otherwise, we give an error on a broken connection. Also, if that is
> true then is there a reason to pass missing_ok as true while dropping
> tablesync slots?
>

AFAIK there is always a potential race with DropSubscription dropping
slots. The DropSubscription might be running at exactly the same time
the apply worker has just dropped the very same tablesync slot. By
saying missing_ok = true it means DropSubscription would not give
ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
with an unexpected error.

>
> > But,
> > whereas the user knows the name of the Subscription slot (they named
> > it), there is no easy way for them to know the names of the remaining
> > tablesync slots unless we log them.
> >
> > That is why the v22 code was written to process the tablesync slots
> > even for wrconn == NULL so their name could be logged:
> > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > syncslotname);
> >
> > The v23 patch removed this dangling slot name information, so it makes
> > it difficult for the user to know what tablesync slots to cleanup.
> >
>
> Okay, then can we think of combining with the existing error of the
> replication slot? I think that might produce a very long message, so
> another idea could be to LOG a separate WARNING for each such slot
> just before giving the error.

There may be many subscribed tables so I agree combining to one
message might be too long. Yes, we can add another loop to output the
necessary information. But, isn’t logging each tablesync slot WARNING
before the subscription slot ERROR exactly the behaviour which already
existed in v22. IIUC the DropSubscription refactoring in V23 was not
done for any functional change, but was intended only to make the code
simpler, but how is that goal achieved if v23 ends up needing 3 loops
where v22 only needed 1 loop to do the same thing?


Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: avoid bitmapOR-ing indexes with scan condition inconsistent with partition constraint

2021-01-31 Thread Masahiko Sawada
Hi,

On Fri, Nov 13, 2020 at 5:14 AM Tom Lane  wrote:
>
> I started looking through this patch.  I really quite dislike solving
> this via a kluge in indxpath.c.  There are multiple disadvantages
> to that:
>
> * It only helps for the very specific problem of redundant bitmap
> index scans, whereas the problem of applying redundant qual checks
> in partition scans seems pretty general.
>
> * It's not unlikely that this will end up trying to make the same
> proof multiple times (and the lack of any way to turn that off,
> through constraint_exclusion or some other knob, isn't too cool).
>
> * It does nothing to fix rowcount estimates in the light of the
> knowledge that some of the restriction clauses are no-ops.  Now,
> if we have up-to-date stats we'll probably manage to come out with
> an appropriate 0 or 1 selectivity anyway, but we might not have those.
> In any case, spending significant effort to estimate a selectivity
> when some other part of the code has taken the trouble to *prove* the
> clause true or false seems very undesirable.
>
> * I'm not even convinced that the logic is correct, specifically that
> it's okay to just "continue" if we refute the OR clause.  That seems
> likely to break generate_bitmap_or_paths' surrounding loop logic about
> "We must be able to match at least one index to each of the arms of
> the OR".  At least, if that still works it requires more than zero
> commentary about why.
>
>
> So I like much better the idea of Konstantin's old patch, that we modify
> the rel's baserestrictinfo list by removing quals that we can prove
> true.  We could extend that to solve the bitmapscan problem by removing
> OR arms that we can prove false.  So I started to review that patch more
> carefully, and after awhile realized that it has a really fundamental
> problem: it is trying to use CHECK predicates to prove WHERE clauses.
> But we don't know that CHECK predicates are true, only that they are
> not-false, and there is no proof mode in predtest.c that will allow
> proving some clauses true based only on other ones being not-false.
>
> We can salvage something by restricting the input quals to be only
> partition quals, since those are built to be guaranteed-true-or-false;
> we can assume they don't yield NULL.  There's a hole in that for
> hashing, as I noted elsewhere, but we'll fail to prove anything anyway
> from a satisfies_hash_partition() qual.  (In principle we could also use
> attnotnull quals, which also have that property.  But I'm dubious that
> that will help often enough to be worth the extra cycles for predtest.c
> to process them.)
>
> So after a bit of coding I had the attached.  This follows Konstantin's
> original patch in letting relation_excluded_by_constraints() change
> the baserestrictinfo list.  I read the comments in the older thread
> about people not liking that, and I can see the point.  But I'm not
> convinced that the later iterations of the patch were an improvement,
> because (a) the call locations for
> remove_restrictions_implied_by_constraints() seemed pretty random, and
> (b) it seems necessary to have relation_excluded_by_constraints() and
> remove_restrictions_implied_by_constraints() pretty much in bed with
> each other if we don't want to duplicate constraint-fetching work.
> Note the comment on get_relation_constraints() that it's called at
> most once per relation; that's not something I particularly desire
> to give up, because a relcache open isn't terribly cheap.  Also
> (c) I think it's important that there be a way to suppress this
> overhead when it's not useful.  In the patch as attached, turning off
> constraint_exclusion does that since relation_excluded_by_constraints()
> falls out before getting to the new code.  If we make
> remove_restrictions_implied_by_constraints() independent then it
> will need some possibly-quite-duplicative logic to check
> constraint_exclusion.  (Of course, if we'd rather condition this
> on some other GUC then that argument falls down.  But I think we
> need something.)  So, I'm not dead set on this code structure, but
> I haven't seen one I like better.
>
> Anyway, this seems to work, and if the regression test changes are
> any guide then it may fire often enough in the real world to be useful.
> Nonetheless, I'm concerned about performance, because predtest.c is a
> pretty expensive thing and there will be a lot of cases where the work
> is useless.  I did a quick check using pgbench's option to partition
> the tables, and observed that the -S (select only) test case seemed to
> get about 2.5% slower with the patch than without.  That's not far
> outside the noise floor, so maybe it's not real, but if it is real then
> it seems pretty disastrous.  Perhaps we could avoid that problem by
> removing the "predicate_implied_by" cases and only trying the
> "predicate_refuted_by" case, so that no significant time is added
> unless you've got an OR restriction clause on a partitioned table.
> That seems 

Re: Asynchronous Append on postgres_fdw nodes.

2021-01-31 Thread Etsuro Fujita
On Tue, Nov 17, 2020 at 6:56 PM Etsuro Fujita  wrote:
> * I haven't yet done anything about the issue on postgres_fdw's
> handling of concurrent data fetches by multiple ForeignScan nodes
> (below *different* Append nodes in the query) using the same
> connection discussed in [2].  I modified the patch to just disable
> applying this feature to problematic test cases in the postgres_fdw
> regression tests, by a new GUC enable_async_append.

A solution for the issue would be a scheduler designed to handle such
data fetches more efficiently, but I don’t think it’s easy to create
such a scheduler.  Rather than doing so, I'd like to propose to allow
FDWs to disable async execution of them in problematic cases by
themselves during executor startup in the first cut.  What I have in
mind for that is:

1) For an FDW that has async-capable ForeignScan(s), we allow the FDW
to record, for each of the async-capable and non-async-capable
ForeignScan(s), the information on a connection to be used for the
ForeignScan into EState during BeginForeignScan().

2) After doing ExecProcNode() to each SubPlan and the main query tree
in InitPlan(), we give the FDW a chance to a) reconsider, for each of
the async-capable ForeignScan(s), whether the ForeignScan can be
executed asynchronously as planned, based on the information stored
into EState in #1, and then b) disable async execution of the
ForeignScan if not.

#1 and #2 would be done after initial partition pruning, so more
async-capable ForeignScans would be executed asynchronously, if other
async-capable ForeignScans conflicting with them are removed by that
pruning.

This wouldn’t prevent us from adding a feature like what was proposed
by Horiguchi-san later.

BTW: while considering this, I noticed some bugs with
ExecAppendAsyncBegin() in the previous patch.  Attached is a new
version of the patch fixing them.  I also tweaked some comments a
little bit.

Best regards,
Etsuro Fujita


async-wip-2021-02-01.patch
Description: Binary data


Re: Single transaction in the tablesync worker?

2021-01-31 Thread Amit Kapila
On Mon, Feb 1, 2021 at 6:48 AM Peter Smith  wrote:
>
> On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila  wrote:
> >
> > I have made the below changes in the patch. Let me know what you think
> > about these?
> > 1. It was a bit difficult to understand the code in DropSubscription
> > so I have rearranged the code to match the way we are doing in HEAD
> > where we drop the slots at the end after finishing all the other
> > cleanup.
>
> There was a reason why the v22 logic was different from HEAD.
>
> The broken connection leaves dangling slots which is unavoidable.
>

I think this is true only when the user specifically requested it by
the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
Otherwise, we give an error on a broken connection. Also, if that is
true then is there a reason to pass missing_ok as true while dropping
tablesync slots?


> But,
> whereas the user knows the name of the Subscription slot (they named
> it), there is no easy way for them to know the names of the remaining
> tablesync slots unless we log them.
>
> That is why the v22 code was written to process the tablesync slots
> even for wrconn == NULL so their name could be logged:
> elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> syncslotname);
>
> The v23 patch removed this dangling slot name information, so it makes
> it difficult for the user to know what tablesync slots to cleanup.
>

Okay, then can we think of combining with the existing error of the
replication slot? I think that might produce a very long message, so
another idea could be to LOG a separate WARNING for each such slot
just before giving the error.

> > 2. In AlterSubscription_refresh(), we can't allow workers to be
> > stopped at commit time as we have already dropped the slots because
> > the worker can access the dropped slot. We need to stop the workers
> > before dropping slots. This makes all the code related to
> > logicalrep_worker_stop_at_commit redundant.
>
> OK.
>
> > 3. In AlterSubscription_refresh(), we need to acquire the lock on
> > pg_subscription_rel only when we try to remove any subscription rel.
>
> + if (!sub_rel_locked)
> + {
> + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> + sub_rel_locked = true;
> + }
>
> OK. But the sub_rel_locked bool is not really necessary. Why not just
> check for rel == NULL? e.g.
> if (!rel)
> rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>

Okay, that seems to be better, will change accordingly.

> > 4. Added/Changed quite a few comments.
> >
>
> @@ -1042,6 +1115,31 @@ DropSubscription(DropSubscriptionStmt *stmt,
> bool isTopLevel)
>   }
>   list_free(subworkers);
>
> + /*
> + * Tablesync resource cleanup (slots and origins).
>
> The comment is misleading; this code is only dropping origins.
>

Okay, will change to something like: "Cleanup of tablesync replication origins."

> @@ -73,20 +73,6 @@ typedef struct LogicalRepWorkerId
>   Oid relid;
>  } LogicalRepWorkerId;
>
> -typedef struct StopWorkersData
> -{
> - int nestDepth; /* Sub-transaction nest level */
> - List*workers; /* List of LogicalRepWorkerId */
> - struct StopWorkersData *parent; /* This need not be an immediate
> - * subtransaction parent */
> -} StopWorkersData;
>
> Since v23 removes that typedef from the code, don't you also have to
> remove it from src/tools/pgindent/typedefs.list?
>

Yeah.

-- 
With Regards,
Amit Kapila.




Re: [sqlsmith] Failed assertion during partition pruning

2021-01-31 Thread Amit Langote
On Mon, Feb 1, 2021 at 8:50 AM David Rowley  wrote:
> On Sun, 31 Jan 2021 at 11:42, Tom Lane  wrote:
> > This fixes the cases reported by Andreas and Jaime, leaving me
> > more confident that there's nothing wrong with David's Assert.
>
> It could be fixed by modifying get_singleton_append_subpath() to
> modify the partitioned_rels when it collapses these single-subpath
> Append/MergeAppend path, but I'm very open to looking at just getting
> rid of the partitioned_rels field. Prior to a929e17e5 that field was
> very loosely set and in serval cases could contain RT indexes of
> partitioned tables that didn't even have any subpaths in the given
> Append/MergeAppend. a929e17e5 attempted to tighten all that up but
> looks like I missed the case above.
>
> > I wonder whether we should consider back-patching this.  Another
> > thing that seems unclear is whether there is any serious consequence
> > to omitting some intermediate partitions from the set considered
> > by make_partition_pruneinfo.  Presumably it could lead to missing
> > some potential run-time-pruning opportunities, but is there any
> > worse issue?  If there isn't, this is a bigger change than I want
> > to put in the back braches.
>
> It shouldn't be backpatched. a929e17e5 only exists in master. Prior to
> that AppendPath/MergeAppendPath's partitioned_rels field could only
> contain additional partitioned table RT index. There are no known
> cases of missing ones prior to a929e17e5, so this bug shouldn't exist
> in PG13.
>
> a929e17e5 introduced run-time partition pruning for
> sub-Append/MergeAppend paths.  The commit message of that explains
> that there are plenty of legitimate cases where we can't flatten these
> sub-Append/MergeAppend paths down into the top-level
> Append/MergeAppend.   Originally I had thought we should only bother
> doing run-time pruning on the top-level Append/MergeAppend because I
> thought these cases were very uncommon.  I've now changed my mind.
>
> For a929e17e5, it was not just a matter of removing the lines in [1]
> to allow run-time pruning on nested Append/MergeAppends. I also needed
> to clean up the sloppy setting of partitioned_rels. The remainder of
> the patch attempted that.
>
> FWIW, I hacked together a patch which fixes the bug by passing a
> Bitmapset ** pointer to get_singleton_append_subpath(), which set the
> bits for the Append / MergeAppend path's partitioned_rels that we get
> rid of when it only has a single subpath. This stops the Assert
> failure mentioned here.  However, I'd much rather explore getting rid
> of partitioned_rels completely. I'll now have a look at the patch
> you're proposing for that.

I've read Tom's patch (0001) and would definitely vote for that over
having partitioned_rels in Paths anymore.

> Thanks for investigating this and writing the patch.

+1.  My apologies as well for failing to notice this thread sooner.

--
Amit Langote
EDB: http://www.enterprisedb.com




Re: Fix typo about generate_gather_paths

2021-01-31 Thread Masahiko Sawada
Hi,

On Wed, Dec 23, 2020 at 3:24 AM Tomas Vondra
 wrote:
>
> On 12/9/20 3:21 AM, Hou, Zhijie wrote:
> > Hi
> >
> > Since ba3e76c,
> > the optimizer call generate_useful_gather_paths instead of 
> > generate_gather_paths() outside.
> >
> > But I noticed that some comment still talking about generate_gather_paths 
> > not generate_useful_gather_paths.
> > I think we should fix these comment, and I try to replace these " 
> > generate_gather_paths " with " generate_useful_gather_paths "
> >
>
> Thanks. I started looking at this a bit more closely, and I think most
> of the changes are fine - the code was changed to call a different
> function, but the comments still reference generate_gather_paths().
>
> The one exception seems to be create_ordered_paths(), because that
> comment also makes statements about what generate_gather_pathes is
> doing. And some of it does not apply to generate_useful_gather_paths.
> For example it says it generates order-preserving Gather Merge paths,
> but generate_useful_gather_paths also generates paths with sorts (which
> are clearly not order-preserving).
>
> So I think this comment will need a bit more work to update ...

Status update for a commitfest entry.

This patch has been "Waiting on Author" without seeing any activity
since Tomas sent review comments. I'm planning to set it to "Returned
with Feedback”, barring objections.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Is it worth accepting multiple CRLs?

2021-01-31 Thread Kyotaro Horiguchi
At Sat, 30 Jan 2021 22:20:19 +0100, Peter Eisentraut 
 wrote in 
> On 2021-01-19 09:32, Kyotaro Horiguchi wrote:
> > At Tue, 19 Jan 2021 09:17:34 +0900 (JST), Kyotaro Horiguchi
> >  wrote in
> >> By the way we can do the same thing on CA file/dir, but I personally
> >> think that the benefit from the specify-by-directory for CA files is
> >> far less than CRL files. So I'm not going to do this for CA files for
> >> now.
> > This is it. A new guc ssl_crl_dir and connection option crldir are
> > added.
> 
> This looks pretty good to me overall.

Thanks!

> You need to update the expected result of the postgres_fdw test.

Oops. Fixed.

> Also check your patch for whitespace errors with git diff --check or
> similar.

Sorry for forgetting that. I found an extra new line in
be-secure-openssl.c and remved it.

> > One problem raised upthread is the footprint for test is quite large
> > because all certificate and key files are replaced by this patch. I
> > think we can shrink the footprint by generating that files on-demand
> > but that needs openssl frontend to be installed on the development
> > environment.
> 
> I don't understand why you need to recreate all these files.  All your
> patch should contain are the new *.r0 files that are computed from the
> existing *.crl files.  Nothing else should change, AIUI.

Ah. If I ran make with this patch, it complains of
ssl/root_ca-certindex lacking and I ran "make clean" to avoid the
complaint.  Instead, I created the additional crl directories by
manually executing the recipes of the additional rules.

v3:  41 files changed, 496 insertions(+), 255 deletions(-)
v4:  21 files changed, 258 insertions(+), 18 deletions(-)

I checked that 001_ssltests.pl succedds both with the preexisting ssl/
files and with the files created by "make sslfiles" after "make
sslfiles-clean".

> Some of the makefile rules for generating the CRL files need some
> refinement.  In
> 
> +ssl/root+server-crldir: ssl/server.crl
> +   mkdir ssl/root+server-crldir
> + cp ssl/server.crl ssl/root+server-crldir/`openssl crl -hash -noout
> -in ssl/server.crl`.r0
> + cp ssl/root.crl ssl/root+server-crldir/`openssl crl -hash -noout -in
> ssl/root.crl`.r0
> +ssl/root+client-crldir: ssl/client.crl
> +   mkdir ssl/root+client-crldir
> + cp ssl/client.crl ssl/root+client-crldir/`openssl crl -hash -noout
> -in ssl/client.crl`.r0
> + cp ssl/root.crl ssl/root+client-crldir/`openssl crl -hash -noout -in
> ssl/root.crl`.r0
> 
> the rules should also have a dependency on ssl/root.crl in addition to
> ssl/server.crl.

Right. Added.

> By the way:
> 
> -   print $sslconf "ssl_crl_file='root+client.crl'\n";
> +   print $sslconf "ssl_crl_file='$crlfile'\n" if (defined $crlfile);
> +   print $sslconf "ssl_crl_dir='$crldir'\n" if (defined $crldir);
> 
> Trailing "if" doesn't need parentheses.

I know. However I preferred to have them at the time, I don't have a
strong opinion about how it should be. Ripped off them.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9168e29fa6bd957171cfa53553ae406eb44bb936 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 1 Feb 2021 11:16:41 +0900
Subject: [PATCH v4] Allow to specify CRL directory

We have the ssl_crl_file GUC variable and the sslcrl connection option
to specify a CRL file.  X509_STORE_load_locations accepts a directory,
which leads to on-demand loading method with which method only
relevant CRLs are loaded. Allow server and client to use the hashed
directory method. We could use the existing variable and option to
specify the direcotry name but allowing to use both methods at the
same time gives operation flexibility to users.
---
 .../postgres_fdw/expected/postgres_fdw.out|  2 +-
 doc/src/sgml/config.sgml  | 21 +++-
 doc/src/sgml/libpq.sgml   | 20 +--
 doc/src/sgml/runtime.sgml | 33 +++
 src/backend/libpq/be-secure-openssl.c | 26 +--
 src/backend/libpq/be-secure.c |  1 +
 src/backend/utils/misc/guc.c  | 10 ++
 src/include/libpq/libpq.h |  1 +
 src/interfaces/libpq/fe-connect.c |  6 
 src/interfaces/libpq/fe-secure-openssl.c  | 24 ++
 src/interfaces/libpq/libpq-int.h  |  1 +
 src/test/ssl/Makefile | 20 ++-
 src/test/ssl/ssl/client-crldir/9bb9e3c3.r0| 11 +++
 .../ssl/ssl/root+client-crldir/9bb9e3c3.r0| 11 +++
 .../ssl/ssl/root+client-crldir/a3d11bff.r0| 11 +++
 .../ssl/ssl/root+server-crldir/a3d11bff.r0| 11 +++
 .../ssl/ssl/root+server-crldir/a836cc2d.r0| 11 +++
 src/test/ssl/ssl/server-crldir/a836cc2d.r0| 11 +++
 src/test/ssl/t/001_ssltests.pl| 31 +++--
 src/test/ssl/t/SSLServer.pm   | 14 +++-
 20 files changed, 258 insertions(+), 18 deletions(-)
 create mode 100644 

Re: Determine parallel-safety of partition relations for Inserts

2021-01-31 Thread Greg Nancarrow
On Fri, Jan 29, 2021 at 5:44 PM Hou, Zhijie  wrote:
>
>
> Attatching v1 patches, introducing options which let user manually control 
> whether to use parallel dml.
>
> About the patch:
> 1. add a new guc option: enable_parallel_dml (boolean)
> 2. add a new tableoption: parallel_dml (boolean)
>
>The default of both is off(false).
>
> User can set enable_parallel_dml in postgresql.conf or seesion to enable 
> parallel dml.
> If user want to choose some specific to use parallel insert, they can set 
> table.parallel_dml to on.
>
> Some attention:
> (1)
> Currently if guc option enable_parallel_dml is set to on but table's 
> parallel_dml is off,
> planner still do not consider parallel for dml.
>
> In this way, If user want to use parallel dml , they have to set 
> enable_parallel_dml=on and set parallel_dml = on.
> If someone dislike this, I think we can also let tableoption. parallel_dml's 
> default value to on ,with this user only need
> to set enable_parallel_dml=on
>
> (2)
> For the parallel_dml.
> If target table is partitioned, and it's parallel_dml is set to on, planner 
> will ignore
> The option value of it's child.
> This is beacause we can not divide dml plan to separate table, so we just 
> check the target table itself.
>
>
> Thoughts and comments are welcome.
>

Personally, I think a table's "parallel_dml" option should be ON by default.
It's annoying to have to separately enable it for each and every table
being used, when I think the need to turn it selectively OFF should be
fairly rare.
And I'm not sure that "parallel_dml" is the best name for the table
option - because it sort of implies parallel dml WILL be used - but
that isn't true, it depends on other factors too.
So I think (to be consistent with other table option naming) it would
have to be something like "parallel_dml_enabled".
I'm still looking at the patches, but have so far noticed that there
are some issues in the documentation updates (grammatical issues and
consistency issues with current documentation), and also some of the
explanations are not clear. I guess I can address these separately.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: [BUG] orphaned function

2021-01-31 Thread Masahiko Sawada
Hi Bertrand,

On Fri, Dec 18, 2020 at 6:52 PM Gilles Darold  wrote:
>
> Le 18/12/2020 à 00:26, Tom Lane a écrit :
> > Gilles Darold  writes:
> >> The same problem applies if the returned type or the procedural language
> >> is dropped. I have tried to fix that in ProcedureCreate() by using an
> >> AccessSharedLock for the data types and languages involved in the
> >> function declaration. With this patch the race condition with parameters
> >> types, return types and PL languages are fixed. Only data types and
> >> procedural languages with Oid greater than FirstBootstrapObjectId will
> >> be locked locked. But this is probably more complex than this fix so it
> >> is proposed as a separate patch
> >> (v1-003-orphaned_function_types_language.patch) to not interfere with
> >> the applying of Bertran's patch.
> > Indeed.  This points up one of the things that is standing in the way
> > of any serious consideration of applying this patch.  To my mind there
> > are two fundamental, somewhat interrelated considerations that haven't
> > been meaningfully addressed:
> >
> > 1. What set of objects is it worth trying to remove this race condition
> > for, and with respect to what dependencies?  Bertrand gave no
> > justification for worrying about function-to-namespace dependencies
> > specifically, and you've likewise given none for expanding the patch
> > to consider function-to-datatype dependencies.  There are dozens more
> > cases that could be considered; but I sure don't want to be processing
> > another ad-hoc fix every time someone thinks they're worried about
> > another specific case.
> >
> > Taking a practical viewpoint, I'm far more worried about dependencies
> > of tables than those of any other kind of object.  A messed-up function
> > definition doesn't cost you much: you can just drop and recreate the
> > function.  A table full of data is way more trouble to recreate, and
> > indeed the data might be irreplaceable.  So it seems pretty silly to
> > propose that we try to remove race conditions for functions' dependencies
> > on datatypes without removing the same race condition for tables'
> > dependencies on their column datatypes.
> >
> > But any of these options lead to the same question: why stop there?
> > An approach that would actually be defensible, perhaps, is to incorporate
> > this functionality into the dependency mechanism: any time we're about to
> > create a pg_depend or pg_shdepend entry, take out an AccessShareLock on
> > the referenced object, and then check to make sure it still exists.
> > This would improve the dependency mechanism so it prevents creation-time
> > race conditions, not just attempts to drop dependencies of
> > already-committed objects.  It would mean that the one patch would be
> > the end of the problem, rather than looking forward to a steady drip of
> > new fixes.
> >
> > 2. How much are we willing to pay for this added protection?  The fact
> > that we've gotten along fine without it for years suggests that the
> > answer needs to be "not much".  But I'm not sure that that actually
> > is the answer, especially if we don't have a policy that says "we'll
> > protect against these race conditions but no other ones".  I think
> > there are possibly-serious costs in three different areas:
> >
> > * Speed.  How much do all these additional lock acquisitions slow
> > down a typical DDL command?
> >
> > * Number of locks taken per transaction.  This'd be particularly an
> > issue for pg_restore runs using single-transaction mode: they'd take
> > not only locks on the objects they create, but also a bunch of
> > reference-protection locks.  It's not very hard to believe that this'd
> > make a difference in whether restoring a large database is possible
> > without increasing max_locks_per_transaction.
> >
> > * Risk of deadlock.  The reference locks themselves should be sharable,
> > so maybe there isn't much of a problem, but I want to see this question
> > seriously analyzed not just ignored.
> >
> > Obviously, the magnitude of these costs depends a lot on how many
> > dependencies we want to remove the race condition for.  But that's
> > exactly the reason why I don't want a piecemeal approach of fixing
> > some problems now and some other problems later.  That's way too
> > much like the old recipe for boiling a frog: we could gradually get
> > into serious performance problems without anyone ever having stopped
> > to consider the issue.
> >
> > In short, I think we should either go for a 100% solution if careful
> > analysis shows we can afford it, or else have a reasoned policy
> > why we are going to close these specific race conditions and no others
> > (implying that we'll reject future patches in the same area).  We
> > haven't got either thing in this thread as it stands, so I do not
> > think it's anywhere near being ready to commit.
> >
> >   regards, tom lane
>
>
> Thanks for the review,
>
>
> Honestly I have never met these 

Re: Key management with tests

2021-01-31 Thread Moon, Insung
Dear All.

Thank you for all opinions and discussions regarding the KMS/TDE function.

First of all, to get to the point of this email,
I want to participate in anything I can do (review or development)
when TDE related development is in progress.
If there is a meeting related to it, I can't communicate because of my
poor English skills, but I would like to attend if it is only possible
to listen.

I didn't understand KMS and didn't participate in the direct
development, so I didn't comment on anything so far. Still, when TDE
development starts, I wanted to join in the discussion and meeting if
there was anything I could do.
However, since I have a complicated and insufficient English ability
to communicate in English, maybe I will rarely say anything in
meetings (voice and video meetings).
But I would like to attend the discussion if it is only possible to listen.

Also, if the wiki page and other mail threads related to TDE start,
I'll join in discussions if there is anything I can do.

Best regards.
Moon.

On Sat, Jan 30, 2021 at 10:23 PM Tom Kincaid  wrote:
>
>
>
>
>
> Thanks Stephen, Bruce and Masahiko,
>
>>
>> > discussions so far and the point behind the design so that everyone
>> > can understand why this feature is designed in that way. To do that,
>> > it might be a good start to sort the wiki page since it has data
>> > encryption part, KMS, and ToDo mixed.
>>
>> I hope it's pretty clear that I'm also very much in support of both this
>> effort with the KMS and of TDE in general- TDE is specifically,
>> repeatedly, called out as a capability whose lack is blocking PG from
>> being able to be used for certain use-cases that it would otherwise be
>> well suited for, and that's really unfortunate.
>
>
> It is clear you are supportive.
>
> As you know, I share your point of view that PG adoption is suffering for 
> certain use cases because it does not have TDE.
>
>> I appreciate the recent discussion and reviews of the KMS in particular,
>> and of the patches which have been sent enabling TDE based on the KMS
>> patches.  Having them be relatively independent seems to be an ongoing
>> concern and perhaps we should figure out a way to more clearly put them
>> together.  That is- the KMS patches have been posted on one thread, and
>> TDE PoC patches which use the KMS patches have been on another thread,
>> leading some to not realize that there's already been TDE PoC work done
>> based on the KMS patches.  Seems like it might make sense to get one
>> patch set which goes all the way from the KMS and includes the TDE PoC,
>> even if they don't all go in at once.
>
>
> Sounds good, thanks Masahiko, let's see if we can get consensus on the 
> approach for moving this forward see below.
>
>>
>>
>> together, as a few on this thread have voiced, but there's no doubt that
>> this is a large project and it's hard to see how we could possibly
>> commit all of it at once.
>
>
> I propose that we meet to discuss what approach we want to use to move TDE 
> forward.  We then start a new thread with a proposal on the approach and 
> finalize it via community consensus. I will invite Bruce, Stephen and 
> Masahiko to this meeting. If anybody else would like to participate in this 
> discussion and subsequently in the effort to get TDE in PG1x, please let me 
> know. Assuming Bruce, Stephen and Masahiko are down for this, I (or a 
> volunteer from this meeting) will post the proposal for how we move this 
> patch forward in another thread. Hopefully, we can get consensus on that and 
> subsequently restart the execution of delivering this feature.
>
>
>
>
>>
>> Thanks!
>>
>> Stephen
>
>
>
> --
> Thomas John Kincaid
>




Re: Single transaction in the tablesync worker?

2021-01-31 Thread Peter Smith
On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila  wrote:

> 2. In AlterSubscription_refresh(), we can't allow workers to be
> stopped at commit time as we have already dropped the slots because
> the worker can access the dropped slot. We need to stop the workers
> before dropping slots. This makes all the code related to
> logicalrep_worker_stop_at_commit redundant.

@@ -73,20 +73,6 @@ typedef struct LogicalRepWorkerId
  Oid relid;
 } LogicalRepWorkerId;

-typedef struct StopWorkersData
-{
- int nestDepth; /* Sub-transaction nest level */
- List*workers; /* List of LogicalRepWorkerId */
- struct StopWorkersData *parent; /* This need not be an immediate
- * subtransaction parent */
-} StopWorkersData;

Since v23 removes that typedef from the code, don't you also have to
remove it from src/tools/pgindent/typedefs.list?


Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Single transaction in the tablesync worker?

2021-01-31 Thread Peter Smith
On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila  wrote:
>
> I have made the below changes in the patch. Let me know what you think
> about these?
> 1. It was a bit difficult to understand the code in DropSubscription
> so I have rearranged the code to match the way we are doing in HEAD
> where we drop the slots at the end after finishing all the other
> cleanup.

There was a reason why the v22 logic was different from HEAD.

The broken connection leaves dangling slots which is unavoidable. But,
whereas the user knows the name of the Subscription slot (they named
it), there is no easy way for them to know the names of the remaining
tablesync slots unless we log them.

That is why the v22 code was written to process the tablesync slots
even for wrconn == NULL so their name could be logged:
elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
syncslotname);

The v23 patch removed this dangling slot name information, so it makes
it difficult for the user to know what tablesync slots to cleanup.

> 2. In AlterSubscription_refresh(), we can't allow workers to be
> stopped at commit time as we have already dropped the slots because
> the worker can access the dropped slot. We need to stop the workers
> before dropping slots. This makes all the code related to
> logicalrep_worker_stop_at_commit redundant.

OK.

> 3. In AlterSubscription_refresh(), we need to acquire the lock on
> pg_subscription_rel only when we try to remove any subscription rel.

+ if (!sub_rel_locked)
+ {
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+ sub_rel_locked = true;
+ }

OK. But the sub_rel_locked bool is not really necessary. Why not just
check for rel == NULL? e.g.
if (!rel)
rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

> 4. Added/Changed quite a few comments.
>

@@ -1042,6 +1115,31 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
  }
  list_free(subworkers);

+ /*
+ * Tablesync resource cleanup (slots and origins).

The comment is misleading; this code is only dropping origins.


Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Printing backtrace of postgres processes

2021-01-31 Thread Bharath Rupireddy
On Fri, Jan 29, 2021 at 7:10 PM vignesh C  wrote:
> > 4) How about following
> > + errmsg("must be a superuser to print backtrace
> > of backend process")));
> > instead of
> > + errmsg("must be a superuser to print backtrace
> > of superuser query process")));
> >
>
> Here the message should include superuser, we cannot remove it. Non
> super user can log non super user provided if user has permissions for
> it.
>
> > 5) How about following
> >  errmsg("must be a member of the role whose backed
> > process's backtrace is being printed or member of
> > pg_signal_backend")));
> > instead of
> > + errmsg("must be a member of the role whose
> > backtrace is being logged or member of pg_signal_backend")));
> >
>
> Modified it.

Maybe I'm confused here to understand the difference between
SIGNAL_BACKEND_NOSUPERUSER and SIGNAL_BACKEND_NOPERMISSION macros and
corresponding error messages. Some clarification/use case to know in
which scenarios we hit those error messages might help me. Did we try
to add test cases that show up these error messages for
pg_print_backtrace? If not, can we add?

> Attached v5 patch has the fixes for the same.
> Thoughts?

Thanks. Here are some comments on v5 patch:

1) typo - it's "process" not "porcess" +a backend porcess. For example:

2) select * from pg_print_backtrace(NULL);
postgres=# select proname, proisstrict from pg_proc where proname =
'pg_print_backtrace';
  proname   | proisstrict
+-
 pg_print_backtrace | t

 See the documentation:
 "proisstrict bool

Function returns null if any call argument is null. In that case the
function won't actually be called at all. Functions that are not
“strict” must be prepared to handle null inputs."
So below PG_ARGISNUL check is not needed, you can remove that, because
pg_print_backtrace will not get called with null input.
intbt_pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);

3) Can we just set results = true instead of PG_RETURN_BOOL(true); so
that it will be returned from PG_RETURN_BOOL(result); just for
consistency?
if (!SendProcSignal(bt_pid, PROCSIG_BACKTRACE_PRINT,
InvalidBackendId))
PG_RETURN_BOOL(true);
else
ereport(WARNING,
(errmsg("failed to send signal to postmaster: %m")));
}

PG_RETURN_BOOL(result);

4) Below is what happens if I request for a backtrace of the
postmaster process? 1388210 is pid of postmaster.
postgres=# select * from pg_print_backtrace(1388210);
WARNING:  PID 1388210 is not a PostgreSQL server process
 pg_print_backtrace

 f

Does it make sense to have a postmaster's backtrace? If yes, can we
have something like below in sigusr1_handler()?
if (CheckPostmasterSignal(PMSIGNAL_BACKTRACE_EMIT))
{
LogBackTrace();
}

5) Can we have PROCSIG_PRINT_BACKTRACE instead of
PROCSIG_BACKTRACE_PRINT, just for readability and consistency with
function name?

6) I think it's not the postmaster that prints backtrace of the
requested backend and we don't send SIGUSR1 with
PMSIGNAL_BACKTRACE_EMIT to postmaster, I think it's the backends, upon
receiving SIGUSR1 with PMSIGNAL_BACKTRACE_EMIT signal, log their own
backtrace. Am I missing anything here? If I'm correct, we need to
change the below description and other places wherever we refer to
this description.

The idea here is to implement & expose pg_print_backtrace function, internally
what this function does is, the connected backend will send SIGUSR1 signal by
setting PMSIGNAL_BACKTRACE_EMIT to the postmaster process. Once the process
receives this signal it will log the backtrace of the process.

7) Can we get the parallel worker's backtrace? IIUC it's possible.

8) What happens if a user runs pg_print_backtrace() on Windows or
MacOS or some other platform? Do you want to say something about other
platforms where gdb/addr2line doesn't exist?
+
+You can get the file name and line number by using gdb/addr2line in
+linux platforms, as a prerequisite users must ensure gdb/addr2line is
+already installed:
+

9) Can't we reuse set_backtrace with just adding a flag to
set_backtrace(ErrorData *edata, int num_skip, bool
is_print_backtrace_function), making it a non-static function and call
set_backtrace(NULL, 0, true)?

void
set_backtrace(ErrorData *edata, int num_skip, bool is_print_backtrace_function)
{
StringInfoData errtrace;
---
---
if (is_print_backtrace_function)
elog(LOG_SERVER_ONLY, "current backtrace:%s", errtrace.data);
else
edata->backtrace = errtrace.data;
}

I think it will be good if we do this, because we can avoid duplicate
code in set_backtrace and LogBackTrace.

10) I think it's "pg_signal_backend" instead of "pg_print_backtrace"?
+backtrace is being printed or the calling role has been granted
+pg_print_backtrace, however only 

Re: row filtering for logical replication

2021-01-31 Thread Euler Taveira
On Mon, Mar 16, 2020, at 10:58 AM, David Steele wrote:
> Please submit to a future CF when a new patch is available.
Hi,

This is another version of the row filter patch. Patch summary:

0001: refactor to remove dead code
0002: grammar refactor for row filter
0003: core code, documentation, and tests
0004: psql code
0005: pg_dump support
0006: debug messages (only for test purposes)
0007: measure row filter overhead (only for test purposes)

>From the previous version I incorporated Amit's suggestions [1], improve 
>documentation and tests. I refactored to code to make it simple to read (break 
>the row filter code into functions). This new version covers the new parameter 
>publish_via_partition_root that was introduced (cf 83fd4532a7).

Regarding function prohibition, I wouldn't like to open a can of worms (see 
previous discussions in this thread). Simple expressions covers most of the use 
cases that I worked with until now. This prohibition can be removed in another 
patch after some careful analysis.

I did some limited tests and didn't observe some excessive CPU usage while 
testing this patch tough I agree with Andres that retain some expression 
context into a cache would certainly speed up this piece of code. I measured 
the row filter overhead in my i7 (see 0007)  and got:

mean:   92.49 us
stddev: 32.63 us
median: 83.45 us
min-max:[11.13 .. 2731.55] us
percentile(95): 117.76 us

[1] 
https://www.postgresql.org/message-id/CA%2BHiwqG3Jz-cRS%3D4gqXmZDjDAi%3D%3D19GvrFCCqAawwHcOCEn4fQ%40mail.gmail.com


--
Euler Taveira
EnterpriseDB:  https://www.enterprisedb.com/
From 78aa13f958d883f52ef0a9796536adf06cd58273 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 18 Jan 2021 11:13:23 -0300
Subject: [PATCH 1/7] Remove unused column from initial table synchronization

Column atttypmod was added in the commit 7c4f52409a, but it is not used.
The removal is safe because COPY from publisher does not need such
information.
---
 src/backend/replication/logical/tablesync.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 863d196fd7..a18f847ade 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -640,7 +640,7 @@ fetch_remote_table_info(char *nspname, char *relname,
 	StringInfoData cmd;
 	TupleTableSlot *slot;
 	Oid			tableRow[] = {OIDOID, CHAROID, CHAROID};
-	Oid			attrRow[] = {TEXTOID, OIDOID, INT4OID, BOOLOID};
+	Oid			attrRow[] = {TEXTOID, OIDOID, BOOLOID};
 	bool		isnull;
 	int			natt;
 
@@ -685,7 +685,6 @@ fetch_remote_table_info(char *nspname, char *relname,
 	appendStringInfo(,
 	 "SELECT a.attname,"
 	 "   a.atttypid,"
-	 "   a.atttypmod,"
 	 "   a.attnum = ANY(i.indkey)"
 	 "  FROM pg_catalog.pg_attribute a"
 	 "  LEFT JOIN pg_catalog.pg_index i"
@@ -718,7 +717,7 @@ fetch_remote_table_info(char *nspname, char *relname,
 		Assert(!isnull);
 		lrel->atttyps[natt] = DatumGetObjectId(slot_getattr(slot, 2, ));
 		Assert(!isnull);
-		if (DatumGetBool(slot_getattr(slot, 4, )))
+		if (DatumGetBool(slot_getattr(slot, 3, )))
 			lrel->attkeys = bms_add_member(lrel->attkeys, natt);
 
 		/* Should never happen. */
-- 
2.20.1

From 13917594bd781cd614875498eacec48ce1d64537 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 18 Jan 2021 11:53:34 -0300
Subject: [PATCH 2/7] Rename a WHERE node

A WHERE clause will be used for row filtering in logical replication. We
already have a similar node: 'WHERE (condition here)'. Let's rename the
node to a generic name and use it for row filtering too.
---
 src/backend/parser/gram.y | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index b2f447bf9a..793aac5377 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -485,7 +485,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	def_arg columnElem where_clause where_or_current_clause
 a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound
 columnref in_expr having_clause func_table xmltable array_expr
-ExclusionWhereClause operator_def_arg
+OptWhereClause operator_def_arg
 %type 	rowsfrom_item rowsfrom_list opt_col_def_list
 %type  opt_ordinality
 %type 	ExclusionConstraintList ExclusionConstraintElem
@@ -3805,7 +3805,7 @@ ConstraintElem:
 	$$ = (Node *)n;
 }
 			| EXCLUDE access_method_clause '(' ExclusionConstraintList ')'
-opt_c_include opt_definition OptConsTableSpace  ExclusionWhereClause
+opt_c_include opt_definition OptConsTableSpace  OptWhereClause
 ConstraintAttributeSpec
 {
 	Constraint *n = makeNode(Constraint);
@@ -3907,7 +3907,7 @@ ExclusionConstraintElem: index_elem WITH any_operator
 			}
 		;
 
-ExclusionWhereClause:
+OptWhereClause:
 			WHERE '(' a_expr ')'	{ $$ = $3; }
 

Re: [sqlsmith] Failed assertion during partition pruning

2021-01-31 Thread David Rowley
On Sun, 31 Jan 2021 at 11:42, Tom Lane  wrote:
> This fixes the cases reported by Andreas and Jaime, leaving me
> more confident that there's nothing wrong with David's Assert.

I agree that there is nothing wrong with the Assert.

The commit message of a929e17e5 mentions:

> Here we tighten that up so that partitioned_rels only ever contains the RT
> index for partitioned tables which actually have subpaths in the given
> Append/MergeAppend.  We can now Assert that every PartitionedRelPruneInfo
> has a non-empty present_parts.  That should allow us to catch any weird
> corner cases that have been missed.

It seems in this case the Assert *did* find a case that I missed.
Unfortunately, I just missed the email that reported the problem,
until yesterday.

> I did not try to add a regression test case, mainly because it's
> not quite clear to me where the original bug is.  (I'm a little
> suspicious that the blame might lie with the "match_partition_order"
> cases in generate_orderedappend_paths, which try to bypass
> accumulate_append_subpath without updating the partitioned_rels
> data.  But I'm not excited about trying to prove that.)

Yeah, that suspicion is correct. More specifically when
get_singleton_append_subpath() finds a single subpath Append /
MergeAppend, the single subpath is returned.

So what's happening is that we first build the Append paths for the
sub-partitioned table which, in the example case, will have a single
subpath Append path with partitioned_rels containing just the parent
of that partition (i.e trigger_parted_p1).  When it comes to doing
generate_orderedappend_paths() for the base relation (trigger_parted),
we find match_partition_order to be true and allow
get_singleton_append_subpath() to pullup the single-subplan Append
path. Unfortunately we just completely ignore that Append path's
partitioned_rels. Back in generate_orderedappend_paths() the
startup_partitioned_rels and total_partitioned_rels variables are
used, both of which only mention the trigger_parted table. The mention
of trigger_parted_p1 is lost.

It could be fixed by modifying get_singleton_append_subpath() to
modify the partitioned_rels when it collapses these single-subpath
Append/MergeAppend path, but I'm very open to looking at just getting
rid of the partitioned_rels field. Prior to a929e17e5 that field was
very loosely set and in serval cases could contain RT indexes of
partitioned tables that didn't even have any subpaths in the given
Append/MergeAppend. a929e17e5 attempted to tighten all that up but
looks like I missed the case above.

> I wonder whether we should consider back-patching this.  Another
> thing that seems unclear is whether there is any serious consequence
> to omitting some intermediate partitions from the set considered
> by make_partition_pruneinfo.  Presumably it could lead to missing
> some potential run-time-pruning opportunities, but is there any
> worse issue?  If there isn't, this is a bigger change than I want
> to put in the back braches.

It shouldn't be backpatched. a929e17e5 only exists in master. Prior to
that AppendPath/MergeAppendPath's partitioned_rels field could only
contain additional partitioned table RT index. There are no known
cases of missing ones prior to a929e17e5, so this bug shouldn't exist
in PG13.

a929e17e5 introduced run-time partition pruning for
sub-Append/MergeAppend paths.  The commit message of that explains
that there are plenty of legitimate cases where we can't flatten these
sub-Append/MergeAppend paths down into the top-level
Append/MergeAppend.   Originally I had thought we should only bother
doing run-time pruning on the top-level Append/MergeAppend because I
thought these cases were very uncommon.  I've now changed my mind.

For a929e17e5, it was not just a matter of removing the lines in [1]
to allow run-time pruning on nested Append/MergeAppends. I also needed
to clean up the sloppy setting of partitioned_rels. The remainder of
the patch attempted that.

FWIW, I hacked together a patch which fixes the bug by passing a
Bitmapset ** pointer to get_singleton_append_subpath(), which set the
bits for the Append / MergeAppend path's partitioned_rels that we get
rid of when it only has a single subpath. This stops the Assert
failure mentioned here.  However, I'd much rather explore getting rid
of partitioned_rels completely. I'll now have a look at the patch
you're proposing for that.

Thanks for investigating this and writing the patch. Apologies for
this email missing my attention closer to the time to when it was
initially reported.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/optimizer/plan/createplan.c;h=40abe6f9f623ed2922ccc4e1991e97e90322f47d;hp=94280a730c4d9abb2143416fca8e74e76dada042;hb=a929e17e5;hpb=dfc797730fc7a07c0e6bd636ad1a564aecab3161




Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-31 Thread Tom Lane
Alexander Korotkov  writes:
> Pushed with minor cleanup.

thorntail seems unhappy:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail=2021-01-31%2020%3A58%3A12

==-=-== stack trace: pgsql.build/src/test/regress/tmp_check/data/core 
==-=-==
[New LWP 2266507]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/sparc64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: nm regression [local] SELECT   
 '.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x0175c410 in jsonb_subscript_check_subscripts (state=, op=0x1d852b0, econtext=) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/utils/adt/jsonbsubs.c:198
198 for (int i = 0; i < sbsrefstate->numupper; i++)
#0  0x0175c410 in jsonb_subscript_check_subscripts (state=, op=0x1d852b0, econtext=) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/utils/adt/jsonbsubs.c:198
#1  0x013e55c0 in ExecInterpExpr (state=0x1d85068, 
econtext=0x1d85660, isnull=0x7feffa2fbbc) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/executor/execExprInterp.c:1402
#2  0x013de4bc in ExecInterpExprStillValid (state=0x1d85068, 
econtext=0x1d85660, isNull=0x7feffa2fbbc) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/executor/execExprInterp.c:1765
#3  0x0154fbd4 in ExecEvalExprSwitchContext (isNull=0x7feffa2fbbc, 
econtext=, state=0x1d85068) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/include/executor/executor.h:315
#4  evaluate_expr (expr=, result_type=, 
result_typmod=, result_collation=) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/util/clauses.c:4533
#5  0x015513b8 in eval_const_expressions_mutator (node=0x1dce218, 
context=0x7feffa30108) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/util/clauses.c:2883
#6  0x014b4968 in expression_tree_mutator (node=0x1cc10e8, 
mutator=0x154fca4 , context=0x7feffa30108) 
at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/nodes/nodeFuncs.c:2762
#7  0x0154fd0c in eval_const_expressions_mutator (node=0x1cc10e8, 
context=0x7feffa30108) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/util/clauses.c:3312
#8  0x014b52d0 in expression_tree_mutator (node=0x1cc1140, 
mutator=0x154fca4 , context=0x7feffa30108) 
at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/nodes/nodeFuncs.c:3050
#9  0x0154fd0c in eval_const_expressions_mutator (node=0x1cc1140, 
context=0x7feffa30108) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/util/clauses.c:3312
#10 0x0155284c in eval_const_expressions (root=0x1dcdca0, 
node=0x1cc1140) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/util/clauses.c:2034
#11 0x01523134 in preprocess_expression (root=0x1dcdca0, 
expr=0x1cc1140, kind=) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/plan/planner.c:1088
#12 0x0152ed3c in subquery_planner (glob=, 
parse=0x1cc0350, parent_root=, hasRecursion=, 
tuple_fraction=0) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/plan/planner.c:765
#13 0x01531afc in standard_planner (parse=0x1cc0350, 
query_string=, cursorOptions=, boundParams=0x0) 
at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/optimizer/plan/planner.c:402
#14 0x01696d6c in pg_plan_query (querytree=0x1cc0350, 
query_string=0x1cbf340 "select ('123'::jsonb)['a'];", 
cursorOptions=, boundParams=0x0) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/tcop/postgres.c:876
#15 0x01696f14 in pg_plan_queries (querytrees=0x1dcdbb0, 
query_string=0x1cbf340 "select ('123'::jsonb)['a'];", 
cursorOptions=, boundParams=0x0) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/tcop/postgres.c:967
#16 0x016976e4 in exec_simple_query (query_string=0x1cbf340 "select 
('123'::jsonb)['a'];") at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/tcop/postgres.c:1159
#17 0x0169a0e0 in PostgresMain (argc=, argv=, dbname=, username=) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/tcop/postgres.c:4394
#18 0x015a94ec in BackendRun (port=0x1ce4000) at 
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4484
#19 BackendStartup 

Re: Why does creating logical replication subscriptions require superuser?

2021-01-31 Thread Noah Misch
On Fri, Jan 22, 2021 at 02:08:02PM -0800, Paul Martinez wrote:
> Some of the original justifications for requiring superuser to create
> subscriptions include:
> - Replication inherently adds significant network traffic and extra background
>   process, and we wouldn't want unprivileged users to be able to add such
>   drastic load to then database.

I think you're referring to these messages:

  
https://postgr.es/m/CA+TgmoahEoM2zZO71yv4883HFarXcBcOs3if6fEdRcRs8Fs=z...@mail.gmail.com
  
https://postgr.es/m/ca+tgmobqxge_7dcx1_dv8+kaf3neoe5sy4nxgb9ayfm5ydj...@mail.gmail.com

A permanent background process bypasses authentication, so mechanisms like
pg_hba.conf and expiration of the auth SSL certificate don't stop access.
Like that thread discussed, this justifies some privilege enforcement.
(Autovacuum also bypasses authentication, but those are less predictable.)

Since we already let users drive the database to out-of-memory, I wouldn't
worry about load.  In other words, the quantity of network traffic and number
of background processes don't matter, just the act of allowing them at all.

> - Running the apply process as a superuser drastically simplifies the number
>   of possible errors that might arise due to not having sufficient permissions
>   to write to target tables, and may have simplified the initial
>   implementation.

I think you're referring to this:

https://postgr.es/m/ca+tgmoye1x21zlycqovl-kdd9djsvz4v8nnmfdscjrwv9qf...@mail.gmail.com
 wrote:
> It seems more likely that there is a person whose job it is to set up
> replication but who doesn't normally interact with the table data
> itself.  In that kind of case, you just want to give the person
> permission to create subscriptions, without needing to also give them
> lots of privileges on individual tables (and maybe having whatever
> they are trying to do fail if you miss a table someplace).

Exposure to permission checks is a chief benefit of doing anything as a
non-superuser, so I disagree with this.  (I've bcc'd the author of that
message, in case he wants to comment.)  We could add a pg_write_any_table
special role.  DBAs should be more cautious granting pg_write_any_table than
granting subscription privilege.  (For this use case, grant both.)

> - Subscriptions store plaintext passwords, which are sensitive, and we
>   wouldn't want unprivileged users to see these. Only allowing superusers
>   to create subscriptions and view the subconninfo column is a way to resolve
>   this.

pg_user_mapping.umoptions has the same security considerations; one should be
able to protect it and subconninfo roughly the same way.

> Are there any other major reasons that we require superuser?

As another prerequisite for non-superuser-owned subscriptions, the connection
to the publisher must enforce the equivalent of dblink_security_check().

> Notably one may
> wonder why we didn't check for the REPLICATION attribute, but even replication
> users could run into table ownership and access issues.

REPLICATION represents the authority to read all bytes of the data directory.
Compared to the implications of starting a subscriber, REPLICATION carries a
lot of power.  I would not reuse REPLICATION here.

> One message in the thread mentioned somehow tricking Postgres into replicating
> system catalog tables:
> 
> https://www.postgresql.org/message-id/109201553163096%40myt5-68ad52a76c91.qloud-c.yandex.net
> 
> I'm unsure whether this is allowed by default, but it doesn't seem like too
> much trouble to run a modified publisher node that does allow it. Then
> something like 'UPDATE pg_authid SET rolsuper = TRUE' could result in 
> privilege
> escalation on the subscriber side. But, again, if the apply process isn't
> running as superuser, then presumably applying the change in the subscriber
> would fail, stopping replication, and neutralizing the attack.

This is a special case of the need for ordinary ACL checks in the subscriber.
Treating system catalogs differently would be insufficient and unnecessary.

Thanks,
nm




Re: Thoughts on "killed tuples" index hint bits support on standby

2021-01-31 Thread Peter Geoghegan
On Sat, Jan 30, 2021 at 5:39 PM Peter Geoghegan  wrote:
> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.

I should point out that hint bits in heap pages are really not like
LP_DEAD bits in indexes -- if they're similar at all then the
similarity is only superficial/mechanistic. In fact, the term "hint
bits in indexes" does not seem useful at all to me, for this reason.

Heap hint bits indicate whether or not the xmin or xmax in a heap
tuple header committed or aborted. We cache the commit or abort status
of one particular XID in the heap tuple header. Notably, this
information alone tells us nothing about whether or not the tuple
should be visible to any given MVCC snapshot. Except perhaps in cases
involving aborted transactions -- but that "exception" is just a
limited special case (and less than 1% of transactions are aborted in
almost all environments anyway).

In contrast, a set LP_DEAD bit in an index is all the information we
need to know that the tuple is dead, and can be ignored completely
(except during hot standby, where at least today we assume nothing
about the status of the tuple, since that would be unsafe). Generally
speaking, the index page LP_DEAD bit is "direct" visibility
information about the tuple, not information about XIDs that are
stored in the tuple header. So a set LD_DEAD bit in an index is
actually like an LP_DEAD-set line pointer in the heap (that's the
closest equivalent in the heap, by far). It's also like a frozen heap
tuple (except it's dead-to-all, not live-to-all).

The difference may be subtle, but it's important here -- it justifies
inventing a whole new type of LP_DEAD-style status bit that gets set
only on standbys. Even today, heap tuples can have hint bits
"independently" set on standbys, subject to certain limitations needed
to avoid breaking things like data page checksums. Hint bits are
ultimately just a thing that remembers the status of transactions that
are known committed or aborted, and so can be set immediately after
the relevant xact commits/aborts (at least on the primary, during
original execution). A long-held MVCC snapshot is never a reason to
not set a hint bit in a heap tuple header (during original execution
or during hot standby/recovery). Of course, a long-held MVCC snapshot
*is* often a reason why we cannot set an LP_DEAD bit in an index.

Conclusion: The whole minRecoveryPoint question that you're trying to
answer to improve things for your patch is just the wrong question.
Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
would be useful to set "true hint bits" on standbys earlier, and maybe
thinking about minRecoveryPoint would help with that problem, but that
doesn't help your index-related patch -- because indexes simply don't
have true hint bits.

-- 
Peter Geoghegan




Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-31 Thread Alexander Korotkov
On Fri, Jan 29, 2021 at 7:01 PM Alexander Korotkov  wrote:
> On Thu, Jan 21, 2021 at 11:14 PM Pavel Stehule  
> wrote:
> >> Looks good, I've applied it, thanks.
> >
> > I tested last set of patches
> >
> > 1. There is no problem with patching and compilation
> > 2. make check-world passed
> > 3. build doc without problems
> > 4. I have not any objections against implemented functionality, 
> > implementation and tests
> >
> > I'll mark this patch as ready for committers
> >
> > Thank you for your work. It will be nice feature
>
> I've skimmed through the thread, it seems that consensus over
> functionality is reached.  Patchset themself looks good for me.  I'm
> going to push this if no objections.

Pushed with minor cleanup.

--
Regards,
Alexander Korotkov




Re: Added schema level support for publication.

2021-01-31 Thread vignesh C
On Fri, Jan 22, 2021 at 10:01 AM vignesh C  wrote:
>
> Thanks Rahila for your comments. Please find my thoughts below:
>
> On Wed, Jan 20, 2021 at 6:27 PM Rahila Syed  wrote:
> >
> > Hi Vignesh,
> >
> >>
> >> I have handled the above scenario(drop schema should automatically
> >> remove the schema entry from publication schema relation) & addition
> >> of tests in the new v2 patch attached.
> >> Thoughts?
> >
> >
> > Please see some initial comments:
> >
> > 1. I think there should be more tests to show that the schema data is 
> > actually replicated
> > to the subscriber.  Currently, I am not seeing the data being replicated 
> > when I use FOR SCHEMA.
> >
> I will fix this issue and include more tests in my next version of the patch.

Modified to handle this and also added a few more tests.

> > 2. How does replication behave when a table is added or removed from a 
> > subscribed schema
> > using ALTER TABLE SET SCHEMA?
> >
> I would like to keep the behavior similar to the table behavior. I
> will post more details for this along with my next version of the
> patch.
>

If a table is set to a different schema, after the schema change table
data will not be sent to the subscriber.
When a new table is added to the published schema, the table data will
be sent by the publisher, subscriber will not apply the changes. If
the change needs to be reflected, subscriber's publication should be
refreshed using "alter subscription mysub1 refresh publication". This
relation will be reflected in the subscriber relation when the
subscriber's publication is refreshed.
If a table is dropped, there is no impact on subscriber, This relation
will be present in pg_subscriber_rel after refreshing subscriber
publication.

> > 3. Can we have a default schema like a public or current schema that gets 
> > replicated in case the user didn't
> > specify one, this can be handy to replicate current schema tables.
> >
> It looks like a good use case, I will check on the feasibility of this
> and try to implement this.

This can be done, I will handle this later.

> > 4. +   The fourth, fifth and sixth variants change which schemas are part 
> > of the
> > +   publication.  The SET TABLE clause will replace the 
> > list
> > +   of schemas in the publication with the specified one.  The ADD
> >
> > There is a typo above s/SET TABLE/SET SCHEMA
> I will fix this in the next version of the patch.

Modified it.
I have separated the tests and documentation into a separate patch to
make review easier. Attached v3 patch with the fixes.
Thoughts?

Regards,
Vignesh
From 72255fcb52cfba2d284b8402b64c118160b35b9f Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sun, 31 Jan 2021 22:47:38 +0530
Subject: [PATCH v3 1/2] Added schema level support for publication.

This patch adds schema level support for publication.  User can specify multiple
schemas with schema option. When user specifies schema option, then the tables
present in the schema specified will be selected by publisher for sending the
data to subscriber.
---
 src/backend/catalog/Makefile|   4 +-
 src/backend/catalog/aclchk.c|   2 +
 src/backend/catalog/dependency.c|   9 +
 src/backend/catalog/objectaddress.c | 138 +++
 src/backend/catalog/pg_publication.c| 134 +-
 src/backend/commands/alter.c|   1 +
 src/backend/commands/event_trigger.c|   4 +
 src/backend/commands/publicationcmds.c  | 266 +++-
 src/backend/commands/seclabel.c |   1 +
 src/backend/commands/tablecmds.c|   1 +
 src/backend/parser/gram.y   |  83 ++---
 src/backend/replication/pgoutput/pgoutput.c |  12 ++
 src/backend/utils/cache/syscache.c  |  23 +++
 src/bin/pg_dump/common.c|   3 +
 src/bin/pg_dump/pg_backup_archiver.c|   3 +-
 src/bin/pg_dump/pg_dump.c   | 155 +++-
 src/bin/pg_dump/pg_dump.h   |  17 ++
 src/bin/pg_dump/pg_dump_sort.c  |   7 +
 src/bin/psql/describe.c | 110 +++-
 src/include/catalog/dependency.h|   1 +
 src/include/catalog/pg_publication.h|  16 +-
 src/include/catalog/pg_publication_schema.h |  49 +
 src/include/commands/publicationcmds.h  |   1 +
 src/include/nodes/parsenodes.h  |   3 +
 src/include/utils/syscache.h|   2 +
 src/test/regress/expected/publication.out   | 100 +--
 src/test/regress/expected/sanity_check.out  |   1 +
 27 files changed, 1055 insertions(+), 91 deletions(-)
 create mode 100644 src/include/catalog/pg_publication_schema.h

diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index c85f0ca..dc8a9eb 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -67,8 +67,8 @@ CATALOG_HEADERS := \
 	pg_foreign_table.h pg_policy.h pg_replication_origin.h \
 	pg_default_acl.h pg_init_privs.h 

Re: Proposal: Save user's original authenticated identity for logging

2021-01-31 Thread Tom Lane
Greg Stark  writes:
> I wonder if there isn't room to handle this the other way around. To
> configure Postgres to not need a CREATE ROLE for every role but
> delegate the user management to the external authentication service.

> So Postgres would consider the actual role to be the one kerberos said
> it was even if that role didn't exist in pg_role. Presumably you would
> want to delegate to a corresponding authorization system as well so if
> the role was absent from pg_role (or more likely fit some pattern)
> Postgres would ignore pg_role and consult the authorization system
> configured like AD or whatever people use with Kerberos these days.

This doesn't sound particularly workable: how would you manage
inside-the-database permissions?  Kerberos isn't going to know
what "view foo" is, let alone know whether you should be allowed
to read or write it.  So ISTM there has to be a role to hold
those permissions.  Certainly, you could allow multiple external
identities to share a role ... but that works today.

regards, tom lane




Re: Proposal: Save user's original authenticated identity for logging

2021-01-31 Thread Tom Lane
Magnus Hagander  writes:
> On Sat, Jan 30, 2021 at 12:40 AM Tom Lane  wrote:
>> I remain concerned about the cost and inconvenience of exposing
>> it via log_line_prefix, but at least that shouldn't be visible
>> to anyone who's not entitled to know who's logged in ...

> What if we logged it as part of log_connection=on, but only there and
> only once? It could still be traced through the rest of that sessions
> logging using the fields identifying the session, and we'd only end up
> logging it once.

I'm certainly fine with including this info in the log_connection output.
Perhaps it'd also be good to have a superuser-only column in
pg_stat_activity, or some other restricted way to get the info from an
existing session.  I doubt we really want a log_line_prefix option.

regards, tom lane




Re: Proposal: Save user's original authenticated identity for logging

2021-01-31 Thread Greg Stark
On Fri, 29 Jan 2021 at 18:41, Tom Lane  wrote:
>
> Ah.  So basically, this comes into play when you consider that some
> outside-the-database entity is your "real" authenticated identity.
> That seems reasonable when using Kerberos or the like, though it's
> not real meaningful for traditional password-type authentication.
> I'd misunderstood your point before.

I wonder if there isn't room to handle this the other way around. To
configure Postgres to not need a CREATE ROLE for every role but
delegate the user management to the external authentication service.

So Postgres would consider the actual role to be the one kerberos said
it was even if that role didn't exist in pg_role. Presumably you would
want to delegate to a corresponding authorization system as well so if
the role was absent from pg_role (or more likely fit some pattern)
Postgres would ignore pg_role and consult the authorization system
configured like AD or whatever people use with Kerberos these days.


-- 
greg




[PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)

2021-01-31 Thread Joel Jacobson
Please ignore previous email, the attached file was 0 bytes.
Here comes the patch again, now including data.

--

Doc: improve documentation of oid columns that can be zero.

pg_attribute.atttypid
Zero if column is dropped.

pg_class.relam
Can be zero, e.g. for views.

pg_depend.classid
Zero for pinned objects.

pg_language.lanplcallfoid
Zero for internal languages.

pg_operator.oprcode
Zero if none.

pg_operator.oprcom
Zero if none.

pg_operator.oprjoin
Zero if none.

pg_operator.oprnegate
Zero if none.

pg_operator.oprrest
Zero if none.

pg_operator.oprresult
Zero if none.

pg_policy.polroles
Array with a zero element if none.

pg_shdepend.classid
Zero for pinned objects (deptype='p'),
meaning there is no dependent object.

pg_shdepend.objid
Zero if none.

pg_trigger.tgconstrindid
Zero if none.

pg_trigger.tgconstrrelid
Zero if none.

document-all-zero-if-none-cases.patch
Description: Binary data


[PATCH] Doc: improve documentation of oid columns that can be zero.

2021-01-31 Thread Joel Jacobson
Doc: improve documentation of oid columns that can be zero.

pg_attribute.atttypid
Zero if column is dropped.

pg_class.relam
Can be zero, e.g. for views.

pg_depend.classid
Zero for pinned objects.

pg_language.lanplcallfoid
Zero for internal languages.

pg_operator.oprcode
Zero if none.

pg_operator.oprcom
Zero if none.

pg_operator.oprjoin
Zero if none.

pg_operator.oprnegate
Zero if none.

pg_operator.oprrest
Zero if none.

pg_operator.oprresult
Zero if none.

pg_policy.polroles
Array with a zero element if none.

pg_shdepend.classid
Zero for pinned objects (deptype='p'),
meaning there is no dependent object.

pg_shdepend.objid
Zero if none.

pg_trigger.tgconstrindid
Zero if none.

pg_trigger.tgconstrrelid
Zero if none.

doc/src/sgml/catalogs.sgml | 34 ++
1 file changed, 18 insertions(+), 16 deletions(-)

document-all-zero-if-none-cases.patch
Description: Binary data


Re: Proposal: Save user's original authenticated identity for logging

2021-01-31 Thread Magnus Hagander
On Sat, Jan 30, 2021 at 12:21 AM Jacob Champion  wrote:
>
> On Fri, 2021-01-29 at 17:01 -0500, Stephen Frost wrote:
> > > - for LDAP, the bind DN is discarded entirely;
> >
> > We don't support pg_ident.conf-style entries for LDAP, meaning that the
> > user provided has to match what we check, so I'm not sure what would be
> > improved with this change..?
>
> For simple binds, this gives you almost nothing. For bind+search,
> logging the actual bind DN is still important, in my opinion, since the
> mechanism for determining it is more opaque (and may change over time).

Yeah, that's definitely a piece of information that can be hard to get at today.


> (There's also the fact that I think pg_ident mapping for LDAP would be
> just as useful as it is for GSS or certs. That's for a different
> conversation.)

Specifically for search+bind, I would assume?


> > I'm also just generally not thrilled with
> > putting much effort into LDAP as it's a demonstrably insecure
> > authentication mechanism.
>
> Because Postgres has to proxy the password? Or is there something else?

Stephen is on a bit of a crusade against ldap :) Mostly for good
reasons of course. A large amount of those who choose ldap also have a
kerberos system (because, say, active directory) and the pick ldap
only because they think it's good, not because it is...

But yes, I think the enforced cleartext password proxying is at the
core of the problem. LDAP also encourages the idea of centralized
password-reuse, which is not exactly a great thing for security.

That said, I don't think either of those are reasons not to improve on
LDAP. It can certainly be a reason for somebody not to want to spend
their own time on it, but there's no reason it should prevent
improvements.


> > > I propose that every auth method should store the string it uses to
> > > identify a user -- what I'll call an "authenticated identity" -- into
> > > one central location in Port, after authentication succeeds but before
> > > any pg_ident authorization occurs. This field can then be exposed in
> > > log_line_prefix. (It could additionally be exposed through a catalog
> > > table or SQL function, if that were deemed useful.) This would let a
> > > DBA more easily audit user activity when using more complicated
> > > pg_ident setups.
> >
> > This seems like it would be good to include the CSV format log files
> > also.
>
> Agreed in principle... Is the CSV format configurable? Forcing it into
> CSV logs by default seems like it'd be a hard sell, especially for
> people not using pg_ident.

For CVS, all columns are always included, and that's a feature -- it
makes it predictable.

To make it optional it would have to be a configuration parameter that
turns the field into an empty one. but it should still be there.


> > For some auth methods, eg: GSS, we've recently added information into
> > the authentication method which logs what the authenticated identity
> > was.  The advantage with that approach is that it avoids bloating the
> > log by only logging that information once upon connection rather than
> > on every log line...  I wonder if we should be focusing on a similar
> > approach for other pg_ident.conf use-cases instead of having it via
> > log_line_prefix, as the latter means we'd be logging the same value over
> > and over again on every log line.
>
> As long as the identity can be easily logged and reviewed by DBAs, I'm
> happy.

Yeah, per my previous mail, I think this is a better way - make it
part of log_connections. But it would be good to find a way that we
can log it the same way for all of them -- rather than slightly
different ways depending on authentication method.

With that I think it would also be useful to have it available in the
system as well -- either as a column in pg_stat_activity or maybe just
as a function like pg_get_authenticated_identity() since it might be
something that's interesting to a smallish subset of users (but very
interesting to those).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Proposal: Save user's original authenticated identity for logging

2021-01-31 Thread Magnus Hagander
On Sat, Jan 30, 2021 at 12:40 AM Tom Lane  wrote:
>
> Jacob Champion  writes:
> > On Fri, 2021-01-29 at 17:30 -0500, Tom Lane wrote:
> >> What happens if ALTER USER RENAME is done while the session is still
> >> alive?
>
> > IMO the authenticated identity should be write-once. Especially since
> > one of my goals is to have greater auditability into events as they've
> > actually happened. So ALTER USER RENAME should have no effect.
>
> > This also doesn't really affect third-party auth methods. If I'm bound
> > as pchamp...@example.com and a superuser changes my username to tlane,
> > you _definitely_ don't want to see my authenticated identity change to
> > tl...@example.com. That's not who I am.
>
> Ah.  So basically, this comes into play when you consider that some
> outside-the-database entity is your "real" authenticated identity.
> That seems reasonable when using Kerberos or the like, though it's
> not real meaningful for traditional password-type authentication.

I think the usecases where it's relevant is a relatively close match
to the usecases where we support user mapping in pg_ident.conf. There
is a small exception in the ldap search+bind since it's a two-step
operation and the interesting part would be in the mid-step, but I'm
not sure there is any other case than those where it adds a lot of
value.


> I'd misunderstood your point before.
>
> So, if we store this "real" identity, is there any security issue
> involved in exposing it to other users (via pg_stat_activity or
> whatever)?

I'd say it might. It might for example reveal where in a hierarchical
authentication setup your "real identity" lives. I think it'd at least
have to be limited to superusers.


> I remain concerned about the cost and inconvenience of exposing
> it via log_line_prefix, but at least that shouldn't be visible
> to anyone who's not entitled to know who's logged in ...

What if we logged it as part of log_connection=on, but only there and
only once? It could still be traced through the rest of that sessions
logging using the fields identifying the session, and we'd only end up
logging it once.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/