RE: Build-farm - intermittent error in 031_column_list.pl

2022-05-31 Thread osumi.takami...@fujitsu.com
On Thursday, May 26, 2022 11:37 AM Amit Kapila  wrote:
> On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
>  wrote:
> >
> > On 5/25/22 13:26, Amit Kapila wrote:
> > > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> > >  wrote:
> > >>
> > >> It does "fix" the case of [1].  But AFAIS
> > >> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> > >> loading in get_rel_sync_entry and the function doesn't seem to be
> > >> assumed to return a invalid entry. (Since the flag is not checked
> > >> nowhere else.)
> > >>
> > >> For example pgoutput_change does not check for the flag of the
> > >> entry returned from the function before uses it, which is not
> > >> seemingly safe. (I didn't check further, though)
> > >>
> > >> Don't we need to explicitly avoid using invalid entries outside the
> > >> function?
> > >>
> > >
> > > We decide that based on pubactions in the callers, so even if entry
> > > is valid, it won't do anything.  Actually, we don't need to avoid
> > > setting replication_valid flag as some of the publications for the
> > > table may be already present. We can check if the publications_valid
> > > flag is set while trying to validate the entry. Now, even if we
> > > don't find any publications the replicate_valid flag will be set but
> > > none of the actions will be set, so it won't do anything in the
> > > caller. Is this better than the previous approach?
> > >
> >
> > For the record, I'm not convinced this is the right way to fix the
> > issue, as it may easily mask the real problem.
> >
> > We do silently ignore missing objects in various places, but only when
> > either requested or when it's obvious it's expected and safe to ignore.
> > But I'm not sure that applies here, in a clean way.
> >
> > Imagine you have a subscriber using two publications p1 and p2, and
> > someone comes around and drops p1 by mistake. With the proposed patch,
> > the subscription will notice this, but it'll continue sending data
> > ignoring the missing publication. Yes, it will continue working, but
> > it's quite possible this breaks the subscriber and it's be better to
> > fail and stop replicating.
> >
> 
> Ideally, shouldn't we disallow drop of publication in such cases where it is 
> part
> of some subscription? I know it will be tricky because some subscriptions
> could be disabled.
> 
> > The other aspect I dislike is that we just stop caching publication
> > info, forcing us to reload it for every replicated change/row. So even
> > if dropping the publication happens not to "break" the subscriber (i.e.
> > the data makes sense), this may easily cause performance issues, lag
> > in the replication, and so on. And the users will have no idea why
> > and/or how to fix it, because we just do this silently.
> >
> 
> Yeah, this is true that if there are missing publications, it needs to reload 
> all the
> publications info again unless we build a mechanism to change the existing
> cached entry by loading only required info. The other thing we could do here 
> is
> to LOG the info for missing publications to make users aware of the fact. I 
> think
> we can also introduce a new option while defining/altering subscription to
> indicate whether to continue on missing publication or not, that way by 
> default
> we will stop replication as we are doing now but users will have a way to move
> replication.
> 
> BTW, what are the other options we have to fix the cases where replication is
> broken (or the user has no clue on how to proceed) as we are discussing the
> case here or the OP reported yet another case on pgsql-bugs [1]?
Hi, 


FYI, I've noticed that after the last report by Peter-san
we've gotten the same errors on Build Farm.
We need to keep discussing to conclude this.


1. Details for system "xenodermus" failure at stage subscriptionCheck, snapshot 
taken 2022-05-31 13:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus=2022-05-31%2013%3A00%3A04


2. Details for system "phycodurus" failure at stage subscriptionCheck, snapshot 
taken 2022-05-26 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus=2022-05-26%2017%3A30%3A04


Best Regards,
Takamichi Osumi



Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-25 Thread Amit Kapila
On Wed, May 25, 2022 at 6:54 PM Tomas Vondra
 wrote:
>
> On 5/25/22 13:26, Amit Kapila wrote:
> > On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
> >  wrote:
> >>
> >> It does "fix" the case of [1].  But AFAIS
> >> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> >> loading in get_rel_sync_entry and the function doesn't seem to be
> >> assumed to return a invalid entry. (Since the flag is not checked
> >> nowhere else.)
> >>
> >> For example pgoutput_change does not check for the flag of the entry
> >> returned from the function before uses it, which is not seemingly
> >> safe. (I didn't check further, though)
> >>
> >> Don't we need to explicitly avoid using invalid entries outside the
> >> function?
> >>
> >
> > We decide that based on pubactions in the callers, so even if entry is
> > valid, it won't do anything.  Actually, we don't need to avoid setting
> > replication_valid flag as some of the publications for the table may
> > be already present. We can check if the publications_valid flag is set
> > while trying to validate the entry. Now, even if we don't find any
> > publications the replicate_valid flag will be set but none of the
> > actions will be set, so it won't do anything in the caller. Is this
> > better than the previous approach?
> >
>
> For the record, I'm not convinced this is the right way to fix the
> issue, as it may easily mask the real problem.
>
> We do silently ignore missing objects in various places, but only when
> either requested or when it's obvious it's expected and safe to ignore.
> But I'm not sure that applies here, in a clean way.
>
> Imagine you have a subscriber using two publications p1 and p2, and
> someone comes around and drops p1 by mistake. With the proposed patch,
> the subscription will notice this, but it'll continue sending data
> ignoring the missing publication. Yes, it will continue working, but
> it's quite possible this breaks the subscriber and it's be better to
> fail and stop replicating.
>

Ideally, shouldn't we disallow drop of publication in such cases where
it is part of some subscription? I know it will be tricky because some
subscriptions could be disabled.

> The other aspect I dislike is that we just stop caching publication
> info, forcing us to reload it for every replicated change/row. So even
> if dropping the publication happens not to "break" the subscriber (i.e.
> the data makes sense), this may easily cause performance issues, lag in
> the replication, and so on. And the users will have no idea why and/or
> how to fix it, because we just do this silently.
>

Yeah, this is true that if there are missing publications, it needs to
reload all the publications info again unless we build a mechanism to
change the existing cached entry by loading only required info. The
other thing we could do here is to LOG the info for missing
publications to make users aware of the fact. I think we can also
introduce a new option while defining/altering subscription to
indicate whether to continue on missing publication or not, that way
by default we will stop replication as we are doing now but users will
have a way to move replication.

BTW, what are the other options we have to fix the cases where
replication is broken (or the user has no clue on how to proceed) as
we are discussing the case here or the OP reported yet another case on
pgsql-bugs [1]?

[1] - 
https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-25 Thread Tom Lane
Tomas Vondra  writes:
> The other aspect I dislike is that we just stop caching publication
> info, forcing us to reload it for every replicated change/row.

Ouch --- that seems likely to be completely horrid.  At the very least
I'd want to see some benchmark numbers before concluding we could
live with that.

regards, tom lane




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-25 Thread Tomas Vondra
On 5/25/22 13:26, Amit Kapila wrote:
> On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
>  wrote:
>>
>> It does "fix" the case of [1].  But AFAIS
>> RelationSyncEntry.replicate_valid is only used to inhibit repeated
>> loading in get_rel_sync_entry and the function doesn't seem to be
>> assumed to return a invalid entry. (Since the flag is not checked
>> nowhere else.)
>>
>> For example pgoutput_change does not check for the flag of the entry
>> returned from the function before uses it, which is not seemingly
>> safe. (I didn't check further, though)
>>
>> Don't we need to explicitly avoid using invalid entries outside the
>> function?
>>
> 
> We decide that based on pubactions in the callers, so even if entry is
> valid, it won't do anything.  Actually, we don't need to avoid setting
> replication_valid flag as some of the publications for the table may
> be already present. We can check if the publications_valid flag is set
> while trying to validate the entry. Now, even if we don't find any
> publications the replicate_valid flag will be set but none of the
> actions will be set, so it won't do anything in the caller. Is this
> better than the previous approach?
> 

For the record, I'm not convinced this is the right way to fix the
issue, as it may easily mask the real problem.

We do silently ignore missing objects in various places, but only when
either requested or when it's obvious it's expected and safe to ignore.
But I'm not sure that applies here, in a clean way.

Imagine you have a subscriber using two publications p1 and p2, and
someone comes around and drops p1 by mistake. With the proposed patch,
the subscription will notice this, but it'll continue sending data
ignoring the missing publication. Yes, it will continue working, but
it's quite possible this breaks the subscriber and it's be better to
fail and stop replicating.

The other aspect I dislike is that we just stop caching publication
info, forcing us to reload it for every replicated change/row. So even
if dropping the publication happens not to "break" the subscriber (i.e.
the data makes sense), this may easily cause performance issues, lag in
the replication, and so on. And the users will have no idea why and/or
how to fix it, because we just do this silently.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-25 Thread Amit Kapila
On Wed, May 25, 2022 at 8:16 AM Kyotaro Horiguchi
 wrote:
>
> It does "fix" the case of [1].  But AFAIS
> RelationSyncEntry.replicate_valid is only used to inhibit repeated
> loading in get_rel_sync_entry and the function doesn't seem to be
> assumed to return a invalid entry. (Since the flag is not checked
> nowhere else.)
>
> For example pgoutput_change does not check for the flag of the entry
> returned from the function before uses it, which is not seemingly
> safe. (I didn't check further, though)
>
> Don't we need to explicitly avoid using invalid entries outside the
> function?
>

We decide that based on pubactions in the callers, so even if entry is
valid, it won't do anything.  Actually, we don't need to avoid setting
replication_valid flag as some of the publications for the table may
be already present. We can check if the publications_valid flag is set
while trying to validate the entry. Now, even if we don't find any
publications the replicate_valid flag will be set but none of the
actions will be set, so it won't do anything in the caller. Is this
better than the previous approach?

-- 
With Regards,
Amit Kapila.


skip_missing_pubs.2.patch
Description: Binary data


Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-24 Thread Kyotaro Horiguchi
At Tue, 24 May 2022 18:19:45 +0530, Amit Kapila  wrote 
in 
> On Sat, May 21, 2022 at 9:03 AM Amit Kapila  wrote:
> >
> > On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
> >  wrote:
> >
> > > Also, we'd probably have to ignore RelationSyncEntry for a while, which
> > > seems quite expensive.
> > >
> >
> > Yet another option could be that we continue using a historic snapshot
> > but ignore publications that are not found for the purpose of
> > computing RelSyncEntry attributes. We won't mark such an entry as
> > valid till all the publications are loaded without anything missing. I
> > think such cases in practice won't be enough to matter. This means we
> > won't publish operations on tables corresponding to that publication
> > till we found such a publication and that seems okay.
> >
> 
> Attached, find the patch to show what I have in mind for this. Today,
> we have received a bug report with a similar symptom [1] and that
> should also be fixed with this. The reported bug should also be fixed
> with this.
> 
> Thoughts?
> 
> 
> [1] - 
> https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

It does "fix" the case of [1].  But AFAIS
RelationSyncEntry.replicate_valid is only used to inhibit repeated
loading in get_rel_sync_entry and the function doesn't seem to be
assumed to return a invalid entry. (Since the flag is not checked
nowhere else.)

For example pgoutput_change does not check for the flag of the entry
returned from the function before uses it, which is not seemingly
safe. (I didn't check further, though)

Don't we need to explicitly avoid using invalid entries outside the
function?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Build-farm - intermittent error in 031_column_list.pl

2022-05-24 Thread osumi.takami...@fujitsu.com
On Tuesday, May 24, 2022 9:50 PM Amit Kapila  wrote:
> On Sat, May 21, 2022 at 9:03 AM Amit Kapila 
> wrote:
> >
> > On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
> >  wrote:
> >
> > > Also, we'd probably have to ignore RelationSyncEntry for a while,
> > > which seems quite expensive.
> > >
> >
> > Yet another option could be that we continue using a historic snapshot
> > but ignore publications that are not found for the purpose of
> > computing RelSyncEntry attributes. We won't mark such an entry as
> > valid till all the publications are loaded without anything missing. I
> > think such cases in practice won't be enough to matter. This means we
> > won't publish operations on tables corresponding to that publication
> > till we found such a publication and that seems okay.
> >
> 
> Attached, find the patch to show what I have in mind for this. Today, we have
> received a bug report with a similar symptom [1] and that should also be fixed
> with this. The reported bug should also be fixed with this.
> 
> Thoughts?
Hi,


I agree with this direction.
I think this approach solves the issue fundamentally
and is better than the first approach to add several calls
of wait_for_catchup in the test, since taking the first one
means we need to care about avoiding the same issue,
whenever we write a new (similar) test, even after the modification.


I've used the patch to check below things.
1. The patch can be applied and make check-world has passed without failure.
2. HEAD applied with the patch passed all tests in src/test/subscription
   (including 031_column_list.pl), after commenting out of WalSndWaitForWal's 
WalSndKeepalive.
3. The new bug fix report in 'How is this possible "publication does not 
exist"' thread
   has been fixed. FYI, after I execute the script's function, I also conduct
   additional insert to the publisher, and this was correctly replicated on the 
subscriber.

Best Regards,
Takamichi Osumi



Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-24 Thread Amit Kapila
On Sat, May 21, 2022 at 9:03 AM Amit Kapila  wrote:
>
> On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
>  wrote:
>
> > Also, we'd probably have to ignore RelationSyncEntry for a while, which
> > seems quite expensive.
> >
>
> Yet another option could be that we continue using a historic snapshot
> but ignore publications that are not found for the purpose of
> computing RelSyncEntry attributes. We won't mark such an entry as
> valid till all the publications are loaded without anything missing. I
> think such cases in practice won't be enough to matter. This means we
> won't publish operations on tables corresponding to that publication
> till we found such a publication and that seems okay.
>

Attached, find the patch to show what I have in mind for this. Today,
we have received a bug report with a similar symptom [1] and that
should also be fixed with this. The reported bug should also be fixed
with this.

Thoughts?


[1] - 
https://www.postgresql.org/message-id/CANWRaJyyD%3D9c1E2HdF-Tqfe7%2BvuCQnAkXd6%2BEFwxC0wM%3D313AA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.


skip_missing_pubs.1.patch
Description: Binary data


Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-20 Thread Amit Kapila
On Fri, May 20, 2022 at 4:01 PM Tomas Vondra
 wrote:
>
> On 5/20/22 05:58, Amit Kapila wrote:
>
> Are we really querying the publications (in get_rel_sync_entry) using
> the historical snapshot?
>

Yes.

> I haven't really realized this, but yeah, that
> might explain the issue.
>
> The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
> often than any other test (there are ~15 calls, 12 of which are in this
> new test). That might be why we haven't seen failures before. Or maybe
> the existing tests simply are not vulnerable to this,
>

Right, I have checked the other cases are not vulnerable to this,
otherwise, I think we would have seen intermittent failures till now.
They don't seem to be doing DMLs before the creation of a publication
or they create a subscription pointing to the same publication before.

> because they
> either do wait_for_catchup late enough or don't do any DML right before
> executing SET PUBLICATION.
>
> >>  That timetravel seems inintuitive but it's the
> >> (current) way it works.
> >>
> >
> > I have thought about it but couldn't come up with a good way to change
> > the way currently it works. Moreover, I think it is easy to hit this
> > in other ways as well. Say, you first create a subscription with a
> > non-existent publication and then do operation on any unrelated table
> > on the publisher before creating the required publication, we will hit
> > exactly this problem of "publication does not exist", so I think we
> > may need to live with this behavior and write tests carefully.
> >
>
> Yeah, I think it pretty much requires ensuring the subscriber is fully
> caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
> replication in an unrecoverable way (actually, you can alter the
> subscription and remove the publication again, right?).
>

Right.

> But this is not just about tests, of course - the same issue applies to
> regular replication. That's a bit unfortunate, so maybe we should think
> about making this less fragile.
>

Agreed, provided we find some reasonable solution.

> We might make sure the subscriber is not lagging (essentially the
> wait_for_catchup) - which the users will have to do anyway (although
> maybe they know the publisher is beyond the LSN where it was created).
>

This won't work for the case mentioned above where we create a
subscription with non-existent publications, then perform DML and then
'CREATE PUBLICATION'.

> The other option would be to detect such case, somehow - if you don't
> see the publication yet, see if it exists in current snapshot, and then
> maybe ignore this error. But that has other issues (the publication
> might have been created and dropped, in which case you won't see it).
>

True, the dropped case would again be tricky to deal with and I think
we will end up publishing some operations which are performed before
the publication is even created.

> Also, we'd probably have to ignore RelationSyncEntry for a while, which
> seems quite expensive.
>

Yet another option could be that we continue using a historic snapshot
but ignore publications that are not found for the purpose of
computing RelSyncEntry attributes. We won't mark such an entry as
valid till all the publications are loaded without anything missing. I
think such cases in practice won't be enough to matter. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.

-- 
With Regards,
Amit Kapila.




RE: Build-farm - intermittent error in 031_column_list.pl

2022-05-20 Thread osumi.takami...@fujitsu.com
On Thursday, May 19, 2022 8:13 PM Amit Kapila  wrote:
> On Thu, May 19, 2022 at 3:16 PM Amit Kapila 
> wrote:
> >
> > On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith
> > >  wrote in
> > > > Hi hackers.
> > > >
> > > > FYI, I saw that there was a recent Build-farm error on the
> > > > "grison" machine [1] [1]
> > > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison
> > > > =HEAD
> > > >
> > > > The error happened during "subscriptionCheck" phase in the TAP
> > > > test t/031_column_list.pl This test file was added by this [2]
> > > > commit.
> > > > [2]
> > > >
> https://github.com/postgres/postgres/commit/923def9a533a7d986acfb5
> > > > 24139d8b9e5466d0a5
> > >
> > > 2022-04-17 00:16:04.278 CEST [293659][client
> > > backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE
> > > PUBLICATION pub9 FOR TABLE test_part_d (a) WITH
> > > (publish_via_partition_root = true);
> > > 2022-04-17 00:16:04.279 CEST [293659][client
> > > backend][:0][031_column_list.pl] LOG:  disconnection: session time:
> > > 0:00:00.002 user=bf database=postgres host=[local]
> > >
> > > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > > the session has been disconnected. But the following request for the
> > > same publication fails due to the absense of the publication.
> > >
> > > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1]
> > > STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88
> > > (proto_version '3', publication_names '"pub9"')
> > > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:
> > > publication "pub9" does not exist
> > >
> >
> > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9".
> The
> > probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> > apply worker (which we can see in LOGS as well) and after the restart,
> > the apply worker will use the existing slot and replication origin
> > corresponding to the subscription. Now, it is possible that before
> > restart the origin has not been updated and the WAL start location
> > points to a location prior to where PUBLICATION pub9 exists which can
> > lead to such an error. Once this error occurs, apply worker will never
> > be able to proceed and will always return the same error. Does this
> > make sense?
> >
> > Unless you or others see a different theory, this seems to be the
> > existing problem in logical replication which is manifested by this
> > test. If we just want to fix these test failures, we can create a new
> > subscription instead of altering the existing publication to point to
> > the new publication.
> >
> 
> If the above theory is correct then I think allowing the publisher to catch 
> up with
> "$node_publisher->wait_for_catchup('sub1');" before ALTER SUBSCRIPTION
> should fix this problem. Because if before ALTER both publisher and
> subscriber are in sync then the new publication should be visible to
> WALSender.
Hi,


I've attached a patch for the fix proposed here.
First of all, thank you so much for helping me offlist, Amit-san.

I reproduced the failure like [1] by commenting out
WalSndWaitForWal's call of WalSndKeepalive and running the test.
This comment out intends to suppress the advance of confirmed_flush location
after creating a publication.

In short, my understanding how the bug happened is, 
1. we execute 'create publication pubX' and create one publication.
2. 'alter subscription subY set publication pubX' makes the apply worker exit
3. relaunched apply worker searches for pubX. But, the slot 
position(confirmed_flush)
   doesn't get updated and points to some location before create publication at 
the publisher node.

Applying the attached patch have made the test pass.


[1] the subscriber's log

2022-05-20 08:56:50.773 UTC [5153] 031_column_list.pl LOG: statement: ALTER 
SUBSCRIPTION sub1 SET PUBLICATION pub6
2022-05-20 08:56:50.801 UTC [5156] 031_column_list.pl LOG: statement: SELECT 
count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:50.846 UTC [5112] LOG: logical replication apply worker for 
subscription "sub1" will restart because of a parameter change
2022-05-20 08:56:50.915 UTC [5158] 031_column_list.pl LOG: statement: SELECT 
count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
...
2022-05-20 08:56:51.257 UTC [5164] 031_column_list.pl LOG: statement: SELECT 
count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:51.353 UTC [5166] LOG: logical replication apply worker for 
subscription "sub1" has started
2022-05-20 08:56:51.366 UTC [5168] LOG: logical replication table 
synchronization worker for subscription "sub1", table "test_part_a" has started
2022-05-20 08:56:51.370 UTC [5171] 031_column_list.pl LOG: statement: SELECT 
count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');
2022-05-20 08:56:51.373 UTC [5166] 

Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-20 Thread Tomas Vondra



On 5/20/22 05:58, Amit Kapila wrote:
> On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi
>  wrote:
>>
 the apply worker will use the existing slot and replication origin
 corresponding to the subscription. Now, it is possible that before
 restart the origin has not been updated and the WAL start location
 points to a location prior to where PUBLICATION pub9 exists which can
 lead to such an error. Once this error occurs, apply worker will never
 be able to proceed and will always return the same error. Does this
 make sense?
>>
>> Wow. I didin't thought that line. That theory explains the silence and
>> makes sense even though I don't see LSN transistions that clearly
>> support it.  I dimly remember a similar kind of problem..
>>
 Unless you or others see a different theory, this seems to be the
 existing problem in logical replication which is manifested by this
 test. If we just want to fix these test failures, we can create a new
 subscription instead of altering the existing publication to point to
 the new publication.

>>>
>>> If the above theory is correct then I think allowing the publisher to
>>> catch up with "$node_publisher->wait_for_catchup('sub1');" before
>>> ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
>>> both publisher and subscriber are in sync then the new publication
>>> should be visible to WALSender.
>>
>> It looks right to me.
>>
> 
> Let's wait for Tomas or others working in this area to share their thoughts.
> 

Are we really querying the publications (in get_rel_sync_entry) using
the historical snapshot? I haven't really realized this, but yeah, that
might explain the issue.

The new TAP test does ALTER SUBSCRIPTION ... SET PUBLICATION much more
often than any other test (there are ~15 calls, 12 of which are in this
new test). That might be why we haven't seen failures before. Or maybe
the existing tests simply are not vulnerable to this, because they
either do wait_for_catchup late enough or don't do any DML right before
executing SET PUBLICATION.

>>  That timetravel seems inintuitive but it's the
>> (current) way it works.
>>
> 
> I have thought about it but couldn't come up with a good way to change
> the way currently it works. Moreover, I think it is easy to hit this
> in other ways as well. Say, you first create a subscription with a
> non-existent publication and then do operation on any unrelated table
> on the publisher before creating the required publication, we will hit
> exactly this problem of "publication does not exist", so I think we
> may need to live with this behavior and write tests carefully.
> 

Yeah, I think it pretty much requires ensuring the subscriber is fully
caught up with the publisher, otherwise ALTER SUBSCRIPTION may break the
replication in an unrecoverable way (actually, you can alter the
subscription and remove the publication again, right?).

But this is not just about tests, of course - the same issue applies to
regular replication. That's a bit unfortunate, so maybe we should think
about making this less fragile.

We might make sure the subscriber is not lagging (essentially the
wait_for_catchup) - which the users will have to do anyway (although
maybe they know the publisher is beyond the LSN where it was created).

The other option would be to detect such case, somehow - if you don't
see the publication yet, see if it exists in current snapshot, and then
maybe ignore this error. But that has other issues (the publication
might have been created and dropped, in which case you won't see it).
Also, we'd probably have to ignore RelationSyncEntry for a while, which
seems quite expensive.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-19 Thread Amit Kapila
On Fri, May 20, 2022 at 6:58 AM Kyotaro Horiguchi
 wrote:
>
> > > the apply worker will use the existing slot and replication origin
> > > corresponding to the subscription. Now, it is possible that before
> > > restart the origin has not been updated and the WAL start location
> > > points to a location prior to where PUBLICATION pub9 exists which can
> > > lead to such an error. Once this error occurs, apply worker will never
> > > be able to proceed and will always return the same error. Does this
> > > make sense?
>
> Wow. I didin't thought that line. That theory explains the silence and
> makes sense even though I don't see LSN transistions that clearly
> support it.  I dimly remember a similar kind of problem..
>
> > > Unless you or others see a different theory, this seems to be the
> > > existing problem in logical replication which is manifested by this
> > > test. If we just want to fix these test failures, we can create a new
> > > subscription instead of altering the existing publication to point to
> > > the new publication.
> > >
> >
> > If the above theory is correct then I think allowing the publisher to
> > catch up with "$node_publisher->wait_for_catchup('sub1');" before
> > ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
> > both publisher and subscriber are in sync then the new publication
> > should be visible to WALSender.
>
> It looks right to me.
>

Let's wait for Tomas or others working in this area to share their thoughts.

>  That timetravel seems inintuitive but it's the
> (current) way it works.
>

I have thought about it but couldn't come up with a good way to change
the way currently it works. Moreover, I think it is easy to hit this
in other ways as well. Say, you first create a subscription with a
non-existent publication and then do operation on any unrelated table
on the publisher before creating the required publication, we will hit
exactly this problem of "publication does not exist", so I think we
may need to live with this behavior and write tests carefully.

-- 
With Regards,
Amit Kapila.




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-19 Thread Kyotaro Horiguchi
At Thu, 19 May 2022 16:42:31 +0530, Amit Kapila  wrote 
in 
> On Thu, May 19, 2022 at 3:16 PM Amit Kapila  wrote:
> >
> > This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
> > probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> > apply worker (which we can see in LOGS as well) and after the restart,

Yes.

> > the apply worker will use the existing slot and replication origin
> > corresponding to the subscription. Now, it is possible that before
> > restart the origin has not been updated and the WAL start location
> > points to a location prior to where PUBLICATION pub9 exists which can
> > lead to such an error. Once this error occurs, apply worker will never
> > be able to proceed and will always return the same error. Does this
> > make sense?

Wow. I didin't thought that line. That theory explains the silence and
makes sense even though I don't see LSN transistions that clearly
support it.  I dimly remember a similar kind of problem..

> > Unless you or others see a different theory, this seems to be the
> > existing problem in logical replication which is manifested by this
> > test. If we just want to fix these test failures, we can create a new
> > subscription instead of altering the existing publication to point to
> > the new publication.
> >
> 
> If the above theory is correct then I think allowing the publisher to
> catch up with "$node_publisher->wait_for_catchup('sub1');" before
> ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
> both publisher and subscriber are in sync then the new publication
> should be visible to WALSender.

It looks right to me.  That timetravel seems inintuitive but it's the
(current) way it works.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-19 Thread Amit Kapila
On Thu, May 19, 2022 at 3:16 PM Amit Kapila  wrote:
>
> On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 19 May 2022 14:26:56 +1000, Peter Smith  
> > wrote in
> > > Hi hackers.
> > >
> > > FYI, I saw that there was a recent Build-farm error on the "grison" 
> > > machine [1]
> > > [1] 
> > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison=HEAD
> > >
> > > The error happened during "subscriptionCheck" phase in the TAP test
> > > t/031_column_list.pl
> > > This test file was added by this [2] commit.
> > > [2] 
> > > https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
> >
> > What is happening for all of them looks like that the name of a
> > publication created by CREATE PUBLICATION without a failure report is
> > missing for a walsender came later. It seems like CREATE PUBLICATION
> > can silently fail to create a publication, or walsender somehow failed
> > to find existing one.
> >
>
> Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?
>
> >
> > > ~~
> > >
> >
> > 2022-04-17 00:16:04.278 CEST [293659][client 
> > backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE PUBLICATION 
> > pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
> > 2022-04-17 00:16:04.279 CEST [293659][client 
> > backend][:0][031_column_list.pl] LOG:  disconnection: session time: 
> > 0:00:00.002 user=bf database=postgres host=[local]
> >
> > "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> > the session has been disconnected. But the following request for the
> > same publication fails due to the absense of the publication.
> >
> > 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  
> > START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', 
> > publication_names '"pub9"')
> > 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  
> > publication "pub9" does not exist
> >
>
> This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
> probable theory is that ALTER SUBSCRIPTION will lead to restarting of
> apply worker (which we can see in LOGS as well) and after the restart,
> the apply worker will use the existing slot and replication origin
> corresponding to the subscription. Now, it is possible that before
> restart the origin has not been updated and the WAL start location
> points to a location prior to where PUBLICATION pub9 exists which can
> lead to such an error. Once this error occurs, apply worker will never
> be able to proceed and will always return the same error. Does this
> make sense?
>
> Unless you or others see a different theory, this seems to be the
> existing problem in logical replication which is manifested by this
> test. If we just want to fix these test failures, we can create a new
> subscription instead of altering the existing publication to point to
> the new publication.
>

If the above theory is correct then I think allowing the publisher to
catch up with "$node_publisher->wait_for_catchup('sub1');" before
ALTER SUBSCRIPTION should fix this problem. Because if before ALTER
both publisher and subscriber are in sync then the new publication
should be visible to WALSender.

-- 
With Regards,
Amit Kapila.




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-19 Thread Amit Kapila
On Thu, May 19, 2022 at 12:28 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 19 May 2022 14:26:56 +1000, Peter Smith  wrote 
> in
> > Hi hackers.
> >
> > FYI, I saw that there was a recent Build-farm error on the "grison" machine 
> > [1]
> > [1] 
> > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison=HEAD
> >
> > The error happened during "subscriptionCheck" phase in the TAP test
> > t/031_column_list.pl
> > This test file was added by this [2] commit.
> > [2] 
> > https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5
>
> What is happening for all of them looks like that the name of a
> publication created by CREATE PUBLICATION without a failure report is
> missing for a walsender came later. It seems like CREATE PUBLICATION
> can silently fail to create a publication, or walsender somehow failed
> to find existing one.
>

Do you see anything in LOGS which indicates CREATE SUBSCRIPTION has failed?

>
> > ~~
> >
>
> 2022-04-17 00:16:04.278 CEST [293659][client 
> backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE PUBLICATION 
> pub9 FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
> 2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] 
> LOG:  disconnection: session time: 0:00:00.002 user=bf database=postgres 
> host=[local]
>
> "CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
> the session has been disconnected. But the following request for the
> same publication fails due to the absense of the publication.
>
> 2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  
> START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', 
> publication_names '"pub9"')
> 2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  
> publication "pub9" does not exist
>

This happens after "ALTER SUBSCRIPTION sub1 SET PUBLICATION pub9". The
probable theory is that ALTER SUBSCRIPTION will lead to restarting of
apply worker (which we can see in LOGS as well) and after the restart,
the apply worker will use the existing slot and replication origin
corresponding to the subscription. Now, it is possible that before
restart the origin has not been updated and the WAL start location
points to a location prior to where PUBLICATION pub9 exists which can
lead to such an error. Once this error occurs, apply worker will never
be able to proceed and will always return the same error. Does this
make sense?

Unless you or others see a different theory, this seems to be the
existing problem in logical replication which is manifested by this
test. If we just want to fix these test failures, we can create a new
subscription instead of altering the existing publication to point to
the new publication.

Note: Added Tomas to know his views as he has committed this test.

-- 
With Regards,
Amit Kapila.




Re: Build-farm - intermittent error in 031_column_list.pl

2022-05-19 Thread Kyotaro Horiguchi
At Thu, 19 May 2022 14:26:56 +1000, Peter Smith  wrote 
in 
> Hi hackers.
> 
> FYI, I saw that there was a recent Build-farm error on the "grison" machine 
> [1]
> [1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison=HEAD
> 
> The error happened during "subscriptionCheck" phase in the TAP test
> t/031_column_list.pl
> This test file was added by this [2] commit.
> [2] 
> https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

What is happening for all of them looks like that the name of a
publication created by CREATE PUBLICATION without a failure report is
missing for a walsender came later. It seems like CREATE PUBLICATION
can silently fail to create a publication, or walsender somehow failed
to find existing one.


> ~~
> 
> I checked the history of fails for that TAP test t/031_column_list.pl
> and found that this same error seems to have been happening
> intermittently for at least the last 50 days.
> 
> Details of similar previous errors from the BF are listed below.
> 
> ~~~
> 
> 1. Details for system "grison" failure at stage subscriptionCheck,
> snapshot taken 2022-05-18 18:11:45
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison=2022-05-18%2018%3A11%3A45
> 
> [22:02:08] t/029_on_error.pl .. ok25475 ms ( 0.01
> usr  0.00 sys + 15.39 cusr  5.59 csys = 20.99 CPU)
> # poll_query_until timed out executing this query:
> # SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
> #  FROM pg_catalog.pg_stat_replication
> #  WHERE application_name IN ('sub1', 'walreceiver')
> # expecting this output:
> # t
> # last actual query output:
> #
> # with stderr:
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 29 just after 22.
> [22:09:25] t/031_column_list.pl ...
> ...
> [22:02:47.887](1.829s) ok 22 - partitions with different replica
> identities not replicated correctly Waiting for replication conn
> sub1's replay_lsn to pass 0/1530588 on publisher
> [22:09:25.395](397.508s) # poll_query_until timed out executing this query:
> # SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
> #  FROM pg_catalog.pg_stat_replication
> #  WHERE application_name IN ('sub1', 'walreceiver')
> # expecting this output:
> # t
> # last actual query output:
> #
> # with stderr:
> timed out waiting for catchup at t/031_column_list.pl line 728.
> ### Stopping node "publisher" using mode immediate

2022-04-17 00:16:04.278 CEST [293659][client 
backend][4/270:0][031_column_list.pl] LOG:  statement: CREATE PUBLICATION pub9 
FOR TABLE test_part_d (a) WITH (publish_via_partition_root = true);
2022-04-17 00:16:04.279 CEST [293659][client backend][:0][031_column_list.pl] 
LOG:  disconnection: session time: 0:00:00.002 user=bf database=postgres 
host=[local]

"CREATE PUBLICATION pub9" is executed at 00:16:04.278 on 293659 then
the session has been disconnected. But the following request for the
same publication fails due to the absense of the publication.

2022-04-17 00:16:08.147 CEST [293856][walsender][3/0:0][sub1] STATEMENT:  
START_REPLICATION SLOT "sub1" LOGICAL 0/153DB88 (proto_version '3', 
publication_names '"pub9"')
2022-04-17 00:16:08.148 CEST [293856][walsender][3/0:0][sub1] ERROR:  
publication "pub9" does not exist


> ~~~
> 
> 2. Details for system "xenodermus" failure at stage subscriptionCheck,
> snapshot taken 2022-04-16 21:00:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus=2022-04-16%2021%3A00%3A04

The same. pub9 is missing after creation.

> ~~~
> 
> 3. Details for system "phycodurus" failure at stage subscriptionCheck,
> snapshot taken 2022-04-05 17:30:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus=2022-04-05%2017%3A30%3A04

The same happens for pub7..

> 4. Details for system "phycodurus" failure at stage subscriptionCheck,
> snapshot taken 2022-04-05 17:30:04
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus=2022-04-05%2017%3A30%3A04

Same. pub7 is missing.

> 5. Details for system "grison" failure at stage subscriptionCheck,
> snapshot taken 2022-04-03 18:11:39
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison=2022-04-03%2018%3A11%3A39

Same. pub7 is missing.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Build-farm - intermittent error in 031_column_list.pl

2022-05-18 Thread Peter Smith
Hi hackers.

FYI, I saw that there was a recent Build-farm error on the "grison" machine [1]
[1] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=grison=HEAD

The error happened during "subscriptionCheck" phase in the TAP test
t/031_column_list.pl
This test file was added by this [2] commit.
[2] 
https://github.com/postgres/postgres/commit/923def9a533a7d986acfb524139d8b9e5466d0a5

~~

I checked the history of fails for that TAP test t/031_column_list.pl
and found that this same error seems to have been happening
intermittently for at least the last 50 days.

Details of similar previous errors from the BF are listed below.

~~~

1. Details for system "grison" failure at stage subscriptionCheck,
snapshot taken 2022-05-18 18:11:45
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison=2022-05-18%2018%3A11%3A45

[22:02:08] t/029_on_error.pl .. ok25475 ms ( 0.01
usr  0.00 sys + 15.39 cusr  5.59 csys = 20.99 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[22:09:25] t/031_column_list.pl ...
...
[22:02:47.887](1.829s) ok 22 - partitions with different replica
identities not replicated correctly Waiting for replication conn
sub1's replay_lsn to pass 0/1530588 on publisher
[22:09:25.395](397.508s) # poll_query_until timed out executing this query:
# SELECT '0/1530588' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 728.
### Stopping node "publisher" using mode immediate

~~~

2. Details for system "xenodermus" failure at stage subscriptionCheck,
snapshot taken 2022-04-16 21:00:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus=2022-04-16%2021%3A00%3A04

[00:15:32] t/029_on_error.pl .. ok 8278 ms ( 0.00
usr  0.00 sys +  1.33 cusr  0.55 csys =  1.88 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1543648' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 25.
[00:22:30] t/031_column_list.pl ...
...
[00:16:04.100](0.901s) ok 25 - partitions with different replica
identities not replicated correctly Waiting for replication conn
sub1's replay_lsn to pass 0/1543648 on publisher
[00:22:29.923](385.823s) # poll_query_until timed out executing this query:
# SELECT '0/1543648' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 818.

~~~

3. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus=2022-04-05%2017%3A30%3A04

# poll_query_until timed out executing this query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 22.
[20:50:25] t/031_column_list.pl ...
...
ok 22 - partitions with different replica identities not replicated
correctly Waiting for replication conn sub1's replay_lsn to pass
0/1528640 on publisher # poll_query_until timed out executing this
query:
# SELECT '0/1528640' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('sub1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
timed out waiting for catchup at t/031_column_list.pl line 667.

~~~

4. Details for system "phycodurus" failure at stage subscriptionCheck,
snapshot taken 2022-04-05 17:30:04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus=2022-04-05%2017%3A30%3A04

[20:43:04] t/030_sequences.pl . ok11108 ms ( 0.00
usr  0.00 sys +  1.49 cusr  0.40 csys =  1.89 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/1528640' <=