On Wed, 13 Jan 2021 at 13:26, Amit Kapila wrote:
> On Tue, Jan 12, 2021 at 4:59 PM Bharath Rupireddy
> <[email protected]> wrote:
>>
>> On Tue, Jan 12, 2021 at 12:06 PM Amit Kapila <[email protected]> wrote:
>> > > Here's my analysis:
>> > > 1) in the publisher, alter publication drop table successfully
>> > > removes(PublicationDropTables) the table from the catalogue
>> > > pg_publication_rel
>> > > 2) in the subscriber, alter subscription refresh publication
>> > > successfully removes the table from the catalogue pg_subscription_rel
>> > > (AlterSubscription_refresh->RemoveSubscriptionRel)
>> > > so far so good
>> > >
>> >
>> > Here, it should register the worker to stop on commit, and then on
>> > commit it should call AtEOXact_ApplyLauncher to stop the apply worker.
>> > Once the apply worker is stopped, the corresponding WALSender will
>> > also be stopped. Something here is not happening as per expected
>> > behavior.
>>
>> On the subscriber, an entry for worker stop is created in
>> AlterSubscription_refresh --> logicalrep_worker_stop_at_commit. At the end
>> of txn, in AtEOXact_ApplyLauncher, we try to stop that worker, but it cannot
>> be stopped because logicalrep_worker_find returns null
>> (AtEOXact_ApplyLauncher --> logicalrep_worker_stop -->
>> logicalrep_worker_find). The worker entry for that subscriber is having
>> relid as 0 [1], due to which the following if condition will not be hit. The
>> apply worker on the subscriber related to the subscription on which refresh
>> publication was run is not closed. It looks like relid 0 is valid because it
>> will be applicable only during the table sync phase, the comment in the
>> LogicalRepWorker structure says that.
>>
>> And also, I think, expecting the apply worker to be closed this way doesn't
>> make sense because the apply worker is a per-subscription base, and the
>> subscription can have other tables too.
>>
>
> Okay, that makes sense. As responded to Li Japin, let's focus on
> figuring out why we are sending the changes from the publisher node in
> some cases and not in other cases.
After some analysis, I find that the dropped tables always replicate to
subscriber.
The difference is that if we drop the table from publication and refresh
publication (on subscriber), the LogicalRepRelMapEntry in
should_apply_changes_for_rel()
set state to SUBREL_STATE_UNKNOWN.
(gdb) p *rel
$2 = {remoterel = {remoteid = 16410, nspname = 0x5564fb0177c0 "public",
relname = 0x5564fb0177a0 "t1", natts = 1, attnames = 0x5564fb0177e0,
atttyps = 0x5564fb017780,
replident = 100 'd', relkind = 0 '\000', attkeys = 0x0}, localrelvalid =
true,
localreloid = 16412, localrel = 0x7f78705da1b8, attrmap = 0x5564fb017800,
updatable = false,
*state = 0 '\000'*, statelsn = 0}
If we insert data between drop table from publication and refresh publication,
the
LogicalRepRelMapEntry state is always SUBREL_STATE_READY.
(gdb) p *rel
$2 = {remoterel = {remoteid = 16410, nspname = 0x5564fb0177c0 "public",
relname = 0x5564fb0177a0 "t1", natts = 1, attnames = 0x5564fb0177e0,
atttyps = 0x5564fb017780,
replident = 100 'd', relkind = 0 '\000', attkeys = 0x0}, localrelvalid =
true,
localreloid = 16412, localrel = 0x7f78705d9d38, attrmap = 0x5564fb017800,
updatable = false,
*state = 114 'r'*, statelsn = 23545672}
I will dig why the state of LogicalRepRelMapEntry doesn't change in second case.
Any suggestion is welcome!
--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.