On Wed, Feb 3, 2021 at 1:34 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 6:38 AM Peter Smith <smithpb2...@gmail.com> wrote:
> >
> > On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila <amit.kapil...@gmail.com> wrote:
> > >
> > > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith <smithpb2...@gmail.com> wrote:
> > > >
> > > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > > > tried a simple test where I do a DROP TABLE with very bad timing for
> > > > the tablesync worker. It seems that doing this can cause the sync
> > > > worker's MyLogicalRepWorker->relid to become invalid.
> > > >
> > >
> > > I think this should be fixed by latest patch because I have disallowed
> > > drop of a table when its synchronization is in progress. You can check
> > > once and let me know if the issue still exists?
> > >
> >
> > FYI - I confirmed that the problem scenario that I reported yesterday
> > is no longer possible because now the V25 patch is disallowing the
> > DROP TABLE while the tablesync is still running.
> >
>
> Thanks for the confirmation. BTW, can you please check if we can
> reproduce that problem without this patch? If so, we might want to
> apply this fix irrespective of this patch. If not, why not?
>

Yes, this was an existing postgres bug. It is independent of the patch.

I can reproduce exactly the same stacktrace using the HEAD src pulled @ 3/Feb.

PSA my test logs showing the details.

----
Kind Regards,
Peter Smith.
Fujitsu Australia
Test Scenario:

1. INSERT data so tablesync should copy something
2. While paused in LogicalRepSyncTableStart do a DROP TABLE to rip the rug out 
from under the sync worker!
3. Continue the sync worker.... see what happens

All code is from PG HEAD (3/Feb) except the "!!>>" added to allow me to attech 
to debugger.

==============

##
## Insert data
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 
'foo');"
INSERT 0 1

##
## SUBSCRIBE and continue to breakpoint at top of tablesync function 
LogicalRepSyncTableStart
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION 
tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' 
PUBLICATION tap_pub;"
2021-02-03 13:42:20.031 AEDT [9555] LOG:  logical decoding found consistent 
point at 0/16605E8
2021-02-03 13:42:20.031 AEDT [9555] DETAIL:  There are no running transactions.
2021-02-03 13:42:20.031 AEDT [9555] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2021-02-03 13:42:20.043 AEDT [9556] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
2021-02-03 13:42:20.043 AEDT [9556] LOG:  !!>> The apply worker process has PID 
= 9556
[postgres@CentOS7-x64 ~]$ 2021-02-03 13:42:20.052 AEDT [9562] LOG:  starting 
logical decoding for slot "tap_sub"
2021-02-03 13:42:20.052 AEDT [9562] DETAIL:  Streaming transactions committing 
after 0/1660620, reading WAL from 0/16605E8.
2021-02-03 13:42:20.052 AEDT [9562] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-03 13:42:20.052 AEDT [9562] LOG:  logical decoding found consistent 
point at 0/16605E8
2021-02-03 13:42:20.052 AEDT [9562] DETAIL:  There are no running transactions.
2021-02-03 13:42:20.052 AEDT [9562] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-03 13:42:20.056 AEDT [9564] LOG:  logical replication table 
synchronization worker for subscription "tap_sub", table "test_tab" has started
2021-02-03 13:42:20.056 AEDT [9564] LOG:  !!>> The tablesync worker process has 
PID = 9564
2021-02-03 13:42:20.056 AEDT [9564] LOG:  !!>>
        
        
        Sleeping 30 secs. For debugging, attach to process 9564 now!
        
## PID 9564 is sync worker
## PID 9556 is apply worker
        
##
## While paused in debugger for the tablesync worker do DROP TABLE on 
subscriber.
## Note we have not done any ALTER SUBSCRIPTION.
##
        
psql -d test_sub -p 54321 -c "DROP TABLE test_tab;"
DROP TABLE

## Following stacktrace LOG happens

[postgres@CentOS7-x64 ~]$ TRAP: FailedAssertion("strvalue != NULL", File: 
"snprintf.c", Line: 442, PID: 9564)
postgres: logical replication worker for subscription 16407 sync 16385 
(ExceptionalCondition+0xb9)[0xad8f4b]
postgres: logical replication worker for subscription 16407 sync 16385 
[0xb4c6ee]
postgres: logical replication worker for subscription 16407 sync 16385 
(pg_vsnprintf+0x7c)[0xb4c072]
postgres: logical replication worker for subscription 16407 sync 16385 
(pvsnprintf+0x30)[0xb3f26f]
postgres: logical replication worker for subscription 16407 sync 16385 
(appendStringInfoVA+0x80)[0xb40cb6]
postgres: logical replication worker for subscription 16407 sync 16385 
(errmsg+0x183)[0xad9d33]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x8c52f3]
postgres: logical replication worker for subscription 16407 sync 16385 
(LogicalRepSyncTableStart+0xb4)[0x8c6714]
postgres: logical replication worker for subscription 16407 sync 16385 
(ApplyWorkerMain+0x3f3)[0x8cbe30]
postgres: logical replication worker for subscription 16407 sync 16385 
(StartBackgroundWorker+0x21f)[0x87e957]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x892fc7]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x893375]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x8923b0]
/lib64/libpthread.so.0(+0xf5d0)[0x7f8cab2d15d0]
/lib64/libc.so.6(__select+0x13)[0x7f8caa8dc0d3]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x88e02e]
postgres: logical replication worker for subscription 16407 sync 16385 
(PostmasterMain+0x114f)[0x88da1d]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x7911b7]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8caa809495]
postgres: logical replication worker for subscription 16407 sync 16385 
[0x480ec9]

##
## Because we are still attached to the tablesync worker we can see some more 
details of
## backtrace information which show it crashed tried to display a message using 
a bad relid.
##

Program received signal SIGABRT, Aborted.
0x00007f8caa81d2c7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f8caa81d2c7 in raise () from /lib64/libc.so.6
#1  0x00007f8caa81e9b8 in abort () from /lib64/libc.so.6
#2  0x0000000000ad8f78 in ExceptionalCondition (conditionName=0xd5a44b 
"strvalue != NULL", 
    errorType=0xd5a43b "FailedAssertion", fileName=0xd5a430 "snprintf.c", 
lineNumber=442) at assert.c:69
#3  0x0000000000b4c6ee in dopr (target=0x7fff4409bdb0, format=0xc8f419 "\" has 
finished", 
    args=0x7fff4409be88) at snprintf.c:442
#4  0x0000000000b4c072 in pg_vsnprintf (
    str=0x1c9c690 "logical replication table synchronization worker for 
subscription \"tap_sub\", table \"", '\177' <repeats 116 times>..., count=1024, 
    fmt=0xc8f3c8 "logical replication table synchronization worker for 
subscription \"%s\", table \"%s\" has finished", args=0x7fff4409be88) at 
snprintf.c:195
#5  0x0000000000b3f26f in pvsnprintf (
    buf=0x1c9c690 "logical replication table synchronization worker for 
subscription \"tap_sub\", table \"", '\177' <repeats 116 times>..., len=1024, 
    fmt=0xc8f3c8 "logical replication table synchronization worker for 
subscription \"%s\", table \"%s\" has finished", args=0x7fff4409be88) at 
psprintf.c:110
#6  0x0000000000b40cb6 in appendStringInfoVA (str=0x7fff4409bea0, 
    fmt=0xc8f3c8 "logical replication table synchronization worker for 
subscription \"%s\", table \"%s\" has finished", args=0x7fff4409be88) at 
stringinfo.c:149
#7  0x0000000000ad9d33 in errmsg (
    fmt=0xc8f3c8 "logical replication table synchronization worker for 
subscription \"%s\", table \"%s\" has finished") at elog.c:925
#8  0x00000000008c52f3 in finish_sync_worker () at tablesync.c:136
#9  0x00000000008c6714 in LogicalRepSyncTableStart 
(origin_startpos=0x7fff4409c048) at tablesync.c:848
#10 0x00000000008cbe30 in ApplyWorkerMain (main_arg=1) at worker.c:3040
#11 0x000000000087e957 in StartBackgroundWorker () at bgworker.c:879
#12 0x0000000000892fc7 in do_start_bgworker (rw=0x1cc1c00) at postmaster.c:5833
#13 0x0000000000893375 in maybe_start_bgworkers () at postmaster.c:6058
#14 0x00000000008923b0 in sigusr1_handler (postgres_signal_arg=10) at 
postmaster.c:5215
#15 <signal handler called>
#16 0x00007f8caa8dc0d3 in __select_nocancel () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#17 0x000000000088e02e in ServerLoop () at postmaster.c:1694
#18 0x000000000088da1d in PostmasterMain (argc=3, argv=0x1c9a4c0) at 
postmaster.c:1402
#19 0x00000000007911b7 in main (argc=3, argv=0x1c9a4c0) at main.c:209
(


##
## Since it crashed above, the remainder of this log is not really important...
##

psql -d test_sub -p 54321 -c "DROP SUBSCRIPTION tap_sub;"
2021-02-03 13:43:47.323 AEDT [9556] FATAL:  terminating logical replication 
worker due to administrator command
2021-02-03 13:43:47.323 AEDT [5230] LOG:  background worker "logical 
replication worker" (PID 9556) exited with exit code 1

2021-02-03 13:43:57.857 AEDT [5230] LOG:  background worker "logical 
replication worker" (PID 9564) was terminated by signal 6: Aborted
2021-02-03 13:43:57.857 AEDT [5230] LOG:  terminating any other active server 
processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited abnormally 
and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat 
your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
2021-02-03 13:43:57.859 AEDT [5230] LOG:  all server processes terminated; 
reinitializing
[postgres@CentOS7-x64 ~]$ 
[postgres@CentOS7-x64 ~]$ 2021-02-03 13:43:57.892 AEDT [12298] LOG:  database 
system was interrupted; last known up at 2021-02-03 13:39:47 AEDT
2021-02-03 13:43:58.587 AEDT [12298] LOG:  database system was not properly 
shut down; automatic recovery in progress
2021-02-03 13:43:58.590 AEDT [12298] LOG:  redo starts at 0/16574C8
2021-02-03 13:43:58.593 AEDT [12298] LOG:  invalid record length at 0/1666D70: 
wanted 24, got 0
2021-02-03 13:43:58.593 AEDT [12298] LOG:  redo done at 0/1666D38 system usage: 
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-02-03 13:43:58.630 AEDT [5230] LOG:  database system is ready to accept 
connections
2021-02-03 13:43:58.650 AEDT [12338] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
2021-02-03 13:43:58.650 AEDT [12338] LOG:  !!>> The apply worker process has 
PID = 12338
2021-02-03 13:43:58.658 AEDT [12341] LOG:  starting logical decoding for slot 
"tap_sub"
2021-02-03 13:43:58.658 AEDT [12341] DETAIL:  Streaming transactions committing 
after 0/1660620, reading WAL from 0/16605E8.
2021-02-03 13:43:58.658 AEDT [12341] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-02-03 13:43:58.658 AEDT [12341] LOG:  logical decoding found consistent 
point at 0/16605E8
2021-02-03 13:43:58.658 AEDT [12341] DETAIL:  There are no running transactions.
2021-02-03 13:43:58.658 AEDT [12341] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')

Reply via email to