Dear Hou, Thank you for updating the patch! While testing yours, I found that the leader apply worker has been crashed in the following case. I will dig the failure more, but I reported here for records.
1. Change macros for forcing to write a temporary file. ``` -#define CHANGES_THRESHOLD 1000 -#define SHM_SEND_TIMEOUT_MS 10000 +#define CHANGES_THRESHOLD 10 +#define SHM_SEND_TIMEOUT_MS 100 ``` 2. Set logical_decoding_work_mem to 64kB on publisher 3. Insert huge data on publisher ``` publisher=# \d tbl Table "public.tbl" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- c | integer | | | Publications: "pub" publisher=# BEGIN; BEGIN publisher=*# INSERT INTO tbl SELECT i FROM generate_series(1, 5000000) s(i); INSERT 0 5000000 publisher=*# COMMIT; ``` -> LA crashes on subscriber! Followings are the backtrace. ``` (gdb) bt #0 0x00007f2663ae4387 in raise () from /lib64/libc.so.6 #1 0x00007f2663ae5a78 in abort () from /lib64/libc.so.6 #2 0x0000000000ad0a95 in ExceptionalCondition (conditionName=0xcabdd0 "mqh->mqh_partial_bytes <= nbytes", fileName=0xcabc30 "../src/backend/storage/ipc/shm_mq.c", lineNumber=420) at ../src/backend/utils/error/assert.c:66 #3 0x00000000008eaeb7 in shm_mq_sendv (mqh=0x271ebd8, iov=0x7ffc664a2690, iovcnt=1, nowait=false, force_flush=true) at ../src/backend/storage/ipc/shm_mq.c:420 #4 0x00000000008eac5a in shm_mq_send (mqh=0x271ebd8, nbytes=1, data=0x271f3c0, nowait=false, force_flush=true) at ../src/backend/storage/ipc/shm_mq.c:338 #5 0x0000000000880e18 in parallel_apply_free_worker (winfo=0x271f270, xid=735, stop_worker=true) at ../src/backend/replication/logical/applyparallelworker.c:368 #6 0x00000000008a3638 in apply_handle_stream_commit (s=0x7ffc664a2790) at ../src/backend/replication/logical/worker.c:2081 #7 0x00000000008a54da in apply_dispatch (s=0x7ffc664a2790) at ../src/backend/replication/logical/worker.c:3195 #8 0x00000000008a5a76 in LogicalRepApplyLoop (last_received=378674872) at ../src/backend/replication/logical/worker.c:3431 #9 0x00000000008a72ac in start_apply (origin_startpos=0) at ../src/backend/replication/logical/worker.c:4245 #10 0x00000000008a7d77 in ApplyWorkerMain (main_arg=0) at ../src/backend/replication/logical/worker.c:4555 #11 0x000000000084983c in StartBackgroundWorker () at ../src/backend/postmaster/bgworker.c:861 #12 0x0000000000854192 in do_start_bgworker (rw=0x26c0d20) at ../src/backend/postmaster/postmaster.c:5801 #13 0x000000000085457c in maybe_start_bgworkers () at ../src/backend/postmaster/postmaster.c:6025 #14 0x000000000085350b in sigusr1_handler (postgres_signal_arg=10) at ../src/backend/postmaster/postmaster.c:5182 #15 <signal handler called> #16 0x00007f2663ba3b23 in __select_nocancel () from /lib64/libc.so.6 #17 0x000000000084edbc in ServerLoop () at ../src/backend/postmaster/postmaster.c:1768 #18 0x000000000084e737 in PostmasterMain (argc=3, argv=0x2690f60) at ../src/backend/postmaster/postmaster.c:1476 #19 0x000000000074adfb in main (argc=3, argv=0x2690f60) at ../src/backend/main/main.c:197 ``` PSA the script that can reproduce the failure on my environment. Best Regards, Hayato Kuroda FUJITSU LIMITED
repro.sh
Description: repro.sh