Re: subscription/015_stream sometimes breaks
On Tue, Aug 29, 2023 at 10:35 PM Alvaro Herrera wrote: > > On 2023-Aug-29, Amit Kapila wrote: > > > On Mon, Aug 28, 2023 at 5:35 AM Peter Smith wrote: > > > > > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila > > > wrote: > > > > > > IMO there are inconsistencies in the second patch that was pushed. > > > I find your suggestions reasonable. Alvaro, do you have any comments? > > Well, my main comment is that at this point I'm not sure these > isFooWorker() macros are worth their salt. It looks like we could > replace their uses with direct type comparisons in their callsites and > remove them, with no loss of readability. The am_sth_worker() are > probably a bit more useful, though some of the callsites could end up > better if replaced with straight type comparison. > > All in all, I don't disagree with Peter's suggestions, but this is > pretty much in "meh" territory for me. I had written a small non-functional patch [1] to address some macro inconsistencies introduced by a prior patch of this thread. It received initial feedback from Amit ("I find your suggestions reasonable") and from Alvaro ("I don't disagree with Peter's suggestions") but then nothing further happened. I also created a CF entry https://commitfest.postgresql.org/46/4570/ for it. AFAIK my patch is still valid, but after 4 months of no activity it seems there is no interest in pushing it, so I am withdrawing the CF entry. == [1] https://www.postgresql.org/message-id/CAHut%2BPuwaF4Sb41pWQk69d2WO_ZJQpj-_2JkQvP%3D1jwozUpcCQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: subscription/015_stream sometimes breaks
On 2023-Aug-29, Amit Kapila wrote: > On Mon, Aug 28, 2023 at 5:35 AM Peter Smith wrote: > > > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila wrote: > > > > IMO there are inconsistencies in the second patch that was pushed. > I find your suggestions reasonable. Alvaro, do you have any comments? Well, my main comment is that at this point I'm not sure these isFooWorker() macros are worth their salt. It looks like we could replace their uses with direct type comparisons in their callsites and remove them, with no loss of readability. The am_sth_worker() are probably a bit more useful, though some of the callsites could end up better if replaced with straight type comparison. All in all, I don't disagree with Peter's suggestions, but this is pretty much in "meh" territory for me. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: subscription/015_stream sometimes breaks
On Mon, Aug 28, 2023 at 5:35 AM Peter Smith wrote: > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila wrote: > > IMO there are inconsistencies in the second patch that was pushed. > > 1. In the am_xxx functions, why is there Assert 'in_use' only for the > APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers? > > 2. In the am_xxx functions there is now Assert 'in_use', so why are we > still using macros to check again what we already asserted is not > possible? (Or, if the checking overkill was a deliberate choice then > why is there no isLeaderApplyWorker macro?) > > ~ > > PSA a small patch to address these. > I find your suggestions reasonable. Alvaro, do you have any comments? -- With Regards, Amit Kapila.
Re: subscription/015_stream sometimes breaks
On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila wrote: > > On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila wrote: > > > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera > > wrote: > > > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera > > > > wrote: > > > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > > of the struct at all, so I propose to add the attached 0001 as a > > > > > minimal > > > > > fix. > > > > > > > > I think that way we may need to add the check for in_use before > > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > > about which fields (or places) are okay to access without checking > > > > in_use field. > > > > > > As far as I realize, we have that rule already. It's only a few > > > relatively new places that have broken it. I understand that the in_use > > > concept comes from the one of the same name in ReplicationSlot, except > > > that it is not at all documented in worker_internal.h. > > > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > > somebody gets to document the locking design for LogicalRepWorker. > > > > > > > Agreed. > > > > Pushed both the patches. > IMO there are inconsistencies in the second patch that was pushed. 1. In the am_xxx functions, why is there Assert 'in_use' only for the APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers? 2. In the am_xxx functions there is now Assert 'in_use', so why are we still using macros to check again what we already asserted is not possible? (Or, if the checking overkill was a deliberate choice then why is there no isLeaderApplyWorker macro?) ~ PSA a small patch to address these. -- Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Fix-am_xxx-function-Asserts.patch Description: Binary data
Re: subscription/015_stream sometimes breaks
On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila wrote: > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera > wrote: > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera > > > wrote: > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > > fix. > > > > > > I think that way we may need to add the check for in_use before > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > about which fields (or places) are okay to access without checking > > > in_use field. > > > > As far as I realize, we have that rule already. It's only a few > > relatively new places that have broken it. I understand that the in_use > > concept comes from the one of the same name in ReplicationSlot, except > > that it is not at all documented in worker_internal.h. > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > somebody gets to document the locking design for LogicalRepWorker. > > > > Agreed. > Pushed both the patches. -- With Regards, Amit Kapila.
Re: subscription/015_stream sometimes breaks
On Fri, Aug 25, 2023 at 9:09 AM Peter Smith wrote: > > On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila wrote: > > > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera > > wrote: > > > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera > > > > wrote: > > > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > > of the struct at all, so I propose to add the attached 0001 as a > > > > > minimal > > > > > fix. > > > > > > > > I think that way we may need to add the check for in_use before > > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > > about which fields (or places) are okay to access without checking > > > > in_use field. > > > > > > As far as I realize, we have that rule already. It's only a few > > > relatively new places that have broken it. I understand that the in_use > > > concept comes from the one of the same name in ReplicationSlot, except > > > that it is not at all documented in worker_internal.h. > > > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > > somebody gets to document the locking design for LogicalRepWorker. > > > > > > > Agreed. > > Both of these patches (Hou-san's expedient resetting of the worker > type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker > type macros) appear to be blending the concept of "type" with whether > the worker is "alive" or not, which I am not sure is a good thing. IMO > the type is the type forever, so I felt type should get assigned only > once when the worker is "born". For example, a dead horse is still a > horse. > I think it is important to have a alive check before accessing the worker type as we are doing for some of the other fields. For example, see the usage of in_use flag in the function logicalrep_worker_find(). The usage of parallel apply workers doesn't consider the use of in_use flag where as other worker types would first check in_use flag. -- With Regards, Amit Kapila.
Re: subscription/015_stream sometimes breaks
On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila wrote: > > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera > wrote: > > > > On 2023-Aug-24, Amit Kapila wrote: > > > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera > > > wrote: > > > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > > fix. > > > > > > I think that way we may need to add the check for in_use before > > > accessing each of the LogicalRepWorker struct fields or form some rule > > > about which fields (or places) are okay to access without checking > > > in_use field. > > > > As far as I realize, we have that rule already. It's only a few > > relatively new places that have broken it. I understand that the in_use > > concept comes from the one of the same name in ReplicationSlot, except > > that it is not at all documented in worker_internal.h. > > > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > > somebody gets to document the locking design for LogicalRepWorker. > > > > Agreed. Both of these patches (Hou-san's expedient resetting of the worker type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker type macros) appear to be blending the concept of "type" with whether the worker is "alive" or not, which I am not sure is a good thing. IMO the type is the type forever, so I felt type should get assigned only once when the worker is "born". For example, a dead horse is still a horse. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: subscription/015_stream sometimes breaks
On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera wrote: > > On 2023-Aug-24, Amit Kapila wrote: > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera > > wrote: > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > > of the struct at all, so I propose to add the attached 0001 as a minimal > > > fix. > > > > I think that way we may need to add the check for in_use before > > accessing each of the LogicalRepWorker struct fields or form some rule > > about which fields (or places) are okay to access without checking > > in_use field. > > As far as I realize, we have that rule already. It's only a few > relatively new places that have broken it. I understand that the in_use > concept comes from the one of the same name in ReplicationSlot, except > that it is not at all documented in worker_internal.h. > > So I propose we do both: apply Zhijie's patch and my 0001 now; and > somebody gets to document the locking design for LogicalRepWorker. > Agreed. -- With Regards, Amit Kapila.
Re: subscription/015_stream sometimes breaks
On 2023-Aug-24, Amit Kapila wrote: > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera > wrote: > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > > of the struct at all, so I propose to add the attached 0001 as a minimal > > fix. > > I think that way we may need to add the check for in_use before > accessing each of the LogicalRepWorker struct fields or form some rule > about which fields (or places) are okay to access without checking > in_use field. As far as I realize, we have that rule already. It's only a few relatively new places that have broken it. I understand that the in_use concept comes from the one of the same name in ReplicationSlot, except that it is not at all documented in worker_internal.h. So I propose we do both: apply Zhijie's patch and my 0001 now; and somebody gets to document the locking design for LogicalRepWorker. > > In fact, I'd go further and propose that if we do take that stance, then > > we don't need clear out the contents of this struct at all, so let's > > not. That's 0002. > > Personally, I think we should consider this change (0002 and 0002) separately. I agree. I'd maybe even retract them. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: subscription/015_stream sometimes breaks
On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera wrote: > > On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote: > > > [1]-- > > LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); > > > > workers = logicalrep_workers_find(MyLogicalRepWorker->subid, > > true); > > foreach(lc, workers) > > { > > LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); > > > > **if (isParallelApplyWorker(w)) > > logicalrep_worker_stop_internal(w, SIGTERM); > > } > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest > of the struct at all, so I propose to add the attached 0001 as a minimal > fix. > I think that way we may need to add the check for in_use before accessing each of the LogicalRepWorker struct fields or form some rule about which fields (or places) are okay to access without checking in_use field. > In fact, I'd go further and propose that if we do take that stance, then > we don't need clear out the contents of this struct at all, so let's > not. That's 0002. > > And the reason 0002 does not remove the zeroing of ->proc is that the > tests gets stuck when I do that, and the reason for that looks to be > some shoddy coding in WaitForReplicationWorkerAttach, so I propose we > change that too, as in 0003. > Personally, I think we should consider this change (0002 and 0002) separately. -- With Regards, Amit Kapila.
Re: subscription/015_stream sometimes breaks
On 2023-Aug-23, Alvaro Herrera wrote: > And the reason 0002 does not remove the zeroing of ->proc is that the > tests gets stuck when I do that, and the reason for that looks to be > some shoddy coding in WaitForReplicationWorkerAttach, so I propose we > change that too, as in 0003. Hmm, actually the test got stuck when I ran it repeatedly with this 0003. I guess there might be other places that depend on ->proc being set to NULL on exit. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Most hackers will be perfectly comfortable conceptualizing users as entropy sources, so let's move on." (Nathaniel Smith)
Re: subscription/015_stream sometimes breaks
On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote: > [1]-- > LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); > > workers = logicalrep_workers_find(MyLogicalRepWorker->subid, > true); > foreach(lc, workers) > { > LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); > > **if (isParallelApplyWorker(w)) > logicalrep_worker_stop_internal(w, SIGTERM); > } Hmm, I think if worker->in_use is false, we shouldn't consult the rest of the struct at all, so I propose to add the attached 0001 as a minimal fix. In fact, I'd go further and propose that if we do take that stance, then we don't need clear out the contents of this struct at all, so let's not. That's 0002. And the reason 0002 does not remove the zeroing of ->proc is that the tests gets stuck when I do that, and the reason for that looks to be some shoddy coding in WaitForReplicationWorkerAttach, so I propose we change that too, as in 0003. Thoughts? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From a28775fc5900cd740556a864e1826ceda268b794 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 23 Aug 2023 09:25:35 +0200 Subject: [PATCH 1/3] Don't use LogicalRepWorker until ->in_use is verified --- src/backend/replication/logical/launcher.c | 4 ++-- src/include/replication/worker_internal.h | 8 ++-- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 72e44d5a02..ec5156aa41 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -862,7 +862,7 @@ logicalrep_sync_worker_count(Oid subid) { LogicalRepWorker *w = &LogicalRepCtx->workers[i]; - if (w->subid == subid && isTablesyncWorker(w)) + if (isTablesyncWorker(w) && w->subid == subid) res++; } @@ -889,7 +889,7 @@ logicalrep_pa_worker_count(Oid subid) { LogicalRepWorker *w = &LogicalRepCtx->workers[i]; - if (w->subid == subid && isParallelApplyWorker(w)) + if (isParallelApplyWorker(w) && w->subid == subid) res++; } diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h index a428663859..8f4bed0958 100644 --- a/src/include/replication/worker_internal.h +++ b/src/include/replication/worker_internal.h @@ -327,8 +327,10 @@ extern void pa_decr_and_wait_stream_block(void); extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn); -#define isParallelApplyWorker(worker) ((worker)->type == WORKERTYPE_PARALLEL_APPLY) -#define isTablesyncWorker(worker) ((worker)->type == WORKERTYPE_TABLESYNC) +#define isParallelApplyWorker(worker) ((worker)->in_use && \ + (worker)->type == WORKERTYPE_PARALLEL_APPLY) +#define isTablesyncWorker(worker) ((worker)->in_use && \ + (worker)->type == WORKERTYPE_TABLESYNC) static inline bool am_tablesync_worker(void) @@ -339,12 +341,14 @@ am_tablesync_worker(void) static inline bool am_leader_apply_worker(void) { + Assert(MyLogicalRepWorker->in_use); return (MyLogicalRepWorker->type == WORKERTYPE_APPLY); } static inline bool am_parallel_apply_worker(void) { + Assert(MyLogicalRepWorker->in_use); return isParallelApplyWorker(MyLogicalRepWorker); } -- 2.30.2 >From 36336e6d8ef2913def59a53b6bc083a40181a85a Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 23 Aug 2023 09:53:51 +0200 Subject: [PATCH 2/3] don't clean up unnecessarily --- src/backend/replication/logical/launcher.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index ec5156aa41..f9d6ebf2d8 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -795,12 +795,6 @@ logicalrep_worker_cleanup(LogicalRepWorker *worker) worker->in_use = false; worker->proc = NULL; - worker->dbid = InvalidOid; - worker->userid = InvalidOid; - worker->subid = InvalidOid; - worker->relid = InvalidOid; - worker->leader_pid = InvalidPid; - worker->parallel_apply = false; } /* -- 2.30.2 >From 9dced5b0898be22004d442cb2893848663f967ba Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 23 Aug 2023 09:55:13 +0200 Subject: [PATCH 3/3] Don't rely on proc being NULL either --- src/backend/replication/logical/launcher.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index f9d6ebf2d8..8bfceb5c27 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -201,11 +201,18 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); - /* Worker either died or has started. Return fals
RE: subscription/015_stream sometimes breaks
> -Original Message- > From: Zhijie Hou (Fujitsu) > Sent: Wednesday, August 23, 2023 10:27 AM > To: Thomas Munro > Cc: Amit Kapila ; pgsql-hackers > > Subject: RE: subscription/015_stream sometimes breaks > > On Wednesday, August 23, 2023 4:55 AM Thomas Munro > wrote: > > > > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro > > > wrote: > > > I didn't study it closely but it looks like there might be a second > > > deadlock, after the one that is expected by the test? Examples from > > > the past couple of weeks: > > > > I should add, it's not correlated with the patches that cfbot is > > testing, and it's the most frequent failure for which that is the case. > > > > suite |name| distinct_patches | errors > > --++--+ > > subscription | 015_stream | 47 | 61 > > > > Thanks for reporting ! > I am researching the failure and will share my analysis. Hi, After an off-list discussion with Amit, we figured out the reason. From the crash log, I can see the apply worker crashed when accessing the worker->proc, so I think it's because the work->proc has been released. 577: /* Now terminate the worker ... */ > 578: kill(worker->proc->pid, signo); 579: 580: /* ... and wait for it to die. */ Normally, this should not happen because we take a lock on LogicalRepWorkerLock when shutting all the parallel workers[1] which can prevent concurrent worker to free the worker info. But in logicalrep_worker_stop_internal(), when stopping parallel worker #1, we will release the lock shortly. and at this timing it's possible that another parallel worker #2 which reported an ERROR will shutdown by itself and free the worker->proc. So when we try to stop that parallel worker #2 in next round, we didn't realize it has been closed, thus accessing invalid memory(worker->proc). [1]-- LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true); foreach(lc, workers) { LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); ** if (isParallelApplyWorker(w)) logicalrep_worker_stop_internal(w, SIGTERM); } -- The bug happens after commit 2a8b40e where isParallelApplyWorker() start to use the worker->type to check but we forgot to reset the worker type at worker exit time. So, even if the worker #2 has shutdown, the worker_type is still valid and we try to stop it again. Previously, the isParallelApplyWorker() used the worker->leader_pid which will be reset when the worker exits, so the "if (isParallelApplyWorker(w))" won't pass in this case and we don't try to stop the worker #2. To fix it I think we need to reset the worker type at exit as well. Attach the patch which does the same. I am also testing it locally to see if there are other issues here. Best Regards, Hou Zhijie 0001-Reset-worker-type-at-exit-time.patch Description: 0001-Reset-worker-type-at-exit-time.patch
Re: subscription/015_stream sometimes breaks
On Wed, 23 Aug 2023 at 02:25, Thomas Munro wrote: > > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro wrote: > > I didn't study it closely but it looks like there might be a second > > deadlock, after the one that is expected by the test? Examples from > > the past couple of weeks: > > I should add, it's not correlated with the patches that cfbot is > testing, and it's the most frequent failure for which that is the > case. > > suite |name| distinct_patches | errors > --++--+ > subscription | 015_stream | 47 | 61 I had noticed that it is failing because of a segmentation fault: 2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel worker][4/44:767] FATAL: terminating logical replication worker due to administrator command 2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel worker][4/44:767] CONTEXT: processing remote data for replication origin "pg_16397" during message type "STREAM STOP" in transaction 748 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG: unregistering background worker "logical replication parallel apply worker for subscription 16397" 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG: background worker "logical replication parallel worker" (PID 3823455) exited with exit code 1 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG: unregistering background worker "logical replication parallel apply worker for subscription 16397" 2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG: background worker "logical replication parallel worker" (PID 3823023) exited with exit code 1 2023-08-22 19:07:22.419 UTC [3819892][postmaster][:0] LOG: background worker "logical replication apply worker" (PID 3822876) was terminated by signal 11: Segmentation fault The stack trace for the same generated at [1] is: Core was generated by `postgres: subscriber: logical replication apply worker for subscription 16397 '. Program terminated with signal SIGSEGV, Segmentation fault. warning: Section `.reg-xstate/3822876' in core file too small. #0 0x007b461e in logicalrep_worker_stop_internal (worker=, signo=) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583 583 kill(worker->proc->pid, signo); #0 0x007b461e in logicalrep_worker_stop_internal (worker=, signo=) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583 #1 0x007b565a in logicalrep_worker_detach () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:774 #2 0x007b49ff in logicalrep_worker_onexit (code=, arg=) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:829 #3 0x008034c5 in shmem_exit (code=) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:239 #4 0x008033dc in proc_exit_prepare (code=1) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:194 #5 0x0080333d in proc_exit (code=1) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:107 #6 0x00797068 in StartBackgroundWorker () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:827 #7 0x0079f257 in do_start_bgworker (rw=0x284e750) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5734 #8 0x0079b541 in maybe_start_bgworkers () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5958 #9 0x0079cb51 in process_pm_pmsignal () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5121 #10 0x0079b6bb in ServerLoop () at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1769 #11 0x0079aaa5 in PostmasterMain (argc=4, argv=) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1462 #12 0x006d82a0 in main (argc=4, argv=0x27e3fd0) at /home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:198 $1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {64, 0 }, _kill = {si_pid = 64, si_uid = 0}, _timer = {si_tid = 64, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 64, si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 64, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x40, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band = 64, si_fd = 0}, _sigsys = {_call_addr = 0x40, _syscall = 0, _arch = 0}}} [1] - https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dragonet&dt=2023-08-22%2018%3A56%3A04&stg=subscription-check Regards, Vignesh
RE: subscription/015_stream sometimes breaks
On Wednesday, August 23, 2023 4:55 AM Thomas Munro wrote: > > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro > wrote: > > I didn't study it closely but it looks like there might be a second > > deadlock, after the one that is expected by the test? Examples from > > the past couple of weeks: > > I should add, it's not correlated with the patches that cfbot is testing, and > it's > the most frequent failure for which that is the case. > > suite |name| distinct_patches | errors > --++--+ > subscription | 015_stream | 47 | 61 > Thanks for reporting ! I am researching the failure and will share my analysis. Best Regards, Hou zj
Re: subscription/015_stream sometimes breaks
On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro wrote: > I didn't study it closely but it looks like there might be a second > deadlock, after the one that is expected by the test? Examples from > the past couple of weeks: I should add, it's not correlated with the patches that cfbot is testing, and it's the most frequent failure for which that is the case. suite |name| distinct_patches | errors --++--+ subscription | 015_stream | 47 | 61