Re: subscription/015_stream sometimes breaks

2024-01-09 Thread Peter Smith
On Tue, Aug 29, 2023 at 10:35 PM Alvaro Herrera  wrote:
>
> On 2023-Aug-29, Amit Kapila wrote:
>
> > On Mon, Aug 28, 2023 at 5:35 AM Peter Smith  wrote:
> > >
> > > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila  
> > > wrote:
> > >
> > > IMO there are inconsistencies in the second patch that was pushed.
>
> > I find your suggestions reasonable. Alvaro, do you have any comments?
>
> Well, my main comment is that at this point I'm not sure these
> isFooWorker() macros are worth their salt.  It looks like we could
> replace their uses with direct type comparisons in their callsites and
> remove them, with no loss of readability.  The am_sth_worker() are
> probably a bit more useful, though some of the callsites could end up
> better if replaced with straight type comparison.
>
> All in all, I don't disagree with Peter's suggestions, but this is
> pretty much in "meh" territory for me.

I had written a small non-functional patch [1] to address some macro
inconsistencies introduced by a prior patch of this thread.

It received initial feedback from Amit ("I find your suggestions
reasonable") and from Alvaro ("I don't disagree with Peter's
suggestions") but then nothing further happened. I also created a CF
entry https://commitfest.postgresql.org/46/4570/ for it.

AFAIK my patch is still valid, but after 4 months of no activity it
seems there is no interest in pushing it, so I am withdrawing the CF
entry.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPuwaF4Sb41pWQk69d2WO_ZJQpj-_2JkQvP%3D1jwozUpcCQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: subscription/015_stream sometimes breaks

2023-08-29 Thread Alvaro Herrera
On 2023-Aug-29, Amit Kapila wrote:

> On Mon, Aug 28, 2023 at 5:35 AM Peter Smith  wrote:
> >
> > On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila  wrote:
> >
> > IMO there are inconsistencies in the second patch that was pushed.

> I find your suggestions reasonable. Alvaro, do you have any comments?

Well, my main comment is that at this point I'm not sure these
isFooWorker() macros are worth their salt.  It looks like we could
replace their uses with direct type comparisons in their callsites and
remove them, with no loss of readability.  The am_sth_worker() are
probably a bit more useful, though some of the callsites could end up
better if replaced with straight type comparison.

All in all, I don't disagree with Peter's suggestions, but this is
pretty much in "meh" territory for me.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: subscription/015_stream sometimes breaks

2023-08-28 Thread Amit Kapila
On Mon, Aug 28, 2023 at 5:35 AM Peter Smith  wrote:
>
> On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila  wrote:
>
> IMO there are inconsistencies in the second patch that was pushed.
>
> 1. In the am_xxx functions, why is there Assert 'in_use' only for the
> APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers?
>
> 2. In the am_xxx functions there is now Assert 'in_use', so why are we
> still using macros to check again what we already asserted is not
> possible? (Or, if the checking overkill was a deliberate choice then
> why is there no isLeaderApplyWorker macro?)
>
> ~
>
> PSA a small patch to address these.
>

I find your suggestions reasonable. Alvaro, do you have any comments?

-- 
With Regards,
Amit Kapila.




Re: subscription/015_stream sometimes breaks

2023-08-27 Thread Peter Smith
On Fri, Aug 25, 2023 at 8:15 PM Amit Kapila  wrote:
>
> On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila  wrote:
> >
> > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2023-Aug-24, Amit Kapila wrote:
> > >
> > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera 
> > > >  wrote:
> > >
> > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > > of the struct at all, so I propose to add the attached 0001 as a 
> > > > > minimal
> > > > > fix.
> > > >
> > > > I think that way we may need to add the check for in_use before
> > > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > > about which fields (or places) are okay to access without checking
> > > > in_use field.
> > >
> > > As far as I realize, we have that rule already.  It's only a few
> > > relatively new places that have broken it.  I understand that the in_use
> > > concept comes from the one of the same name in ReplicationSlot, except
> > > that it is not at all documented in worker_internal.h.
> > >
> > > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > > somebody gets to document the locking design for LogicalRepWorker.
> > >
> >
> > Agreed.
> >
>
> Pushed both the patches.
>

IMO there are inconsistencies in the second patch that was pushed.

1. In the am_xxx functions, why is there Assert 'in_use' only for the
APPLY / PARALLEL_APPLY workers but not for TABLESYNC workers?

2. In the am_xxx functions there is now Assert 'in_use', so why are we
still using macros to check again what we already asserted is not
possible? (Or, if the checking overkill was a deliberate choice then
why is there no isLeaderApplyWorker macro?)

~

PSA a small patch to address these.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Fix-am_xxx-function-Asserts.patch
Description: Binary data


Re: subscription/015_stream sometimes breaks

2023-08-25 Thread Amit Kapila
On Thu, Aug 24, 2023 at 3:48 PM Amit Kapila  wrote:
>
> On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera  
> wrote:
> >
> > On 2023-Aug-24, Amit Kapila wrote:
> >
> > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera  
> > > wrote:
> >
> > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > > fix.
> > >
> > > I think that way we may need to add the check for in_use before
> > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > about which fields (or places) are okay to access without checking
> > > in_use field.
> >
> > As far as I realize, we have that rule already.  It's only a few
> > relatively new places that have broken it.  I understand that the in_use
> > concept comes from the one of the same name in ReplicationSlot, except
> > that it is not at all documented in worker_internal.h.
> >
> > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > somebody gets to document the locking design for LogicalRepWorker.
> >
>
> Agreed.
>

Pushed both the patches.

-- 
With Regards,
Amit Kapila.




Re: subscription/015_stream sometimes breaks

2023-08-24 Thread Amit Kapila
On Fri, Aug 25, 2023 at 9:09 AM Peter Smith  wrote:
>
> On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila  wrote:
> >
> > On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2023-Aug-24, Amit Kapila wrote:
> > >
> > > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera 
> > > >  wrote:
> > >
> > > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > > of the struct at all, so I propose to add the attached 0001 as a 
> > > > > minimal
> > > > > fix.
> > > >
> > > > I think that way we may need to add the check for in_use before
> > > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > > about which fields (or places) are okay to access without checking
> > > > in_use field.
> > >
> > > As far as I realize, we have that rule already.  It's only a few
> > > relatively new places that have broken it.  I understand that the in_use
> > > concept comes from the one of the same name in ReplicationSlot, except
> > > that it is not at all documented in worker_internal.h.
> > >
> > > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > > somebody gets to document the locking design for LogicalRepWorker.
> > >
> >
> > Agreed.
>
> Both of these patches (Hou-san's expedient resetting of the worker
> type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker
> type macros) appear to be blending the concept of "type" with whether
> the worker is "alive" or not, which I am not sure is a good thing. IMO
> the type is the type forever, so I felt type should get assigned only
> once when the worker is "born". For example, a dead horse is still a
> horse.
>

I think it is important to have a alive check before accessing the
worker type as we are doing for some of the other fields. For example,
see the usage of in_use flag in the function logicalrep_worker_find().
The usage of parallel apply workers doesn't consider the use of in_use
flag where as other worker types would first check in_use flag.

-- 
With Regards,
Amit Kapila.




Re: subscription/015_stream sometimes breaks

2023-08-24 Thread Peter Smith
On Thu, Aug 24, 2023 at 8:18 PM Amit Kapila  wrote:
>
> On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera  
> wrote:
> >
> > On 2023-Aug-24, Amit Kapila wrote:
> >
> > > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera  
> > > wrote:
> >
> > > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > > fix.
> > >
> > > I think that way we may need to add the check for in_use before
> > > accessing each of the LogicalRepWorker struct fields or form some rule
> > > about which fields (or places) are okay to access without checking
> > > in_use field.
> >
> > As far as I realize, we have that rule already.  It's only a few
> > relatively new places that have broken it.  I understand that the in_use
> > concept comes from the one of the same name in ReplicationSlot, except
> > that it is not at all documented in worker_internal.h.
> >
> > So I propose we do both: apply Zhijie's patch and my 0001 now; and
> > somebody gets to document the locking design for LogicalRepWorker.
> >
>
> Agreed.

Both of these patches (Hou-san's expedient resetting of the worker
type, Alvaro's 0001 putting the 'in_use' check within the isXXXWorker
type macros) appear to be blending the concept of "type" with whether
the worker is "alive" or not, which I am not sure is a good thing. IMO
the type is the type forever, so I felt type should get assigned only
once when the worker is "born". For example, a dead horse is still a
horse.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: subscription/015_stream sometimes breaks

2023-08-24 Thread Amit Kapila
On Thu, Aug 24, 2023 at 1:20 PM Alvaro Herrera  wrote:
>
> On 2023-Aug-24, Amit Kapila wrote:
>
> > On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera  
> > wrote:
>
> > > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > > of the struct at all, so I propose to add the attached 0001 as a minimal
> > > fix.
> >
> > I think that way we may need to add the check for in_use before
> > accessing each of the LogicalRepWorker struct fields or form some rule
> > about which fields (or places) are okay to access without checking
> > in_use field.
>
> As far as I realize, we have that rule already.  It's only a few
> relatively new places that have broken it.  I understand that the in_use
> concept comes from the one of the same name in ReplicationSlot, except
> that it is not at all documented in worker_internal.h.
>
> So I propose we do both: apply Zhijie's patch and my 0001 now; and
> somebody gets to document the locking design for LogicalRepWorker.
>

Agreed.

-- 
With Regards,
Amit Kapila.




Re: subscription/015_stream sometimes breaks

2023-08-24 Thread Alvaro Herrera
On 2023-Aug-24, Amit Kapila wrote:

> On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera  
> wrote:

> > Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> > of the struct at all, so I propose to add the attached 0001 as a minimal
> > fix.
> 
> I think that way we may need to add the check for in_use before
> accessing each of the LogicalRepWorker struct fields or form some rule
> about which fields (or places) are okay to access without checking
> in_use field.

As far as I realize, we have that rule already.  It's only a few
relatively new places that have broken it.  I understand that the in_use
concept comes from the one of the same name in ReplicationSlot, except
that it is not at all documented in worker_internal.h.

So I propose we do both: apply Zhijie's patch and my 0001 now; and
somebody gets to document the locking design for LogicalRepWorker.

> > In fact, I'd go further and propose that if we do take that stance, then
> > we don't need clear out the contents of this struct at all, so let's
> > not.  That's 0002.
> 
> Personally, I think we should consider this change (0002 and 0002) separately.

I agree.  I'd maybe even retract them.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: subscription/015_stream sometimes breaks

2023-08-23 Thread Amit Kapila
On Wed, Aug 23, 2023 at 1:31 PM Alvaro Herrera  wrote:
>
> On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote:
>
> > [1]--
> >   LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> >
> >   workers = logicalrep_workers_find(MyLogicalRepWorker->subid, 
> > true);
> >   foreach(lc, workers)
> >   {
> >   LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
> >
> > **if (isParallelApplyWorker(w))
> >   logicalrep_worker_stop_internal(w, SIGTERM);
> >   }
>
> Hmm, I think if worker->in_use is false, we shouldn't consult the rest
> of the struct at all, so I propose to add the attached 0001 as a minimal
> fix.
>

I think that way we may need to add the check for in_use before
accessing each of the LogicalRepWorker struct fields or form some rule
about which fields (or places) are okay to access without checking
in_use field.

> In fact, I'd go further and propose that if we do take that stance, then
> we don't need clear out the contents of this struct at all, so let's
> not.  That's 0002.
>
> And the reason 0002 does not remove the zeroing of ->proc is that the
> tests gets stuck when I do that, and the reason for that looks to be
> some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
> change that too, as in 0003.
>

Personally, I think we should consider this change (0002 and 0002) separately.

-- 
With Regards,
Amit Kapila.




Re: subscription/015_stream sometimes breaks

2023-08-23 Thread Alvaro Herrera
On 2023-Aug-23, Alvaro Herrera wrote:

> And the reason 0002 does not remove the zeroing of ->proc is that the
> tests gets stuck when I do that, and the reason for that looks to be
> some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
> change that too, as in 0003.

Hmm, actually the test got stuck when I ran it repeatedly with this
0003.  I guess there might be other places that depend on ->proc being
set to NULL on exit.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."   (Nathaniel Smith)




Re: subscription/015_stream sometimes breaks

2023-08-23 Thread Alvaro Herrera
On 2023-Aug-23, Zhijie Hou (Fujitsu) wrote:

> [1]--
>   LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> 
>   workers = logicalrep_workers_find(MyLogicalRepWorker->subid, 
> true);
>   foreach(lc, workers)
>   {
>   LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
> 
> **if (isParallelApplyWorker(w))
>   logicalrep_worker_stop_internal(w, SIGTERM);
>   }

Hmm, I think if worker->in_use is false, we shouldn't consult the rest
of the struct at all, so I propose to add the attached 0001 as a minimal
fix.

In fact, I'd go further and propose that if we do take that stance, then
we don't need clear out the contents of this struct at all, so let's
not.  That's 0002.

And the reason 0002 does not remove the zeroing of ->proc is that the
tests gets stuck when I do that, and the reason for that looks to be
some shoddy coding in WaitForReplicationWorkerAttach, so I propose we
change that too, as in 0003.

Thoughts?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From a28775fc5900cd740556a864e1826ceda268b794 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 23 Aug 2023 09:25:35 +0200
Subject: [PATCH 1/3] Don't use LogicalRepWorker until ->in_use is verified

---
 src/backend/replication/logical/launcher.c | 4 ++--
 src/include/replication/worker_internal.h  | 8 ++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 72e44d5a02..ec5156aa41 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -862,7 +862,7 @@ logicalrep_sync_worker_count(Oid subid)
 	{
 		LogicalRepWorker *w = &LogicalRepCtx->workers[i];
 
-		if (w->subid == subid && isTablesyncWorker(w))
+		if (isTablesyncWorker(w) && w->subid == subid)
 			res++;
 	}
 
@@ -889,7 +889,7 @@ logicalrep_pa_worker_count(Oid subid)
 	{
 		LogicalRepWorker *w = &LogicalRepCtx->workers[i];
 
-		if (w->subid == subid && isParallelApplyWorker(w))
+		if (isParallelApplyWorker(w) && w->subid == subid)
 			res++;
 	}
 
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a428663859..8f4bed0958 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -327,8 +327,10 @@ extern void pa_decr_and_wait_stream_block(void);
 extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
 		   XLogRecPtr remote_lsn);
 
-#define isParallelApplyWorker(worker) ((worker)->type == WORKERTYPE_PARALLEL_APPLY)
-#define isTablesyncWorker(worker) ((worker)->type == WORKERTYPE_TABLESYNC)
+#define isParallelApplyWorker(worker) ((worker)->in_use && \
+	   (worker)->type == WORKERTYPE_PARALLEL_APPLY)
+#define isTablesyncWorker(worker) ((worker)->in_use && \
+   (worker)->type == WORKERTYPE_TABLESYNC)
 
 static inline bool
 am_tablesync_worker(void)
@@ -339,12 +341,14 @@ am_tablesync_worker(void)
 static inline bool
 am_leader_apply_worker(void)
 {
+	Assert(MyLogicalRepWorker->in_use);
 	return (MyLogicalRepWorker->type == WORKERTYPE_APPLY);
 }
 
 static inline bool
 am_parallel_apply_worker(void)
 {
+	Assert(MyLogicalRepWorker->in_use);
 	return isParallelApplyWorker(MyLogicalRepWorker);
 }
 
-- 
2.30.2

>From 36336e6d8ef2913def59a53b6bc083a40181a85a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 23 Aug 2023 09:53:51 +0200
Subject: [PATCH 2/3] don't clean up unnecessarily

---
 src/backend/replication/logical/launcher.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index ec5156aa41..f9d6ebf2d8 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -795,12 +795,6 @@ logicalrep_worker_cleanup(LogicalRepWorker *worker)
 
 	worker->in_use = false;
 	worker->proc = NULL;
-	worker->dbid = InvalidOid;
-	worker->userid = InvalidOid;
-	worker->subid = InvalidOid;
-	worker->relid = InvalidOid;
-	worker->leader_pid = InvalidPid;
-	worker->parallel_apply = false;
 }
 
 /*
-- 
2.30.2

>From 9dced5b0898be22004d442cb2893848663f967ba Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 23 Aug 2023 09:55:13 +0200
Subject: [PATCH 3/3] Don't rely on proc being NULL either

---
 src/backend/replication/logical/launcher.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index f9d6ebf2d8..8bfceb5c27 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -201,11 +201,18 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
 		LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
 
-		/* Worker either died or has started. Return fals

RE: subscription/015_stream sometimes breaks

2023-08-22 Thread Zhijie Hou (Fujitsu)


> -Original Message-
> From: Zhijie Hou (Fujitsu) 
> Sent: Wednesday, August 23, 2023 10:27 AM
> To: Thomas Munro 
> Cc: Amit Kapila ; pgsql-hackers
> 
> Subject: RE: subscription/015_stream sometimes breaks
> 
> On Wednesday, August 23, 2023 4:55 AM Thomas Munro
>  wrote:
> >
> > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro
> 
> > wrote:
> > > I didn't study it closely but it looks like there might be a second
> > > deadlock, after the one that is expected by the test?  Examples from
> > > the past couple of weeks:
> >
> > I should add, it's not correlated with the patches that cfbot is
> > testing, and it's the most frequent failure for which that is the case.
> >
> > suite |name| distinct_patches | errors
> > --++--+
> >  subscription | 015_stream |   47 | 61
> >
> 
> Thanks for reporting !
> I am researching the failure and will share my analysis.

Hi,

After an off-list discussion with Amit, we figured out the reason.
From the crash log, I can see the apply worker crashed when accessing the
worker->proc, so I think it's because the work->proc has been released.

   577: /* Now terminate the worker ... */
>  578: kill(worker->proc->pid, signo);
   579: 
   580: /* ... and wait for it to die. */

Normally, this should not happen because we take a lock on LogicalRepWorkerLock
when shutting all the parallel workers[1] which can prevent concurrent worker
to free the worker info. But in logicalrep_worker_stop_internal(), when
stopping parallel worker #1, we will release the lock shortly. and at this
timing it's possible that another parallel worker #2 which reported an ERROR
will shutdown by itself and free the worker->proc. So when we try to stop that
parallel worker #2 in next round, we didn't realize it has been closed,
thus accessing invalid memory(worker->proc).

[1]--
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);

workers = logicalrep_workers_find(MyLogicalRepWorker->subid, 
true);
foreach(lc, workers)
{
LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);

**  if (isParallelApplyWorker(w))
logicalrep_worker_stop_internal(w, SIGTERM);
}
--

The bug happens after commit 2a8b40e where isParallelApplyWorker() start to use
the worker->type to check but we forgot to reset the worker type at worker exit
time. So, even if the worker #2 has shutdown, the worker_type is still valid
and we try to stop it again.

Previously, the isParallelApplyWorker() used the worker->leader_pid which will
be reset when the worker exits, so the "if (isParallelApplyWorker(w))" won't 
pass
in this case and we don't try to stop the worker #2.

To fix it I think we need to reset the worker type at exit as well.
Attach the patch which does the same. I am also testing it locally
to see if there are other issues here.

Best Regards,
Hou Zhijie



0001-Reset-worker-type-at-exit-time.patch
Description: 0001-Reset-worker-type-at-exit-time.patch


Re: subscription/015_stream sometimes breaks

2023-08-22 Thread vignesh C
On Wed, 23 Aug 2023 at 02:25, Thomas Munro  wrote:
>
> On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro  wrote:
> > I didn't study it closely but it looks like there might be a second
> > deadlock, after the one that is expected by the test?  Examples from
> > the past couple of weeks:
>
> I should add, it's not correlated with the patches that cfbot is
> testing, and it's the most frequent failure for which that is the
> case.
>
> suite |name| distinct_patches | errors
> --++--+
>  subscription | 015_stream |   47 | 61

I had noticed that it is failing because of a segmentation fault:
2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel
worker][4/44:767] FATAL:  terminating logical replication worker due
to administrator command
2023-08-22 19:07:22.403 UTC [3823023][logical replication parallel
worker][4/44:767] CONTEXT:  processing remote data for replication
origin "pg_16397" during message type "STREAM STOP" in transaction 748
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG:
unregistering background worker "logical replication parallel apply
worker for subscription 16397"
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG:  background
worker "logical replication parallel worker" (PID 3823455) exited with
exit code 1
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] DEBUG:
unregistering background worker "logical replication parallel apply
worker for subscription 16397"
2023-08-22 19:07:22.404 UTC [3819892][postmaster][:0] LOG:  background
worker "logical replication parallel worker" (PID 3823023) exited with
exit code 1
2023-08-22 19:07:22.419 UTC [3819892][postmaster][:0] LOG:  background
worker "logical replication apply worker" (PID 3822876) was terminated
by signal 11: Segmentation fault

The stack trace for the same generated at [1] is:
Core was generated by `postgres: subscriber: logical replication apply
worker for subscription 16397 '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/3822876' in core file too small.
#0  0x007b461e in logicalrep_worker_stop_internal
(worker=, signo=) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583
583 kill(worker->proc->pid, signo);
#0  0x007b461e in logicalrep_worker_stop_internal
(worker=, signo=) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:583
#1  0x007b565a in logicalrep_worker_detach () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:774
#2  0x007b49ff in logicalrep_worker_onexit (code=, arg=) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/launcher.c:829
#3  0x008034c5 in shmem_exit (code=) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:239
#4  0x008033dc in proc_exit_prepare (code=1) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:194
#5  0x0080333d in proc_exit (code=1) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/ipc.c:107
#6  0x00797068 in StartBackgroundWorker () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:827
#7  0x0079f257 in do_start_bgworker (rw=0x284e750) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5734
#8  0x0079b541 in maybe_start_bgworkers () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5958
#9  0x0079cb51 in process_pm_pmsignal () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:5121
#10 0x0079b6bb in ServerLoop () at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1769
#11 0x0079aaa5 in PostmasterMain (argc=4, argv=) at 
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:1462
#12 0x006d82a0 in main (argc=4, argv=0x27e3fd0) at
/home/bf/bf-build/dragonet/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:198
$1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad =
{64, 0 }, _kill = {si_pid = 64, si_uid = 0}, _timer
= {si_tid = 64, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr
= 0x0}}, _rt = {si_pid = 64, si_uid = 0, si_sigval = {sival_int = 0,
sival_ptr = 0x0}}, _sigchld = {si_pid = 64, si_uid = 0, si_status = 0,
si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x40, _addr_lsb =
0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}, _sigpoll = {si_band =
64, si_fd = 0}, _sigsys = {_call_addr = 0x40, _syscall = 0, _arch =
0}}}

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dragonet&dt=2023-08-22%2018%3A56%3A04&stg=subscription-check

Regards,
Vignesh




RE: subscription/015_stream sometimes breaks

2023-08-22 Thread Zhijie Hou (Fujitsu)
On Wednesday, August 23, 2023 4:55 AM Thomas Munro  
wrote:
> 
> On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro
>  wrote:
> > I didn't study it closely but it looks like there might be a second
> > deadlock, after the one that is expected by the test?  Examples from
> > the past couple of weeks:
> 
> I should add, it's not correlated with the patches that cfbot is testing, and 
> it's
> the most frequent failure for which that is the case.
> 
> suite |name| distinct_patches | errors
> --++--+
>  subscription | 015_stream |   47 | 61
> 

Thanks for reporting !
I am researching the failure and will share my analysis.

Best Regards,
Hou zj


Re: subscription/015_stream sometimes breaks

2023-08-22 Thread Thomas Munro
On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro  wrote:
> I didn't study it closely but it looks like there might be a second
> deadlock, after the one that is expected by the test?  Examples from
> the past couple of weeks:

I should add, it's not correlated with the patches that cfbot is
testing, and it's the most frequent failure for which that is the
case.

suite |name| distinct_patches | errors
--++--+
 subscription | 015_stream |   47 | 61