On 2023-09-06 11:17, James Coleman wrote:

> I've never been able to reproduce it (haven't tested the new version,
> but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
> (first buster and now bullseye).
>
> I'm attaching several stacktraces in the hope that they provide some
> clues. These all match the ps output I sent earlier, though note in
> that output there is both the regress instance and my test instance
> (pid 3213249) running (different ports, of course, and they are from
> the exact same compilation run). I've attached ps output for the
> postgres processes under the make check process to simplify cross
> referencing.
>
> A few interesting things:
> - There's definitely a lock on a relation that seems to be what's
> blocking the processes.
> - When I try to connect with psql the process forks but then hangs
> (see the ps output with task names stuck in "authentication"). I've
> also included a trace from one of these.

Thanks for sharing them!

Many processes are waiting to acquire the LW lock, including the process
trying to output the plan(select1.trace).

I suspect that this is due to a lock that was acquired prior to being
interrupted by ProcessLogQueryPlanInterrupt(), but have not been able to
reproduce the same situation..


I don't have time immediately to dig into this, but perhaps loading up
the core dumps would allow us to see what query is running in each
backend process (if it hasn't already been discarded by that point)
and thereby determine what point in each test process led to the error
condition?

Thanks for the suggestion.
I am concerned that core dumps may not be readable on different operating systems or other environments. (Unfortunately, I do not have Debian on hand)

It seems that we can know what queries were running from the stack traces you shared. As described above, I suspect a lock which was acquired prior to ProcessLogQueryPlanInterrupt() caused the issue. I will try a little more to see if I can devise a way to create the same situation.

Alternatively we might be able to apply the same trick to the test
client instead...

BTW, for my own easy reference in this thread: relid 1259 is pg_class
if I'm not mistaken.

Yeah, and I think it's strange that the lock to 1259 appears twice and must be avoided.

#10 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:117
  ..
#49 0x0000559d61b4989d in ProcessLogQueryPlanInterrupt () at explain.c:5158
  ..
#53 0x0000559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:117

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation


Reply via email to