On 1/29/23 19:08, Tomas Vondra wrote: > > > On 1/29/23 18:53, Andres Freund wrote: >> Hi, >> >> On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote: >>> Will do, but I'll wait for another lockup to see how frequent it >>> actually is. I'm now at ~90 runs total, and it didn't happen again yet. >>> So hitting it after 15 runs might have been a bit of a luck. >> >> Was there a difference in how much load there was on the machine between >> "reproduced in 15 runs" and "not reproed in 90"? If indeed lack of barriers >> is related to the issue, an increase in context switches could substantially >> change the behaviour (in both directions). More intra-process context >> switches can amount to "probabilistic barriers" because that'll be a >> barrier. At the same time it can make it more likely that the relatively >> narrow window in WaitEventSetWait() is hit, or lead to larger delays >> processing signals. >> > > No. The only thing the machine is doing is > > while /usr/bin/true; do > make check > done > > I can't reduce the workload further, because the "join" test is in a > separate parallel group (I cut down parallel_schedule). I could make the > machine busier, of course. > > However, the other lockup I saw was when using serial_schedule, so I > guess lower concurrency makes it more likely. >
FWIW the machine is now on run ~2700 without any further lockups :-/ Seems it was quite lucky we hit it twice in a handful of attempts. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company