Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-11 Thread Thomas Munro
On Sat, Sep 9, 2023 at 9:00 PM Alexander Lakhin wrote: > Yes, I think we deal with something like that. I can try to deduce a minimum > change that affects reproducing the issue, but may be it's not that important. > Perhaps we now should think of escalating the problem to FreeBSD developers? > I

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-09 Thread Alexander Lakhin
Hi Thomas, 08.09.2023 22:39, Thomas Munro wrote: With debugging logging added I see (on 7389aad63~1) that one process really sends SIGURG to another, and the latter reaches poll(), but it just got no signal, it's signal handler not called and poll() just waits... Thanks for working so hard on

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-08 Thread Thomas Munro
On Sat, Sep 9, 2023 at 7:00 AM Alexander Lakhin wrote: > It takes less than 10 minutes on average for me. I checked > REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined > forcefully) — they all are affected. > I could not reproduce the lockup on my Ubuntu box (with

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-08 Thread Alexander Lakhin
Hello, 03.09.2023 00:00, Alexander Lakhin wrote: I'll try to test this guess on the target machine... I got access to dikkop thanks to Tomas Vondra, and started reproducing the issue. It was rather difficult to catch the lockup as Tomas and Tom noticed before. I tried to use stress-ng to

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-02 Thread Thomas Munro
I agree that the code lacks barriers. I haven't been able to figure out how any reordering could cause this hang, though, because in these old branches procsignal_sigusr1_handler is used for latch wakeups, and it also calls SetLatch(MyLatch) itself, right at the end. That is, SetLatch() gets

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-02 Thread Alexander Lakhin
Hello Robert, 01.09.2023 23:21, Robert Haas wrote: On Fri, Sep 1, 2023 at 6:13 AM Alexander Lakhin wrote: (Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the issue for us.) Maybe it'd be worth trying something stronger, like pg_memory_barrier(). A compiler barrier

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-01 Thread Robert Haas
On Fri, Sep 1, 2023 at 6:13 AM Alexander Lakhin wrote: > (Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the > issue for us.) Maybe it'd be worth trying something stronger, like pg_memory_barrier(). A compiler barrier doesn't prevent the CPU from reordering loads and stores

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-01 Thread Alexander Lakhin
Hello Tomas, 01.09.2023 16:00, Tomas Vondra wrote: Hmmm, I'm not very good at reading the binary code, but here's what objdump produced for WaitEventSetWait. Maybe someone will see what the issue is. At first glance, I can't see anything suspicious in the disassembly. IIUC, waiting = true

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-01 Thread Tomas Vondra
On 9/1/23 10:00, Alexander Lakhin wrote: > Hello Thomas, > > 31.08.2023 14:15, Thomas Munro wrote: > >> We have a signal that is pending and not blocked, so I don't >> immediately know why poll() hasn't returned control. > > When I worked at the Postgres Pro company, we observed a similar

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-09-01 Thread Alexander Lakhin
Hello Thomas, 31.08.2023 14:15, Thomas Munro wrote: We have a signal that is pending and not blocked, so I don't immediately know why poll() hasn't returned control. When I worked at the Postgres Pro company, we observed a similar lockup under rather specific conditions (we used Elbrus CPU

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-08-31 Thread Thomas Munro
On Thu, Aug 31, 2023 at 2:32 PM Thomas Munro wrote: > On Thu, Aug 31, 2023 at 12:16 AM Tomas Vondra > wrote: > > I have another case of this on dikkop (on 11 again). Is there anything > > else we'd want to try? Or maybe someone would want access to the machine > > and do some investigation

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-08-30 Thread Thomas Munro
On Thu, Aug 31, 2023 at 12:16 AM Tomas Vondra wrote: > I have another case of this on dikkop (on 11 again). Is there anything > else we'd want to try? Or maybe someone would want access to the machine > and do some investigation directly? Sounds interesting -- I'll ping you off-list.

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-08-30 Thread Tomas Vondra
Hi, I have another case of this on dikkop (on 11 again). Is there anything else we'd want to try? Or maybe someone would want access to the machine and do some investigation directly? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-06-17 Thread Tomas Vondra
On 2/7/23 01:09, Thomas Munro wrote: > On Tue, Feb 7, 2023 at 1:06 PM Tomas Vondra > wrote: >> On 2/7/23 00:48, Thomas Munro wrote: >>> On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra >>> wrote: No, I left the workload as it was for the first lockup, so `make check` runs everything as

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-02-06 Thread Thomas Munro
On Tue, Feb 7, 2023 at 1:06 PM Tomas Vondra wrote: > On 2/7/23 00:48, Thomas Munro wrote: > > On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra > > wrote: > >> No, I left the workload as it was for the first lockup, so `make check` > >> runs everything as is up until the "join" test suite. > > > >

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-02-06 Thread Tomas Vondra
On 2/7/23 00:48, Thomas Munro wrote: > On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra > wrote: >> No, I left the workload as it was for the first lockup, so `make check` >> runs everything as is up until the "join" test suite. > > Wait, shouldn't that be join_hash? No, because join_hash does not

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-02-06 Thread Thomas Munro
On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra wrote: > No, I left the workload as it was for the first lockup, so `make check` > runs everything as is up until the "join" test suite. Wait, shouldn't that be join_hash?

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-02-06 Thread Tomas Vondra
On 2/6/23 20:20, Andres Freund wrote: > Hi, > > On 2023-02-06 19:51:19 +0100, Tomas Vondra wrote: >>> No. The only thing the machine is doing is >>> >>> while /usr/bin/true; do >>> make check >>> done >>> >>> I can't reduce the workload further, because the "join" test is in a >>>

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-02-06 Thread Andres Freund
Hi, On 2023-02-06 19:51:19 +0100, Tomas Vondra wrote: > > No. The only thing the machine is doing is > > > > while /usr/bin/true; do > > make check > > done > > > > I can't reduce the workload further, because the "join" test is in a > > separate parallel group (I cut down

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-02-06 Thread Tomas Vondra
On 1/29/23 19:08, Tomas Vondra wrote: > > > On 1/29/23 18:53, Andres Freund wrote: >> Hi, >> >> On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote: >>> Will do, but I'll wait for another lockup to see how frequent it >>> actually is. I'm now at ~90 runs total, and it didn't happen again yet. >>>

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-30 Thread Thomas Munro
On Mon, Jan 30, 2023 at 6:36 PM Andres Freund wrote: > On 2023-01-30 15:22:34 +1300, Thomas Munro wrote: > > On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro wrote: > > > out-of-order hazard > > > > I've been trying to understand how that could happen, but my CPU-fu is > > weak. Let me try to write

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Andres Freund
Hi, On 2023-01-30 15:22:34 +1300, Thomas Munro wrote: > On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro wrote: > > out-of-order hazard > > I've been trying to understand how that could happen, but my CPU-fu is > weak. Let me try to write an argument for why it can't happen, so > that later I can

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Thomas Munro
On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro wrote: > out-of-order hazard I've been trying to understand how that could happen, but my CPU-fu is weak. Let me try to write an argument for why it can't happen, so that later I can look back at how stupid and naive I was. We have A B, and if the

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Thomas Munro
On Mon, Jan 30, 2023 at 7:08 AM Tomas Vondra wrote: > However, the other lockup I saw was when using serial_schedule, so I > guess lower concurrency makes it more likely. FWIW "psql db -f src/test/regress/sql/join_hash.sql | cat" also works (I mean, it's self-contained and doesn't need anything

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Tomas Vondra
On 1/29/23 18:53, Andres Freund wrote: > Hi, > > On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote: >> Will do, but I'll wait for another lockup to see how frequent it >> actually is. I'm now at ~90 runs total, and it didn't happen again yet. >> So hitting it after 15 runs might have been a bit

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Andres Freund
Hi, On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote: > Will do, but I'll wait for another lockup to see how frequent it > actually is. I'm now at ~90 runs total, and it didn't happen again yet. > So hitting it after 15 runs might have been a bit of a luck. Was there a difference in how much

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Andres Freund
Hi, On 2023-01-30 06:26:02 +1300, Thomas Munro wrote: > On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra > wrote: > > So I did that - same configure options as the buildfarm client, and a > > 'make check' (with only tests up to the 'join' suite, because that's > > where it got stuck before). And it

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Tomas Vondra
On 1/29/23 18:26, Thomas Munro wrote: > On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra > wrote: >> So I did that - same configure options as the buildfarm client, and a >> 'make check' (with only tests up to the 'join' suite, because that's >> where it got stuck before). And it took only ~15

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Thomas Munro
On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra wrote: > So I did that - same configure options as the buildfarm client, and a > 'make check' (with only tests up to the 'join' suite, because that's > where it got stuck before). And it took only ~15 runs (~1h) to hit this > again on dikkop. That's

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-29 Thread Tomas Vondra
On 1/28/23 13:05, Tomas Vondra wrote: > > FWIW I'll wait for dikkop to finish the current buildfarm run (it's > currently chewing on HEAD) and then will try to do runs of the 'joins' > test in a loop. That's where dikkop got stuck before. > So I did that - same configure options as the buildfarm

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-28 Thread Tomas Vondra
On 1/28/23 05:53, Andres Freund wrote: > Hi, > > On 2023-01-27 23:18:39 -0500, Tom Lane wrote: >> I also saw it on florican, which is/was an i386 machine using clang and >> pretty standard build options other than >> 'CFLAGS' => '-msse2 -O2', >> so I think this isn't too much about

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Thomas Munro
On Sat, Jan 28, 2023 at 4:42 PM Andres Freund wrote: > Did you use the same compiler / compilation flags as when elver hit it? > Clearly Tomas' case was with at least some optimizations enabled. I did use the same compiler version and optimisation level, clang llvmorg-13.0.0-0-gd7b669b3a303 at

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Andres Freund
Hi, On 2023-01-27 23:18:39 -0500, Tom Lane wrote: > I also saw it on florican, which is/was an i386 machine using clang and > pretty standard build options other than > 'CFLAGS' => '-msse2 -O2', > so I think this isn't too much about machine architecture or compiler > flags. Ah. Florican

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Tom Lane
Andres Freund writes: > Except that you're saying that you hit this on elver (amd64), I think it'd be > interesting that we see the failure on an arm host, which has a less strict > memory order model than x86. I also saw it on florican, which is/was an i386 machine using clang and pretty

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Andres Freund
Hi, On 2023-01-27 22:23:58 +1300, Thomas Munro wrote: > After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq > check loops, on the same FBSD 13.1 machine as elver which has failed > like this once before, I haven't been able to reproduce this on > REL_12_STABLE. Did you use

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Thomas Munro
After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq check loops, on the same FBSD 13.1 machine as elver which has failed like this once before, I haven't been able to reproduce this on REL_12_STABLE. Not really sure how to chase this, but if you see this situation again,

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro wrote: > On Fri, Jan 27, 2023 at 9:49 AM Tom Lane wrote: > > Tomas Vondra writes: > > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) > > > did not report any results for a couple days, and it seems it got into > > > an

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Thomas Munro
On Fri, Jan 27, 2023 at 9:49 AM Tom Lane wrote: > Tomas Vondra writes: > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) > > did not report any results for a couple days, and it seems it got into > > an infinite loop in REL_11_STABLE when building hash table in a

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Tom Lane
Tomas Vondra writes: > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) > did not report any results for a couple days, and it seems it got into > an infinite loop in REL_11_STABLE when building hash table in a parallel > hashjoin, or something like that. > It seems to be

lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-26 Thread Tomas Vondra
Hi, I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) did not report any results for a couple days, and it seems it got into an infinite loop in REL_11_STABLE when building hash table in a parallel hashjoin, or something like that. It seems to be progressing now, probably