On Sat, Sep 9, 2023 at 9:00 PM Alexander Lakhin wrote:
> Yes, I think we deal with something like that. I can try to deduce a minimum
> change that affects reproducing the issue, but may be it's not that important.
> Perhaps we now should think of escalating the problem to FreeBSD developers?
> I
Hi Thomas,
08.09.2023 22:39, Thomas Munro wrote:
With debugging logging added I see (on 7389aad63~1) that one process
really sends SIGURG to another, and the latter reaches poll(), but it
just got no signal, it's signal handler not called and poll() just waits...
Thanks for working so hard on
On Sat, Sep 9, 2023 at 7:00 AM Alexander Lakhin wrote:
> It takes less than 10 minutes on average for me. I checked
> REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined
> forcefully) — they all are affected.
> I could not reproduce the lockup on my Ubuntu box (with
Hello,
03.09.2023 00:00, Alexander Lakhin wrote:
I'll try to test this guess on the target machine...
I got access to dikkop thanks to Tomas Vondra, and started reproducing the
issue. It was rather difficult to catch the lockup as Tomas and Tom
noticed before. I tried to use stress-ng to
I agree that the code lacks barriers. I haven't been able to figure
out how any reordering could cause this hang, though, because in these
old branches procsignal_sigusr1_handler is used for latch wakeups, and
it also calls SetLatch(MyLatch) itself, right at the end. That is,
SetLatch() gets
Hello Robert,
01.09.2023 23:21, Robert Haas wrote:
On Fri, Sep 1, 2023 at 6:13 AM Alexander Lakhin wrote:
(Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
issue for us.)
Maybe it'd be worth trying something stronger, like
pg_memory_barrier(). A compiler barrier
On Fri, Sep 1, 2023 at 6:13 AM Alexander Lakhin wrote:
> (Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
> issue for us.)
Maybe it'd be worth trying something stronger, like
pg_memory_barrier(). A compiler barrier doesn't prevent the CPU from
reordering loads and stores
Hello Tomas,
01.09.2023 16:00, Tomas Vondra wrote:
Hmmm, I'm not very good at reading the binary code, but here's what
objdump produced for WaitEventSetWait. Maybe someone will see what the
issue is.
At first glance, I can't see anything suspicious in the disassembly.
IIUC, waiting = true
On 9/1/23 10:00, Alexander Lakhin wrote:
> Hello Thomas,
>
> 31.08.2023 14:15, Thomas Munro wrote:
>
>> We have a signal that is pending and not blocked, so I don't
>> immediately know why poll() hasn't returned control.
>
> When I worked at the Postgres Pro company, we observed a similar
Hello Thomas,
31.08.2023 14:15, Thomas Munro wrote:
We have a signal that is pending and not blocked, so I don't
immediately know why poll() hasn't returned control.
When I worked at the Postgres Pro company, we observed a similar lockup
under rather specific conditions (we used Elbrus CPU
On Thu, Aug 31, 2023 at 2:32 PM Thomas Munro wrote:
> On Thu, Aug 31, 2023 at 12:16 AM Tomas Vondra
> wrote:
> > I have another case of this on dikkop (on 11 again). Is there anything
> > else we'd want to try? Or maybe someone would want access to the machine
> > and do some investigation
On Thu, Aug 31, 2023 at 12:16 AM Tomas Vondra
wrote:
> I have another case of this on dikkop (on 11 again). Is there anything
> else we'd want to try? Or maybe someone would want access to the machine
> and do some investigation directly?
Sounds interesting -- I'll ping you off-list.
Hi,
I have another case of this on dikkop (on 11 again). Is there anything
else we'd want to try? Or maybe someone would want access to the machine
and do some investigation directly?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2/7/23 01:09, Thomas Munro wrote:
> On Tue, Feb 7, 2023 at 1:06 PM Tomas Vondra
> wrote:
>> On 2/7/23 00:48, Thomas Munro wrote:
>>> On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
>>> wrote:
No, I left the workload as it was for the first lockup, so `make check`
runs everything as
On Tue, Feb 7, 2023 at 1:06 PM Tomas Vondra
wrote:
> On 2/7/23 00:48, Thomas Munro wrote:
> > On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
> > wrote:
> >> No, I left the workload as it was for the first lockup, so `make check`
> >> runs everything as is up until the "join" test suite.
> >
> >
On 2/7/23 00:48, Thomas Munro wrote:
> On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
> wrote:
>> No, I left the workload as it was for the first lockup, so `make check`
>> runs everything as is up until the "join" test suite.
>
> Wait, shouldn't that be join_hash?
No, because join_hash does not
On Tue, Feb 7, 2023 at 12:46 PM Tomas Vondra
wrote:
> No, I left the workload as it was for the first lockup, so `make check`
> runs everything as is up until the "join" test suite.
Wait, shouldn't that be join_hash?
On 2/6/23 20:20, Andres Freund wrote:
> Hi,
>
> On 2023-02-06 19:51:19 +0100, Tomas Vondra wrote:
>>> No. The only thing the machine is doing is
>>>
>>> while /usr/bin/true; do
>>> make check
>>> done
>>>
>>> I can't reduce the workload further, because the "join" test is in a
>>>
Hi,
On 2023-02-06 19:51:19 +0100, Tomas Vondra wrote:
> > No. The only thing the machine is doing is
> >
> > while /usr/bin/true; do
> > make check
> > done
> >
> > I can't reduce the workload further, because the "join" test is in a
> > separate parallel group (I cut down
On 1/29/23 19:08, Tomas Vondra wrote:
>
>
> On 1/29/23 18:53, Andres Freund wrote:
>> Hi,
>>
>> On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:
>>> Will do, but I'll wait for another lockup to see how frequent it
>>> actually is. I'm now at ~90 runs total, and it didn't happen again yet.
>>>
On Mon, Jan 30, 2023 at 6:36 PM Andres Freund wrote:
> On 2023-01-30 15:22:34 +1300, Thomas Munro wrote:
> > On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro wrote:
> > > out-of-order hazard
> >
> > I've been trying to understand how that could happen, but my CPU-fu is
> > weak. Let me try to write
Hi,
On 2023-01-30 15:22:34 +1300, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro wrote:
> > out-of-order hazard
>
> I've been trying to understand how that could happen, but my CPU-fu is
> weak. Let me try to write an argument for why it can't happen, so
> that later I can
On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro wrote:
> out-of-order hazard
I've been trying to understand how that could happen, but my CPU-fu is
weak. Let me try to write an argument for why it can't happen, so
that later I can look back at how stupid and naive I was. We have A
B, and if the
On Mon, Jan 30, 2023 at 7:08 AM Tomas Vondra
wrote:
> However, the other lockup I saw was when using serial_schedule, so I
> guess lower concurrency makes it more likely.
FWIW "psql db -f src/test/regress/sql/join_hash.sql | cat" also works
(I mean, it's self-contained and doesn't need anything
On 1/29/23 18:53, Andres Freund wrote:
> Hi,
>
> On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:
>> Will do, but I'll wait for another lockup to see how frequent it
>> actually is. I'm now at ~90 runs total, and it didn't happen again yet.
>> So hitting it after 15 runs might have been a bit
Hi,
On 2023-01-29 18:39:05 +0100, Tomas Vondra wrote:
> Will do, but I'll wait for another lockup to see how frequent it
> actually is. I'm now at ~90 runs total, and it didn't happen again yet.
> So hitting it after 15 runs might have been a bit of a luck.
Was there a difference in how much
Hi,
On 2023-01-30 06:26:02 +1300, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
> wrote:
> > So I did that - same configure options as the buildfarm client, and a
> > 'make check' (with only tests up to the 'join' suite, because that's
> > where it got stuck before). And it
On 1/29/23 18:26, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
> wrote:
>> So I did that - same configure options as the buildfarm client, and a
>> 'make check' (with only tests up to the 'join' suite, because that's
>> where it got stuck before). And it took only ~15
On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
wrote:
> So I did that - same configure options as the buildfarm client, and a
> 'make check' (with only tests up to the 'join' suite, because that's
> where it got stuck before). And it took only ~15 runs (~1h) to hit this
> again on dikkop.
That's
On 1/28/23 13:05, Tomas Vondra wrote:
>
> FWIW I'll wait for dikkop to finish the current buildfarm run (it's
> currently chewing on HEAD) and then will try to do runs of the 'joins'
> test in a loop. That's where dikkop got stuck before.
>
So I did that - same configure options as the buildfarm
On 1/28/23 05:53, Andres Freund wrote:
> Hi,
>
> On 2023-01-27 23:18:39 -0500, Tom Lane wrote:
>> I also saw it on florican, which is/was an i386 machine using clang and
>> pretty standard build options other than
>> 'CFLAGS' => '-msse2 -O2',
>> so I think this isn't too much about
On Sat, Jan 28, 2023 at 4:42 PM Andres Freund wrote:
> Did you use the same compiler / compilation flags as when elver hit it?
> Clearly Tomas' case was with at least some optimizations enabled.
I did use the same compiler version and optimisation level, clang
llvmorg-13.0.0-0-gd7b669b3a303 at
Hi,
On 2023-01-27 23:18:39 -0500, Tom Lane wrote:
> I also saw it on florican, which is/was an i386 machine using clang and
> pretty standard build options other than
> 'CFLAGS' => '-msse2 -O2',
> so I think this isn't too much about machine architecture or compiler
> flags.
Ah. Florican
Andres Freund writes:
> Except that you're saying that you hit this on elver (amd64), I think it'd be
> interesting that we see the failure on an arm host, which has a less strict
> memory order model than x86.
I also saw it on florican, which is/was an i386 machine using clang and
pretty
Hi,
On 2023-01-27 22:23:58 +1300, Thomas Munro wrote:
> After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
> check loops, on the same FBSD 13.1 machine as elver which has failed
> like this once before, I haven't been able to reproduce this on
> REL_12_STABLE.
Did you use
After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
check loops, on the same FBSD 13.1 machine as elver which has failed
like this once before, I haven't been able to reproduce this on
REL_12_STABLE. Not really sure how to chase this, but if you see this
situation again,
On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro wrote:
> On Fri, Jan 27, 2023 at 9:49 AM Tom Lane wrote:
> > Tomas Vondra writes:
> > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > > did not report any results for a couple days, and it seems it got into
> > > an
On Fri, Jan 27, 2023 at 9:49 AM Tom Lane wrote:
> Tomas Vondra writes:
> > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > did not report any results for a couple days, and it seems it got into
> > an infinite loop in REL_11_STABLE when building hash table in a
Tomas Vondra writes:
> I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> did not report any results for a couple days, and it seems it got into
> an infinite loop in REL_11_STABLE when building hash table in a parallel
> hashjoin, or something like that.
> It seems to be
Hi,
I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.
It seems to be progressing now, probably
40 matches
Mail list logo