Re: Non-reproducible AIO failure

2025-09-03 Thread Andres Freund
Hi, On 2025-09-03 21:50:42 +0300, Konstantin Knizhnik wrote: > On 03/09/2025 8:37 PM, Dmitry Mityugov wrote: > Size of PgAioHandle is144 bytes. I wonder how critical for us is to save 9 > bytes for it (3 bytes vs 3 integers)? Not that it makes that huge a difference, but due to alignment consider

Re: Non-reproducible AIO failure

2025-09-03 Thread Konstantin Knizhnik
On 03/09/2025 8:37 PM, Dmitry Mityugov wrote: Quite inspiring discussion. The patch is brilliantly good but it adds a bunch of explicit type casts, and it's not always easy to remember what cast to use in a particular case, and that may eventually lead to errors in the future. Just wanted to

Re: Non-reproducible AIO failure

2025-09-03 Thread Dmitry Mityugov
Quite inspiring discussion. The patch is brilliantly good but it adds a bunch of explicit type casts, and it's not always easy to remember what cast to use in a particular case, and that may eventually lead to errors in the future. Just wanted to add that when 64-bit code is generated, uint8s are p

Re: Non-reproducible AIO failure

2025-08-28 Thread Konstantin Knizhnik
On 28/08/2025 3:08 AM, Thomas Munro wrote: On Thu, Aug 28, 2025 at 11:08 AM Andres Freund wrote: On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote: Still it is not quite clear to me how bitfields can cause this issue. Same. Here's what I speculated after reading the generated asm[1]:

Re: Non-reproducible AIO failure

2025-08-27 Thread Thomas Munro
On Thu, Aug 28, 2025 at 11:08 AM Andres Freund wrote: > On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote: > > Still it is not quite clear to me how bitfields can cause this issue. > > Same. Here's what I speculated after reading the generated asm[1]: "Could it be that the store buffer was

Re: Non-reproducible AIO failure

2025-08-27 Thread Andres Freund
On 2025-08-27 19:08:20 -0400, Andres Freund wrote: > I'll push the patch to remove the bitfields after adjusting the commit message > somewhat. And done.

Re: Non-reproducible AIO failure

2025-08-27 Thread Andres Freund
Hi, On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote: > On 26/08/2025 3:37 AM, Andres Freund wrote: > > Hi, > > > > I'm a bit confused by this focus on bitfields - both Alexander and > > Konstantin > > stated they could reproduce the issue without the bitfields. > > > Sorry if I am not

Re: Non-reproducible AIO failure

2025-08-26 Thread Ken Marshall
On Tue, Aug 26, 2025 at 04:59:54PM +0300, Konstantin Knizhnik wrote: > > > But we have observed the generated code being pretty grotty and it's caused > > more than enough confusion - so let's just replace them with plain uint8's > > and > > cast in switches. > > +1 > > May be I am wrong, but i

Re: Non-reproducible AIO failure

2025-08-26 Thread Konstantin Knizhnik
On 26/08/2025 3:37 AM, Andres Freund wrote: Hi, I'm a bit confused by this focus on bitfields - both Alexander and Konstantin stated they could reproduce the issue without the bitfields. Sorry if I am not correct, but it seems that the problem was never reproduced with replaced bitfields. I

Re: Non-reproducible AIO failure

2025-08-26 Thread Andres Freund
Hi, On 2025-08-26 15:21:34 +1200, Thomas Munro wrote: > On Tue, Aug 26, 2025 at 12:45 PM Andres Freund wrote: > > On 2025-08-25 10:43:21 +1200, Thomas Munro wrote: > > > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik > > > wrote: > > > > In theory even replacing bitfield with in should not

Re: Non-reproducible AIO failure

2025-08-25 Thread Thomas Munro
On Tue, Aug 26, 2025 at 12:37 PM Andres Freund wrote: > I'm a bit confused by this focus on bitfields - both Alexander and Konstantin > stated they could reproduce the issue without the bitfields. Konstantin's message all seem to say it *did* fix it? But I do apologise for working through the sa

Re: Non-reproducible AIO failure

2025-08-25 Thread Thomas Munro
On Tue, Aug 26, 2025 at 12:45 PM Andres Freund wrote: > On 2025-08-25 10:43:21 +1200, Thomas Munro wrote: > > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik > > wrote: > > > In theory even replacing bitfield with in should not > > > avoid race condition, because they are still shared the sa

Re: Non-reproducible AIO failure

2025-08-25 Thread Andres Freund
Hi, On 2025-08-25 10:43:21 +1200, Thomas Munro wrote: > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik > wrote: > > In theory even replacing bitfield with in should not > > avoid race condition, because they are still shared the same cache line. > > I'm no expert in this stuff, but that's

Re: Non-reproducible AIO failure

2025-08-25 Thread Andres Freund
Hi, I'm a bit confused by this focus on bitfields - both Alexander and Konstantin stated they could reproduce the issue without the bitfields. But we have observed the generated code being pretty grotty and it's caused more than enough confusion - so let's just replace them with plain uint8's and

Re: Non-reproducible AIO failure

2025-08-24 Thread Thomas Munro
On Mon, Aug 25, 2025 at 2:41 PM Thomas Munro wrote: > On Mon, Aug 25, 2025 at 1:52 PM Thomas Munro wrote: > > > struct { PgAioHandleState v:8; } state; > > > > This preserves type safety and compiles to strb two properties we > > want, but it seems to waste space (look at the offsets for

Re: Non-reproducible AIO failure

2025-08-24 Thread Thomas Munro
On Mon, Aug 25, 2025 at 1:52 PM Thomas Munro wrote: > > struct { PgAioHandleState v:8; } state; > > This preserves type safety and compiles to strb two properties we > want, but it seems to waste space (look at the offsets for the > stores): > > a.out[0x105f8] <+140>: ldrx8, [sp, #

Re: Non-reproducible AIO failure

2025-08-24 Thread Thomas Munro
On Mon, Aug 25, 2025 at 11:42 AM Nico Williams wrote: > I think the issue is that if the compiler decides to coalesce what we > think of as distinct (but neighboring) bitfields, then when you update > one of the bitfields you could be updating the other with stale data > from an earlier read where

Re: Non-reproducible AIO failure

2025-08-24 Thread Nico Williams
On Mon, Aug 25, 2025 at 10:43:21AM +1200, Thomas Munro wrote: > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik > wrote: > > In theory even replacing bitfield with in should not > > avoid race condition, because they are still shared the same cache line. > > I'm no expert in this stuff, but

Re: Non-reproducible AIO failure

2025-08-24 Thread Thomas Munro
On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik wrote: > In theory even replacing bitfield with in should not > avoid race condition, because they are still shared the same cache line. I'm no expert in this stuff, but that's not my understanding of how it works. Plain stores to normal memory

Re: Non-reproducible AIO failure

2025-08-24 Thread Konstantin Knizhnik
On 24/08/2025 3:38 PM, Thomas Munro wrote: That's also how open source clang 17 compiles it if you rip out the bitfield. I guess if you do that, you won't be able to reproduce this, Alexander? Something like: I think that we have made this experiment at the very beginning and as far as I re

Re: Non-reproducible AIO failure

2025-08-24 Thread Thomas Munro
On Sun, Aug 24, 2025 at 5:32 AM Konstantin Knizhnik wrote: > On 20/08/2025 9:00 PM, Alexander Lakhin wrote: > > for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i: > > $np"; time parallel -j40 --linebuffer --tag /tmp/repro-AIO-Assert.sh > > {} ::: `seq $np` || break; sleep $(($RAN

Re: Non-reproducible AIO failure

2025-08-23 Thread Konstantin Knizhnik
On 20/08/2025 9:00 PM, Alexander Lakhin wrote: for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i: $np"; time parallel -j40 --linebuffer --tag  /tmp/repro-AIO-Assert.sh {} ::: `seq $np` || break; sleep $(($RANDOM % 20)); done; echo -e "\007" Unfortunately I was not able to r

Re: Non-reproducible AIO failure

2025-08-20 Thread Alexander Lakhin
Hello Andres, 22.07.2025 02:19, Andres Freund wrote: Hi, On 2025-06-19 10:16:12 -0500, Nico Williams wrote: On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote: I also dug out an archeologically old MacBook Pro running macOS High Sierra 10.13.6 with an i5 using Apple LLVM versio

Re: Non-reproducible AIO failure

2025-08-12 Thread Nathan Bossart
On Mon, Jul 21, 2025 at 07:19:54PM -0400, Andres Freund wrote: > RMT, note that there were two issues in this thread, the original report by > Tom has been addressed (in e9a3615a522). I guess the best thing would be to > split the open items entry into two? I went ahead and marked the open item a

Re: Non-reproducible AIO failure

2025-07-21 Thread Andres Freund
Hi, On 2025-06-19 10:16:12 -0500, Nico Williams wrote: > On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote: > > I also dug out an archeologically old MacBook Pro running macOS High Sierra > > 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), > > and it > > t

Re: Non-reproducible AIO failure

2025-06-19 Thread Nico Williams
On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote: > I also dug out an archeologically old MacBook Pro running macOS High Sierra > 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and > it > too fails to reproduce any issue. It's not going to be reproducibl

Re: Non-reproducible AIO failure

2025-06-19 Thread Daniel Gustafsson
> On 19 Jun 2025, at 16:36, Andres Freund wrote: > So for some reason this apparently can only be reproduced on older macos - we > know it's not the older compiler, because I couldn't reproduce it on the same > compile version as alexander, on an m1 that was running sequoia. That's really > reall

Re: Non-reproducible AIO failure

2025-06-19 Thread Andres Freund
Hi, On 2025-06-19 17:02:18 +0300, Konstantin Knizhnik wrote: > On 18/06/2025 7:08 pm, Andres Freund wrote: > > Hi, > > > > On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote: > > > On 17/06/2025 6:08 pm, Andres Freund wrote: > > > > I don't think it can - this must be an independent bug from

Re: Non-reproducible AIO failure

2025-06-19 Thread Daniel Gustafsson
> On 19 Jun 2025, at 16:02, Konstantin Knizhnik wrote: > By the way - still not been able to reproduce assertion failure at most > recent MacPro (Apple M4 Pro) with Sequoia 15.5. I tried to reproduce this on an older quad core i7 MacBook Pro running Sonoma 14.7.5 using Apple clang version 15.0.

Re: Non-reproducible AIO failure

2025-06-19 Thread Konstantin Knizhnik
On 18/06/2025 7:08 pm, Andres Freund wrote: Hi, On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote: On 17/06/2025 6:08 pm, Andres Freund wrote: I don't think it can - this must be an independent bug from the one that Tom and I were encountering. I see... It's a pity. Indeed. Konstant

Re: Non-reproducible AIO failure

2025-06-18 Thread Thomas Munro
On Thu, Jun 19, 2025 at 4:08 AM Andres Freund wrote: > Konstantin, Alexander, can you share what commit you're testing and what > precise changes have been applied to the source? I've now tested this on a > significant number of apple machines for many many days without being able to > reproduce

Re: Non-reproducible AIO failure

2025-06-18 Thread Andres Freund
Hi, On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote: > On 17/06/2025 6:08 pm, Andres Freund wrote: > > > > I don't think it can - this must be an independent bug from the one that Tom > > and I were encountering. > I see... It's a pity. Indeed. Konstantin, Alexander, can you share what

Re: Non-reproducible AIO failure

2025-06-18 Thread Konstantin Knizhnik
On 17/06/2025 6:08 pm, Andres Freund wrote: I don't think it can - this must be an independent bug from the one that Tom and I were encountering. I see... It's a pity. By the way, I have a questions concerning using interrupts in AIO. The comments say: pgaio_io_release(PgAioHandle *ioh)  

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
On 2025-06-17 18:08:30 +0300, Konstantin Knizhnik wrote: > > On 17/06/2025 4:47 pm, Andres Freund wrote: > > > I and Alexandr are using completely different devices with different > > > hardware, OS and clang version. > > Both of you are running Ventura, right? > > > No, Alexandr is using darwin2

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
On 2025-06-17 17:54:12 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > The problem appears to be in that switch between "when submitted, by the IO > > worker" and "then again by the backend". It's not concurrent access in the > > sense of two processes writin

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 17/06/2025 4:47 pm, Andres Freund wrote: I and Alexandr are using completely different devices with different hardware, OS and clang version. Both of you are running Ventura, right? No, Alexandr is using darwin23.5 Alexandr also noticed that he can reproduce the problem only with --with-l

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 12/06/2025 4:57 pm, Andres Freund wrote: The problem appears to be in that switch between "when submitted, by the IO worker" and "then again by the backend". It's not concurrent access in the sense of two processes writing to the same value, it's that when switching from the worker updating

Re: Non-reproducible AIO failure

2025-06-17 Thread Tom Lane
Andres Freund writes: > Both of you are running Ventura, right? FTR, the machines I'm trying this on are all running current Sequoia: [tgl@minim4 ~]$ uname -a Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T604

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
Hi, On 2025-06-17 16:43:05 +0300, Konstantin Knizhnik wrote: > On 17/06/2025 4:35 pm, Andres Freund wrote: > > Konstantin, Alexander - are you using the same device to reproduce this or > > different ones? I wonder if this somehow depends on some MDM / corporate > > enforcement tooling running or

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 17/06/2025 4:35 pm, Andres Freund wrote: Konstantin, Alexander - are you using the same device to reproduce this or different ones? I wonder if this somehow depends on some MDM / corporate enforcement tooling running or such. What does: - profiles status -type enrollment - kextstat -l show?

Re: Non-reproducible AIO failure

2025-06-17 Thread Andres Freund
Hi, On 2025-06-16 20:22:00 -0400, Tom Lane wrote: > Konstantin Knizhnik writes: > > On 16/06/2025 6:11 pm, Andres Freund wrote: > >> I unfortunately can't repro this issue so far. > > > But unfortunately it means that the problem is not fixed. > > FWIW, I get similar results to Andres' on a Mac M

Re: Non-reproducible AIO failure

2025-06-17 Thread Konstantin Knizhnik
On 17/06/2025 3:22 am, Tom Lane wrote: Konstantin Knizhnik writes: On 16/06/2025 6:11 pm, Andres Freund wrote: I unfortunately can't repro this issue so far. But unfortunately it means that the problem is not fixed. FWIW, I get similar results to Andres' on a Mac Mini M4 Pro using MacPorts'

Re: Non-reproducible AIO failure

2025-06-16 Thread Tom Lane
Konstantin Knizhnik writes: > On 16/06/2025 6:11 pm, Andres Freund wrote: >> I unfortunately can't repro this issue so far. > But unfortunately it means that the problem is not fixed. FWIW, I get similar results to Andres' on a Mac Mini M4 Pro using MacPorts' current compiler release (clang vers

Re: Non-reproducible AIO failure

2025-06-16 Thread Konstantin Knizhnik
On 16/06/2025 6:11 pm, Andres Freund wrote: Hi, On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote: One more update: with the proposed patch (memory barrier before `ConditionVariableBroadcast` in `pgaio_io_process_completion` I don't see how that barrier could be required for correctness

Re: Non-reproducible AIO failure

2025-06-16 Thread Andres Freund
Hi, On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote: > One more update: with the proposed patch (memory barrier before > `ConditionVariableBroadcast` in `pgaio_io_process_completion` I don't see how that barrier could be required for correctness - ConditionVariableBroadcast() is a barrier

Re: Non-reproducible AIO failure

2025-06-16 Thread Konstantin Knizhnik
One more update: with the proposed patch (memory barrier before `ConditionVariableBroadcast` in `pgaio_io_process_completion` and replacing bit fields with `uint8`) the problem is not reproduced at my system during 5 seconds.

Re: Non-reproducible AIO failure

2025-06-15 Thread Konstantin Knizhnik
With this two additional changes: diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c index 6c6c0a908e2..6dd2816bea9 100644 --- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -538,6 +538,9 @@ pgaio_io_process_completion(PgAioHandle *ioh, int result)

Re: Non-reproducible AIO failure

2025-06-15 Thread Konstantin Knizhnik
On 13/06/2025 11:20 pm, Andres Freund wrote: Attached is a patch that fixes the problem for me. Alexander, Konstantin, could you verify that it also fixes the problem for you? Given that it does address the problem for me, I'm inclined to push this fairly soon, the barrier is pretty obviously r

Re: Non-reproducible AIO failure

2025-06-14 Thread Konstantin Knizhnik
On 13/06/2025 11:20 pm, Andres Freund wrote: Hi, On 2025-06-12 12:23:13 -0400, Andres Freund wrote: On 2025-06-12 11:52:31 -0400, Andres Freund wrote: On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: On 12/06/2025 4:57 pm, Andres Freund wrote: The problem appears to be in that swit

Re: Non-reproducible AIO failure

2025-06-13 Thread Andres Freund
Hi, On 2025-06-12 12:23:13 -0400, Andres Freund wrote: > On 2025-06-12 11:52:31 -0400, Andres Freund wrote: > > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > > > The problem appears to be in that switch between "when submitted, by >

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 11:52:31 -0400, Andres Freund wrote: > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > > The problem appears to be in that switch between "when submitted, by the > > > IO > > > worker" and "then again by the backend".

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > On 12/06/2025 4:57 pm, Andres Freund wrote: > > The problem appears to be in that switch between "when submitted, by the IO > > worker" and "then again by the backend". It's not concurrent access in the > > sense of two processes writ

Re: Non-reproducible AIO failure

2025-06-12 Thread Konstantin Knizhnik
On 12/06/2025 4:57 pm, Andres Freund wrote: The problem appears to be in that switch between "when submitted, by the IO worker" and "then again by the backend". It's not concurrent access in the sense of two processes writing to the same value, it's that when switching from the worker updating

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 16:30:54 +0300, Konstantin Knizhnik wrote: > On 12/06/2025 4:13 pm, Andres Freund wrote: > > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: > > I'm reasonably certain I found the issue, I think it's a missing memory > > barrier on the read side. The CPU is reordering th

Re: Non-reproducible AIO failure

2025-06-12 Thread Konstantin Knizhnik
On 12/06/2025 4:13 pm, Andres Freund wrote: Hi, On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: Reproduced it once again with with write-protected io handle. But once again - no access violation, just assert failure. Previously "op" field was overwritten somewhere between `pgaio_io_

Re: Non-reproducible AIO failure

2025-06-12 Thread Andres Freund
Hi, On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: > Reproduced it once again with with write-protected io handle. > But once again - no access violation, just assert failure. > > Previously "op" field was overwritten somewhere between `pgaio_io_reclaim` > and `AsyncReadBuffers`: > > !

Re: Non-reproducible AIO failure

2025-06-12 Thread Konstantin Knizhnik
Reproduced it once again with with write-protected io handle. But once again - no access violation, just assert failure. Previously "op" field was overwritten somewhere between `pgaio_io_reclaim` and `AsyncReadBuffers`: !!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generatio

Re: Non-reproducible AIO failure

2025-06-11 Thread Konstantin Knizhnik
I tried to catch moment when memory is changed using mprotect. I have aligned PgAioHandle on page boundary (16kb at MacOS), and disable writes in `pgaio_io_reclaim`: ``` static void pgaio_io_reclaim(PgAioHandle *ioh) {    RESUME_INTERRUPTS();     rc = mprotect(ioh, sizeof(*ioh), PROT_READ);    

Re: Non-reproducible AIO failure

2025-06-10 Thread Andres Freund
Hi, On 2025-06-10 21:09:18 +0300, Konstantin Knizhnik wrote: > > On 10/06/2025 8:41 pm, Andres Freund wrote: > > I was able to reproduce it with gcc, too. > > I've reproduced it without that bitfield, unfortunately :(. > But also only at MacOS? Correct. > I wonder if it is possible to set har

Re: Non-reproducible AIO failure

2025-06-10 Thread Konstantin Knizhnik
On 10/06/2025 8:41 pm, Andres Freund wrote: I was able to reproduce it with gcc, too. I've reproduced it without that bitfield, unfortunately :(. But also only at MacOS? I wonder if it is possible to set hardware watchpoint fro program itself (not using gdb)? I.e. using ptrace? Looks lik

Re: Non-reproducible AIO failure

2025-06-10 Thread Andres Freund
Hi, On 2025-06-10 17:28:11 +0300, Konstantin Knizhnik wrote: > On 09/06/2025 2:05 am, Thomas Munro wrote: > > On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote: > > > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > > > > There is really essential difference in code generated by clang

Re: Non-reproducible AIO failure

2025-06-10 Thread Konstantin Knizhnik
On 09/06/2025 2:05 am, Thomas Munro wrote: On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote: On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: There is really essential difference in code generated by clang 15 (working) and 16 (not working). There also are code gen differences betw

Re: Non-reproducible AIO failure

2025-06-08 Thread Thomas Munro
On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote: > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > > There is really essential difference in code generated by clang 15 (working) > > and 16 (not working). > > There also are code gen differences between upstream clang 17 and apple's >

Re: Non-reproducible AIO failure

2025-06-08 Thread Tom Lane
Andres Freund writes: > The symptoms I can reproduce are slightly different than Alexander's - it's > the assertion failure reported upthread by Tom. > > FWIW, I can continue to repro the assertion after removing the use of the > bitfield in PgAioHandle. So the problem indeed seems to be be indepe

Re: Non-reproducible AIO failure

2025-06-08 Thread Andres Freund
Hi, On 2025-06-06 15:37:45 -0400, Andres Freund wrote: > There shouldn't be any concurrent accesses here, so I don't really see how the > above would explain the problem (the IO can only ever be modified by one > backend, initially the "owning backend", then, when submitted, by the IO > worker, an

Re: Non-reproducible AIO failure

2025-06-07 Thread Konstantin Knizhnik
On 06/06/2025 10:21 pm, Tom Lane wrote: Konstantin Knizhnik writes: There is really essential difference in code generated by clang 15 (working) and 16 (not working). It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations

Re: Non-reproducible AIO failure

2025-06-06 Thread Konstantin Knizhnik
On 06/06/2025 9:47 pm, Andres Freund wrote: Hi, On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: There is really essential difference in code generated by clang 15 (working) and 16 (not working). There also are code gen differences between upstream clang 17 and apple's clang, which is

Re: Non-reproducible AIO failure

2025-06-06 Thread Nico Williams
On Fri, Jun 06, 2025 at 03:37:45PM -0400, Andres Freund wrote: > On 2025-06-06 15:21:13 -0400, Tom Lane wrote: > > So it's our code that is busted. No doubt, what is happening is > > that process A is fetching two fields, modifying one of them, > > and storing the word back (with the observed valu

Re: Non-reproducible AIO failure

2025-06-06 Thread Alexander Lakhin
Hello Andres and Tom, 06.06.2025 22:37, Andres Freund wrote: On 2025-06-06 15:21:13 -0400, Tom Lane wrote: It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations to access bit-field struct members. Such accesses may fetch or

Re: Non-reproducible AIO failure

2025-06-06 Thread Andres Freund
Hi, On 2025-06-06 15:21:13 -0400, Tom Lane wrote: > Konstantin Knizhnik writes: > > There is really essential difference in code generated by clang 15 > > (working) and 16 (not working). > > It's a mistake to think that this is a compiler bug. The C standard > explicitly allows compilers to use

Re: Non-reproducible AIO failure

2025-06-06 Thread Tom Lane
Konstantin Knizhnik writes: > There is really essential difference in code generated by clang 15 > (working) and 16 (not working). It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations to access bit-field struct members. Suc

Re: Non-reproducible AIO failure

2025-06-06 Thread Andres Freund
Hi, On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > There is really essential difference in code generated by clang 15 (working) > and 16 (not working). There also are code gen differences between upstream clang 17 and apple's clang, which is based on llvm 17 as well (I've updated the

Re: Non-reproducible AIO failure

2025-06-06 Thread Konstantin Knizhnik
There is really essential difference in code generated by clang 15 (working) and 16 (not working). ``` pgaio_io_stage(PgAioHandle *ioh, PgAioOp op) { ... HOLD_INTERRUPTS();     ioh->op = op;     ioh->result = 0;     pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);     ... } ``` c

Re: Non-reproducible AIO failure

2025-06-05 Thread Konstantin Knizhnik
On 06/06/2025 2:31 am, Tom Lane wrote: Matthias van de Meent writes: I have a very wild guess that's probably wrong in a weird way, but here goes anyway: Did anyone test if interleaving the enum-typed bitfield fields of PgAioHandle with the uint8 fields might solve the issue? Ugh. I think y

Re: Non-reproducible AIO failure

2025-06-05 Thread Alexander Lakhin
Hello, 05.06.2025 22:00, Alexander Lakhin wrote: Thank you for your attention to this and for the tip! Today I tried the following: --- a/src/include/storage/aio.h +++ b/src/include/storage/aio.h @@ -89,8 +89,8 @@ typedef enum PgAioOp     /* intentionally the zero value, to help catch zeroed

Re: Non-reproducible AIO failure

2025-06-05 Thread Tom Lane
Matthias van de Meent writes: > I have a very wild guess that's probably wrong in a weird way, but > here goes anyway: > Did anyone test if interleaving the enum-typed bitfield fields of > PgAioHandle with the uint8 fields might solve the issue? Ugh. I think you probably nailed it. IMO all thos

Re: Non-reproducible AIO failure

2025-06-05 Thread Matthias van de Meent
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin wrote: > > Hello Thomas and Andres, > > 04.06.2025 23:32, Thomas Munro wrote: > > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote: > >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_

Re: Non-reproducible AIO failure

2025-06-05 Thread Alexander Lakhin
Hello Thomas and Andres, 04.06.2025 23:32, Thomas Munro wrote: On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote: On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->resul

Re: Non-reproducible AIO failure

2025-06-04 Thread Thomas Munro
On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote: > On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > > 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: > > 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: > > 2, ioh->generation: 21694 >

Re: Non-reproducible AIO failure

2025-06-04 Thread Andres Freund
Hi, Thanks for working on investigating this. On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > 02.06.2025 09:00, Alexander Lakhin wrote: > > With additional logging (the patch is attached), I can see the following: > > ... > > !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh

Re: Non-reproducible AIO failure

2025-06-02 Thread Alexander Lakhin
Hello, 02.06.2025 09:00, Alexander Lakhin wrote: With additional logging (the patch is attached), I can see the following: ... !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2 !!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0

Re: Non-reproducible AIO failure

2025-06-01 Thread Alexander Lakhin
31.05.2025 06:00, Alexander Lakhin wrote: Hello Thomas, It looks like I managed to restore all the conditions needed to reproduce that Assert more or less reliably (within a couple of hours), so I can continue experiments. I've added the following debugging: ... With additional logging (the p

Re: Non-reproducible AIO failure

2025-05-30 Thread Alexander Lakhin
Hello Thomas, 25.05.2025 05:45, Thomas Munro wrote: TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 Can you get a core and print *ioh in the debugger? It looks like I managed to restore all the conditions needed to reproduce that Assert more or less

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Andres Freund writes: > I'll see if being graphically logged in somehow indeed increased the repro > rate, and if so I'll expand the debugging somewhat, or if this was just an > absurd coincidence. Hmm. Now that you mention it, the one repro on the M1 came just as I was about to give up and manu

Re: Non-reproducible AIO failure

2025-05-27 Thread Robert Haas
On Sun, May 25, 2025 at 8:25 PM Tom Lane wrote: > The fact that I can trace through this Assert failure but not the > AIO one strongly suggests some system-level problem in the latter. > There is something rotten in the state of Denmark. I have been quite frustrated with lldb on macOS for a while

Re: Non-reproducible AIO failure

2025-05-27 Thread Andres Freund
Hi, On 2025-05-27 14:43:14 -0400, Tom Lane wrote: > Andres Freund writes: > > I just meant that it seems that I can't reproduce it for some as of yet > > unknown reason. I've now been through 3k+ runs of 027_stream_regress, > > without > > a single failure, so there has to be *something* differe

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Andres Freund writes: > I just meant that it seems that I can't reproduce it for some as of yet > unknown reason. I've now been through 3k+ runs of 027_stream_regress, without > a single failure, so there has to be *something* different about my > environment than yours. > Darwin m4-dev 24.1.0 Da

Re: Non-reproducible AIO failure

2025-05-27 Thread Andres Freund
Hi, On 2025-05-27 10:12:28 -0400, Tom Lane wrote: > Andres Freund writes: > > This is on a m4 mac mini. I'm wondering if there's some hardware specific > > memory ordering issue or disk speed based timing issue that I'm just not > > hitting. > > I dunno, I've seen it on three different physical

Re: Non-reproducible AIO failure

2025-05-27 Thread Alexander Lakhin
Hello hackers, 27.05.2025 16:35, Andres Freund пишет: On 2025-05-25 20:05:49 -0400, Tom Lane wrote: Thomas Munro writes: Could you guys please share your exact repro steps? I've just been running 027_stream_regress.pl over and over. It's not a recommendable answer though because the failure p

Re: Non-reproducible AIO failure

2025-05-27 Thread Alexander Lakhin
Hello Tomas, 27.05.2025 16:26, Tomas Vondra wrote: I'm interested in how you run these tests in parallel. Can you share the patch/script? Yeah, sure. I'm running the test as follows: rm -rf src/test/recovery_*; for i in `seq 40`; do cp -r src/test/recovery/ src/test/recovery_$i/; sed -i .bak

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Andres Freund writes: > This is on a m4 mac mini. I'm wondering if there's some hardware specific > memory ordering issue or disk speed based timing issue that I'm just not > hitting. I dunno, I've seen it on three different physical machines now (one M1, two M4 Pros). But it is darn hard to re

Re: Non-reproducible AIO failure

2025-05-27 Thread Tom Lane
Thomas Munro writes: > Could you please share your configure options? The failures on indri and sifaka were during ordinary buildfarm runs, you can check the animals' details on the website. (Note those are same host machine, the difference is that indri uses some MacPorts packages while sifaka i

Re: Non-reproducible AIO failure

2025-05-27 Thread Tomas Vondra
On 5/24/25 23:00, Alexander Lakhin wrote: > ... > > I'm yet to see the Assert triggered on the buildfarm, but this one looks > interesting too. > > (I can share the complete patch + script for such testing, if it can be > helpful.) > I'm interested in how you run these tests in parallel. Can

Re: Non-reproducible AIO failure

2025-05-27 Thread Andres Freund
Hi, On 2025-05-25 20:05:49 -0400, Tom Lane wrote: > Thomas Munro writes: > > Could you guys please share your exact repro steps? > > I've just been running 027_stream_regress.pl over and over. > It's not a recommendable answer though because the failure > probability is tiny, under 1%. It sound

Re: Non-reproducible AIO failure

2025-05-27 Thread Thomas Munro
On Mon, May 26, 2025 at 12:05 PM Tom Lane wrote: > Thomas Munro writes: > > Could you guys please share your exact repro steps? > > I've just been running 027_stream_regress.pl over and over. > It's not a recommendable answer though because the failure > probability is tiny, under 1%. It sounded

Re: Non-reproducible AIO failure

2025-05-25 Thread Tom Lane
Thomas Munro writes: > On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote: >> So far, I've failed to get anything useful out of core files >> from this failure. The trace goes back no further than >> (lldb) bt >> * thread #1 >> * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8

Re: Non-reproducible AIO failure

2025-05-25 Thread Tom Lane
Thomas Munro writes: > Could you guys please share your exact repro steps? I've just been running 027_stream_regress.pl over and over. It's not a recommendable answer though because the failure probability is tiny, under 1%. It sounded like Alexander had a better way. re

Re: Non-reproducible AIO failure

2025-05-25 Thread Thomas Munro
On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote: > Thomas Munro writes: > > Can you get a core and print *ioh in the debugger? > > So far, I've failed to get anything useful out of core files > from this failure. The trace goes back no further than > > (lldb) bt > * thread #1 > * frame #0: 0x00

Re: Non-reproducible AIO failure

2025-05-24 Thread Tom Lane
Thomas Munro writes: > Can you get a core and print *ioh in the debugger? So far, I've failed to get anything useful out of core files from this failure. The trace goes back no further than (lldb) bt * thread #1 * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8 That's

Re: Non-reproducible AIO failure

2025-05-24 Thread Thomas Munro
On Sun, May 25, 2025 at 9:00 AM Alexander Lakhin wrote: > Hello Thomas, > 24.05.2025 14:42, Thomas Munro wrote: > > On Sat, May 24, 2025 at 3:17 PM Tom Lane wrote: > >> So it seems that "very low-probability issue in our Mac AIO code" is > >> the most probable description. > > There isn't any mac

  1   2   >