Hi,
On 2025-09-03 21:50:42 +0300, Konstantin Knizhnik wrote:
> On 03/09/2025 8:37 PM, Dmitry Mityugov wrote:
> Size of PgAioHandle is144 bytes. I wonder how critical for us is to save 9
> bytes for it (3 bytes vs 3 integers)?
Not that it makes that huge a difference, but due to alignment consider
On 03/09/2025 8:37 PM, Dmitry Mityugov wrote:
Quite inspiring discussion. The patch is brilliantly good but it adds
a bunch of explicit type casts, and it's not always easy to remember
what cast to use in a particular case, and that may eventually lead to
errors in the future. Just wanted to
Quite inspiring discussion. The patch is brilliantly good but it adds a
bunch of explicit type casts, and it's not always easy to remember what
cast to use in a particular case, and that may eventually lead to errors in
the future. Just wanted to add that when 64-bit code is generated, uint8s
are p
On 28/08/2025 3:08 AM, Thomas Munro wrote:
On Thu, Aug 28, 2025 at 11:08 AM Andres Freund wrote:
On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
Still it is not quite clear to me how bitfields can cause this issue.
Same.
Here's what I speculated after reading the generated asm[1]:
On Thu, Aug 28, 2025 at 11:08 AM Andres Freund wrote:
> On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
> > Still it is not quite clear to me how bitfields can cause this issue.
>
> Same.
Here's what I speculated after reading the generated asm[1]:
"Could it be that the store buffer was
On 2025-08-27 19:08:20 -0400, Andres Freund wrote:
> I'll push the patch to remove the bitfields after adjusting the commit message
> somewhat.
And done.
Hi,
On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
> On 26/08/2025 3:37 AM, Andres Freund wrote:
> > Hi,
> >
> > I'm a bit confused by this focus on bitfields - both Alexander and
> > Konstantin
> > stated they could reproduce the issue without the bitfields.
>
>
> Sorry if I am not
On Tue, Aug 26, 2025 at 04:59:54PM +0300, Konstantin Knizhnik wrote:
>
> > But we have observed the generated code being pretty grotty and it's caused
> > more than enough confusion - so let's just replace them with plain uint8's
> > and
> > cast in switches.
>
> +1
>
> May be I am wrong, but i
On 26/08/2025 3:37 AM, Andres Freund wrote:
Hi,
I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
stated they could reproduce the issue without the bitfields.
Sorry if I am not correct, but it seems that the problem was never
reproduced with replaced bitfields.
I
Hi,
On 2025-08-26 15:21:34 +1200, Thomas Munro wrote:
> On Tue, Aug 26, 2025 at 12:45 PM Andres Freund wrote:
> > On 2025-08-25 10:43:21 +1200, Thomas Munro wrote:
> > > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik
> > > wrote:
> > > > In theory even replacing bitfield with in should not
On Tue, Aug 26, 2025 at 12:37 PM Andres Freund wrote:
> I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
> stated they could reproduce the issue without the bitfields.
Konstantin's message all seem to say it *did* fix it?
But I do apologise for working through the sa
On Tue, Aug 26, 2025 at 12:45 PM Andres Freund wrote:
> On 2025-08-25 10:43:21 +1200, Thomas Munro wrote:
> > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik
> > wrote:
> > > In theory even replacing bitfield with in should not
> > > avoid race condition, because they are still shared the sa
Hi,
On 2025-08-25 10:43:21 +1200, Thomas Munro wrote:
> On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik
> wrote:
> > In theory even replacing bitfield with in should not
> > avoid race condition, because they are still shared the same cache line.
>
> I'm no expert in this stuff, but that's
Hi,
I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
stated they could reproduce the issue without the bitfields.
But we have observed the generated code being pretty grotty and it's caused
more than enough confusion - so let's just replace them with plain uint8's and
On Mon, Aug 25, 2025 at 2:41 PM Thomas Munro wrote:
> On Mon, Aug 25, 2025 at 1:52 PM Thomas Munro wrote:
> > > struct { PgAioHandleState v:8; } state;
> >
> > This preserves type safety and compiles to strb two properties we
> > want, but it seems to waste space (look at the offsets for
On Mon, Aug 25, 2025 at 1:52 PM Thomas Munro wrote:
> > struct { PgAioHandleState v:8; } state;
>
> This preserves type safety and compiles to strb two properties we
> want, but it seems to waste space (look at the offsets for the
> stores):
>
> a.out[0x105f8] <+140>: ldrx8, [sp, #
On Mon, Aug 25, 2025 at 11:42 AM Nico Williams wrote:
> I think the issue is that if the compiler decides to coalesce what we
> think of as distinct (but neighboring) bitfields, then when you update
> one of the bitfields you could be updating the other with stale data
> from an earlier read where
On Mon, Aug 25, 2025 at 10:43:21AM +1200, Thomas Munro wrote:
> On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik
> wrote:
> > In theory even replacing bitfield with in should not
> > avoid race condition, because they are still shared the same cache line.
>
> I'm no expert in this stuff, but
On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik wrote:
> In theory even replacing bitfield with in should not
> avoid race condition, because they are still shared the same cache line.
I'm no expert in this stuff, but that's not my understanding of how it
works. Plain stores to normal memory
On 24/08/2025 3:38 PM, Thomas Munro wrote:
That's also how open source clang 17 compiles it if you rip out the
bitfield. I guess if you do that, you won't be able to reproduce
this, Alexander? Something like:
I think that we have made this experiment at the very beginning and as
far as I re
On Sun, Aug 24, 2025 at 5:32 AM Konstantin Knizhnik wrote:
> On 20/08/2025 9:00 PM, Alexander Lakhin wrote:
> > for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i:
> > $np"; time parallel -j40 --linebuffer --tag /tmp/repro-AIO-Assert.sh
> > {} ::: `seq $np` || break; sleep $(($RAN
On 20/08/2025 9:00 PM, Alexander Lakhin wrote:
for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i:
$np"; time parallel -j40 --linebuffer --tag /tmp/repro-AIO-Assert.sh
{} ::: `seq $np` || break; sleep $(($RANDOM % 20)); done; echo -e "\007"
Unfortunately I was not able to r
Hello Andres,
22.07.2025 02:19, Andres Freund wrote:
Hi,
On 2025-06-19 10:16:12 -0500, Nico Williams wrote:
On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote:
I also dug out an archeologically old MacBook Pro running macOS High Sierra
10.13.6 with an i5 using Apple LLVM versio
On Mon, Jul 21, 2025 at 07:19:54PM -0400, Andres Freund wrote:
> RMT, note that there were two issues in this thread, the original report by
> Tom has been addressed (in e9a3615a522). I guess the best thing would be to
> split the open items entry into two?
I went ahead and marked the open item a
Hi,
On 2025-06-19 10:16:12 -0500, Nico Williams wrote:
> On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote:
> > I also dug out an archeologically old MacBook Pro running macOS High Sierra
> > 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4),
> > and it
> > t
On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote:
> I also dug out an archeologically old MacBook Pro running macOS High Sierra
> 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and
> it
> too fails to reproduce any issue.
It's not going to be reproducibl
> On 19 Jun 2025, at 16:36, Andres Freund wrote:
> So for some reason this apparently can only be reproduced on older macos - we
> know it's not the older compiler, because I couldn't reproduce it on the same
> compile version as alexander, on an m1 that was running sequoia. That's really
> reall
Hi,
On 2025-06-19 17:02:18 +0300, Konstantin Knizhnik wrote:
> On 18/06/2025 7:08 pm, Andres Freund wrote:
> > Hi,
> >
> > On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote:
> > > On 17/06/2025 6:08 pm, Andres Freund wrote:
> > > > I don't think it can - this must be an independent bug from
> On 19 Jun 2025, at 16:02, Konstantin Knizhnik wrote:
> By the way - still not been able to reproduce assertion failure at most
> recent MacPro (Apple M4 Pro) with Sequoia 15.5.
I tried to reproduce this on an older quad core i7 MacBook Pro running Sonoma
14.7.5 using Apple clang version 15.0.
On 18/06/2025 7:08 pm, Andres Freund wrote:
Hi,
On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote:
On 17/06/2025 6:08 pm, Andres Freund wrote:
I don't think it can - this must be an independent bug from the one that Tom
and I were encountering.
I see... It's a pity.
Indeed.
Konstant
On Thu, Jun 19, 2025 at 4:08 AM Andres Freund wrote:
> Konstantin, Alexander, can you share what commit you're testing and what
> precise changes have been applied to the source? I've now tested this on a
> significant number of apple machines for many many days without being able to
> reproduce
Hi,
On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote:
> On 17/06/2025 6:08 pm, Andres Freund wrote:
> >
> > I don't think it can - this must be an independent bug from the one that Tom
> > and I were encountering.
> I see... It's a pity.
Indeed.
Konstantin, Alexander, can you share what
On 17/06/2025 6:08 pm, Andres Freund wrote:
I don't think it can - this must be an independent bug from the one that Tom
and I were encountering.
I see... It's a pity.
By the way, I have a questions concerning using interrupts in AIO.
The comments say:
pgaio_io_release(PgAioHandle *ioh)
On 2025-06-17 18:08:30 +0300, Konstantin Knizhnik wrote:
>
> On 17/06/2025 4:47 pm, Andres Freund wrote:
> > > I and Alexandr are using completely different devices with different
> > > hardware, OS and clang version.
> > Both of you are running Ventura, right?
> >
> No, Alexandr is using darwin2
On 2025-06-17 17:54:12 +0300, Konstantin Knizhnik wrote:
>
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend". It's not concurrent access in the
> > sense of two processes writin
On 17/06/2025 4:47 pm, Andres Freund wrote:
I and Alexandr are using completely different devices with different
hardware, OS and clang version.
Both of you are running Ventura, right?
No, Alexandr is using darwin23.5
Alexandr also noticed that he can reproduce the problem only with
--with-l
On 12/06/2025 4:57 pm, Andres Freund wrote:
The problem appears to be in that switch between "when submitted, by the IO
worker" and "then again by the backend". It's not concurrent access in the
sense of two processes writing to the same value, it's that when switching
from the worker updating
Andres Freund writes:
> Both of you are running Ventura, right?
FTR, the machines I'm trying this on are all running current Sequoia:
[tgl@minim4 ~]$ uname -a
Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22
19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T604
Hi,
On 2025-06-17 16:43:05 +0300, Konstantin Knizhnik wrote:
> On 17/06/2025 4:35 pm, Andres Freund wrote:
> > Konstantin, Alexander - are you using the same device to reproduce this or
> > different ones? I wonder if this somehow depends on some MDM / corporate
> > enforcement tooling running or
On 17/06/2025 4:35 pm, Andres Freund wrote:
Konstantin, Alexander - are you using the same device to reproduce this or
different ones? I wonder if this somehow depends on some MDM / corporate
enforcement tooling running or such.
What does:
- profiles status -type enrollment
- kextstat -l
show?
Hi,
On 2025-06-16 20:22:00 -0400, Tom Lane wrote:
> Konstantin Knizhnik writes:
> > On 16/06/2025 6:11 pm, Andres Freund wrote:
> >> I unfortunately can't repro this issue so far.
>
> > But unfortunately it means that the problem is not fixed.
>
> FWIW, I get similar results to Andres' on a Mac M
On 17/06/2025 3:22 am, Tom Lane wrote:
Konstantin Knizhnik writes:
On 16/06/2025 6:11 pm, Andres Freund wrote:
I unfortunately can't repro this issue so far.
But unfortunately it means that the problem is not fixed.
FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
using MacPorts'
Konstantin Knizhnik writes:
> On 16/06/2025 6:11 pm, Andres Freund wrote:
>> I unfortunately can't repro this issue so far.
> But unfortunately it means that the problem is not fixed.
FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
using MacPorts' current compiler release (clang vers
On 16/06/2025 6:11 pm, Andres Freund wrote:
Hi,
On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote:
One more update: with the proposed patch (memory barrier before
`ConditionVariableBroadcast` in `pgaio_io_process_completion`
I don't see how that barrier could be required for correctness
Hi,
On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote:
> One more update: with the proposed patch (memory barrier before
> `ConditionVariableBroadcast` in `pgaio_io_process_completion`
I don't see how that barrier could be required for correctness -
ConditionVariableBroadcast() is a barrier
One more update: with the proposed patch (memory barrier before
`ConditionVariableBroadcast` in `pgaio_io_process_completion` and
replacing bit fields with `uint8`) the problem is not reproduced at my
system during 5 seconds.
With this two additional changes:
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 6c6c0a908e2..6dd2816bea9 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -538,6 +538,9 @@ pgaio_io_process_completion(PgAioHandle *ioh, int
result)
On 13/06/2025 11:20 pm, Andres Freund wrote:
Attached is a patch that fixes the problem for me. Alexander, Konstantin,
could you verify that it also fixes the problem for you?
Given that it does address the problem for me, I'm inclined to push this
fairly soon, the barrier is pretty obviously r
On 13/06/2025 11:20 pm, Andres Freund wrote:
Hi,
On 2025-06-12 12:23:13 -0400, Andres Freund wrote:
On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
On 12/06/2025 4:57 pm, Andres Freund wrote:
The problem appears to be in that swit
Hi,
On 2025-06-12 12:23:13 -0400, Andres Freund wrote:
> On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
> > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> > > On 12/06/2025 4:57 pm, Andres Freund wrote:
> > > > The problem appears to be in that switch between "when submitted, by
>
Hi,
On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> > On 12/06/2025 4:57 pm, Andres Freund wrote:
> > > The problem appears to be in that switch between "when submitted, by the
> > > IO
> > > worker" and "then again by the backend".
Hi,
On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend". It's not concurrent access in the
> > sense of two processes writ
On 12/06/2025 4:57 pm, Andres Freund wrote:
The problem appears to be in that switch between "when submitted, by the IO
worker" and "then again by the backend". It's not concurrent access in the
sense of two processes writing to the same value, it's that when switching
from the worker updating
Hi,
On 2025-06-12 16:30:54 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:13 pm, Andres Freund wrote:
> > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
> > I'm reasonably certain I found the issue, I think it's a missing memory
> > barrier on the read side. The CPU is reordering th
On 12/06/2025 4:13 pm, Andres Freund wrote:
Hi,
On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
Reproduced it once again with with write-protected io handle.
But once again - no access violation, just assert failure.
Previously "op" field was overwritten somewhere between `pgaio_io_
Hi,
On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
> Reproduced it once again with with write-protected io handle.
> But once again - no access violation, just assert failure.
>
> Previously "op" field was overwritten somewhere between `pgaio_io_reclaim`
> and `AsyncReadBuffers`:
>
> !
Reproduced it once again with with write-protected io handle.
But once again - no access violation, just assert failure.
Previously "op" field was overwritten somewhere between
`pgaio_io_reclaim` and `AsyncReadBuffers`:
!!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0,
ioh->generatio
I tried to catch moment when memory is changed using mprotect.
I have aligned PgAioHandle on page boundary (16kb at MacOS), and disable
writes in `pgaio_io_reclaim`:
```
static void
pgaio_io_reclaim(PgAioHandle *ioh)
{
RESUME_INTERRUPTS();
rc = mprotect(ioh, sizeof(*ioh), PROT_READ);
Hi,
On 2025-06-10 21:09:18 +0300, Konstantin Knizhnik wrote:
>
> On 10/06/2025 8:41 pm, Andres Freund wrote:
> > I was able to reproduce it with gcc, too.
> > I've reproduced it without that bitfield, unfortunately :(.
> But also only at MacOS?
Correct.
> I wonder if it is possible to set har
On 10/06/2025 8:41 pm, Andres Freund wrote:
I was able to reproduce it with gcc, too.
I've reproduced it without that bitfield, unfortunately :(.
But also only at MacOS?
I wonder if it is possible to set hardware watchpoint fro program itself
(not using gdb)? I.e. using ptrace?
Looks lik
Hi,
On 2025-06-10 17:28:11 +0300, Konstantin Knizhnik wrote:
> On 09/06/2025 2:05 am, Thomas Munro wrote:
> > On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote:
> > > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
> > > > There is really essential difference in code generated by clang
On 09/06/2025 2:05 am, Thomas Munro wrote:
On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote:
On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
There is really essential difference in code generated by clang 15 (working)
and 16 (not working).
There also are code gen differences betw
On Sat, Jun 7, 2025 at 6:47 AM Andres Freund wrote:
> On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
> > There is really essential difference in code generated by clang 15 (working)
> > and 16 (not working).
>
> There also are code gen differences between upstream clang 17 and apple's
>
Andres Freund writes:
> The symptoms I can reproduce are slightly different than Alexander's - it's
> the assertion failure reported upthread by Tom.
>
> FWIW, I can continue to repro the assertion after removing the use of the
> bitfield in PgAioHandle. So the problem indeed seems to be be indepe
Hi,
On 2025-06-06 15:37:45 -0400, Andres Freund wrote:
> There shouldn't be any concurrent accesses here, so I don't really see how the
> above would explain the problem (the IO can only ever be modified by one
> backend, initially the "owning backend", then, when submitted, by the IO
> worker, an
On 06/06/2025 10:21 pm, Tom Lane wrote:
Konstantin Knizhnik writes:
There is really essential difference in code generated by clang 15
(working) and 16 (not working).
It's a mistake to think that this is a compiler bug. The C standard
explicitly allows compilers to use word-wide operations
On 06/06/2025 9:47 pm, Andres Freund wrote:
Hi,
On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
There is really essential difference in code generated by clang 15 (working)
and 16 (not working).
There also are code gen differences between upstream clang 17 and apple's
clang, which is
On Fri, Jun 06, 2025 at 03:37:45PM -0400, Andres Freund wrote:
> On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
> > So it's our code that is busted. No doubt, what is happening is
> > that process A is fetching two fields, modifying one of them,
> > and storing the word back (with the observed valu
Hello Andres and Tom,
06.06.2025 22:37, Andres Freund wrote:
On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
It's a mistake to think that this is a compiler bug. The C standard
explicitly allows compilers to use word-wide operations to access
bit-field struct members. Such accesses may fetch or
Hi,
On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
> Konstantin Knizhnik writes:
> > There is really essential difference in code generated by clang 15
> > (working) and 16 (not working).
>
> It's a mistake to think that this is a compiler bug. The C standard
> explicitly allows compilers to use
Konstantin Knizhnik writes:
> There is really essential difference in code generated by clang 15
> (working) and 16 (not working).
It's a mistake to think that this is a compiler bug. The C standard
explicitly allows compilers to use word-wide operations to access
bit-field struct members. Suc
Hi,
On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
> There is really essential difference in code generated by clang 15 (working)
> and 16 (not working).
There also are code gen differences between upstream clang 17 and apple's
clang, which is based on llvm 17 as well (I've updated the
There is really essential difference in code generated by clang 15
(working) and 16 (not working).
```
pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
{
...
HOLD_INTERRUPTS();
ioh->op = op;
ioh->result = 0;
pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
...
}
```
c
On 06/06/2025 2:31 am, Tom Lane wrote:
Matthias van de Meent writes:
I have a very wild guess that's probably wrong in a weird way, but
here goes anyway:
Did anyone test if interleaving the enum-typed bitfield fields of
PgAioHandle with the uint8 fields might solve the issue?
Ugh. I think y
Hello,
05.06.2025 22:00, Alexander Lakhin wrote:
Thank you for your attention to this and for the tip! Today I tried the
following:
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -89,8 +89,8 @@ typedef enum PgAioOp
/* intentionally the zero value, to help catch zeroed
Matthias van de Meent writes:
> I have a very wild guess that's probably wrong in a weird way, but
> here goes anyway:
> Did anyone test if interleaving the enum-typed bitfield fields of
> PgAioHandle with the uint8 fields might solve the issue?
Ugh. I think you probably nailed it.
IMO all thos
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin wrote:
>
> Hello Thomas and Andres,
>
> 04.06.2025 23:32, Thomas Munro wrote:
> > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote:
> >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_
Hello Thomas and Andres,
04.06.2025 23:32, Thomas Munro wrote:
On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote:
On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->resul
On Thu, Jun 5, 2025 at 8:02 AM Andres Freund wrote:
> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> > 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> > 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> > 2, ioh->generation: 21694
>
Hi,
Thanks for working on investigating this.
On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> 02.06.2025 09:00, Alexander Lakhin wrote:
> > With additional logging (the patch is attached), I can see the following:
> > ...
> > !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh
Hello,
02.06.2025 09:00, Alexander Lakhin wrote:
With additional logging (the patch is attached), I can see the following:
...
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6,
ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0
31.05.2025 06:00, Alexander Lakhin wrote:
Hello Thomas,
It looks like I managed to restore all the conditions needed to reproduce
that Assert more or less reliably (within a couple of hours), so I can
continue experiments.
I've added the following debugging:
...
With additional logging (the p
Hello Thomas,
25.05.2025 05:45, Thomas Munro wrote:
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line:
161, PID: 32355
Can you get a core and print *ioh in the debugger?
It looks like I managed to restore all the conditions needed to reproduce
that Assert more or less
Andres Freund writes:
> I'll see if being graphically logged in somehow indeed increased the repro
> rate, and if so I'll expand the debugging somewhat, or if this was just an
> absurd coincidence.
Hmm. Now that you mention it, the one repro on the M1 came just as
I was about to give up and manu
On Sun, May 25, 2025 at 8:25 PM Tom Lane wrote:
> The fact that I can trace through this Assert failure but not the
> AIO one strongly suggests some system-level problem in the latter.
> There is something rotten in the state of Denmark.
I have been quite frustrated with lldb on macOS for a while
Hi,
On 2025-05-27 14:43:14 -0400, Tom Lane wrote:
> Andres Freund writes:
> > I just meant that it seems that I can't reproduce it for some as of yet
> > unknown reason. I've now been through 3k+ runs of 027_stream_regress,
> > without
> > a single failure, so there has to be *something* differe
Andres Freund writes:
> I just meant that it seems that I can't reproduce it for some as of yet
> unknown reason. I've now been through 3k+ runs of 027_stream_regress, without
> a single failure, so there has to be *something* different about my
> environment than yours.
> Darwin m4-dev 24.1.0 Da
Hi,
On 2025-05-27 10:12:28 -0400, Tom Lane wrote:
> Andres Freund writes:
> > This is on a m4 mac mini. I'm wondering if there's some hardware specific
> > memory ordering issue or disk speed based timing issue that I'm just not
> > hitting.
>
> I dunno, I've seen it on three different physical
Hello hackers,
27.05.2025 16:35, Andres Freund пишет:
On 2025-05-25 20:05:49 -0400, Tom Lane wrote:
Thomas Munro writes:
Could you guys please share your exact repro steps?
I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
p
Hello Tomas,
27.05.2025 16:26, Tomas Vondra wrote:
I'm interested in how you run these tests in parallel. Can you share the
patch/script?
Yeah, sure. I'm running the test as follows:
rm -rf src/test/recovery_*; for i in `seq 40`; do cp -r src/test/recovery/ src/test/recovery_$i/; sed -i .bak
Andres Freund writes:
> This is on a m4 mac mini. I'm wondering if there's some hardware specific
> memory ordering issue or disk speed based timing issue that I'm just not
> hitting.
I dunno, I've seen it on three different physical machines now
(one M1, two M4 Pros). But it is darn hard to re
Thomas Munro writes:
> Could you please share your configure options?
The failures on indri and sifaka were during ordinary buildfarm
runs, you can check the animals' details on the website.
(Note those are same host machine, the difference is that
indri uses some MacPorts packages while sifaka i
On 5/24/25 23:00, Alexander Lakhin wrote:
> ...
>
> I'm yet to see the Assert triggered on the buildfarm, but this one looks
> interesting too.
>
> (I can share the complete patch + script for such testing, if it can be
> helpful.)
>
I'm interested in how you run these tests in parallel. Can
Hi,
On 2025-05-25 20:05:49 -0400, Tom Lane wrote:
> Thomas Munro writes:
> > Could you guys please share your exact repro steps?
>
> I've just been running 027_stream_regress.pl over and over.
> It's not a recommendable answer though because the failure
> probability is tiny, under 1%. It sound
On Mon, May 26, 2025 at 12:05 PM Tom Lane wrote:
> Thomas Munro writes:
> > Could you guys please share your exact repro steps?
>
> I've just been running 027_stream_regress.pl over and over.
> It's not a recommendable answer though because the failure
> probability is tiny, under 1%. It sounded
Thomas Munro writes:
> On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote:
>> So far, I've failed to get anything useful out of core files
>> from this failure. The trace goes back no further than
>> (lldb) bt
>> * thread #1
>> * frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8
Thomas Munro writes:
> Could you guys please share your exact repro steps?
I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%. It sounded like Alexander
had a better way.
re
On Sun, May 25, 2025 at 3:22 PM Tom Lane wrote:
> Thomas Munro writes:
> > Can you get a core and print *ioh in the debugger?
>
> So far, I've failed to get anything useful out of core files
> from this failure. The trace goes back no further than
>
> (lldb) bt
> * thread #1
> * frame #0: 0x00
Thomas Munro writes:
> Can you get a core and print *ioh in the debugger?
So far, I've failed to get anything useful out of core files
from this failure. The trace goes back no further than
(lldb) bt
* thread #1
* frame #0: 0x00018de39388 libsystem_kernel.dylib`__pthread_kill + 8
That's
On Sun, May 25, 2025 at 9:00 AM Alexander Lakhin wrote:
> Hello Thomas,
> 24.05.2025 14:42, Thomas Munro wrote:
> > On Sat, May 24, 2025 at 3:17 PM Tom Lane wrote:
> >> So it seems that "very low-probability issue in our Mac AIO code" is
> >> the most probable description.
> > There isn't any mac
1 - 100 of 106 matches
Mail list logo