Hi Kirill, thanks for looking into this!
> On 20 Aug 2025, at 12:19, Kirill Reshke wrote:
>
> + /*
> + * We might have filled this offset previosuly.
> + * Cross-check for correctness.
> + */
> + Assert((*offptr == 0) || (*offptr == offset));
>
> Should we exit here with errcode(ERRCODE_DATA_CO
On Thu, 31 Jul 2025 at 11:29, Andrey Borodin wrote:
>
>
>
> > On 29 Jul 2025, at 23:15, Andrey Borodin wrote:
> >
> > I do not understand it yet.
>
> OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical
> file and could race with real extension by ExtendMultiXactOffset()
On 31.07.2025 09:29, Andrey Borodin wrote:
Here's an updated two patches, one for Postgres 17 and one for
mater(with a test).
I ran tests on PG17 with patch v9.
I tried to reproduce it for three cases, the first when we explicitly
use for key share, the second through subtransactions
and the t
> On 29 Jul 2025, at 23:15, Andrey Borodin wrote:
>
> I do not understand it yet.
OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical
file and could race with real extension by ExtendMultiXactOffset().
So I used ExtendMultiXactOffset(actual + 1). I hope this does not
> On 29 Jul 2025, at 12:17, Dmitry wrote:
>
> But on the master, some of the requests then fail with an error, apparently
> invalid multixact's remain in the pages.
Thanks!
That's a bug in my patch. I do not understand it yet. I've reproduced it with
your original workload.
Most of errors
17.07.2025 21:34, Andrey Borodin пишет:
>> On 30 Jun 2025, at 15:58, Andrey Borodin wrote:
>> page_collect_tuples() holds a lock on the buffer while examining tuples
>> visibility, having InterruptHoldoffCount > 0. Tuple visibility check might
>> need WAL to go on, we have to wait until some nex
I'll duplicate the message, the previous one turned out to have poor
formatting, sorry.
On 28.07.2025 15:49, Andrey Borodin wrote:
I also attach a version for PG17, maybe Dmitry could try to reproduce
the problem with this patch.
Andrey, thank you very much for your work, and also thanks to Álv
On 28.07.2025 15:49, Andrey Borodin wrote:
I also attach a version for PG17, maybe Dmitry could try to reproduce the
problem with this patch.
Andrey, thank you very much for your work, and also thanks to Álvaro for
joining the discussion on the problem. I ran tests on PG17 with patch
v8, the
> On 27 Jul 2025, at 16:53, Andrey Borodin wrote:
>
> we have to do this "next offset" dance on Primary too.
PFA draft of this.
I also attach a version for PG17, maybe Dmitry could try to reproduce the
problem with this patch. I think the problem should be fixed by the patch.
Thanks!
Best
> On 26 Jul 2025, at 22:44, Álvaro Herrera wrote:
>
> On 2025-Jul-25, Andrey Borodin wrote:
>
>> Also I've discovered one more serious problem.
>> If a backend crashes just before WAL-logging multi, any heap tuple
>> that uses this multi will become unreadable. Any attempt to read it
>> will
On 2025-Jul-25, Andrey Borodin wrote:
> Also I've discovered one more serious problem.
> If a backend crashes just before WAL-logging multi, any heap tuple
> that uses this multi will become unreadable. Any attempt to read it
> will hang forever.
>
> I've reproduced the problem and now I'm workin
> On 21 Jul 2025, at 19:58, Andrey Borodin wrote:
>
> I'm planning to prepare tests and fixes for all supported branches
This is a status update message. I've reproduced problem on REL_13_STABLE and
verified that proposed fix works there.
Also I've discovered one more serious problem.
If a
> On 18 Jul 2025, at 18:53, Andrey Borodin wrote:
>
> Please find attached dirty test and a sketch of the fix. It is done against
> PG 16, I wanted to ensure that problem is reproducible before 17.
Here'v v7 with improved comments and cross-check for correctness.
Also, MultiXact wraparound is
> On 18 Jul 2025, at 16:53, Álvaro Herrera wrote:
>
> Hello,
>
> Andrey and I discussed this on IM, and after some back and forth, he
> came up with a brilliant idea: modify the WAL record for multixact
> creation, so that the offset of the next multixact is transmitted and
> can be replayed.
Hello,
Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of me
On 2025-Jul-17, Andrey Borodin wrote:
> Thinking more about the problem I see 3 ways to deal with this deadlock:
> 1. We check for recovery conflict even in presence of
> InterruptHoldoffCount. That's what patch v4 does.
> 2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility()
> witho
> On 30 Jun 2025, at 15:58, Andrey Borodin wrote:
>
> page_collect_tuples() holds a lock on the buffer while examining tuples
> visibility, having InterruptHoldoffCount > 0. Tuple visibility check might
> need WAL to go on, we have to wait until some next MX be filled in.
> Which might need
> On 28 Jun 2025, at 21:24, Andrey Borodin wrote:
>
> This seems to be fixing issue for me.
ISTM I was wrong: there is a possible recovery conflict with snapshot.
REDO:
frame #2: 0x00010179a0c8 postgres`pg_usleep(microsec=100) at
pgsleep.c:50:10
frame #3: 0x00010144c108
post
> On 28 Jun 2025, at 00:37, Andrey Borodin wrote:
>
> Indeed.
After some experiments I could get unstable repro on my machine.
I've added some logging and that's what I've found:
2025-06-28 23:03:40.598 +05 [40887] 006_MultiXact_standby.pl WARNING: Timed
out: nextMXact 415832 tmpMXact 41582
> On 27 Jun 2025, at 11:41, Dmitry wrote:
>
> It seems that the hypothesis has not been confirmed.
Indeed.
For some reason your reproduction does not work for me.
I tried to create a test from your workload description. PFA patch with a very
dirty prototype.
to run test you can run:
cd con
On 26.06.2025 19:24, Andrey Borodin wrote:
If my hypothesis is correct nextMXact will precede tmpMXact.
It seems that the hypothesis has not been confirmed.
Attempt #1
2025-06-26 23:47:24.821 MSK [220458] WARNING: Timed out: nextMXact
24138381 tmpMXact 24138379
2025-06-26 23:47:24.822 MSK [2
> On 26 Jun 2025, at 17:59, Andrey Borodin wrote:
>
> hypothesis
Dmitry, can you please retry your reproduction with attached patch?
It must print nextMXact and tmpMXact. If my hypothesis is correct nextMXact
will precede tmpMXact.
Best regards, Andrey Borodin.
v2-0001-Make-next-multixac
> On 26 Jun 2025, at 14:33, Dmitry wrote:
>
> On 25.06.2025 16:44, Dmitry wrote:
>> I will definitely try to reproduce the problem with your patch.
> Hi Andrey!
>
> I checked with the patch, unfortunately the problem is also reproducible.
> Client processes wake up after a second and try to g
On 25.06.2025 16:44, Dmitry wrote:
I will definitely try to reproduce the problem with your patch.
Hi Andrey!
I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the
members of the multixact again,
On 25.06.2025 12:34, Andrey Borodin wrote:
On 25 Jun 2025, at 11:11, Dmitry wrote:
#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0,
from_pgupgrade=, isLockOnly=)
at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483
> On 25 Jun 2025, at 11:11, Dmitry wrote:
>
> #6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0,
> from_pgupgrade=, isLockOnly=)
> at
> /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483
Hi Dmitry!
This looks to be rela
Hi, hackers
The problem is as follows.
A replication cluster includes a primary server and one hot-standby replica.
The workload on the primary server is represented by multiple requests
generating multixact IDs, while the hot-standby replica performs reading
requests.
After some time, all re
27 matches
Mail list logo