> On 18 Jul 2025, at 16:53, Álvaro Herrera <alvhe...@kurilemu.de> wrote: > > Hello, > > Andrey and I discussed this on IM, and after some back and forth, he > came up with a brilliant idea: modify the WAL record for multixact > creation, so that the offset of the next multixact is transmitted and > can be replayed. (We know it when we create each multixact, because the > number of members is known). So the replica can store the offset of the > next multixact right away, even though it doesn't know the members for > that multixact. On replay of the next multixact we can cross-check that > the offset matches what we had written previously. This allows reading > the first multixact, without having to wait for the replay of creation > of the second multixact. > > One concern is: if we write the offset for the second mxact, but haven't > written its members, what happens if another process looks up the > members for that multixact? We'll have to make it wait (retry) somehow. > Given what was described upthread, it's possible for the multixact > beyond that one to be written already, so we won't have the zero offset > that would make us wait.
We redo Multixact creation always before it is visible anywhere on heap. The problem was that to read Multi we might need another Multi offset, and that multi did not happen to be WAL-logged yet. However, I think we do not need to read multi before it is redone. > > Anyway, he's going to try and implement this. > > Andrey, please let me know if I misunderstood the idea. Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17. Best regards, Andrey Borodin.
v6-0001-Test-that-reproduces-multixat-deadlock-with-recov.patch
Description: Binary data
v6-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patch
Description: Binary data