Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-27 Thread Masahiko Sawada
On Fri, Nov 25, 2022 at 5:58 PM Maxim Orlov  wrote:
>
>
>
> On Fri, 25 Nov 2022 at 09:40, Amit Kapila  wrote:
>>
>> On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila  wrote:
>> >
>> > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada  
>> > wrote:
>> > >
>> > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  
>> > > wrote:
>> > >
>> > > Agreed not to have a test case for this.
>> > >
>> > > I've attached an updated patch. I've confirmed this patch works for
>> > > all supported branches.
>> > >
>> >
>> > I have slightly changed the checks used in the patch, otherwise looks
>> > good to me. I am planning to push (v11-v15) the attached tomorrow
>> > unless there are more comments.
>> >
>>
>> Pushed.
>
> A big thanks to you! Could you also, close the commitfest entry 
> https://commitfest.postgresql.org/41/4024/, please?

Closed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-25 Thread Maxim Orlov
On Fri, 25 Nov 2022 at 09:40, Amit Kapila  wrote:

> On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila 
> wrote:
> >
> > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada 
> wrote:
> > >
> > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila 
> wrote:
> > >
> > > Agreed not to have a test case for this.
> > >
> > > I've attached an updated patch. I've confirmed this patch works for
> > > all supported branches.
> > >
> >
> > I have slightly changed the checks used in the patch, otherwise looks
> > good to me. I am planning to push (v11-v15) the attached tomorrow
> > unless there are more comments.
> >
>
> Pushed.
>
A big thanks to you! Could you also, close the commitfest entry
https://commitfest.postgresql.org/41/4024/, please?

-- 
Best regards,
Maxim Orlov.


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-24 Thread Amit Kapila
On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila  wrote:
>
> On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada  wrote:
> >
> > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  
> > wrote:
> >
> > Agreed not to have a test case for this.
> >
> > I've attached an updated patch. I've confirmed this patch works for
> > all supported branches.
> >
>
> I have slightly changed the checks used in the patch, otherwise looks
> good to me. I am planning to push (v11-v15) the attached tomorrow
> unless there are more comments.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-24 Thread Amit Kapila
On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada  wrote:
>
> On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  wrote:
>
> Agreed not to have a test case for this.
>
> I've attached an updated patch. I've confirmed this patch works for
> all supported branches.
>

I have slightly changed the checks used in the patch, otherwise looks
good to me. I am planning to push (v11-v15) the attached tomorrow
unless there are more comments.

-- 
With Regards,
Amit Kapila.


v2-0001-Fix-uninitialized-access-to-InitialRunningXacts-d.patch
Description: Binary data


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-24 Thread Maxim Orlov
>
> Agreed not to have a test case for this.
>
> I've attached an updated patch. I've confirmed this patch works for
> all supported branches.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
>
It works for me as well. Thanks!

I've created a commitfest entry for this patch, see
https://commitfest.postgresql.org/41/4024/
Hope, it will be committed soon.

-- 
Best regards,
Maxim Orlov.


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-24 Thread Masahiko Sawada
On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  wrote:
>
> On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov  wrote:
> >>
> >>
> >> Regarding the tests, the patch includes a new scenario to
> >> reproduce this issue. However, since the issue can be reproduced also
> >> by the existing scenario (with low probability, though), I'm not sure
> >> it's worth adding the new scenario.
> >
> > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it 
> > does not worth it. But, I do not have a strong opinion here.
> > Let's add tests in a separate commit and let the actual committer to decide 
> > what to do, should we?
> >
>
> +1 to not have a test for this as the scenario can already be tested
> by the existing set of tests.

Agreed not to have a test case for this.

I've attached an updated patch. I've confirmed this patch works for
all supported branches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-Reset-InitialRunningXacts-array-when-freeing-SnapBui.patch
Description: Binary data


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-22 Thread Amit Kapila
On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov  wrote:
>>
>>
>> Regarding the tests, the patch includes a new scenario to
>> reproduce this issue. However, since the issue can be reproduced also
>> by the existing scenario (with low probability, though), I'm not sure
>> it's worth adding the new scenario.
>
> AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it does 
> not worth it. But, I do not have a strong opinion here.
> Let's add tests in a separate commit and let the actual committer to decide 
> what to do, should we?
>

+1 to not have a test for this as the scenario can already be tested
by the existing set of tests.

-- 
With Regards,
Amit Kapila.




Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-22 Thread Maxim Orlov
> I've attached a draft patch. To fix it, I think we can reset
> InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder()
> and add an assertion in AllocateSnapshotBuilder() to make sure both
> are reset.
>
Thanks for the patch. It works fine. I've tested this patch for 15 and 11
versions on x86_64 and ARM
and see no fails. But the function pg_current_xact_id added by 4c04be9b05ad
doesn't exist in PG11.


> Regarding the tests, the patch includes a new scenario to
> reproduce this issue. However, since the issue can be reproduced also
> by the existing scenario (with low probability, though), I'm not sure
> it's worth adding the new scenario.
>
AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it
does not worth it. But, I do not have a strong opinion here.
Let's add tests in a separate commit and let the actual committer to decide
what to do, should we?

-- 
Best regards,
Maxim Orlov.


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-22 Thread Masahiko Sawada
Hi,

On Tue, Nov 22, 2022 at 6:37 PM Amit Kapila  wrote:
>
> On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov  wrote:
> >
> > PROBLEM
> >
> > After some investigation, I think, the problem is in the snapbuild.c 
> > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> > array in the context of builder->context, but for the time when we call 
> > SnapBuildPurgeOlderTxn this context may be already free'd.
> >
>
> I think you are seeing it freed in SnapBuildPurgeOlderTxn when we
> finish and restart decoding in the same session. After finishing the
> first decoding, it frees the decoding context but we forgot to reset
> NInitialRunningXacts and InitialRunningXacts array. So, next time when
> we start decoding in the same session where we don't restore any
> serialized snapshot, it can lead to the problem you are seeing because
> NInitialRunningXacts (and InitialRunningXacts array) won't have sane
> values.
>
> This can happen in the catalog_change_snapshot test as we have
> multiple permutations and those use the same session across a restart
> of decoding.

I have the same analysis. In order to restart the decoding from the
LSN where we don't restore any serialized snapshot, we somehow need to
advance the slot's restart_lsn. In this case, I think it happened
since the same session drops at the end of the first scenario and
creates the replication slot with the same name at the beginning of
the second scenario in catalog_change_snapshot.spec.

>
> >
> > Simple fix like:
> > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr 
> > lsn, xl_running_xacts *runn
> >  * changes. See SnapBuildXidSetCatalogChanges.
> >  */
> > NInitialRunningXacts = nxacts;
> > -   InitialRunningXacts = MemoryContextAlloc(builder->context, 
> > sz);
> > +   InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, 
> > sz);
> > memcpy(InitialRunningXacts, running->xids, sz);
> > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), 
> > xidComparator);
> >
> > seems to solve the described problem, but I'm not in the context of [0] and 
> > why array is allocated in builder->context.
> >
>
> It will leak the memory for InitialRunningXacts. We need to reset
> NInitialRunningXacts (and InitialRunningXacts) as mentioned above.
>
> Thank you for the report and initial analysis. I have added Sawada-San
> to know his views as he was the primary author of this work.

Thanks!

I've attached a draft patch. To fix it, I think we can reset
InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder()
and add an assertion in AllocateSnapshotBuilder() to make sure both
are reset. Regarding the tests, the patch includes a new scenario to
reproduce this issue. However, since the issue can be reproduced also
by the existing scenario (with low probability, though), I'm not sure
it's worth adding the new scenario.

I've not checked if the patch works for version 14 or older yet.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


reset_initial_running_xacts.patch
Description: Binary data


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-22 Thread Maxim Orlov
>
> Thank you for the report and initial analysis. I have added Sawada-San
> to know his views as he was the primary author of this work.
>
> --
> With Regards,
> Amit Kapila.
>

OK, thanks a lot. I hope, we'll fix this soon.

-- 
Best regards,
Maxim Orlov.


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-22 Thread Amit Kapila
On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov  wrote:
>
> PROBLEM
>
> After some investigation, I think, the problem is in the snapbuild.c (commit 
> 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> array in the context of builder->context, but for the time when we call 
> SnapBuildPurgeOlderTxn this context may be already free'd.
>

I think you are seeing it freed in SnapBuildPurgeOlderTxn when we
finish and restart decoding in the same session. After finishing the
first decoding, it frees the decoding context but we forgot to reset
NInitialRunningXacts and InitialRunningXacts array. So, next time when
we start decoding in the same session where we don't restore any
serialized snapshot, it can lead to the problem you are seeing because
NInitialRunningXacts (and InitialRunningXacts array) won't have sane
values.

This can happen in the catalog_change_snapshot test as we have
multiple permutations and those use the same session across a restart
of decoding.

>
> Simple fix like:
> @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr 
> lsn, xl_running_xacts *runn
>  * changes. See SnapBuildXidSetCatalogChanges.
>  */
> NInitialRunningXacts = nxacts;
> -   InitialRunningXacts = MemoryContextAlloc(builder->context, 
> sz);
> +   InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, 
> sz);
> memcpy(InitialRunningXacts, running->xids, sz);
> qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), 
> xidComparator);
>
> seems to solve the described problem, but I'm not in the context of [0] and 
> why array is allocated in builder->context.
>

It will leak the memory for InitialRunningXacts. We need to reset
NInitialRunningXacts (and InitialRunningXacts) as mentioned above.

Thank you for the report and initial analysis. I have added Sawada-San
to know his views as he was the primary author of this work.

-- 
With Regards,
Amit Kapila.




Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-21 Thread Amit Kapila
On Tue, Nov 22, 2022 at 2:22 AM Andres Freund  wrote:
>
> On 2022-11-21 15:47:12 +0300, Maxim Orlov wrote:
> > After some investigation, I think, the problem is in the snapbuild.c
> > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> > array in the context of builder->context, but for the time when we call
> > SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
> > InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F)
> > values. This is not caused buildfarm to fail due to that code:
>
> Amit, that does indeed seem to be a problem...
>

I'll look into it.

-- 
With Regards,
Amit Kapila.




Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-21 Thread Andres Freund
Hi,

On 2022-11-21 15:47:12 +0300, Maxim Orlov wrote:
> After some investigation, I think, the problem is in the snapbuild.c
> (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> array in the context of builder->context, but for the time when we call
> SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
> InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F)
> values. This is not caused buildfarm to fail due to that code:

Amit, that does indeed seem to be a problem...

Greetings,

Andres Freund




[BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-21 Thread Maxim Orlov
Hi!


PREAMBLE

For a last couple of months, I stumbled into a problem while running tests
on ARM (Debain, aarch64) and some more wired platforms
for my 64–bit XIDs patch set. Test contrib/test_decoding
(catalog_change_snapshot) rarely failed with the next message:

TRAP: FailedAssertion("TransactionIdIsNormal(InitialRunningXacts[0]) &&
TransactionIdIsNormal(builder->xmin)", File: "snapbuild.c"

I have plenty of failing on ARM, couple on x86 and none (if memory serves)
on x86–64.

At first, my thought was to blame my 64–bit XID patch for what, but this is
not the case. This error persist from PG15 to PG10
without any patch applied. Though hard to reproduce.


PROBLEM

After some investigation, I think, the problem is in the snapbuild.c
(commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
array in the context of builder->context, but for the time when we call
SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F)
values. This is not caused buildfarm to fail due to that code:

if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
 builder->xmin))
return;

Since the cluster is initialised with XID way less than 0x7F7F7F7F, we get
to return here, but the problem is still existing.
I've attached the patch based on branch REL_15_STABLE to reproduce the
problem on x86-64.

On my patch set of 64–bit XID's this problem manifested since I do init
cluster with XID far beyond 32–bit bound.

Alternatively, I did try to use my patch [1] to init cluster with first
transaction 2139062143 (i.e. 0x7F7F7F7F).
Then put pg_sleep call just like in the attached patch:
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
if (NInitialRunningXacts == 0)
return;

+   pg_usleep(100L * 2L);
+
/* bound check if there is at least one transaction to remove */
if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],

 builder->xmin))

Run installcheck-force for many times for a test_decoding/
catalog_change_snapshot's and got a segfault.


CONCLUSION

In snapbuild.c, context allocated array InitialRunningXacts may be free'd,
this caused assertion failed (at best) or
may lead to the more serious problems.


P.S.

Simple fix like:
@@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr
lsn, xl_running_xacts *runn
 * changes. See SnapBuildXidSetCatalogChanges.
 */
NInitialRunningXacts = nxacts;
-   InitialRunningXacts = MemoryContextAlloc(builder->context,
sz);
+   InitialRunningXacts = MemoryContextAlloc(TopMemoryContext,
sz);
memcpy(InitialRunningXacts, running->xids, sz);
qsort(InitialRunningXacts, nxacts, sizeof(TransactionId),
xidComparator);

seems to solve the described problem, but I'm not in the context of [0] and
why array is allocated in builder->context.


[0] https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
[1]
https://www.postgresql.org/message-id/flat/CACG=ezaa4vqYjJ16yoxgrpa-=gxnf0vv3ey9bjgrrrfn2yy...@mail.gmail.com

-- 
Best regards,
Maxim Orlov.
From d09a031f1f807cdfe1e02000b2bf4fd3eaaedd8f Mon Sep 17 00:00:00 2001
From: Maxim Orlov 
Date: Mon, 21 Nov 2022 14:50:02 +0300
Subject: [PATCH] catalog_change_snapshot-fail

---
 contrib/test_decoding/Makefile  | 1007 ++-
 src/backend/replication/logical/snapbuild.c |   12 +
 2 files changed, 1012 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c7ce603706..aaf7a63411 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -3,12 +3,1006 @@
 MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
-REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-   decoding_into_rel binary prepared replorigin time messages \
-   spill slot truncate stream stats twophase twophase_stream
-ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-   oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-   twophase_snapshot slot_creation_error catalog_change_snapshot
+ISOLATION = catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+   catalog_change_snapshot \
+