Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Fri, Nov 25, 2022 at 5:58 PM Maxim Orlov wrote: > > > > On Fri, 25 Nov 2022 at 09:40, Amit Kapila wrote: >> >> On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila wrote: >> > >> > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada >> > wrote: >> > > >> > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila >> > > wrote: >> > > >> > > Agreed not to have a test case for this. >> > > >> > > I've attached an updated patch. I've confirmed this patch works for >> > > all supported branches. >> > > >> > >> > I have slightly changed the checks used in the patch, otherwise looks >> > good to me. I am planning to push (v11-v15) the attached tomorrow >> > unless there are more comments. >> > >> >> Pushed. > > A big thanks to you! Could you also, close the commitfest entry > https://commitfest.postgresql.org/41/4024/, please? Closed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Fri, 25 Nov 2022 at 09:40, Amit Kapila wrote: > On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila > wrote: > > > > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada > wrote: > > > > > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila > wrote: > > > > > > Agreed not to have a test case for this. > > > > > > I've attached an updated patch. I've confirmed this patch works for > > > all supported branches. > > > > > > > I have slightly changed the checks used in the patch, otherwise looks > > good to me. I am planning to push (v11-v15) the attached tomorrow > > unless there are more comments. > > > > Pushed. > A big thanks to you! Could you also, close the commitfest entry https://commitfest.postgresql.org/41/4024/, please? -- Best regards, Maxim Orlov.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila wrote: > > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada wrote: > > > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila > > wrote: > > > > Agreed not to have a test case for this. > > > > I've attached an updated patch. I've confirmed this patch works for > > all supported branches. > > > > I have slightly changed the checks used in the patch, otherwise looks > good to me. I am planning to push (v11-v15) the attached tomorrow > unless there are more comments. > Pushed. -- With Regards, Amit Kapila.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada wrote: > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila wrote: > > Agreed not to have a test case for this. > > I've attached an updated patch. I've confirmed this patch works for > all supported branches. > I have slightly changed the checks used in the patch, otherwise looks good to me. I am planning to push (v11-v15) the attached tomorrow unless there are more comments. -- With Regards, Amit Kapila. v2-0001-Fix-uninitialized-access-to-InitialRunningXacts-d.patch Description: Binary data
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
> > Agreed not to have a test case for this. > > I've attached an updated patch. I've confirmed this patch works for > all supported branches. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com > It works for me as well. Thanks! I've created a commitfest entry for this patch, see https://commitfest.postgresql.org/41/4024/ Hope, it will be committed soon. -- Best regards, Maxim Orlov.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila wrote: > > On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov wrote: > >> > >> > >> Regarding the tests, the patch includes a new scenario to > >> reproduce this issue. However, since the issue can be reproduced also > >> by the existing scenario (with low probability, though), I'm not sure > >> it's worth adding the new scenario. > > > > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it > > does not worth it. But, I do not have a strong opinion here. > > Let's add tests in a separate commit and let the actual committer to decide > > what to do, should we? > > > > +1 to not have a test for this as the scenario can already be tested > by the existing set of tests. Agreed not to have a test case for this. I've attached an updated patch. I've confirmed this patch works for all supported branches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com 0001-Reset-InitialRunningXacts-array-when-freeing-SnapBui.patch Description: Binary data
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov wrote: >> >> >> Regarding the tests, the patch includes a new scenario to >> reproduce this issue. However, since the issue can be reproduced also >> by the existing scenario (with low probability, though), I'm not sure >> it's worth adding the new scenario. > > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it does > not worth it. But, I do not have a strong opinion here. > Let's add tests in a separate commit and let the actual committer to decide > what to do, should we? > +1 to not have a test for this as the scenario can already be tested by the existing set of tests. -- With Regards, Amit Kapila.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
> I've attached a draft patch. To fix it, I think we can reset > InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder() > and add an assertion in AllocateSnapshotBuilder() to make sure both > are reset. > Thanks for the patch. It works fine. I've tested this patch for 15 and 11 versions on x86_64 and ARM and see no fails. But the function pg_current_xact_id added by 4c04be9b05ad doesn't exist in PG11. > Regarding the tests, the patch includes a new scenario to > reproduce this issue. However, since the issue can be reproduced also > by the existing scenario (with low probability, though), I'm not sure > it's worth adding the new scenario. > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it does not worth it. But, I do not have a strong opinion here. Let's add tests in a separate commit and let the actual committer to decide what to do, should we? -- Best regards, Maxim Orlov.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Hi, On Tue, Nov 22, 2022 at 6:37 PM Amit Kapila wrote: > > On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov wrote: > > > > PROBLEM > > > > After some investigation, I think, the problem is in the snapbuild.c > > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > > array in the context of builder->context, but for the time when we call > > SnapBuildPurgeOlderTxn this context may be already free'd. > > > > I think you are seeing it freed in SnapBuildPurgeOlderTxn when we > finish and restart decoding in the same session. After finishing the > first decoding, it frees the decoding context but we forgot to reset > NInitialRunningXacts and InitialRunningXacts array. So, next time when > we start decoding in the same session where we don't restore any > serialized snapshot, it can lead to the problem you are seeing because > NInitialRunningXacts (and InitialRunningXacts array) won't have sane > values. > > This can happen in the catalog_change_snapshot test as we have > multiple permutations and those use the same session across a restart > of decoding. I have the same analysis. In order to restart the decoding from the LSN where we don't restore any serialized snapshot, we somehow need to advance the slot's restart_lsn. In this case, I think it happened since the same session drops at the end of the first scenario and creates the replication slot with the same name at the beginning of the second scenario in catalog_change_snapshot.spec. > > > > > Simple fix like: > > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr > > lsn, xl_running_xacts *runn > > * changes. See SnapBuildXidSetCatalogChanges. > > */ > > NInitialRunningXacts = nxacts; > > - InitialRunningXacts = MemoryContextAlloc(builder->context, > > sz); > > + InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, > > sz); > > memcpy(InitialRunningXacts, running->xids, sz); > > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), > > xidComparator); > > > > seems to solve the described problem, but I'm not in the context of [0] and > > why array is allocated in builder->context. > > > > It will leak the memory for InitialRunningXacts. We need to reset > NInitialRunningXacts (and InitialRunningXacts) as mentioned above. > > Thank you for the report and initial analysis. I have added Sawada-San > to know his views as he was the primary author of this work. Thanks! I've attached a draft patch. To fix it, I think we can reset InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder() and add an assertion in AllocateSnapshotBuilder() to make sure both are reset. Regarding the tests, the patch includes a new scenario to reproduce this issue. However, since the issue can be reproduced also by the existing scenario (with low probability, though), I'm not sure it's worth adding the new scenario. I've not checked if the patch works for version 14 or older yet. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com reset_initial_running_xacts.patch Description: Binary data
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
> > Thank you for the report and initial analysis. I have added Sawada-San > to know his views as he was the primary author of this work. > > -- > With Regards, > Amit Kapila. > OK, thanks a lot. I hope, we'll fix this soon. -- Best regards, Maxim Orlov.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov wrote: > > PROBLEM > > After some investigation, I think, the problem is in the snapbuild.c (commit > 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > array in the context of builder->context, but for the time when we call > SnapBuildPurgeOlderTxn this context may be already free'd. > I think you are seeing it freed in SnapBuildPurgeOlderTxn when we finish and restart decoding in the same session. After finishing the first decoding, it frees the decoding context but we forgot to reset NInitialRunningXacts and InitialRunningXacts array. So, next time when we start decoding in the same session where we don't restore any serialized snapshot, it can lead to the problem you are seeing because NInitialRunningXacts (and InitialRunningXacts array) won't have sane values. This can happen in the catalog_change_snapshot test as we have multiple permutations and those use the same session across a restart of decoding. > > Simple fix like: > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr > lsn, xl_running_xacts *runn > * changes. See SnapBuildXidSetCatalogChanges. > */ > NInitialRunningXacts = nxacts; > - InitialRunningXacts = MemoryContextAlloc(builder->context, > sz); > + InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, > sz); > memcpy(InitialRunningXacts, running->xids, sz); > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), > xidComparator); > > seems to solve the described problem, but I'm not in the context of [0] and > why array is allocated in builder->context. > It will leak the memory for InitialRunningXacts. We need to reset NInitialRunningXacts (and InitialRunningXacts) as mentioned above. Thank you for the report and initial analysis. I have added Sawada-San to know his views as he was the primary author of this work. -- With Regards, Amit Kapila.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
On Tue, Nov 22, 2022 at 2:22 AM Andres Freund wrote: > > On 2022-11-21 15:47:12 +0300, Maxim Orlov wrote: > > After some investigation, I think, the problem is in the snapbuild.c > > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > > array in the context of builder->context, but for the time when we call > > SnapBuildPurgeOlderTxn this context may be already free'd. Thus, > > InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) > > values. This is not caused buildfarm to fail due to that code: > > Amit, that does indeed seem to be a problem... > I'll look into it. -- With Regards, Amit Kapila.
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Hi, On 2022-11-21 15:47:12 +0300, Maxim Orlov wrote: > After some investigation, I think, the problem is in the snapbuild.c > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts > array in the context of builder->context, but for the time when we call > SnapBuildPurgeOlderTxn this context may be already free'd. Thus, > InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) > values. This is not caused buildfarm to fail due to that code: Amit, that does indeed seem to be a problem... Greetings, Andres Freund
[BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Hi! PREAMBLE For a last couple of months, I stumbled into a problem while running tests on ARM (Debain, aarch64) and some more wired platforms for my 64–bit XIDs patch set. Test contrib/test_decoding (catalog_change_snapshot) rarely failed with the next message: TRAP: FailedAssertion("TransactionIdIsNormal(InitialRunningXacts[0]) && TransactionIdIsNormal(builder->xmin)", File: "snapbuild.c" I have plenty of failing on ARM, couple on x86 and none (if memory serves) on x86–64. At first, my thought was to blame my 64–bit XID patch for what, but this is not the case. This error persist from PG15 to PG10 without any patch applied. Though hard to reproduce. PROBLEM After some investigation, I think, the problem is in the snapbuild.c (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts array in the context of builder->context, but for the time when we call SnapBuildPurgeOlderTxn this context may be already free'd. Thus, InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) values. This is not caused buildfarm to fail due to that code: if (!NormalTransactionIdPrecedes(InitialRunningXacts[0], builder->xmin)) return; Since the cluster is initialised with XID way less than 0x7F7F7F7F, we get to return here, but the problem is still existing. I've attached the patch based on branch REL_15_STABLE to reproduce the problem on x86-64. On my patch set of 64–bit XID's this problem manifested since I do init cluster with XID far beyond 32–bit bound. Alternatively, I did try to use my patch [1] to init cluster with first transaction 2139062143 (i.e. 0x7F7F7F7F). Then put pg_sleep call just like in the attached patch: --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder) if (NInitialRunningXacts == 0) return; + pg_usleep(100L * 2L); + /* bound check if there is at least one transaction to remove */ if (!NormalTransactionIdPrecedes(InitialRunningXacts[0], builder->xmin)) Run installcheck-force for many times for a test_decoding/ catalog_change_snapshot's and got a segfault. CONCLUSION In snapbuild.c, context allocated array InitialRunningXacts may be free'd, this caused assertion failed (at best) or may lead to the more serious problems. P.S. Simple fix like: @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn * changes. See SnapBuildXidSetCatalogChanges. */ NInitialRunningXacts = nxacts; - InitialRunningXacts = MemoryContextAlloc(builder->context, sz); + InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz); memcpy(InitialRunningXacts, running->xids, sz); qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator); seems to solve the described problem, but I'm not in the context of [0] and why array is allocated in builder->context. [0] https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com [1] https://www.postgresql.org/message-id/flat/CACG=ezaa4vqYjJ16yoxgrpa-=gxnf0vv3ey9bjgrrrfn2yy...@mail.gmail.com -- Best regards, Maxim Orlov. From d09a031f1f807cdfe1e02000b2bf4fd3eaaedd8f Mon Sep 17 00:00:00 2001 From: Maxim Orlov Date: Mon, 21 Nov 2022 14:50:02 +0300 Subject: [PATCH] catalog_change_snapshot-fail --- contrib/test_decoding/Makefile | 1007 ++- src/backend/replication/logical/snapbuild.c | 12 + 2 files changed, 1012 insertions(+), 7 deletions(-) diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index c7ce603706..aaf7a63411 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -3,12 +3,1006 @@ MODULES = test_decoding PGFILEDESC = "test_decoding - example of a logical decoding output plugin" -REGRESS = ddl xact rewrite toast permissions decoding_in_xact \ - decoding_into_rel binary prepared replorigin time messages \ - spill slot truncate stream stats twophase twophase_stream -ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \ - oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \ - twophase_snapshot slot_creation_error catalog_change_snapshot +ISOLATION = catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ + catalog_change_snapshot \ +