On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > > > Summarizing this issue, the assertion check in AssertTXNLsnOrder() > > fails as reported because the current logical decoding cannot properly > > handle the case where the decoding restarts from NEW_CID. Since we > > don't make the association between top-level transaction and its > > subtransaction while decoding NEW_CID (ie, in > > SnapBuildProcessNewCid()), two transactions are created in > > ReorderBuffer as top-txn and have the same LSN. This failure happens > > on all supported versions. > > > > To fix the problem, one idea is that we make the association between > > top-txn and sub-txn during that by calling ReorderBufferAssignChild(), > > as Tomas proposed. On the other hand, since we don't guarantee to make > > the association between the top-level transaction and its > > sub-transactions until we try to decode the actual contents of the > > transaction, it makes sense to me that instead of trying to solve by > > making association, we need to change the code which are assuming that > > it is associated. > > > > I've attached the patch for this idea. With the patch, we skip the > > assertion checks in AssertTXNLsnOrder() until we reach the LSN at > > which we start decoding the contents of transaction, ie. > > start_decoding_at in SnapBuild. The minor concern is other way that > > the assertion check could miss some faulty cases where two unrelated > > top-transactions could have same LSN. With this patch, it will pass > > for such a case. Therefore, for transactions that we skipped checking, > > we do the check when we reach the LSN. > > > > > > --- a/src/backend/replication/logical/decode.c > +++ b/src/backend/replication/logical/decode.c > @@ -113,6 +113,15 @@ > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, > XLogReaderState *recor > buf.origptr); > } > > +#ifdef USE_ASSERT_CHECKING > + /* > + * Check the order of transaction LSNs when we reached the start decoding > + * LSN. See the comments in AssertTXNLsnOrder() for details. > + */ > + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr) > + AssertTXNLsnOrder(ctx->reorder); > +#endif > + > rmgr = GetRmgr(XLogRecGetRmid(record)); > > > > I am not able to think how/when this check will be useful. Because we > skipped assert checking only for records that are prior to > start_decoding_at point, I think for those records ordering should > have been checked before the restart. start_decoding_at point will be > either (a) confirmed_flush location, or (b) lsn sent by client, and > any record prior to that must have been processed before restart.
Good point. I was considering the case where the client sets far ahead LSN but it's not worth considering this case in this context. I've updated the patch accoringly. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data