On 9/5/22 12:12, Amit Kapila wrote:
> On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
> <tomas.von...@enterprisedb.com> wrote:
>>
>> On 9/5/22 06:32, Amit Kapila wrote:
>>> On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
>>> <tomas.von...@enterprisedb.com> wrote:
>>>>
>>>> On 9/4/22 14:24, Tomas Vondra wrote:
>>>>>
>>>>>> As per
>>>>>> my understanding, the problem I reported in the email [1] is the same
>>>>>> and we have seen this in BF failures as well. I posted a way to
>>>>>> reproduce it in that email. It seems this is possible if the decoding
>>>>>> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
>>>>>> subtransaction) after XLOG_RUNNING_XACTS.
>>>>>>
>>>>>
>>>>> Interesting. That's certainly true for WAL in the crashing case:
>>>>>
>>>>> rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
>>>>> 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
>>>>> latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
>>>>> rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
>>>>> 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
>>>>> 58/38; cmin: 1, cmax: 14, combo: 6
>>>>>
>>>>
>>>> I investigated using the pgdata from the crashed run (can provide, if
>>>> you have rpi4 or some other aarch64 machine), and the reason is pretty
>>>> simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
>>>> after the subxact assignment, so we add both xids as toplevel.
>>>>
>>>> That seems broken - if we skip the assignment like this, doesn't that
>>>> break spill-to-disk and/or streaming? IIRC that's exactly why we had to
>>>> start logging assignments immediately with wal_level=logical.
>>>>
>>>
>>> We had started logging assignments immediately in commit 0bead9af48
>>> for streaming transactions in PG14. This issue exists prior to that. I
>>> have tried and reproduced it in PG13 but I think it will be there even
>>> before that. So, I am not sure if the spilling behavior is broken due
>>> to this. I think if we don't get assignment recording before
>>> processing changes during decoding commit then we could miss sending
>>> the changes which won't be the case here. Do you see any other
>>> problem?
>>>
>>
>> I can't, but that's hardly a proof of anything. You're right spilling to
>> disk may not be broken by this, though. I forgot it precedes assignments
>> being logged immediately, so it does not rely on that.
>>
>>>> Or maybe we're not dealing with the restart_lsn properly, and we should
>>>> have ignored those records. Both xacts started long before the restart
>>>> LSN, so we're not seeing the whole xact anyway.
>>>>
>>>
>>> Right, but is that problematic? The restart LSN will be used as a
>>> point to start reading the WAL and that helps in building a consistent
>>> snapshot. However, for decoding to send the commit, we use
>>> start_decoding_at point which will ensure that we send complete
>>> transactions.
>>>
>>
>> Which part would not be problematic? There's some sort of a bug, that's
>> for sure.
>>
> 
> It is possible that there is some other problem here that I am
> missing. But at this stage, I don't see anything wrong other than the
> assertion you have reported.
> 

I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.

>> I think it's mostly clear we won't output this transaction, because the
>> restart LSN is half-way through. We can either ignore it at commit time,
>> and then we have to make everything work in case we miss assignments (or
>> any other part of the transaction).
>>
> 
> Note, traditionally, we only form these assignments at commit time
> after deciding whether to skip such commits. So, ideally, there
> shouldn't be any fundamental problem with not making these
> associations before deciding whether we need to replay (send
> downstream) any particular transaction.
> 

Isn't that self-contradictory? Either we form these assignments at
commit time, or we support streaming (in which case it clearly can't
happen at commit time). AFAICS that's exactly why we started logging
(and processing) assignments immediately, no?

>> Or we can ignore stuff early, and not even process some of the changes.
>> For example in this case do we need to process the NEW_CID contents for
>> transaction 848? If we can skip that bit, the problem will disappear.
>>
>> But maybe this is futile and there are other similar issues ...
>>
>>>> However, when processing the NEW_CID record:
>>>>
>>>> tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
>>>> 1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6
>>>>
>>>> we ultimately do this in SnapBuildProcessNewCid:
>>>>
>>>> #1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
>>>> xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
>>>> combocid=6) at reorderbuffer.c:3218
>>>> #2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
>>>> xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818
>>>>
>>>> so in fact we *know* 849 is a subxact of 848, but we don't call
>>>> ReorderBufferAssignChild in this case. In fact we can't even do the
>>>> assignment easily in this case, because we create the subxact first, so
>>>> that the crash happens right when we attempt to create the toplevel one,
>>>> and we never even get a chance to do the assignment:
>>>>
>>>> 1) process the NEW_CID record, logged for 849 (subxact)
>>>> 2) process CIDs in the WAL record, which has topleve_xid 848
>>>>
>>>>
>>>> So IMHO we need to figure out what to do for WAL records that create
>>>> both the toplevel and subxact - either we need to skip them, or rethink
>>>> how we create the ReorderBufferTXN structs.
>>>>
>>>
>>> As per my understanding, we can't skip them as they are used to build
>>> the snapshot.
>>>
>>
>> Don't we know 848 (the top-level xact) won't be decoded? In that case we
>> won't need the snapshot, so why build it?
>>
> 
> But this transaction id can be part of committed.xip array if it has
> made any catalog changes. We add the transaction/subtransaction to
> this array before deciding whether to skip decoding/replay of its
> commit.
> 

Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.

However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to