Andres Freund <[email protected]> wrote:

> > While troubleshooting REPACK issue [1], I realized that
> > HeapTupleSatisfiesMVCCBatch() can also be called during logical decoding - 
> > in
> > that case we need to use a historic MVCC snapshot.
> 
> Huh. Indeed. That's unintentional - the path should never have been reached,
> we are checking that an MVCC snapshot is used. Unfortunately, somebody
> (i.e. probably me) at some point defined the relevant macro as
> 
> /* This macro encodes the knowledge of which snapshots are MVCC-safe */
> #define IsMVCCSnapshot(snapshot)  \
>       ((snapshot)->snapshot_type == SNAPSHOT_MVCC || \
>        (snapshot)->snapshot_type == SNAPSHOT_HISTORIC_MVCC)
> 
> Which makes sense for some places, but not for plenty others.
> 
> The reason this didn't cause more widespread issues is that during logical
> decoding we mostly don't use sequential scans etc that are affected by the
> these paths.

> > My proposal to fix the problem is attached.
> 
> That's imo not at all the right fix - it'd make visibility during seqscans
> checking noticeably slower.

ok

> I think we ought to instead restrict the page-at-a-time scans to only happen
> with "real" mvcc snapshots. I.e. this:
> 
>       /*
>        * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
>        */
>       if (!(snapshot && IsMVCCSnapshot(snapshot)))
>               scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
>
> should trigger for historic snapshots as well.

I suppose you mean changing it to

        if (!(snapshot && IsMVCCSnapshot(snapshot) &&
                  !IsHistoricMVCCSnapshot(snapshot)))
                scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;

> Does that fix the issue for you?

Yes, with this change, I don't hit the problem anymore.

> What's your reproducer?

Check out this branch

https://github.com/michail-nikolaev/postgres/tree/repack_concurrently_repro_22

and run t/008_repack_concurrently.pl in contrib/amcheck. The error we saw in
most cases was "ERROR: cache lookup failed for relation". I noticed that the
related pg_class entries had hint bits set incorrectly, so I added the
following to see when exactly it happens (actually I used lower elevel first,
to find out that the decoding worker is responsible for the problem):

diff --git a/src/backend/access/heap/heapam_visibility.c 
b/src/backend/access/heap/heapam_visibility.c
index 75ae268d753..ebf38460873 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -73,6 +73,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "commands/cluster.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -938,6 +939,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
                                                
HeapTupleHeaderGetRawXmin(tuple));
                else
                {
+                       if (am_decoding_for_repack())
+                               elog(PANIC, "HEAP_XMIN_INVALID set");
                        /* it must have aborted or crashed */
                        SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
                                                InvalidTransactionId);

The backtrace looked like:

    #3  0x0000000000c367ad errfinish (postgres + 0x8367ad)
    #4  0x0000000000517ee1 HeapTupleSatisfiesMVCC (postgres + 0x117ee1)
    #5  0x0000000000518e14 HeapTupleSatisfiesMVCCBatch (postgres + 0x118e14)
    #6  0x000000000050483e page_collect_tuples (postgres + 0x10483e)
    #7  0x0000000000504a8e heap_prepare_pagescan (postgres + 0x104a8e)
    #8  0x0000000000505344 heapgettup_pagemode (postgres + 0x105344)
    #9  0x0000000000505cff heap_getnextslot (postgres + 0x105cff)
    #10 0x000000000052ca08 table_scan_getnextslot (postgres + 0x12ca08)
    #11 0x000000000052d4ed systable_getnext (postgres + 0x12d4ed)
    #12 0x0000000000c2098e ScanPgRelation (postgres + 0x82098e)
    #13 0x0000000000c23aa2 RelationReloadIndexInfo (postgres + 0x823aa2)
    #14 0x0000000000c24376 RelationRebuildRelation (postgres + 0x824376)
    #15 0x0000000000c23784 RelationIdGetRelation (postgres + 0x823784)
    #16 0x00000000004b558f relation_open (postgres + 0xb558f)
    #17 0x000000000052e073 index_open (postgres + 0x12e073)
    #18 0x000000000052d0c5 systable_beginscan (postgres + 0x12d0c5)
    #19 0x0000000000c2097e ScanPgRelation (postgres + 0x82097e)
    #20 0x0000000000c23dde RelationReloadNailed (postgres + 0x823dde)
    #21 0x0000000000c24399 RelationRebuildRelation (postgres + 0x824399)
    #22 0x0000000000c23784 RelationIdGetRelation (postgres + 0x823784)
    #23 0x00000000004b558f relation_open (postgres + 0xb558f)
    #24 0x0000000000573ab1 table_open (postgres + 0x173ab1)
    #25 0x0000000000c20919 ScanPgRelation (postgres + 0x820919)
    #26 0x0000000000c21c98 RelationBuildDesc (postgres + 0x821c98)
    #27 0x0000000000c237d6 RelationIdGetRelation (postgres + 0x8237d6)
    #28 0x00000000009a028a ReorderBufferProcessTXN (postgres + 0x5a028a)
    #29 0x00000000009a10ed ReorderBufferReplay (postgres + 0x5a10ed)
    #30 0x00000000009a116b ReorderBufferCommit (postgres + 0x5a116b)
    #31 0x000000000098c7fe DecodeCommit (postgres + 0x58c7fe)
    #32 0x000000000098ba67 xact_decode (postgres + 0x58ba67)
    #33 0x000000000098b685 LogicalDecodingProcessRecord (postgres + 0x58b685)
    #34 0x00000000006a3e4b decode_concurrent_changes (postgres + 0x2a3e4b)
    #35 0x00000000006a5ea3 repack_worker_internal (postgres + 0x2a5ea3)
    #36 0x00000000006a5d5e RepackWorkerMain (postgres + 0x2a5d5e)
    #37 0x0000000000961b2b BackgroundWorkerMain (postgres + 0x561b2b)
    #38 0x0000000000964a61 postmaster_child_launch (postgres + 0x564a61)
    #39 0x000000000096b58d StartBackgroundWorker (postgres + 0x56b58d)
    #40 0x000000000096b80a maybe_start_bgworkers (postgres + 0x56b80a)
    #41 0x000000000096a5d9 LaunchMissingBackgroundProcesses (postgres + 
0x56a5d9)
    #42 0x00000000009683fc ServerLoop (postgres + 0x5683fc)
    #43 0x0000000000967d37 PostmasterMain (postgres + 0x567d37)
    #44 0x0000000000813421 main (postgres + 0x413421)

Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com


Reply via email to