Re: Flaky vacuum truncate test in reloptions.sql
On Fri, Apr 2, 2021 at 9:46 AM Michael Paquier wrote: > Okay, applied and back-patched down to 12 then. Thank you both. Unfortunately and surprisingly, the test still fails (perhaps even rarer, once in several hundred runs) under multimaster. After scratching the head for some more time, it seems to me the following happens: not only vacuum encounters locked page, but also there exist a concurrent backend (as the parallel schedule is run) who holds back oldestxmin keeping it less than xid of transaction which did the insertion INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); FreezeLimit can't be higher than oldestxmin, so lazy_check_needs_freeze decides there is nothing to freeze on the page. multimaster commits are quite heavy, which apparently shifts the timings making the issue more likely. Currently we are testing the rather funny attached patch which forces all such old-snapshot-holders to finish. It is crutchy, but I doubt we want to change vacuum logic (e.g. checking tuple liveness in lazy_check_needs_freeze) due to this issue. (it is especially crutchy in xid::bigint casts, but wraparound is hardly expected in regression tests run). -- cheers, arseny diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out index bb7bd6e1e7e..78bbf4a5255 100644 --- a/src/test/regress/expected/reloptions.out +++ b/src/test/regress/expected/reloptions.out @@ -128,6 +128,20 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint DETAIL: Failing row contains (null, null). +do $$ +declare + my_xid bigint; + oldest_xmin bigint; +begin + my_xid := txid_current(); + while true loop +oldest_xmin := min(backend_xmin::text::bigint) from pg_stat_activity where pid != pg_backend_pid(); +exit when oldest_xmin is null or oldest_xmin >= my_xid; +perform pg_sleep(0.1); +perform pg_stat_clear_snapshot(); + end loop; +end +$$; -- Do an aggressive vacuum to prevent page-skipping. VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') = 0; diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql index 95f7ab4189e..96fb59d16ad 100644 --- a/src/test/regress/sql/reloptions.sql +++ b/src/test/regress/sql/reloptions.sql @@ -72,6 +72,20 @@ SELECT reloptions FROM pg_class WHERE oid = ALTER TABLE reloptions_test RESET (vacuum_truncate); SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); +do $$ +declare + my_xid bigint; + oldest_xmin bigint; +begin + my_xid := txid_current(); + while true loop +oldest_xmin := min(backend_xmin::text::bigint) from pg_stat_activity where pid != pg_backend_pid(); +exit when oldest_xmin is null or oldest_xmin >= my_xid; +perform pg_sleep(0.1); +perform pg_stat_clear_snapshot(); + end loop; +end +$$; -- Do an aggressive vacuum to prevent page-skipping. VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') = 0;
Re: Flaky vacuum truncate test in reloptions.sql
Michael Paquier writes: > On Thu, Apr 01, 2021 at 12:52:21PM +0900, Masahiko Sawada wrote: >> Just to be clear the context, I’m mentioning the following test case: Sorry, I misremembered the test and assumed the table is non-empty there while it is empty but vacuum_truncate is disabled. Still, this doesn't change my conclusion of freezing being not a big deal there due to small chance of locked page. Anyway, let's finish with this. > What you are writing here makes sense to me. Looking at the test, it > is designed to test vacuum_truncate, aka that the behavior we want to > stress (your former case here) gets stressed all the time, so adding > the options to avoid the latter case all the time is an improvement. > And this, even if the latter case does not actually cause a diff and > it has a small chance to happen in practice. > > It would be good to add a comment explaining why the options are > added (aka just don't skip any pages). How about the attached? -- cheers, arseny diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out index 44c130409ff..d6f5bf98114 100644 --- a/src/test/regress/expected/reloptions.out +++ b/src/test/regress/expected/reloptions.out @@ -102,7 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint DETAIL: Failing row contains (null, null). -VACUUM reloptions_test; +-- do aggressive vacuum to be sure we won't skip the page even if it is locked +VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') > 0; ?column? -- @@ -127,7 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint DETAIL: Failing row contains (null, null). -VACUUM reloptions_test; +-- do aggressive vacuum to be sure we won't skip the page even if it is locked +VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') = 0; ?column? -- diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql index cac5b0bcb0d..61638c3bcae 100644 --- a/src/test/regress/sql/reloptions.sql +++ b/src/test/regress/sql/reloptions.sql @@ -61,7 +61,8 @@ CREATE TABLE reloptions_test(i INT NOT NULL, j text) autovacuum_enabled=false); SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); -VACUUM reloptions_test; +-- do aggressive vacuum to be sure we won't skip the page even if it is locked +VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') > 0; SELECT reloptions FROM pg_class WHERE oid = @@ -71,7 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid = ALTER TABLE reloptions_test RESET (vacuum_truncate); SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); -VACUUM reloptions_test; +-- do aggressive vacuum to be sure we won't skip the page even if it is locked +VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') = 0; -- Test toast.* options
Re: Flaky vacuum truncate test in reloptions.sql
Arseny Sher writes: > as currently the chance of its failure is close to 1. A typo, to 0 too, of course.
Re: Flaky vacuum truncate test in reloptions.sql
Masahiko Sawada writes: >> I don't think this matters much, as it tests the contrary and the >> probability of >> successful test passing (in case of theoretical bug making vacuum to >> truncate >> non-empty relation) becomes stunningly small. But adding it wouldn't hurt >> either. > > I was concerned a bit that without FREEZE in the first VACUUM we could > not test it properly because the table could not be truncated because > either vacuum_truncate is off FREEZE won't help us there. > or the page is skipped. You mean at the same time there is a potential bug in vacuum which would force the truncation of non-empy relation if the page wasn't locked? That would mean the chance of test getting passed even single time is close to 0, as currently the chance of its failure is close to 1. -- cheers, arseny
Re: Flaky vacuum truncate test in reloptions.sql
On 3/31/21 4:17 PM, Masahiko Sawada wrote: > Is it better to add FREEZE to the first "VACUUM reloptions_test;" as well? I don't think this matters much, as it tests the contrary and the probability of successful test passing (in case of theoretical bug making vacuum to truncate non-empty relation) becomes stunningly small. But adding it wouldn't hurt either. -- cheers, arseny
Re: Flaky vacuum truncate test in reloptions.sql
On 3/30/21 10:12 AM, Michael Paquier wrote: > Yep, this is the same problem as the one discussed for c2dc1a7, where > a concurrent checkpoint may cause a page to be skipped, breaking the > test. Indeed, Alexander Lakhin pointed me to that commit after I wrote the message. > Why not just using DISABLE_PAGE_SKIPPING instead here? I think this is not enough. DISABLE_PAGE_SKIPPING disables vm consulting (sets aggressive=true in the routine); however, if the page is locked and lazy_check_needs_freeze says there is nothing to freeze on it, we again don't look at its contents closely. -- cheers, arseny
Flaky vacuum truncate test in reloptions.sql
Hi, I rarely observe failure of vacuum with truncation test in reloptions.sql, i.e. the truncation doesn't happen: --- ../../src/test/regress/expected/reloptions.out 2020-04-16 12:37:17.749547401 +0300 +++ ../../src/test/regress/results/reloptions.out 2020-04-17 00:14:58.999211750 +0300 @@ -131,7 +131,7 @@ SELECT pg_relation_size('reloptions_test') = 0; ?column? -- - t + f (1 row) Intimate reading of lazy_scan_heap says that the failure indeed might happen; if ConditionalLockBufferForCleanup couldn't lock the buffer and either the buffer doesn't need freezing or vacuum is not aggressive, we don't insist on close inspection of the page contents and count it as nonempty according to lazy_check_needs_freeze. It means the page is regarded as such even if it contains only garbage (but occupied) ItemIds, which is the case of the test. And of course this allegedly nonempty page prevents the truncation. Obvious competitors for the page are bgwriter/checkpointer; the chances of a simultaneous attack are small but they exist. A simple fix is to perform aggressive VACUUM FREEZE, as attached. I'm a bit puzzled that I've ever seen this only when running regression tests under our multimaster. While multimaster contains a fair amount of C code, I don't see how any of it can interfere with the vacuuming business here. I can't say I did my best to create the repoduction though -- the explanation above seems to be enough. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out index 44c130409ff..fa1c223c87a 100644 --- a/src/test/regress/expected/reloptions.out +++ b/src/test/regress/expected/reloptions.out @@ -127,7 +127,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); ERROR: null value in column "i" of relation "reloptions_test" violates not-null constraint DETAIL: Failing row contains (null, null). -VACUUM reloptions_test; +-- do aggressive vacuum to be sure we won't skip the page +VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') = 0; ?column? -- diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql index cac5b0bcb0d..a84aae5093b 100644 --- a/src/test/regress/sql/reloptions.sql +++ b/src/test/regress/sql/reloptions.sql @@ -71,7 +71,8 @@ SELECT reloptions FROM pg_class WHERE oid = ALTER TABLE reloptions_test RESET (vacuum_truncate); SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass; INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL); -VACUUM reloptions_test; +-- do aggressive vacuum to be sure we won't skip the page +VACUUM FREEZE reloptions_test; SELECT pg_relation_size('reloptions_test') = 0; -- Test toast.* options
Enlarge IOS vm cache
/* * Rats, we have to visit the heap to check visibility. @@ -377,11 +382,14 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node) indexRelationDesc = node->ioss_RelationDesc; indexScanDesc = node->ioss_ScanDesc; - /* Release VM buffer pin, if any. */ - if (node->ioss_VMBuffer != InvalidBuffer) + /* Release VM buffer pins, if any. */ + for (int i = 0; i < VMBUF_SIZE; i++) { - ReleaseBuffer(node->ioss_VMBuffer); - node->ioss_VMBuffer = InvalidBuffer; + if (node->ioss_VMBuffer[i] != InvalidBuffer) + { + ReleaseBuffer(node->ioss_VMBuffer[i]); + node->ioss_VMBuffer[i] = InvalidBuffer; + } } /* @@ -680,7 +688,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node, node->ioss_NumOrderByKeys, piscan); node->ioss_ScanDesc->xs_want_itup = true; - node->ioss_VMBuffer = InvalidBuffer; + for (int i = 0; i < VMBUF_SIZE; i++) + node->ioss_VMBuffer[i] = InvalidBuffer; /* * If no run-time keys to calculate or they are ready, go ahead and pass diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h index 57362c36876..0c9bda8de17 100644 --- a/src/include/access/visibilitymap.h +++ b/src/include/access/visibilitymap.h @@ -28,6 +28,24 @@ #define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visibilitymap * flags bits */ +/* + * Size of the bitmap on each visibility map page, in bytes. There's no + * extra headers, so the whole page minus the standard page header is + * used for the bitmap. + */ +#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData)) + +/* Number of heap blocks we can represent in one byte */ +#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK) + +/* Number of heap blocks we can represent in one visibility map page. */ +#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE) + +/* Mapping from heap block number to the right bit in the visibility map */ +#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE) +#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE) +#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK) + /* Macros for visibilitymap test */ #define VM_ALL_VISIBLE(r, b, v) \ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0) diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index e31ad6204e6..15290be11c6 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -1439,6 +1439,8 @@ typedef struct IndexScanState Size iss_PscanLen; } IndexScanState; +#define VMBUF_SIZE 64 + /* * IndexOnlyScanState information * @@ -1473,7 +1475,7 @@ typedef struct IndexOnlyScanState Relation ioss_RelationDesc; struct IndexScanDescData *ioss_ScanDesc; TupleTableSlot *ioss_TableSlot; - Buffer ioss_VMBuffer; + Buffer ioss_VMBuffer[VMBUF_SIZE]; Size ioss_PscanLen; } IndexOnlyScanState; -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Use-after-free in 12- EventTriggerAlterTableEnd
Hi, Valgrind on our internal buildfarm complained about use-after-free during currentEventTriggerState->commandList manipulations, e.g. lappend in EventTriggerCollectSimpleCommand. I've discovered that the source of problem is EventTriggerAlterTableEnd not bothering to switch into its own context before appending to the list. ced138e8cba fixed this in master and 13 but wasn't backpatched further, so I see the problem in 12-. The particular reproducing scenario is somewhat involved; it combines pg_pathman [1] extension and SQL interface to it created in our fork of Postgres. Namely, we allow to create partitions in CREATE TABLE PARTITIONED by statement (similar to what [2] proposes). Each partition is created via separate SPI call which in turn ends up making AlterTableStmt as ProcessUtility PROCESS_UTILITY_SUBCOMMAND; so EventTriggerAlterTableStart/End is done, but additional EventTriggerQueryState is not allocated and commands are collected in toplevel EventTriggerQueryState. Of course SPI frees its memory between the calls which makes valgrind scream. Admittedly our case is tricky and I'm not sure it is possible to make up something like that in the pure core code, but I do believe other extension writers might run into this, so I propose to backpatch (the attached) 3 lines healing to all supported branches. Technically, all you (an extension author) have to do to encounter this is 1) Have toplevel EvenTriggerQueryState, i.e. catch utility statement. 2) Inside it, run AlterTable as PROCESS_UTILITY_SUBCOMMAND in some short-living context. [1] https://github.com/postgrespro/pg_pathman [2] https://www.postgresql.org/message-id/flat/CALT9ZEFBv05OhLMKO1Lbo_Zg9a0v%2BU9q9twe%3Dt-dixfR45RmVQ%40mail.gmail.com#f86f0fcfa62d5108fb81556a43f42457 diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c index f7ee9838f7..d1972e2d7f 100644 --- a/src/backend/commands/event_trigger.c +++ b/src/backend/commands/event_trigger.c @@ -1807,9 +1807,15 @@ EventTriggerAlterTableEnd(void) /* If no subcommands, don't collect */ if (list_length(currentEventTriggerState->currentCommand->d.alterTable.subcmds) != 0) { + MemoryContext oldcxt; + + oldcxt = MemoryContextSwitchTo(currentEventTriggerState->cxt); + currentEventTriggerState->commandList = lappend(currentEventTriggerState->commandList, currentEventTriggerState->currentCommand); + + MemoryContextSwitchTo(oldcxt); } else pfree(currentEventTriggerState->currentCommand); -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Parallel query hangs after a smart shutdown is issued
Tom Lane writes: > Thomas Munro writes: >> On Fri, Aug 14, 2020 at 4:45 AM Tom Lane wrote: >>> After some more rethinking and testing, here's a v5 that feels >>> fairly final to me. I realized that the logic in canAcceptConnections >>> was kind of backwards: it's better to check the main pmState restrictions >>> first and then the smart-shutdown restrictions afterwards. > >> LGTM. I tested this a bit today and it did what I expected for >> parallel queries and vacuum, on primary and standby. > > Thanks for reviewing! I'll do the back-patching and push this today. FWIW, I've also looked through the patch and it's fine. Moderate testing also found no issues, check-world works, bgws are started during smart shutdown as expected. And surely this is better than the inital shorthack of allowing only parallel workers.
Re: logical copy_replication_slot issues
Masahiko Sawada writes: > /* > -* Create logical decoding context, to build the initial snapshot. > +* Create logical decoding context to find start point or, if we don't > +* need it, to 1) bump slot's restart_lsn and xmin 2) check plugin sanity. > */ > > Do we need to numbering that despite not referring them? No, it just seemed clearer to me this way. I don't mind removing the numbers if you feel this is better. > ctx = CreateInitDecodingContext(plugin, NIL, > - false, /* do not build snapshot */ > + false, /* do not build data snapshot */ > restart_lsn, > logical_read_local_xlog_page, NULL, NULL, > NULL); > I'm not sure this change makes the comment better. Could you elaborate > on the motivation of this change? Well, DecodingContextFindStartpoint always builds a snapshot allowing historical *catalog* lookups. This bool controls whether the snapshot should additionally be suitable for looking at the actual data, this is e.g. used by initial data sync in the native logical replication. -- cheers, arseny
Re: logical copy_replication_slot issues
I wrote: > It looks good to me now. After lying for some time in my head it reminded me that CreateInitDecodingContext not only pegs the LSN, but also xmin, so attached makes a minor comment correction. While taking a look at the nearby code it seemed weird to me that GetOldestSafeDecodingTransactionId checks PGXACT->xid, not xmin. Don't want to investigate this at the moment though, and not for this thread. Also not for this thread, but I've noticed pg_copy_logical_replication_slot doesn't allow to change plugin name which is an omission in my view. It would be useful and trivial to do. -- cheers, arseny diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 2c9d5de6d9..da634bef0e 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -121,7 +121,8 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS) */ static void create_logical_replication_slot(char *name, char *plugin, -bool temporary, XLogRecPtr restart_lsn) +bool temporary, XLogRecPtr restart_lsn, +bool find_startpoint) { LogicalDecodingContext *ctx = NULL; @@ -139,16 +140,18 @@ create_logical_replication_slot(char *name, char *plugin, temporary ? RS_TEMPORARY : RS_EPHEMERAL); /* - * Create logical decoding context, to build the initial snapshot. + * Create logical decoding context to find start point or, if we don't + * need it, to 1) bump slot's restart_lsn and xmin 2) check plugin sanity. */ ctx = CreateInitDecodingContext(plugin, NIL, - false, /* do not build snapshot */ + false, /* do not build data snapshot */ restart_lsn, logical_read_local_xlog_page, NULL, NULL, NULL); /* build initial snapshot, might take a while */ - DecodingContextFindStartpoint(ctx); + if (find_startpoint) + DecodingContextFindStartpoint(ctx); /* don't need the decoding context anymore */ FreeDecodingContext(ctx); @@ -179,7 +182,8 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS) create_logical_replication_slot(NameStr(*name), NameStr(*plugin), temporary, - InvalidXLogRecPtr); + InvalidXLogRecPtr, + true); values[0] = NameGetDatum(>data.name); values[1] = LSNGetDatum(MyReplicationSlot->data.confirmed_flush); @@ -683,10 +687,19 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) /* Create new slot and acquire it */ if (logical_slot) + { + /* + * WAL required for building snapshot could be removed as we haven't + * reserved WAL yet. So we create a new logical replication slot + * without building an initial snapshot. A reasonable start point for + * decoding will be provided by the source slot. + */ create_logical_replication_slot(NameStr(*dst_name), plugin, temporary, - src_restart_lsn); + src_restart_lsn, + false); + } else create_physical_replication_slot(NameStr(*dst_name), true, @@ -703,6 +716,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) TransactionId copy_xmin; TransactionId copy_catalog_xmin; XLogRecPtr copy_restart_lsn; + XLogRecPtr copy_confirmed_flush; bool copy_islogical; char *copy_name; @@ -714,6 +728,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) copy_xmin = src->data.xmin; copy_catalog_xmin = src->data.catalog_xmin; copy_restart_lsn = src->data.restart_lsn; + copy_confirmed_flush = src->data.confirmed_flush; /* for existence check */ copy_name = pstrdup(NameStr(src->data.name)); @@ -738,6 +753,13 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) NameStr(*src_name)), errdetail("The source replication slot was modified incompatibly during the copy operation."))); + /* The source slot must have a consistent snapshot */ + if (src_islogical && XLogRecPtrIsInvalid(copy_confirmed_flush)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot copy a logical replication slot that doesn't have confirmed_flush_lsn"), + errhint("Retry when the source replication slot creation is finished."))); + /* Install copied values again */ SpinLockAcquire(>mutex); MyReplicationSlot->effective_xmin = copy_effective_xmin; @@ -746,6 +768,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) MyReplicationSlot->data.xmin = copy_xmin; MyReplicationSlot->data.catalog_xmin = copy_catalog_xmin; MyReplicationSlot->data.restart_lsn = copy_restart_lsn; + MyReplicationSlot->data.confirmed_flush = copy_confirmed_flush; SpinLockRelease(>mutex); ReplicationSlotMarkDirty();
Re: logical copy_replication_slot issues
Masahiko Sawada writes: > I've attached the updated version patch that incorporated your > comments. I believe we're going in the right direction for fixing this > bug. I'll register this item to the next commit fest so as not to > forget. I've moved confirmed_flush check to the second lookup out of paranoic considerations (e.g. slot could have been recreated and creation hasn't finished yet) and made some minor stylistic adjustments. It looks good to me now. diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 2c9d5de6d9..4a3c7aa0ce 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -121,7 +121,8 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS) */ static void create_logical_replication_slot(char *name, char *plugin, -bool temporary, XLogRecPtr restart_lsn) +bool temporary, XLogRecPtr restart_lsn, +bool find_startpoint) { LogicalDecodingContext *ctx = NULL; @@ -139,16 +140,18 @@ create_logical_replication_slot(char *name, char *plugin, temporary ? RS_TEMPORARY : RS_EPHEMERAL); /* - * Create logical decoding context, to build the initial snapshot. + * Create logical decoding context to find start point or, if we don't + * need it, to 1) bump slot's restart_lsn 2) check plugin sanity. */ ctx = CreateInitDecodingContext(plugin, NIL, - false, /* do not build snapshot */ + false, /* do not build data snapshot */ restart_lsn, logical_read_local_xlog_page, NULL, NULL, NULL); /* build initial snapshot, might take a while */ - DecodingContextFindStartpoint(ctx); + if (find_startpoint) + DecodingContextFindStartpoint(ctx); /* don't need the decoding context anymore */ FreeDecodingContext(ctx); @@ -179,7 +182,8 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS) create_logical_replication_slot(NameStr(*name), NameStr(*plugin), temporary, - InvalidXLogRecPtr); + InvalidXLogRecPtr, + true); values[0] = NameGetDatum(>data.name); values[1] = LSNGetDatum(MyReplicationSlot->data.confirmed_flush); @@ -683,10 +687,19 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) /* Create new slot and acquire it */ if (logical_slot) + { + /* + * WAL required for building snapshot could be removed as we haven't + * reserved WAL yet. So we create a new logical replication slot + * without building an initial snapshot. A reasonable start point for + * decoding will be provided by the source slot. + */ create_logical_replication_slot(NameStr(*dst_name), plugin, temporary, - src_restart_lsn); + src_restart_lsn, + false); + } else create_physical_replication_slot(NameStr(*dst_name), true, @@ -703,6 +716,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) TransactionId copy_xmin; TransactionId copy_catalog_xmin; XLogRecPtr copy_restart_lsn; + XLogRecPtr copy_confirmed_flush; bool copy_islogical; char *copy_name; @@ -714,6 +728,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) copy_xmin = src->data.xmin; copy_catalog_xmin = src->data.catalog_xmin; copy_restart_lsn = src->data.restart_lsn; + copy_confirmed_flush = src->data.confirmed_flush; /* for existence check */ copy_name = pstrdup(NameStr(src->data.name)); @@ -738,6 +753,13 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) NameStr(*src_name)), errdetail("The source replication slot was modified incompatibly during the copy operation."))); + /* The source slot must have a consistent snapshot */ + if (src_islogical && XLogRecPtrIsInvalid(copy_confirmed_flush)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot copy a logical replication slot that doesn't have confirmed_flush_lsn"), + errhint("Retry when the source replication slot creation is finished."))); + /* Install copied values again */ SpinLockAcquire(>mutex); MyReplicationSlot->effective_xmin = copy_effective_xmin; @@ -746,6 +768,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) MyReplicationSlot->data.xmin = copy_xmin; MyReplicationSlot->data.catalog_xmin = copy_catalog_xmin; MyReplicationSlot->data.restart_lsn = copy_restart_lsn; + MyReplicationSlot->data.confirmed_flush = copy_confirmed_flush; SpinLockRelease(>mutex); ReplicationSlotMarkDirty(); -- cheers, arseny
Re: logical copy_replication_slot issues
Masahiko Sawada writes: > I've attached the draft patch fixing this issue but I'll continue > investigating it more deeply. There also should be a check that source slot itself has consistent snapshot (valid confirmed_flush) -- otherwise it might be possible to create not initialized slot which is probably not an error, but weird and somewhat meaningless. Paranoically, this ought to be checked in both src slot lookups. With this patch it seems like the only thing create_logical_replication_slot does is ReplicationSlotCreate, which questions the usefulness of this function. On the second look, CreateInitDecodingContext checks plugin sanity (ensures it exists), so probably it's fine. -- cheers, arseny
logical copy_replication_slot issues
Hi, While jumping around partically decoded xacts questions [1], I've read through the copy replication slots code (9f06d79ef) and found a couple of issues. 1) It seems quite reckless to me to dive into DecodingContextFindStartpoint without actual WAL reservation (donors slot restart_lsn is used, but it is not acquired). Why risking erroring out with WAL removal error if the function advances new slot position to updated donors one in the end anyway? 2) In the end, restart_lsn of new slot is set to updated donors one. However, confirmed_flush field is not updated. This is just wrong -- we could start decoding too early and stream partially decoded transaction. I'd probably avoid doing DecodingContextFindStartpoint at all. Its only purpose is to assemble consistent snapshot (and establish corresponding pair), but donor slot must have already done that and we could use it as well. Was this considered? [1] https://www.postgresql.org/message-id/flat/AB5978B2-1772-4FEE-A245-74C91704ECB0%40amazon.com -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Too rigorous assert in reorderbuffer.c
Arseny Sher writes: > I'm sorry to bother you with this again, but due to new test our > internal buildfarm revealed that ajacent assert on cmin is also lie. You > see, we can't assume cmin is stable because the same key (relnode, tid) > might refer to completely different tuples if tuple was inserted by > aborted subxact, immeditaly reclaimed and then space occupied by another > one. Fix is attached. > > Technically this might mean a user-facing bug, because we only pick the > first cmin which means we might get visibility wrong, allowing to see > some version too early (i.e real cmin of tuple is y, but decoding thinks > it is x, and x < y). However, I couldn't quickly make up an example > where this would actually lead to bad consequences. I tried to create > such extra visible row in pg_attribute, but that's ok because loop in > RelationBuildTupleDesc spins exactly natts times and ignores what is > left unscanned. It is also ok with pg_class, because apparently > ScanPgRelation also fishes out the (right) first tuple and doesn't check > for duplicates appearing later in the scan. Maybe I just haven't tried > hard enough though. This issue still exists, it would be nice to fix it... -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: (Re)building index using itself or another index of the same table
Tomas Vondra writes: > On Thu, Sep 12, 2019 at 11:08:28AM -0400, Tom Lane wrote: >>I have exactly no faith that this fixes things enough to make such >>cases supportable. And I have no interest in opening that can of >>worms anyway. I'd rather put in some code to reject database >>accesses in immutable functions. >> > > Same here. My hunch is a non-trivaial fraction of applications using > this "trick" is silently broken in various subtle ways. Ok, I see the point. However, "could not read block" error might seem quite scary to the users; it looks like data corruption. How about ERRORing out then in get_relation_info instead of skipping reindexing indexes, like in attached? Even if this doesn't cover all cases, at least one scenario observed in the field would have better error message. Rejecting database access completely in immutable functions would be unfortunate for our particular case, because this GIN index on expression joining the very indexed table multiple times (and using thus btree index) is, well, useful. Here is a brief description of the case. Indexed table stores postal addresses, which are of hierarchical nature (e.g. country-region-city-street-house). Single row is one element of any depth (e.g. region or house); each row stores link to its parent in parent_guid column, establishing thus the hierarchy (e.g. house has link to the street). The task it to get the full address by typing random parts of it (imagine typing hints in Google Maps). For that, FTS is used. GIN index is built on full addresses, and to get the full address table is climbed up about six times (hierarchy depth) by following parent_guid chain. We could materialize full addresses in the table and eliminate the need to form them in the index expression, but that would seriously increase amount of required storage -- GIN doesn't store indexed columns fully, and thus it is cheaper to 'materialize' full addresses inside it only. Surely this is a hack which cheats the system. We might imagine creating some functionality (kinda index referring to multiple rows of the table -- or even rows of different tables) making it unneccessary, but such functionality doesn't exist today, and the hack is useful, if you understand the risk. >>> One might argue that function selecting from table can hardly be called >>> immutable, and immutability is required for index expressions. However, >>> if user is sure table contents doesn't change, why not? >> >>If the table contents never change, why are you doing VACUUM FULL on it? >> > > It's possible the columns referenced by the index expression are not > changing, but some additional columns are updated. Yeah. Also table can be CLUSTERed without VACUUM FULL. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From f5b9c433bf387a9ddbe318dfea2b96c02c4a945e Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Thu, 12 Sep 2019 17:35:16 +0300 Subject: [PATCH] ERROR out early on attempt to touch user indexes while they are being (re)built. Existing ReindexIsProcessingIndex check is consulted only in genam.c and thus enforced only for system catalogs. Check it also in the planner, so that indexes which are currently being rebuilt are never tried. Also cock SetReindexProcessing in index_create to defend from index self usage during its creation. Without this, VACUUM FULL or just CREATE INDEX might fail with something like ERROR: could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes if there are indexes which usage can be considered during these very indexes (re)building, i.e. index expression scans indexed table. Such error might seem scary, so catch this earlier. --- src/backend/catalog/index.c| 22 -- src/backend/optimizer/util/plancat.c | 9 + src/test/regress/expected/create_index.out | 15 +++ src/test/regress/sql/create_index.sql | 13 + 4 files changed, 57 insertions(+), 2 deletions(-) diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 3e1d40662d..5bc764ce46 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1174,7 +1174,22 @@ index_create(Relation heapRelation, } else { - index_build(heapRelation, indexRelation, indexInfo, false, true); + /* ensure SetReindexProcessing state isn't leaked */ + PG_TRY(); + { + /* Suppress use of the target index while building it */ + SetReindexProcessing(heapRelationId, indexRelationId); + + index_build(heapRelation, indexRelation, indexInfo, false, true); + } + PG_CATCH(); + { + /* Make sure flag gets cleared on error exit */ + ResetReindexProcessing(); + PG_RE_THROW(); + } + PG_END_TRY(); + ResetReindexProcessing(); } /* @@ -1379,7 +1394,10 @@ index_concurrentl
(Re)building index using itself or another index of the same table
Hi, Our customer encountered a curious scenario. They have a table with GIN index on expression, which performs multiple joins with this table itself. These joins employ another btree index for efficiency. VACUUM FULL on this table fails with error like ERROR: could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes It happens because order in which indexes are rebuilt is not specified. GIN index is being rebuilt when btree index is not reconstructed yet; an attempt to use old index with rewritten heap crashes. A problem of similar nature can be reproduced with the following stripped-down scenario: CREATE TABLE pears(f1 int primary key, f2 int); INSERT INTO pears SELECT i, i+1 FROM generate_series(1, 100) i; CREATE OR REPLACE FUNCTION pears_f(i int) RETURNS int LANGUAGE SQL IMMUTABLE AS $$ SELECT f1 FROM pears WHERE pears.f2 = 42 $$; CREATE index ON pears ((pears_f(f1))); Here usage of not-yet-created index on pears_f(f1) for its own construction is pointless, however planner in principle considers it in get_relation_info, tries to get btree height (_bt_getrootheight) -- and fails. There is already a mechanism which prevents usage of indexes during reindex -- ReindexIsProcessingIndex et al. However, to the contrary of what index.c:3664 comment say, these protect only indexes on system catalogs, not user tables: the only real caller is genam.c. Attached patch extends it: the same check is added to get_relation_info. Also SetReindexProcessing is cocked in index_create to defend from index self usage during creation as in stripped example above. There are some other still unprotected callers of index_build; concurrent index creation doesn't need it because index is 'not indisvalid' during the build, and in RelationTruncateIndexes table is empty, so it looks like it can be omitted. One might argue that function selecting from table can hardly be called immutable, and immutability is required for index expressions. However, if user is sure table contents doesn't change, why not? Also, the possiblity of triggering "could not read block" error with plain SQL is definitely not nice. >From 5942a3a5b2c90056119b9873c81f30dfa9e003af Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Thu, 12 Sep 2019 17:35:16 +0300 Subject: [PATCH] Avoid touching user indexes while they are being (re)built. Existing ReindexIsProcessingIndex check is consulted only in genam.c and thus enforced only for system catalogs. Check it also in the planner, so that indexes which are currently being rebuilt are never used. Also cock SetReindexProcessing in index_create to defend from index self usage during its creation. Without this, VACUUM FULL or just CREATE INDEX might fail with something like ERROR: could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes if there are indexes which usage can be considered during these very indexes (re)building, i.e. index expression scans indexed table. --- src/backend/catalog/index.c| 22 -- src/backend/optimizer/util/plancat.c | 5 + src/test/regress/expected/create_index.out | 12 src/test/regress/sql/create_index.sql | 13 + 4 files changed, 50 insertions(+), 2 deletions(-) diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 3e1d40662d..5bc764ce46 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1174,7 +1174,22 @@ index_create(Relation heapRelation, } else { - index_build(heapRelation, indexRelation, indexInfo, false, true); + /* ensure SetReindexProcessing state isn't leaked */ + PG_TRY(); + { + /* Suppress use of the target index while building it */ + SetReindexProcessing(heapRelationId, indexRelationId); + + index_build(heapRelation, indexRelation, indexInfo, false, true); + } + PG_CATCH(); + { + /* Make sure flag gets cleared on error exit */ + ResetReindexProcessing(); + PG_RE_THROW(); + } + PG_END_TRY(); + ResetReindexProcessing(); } /* @@ -1379,7 +1394,10 @@ index_concurrently_build(Oid heapRelationId, indexInfo->ii_Concurrent = true; indexInfo->ii_BrokenHotChain = false; - /* Now build the index */ + /* + * Now build the index + * SetReindexProcessing is not required since indisvalid is false anyway + */ index_build(heapRel, indexRelation, indexInfo, false, true); /* Close both the relations, but keep the locks */ diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index cf1761401d..9d58cd2574 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -27,6 +27,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/dependency.h" +#include "catalog/index.h" #include "catalog/heap.h" #include "catalog/pg_am.h" #include "catalog/p
Re: Parallel query vs smart shutdown and Postmaster death
Thomas Munro writes: > Just a thought: instead of the new hand-coded loop you added in > pmdie(), do you think it would make sense to have a new argument > "exclude_class_mask" for SignalSomeChildren()? If we did that, I > would consider renaming the existing parameter "target" to > "include_type_mask" to make it super clear what's going on. I thought that this is a bit too complicated for single use-case, but if you like it better, here is an updated version. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From 2e6359df5bf60b9ab6ca6e35fa347993019526db Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Sun, 17 Mar 2019 07:42:18 +0300 Subject: [PATCH] Allow parallel workers while backends are alive in 'smart' shutdown. Since postmaster doesn't advertise that he is never going to start parallel workers after shutdown was issued, parallel leader stucks waiting for them indefinitely. --- src/backend/postmaster/postmaster.c | 72 - 1 file changed, 56 insertions(+), 16 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index fe599632d3..2ad215b12d 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -412,10 +412,10 @@ static void report_fork_failure_to_client(Port *port, int errnum); static CAC_state canAcceptConnections(void); static bool RandomCancelKey(int32 *cancel_key); static void signal_child(pid_t pid, int signal); -static bool SignalSomeChildren(int signal, int targets); +static bool SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask); static void TerminateChildren(int signal); -#define SignalChildren(sig) SignalSomeChildren(sig, BACKEND_TYPE_ALL) +#define SignalChildren(sig) SignalSomeChildren(sig, BACKEND_TYPE_ALL, 0) static int CountChildren(int target); static bool assign_backendlist_entry(RegisteredBgWorker *rw); @@ -2689,10 +2689,12 @@ pmdie(SIGNAL_ARGS) if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY || pmState == PM_STARTUP) { -/* autovac workers are told to shut down immediately */ -/* and bgworkers too; does this need tweaking? */ -SignalSomeChildren(SIGTERM, - BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER); +/* + * Shut down all bgws except parallel workers: graceful + * termination of existing sessions needs them. + * Autovac workers are also told to shut down immediately. + */ +SignalSomeChildren(SIGTERM, BACKEND_TYPE_WORKER, BGWORKER_CLASS_PARALLEL); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); @@ -2752,7 +2754,7 @@ pmdie(SIGNAL_ARGS) signal_child(WalReceiverPID, SIGTERM); if (pmState == PM_STARTUP || pmState == PM_RECOVERY) { -SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER); +SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER, 0); /* * Only startup, bgwriter, walreceiver, possibly bgworkers, @@ -2773,7 +2775,7 @@ pmdie(SIGNAL_ARGS) /* shut down all backends and workers */ SignalSomeChildren(SIGTERM, BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC | - BACKEND_TYPE_BGWORKER); + BACKEND_TYPE_BGWORKER, 0); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); @@ -3939,26 +3941,52 @@ signal_child(pid_t pid, int signal) /* * Send a signal to the targeted children (but NOT special children; - * dead_end children are never signaled, either). + * dead_end children are never signaled, either). If BACKEND_TYPE_BGWORKER + * is in include_type_mask, bgworkers are additionaly filtered out by + * exclude_bgw_mask. */ static bool -SignalSomeChildren(int signal, int target) +SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask) { dlist_iter iter; + slist_iter siter; bool signaled = false; + /* + * Handle bgws by walking over BackgroundWorkerList because we have + * to check out bgw class. + */ + if ((include_type_mask & BACKEND_TYPE_BGWORKER) != 0) + { + slist_foreach(siter, ) + { + RegisteredBgWorker *rw; + + rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur); + if (rw->rw_pid > 0 && +((rw->rw_worker.bgw_flags & exclude_bgw_mask) == 0)) + { +ereport(DEBUG4, + (errmsg_internal("sending signal %d to process %d", + signal, (int) rw->rw_pid))); +signal_child(rw->rw_pid, signal); +signaled = true; + } + } + } + dlist_foreach(iter, ) { Backend*bp = dlist_container(Backend, elem, iter.cur); - if (bp->dead_end) + if (bp->dead_end || bp->bkend_type == BACKEND_TYPE_BGWORKER) continue; /* * Since target == BACKEND_TYPE_ALL is the most common case, we test * it first and avoid touching
Re: Parallel query vs smart shutdown and Postmaster death
Hi, Thomas Munro writes: > 1. Why does pmdie()'s SIGTERM case terminate parallel workers > immediately? That breaks aborts running parallel queries, so they > don't get to end their work normally. > 2. Why are new parallel workers not allowed to be started while in > this state? That hangs future parallel queries forever, so they don't > get to end their work normally. > 3. Suppose we fix the above cases; should we do it for parallel > workers only (somehow), or for all bgworkers? It's hard to say since > I don't know what all bgworkers do. Attached patch fixes 1 and 2. As for the 3, the only other internal bgworkers currently are logical replication launcher and workers, which arguably should be killed immediately. > Here's an initial sketch of a > patch like that: it gives you "ERROR: parallel worker failed to > initialize" and "HINT: More details may be available in the server > log." if you try to run a parallel query. Reporting bgworkers that postmaster is never going to start is probably worthwhile doing anyway to prevent potential further deadlocks like this. However, I think there is a problem in your patch: we might be in post PM_RUN states due to FatalError, not because of shutdown. In this case, we shouldn't refuse to run bgws in the future. I would also merge the check into bgworker_should_start_now. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From da11d22a5ed78a690ccdbfeb77c59749c541d463 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Sun, 17 Mar 2019 07:42:18 +0300 Subject: [PATCH] Allow parallel workers while backends are alive in 'smart' shutdown. Since postmaster doesn't advertise that he is never going to start parallel workers after shutdown was issued, parallel leader stucks waiting for them indefinitely. --- src/backend/postmaster/postmaster.c | 44 - 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index fe599632d3..d60969fdb8 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2689,10 +2689,30 @@ pmdie(SIGNAL_ARGS) if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY || pmState == PM_STARTUP) { +slist_iter siter; + +/* + * Shut down all bgws except parallel workers: graceful + * termination of existing sessions needs them. + */ +slist_foreach(siter, ) +{ + RegisteredBgWorker *rw; + + rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur); + if (rw->rw_pid > 0 && + ((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0)) + { + ereport(DEBUG4, +(errmsg_internal("sending signal %d to process %d", + SIGTERM, (int) rw->rw_pid))); + signal_child(rw->rw_pid, SIGTERM); + + } +} + /* autovac workers are told to shut down immediately */ -/* and bgworkers too; does this need tweaking? */ -SignalSomeChildren(SIGTERM, - BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER); +SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); @@ -5735,18 +5755,30 @@ do_start_bgworker(RegisteredBgWorker *rw) * specified start_time? */ static bool -bgworker_should_start_now(BgWorkerStartTime start_time) +bgworker_should_start_now(RegisteredBgWorker *rw) { + BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time; + switch (pmState) { case PM_NO_CHILDREN: case PM_WAIT_DEAD_END: case PM_SHUTDOWN_2: case PM_SHUTDOWN: + return false; + case PM_WAIT_BACKENDS: case PM_WAIT_READONLY: case PM_WAIT_BACKUP: - break; + /* + * Allow new parallel workers in SmartShutdown while backends + * are still there. + */ + if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) || +Shutdown != SmartShutdown || +FatalError) +return false; + /* fall through */ case PM_RUN: if (start_time == BgWorkerStart_RecoveryFinished) @@ -5906,7 +5938,7 @@ maybe_start_bgworkers(void) } } - if (bgworker_should_start_now(rw->rw_worker.bgw_start_time)) + if (bgworker_should_start_now(rw)) { /* reset crash time before trying to start worker */ rw->rw_crashed_at = 0; -- 2.11.0
Re: Too rigorous assert in reorderbuffer.c
Alvaro Herrera writes: > Thanks for checking! I also run it on all branches, everything passes. > Pushed now. I'm sorry to bother you with this again, but due to new test our internal buildfarm revealed that ajacent assert on cmin is also lie. You see, we can't assume cmin is stable because the same key (relnode, tid) might refer to completely different tuples if tuple was inserted by aborted subxact, immeditaly reclaimed and then space occupied by another one. Fix is attached. Technically this might mean a user-facing bug, because we only pick the first cmin which means we might get visibility wrong, allowing to see some version too early (i.e real cmin of tuple is y, but decoding thinks it is x, and x < y). However, I couldn't quickly make up an example where this would actually lead to bad consequences. I tried to create such extra visible row in pg_attribute, but that's ok because loop in RelationBuildTupleDesc spins exactly natts times and ignores what is left unscanned. It is also ok with pg_class, because apparently ScanPgRelation also fishes out the (right) first tuple and doesn't check for duplicates appearing later in the scan. Maybe I just haven't tried hard enough though. Attached 'aborted_subxact_test.patch' is an illustration of such wrong cmin visibility on pg_attribute. It triggers assertion failure, but otherwise ok (no user-facing issues), as I said earlier, so I am disinclined to include it in the fix. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 2b486b5e9f..505c7ff134 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1318,29 +1318,36 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) (void *) , HASH_ENTER | HASH_FIND, ); - if (!found) - { - ent->cmin = change->data.tuplecid.cmin; - ent->cmax = change->data.tuplecid.cmax; - ent->combocid = change->data.tuplecid.combocid; - } - else + if (found) { /* - * Maybe we already saw this tuple before in this transaction, - * but if so it must have the same cmin. + * Tuple with such key (relnode, tid) might be inserted in aborted + * subxact, space reclaimed and then filled with another tuple + * from our xact or another; then we might update it again. It + * means cmin can become valid/invalid at any moment, but valid + * values are always monotonic. */ - Assert(ent->cmin == change->data.tuplecid.cmin); + Assert((ent->cmin == InvalidCommandId) || + (change->data.tuplecid.cmin == InvalidCommandId) || + (change->data.tuplecid.cmin >= ent->cmin)); /* - * cmax may be initially invalid, but once set it can only grow, - * and never become invalid again. + * Practically the same for cmax; to see invalid cmax after + * valid, we need to insert new tuple in place of one reclaimed + * from our aborted subxact. */ Assert((ent->cmax == InvalidCommandId) || - ((change->data.tuplecid.cmax != InvalidCommandId) && - (change->data.tuplecid.cmax > ent->cmax))); - ent->cmax = change->data.tuplecid.cmax; + (change->data.tuplecid.cmax == InvalidCommandId) || + (change->data.tuplecid.cmax > ent->cmax)); } + + /* + * The 'right' (potentially belonging to committed xact) cmin and cmax + * are always the latest. + */ + ent->cmin = change->data.tuplecid.cmin; + ent->cmax = change->data.tuplecid.cmax; + ent->combocid = change->data.tuplecid.combocid; } } diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index 4afb1d963e..32c2650d1a 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -3,11 +3,11 @@ MODULES = test_decoding PGFILEDESC = "test_decoding - example of a logical decoding output plugin" -REGRESS = ddl xact rewrite toast permissions decoding_in_xact \ +# REGRESS = ddl xact rewrite toast permissions decoding_in_xact \ decoding_into_rel binary prepared replorigin time messages \ spill slot truncate ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \ - oldest_xmin snapshot_transfer + oldest_xmin snapshot_transfer aborted_subxact REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf diff --git a/contrib/test_decoding/specs/aborted_subxact.spec b/contrib/test_decoding/specs/aborted_subxact.spec new file mode 100644 index 00..cb066fd5d2 --- /dev/null +++ b/contrib/test_decoding/specs/aborted_subxact.spec @@ -0,0 +1,35 @@ +# Test DDL in aborted subxact + +setup +{ +SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'tes
Re: Too rigorous assert in reorderbuffer.c
Alvaro Herrera writes: >> I thought the blanket removal of the assert() was excessive, and we can >> relax it instead; what do you think of the attached? > > More precisely, my question was: with this change, does the code still > work correctly in your non-toy case? Yes, it works. I thought for a moment that some obscure cases where cmax on a single tuple is not strictly monotonic might exist, but looks like they don't. So your change is ok for me, reshaping assert is better than removing. make check is also good on all supported branches. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Too rigorous assert in reorderbuffer.c
Alvaro Herrera writes: > On 2019-Feb-06, Arseny Sher wrote: > >> >> Alvaro Herrera writes: >> >> > note the additional pg_temp_XYZ row in the middle. This is caused by >> > the rewrite in ALTER TABLE. Peter E fixed that in Pg11 in commit >> > 325f2ec55; I don't think there's much to do in the backbranches other >> > than hide the pesky record to avoid it breaking the test. >> >> Oh, I see. Let's just remove the first insertion then, as in attached. >> I've tested it on master and on 9.4. > > Ah, okay. Does the test still fail when run without the code fix? Yes. The problem here is overriding cmax of catalog (pg_attribute in the test) tuples, so it fails without any data at all. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Too rigorous assert in reorderbuffer.c
Alvaro Herrera writes: > note the additional pg_temp_XYZ row in the middle. This is caused by > the rewrite in ALTER TABLE. Peter E fixed that in Pg11 in commit > 325f2ec55; I don't think there's much to do in the backbranches other > than hide the pesky record to avoid it breaking the test. Oh, I see. Let's just remove the first insertion then, as in attached. I've tested it on master and on 9.4. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From b34726d9b7565df73319a44664d9cd04de5f514f Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Wed, 30 Jan 2019 23:31:47 +0300 Subject: [PATCH] Remove assertion in reorderbuffer that cmax is stable. Since it can be rewritten arbitrary number of times if subxact(s) who tried to delete some tuple version abort later. Add small test involving DDL in aborted subxact as this case is interesting on its own and wasn't covered before. --- contrib/test_decoding/expected/ddl.out | 18 ++ contrib/test_decoding/sql/ddl.sql | 13 + src/backend/replication/logical/reorderbuffer.c | 10 ++ 3 files changed, 37 insertions(+), 4 deletions(-) diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out index b7c76469fc..2bd28e6d15 100644 --- a/contrib/test_decoding/expected/ddl.out +++ b/contrib/test_decoding/expected/ddl.out @@ -409,6 +409,24 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc COMMIT (6 rows) +-- check that DDL in aborted subtransactions handled correctly +CREATE TABLE tr_sub_ddl(data int); +BEGIN; +SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text; +INSERT INTO tr_sub_ddl VALUES ('blah-blah'); +ROLLBACK TO SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint; +INSERT INTO tr_sub_ddl VALUES(43); +COMMIT; +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); + data +-- + BEGIN + table public.tr_sub_ddl: INSERT: data[bigint]:43 + COMMIT +(3 rows) + /* * Check whether treating a table as a catalog table works somewhat */ diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql index c4b10a4cf9..a55086443c 100644 --- a/contrib/test_decoding/sql/ddl.sql +++ b/contrib/test_decoding/sql/ddl.sql @@ -236,6 +236,19 @@ COMMIT; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); +-- check that DDL in aborted subtransactions handled correctly +CREATE TABLE tr_sub_ddl(data int); +BEGIN; +SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text; +INSERT INTO tr_sub_ddl VALUES ('blah-blah'); +ROLLBACK TO SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint; +INSERT INTO tr_sub_ddl VALUES(43); +COMMIT; + +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); + /* * Check whether treating a table as a catalog table works somewhat diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index a49e226967..0ace0cfb87 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1327,13 +1327,15 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) else { Assert(ent->cmin == change->data.tuplecid.cmin); - Assert(ent->cmax == InvalidCommandId || - ent->cmax == change->data.tuplecid.cmax); /* * if the tuple got valid in this transaction and now got deleted - * we already have a valid cmin stored. The cmax will be - * InvalidCommandId though. + * we already have a valid cmin stored. The cmax is still + * InvalidCommandId though, so stamp it. Moreover, cmax might + * be rewritten arbitrary number of times if subxacts who tried to + * delete this version abort later. Pick up the latest since it is + * (belonging to potentially committed subxact) the only one which + * matters. */ ent->cmax = change->data.tuplecid.cmax; } -- 2.11.0
Too rigorous assert in reorderbuffer.c
Hi, My colleague Alexander Lakhin has noticed an assertion failure in reorderbuffer.c:1330. Here is a simple snippet reproducing it: SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); create table t(k int); begin; savepoint a; alter table t alter column k type text; rollback to savepoint a; alter table t alter column k type bigint; commit; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); It is indeed too opinionated since cmax of a tuple is not stable; it can be rewritten if subxact who tried to delete it later aborts (analogy also holds for xmax). Attached patch removes it. While here, I had also considered worthwhile to add a test involving DDL in aborted subxact as it is interesting anyway and wasn't covered before. >From c5cd30f9e23c96390cafad82f832c918dfd3397f Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Wed, 30 Jan 2019 23:31:47 +0300 Subject: [PATCH] Remove assertion in reorderbuffer that cmax is stable. Since it can be rewritten arbitrary number of times if subxact(s) who tried to delete some tuple version abort later. Add small test involving DDL in aborted subxact as this case is interesting on its own and wasn't covered before. --- contrib/test_decoding/expected/ddl.out | 20 contrib/test_decoding/sql/ddl.sql | 14 ++ src/backend/replication/logical/reorderbuffer.c | 10 ++ 3 files changed, 40 insertions(+), 4 deletions(-) diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out index b7c76469fc..69dd74676c 100644 --- a/contrib/test_decoding/expected/ddl.out +++ b/contrib/test_decoding/expected/ddl.out @@ -409,6 +409,26 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc COMMIT (6 rows) +-- check that DDL in aborted subtransactions handled correctly +CREATE TABLE tr_sub_ddl(data int); +BEGIN; +INSERT INTO tr_sub_ddl VALUES (42); +SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text; +INSERT INTO tr_sub_ddl VALUES ('blah-blah'); +ROLLBACK TO SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint; +INSERT INTO tr_sub_ddl VALUES(43); +COMMIT; +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); + data +--- + BEGIN + table public.tr_sub_ddl: INSERT: data[integer]:42 + table public.tr_sub_ddl: INSERT: data[bigint]:43 + COMMIT +(4 rows) + /* * Check whether treating a table as a catalog table works somewhat */ diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql index c4b10a4cf9..f47b4ad716 100644 --- a/contrib/test_decoding/sql/ddl.sql +++ b/contrib/test_decoding/sql/ddl.sql @@ -236,6 +236,20 @@ COMMIT; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); +-- check that DDL in aborted subtransactions handled correctly +CREATE TABLE tr_sub_ddl(data int); +BEGIN; +INSERT INTO tr_sub_ddl VALUES (42); +SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text; +INSERT INTO tr_sub_ddl VALUES ('blah-blah'); +ROLLBACK TO SAVEPOINT a; +ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint; +INSERT INTO tr_sub_ddl VALUES(43); +COMMIT; + +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); + /* * Check whether treating a table as a catalog table works somewhat diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index a49e226967..0ace0cfb87 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1327,13 +1327,15 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn) else { Assert(ent->cmin == change->data.tuplecid.cmin); - Assert(ent->cmax == InvalidCommandId || - ent->cmax == change->data.tuplecid.cmax); /* * if the tuple got valid in this transaction and now got deleted - * we already have a valid cmin stored. The cmax will be - * InvalidCommandId though. + * we already have a valid cmin stored. The cmax is still + * InvalidCommandId though, so stamp it. Moreover, cmax might + * be rewritten arbitrary number of times if subxacts who tried to + * delete this version abort later. Pick up the latest since it is + * (belonging to potentially committed subxact) the only one which + * matters. */ ent->cmax = change->data.tuplecid.cmax; } -- 2.11.0 -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] logical decoding of two-phase transactions
e one we are aborting, clear its reorder buffer as well. +*/ + if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid) + ReorderBufferAbort(ctx->reorder, xid, origin_lsn); What is the aim of this? How xl_xid xid of commit prepared record can be normal? + /* +* The transaction may or may not exist (during restarts for example). +* Anyways, 2PC transactions do not contain any reorderbuffers. So allow +* it to be created below. +*/ Code around looks sane, but I think that restarts are irrelevant to rbtxn existence at this moment: if we are going to COMMIT/ABORT PREPARED it, it must have been replayed and rbtxn purged immediately after. The only reason why rbtxn can exist here is invalidation addition (ReorderBufferAddInvalidations) happening a couple of calls earlier. Also, instead of misty '2PC transactions do not contain any reorderbuffers' I would say something like 'create dummy ReorderBufferTXN to pass it to the callback'. - filter_prepare_cb callback existence is checked in both decode.c and in filter_prepare_cb_wrapper. Third patch: +/* + * An xid value pointing to a possibly ongoing or a prepared transaction. + * Currently used in logical decoding. It's possible that such transactions + * can get aborted while the decoding is ongoing. + */ I would explain here that this xid is checked for abort after each catalog scan, and sent for the details to SetupHistoricSnapshot. Nitpicking: First patch: I still don't think that these flags need a bitmask. Second patch: - I still think ReorderBufferCommitInternal name is confusing and should be renamed to something like ReorderBufferReplay. /* Do we know this is a subxact? Xid of top-level txn if so */ TransactionId toplevel_xid; + /* In case of 2PC we need to pass GID to output plugin */ + char *gid; Better add here newline as between other fields. + txn->txn_flags |= RBTXN_PREPARE; + txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */ + strcpy(txn->gid, gid); pstrdup? - ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the same and should be merged with comments explaining that the answer must be stable. + The optional commit_prepared_cb callback is called whenever + a commit prepared transaction has been decoded. The gid field, a commit prepared transaction *record* has been decoded? Fourth patch: Applying: Teach test_decoding plugin to work with 2PC .git/rebase-apply/patch:347: trailing whitespace. -- test savepoints .git/rebase-apply/patch:424: trailing whitespace. # get XID of the above two-phase transaction warning: 2 lines add whitespace errors. [1] https://www.postgresql.org/message-id/87zhxrwgvh.fsf%40ars-thinkpad -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] logical decoding of two-phase transactions
Tomas Vondra writes: > Hi Nikhil, > > Thanks for the updated patch - I've started working on a review, with > the hope of getting it committed sometime in 2019-01. But the patch > bit-rotted again a bit (probably due to d3c09b9b), which broke the last > part. Can you post a fixed version? Please also note that at some time the thread was torn and continued in another place: https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com And now we have two branches =( I hadn't checked whether my concerns where addressed in the latest version though. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Global snapshots
o, I could not understand some notes from Arseny: > >> 25 июля 2018 г., в 16:35, Arseny Sher написал(а): >> >> * One drawback of these patches is that only REPEATABLE READ is >> supported. For READ COMMITTED, we must export every new snapshot >> generated on coordinator to all nodes, which is fairly easy to >> do. SERIALIZABLE will definitely require chattering between nodes, >> but that's much less demanded isolevel (e.g. we still don't support >> it on replicas). > > If all shards are executing transaction in SERIALIZABLE, what anomalies does > it permit? > > If you have transactions on server A and server B, there are > transactions 1 and 2, transaction A1 is serialized before A2, but B1 > is after B2, right? > > Maybe we can somehow abort 1 or 2? Yes, your explanation is concise and correct. To put it in another way, ensuring SERIALIZABLE in MVCC requires tracking reads, and there is no infrastructure for doing it globally. Classical write skew is possible: you have node A holding x and node B holding y, initially x = y = 30 and there is a constraint x + y > 0. Two concurrent xacts start: T1: x = x - 42; T2: y = y - 42; They don't see each other, so both commit successfully and the constraint is violated. We need to transfer info about reads between nodes to know when we need to abort someone. >> >> * Another somewhat serious issue is that there is a risk of recency >> guarantee violation. If client starts transaction at node with >> lagging clocks, its snapshot might not include some recently >> committed transactions; if client works with different nodes, she >> might not even see her own changes. CockroachDB describes at [1] how >> they and Google Spanner overcome this problem. In short, both set >> hard limit on maximum allowed clock skew. Spanner uses atomic >> clocks, so this skew is small and they just wait it at the end of >> each transaction before acknowledging the client. In CockroachDB, if >> tuple is not visible but we are unsure whether it is truly invisible >> or it's just the skew (the difference between snapshot and tuple's >> csn is less than the skew), transaction is restarted with advanced >> snapshot. This process is not infinite because the upper border >> (initial snapshot + max skew) stays the same; this is correct as we >> just want to ensure that our xact sees all the committed ones before >> it started. We can implement the same thing. > I think that this situation is also covered in Clock-SI since > transactions will not exit InDoubt state before we can see them. But > I'm not sure, chances are that I get something wrong, I'll think more > about it. I'd be happy to hear comments from Stas about this. InDoubt state protects from seeing xact who is not yet committed everywhere, but it doesn't protect from starting xact on node with lagging clocks, obtaining plainly old snapshot. We won't see any half-committed data (InDoubt covers us here) with it, but some recently committed xacts might not get into our old snapshot entirely. >> * 003_bank_shared.pl test is removed. In current shape (loading one >> node) it is useless, and if we bombard both nodes, deadlock surely >> appears. In general, global snaphots are not needed for such >> multimaster-like setup -- either there are no conflicts and we are >> fine, or there is a conflict, in which case we get a deadlock. > Can we do something with this deadlock? Will placing an upper limit on > time of InDoubt state fix the issue? I understand that aborting > automatically is kind of dangerous... Sure, this is just a generalization of basic deadlock problem to distributed system. To deal with it, someone must periodically collect locks across the nodes, build graph, check for loops and punish (abort) one of its chain creators, if loop exists. InDoubt (and global snapshots in general) is unrelated to this, we hanged on usual row-level lock in this test. BTW, our pg_shardman extension has primitive deadlock detector [2], I suppose Citus [3] also has one. > Also, currently hanging 2pc transaction can cause a lot of headache > for DBA. Can we have some kind of protection for the case when one > node is gone permanently during transaction? Oh, subject of automatic 2PC xacts resolution is also matter of another (probably many-miles) thread, largely unrelated to global snapshots/visibility. In general, the problem of distributed transaction commit which doesn't block while majority of nodes is alive requires implementing distributed consensus algorithm like Paxos or Raft. You might also find thread [4] interesting. Thank you for your interest in the topic! [1] https://yadi.sk/i/qgmFeICvuRwYNA [2] https://github.com/postgrespro/pg_shardman/blob/broa
Re: [HACKERS] logical decoding of two-phase transactions
snapshot to all RBTXNs, if it is important. * Set base snap of our xact if it did DDL, to execute invalidations during replay. I see that we don't need to do first two bullets during DecodePrepare: xact effects are still invisible for everyone but itself after PREPARE. As for seeing xacts own own changes, it is implemented via logging cmin/cmax and we don't need to mark xact as committed for that (c.f ReorderBufferCopySnap). Regarding the third point... I think in 2PC decoding we might need to execute invalidations twice: 1) After replaying xact on PREPARE to forget about catalog changes xact did -- it is not yet committed and must be invisible to other xacts until CP. In the latest patchset invalidations are executed only if there is at least one change in xact (it has base snap). It looks fine: we can't spoil catalogs if there were nothing to decode. Better to explain that somewhere. 2) After decoding COMMIT PREPARED to make changes visible. In current patchset it is always done. Actually, *this* is the reason RBTXN might already exist when we enter ReorderBufferFinishPrepared, not "(during restarts for example)" as comment says there: if there were inval messages, RBTXN will be created in DecodeCommit during their addition. BTW, "that we might need to execute invalidations, add snapshot" in SnapBuildCommitTxn looks like a cludge to me: I suppose it is better to do that at ReorderBufferXidSetCatalogChanges. Now, another issue is registering xact as committed in SnapBuildCommitTxn during COMMIT PREPARED processing. Since RBTXN is always purged after xact replay on PREPARE, the only medium we have for noticing catalog changes during COMMIT PREPARED is invalidation messages attached to the CP record. This raises the following question. * If there is a guarantee that whenever xact makes catalog changes it generates invalidation messages, then this code is fine. However, currently ReorderBufferXidSetCatalogChanges is also called on XLOG_HEAP_INPLACE processing and in SnapBuildProcessNewCid, which is useless if such guarantee exists. * If, on the other hand, there is no such guarantee, this code is broken. >> - I am one of those people upthread who don't think that converting >> flags to bitmask is beneficial -- especially given that many of them >> are mutually exclusive, e.g. xact can't be committed and aborted at >> the same time. Apparently you have left this to the committer though. >> > > Hmm, there seems to be divided opinion on this. I am willing to go > back to using the booleans if there's opposition and if the committer > so wishes. Note that this patch will end up adding 4/5 more booleans > in that case (we add new ones for prepare, commit prepare, abort, > rollback prepare etc). Well, you can unite mutually exclusive fields into one enum or char with macros defining possible values. Transaction can't be committed and aborted at the same time, etc. >> + row. The change_cb callback may access system or >> + user catalog tables to aid in the process of outputting the row >> + modification details. In case of decoding a prepared (but yet >> + uncommitted) transaction or decoding of an uncommitted transaction, >> this >> + change callback is ensured sane access to catalog tables regardless of >> + simultaneous rollback by another backend of this very same >> transaction. >> >> I don't think we should explain this, at least in such words. As >> mentioned upthread, we should warn about allowed systable_* accesses >> instead. Same for message_cb. >> > > Looks like you are looking at an earlier patchset. The latest patchset > has removed the above. I see, sorry. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] logical decoding of two-phase transactions
Andres Freund writes: >> - On decoding of aborted xacts. The idea to throw an error once we >> detect the abort is appealing, however I think you will have problems >> with subxacts in the current implementation. What if subxact issues >> DDL and then aborted, but main transaction successfully committed? > > I don't see a fundamental issue here. I've not reviewed the current > patchset meaningfully, however. Do you see a fundamental issue here? Hmm, yes, this is not an issue for this patch because after reading PREPARE record we know all aborted subxacts and won't try to decode their changes. However, this will be raised once we decide to decode in-progress transactions. Checking for all subxids is expensive; moreover, WAL doesn't provide all of them until commit... it might be easier to prevent vacuuming of aborted stuff while decoding needs it. Matter for another patch, anyway. >> - Currently we will call abort_prepared cb even if we failed to actually >> prepare xact due to concurrent abort. I think it is confusing for >> users. We should either handle this by remembering not to invoke >> abort_prepared in these cases or at least document this behaviour, >> leaving this problem to the receiver side. > > What precisely do you mean by "concurrent abort"? With current patch, the following is possible: * We start decoding of some prepared xact; * Xact aborts (ABORT PREPARED) for any reason; * Decoding processs notices this on catalog scan and calls abort() callback; * Later decoding process reads abort record and calls abort_prepared callback. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] logical decoding of two-phase transactions
ht be worthwile to put the check + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && + parsed->dbId == ctx->slot->data.database && + !FilterByOrigin(ctx, origin_id) && which appears 3 times now into separate function. +* two-phase transactions - we either have to have all of them or none. +* The filter_prepare callback is optional, but can only be defined when Kind of controversial (all of them or none, but optional), might be formulated more accurately. + /* +* Capabilities of the output plugin. +*/ + boolenable_twophase; I would rename this to 'supports_twophase' since this is not an option but a description of the plugin capabilities. + /* filter_prepare is optional, but requires two-phase decoding */ + if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase)) + ereport(ERROR, + (errmsg("Output plugin does not support two-phase decoding, but " + "registered filter_prepared callback."))); Don't think we need to check that... +* Otherwise call either PREPARE (for twophase transactions) or COMMIT +* (for regular ones). +*/ + if (rbtxn_rollback(txn)) + rb->abort(rb, txn, commit_lsn); This is the dead code since we don't have decoding of in-progress xacts yet. Third patch: +/* + * An xid value pointing to a possibly ongoing or a prepared transaction. + * Currently used in logical decoding. It's possible that such transactions + * can get aborted while the decoding is ongoing. + */ I would explain here that this xid is checked for abort after each catalog scan, and sent for the details to SetupHistoricSnapshot. + /* +* If CheckXidAlive is valid, then we check if it aborted. If it did, we +* error out +*/ + if (TransactionIdIsValid(CheckXidAlive) && + !TransactionIdIsInProgress(CheckXidAlive) && + !TransactionIdDidCommit(CheckXidAlive)) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), +errmsg("transaction aborted during system catalog scan"))); Probably centralize checks in one function? As well as 'We don't expect direct calls to heap_fetch...' ones. P.S. Looks like you have torn the thread chain: In-Reply-To header of mail [1] is missing. Please don't do that. [1] https://www.postgresql.org/message-id/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Global snapshots
Hello, I have looked through the patches and found them pretty accurate. I'd fixed a lot of small issues here and there; updated patchset is attached. But first, some high-level notes: * I agree that it would be cool to implement functionality like current "snapshot too old": that is, abort transaction with old global snapshot only if it really attempted to touch modified data. * I also agree with Stas that any attempts to trade oldestxmin in gossip between the nodes would drastically complicate this patch and make it discussion-prone; it would be nice first to get some feedback on general approach, especially from people trying to distribute Postgres. * One drawback of these patches is that only REPEATABLE READ is supported. For READ COMMITTED, we must export every new snapshot generated on coordinator to all nodes, which is fairly easy to do. SERIALIZABLE will definitely require chattering between nodes, but that's much less demanded isolevel (e.g. we still don't support it on replicas). * Another somewhat serious issue is that there is a risk of recency guarantee violation. If client starts transaction at node with lagging clocks, its snapshot might not include some recently committed transactions; if client works with different nodes, she might not even see her own changes. CockroachDB describes at [1] how they and Google Spanner overcome this problem. In short, both set hard limit on maximum allowed clock skew. Spanner uses atomic clocks, so this skew is small and they just wait it at the end of each transaction before acknowledging the client. In CockroachDB, if tuple is not visible but we are unsure whether it is truly invisible or it's just the skew (the difference between snapshot and tuple's csn is less than the skew), transaction is restarted with advanced snapshot. This process is not infinite because the upper border (initial snapshot + max skew) stays the same; this is correct as we just want to ensure that our xact sees all the committed ones before it started. We can implement the same thing. Now, the description of my mostly cosmetical changes: * Don't ERROR in BroadcastStmt to allow us to handle failure manually; * Check global_snapshot_defer_time in ImportGlobalSnapshot instead of falling on assert; * (Arguably) improved comments around locking at circular buffer maintenance; also, don't lock procarray during global_snapshot_xmin bump. * s/snaphot/snapshot, other typos. * Don't track_global_snapshots by default -- while handy for testing, it doesn't look generally good. * Set track_global_snapshots = true in tests everywhere. * GUC renamed from postgres_fdw.use_tsdtm to postgres_fdw.use_global_snapshots for consistency. * 003_bank_shared.pl test is removed. In current shape (loading one node) it is useless, and if we bombard both nodes, deadlock surely appears. In general, global snaphots are not needed for such multimaster-like setup -- either there are no conflicts and we are fine, or there is a conflict, in which case we get a deadlock. * Fix initdb failure with non-zero global_snapshot_defer_time. * Enforce REPEATABLE READ since currently we export snap only once in xact. * Remove assertion that circular buffer entries are monotonic, as GetOldestXmin *can* go backwards. [1] https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From 21687e75366df03b92e48c6125bb2e90d01bb70a Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Wed, 25 Apr 2018 16:05:46 +0300 Subject: [PATCH 1/3] GlobalCSNLog SLRU --- src/backend/access/transam/Makefile | 3 +- src/backend/access/transam/global_csn_log.c | 439 src/backend/access/transam/twophase.c | 1 + src/backend/access/transam/varsup.c | 2 + src/backend/access/transam/xlog.c | 12 + src/backend/storage/ipc/ipci.c | 3 + src/backend/storage/ipc/procarray.c | 3 + src/backend/storage/lmgr/lwlocknames.txt| 1 + src/backend/utils/misc/guc.c| 9 + src/backend/utils/probes.d | 2 + src/bin/initdb/initdb.c | 3 +- src/include/access/global_csn_log.h | 30 ++ src/include/storage/lwlock.h| 1 + src/include/utils/snapshot.h| 3 + 14 files changed, 510 insertions(+), 2 deletions(-) create mode 100644 src/backend/access/transam/global_csn_log.c create mode 100644 src/include/access/global_csn_log.h diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile index 16fbe47269..03aa360ea3 100644 --- a/src/backend/access/transam/Makefile +++ b/src/backend/access/transam/Makefile @@ -12,7 +12,8 @@ subdir = src/backend/access/transam top_builddir = ../../../.. include $(top_
Re: Possible bug in logical replication.
Michael Paquier writes: > Okay, let's do as you suggest then. Do you find the attached adapted? Yes, thanks! -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: pgsql: Fix "base" snapshot handling in logical decoding
Arseny Sher writes: > There is also one thing that puzzles me as I don't know much about > vacuum internals. If I do plain VACUUM of pg_attribute in the test, it > shouts "catalog is missing 1 attribute(s) for relid" error (which is > quite expected), while with 'VACUUM FULL pg_attribute' the tuple is > silently (and wrongly, with dropped column missing) decoded. Moreover, > if I perform the test manually, and do 'VACUUM FULL;', sometimes test > becomes useless -- that is, tuple is successfully decoded with all three > columns, as though VACUUM was not actually executed. All this is without > the main patch, of course. I think I will look into this soon. So I have been jumping around this and learned a few curious things. 1) Test in its current shape sometimes doesn't fulfill its aim indeed -- that is, despite issued VACUUM the tuple is still decoded with all three fields. This happens because during attempt to advance xmin there is a good possiblity to encounter xl_running_xacts record logged before our CHECKPOINT (they are logged each 15 seconds). We do not try to advance xmin twice without client acknowledgment, so in this case xmin will not be advanced far enough to allow vacuum prune entry from pg_attribute. 2) This is not easy to notice because often (but not always) explicit VACUUM is not needed at all: tuple is often pruned by microvacuum (heap_page_prune_opts) right in the final decoding session. If we hadn't bumped xmin far enough during previous get_changes, we do that now, so microvacuum actually purges the entry. But if we were so unfortunate that 1) extra xl_running_xacts was logged and 2) microvacuum was in bad mood and didn't come, pg_attribute is not vacuumed and test becomes useless. To make this bulletproof, in the attached patch I doubled first get_changes: now there are two client acks, so our VACUUM always does the job. 3) As a side note, answer to my question 'why do we get different errors with VACUUM and VACUUM FULL' is the following. With VACUUM FULL, not only old pg_attribute entry is pruned, but also xmin of new entry with attisdropped=true is reset to frozen xid. This means that decoding session (RelationBuildTupleDesc) actually sees 3 attributes, and the fact that one of them is dropped doesn't embarrass this function (apparently relnatts in pg_class is never decremented) -- we just go ahead and decode only live attributes. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/test_decoding/expected/oldest_xmin.out b/contrib/test_decoding/expected/oldest_xmin.out index d09342c4be..d1b4f17e3a 100644 --- a/contrib/test_decoding/expected/oldest_xmin.out +++ b/contrib/test_decoding/expected/oldest_xmin.out @@ -1,6 +1,6 @@ Parsed test spec with 2 sessions -starting permutation: s0_begin s0_getxid s1_begin s1_insert s0_alter s0_commit s0_checkpoint s0_get_changes s1_commit s0_vacuum s0_get_changes +starting permutation: s0_begin s0_getxid s1_begin s1_insert s0_alter s0_commit s0_checkpoint s0_get_changes s0_get_changes s1_commit s0_vacuum s0_get_changes step s0_begin: BEGIN; step s0_getxid: SELECT txid_current() IS NULL; ?column? @@ -14,8 +14,11 @@ step s0_checkpoint: CHECKPOINT; step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); data +step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); +data + step s1_commit: COMMIT; -step s0_vacuum: VACUUM FULL; +step s0_vacuum: VACUUM pg_attribute; step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); data diff --git a/contrib/test_decoding/specs/oldest_xmin.spec b/contrib/test_decoding/specs/oldest_xmin.spec index 4f8af70aa2..141fe2b145 100644 --- a/contrib/test_decoding/specs/oldest_xmin.spec +++ b/contrib/test_decoding/specs/oldest_xmin.spec @@ -22,7 +22,7 @@ step "s0_getxid" { SELECT txid_current() IS NULL; } step "s0_alter" { ALTER TYPE basket DROP ATTRIBUTE mangos; } step "s0_commit" { COMMIT; } step "s0_checkpoint" { CHECKPOINT; } -step "s0_vacuum" { VACUUM FULL; } +step "s0_vacuum" { VACUUM pg_attribute; } step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); } session "s1" @@ -30,8 +30,11 @@ step "s1_begin" { BEGIN; } step "s1_insert" { INSERT INTO harvest VALUES ((1, 2, 3)); } step "s1_commit" { COMMIT; } -# Checkpoint with following get_changes forces to advance xmin. ALTER of a +# Checkpoint with following get_changes forces xmin a
Re: pgsql: Fix "base" snapshot handling in logical decoding
Tom Lane writes: > Alvaro Herrera writes: >>> On 2018-Jun-28, Tom Lane wrote: >>>> According to buildfarm member friarbird, and as confirmed here, >>>> the contrib/test_decoding/specs/oldest_xmin.spec test added by this >>>> commit fails under CLOBBER_CACHE_ALWAYS. > >> I suppose 60 seconds (isolationtester's default timeout) is just not >> enough time for those machines. We could increase it to 180 seconds and >> see if that's enough to make them pass ... > > What I want to know is why this test is doing a database-wide VACUUM FULL > in the first place. If that isn't profligate wastage of testing cycles, > why not? > > regards, tom lane Oh, that's my fault. I think just VACUUM pg_attribute is enough there -- it takes 42ms on my laptop with CLOBBER_CACHE_ALWAYS ('vacuum full' occupies 1 minute), patch is attached. The test is still steadily fails without the main patch. There is also one thing that puzzles me as I don't know much about vacuum internals. If I do plain VACUUM of pg_attribute in the test, it shouts "catalog is missing 1 attribute(s) for relid" error (which is quite expected), while with 'VACUUM FULL pg_attribute' the tuple is silently (and wrongly, with dropped column missing) decoded. Moreover, if I perform the test manually, and do 'VACUUM FULL;', sometimes test becomes useless -- that is, tuple is successfully decoded with all three columns, as though VACUUM was not actually executed. All this is without the main patch, of course. I think I will look into this soon. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/test_decoding/expected/oldest_xmin.out b/contrib/test_decoding/expected/oldest_xmin.out index d09342c4be..a5d7d3c2ca 100644 --- a/contrib/test_decoding/expected/oldest_xmin.out +++ b/contrib/test_decoding/expected/oldest_xmin.out @@ -15,7 +15,7 @@ step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slo data step s1_commit: COMMIT; -step s0_vacuum: VACUUM FULL; +step s0_vacuum: VACUUM pg_attribute; step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); data diff --git a/contrib/test_decoding/specs/oldest_xmin.spec b/contrib/test_decoding/specs/oldest_xmin.spec index 4f8af70aa2..cd2b69cae2 100644 --- a/contrib/test_decoding/specs/oldest_xmin.spec +++ b/contrib/test_decoding/specs/oldest_xmin.spec @@ -22,7 +22,7 @@ step "s0_getxid" { SELECT txid_current() IS NULL; } step "s0_alter" { ALTER TYPE basket DROP ATTRIBUTE mangos; } step "s0_commit" { COMMIT; } step "s0_checkpoint" { CHECKPOINT; } -step "s0_vacuum" { VACUUM FULL; } +step "s0_vacuum" { VACUUM pg_attribute; } step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); } session "s1"
Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.
Alvaro Herrera writes: > I'm struggling with this assert. I find that test_decoding passes tests > with this assertion in branch master, but fails in 9.4 - 10. Maybe I'm > running the tests wrong (no assertions in master?) but I don't see it. > It *should* fail ... Your v3 patch fails for me on freshest master (4d54543efa) in exactly this assert, it looks like there is something wrong in your setup. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.
Alvaro Herrera writes: > Firstly -- this is top-notch detective work, kudos and thanks for the > patch and test cases. (I verified that both fail before the code fix.) Thank you! > Here's a v3. I applied a lot of makeup in order to try to understand > what's going on. I *think* I have a grasp on the original code and your > bugfix, not terribly firm I admit. Well, when I started digging it, I have found logical decoding code somewhat underdocumented. If you or any other committer will consider the patch trying to improve the comments, I can prepare one. > * you don't need to Assert that things are not NULL if you're > immediately going to dereference them. Indeed. > * I think setting snapshot_base to NULL in ReorderBufferCleanupTXN is > pointless, since the struct is gonna be freed shortly afterwards. Yeah, but it is a general style of the original code, e.g. see /* cosmetic... */ comments. It probably makes the picture a bit more aesthetic and consistent, kind of final chord, though I don't mind removing it. > * I rewrote many comments (both existing and some of the ones your patch > adds), and added lots of comments where there were none. Now the code is nicer, thanks. > * I'm a bit unsure about the comment atop ReorderBufferSetBaseSnapshot. > Obviously, the bit within the #if 0/#endif I'm going to remove before > push. It looks like you've started editing that bit and didn't finish. > I don't understand why it says "Needs to be called before any > changes are added with ReorderBufferQueueChange"; but if you edit that > function and add an assert that the base snapshot is set, it crashes > pretty quickly in the test_decoding tests. (The supposedly bogus > comment was there before your patch -- I'm not saying your comment > addition is faulty.) That's because REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID changes are queued whenever we have read that xact has modified the catalog, regardless of base snapshot existence. Even if there are no changes yet, we will need it for correct visibility of catalog, so I am inclined to remove the assert and comment or rephrase the latter with 'any *data-carrying* changes'. > * I also noticed that we're doing subtxn cleanup one by one in both > ReorderBufferAssignChild and ReorderBufferCommitChild, which means the > top-level txn is sought in the hash table over and over, which seems a > bit silly. Not this patch's problem to fix ... There is already one-entry cache in ReorderBufferTXNByXid. We could add 'don't cache me' flag to it and use it with subxacts, or simply pull top-level reorderbuffer out of loops. I'm fine with the rest of your edits. One more little comment: @@ -543,9 +546,10 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn, ReorderBufferTXN *txn; txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); + Assert(txn->base_snapshot != NULL || txn->is_known_as_subxact); change->lsn = lsn; - Assert(InvalidXLogRecPtr != lsn); + Assert(!XLogRecPtrIsInvalid(lsn)); dlist_push_tail(>changes, >node); txn->nentries++; txn->nentries_mem++; Since we do that, probably we should replace all lsn validity checks with XLogRectPtrIsInvalid? -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.
Hello, Thank you for your time. Alvaro Herrera writes: > On 2018-Jun-11, Antonin Houska wrote: > >> >> One comment about the coding: since you introduce a new transaction list and >> it's sorted by LSN, I think you should make the function AssertTXNLsnOrder() >> more generic and use it to check the by_base_snapshot_lsn list too. For >> example call it after the list insertion and deletion in >> ReorderBufferPassBaseSnapshot(). Ok, probably this makes sense. Updated patch is attached. >> Also I think it makes no harm if you split the diff into two, although both >> fixes use the by_base_snapshot_lsn field. > > Please don't. Yeah, I don't think we should bother with that. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From 61b4b0a89c95f8912729c54c013b64033f88fccf Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Thu, 21 Jun 2018 10:29:07 +0300 Subject: [PATCH] Fix slot's xmin advancement and subxact's lost snapshots in decoding There are two in some way related bugs here. First, xmin of logical slots was advanced too early. During xl_running_xacts processing, xmin of slot was set to oldest running xid in the record, which is wrong: actually snapshots which will be used for not-yet-replayed at this moment xacts might consider older xacts as running too. The problem wasn't noticed earlier because DDL which allows to delete tuple (set xmax) while some another not yet committed xact looks at it is pretty rare, if not unique: e.g. all forms of ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock conflicting with any inserts. Included test case uses ALTER of a composite type, which doesn't have such interlocking. To deal with this, we must be able to quickly retrieve oldest xmin (oldest running xid among all assigned snapshots) from ReorderBuffer. Proposed fix adds yet another list of ReorderBufferTXNs to it, where xacts are sorted by LSN of the base snapshot, because * Plain usage of current ReorderBufferGetOldestTXN (to get base snapshot's xmin) is wrong, because order by LSN of the first record is *not* the same as order by LSN of record on which base snapshot was assigned: many record types don't need the snapshot, e.g. xl_xact_assignment. We could remember snapbuilder's xmin during processing of the first record, but that's too conservative and we would keep more tuples from vacuuming than needed. * On the other hand, we can't fully replace existing order by LSN of the first change with LSN of the base snapshot, because if we do that, WAL records which didn't forced receival of base snap might be recycled before we replay the xact, as ReorderBufferGetOldestTXN determines which LSN still must be kept. Second issue concerns subxacts. Currently SnapBuildDistributeNewCatalogSnapshot never spares a snapshot to known subxact. It means that if toplevel doesn't have base snapshot (never did writes), there was an assignment record (subxact is known) and subxact is writing, no new snapshots are being queued. Probably the simplest fix would be to employ freshly baked by_base_snapshot_lsn list and just distribute snapshot to all xacts with base snapshot whether they are subxacts or not. However, in this case we would queue unneccessary snapshot to all already known subxacts. To avoid this, in this patch base snapshot of known subxact is immediately passed to the toplevel as soon as we learn the kinship / base snap assigned -- we have to do that anyway in ReorderBufferCommitChild; new snaps are distributed only to top-level xacts (or unknown subxacts) as before. Also, minor memory leak is fixed: counter of toplevel's old base snapshot was not decremented when snap has been passed from child. --- contrib/test_decoding/Makefile | 3 +- contrib/test_decoding/expected/oldest_xmin.out | 27 +++ .../test_decoding/expected/subxact_snap_pass.out | 49 + contrib/test_decoding/specs/oldest_xmin.spec | 37 contrib/test_decoding/specs/subxact_snap_pass.spec | 42 + src/backend/replication/logical/reorderbuffer.c| 208 +++-- src/backend/replication/logical/snapbuild.c| 15 +- src/include/replication/reorderbuffer.h| 15 +- 8 files changed, 330 insertions(+), 66 deletions(-) create mode 100644 contrib/test_decoding/expected/oldest_xmin.out create mode 100644 contrib/test_decoding/expected/subxact_snap_pass.out create mode 100644 contrib/test_decoding/specs/oldest_xmin.spec create mode 100644 contrib/test_decoding/specs/subxact_snap_pass.spec diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index 1d601d8144..05d7766cd5 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -50,7 +50,8 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install $(pg_regress_installcheck) \ $(REGRESS
Re: Possible bug in logical replication.
Michael Paquier writes: > On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote: >> On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote: >> It seems to me that we still want to have the slot forwarding finish in >> this case even if this is interrupted. Petr, isn't that the intention >> here? > > I have been chewing a bit more on the proposed patch, finishing with the > attached to close the loop. Thoughts? Sorry for being pedantic, but it seems to me worthwhile to mention *why* we need decoding machinery at all -- like I wrote: + * While we could just do LogicalConfirmReceivedLocation updating + * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn + * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples). Also, > * The slot's restart_lsn is used as start point for reading records, This is clearly seen from the code, I propose to remove this. > * while confirmed_lsn is used as base point for the decoding context. And as I wrote, this doesn't matter as changes are not produced. > * The LSN position to move to is checked by doing a per-record scan and > * logical decoding which makes sure that confirmed_lsn is updated to a > * LSN which allows the future slot consumer to get consistent logical > - * changes. > + * changes. As decoding is done with fast_forward mode, no changes are > + * actually generated. confirmed_lsn is always updated to `moveto` unless we run out of WAL earlier (and unless we try to move slot backwards, which is obviously forbidden) -- consistent changes are practically irrelevant here. Moreover, we can directly set confirmed_lsn and still have consistent changes further as restart_lsn and xmin of the slot are not touched. What we actually do here is trying to advance *restart_lsn and xmin* as far as we can but up to the point which ensures that decoding can assemble a consistent snapshot allowing to fully decode all COMMITs since updated `confirmed_flush_lsn`. All this happens in SnapBuildProcessRunningXacts. > It seems to me that we still want to have the slot forwarding finish in > this case even if this is interrupted. Petr, isn't that the intention > here? Probably, though I am not sure what is the point of this. Ok, I keep this check in the updated (with your comments) patch and CC'ing Petr. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index 61588d626f..76bafca41c 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -335,11 +335,14 @@ CreateInitDecodingContext(char *plugin, * The LSN at which to start decoding. If InvalidXLogRecPtr, restart * from the slot's confirmed_flush; otherwise, start from the specified * location (but move it forwards to confirmed_flush if it's older than - * that, see below). + * that, see below). Doesn't matter in fast_forward mode. * * output_plugin_options * contains options passed to the output plugin. * + * fast_forward + * bypasses the generation of logical changes. + * * read_page, prepare_write, do_write, update_progress * callbacks that have to be filled to perform the use-case dependent, * actual work. diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 2806e1076c..0a4985ef8c 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -341,12 +341,10 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) /* * Helper function for advancing logical replication slot forward. - * The slot's restart_lsn is used as start point for reading records, - * while confirmed_lsn is used as base point for the decoding context. - * The LSN position to move to is checked by doing a per-record scan and - * logical decoding which makes sure that confirmed_lsn is updated to a - * LSN which allows the future slot consumer to get consistent logical - * changes. + * While we could just do LogicalConfirmReceivedLocation updating + * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn + * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples). + * We do it in special fast_forward mode without actual replay. */ static XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto) @@ -358,7 +356,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) PG_TRY(); { - /* restart at slot's confirmed_flush */ ctx = CreateDecodingContext(InvalidXLogRecPtr, NIL, true, @@ -388,10 +385,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) */ startlsn = InvalidXLogRecPtr; - /* - * The {begin_txn,change,commit_txn}_wrapper callbacks above will - * store the description into our tuplestore. - */ + /* Changes are
Re: Possible bug in logical replication.
Alvaro Herrera writes: > Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that > Michaël's commit fixes the reported bug? I confirm that starting reading WAL since restart_lsn as implemented in f731cfa fixes this issue, as well as the second issue tushar mentioned at [1]. I think that the code still can be improved a bit though -- consider the attached patch: * pg_logical_replication_slot_advance comment was not very informative and even a bit misleading: it said that we use confirmed_flush_lsn and restart_lsn, but didn't explain why. * Excessive check in its main loop. * Copy-paste comment fix. [1] https://www.postgresql.org/message-id/5f85bf41-098e-c4e1-7332-9171fef57a0a%40enterprisedb.com -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From d8ed8ae3eec54b716d7dbb35379d0047a96c6c75 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Fri, 15 Jun 2018 18:11:17 +0300 Subject: [PATCH] Cosmetic review for f731cfa. * pg_logical_replication_slot_advance comment was not very informative and even a bit misleading: it said that we use confirmed_flush_lsn and restart_lsn, but didn't explain why. * Excessive check in its main loop. * Copy-paste comment fix. --- src/backend/replication/slotfuncs.c | 25 ++--- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 2806e1076c..597e81f4d9 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -341,24 +341,26 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) /* * Helper function for advancing logical replication slot forward. - * The slot's restart_lsn is used as start point for reading records, - * while confirmed_lsn is used as base point for the decoding context. - * The LSN position to move to is checked by doing a per-record scan and - * logical decoding which makes sure that confirmed_lsn is updated to a - * LSN which allows the future slot consumer to get consistent logical - * changes. + * While we could just do LogicalConfirmReceivedLocation updating + * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn allowing + * to recycle WAL and old catalog tuples. We do it in special fast_forward + * mode without actual replay. */ static XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto) { LogicalDecodingContext *ctx; ResourceOwner old_resowner = CurrentResourceOwner; + /* + * Start reading WAL at restart_lsn, which certainly points to the valid + * record. + */ XLogRecPtr startlsn = MyReplicationSlot->data.restart_lsn; XLogRecPtr retlsn = MyReplicationSlot->data.confirmed_flush; PG_TRY(); { - /* restart at slot's confirmed_flush */ + /* start_lsn doesn't matter here, we don't replay xacts at all */ ctx = CreateDecodingContext(InvalidXLogRecPtr, NIL, true, @@ -388,17 +390,10 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) */ startlsn = InvalidXLogRecPtr; - /* - * The {begin_txn,change,commit_txn}_wrapper callbacks above will - * store the description into our tuplestore. - */ + /* Changes are not actually produced in fast_forward mode. */ if (record != NULL) LogicalDecodingProcessRecord(ctx, ctx->reader); - /* Stop once the moving point wanted by caller has been reached */ - if (moveto <= ctx->reader->EndRecPtr) -break; - CHECK_FOR_INTERRUPTS(); } -- 2.11.0
Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.
Michael Paquier <mich...@paquier.xyz> writes: > but for now I would recommend to register this patch to the next > commit fest under the category "Bug Fixes" so as it does not fall into > the cracks: https://commitfest.postgresql.org/18/ > > I have added an entry to the open items in the section for older bugs: > https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Older_Bugs > However this list tends to be... Er... Ignored. > Thank you, Michael. I have created the commitfest entry. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Possible bug in logical replication.
Michael Paquier <mich...@paquier.xyz> writes: > Maybe I am being too naive, but wouldn't it be just enough to update the > confirmed flush LSN to ctx->reader->ReadRecPtr? This way, the slot > advances up to the beginning of the last record where user wants to > advance, and not the beginning of the next record: Same problem should be handled at pg_logical_slot_get_changes_guts and apply worker feedback; and there is a convention that all commits since confirmed_flush must be decoded, so we risk decoding such boundary commit twice. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Possible bug in logical replication.
Hello, Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes: > restart_lsn stays at the beginning of a transaction until the > transaction ends so just using restart_lsn allows repeated > decoding of a transaction, in short, rewinding occurs. The > function works only for inactive slot so the current code works > fine on this point. Sorry, I do not follow. restart_lsn is advanced whenever there is a consistent snapshot dumped (in xl_running_xacts) which is old enough to wholly decode all xacts not yet confirmed by the client. Could you please elaborate, what's wrong with that? > Addition to that restart_lsn also can be on a > page bounary. Do you have an example of that? restart_lsn is set initially to WAL insert position at ReplicationSlotReserveWal, and later it always points to xl_running_xacts record with consistent snapshot dumped. > So directly set ctx->reader->EndRecPtr by startlsn fixes the > problem, but I found another problem here. There is a minor issue with the patch. Now slot advancement hangs polling for new WAL on my example from [1]; most probably because we must exit the loop when ctx->reader->EndRecPtr == moveto. > The function accepts any LSN even if it is not at the begiining > of a record. We will see errors or crashs or infinite waiting or > maybe any kind of trouble by such values. The moved LSN must > always be at the "end of a record" (that is, at the start of the > next recored). The attached patch also fixes this. Indeed, but we have these problems only if we are trying to read WAL since confirmed_flush. [1] https://www.postgresql.org/message-id/873720e4hf.fsf%40ars-thinkpad -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Possible bug in logical replication.
Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes: > I think that using restart_lsn instead of confirmed_flush is not right > approach. > If restart_lsn is not available and confirmed_flush is pointing to page > boundary, then in any case we should somehow handle this case and adjust > startlsn to point on the valid record position (by jjust adding page header > size?). Well, restart_lsn is always available on live slot: it is initially set in ReplicationSlotReserveWal during slot creation. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Indexes on partitioned tables and foreign partitions
Simon Riggs <si...@2ndquadrant.com> writes: > Indexes on foreign tables cause an ERROR, so yes, we already just > don't create them. > > You're suggesting silently skipping the ERROR. I can't see a reason for that. Truly, I was inaccurate. I mean that index propagation is a nice feature, and making it available for mix of foreign and local partitions skipping the foreign ones is useful thing for users who understand the consequences (of course there should be a warning in the docs -- my patch was just an illustration of the idea). As I said, FDW-based sharding often involves mix of foreign and local partitions, even in simple manual forms. Imagine that you have a table which is too big to fit into one server, so you partition it and move some partiitons to another machine; you still would like to access the whole table to avoid manual tuple routing. Or, you partition table by timestamps and keep recent data on fast primary server, gradually moving old partitions to another box. Again, it would be nice to access table as a whole to e.g. calculate some analytics. Anyway, given other opinions, I assume that the community has chosen more conservative approach. Indeed, we can't guarantee uniqueness this way; that's fully users responsiblity. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Indexes on partitioned tables and foreign partitions
Simon Riggs <si...@2ndquadrant.com> writes: > How much sense is it to have a partitioned table with a mix of local > and foreign tables? Well, as much sense as fdw-based sharding has, for instance. It is arguable, but it exists. > Shouldn't the fix be to allow creation of indexes on foreign tables? > (Maybe they would be virtual or foreign indexes??) Similar ideas were discussed at [1]. There was no wide consensus of even what problems such feature would solve. Since currently indexes on foreign tables are just forbidden, it seems to me that the best what partitioning code can do today is just not creating them. [1] https://www.postgresql.org/message-id/flat/4F62FD69.2060007%40lab.ntt.co.jp#4f62fd69.2060...@lab.ntt.co.jp -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Indexes on partitioned tables and foreign partitions
Hi, 8b08f7d4 added propagation of indexes on partitioned tables to partitions, which is very cool. However, index creation also recurses down to foreign tables. I doubt this is intentional, as such indexes are forbidden as not making much sense; attempt to create index on partitioned table with foreign partition leads to an error now. Attached lines fix this. *** a/src/backend/commands/indexcmds.c --- b/src/backend/commands/indexcmds.c *** DefineIndex(Oid relationId, *** 915,920 --- 915,926 int maplen; childrel = heap_open(childRelid, lockmode); + /* Foreign table doesn't need indexes */ + if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE) + { + heap_close(childrel, NoLock); + continue; + } childidxs = RelationGetIndexList(childrel); attmap = convert_tuples_by_name_map(RelationGetDescr(childrel), *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** AttachPartitionEnsureIndexes(Relation re *** 14352,14357 --- 14352,14361 MemoryContext cxt; MemoryContext oldcxt; + /* Foreign table doesn't need indexes */ + if (attachrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE) + return; + cxt = AllocSetContextCreate(CurrentMemoryContext, "AttachPartitionEnsureIndexes", ALLOCSET_DEFAULT_SIZES); -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.
(delicate ping)
Fix slot's xmin advancement and subxact's lost snapshots in decoding.
Hello, I've discovered a couple of bugs in logical decoding code, both leading to incorrect decoding results in somewhat rare cases. First, xmin of slots is advanced too early. This affects the results only when interlocking allows to perform DDL concurrently with looking at the schema. In fact, I was not aware about such DDL until at https://www.postgresql.org/message-id/flat/87tvu0p0jm.fsf%40ars-thinkpad#87tvu0p0jm.fsf@ars-thinkpad I raised this question and Andres pointed out ALTER of composite types. Probably there are others, I am not sure; it would be interesting to know them. Another problem is that new snapshots are never queued to known subxacts. It means decoding results can be wrong if toplevel doesn't write anything while subxact does. Please see detailed description of the issues, tests which reproduce them and fixes in the attached patch. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From 0d03ef172a29ce64a6c5c26f484f0f3186895125 Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-...@yandex.ru> Date: Sat, 7 Apr 2018 08:22:30 +0300 Subject: [PATCH] Fix slot's xmin advancement and subxact's lost snapshots in decoding. There are two in some way related bugs here. First, xmin of logical slots was advanced too early. During xl_running_xacts processing, xmin of slot was set to oldest running xid in the record, which is wrong: actually snapshots which will be used for not-yet-replayed at this moment xacts might consider older xacts as running too. The problem wasn't noticed earlier because DDL which allows to delete tuple (set xmax) while some another not yet committed xact looks at it is pretty rare, if not unique: e.g. all forms of ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock conflicting with any inserts. Included test case uses ALTER of a composite type, which doesn't have such interlocking. To deal with this, we must be able to quickly retrieve oldest xmin (oldest running xid among all assigned snapshots) from ReorderBuffer. Proposed fix adds yet another list of ReorderBufferTXNs to it, where xacts are sorted by LSN of the base snapshot, because * Plain usage of current ReorderBufferGetOldestTXN (to get base snapshot's xmin) is wrong, because order by LSN of the first record is *not* the same as order by LSN of record on which base snapshot was assigned: many record types don't need the snapshot, e.g. xl_xact_assignment. We could remember snapbuilder's xmin during processing of the first record, but that's too conservative and we would keep more tuples from vacuuming than needed. * On the other hand, we can't fully replace existing order by LSN of the first change with LSN of the base snapshot, because if we do that, WAL records which didn't forced receival of base snap might be recycled before we replay the xact, as ReorderBufferGetOldestTXN determines which LSN still must be kept. Second issue concerns subxacts. Currently SnapBuildDistributeNewCatalogSnapshot never spares a snapshot to known subxact. It means that if toplevel doesn't have base snapshot (never did writes), there was an assignment record (subxact is known) and subxact is writing, no new snapshots are being queued. Probably the simplest fix would be to employ freshly baked by_base_snapshot_lsn list and just distribute snapshot to all xacts with base snapshot whether they are subxacts or not. However, in this case we would queue unneccessary snapshot to all already known subxacts. To avoid this, in this patch base snapshot of known subxact is immediately passed to the toplevel as soon as we learn the kinship / base snap assigned -- we have to do that anyway in ReorderBufferCommitChild; new snaps are distributed only to top-level xacts (or unknown subxacts) as before. Also, minor memory leak is fixed: counter of toplevel's old base snapshot was not decremented when snap has been passed from child. --- contrib/test_decoding/Makefile | 3 +- contrib/test_decoding/expected/oldest_xmin.out | 27 +++ .../test_decoding/expected/subxact_snap_pass.out | 49 ++ contrib/test_decoding/specs/oldest_xmin.spec | 37 + contrib/test_decoding/specs/subxact_snap_pass.spec | 42 + src/backend/replication/logical/reorderbuffer.c| 183 ++--- src/backend/replication/logical/snapbuild.c| 15 +- src/include/replication/reorderbuffer.h| 15 +- 8 files changed, 304 insertions(+), 67 deletions(-) create mode 100644 contrib/test_decoding/expected/oldest_xmin.out create mode 100644 contrib/test_decoding/expected/subxact_snap_pass.out create mode 100644 contrib/test_decoding/specs/oldest_xmin.spec create mode 100644 contrib/test_decoding/specs/subxact_snap_pass.spec diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index 6c18189d9d..c696288d67 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -5
Re: Why chain of snapshots is used in ReorderBufferCommit?
Andres Freund <and...@anarazel.de> writes: > I don't think that's right. For one there's plenty types of DDL where no > such locking exists, consider e.g. composite datums where additional > columns can be added even after such a type is already used by another > table. Oh, indeed. Never mind the last paragraph. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: GSoC 2018
Aleksander Alekseev <a.aleks...@postgrespro.ru> writes: > Hi Stephen, > >> synchronous_commit isn't the same as synchronous_standby_names ... > > What about synchronous_standby_names? Logical replication ignores it and > doesn't wait for ack's from replicas or something? It actually works, see [1]: The name of a standby server for this purpose is the application_name setting of the standby, as set in the standby's connection information. In case of a physical replication standby, this should be set in the primary_conninfo setting in recovery.conf; the default is walreceiver. For logical replication, this can be set in the connection information of the subscription, and it defaults to the subscription name. [1] https://www.postgresql.org/docs/current/static/runtime-config-replication.html -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company