Re: Flaky vacuum truncate test in reloptions.sql

2021-04-04 Thread Arseny Sher

On Fri, Apr 2, 2021 at 9:46 AM Michael Paquier  wrote:

> Okay, applied and back-patched down to 12 then.

Thank you both. Unfortunately and surprisingly, the test still fails
(perhaps even rarer, once in several hundred runs) under
multimaster. After scratching the head for some more time, it seems to
me the following happens: not only vacuum encounters locked page, but
also there exist a concurrent backend (as the parallel schedule is run)
who holds back oldestxmin keeping it less than xid of transaction which
did the insertion

INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);

FreezeLimit can't be higher than oldestxmin, so lazy_check_needs_freeze
decides there is nothing to freeze on the page. multimaster commits are
quite heavy, which apparently shifts the timings making the issue more
likely.

Currently we are testing the rather funny attached patch which forces
all such old-snapshot-holders to finish. It is crutchy, but I doubt we
want to change vacuum logic (e.g. checking tuple liveness in
lazy_check_needs_freeze) due to this issue. (it is especially crutchy in
xid::bigint casts, but wraparound is hardly expected in regression tests
run).


-- cheers, arseny

diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index bb7bd6e1e7e..78bbf4a5255 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -128,6 +128,20 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
+do $$
+declare
+  my_xid bigint;
+  oldest_xmin bigint;
+begin
+  my_xid := txid_current();
+  while true loop
+oldest_xmin := min(backend_xmin::text::bigint) from pg_stat_activity where pid != pg_backend_pid();
+exit when oldest_xmin is null or oldest_xmin >= my_xid;
+perform pg_sleep(0.1);
+perform pg_stat_clear_snapshot();
+  end loop;
+end
+$$;
 -- Do an aggressive vacuum to prevent page-skipping.
 VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 95f7ab4189e..96fb59d16ad 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -72,6 +72,20 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+do $$
+declare
+  my_xid bigint;
+  oldest_xmin bigint;
+begin
+  my_xid := txid_current();
+  while true loop
+oldest_xmin := min(backend_xmin::text::bigint) from pg_stat_activity where pid != pg_backend_pid();
+exit when oldest_xmin is null or oldest_xmin >= my_xid;
+perform pg_sleep(0.1);
+perform pg_stat_clear_snapshot();
+  end loop;
+end
+$$;
 -- Do an aggressive vacuum to prevent page-skipping.
 VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;


Re: Flaky vacuum truncate test in reloptions.sql

2021-04-01 Thread Arseny Sher

Michael Paquier  writes:

> On Thu, Apr 01, 2021 at 12:52:21PM +0900, Masahiko Sawada wrote:
>> Just to be clear the context, I’m mentioning the following test case:

Sorry, I misremembered the test and assumed the table is non-empty there
while it is empty but vacuum_truncate is disabled. Still, this doesn't
change my conclusion of freezing being not a big deal there due to small
chance of locked page. Anyway, let's finish with this.

> What you are writing here makes sense to me.  Looking at the test, it
> is designed to test vacuum_truncate, aka that the behavior we want to
> stress (your former case here) gets stressed all the time, so adding
> the options to avoid the latter case all the time is an improvement.
> And this, even if the latter case does not actually cause a diff and
> it has a small chance to happen in practice.
>
> It would be good to add a comment explaining why the options are
> added (aka just don't skip any pages).

How about the attached?


-- cheers, arseny


diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index 44c130409ff..d6f5bf98114 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,7 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 --
@@ -127,7 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 --
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index cac5b0bcb0d..61638c3bcae 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,7 +61,8 @@ CREATE TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -71,7 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options


Re: Flaky vacuum truncate test in reloptions.sql

2021-03-31 Thread Arseny Sher


Arseny Sher  writes:

> as currently the chance of its failure is close to 1.

A typo, to 0 too, of course.




Re: Flaky vacuum truncate test in reloptions.sql

2021-03-31 Thread Arseny Sher


Masahiko Sawada  writes:

>> I don't think this matters much, as it tests the contrary and the
>> probability of
>> successful test passing (in case of theoretical bug making vacuum to
>> truncate
>> non-empty relation) becomes stunningly small. But adding it wouldn't hurt
>> either.
>
> I was concerned a bit that without FREEZE in the first VACUUM we could
> not test it properly because the table could not be truncated because
> either vacuum_truncate is off

FREEZE won't help us there.

> or the page is skipped.

You mean at the same time there is a potential bug in vacuum which would
force the truncation of non-empy relation if the page wasn't locked?
That would mean the chance of test getting passed even single time is
close to 0, as currently the chance of its failure is close to 1.


-- cheers, arseny




Re: Flaky vacuum truncate test in reloptions.sql

2021-03-31 Thread Arseny Sher



On 3/31/21 4:17 PM, Masahiko Sawada wrote:

> Is it better to add FREEZE to the first "VACUUM reloptions_test;" as 
well?


I don't think this matters much, as it tests the contrary and the 
probability of
successful test passing (in case of theoretical bug making vacuum to 
truncate

non-empty relation) becomes stunningly small. But adding it wouldn't hurt
either.

-- cheers, arseny




Re: Flaky vacuum truncate test in reloptions.sql

2021-03-30 Thread Arseny Sher

On 3/30/21 10:12 AM, Michael Paquier wrote:

> Yep, this is the same problem as the one discussed for c2dc1a7, where
> a concurrent checkpoint may cause a page to be skipped, breaking the
> test.

Indeed, Alexander Lakhin pointed me to that commit after I wrote the 
message.


> Why not just using DISABLE_PAGE_SKIPPING instead here?

I think this is not enough. DISABLE_PAGE_SKIPPING disables vm consulting 
(sets

aggressive=true in the routine); however, if the page is locked and
lazy_check_needs_freeze says there is nothing to freeze on it, we again 
don't

look at its contents closely.


-- cheers, arseny




Flaky vacuum truncate test in reloptions.sql

2021-03-29 Thread Arseny Sher
Hi,

I rarely observe failure of vacuum with truncation test in
reloptions.sql, i.e. the truncation doesn't happen:

--- ../../src/test/regress/expected/reloptions.out  2020-04-16 
12:37:17.749547401 +0300
+++ ../../src/test/regress/results/reloptions.out   2020-04-17 
00:14:58.999211750 +0300
@@ -131,7 +131,7 @@
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column?
 --
- t
+ f
 (1 row)

Intimate reading of lazy_scan_heap says that the failure indeed might
happen; if ConditionalLockBufferForCleanup couldn't lock the buffer and
either the buffer doesn't need freezing or vacuum is not aggressive, we
don't insist on close inspection of the page contents and count it as
nonempty according to lazy_check_needs_freeze. It means the page is
regarded as such even if it contains only garbage (but occupied) ItemIds,
which is the case of the test. And of course this allegedly nonempty
page prevents the truncation. Obvious competitors for the page are
bgwriter/checkpointer; the chances of a simultaneous attack are small
but they exist.

A simple fix is to perform aggressive VACUUM FREEZE, as attached.

I'm a bit puzzled that I've ever seen this only when running regression
tests under our multimaster. While multimaster contains a fair amount of
C code, I don't see how any of it can interfere with the vacuuming
business here. I can't say I did my best to create the repoduction
though -- the explanation above seems to be enough.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index 44c130409ff..fa1c223c87a 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -127,7 +127,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 --
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index cac5b0bcb0d..a84aae5093b 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -71,7 +71,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options


Enlarge IOS vm cache

2021-03-09 Thread Arseny Sher
		/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -377,11 +382,14 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
+	/* Release VM buffer pins, if any. */
+	for (int i = 0; i < VMBUF_SIZE; i++)
 	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
+		if (node->ioss_VMBuffer[i] != InvalidBuffer)
+		{
+			ReleaseBuffer(node->ioss_VMBuffer[i]);
+			node->ioss_VMBuffer[i] = InvalidBuffer;
+		}
 	}
 
 	/*
@@ -680,7 +688,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
  node->ioss_NumOrderByKeys,
  piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	for (int i = 0; i < VMBUF_SIZE; i++)
+		node->ioss_VMBuffer[i] = InvalidBuffer;
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 57362c36876..0c9bda8de17 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -28,6 +28,24 @@
 #define VISIBILITYMAP_VALID_BITS	0x03	/* OR of all valid visibilitymap
 			 * flags bits */
 
+/*
+ * Size of the bitmap on each visibility map page, in bytes. There's no
+ * extra headers, so the whole page minus the standard page header is
+ * used for the bitmap.
+ */
+#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+
+/* Number of heap blocks we can represent in one byte */
+#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
+
+/* Number of heap blocks we can represent in one visibility map page. */
+#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
+
+/* Mapping from heap block number to the right bit in the visibility map */
+#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
 /* Macros for visibilitymap test */
 #define VM_ALL_VISIBLE(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e6..15290be11c6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1439,6 +1439,8 @@ typedef struct IndexScanState
 	Size		iss_PscanLen;
 } IndexScanState;
 
+#define VMBUF_SIZE 64
+
 /* 
  *	 IndexOnlyScanState information
  *
@@ -1473,7 +1475,7 @@ typedef struct IndexOnlyScanState
 	Relation	ioss_RelationDesc;
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
+	Buffer		ioss_VMBuffer[VMBUF_SIZE];
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Use-after-free in 12- EventTriggerAlterTableEnd

2020-10-27 Thread Arseny Sher
Hi,

Valgrind on our internal buildfarm complained about use-after-free
during currentEventTriggerState->commandList manipulations, e.g. lappend
in EventTriggerCollectSimpleCommand. I've discovered that the source of
problem is EventTriggerAlterTableEnd not bothering to switch into its
own context before appending to the list. ced138e8cba fixed this in
master and 13 but wasn't backpatched further, so I see the problem in
12-.

The particular reproducing scenario is somewhat involved; it combines
pg_pathman [1] extension and SQL interface to it created in our fork of
Postgres. Namely, we allow to create partitions in CREATE TABLE
PARTITIONED by statement (similar to what [2] proposes). Each partition
is created via separate SPI call which in turn ends up making
AlterTableStmt as ProcessUtility PROCESS_UTILITY_SUBCOMMAND; so
EventTriggerAlterTableStart/End is done, but additional
EventTriggerQueryState is not allocated and commands are collected in
toplevel EventTriggerQueryState. Of course SPI frees its memory between
the calls which makes valgrind scream.

Admittedly our case is tricky and I'm not sure it is possible to make up
something like that in the pure core code, but I do believe other
extension writers might run into this, so I propose to backpatch (the
attached) 3 lines healing to all supported branches.

Technically, all you (an extension author) have to do to encounter this
is
 1) Have toplevel EvenTriggerQueryState, i.e. catch utility statement.
 2) Inside it, run AlterTable as PROCESS_UTILITY_SUBCOMMAND in some
short-living context.

[1] https://github.com/postgrespro/pg_pathman
[2] 
https://www.postgresql.org/message-id/flat/CALT9ZEFBv05OhLMKO1Lbo_Zg9a0v%2BU9q9twe%3Dt-dixfR45RmVQ%40mail.gmail.com#f86f0fcfa62d5108fb81556a43f42457

diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index f7ee9838f7..d1972e2d7f 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -1807,9 +1807,15 @@ EventTriggerAlterTableEnd(void)
 	/* If no subcommands, don't collect */
 	if (list_length(currentEventTriggerState->currentCommand->d.alterTable.subcmds) != 0)
 	{
+		MemoryContext oldcxt;
+
+		oldcxt = MemoryContextSwitchTo(currentEventTriggerState->cxt);
+
 		currentEventTriggerState->commandList =
 			lappend(currentEventTriggerState->commandList,
 	currentEventTriggerState->currentCommand);
+
+		MemoryContextSwitchTo(oldcxt);
 	}
 	else
 		pfree(currentEventTriggerState->currentCommand);

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Parallel query hangs after a smart shutdown is issued

2020-08-14 Thread Arseny Sher


Tom Lane  writes:

> Thomas Munro  writes:
>> On Fri, Aug 14, 2020 at 4:45 AM Tom Lane  wrote:
>>> After some more rethinking and testing, here's a v5 that feels
>>> fairly final to me.  I realized that the logic in canAcceptConnections
>>> was kind of backwards: it's better to check the main pmState restrictions
>>> first and then the smart-shutdown restrictions afterwards.
>
>> LGTM.  I tested this a bit today and it did what I expected for
>> parallel queries and vacuum, on primary and standby.
>
> Thanks for reviewing!  I'll do the back-patching and push this today.

FWIW, I've also looked through the patch and it's fine. Moderate testing
also found no issues, check-world works, bgws are started during smart
shutdown as expected. And surely this is better than the inital
shorthack of allowing only parallel workers.




Re: logical copy_replication_slot issues

2020-03-09 Thread Arseny Sher


Masahiko Sawada  writes:

> /*
> -* Create logical decoding context, to build the initial snapshot.
> +* Create logical decoding context to find start point or, if we don't
> +* need it, to 1) bump slot's restart_lsn and xmin 2) check plugin sanity.
>  */
>
> Do we need to numbering that despite not referring them?

No, it just seemed clearer to me this way. I don't mind removing the
numbers if you feel this is better.

> ctx = CreateInitDecodingContext(plugin, NIL,
> -   false,  /* do not build snapshot */
> +   false,  /* do not build data snapshot */
> restart_lsn,
> logical_read_local_xlog_page, NULL, NULL,
> NULL);
> I'm not sure this change makes the comment better. Could you elaborate
> on the motivation of this change?

Well, DecodingContextFindStartpoint always builds a snapshot allowing
historical *catalog* lookups. This bool controls whether the snapshot
should additionally be suitable for looking at the actual data, this is
e.g. used by initial data sync in the native logical replication.


-- cheers, arseny




Re: logical copy_replication_slot issues

2020-03-06 Thread Arseny Sher
I wrote:

> It looks good to me now.

After lying for some time in my head it reminded me that
CreateInitDecodingContext not only pegs the LSN, but also xmin, so
attached makes a minor comment correction.

While taking a look at the nearby code it seemed weird to me that
GetOldestSafeDecodingTransactionId checks PGXACT->xid, not xmin. Don't
want to investigate this at the moment though, and not for this thread.

Also not for this thread, but I've noticed
pg_copy_logical_replication_slot doesn't allow to change plugin name
which is an omission in my view. It would be useful and trivial to do.


-- cheers, arseny

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2c9d5de6d9..da634bef0e 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -121,7 +121,8 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
  */
 static void
 create_logical_replication_slot(char *name, char *plugin,
-bool temporary, XLogRecPtr restart_lsn)
+bool temporary, XLogRecPtr restart_lsn,
+bool find_startpoint)
 {
 	LogicalDecodingContext *ctx = NULL;
 
@@ -139,16 +140,18 @@ create_logical_replication_slot(char *name, char *plugin,
 		  temporary ? RS_TEMPORARY : RS_EPHEMERAL);
 
 	/*
-	 * Create logical decoding context, to build the initial snapshot.
+	 * Create logical decoding context to find start point or, if we don't
+	 * need it, to 1) bump slot's restart_lsn and xmin 2) check plugin sanity.
 	 */
 	ctx = CreateInitDecodingContext(plugin, NIL,
-	false,	/* do not build snapshot */
+	false,	/* do not build data snapshot */
 	restart_lsn,
 	logical_read_local_xlog_page, NULL, NULL,
 	NULL);
 
 	/* build initial snapshot, might take a while */
-	DecodingContextFindStartpoint(ctx);
+	if (find_startpoint)
+		DecodingContextFindStartpoint(ctx);
 
 	/* don't need the decoding context anymore */
 	FreeDecodingContext(ctx);
@@ -179,7 +182,8 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	create_logical_replication_slot(NameStr(*name),
 	NameStr(*plugin),
 	temporary,
-	InvalidXLogRecPtr);
+	InvalidXLogRecPtr,
+	true);
 
 	values[0] = NameGetDatum(>data.name);
 	values[1] = LSNGetDatum(MyReplicationSlot->data.confirmed_flush);
@@ -683,10 +687,19 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 
 	/* Create new slot and acquire it */
 	if (logical_slot)
+	{
+		/*
+		 * WAL required for building snapshot could be removed as we haven't
+		 * reserved WAL yet. So we create a new logical replication slot
+		 * without building an initial snapshot.  A reasonable start point for
+		 * decoding will be provided by the source slot.
+		 */
 		create_logical_replication_slot(NameStr(*dst_name),
 		plugin,
 		temporary,
-		src_restart_lsn);
+		src_restart_lsn,
+		false);
+	}
 	else
 		create_physical_replication_slot(NameStr(*dst_name),
 		 true,
@@ -703,6 +716,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 		TransactionId copy_xmin;
 		TransactionId copy_catalog_xmin;
 		XLogRecPtr	copy_restart_lsn;
+		XLogRecPtr	copy_confirmed_flush;
 		bool		copy_islogical;
 		char	   *copy_name;
 
@@ -714,6 +728,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 		copy_xmin = src->data.xmin;
 		copy_catalog_xmin = src->data.catalog_xmin;
 		copy_restart_lsn = src->data.restart_lsn;
+		copy_confirmed_flush = src->data.confirmed_flush;
 
 		/* for existence check */
 		copy_name = pstrdup(NameStr(src->data.name));
@@ -738,6 +753,13 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 			NameStr(*src_name)),
 	 errdetail("The source replication slot was modified incompatibly during the copy operation.")));
 
+		/* The source slot must have a consistent snapshot */
+		if (src_islogical && XLogRecPtrIsInvalid(copy_confirmed_flush))
+			ereport(ERROR,
+	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+	 errmsg("cannot copy a logical replication slot that doesn't have confirmed_flush_lsn"),
+	 errhint("Retry when the source replication slot creation is finished.")));
+
 		/* Install copied values again */
 		SpinLockAcquire(>mutex);
 		MyReplicationSlot->effective_xmin = copy_effective_xmin;
@@ -746,6 +768,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 		MyReplicationSlot->data.xmin = copy_xmin;
 		MyReplicationSlot->data.catalog_xmin = copy_catalog_xmin;
 		MyReplicationSlot->data.restart_lsn = copy_restart_lsn;
+		MyReplicationSlot->data.confirmed_flush = copy_confirmed_flush;
 		SpinLockRelease(>mutex);
 
 		ReplicationSlotMarkDirty();


Re: logical copy_replication_slot issues

2020-03-04 Thread Arseny Sher

Masahiko Sawada  writes:

> I've attached the updated version patch that incorporated your
> comments. I believe we're going in the right direction for fixing this
> bug. I'll register this item to the next commit fest so as not to
> forget.

I've moved confirmed_flush check to the second lookup out of paranoic
considerations (e.g. slot could have been recreated and creation hasn't
finished yet) and made some minor stylistic adjustments. It looks good
to me now.

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2c9d5de6d9..4a3c7aa0ce 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -121,7 +121,8 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
  */
 static void
 create_logical_replication_slot(char *name, char *plugin,
-bool temporary, XLogRecPtr restart_lsn)
+bool temporary, XLogRecPtr restart_lsn,
+bool find_startpoint)
 {
 	LogicalDecodingContext *ctx = NULL;
 
@@ -139,16 +140,18 @@ create_logical_replication_slot(char *name, char *plugin,
 		  temporary ? RS_TEMPORARY : RS_EPHEMERAL);
 
 	/*
-	 * Create logical decoding context, to build the initial snapshot.
+	 * Create logical decoding context to find start point or, if we don't
+	 * need it, to 1) bump slot's restart_lsn 2) check plugin sanity.
 	 */
 	ctx = CreateInitDecodingContext(plugin, NIL,
-	false,	/* do not build snapshot */
+	false,	/* do not build data snapshot */
 	restart_lsn,
 	logical_read_local_xlog_page, NULL, NULL,
 	NULL);
 
 	/* build initial snapshot, might take a while */
-	DecodingContextFindStartpoint(ctx);
+	if (find_startpoint)
+		DecodingContextFindStartpoint(ctx);
 
 	/* don't need the decoding context anymore */
 	FreeDecodingContext(ctx);
@@ -179,7 +182,8 @@ pg_create_logical_replication_slot(PG_FUNCTION_ARGS)
 	create_logical_replication_slot(NameStr(*name),
 	NameStr(*plugin),
 	temporary,
-	InvalidXLogRecPtr);
+	InvalidXLogRecPtr,
+	true);
 
 	values[0] = NameGetDatum(>data.name);
 	values[1] = LSNGetDatum(MyReplicationSlot->data.confirmed_flush);
@@ -683,10 +687,19 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 
 	/* Create new slot and acquire it */
 	if (logical_slot)
+	{
+		/*
+		 * WAL required for building snapshot could be removed as we haven't
+		 * reserved WAL yet. So we create a new logical replication slot
+		 * without building an initial snapshot.  A reasonable start point for
+		 * decoding will be provided by the source slot.
+		 */
 		create_logical_replication_slot(NameStr(*dst_name),
 		plugin,
 		temporary,
-		src_restart_lsn);
+		src_restart_lsn,
+		false);
+	}
 	else
 		create_physical_replication_slot(NameStr(*dst_name),
 		 true,
@@ -703,6 +716,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 		TransactionId copy_xmin;
 		TransactionId copy_catalog_xmin;
 		XLogRecPtr	copy_restart_lsn;
+		XLogRecPtr	copy_confirmed_flush;
 		bool		copy_islogical;
 		char	   *copy_name;
 
@@ -714,6 +728,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 		copy_xmin = src->data.xmin;
 		copy_catalog_xmin = src->data.catalog_xmin;
 		copy_restart_lsn = src->data.restart_lsn;
+		copy_confirmed_flush = src->data.confirmed_flush;
 
 		/* for existence check */
 		copy_name = pstrdup(NameStr(src->data.name));
@@ -738,6 +753,13 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 			NameStr(*src_name)),
 	 errdetail("The source replication slot was modified incompatibly during the copy operation.")));
 
+		/* The source slot must have a consistent snapshot */
+		if (src_islogical && XLogRecPtrIsInvalid(copy_confirmed_flush))
+			ereport(ERROR,
+	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+	 errmsg("cannot copy a logical replication slot that doesn't have confirmed_flush_lsn"),
+	 errhint("Retry when the source replication slot creation is finished.")));
+
 		/* Install copied values again */
 		SpinLockAcquire(>mutex);
 		MyReplicationSlot->effective_xmin = copy_effective_xmin;
@@ -746,6 +768,7 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot)
 		MyReplicationSlot->data.xmin = copy_xmin;
 		MyReplicationSlot->data.catalog_xmin = copy_catalog_xmin;
 		MyReplicationSlot->data.restart_lsn = copy_restart_lsn;
+		MyReplicationSlot->data.confirmed_flush = copy_confirmed_flush;
 		SpinLockRelease(>mutex);
 
 		ReplicationSlotMarkDirty();


-- cheers, arseny


Re: logical copy_replication_slot issues

2020-02-10 Thread Arseny Sher


Masahiko Sawada  writes:

> I've attached the draft patch fixing this issue but I'll continue
> investigating it more deeply.

There also should be a check that source slot itself has consistent
snapshot (valid confirmed_flush) -- otherwise it might be possible to
create not initialized slot which is probably not an error, but weird
and somewhat meaningless. Paranoically, this ought to be checked in both
src slot lookups.

With this patch it seems like the only thing
create_logical_replication_slot does is ReplicationSlotCreate, which
questions the usefulness of this function. On the second look,
CreateInitDecodingContext checks plugin sanity (ensures it exists), so
probably it's fine.


-- cheers, arseny




logical copy_replication_slot issues

2020-02-09 Thread Arseny Sher
Hi,

While jumping around partically decoded xacts questions [1], I've read
through the copy replication slots code (9f06d79ef) and found a couple
of issues.

1) It seems quite reckless to me to dive into
DecodingContextFindStartpoint without actual WAL reservation (donors
slot restart_lsn is used, but it is not acquired). Why risking erroring
out with WAL removal error if the function advances new slot position to
updated donors one in the end anyway?

2) In the end, restart_lsn of new slot is set to updated donors
one. However, confirmed_flush field is not updated. This is just wrong
-- we could start decoding too early and stream partially decoded
transaction.

I'd probably avoid doing DecodingContextFindStartpoint at all. Its only
purpose is to assemble consistent snapshot (and establish corresponding
 pair), but donor slot must have
already done that and we could use it as well. Was this considered?


[1] 
https://www.postgresql.org/message-id/flat/AB5978B2-1772-4FEE-A245-74C91704ECB0%40amazon.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Too rigorous assert in reorderbuffer.c

2019-12-19 Thread Arseny Sher


Arseny Sher  writes:

> I'm sorry to bother you with this again, but due to new test our
> internal buildfarm revealed that ajacent assert on cmin is also lie. You
> see, we can't assume cmin is stable because the same key (relnode, tid)
> might refer to completely different tuples if tuple was inserted by
> aborted subxact, immeditaly reclaimed and then space occupied by another
> one. Fix is attached.
>
> Technically this might mean a user-facing bug, because we only pick the
> first cmin which means we might get visibility wrong, allowing to see
> some version too early (i.e real cmin of tuple is y, but decoding thinks
> it is x, and x < y). However, I couldn't quickly make up an example
> where this would actually lead to bad consequences. I tried to create
> such extra visible row in pg_attribute, but that's ok because loop in
> RelationBuildTupleDesc spins exactly natts times and ignores what is
> left unscanned. It is also ok with pg_class, because apparently
> ScanPgRelation also fishes out the (right) first tuple and doesn't check
> for duplicates appearing later in the scan. Maybe I just haven't tried
> hard enough though.

This issue still exists, it would be nice to fix it...

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: (Re)building index using itself or another index of the same table

2019-09-16 Thread Arseny Sher

Tomas Vondra  writes:

> On Thu, Sep 12, 2019 at 11:08:28AM -0400, Tom Lane wrote:

>>I have exactly no faith that this fixes things enough to make such
>>cases supportable.  And I have no interest in opening that can of
>>worms anyway.  I'd rather put in some code to reject database
>>accesses in immutable functions.
>>
>
> Same here. My hunch is a non-trivaial fraction of applications using
> this "trick" is silently broken in various subtle ways.

Ok, I see the point. However, "could not read block" error might seem
quite scary to the users; it looks like data corruption. How about
ERRORing out then in get_relation_info instead of skipping reindexing
indexes, like in attached? Even if this doesn't cover all cases, at
least one scenario observed in the field would have better error
message.

Rejecting database access completely in immutable functions would be
unfortunate for our particular case, because this GIN index on
expression joining the very indexed table multiple times (and using thus
btree index) is, well, useful. Here is a brief description of the
case. Indexed table stores postal addresses, which are of hierarchical
nature (e.g. country-region-city-street-house). Single row is one element
of any depth (e.g. region or house); each row stores link to its parent
in parent_guid column, establishing thus the hierarchy
(e.g. house has link to the street).

The task it to get the full address by typing random parts of it
(imagine typing hints in Google Maps). For that, FTS is used. GIN index
is built on full addresses, and to get the full address table is climbed
up about six times (hierarchy depth) by following parent_guid chain.

We could materialize full addresses in the table and eliminate the need
to form them in the index expression, but that would seriously increase
amount of required storage -- GIN doesn't store indexed columns fully,
and thus it is cheaper to 'materialize' full addresses inside it only.


Surely this is a hack which cheats the system. We might imagine creating
some functionality (kinda index referring to multiple rows of the table
-- or even rows of different tables) making it unneccessary, but such
functionality doesn't exist today, and the hack is useful, if you
understand the risk.


>>> One might argue that function selecting from table can hardly be called
>>> immutable, and immutability is required for index expressions. However,
>>> if user is sure table contents doesn't change, why not?
>>
>>If the table contents never change, why are you doing VACUUM FULL on it?
>>
>
> It's possible the columns referenced by the index expression are not
> changing, but some additional columns are updated.

Yeah. Also table can be CLUSTERed without VACUUM FULL.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From f5b9c433bf387a9ddbe318dfea2b96c02c4a945e Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Thu, 12 Sep 2019 17:35:16 +0300
Subject: [PATCH] ERROR out early on attempt to touch user indexes while they
 are being (re)built.

Existing ReindexIsProcessingIndex check is consulted only in genam.c and thus
enforced only for system catalogs. Check it also in the planner, so that indexes
which are currently being rebuilt are never tried. Also cock SetReindexProcessing
in index_create to defend from index self usage during its creation.

Without this, VACUUM FULL or just CREATE INDEX might fail with something like

ERROR:  could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes

if there are indexes which usage can be considered during these very
indexes (re)building, i.e. index expression scans indexed table.

Such error might seem scary, so catch this earlier.
---
 src/backend/catalog/index.c| 22 --
 src/backend/optimizer/util/plancat.c   |  9 +
 src/test/regress/expected/create_index.out | 15 +++
 src/test/regress/sql/create_index.sql  | 13 +
 4 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3e1d40662d..5bc764ce46 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1174,7 +1174,22 @@ index_create(Relation heapRelation,
 	}
 	else
 	{
-		index_build(heapRelation, indexRelation, indexInfo, false, true);
+		/* ensure SetReindexProcessing state isn't leaked */
+		PG_TRY();
+		{
+			/* Suppress use of the target index while building it */
+			SetReindexProcessing(heapRelationId, indexRelationId);
+
+			index_build(heapRelation, indexRelation, indexInfo, false, true);
+		}
+		PG_CATCH();
+		{
+			/* Make sure flag gets cleared on error exit */
+			ResetReindexProcessing();
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+		ResetReindexProcessing();
 	}
 
 	/*
@@ -1379,7 +1394,10 @@ index_concurrentl

(Re)building index using itself or another index of the same table

2019-09-12 Thread Arseny Sher
Hi,

Our customer encountered a curious scenario. They have a table with GIN
index on expression, which performs multiple joins with this table
itself. These joins employ another btree index for efficiency.
VACUUM FULL on this table fails with error like

ERROR:  could not read block 3534 in file "base/41366676/56697497": read only 0 
of 8192 bytes

It happens because order in which indexes are rebuilt is not specified.
GIN index is being rebuilt when btree index is not reconstructed yet;
an attempt to use old index with rewritten heap crashes.

A problem of similar nature can be reproduced with the following
stripped-down scenario:

CREATE TABLE pears(f1 int primary key, f2 int);
INSERT INTO pears SELECT i, i+1 FROM generate_series(1, 100) i;
CREATE OR REPLACE FUNCTION pears_f(i int) RETURNS int LANGUAGE SQL IMMUTABLE AS 
$$
  SELECT f1 FROM pears WHERE pears.f2 = 42
$$;
CREATE index ON pears ((pears_f(f1)));

Here usage of not-yet-created index on pears_f(f1) for its own
construction is pointless, however planner in principle considers it in
get_relation_info, tries to get btree height (_bt_getrootheight) -- and
fails.


There is already a mechanism which prevents usage of indexes during
reindex -- ReindexIsProcessingIndex et al. However, to the contrary of
what index.c:3664 comment say, these protect only indexes on system
catalogs, not user tables: the only real caller is genam.c.

Attached patch extends it: the same check is added to
get_relation_info. Also SetReindexProcessing is cocked in index_create
to defend from index self usage during creation as in stripped example
above. There are some other still unprotected callers of index_build;
concurrent index creation doesn't need it because index is
'not indisvalid' during the build, and in RelationTruncateIndexes
table is empty, so it looks like it can be omitted.


One might argue that function selecting from table can hardly be called
immutable, and immutability is required for index expressions. However,
if user is sure table contents doesn't change, why not? Also, the
possiblity of triggering "could not read block" error with plain SQL is
definitely not nice.

>From 5942a3a5b2c90056119b9873c81f30dfa9e003af Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Thu, 12 Sep 2019 17:35:16 +0300
Subject: [PATCH] Avoid touching user indexes while they are being (re)built.

Existing ReindexIsProcessingIndex check is consulted only in genam.c and thus
enforced only for system catalogs. Check it also in the planner, so that indexes
which are currently being rebuilt are never used. Also cock SetReindexProcessing
in index_create to defend from index self usage during its creation.

Without this, VACUUM FULL or just CREATE INDEX might fail with something like

ERROR:  could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes

if there are indexes which usage can be considered during these very
indexes (re)building, i.e. index expression scans indexed table.
---
 src/backend/catalog/index.c| 22 --
 src/backend/optimizer/util/plancat.c   |  5 +
 src/test/regress/expected/create_index.out | 12 
 src/test/regress/sql/create_index.sql  | 13 +
 4 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3e1d40662d..5bc764ce46 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1174,7 +1174,22 @@ index_create(Relation heapRelation,
 	}
 	else
 	{
-		index_build(heapRelation, indexRelation, indexInfo, false, true);
+		/* ensure SetReindexProcessing state isn't leaked */
+		PG_TRY();
+		{
+			/* Suppress use of the target index while building it */
+			SetReindexProcessing(heapRelationId, indexRelationId);
+
+			index_build(heapRelation, indexRelation, indexInfo, false, true);
+		}
+		PG_CATCH();
+		{
+			/* Make sure flag gets cleared on error exit */
+			ResetReindexProcessing();
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+		ResetReindexProcessing();
 	}
 
 	/*
@@ -1379,7 +1394,10 @@ index_concurrently_build(Oid heapRelationId,
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
 
-	/* Now build the index */
+	/*
+	 * Now build the index
+	 * SetReindexProcessing is not required since indisvalid is false anyway
+	 */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
 	/* Close both the relations, but keep the locks */
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cf1761401d..9d58cd2574 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -27,6 +27,7 @@
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
+#include "catalog/index.h"
 #include "catalog/heap.h"
 #include "catalog/pg_am.h"
 #include "catalog/p

Re: Parallel query vs smart shutdown and Postmaster death

2019-03-18 Thread Arseny Sher

Thomas Munro  writes:

> Just a thought: instead of the new hand-coded loop you added in
> pmdie(), do you think it would make sense to have a new argument
> "exclude_class_mask" for SignalSomeChildren()?  If we did that, I
> would consider renaming the existing parameter "target" to
> "include_type_mask" to make it super clear what's going on.

I thought that this is a bit too complicated for single use-case, but if
you like it better, here is an updated version.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 2e6359df5bf60b9ab6ca6e35fa347993019526db Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 72 -
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..2ad215b12d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -412,10 +412,10 @@ static void report_fork_failure_to_client(Port *port, int errnum);
 static CAC_state canAcceptConnections(void);
 static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
-static bool SignalSomeChildren(int signal, int targets);
+static bool SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask);
 static void TerminateChildren(int signal);
 
-#define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
+#define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL, 0)
 
 static int	CountChildren(int target);
 static bool assign_backendlist_entry(RegisteredBgWorker *rw);
@@ -2689,10 +2689,12 @@ pmdie(SIGNAL_ARGS)
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
 			{
-/* autovac workers are told to shut down immediately */
-/* and bgworkers too; does this need tweaking? */
-SignalSomeChildren(SIGTERM,
-   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+/*
+ * Shut down all bgws except parallel workers: graceful
+ * termination of existing sessions needs them.
+ * Autovac workers are also told to shut down immediately.
+ */
+SignalSomeChildren(SIGTERM, BACKEND_TYPE_WORKER, BGWORKER_CLASS_PARALLEL);
 /* and the autovac launcher too */
 if (AutoVacPID != 0)
 	signal_child(AutoVacPID, SIGTERM);
@@ -2752,7 +2754,7 @@ pmdie(SIGNAL_ARGS)
 signal_child(WalReceiverPID, SIGTERM);
 			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
-SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
+SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER, 0);
 
 /*
  * Only startup, bgwriter, walreceiver, possibly bgworkers,
@@ -2773,7 +2775,7 @@ pmdie(SIGNAL_ARGS)
 /* shut down all backends and workers */
 SignalSomeChildren(SIGTERM,
    BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC |
-   BACKEND_TYPE_BGWORKER);
+   BACKEND_TYPE_BGWORKER, 0);
 /* and the autovac launcher too */
 if (AutoVacPID != 0)
 	signal_child(AutoVacPID, SIGTERM);
@@ -3939,26 +3941,52 @@ signal_child(pid_t pid, int signal)
 
 /*
  * Send a signal to the targeted children (but NOT special children;
- * dead_end children are never signaled, either).
+ * dead_end children are never signaled, either). If BACKEND_TYPE_BGWORKER
+ * is in include_type_mask, bgworkers are additionaly filtered out by
+ * exclude_bgw_mask.
  */
 static bool
-SignalSomeChildren(int signal, int target)
+SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask)
 {
 	dlist_iter	iter;
+	slist_iter	siter;
 	bool		signaled = false;
 
+	/*
+	 * Handle bgws by walking over BackgroundWorkerList because we have
+	 * to check out bgw class.
+	 */
+	if ((include_type_mask & BACKEND_TYPE_BGWORKER) != 0)
+	{
+		slist_foreach(siter, )
+		{
+			RegisteredBgWorker *rw;
+
+			rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+			if (rw->rw_pid > 0 &&
+((rw->rw_worker.bgw_flags & exclude_bgw_mask) == 0))
+			{
+ereport(DEBUG4,
+		(errmsg_internal("sending signal %d to process %d",
+		 signal, (int) rw->rw_pid)));
+signal_child(rw->rw_pid, signal);
+signaled = true;
+			}
+		}
+	}
+
 	dlist_foreach(iter, )
 	{
 		Backend*bp = dlist_container(Backend, elem, iter.cur);
 
-		if (bp->dead_end)
+		if (bp->dead_end || bp->bkend_type == BACKEND_TYPE_BGWORKER)
 			continue;
 
 		/*
 		 * Since target == BACKEND_TYPE_ALL is the most common case, we test
 		 * it first and avoid touching

Re: Parallel query vs smart shutdown and Postmaster death

2019-03-16 Thread Arseny Sher
Hi,

Thomas Munro  writes:

> 1.  Why does pmdie()'s SIGTERM case terminate parallel workers
> immediately?  That breaks aborts running parallel queries, so they
> don't get to end their work normally.
> 2.  Why are new parallel workers not allowed to be started while in
> this state?  That hangs future parallel queries forever, so they don't
> get to end their work normally.
> 3.  Suppose we fix the above cases; should we do it for parallel
> workers only (somehow), or for all bgworkers?  It's hard to say since
> I don't know what all bgworkers do.

Attached patch fixes 1 and 2. As for the 3, the only other internal
bgworkers currently are logical replication launcher and workers, which
arguably should be killed immediately.

> Here's an initial sketch of a
> patch like that: it gives you "ERROR:  parallel worker failed to
> initialize" and "HINT:  More details may be available in the server
> log." if you try to run a parallel query.

Reporting bgworkers that postmaster is never going to start is probably
worthwhile doing anyway to prevent potential further deadlocks like
this. However, I think there is a problem in your patch: we might be in
post PM_RUN states due to FatalError, not because of shutdown. In this
case, we shouldn't refuse to run bgws in the future. I would also merge
the check into bgworker_should_start_now.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From da11d22a5ed78a690ccdbfeb77c59749c541d463 Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 44 -
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..d60969fdb8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2689,10 +2689,30 @@ pmdie(SIGNAL_ARGS)
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
 			{
+slist_iter siter;
+
+/*
+ * Shut down all bgws except parallel workers: graceful
+ * termination of existing sessions needs them.
+ */
+slist_foreach(siter, )
+{
+	RegisteredBgWorker *rw;
+
+	rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+	if (rw->rw_pid > 0 &&
+		((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0))
+	{
+		ereport(DEBUG4,
+(errmsg_internal("sending signal %d to process %d",
+ SIGTERM, (int) rw->rw_pid)));
+		signal_child(rw->rw_pid, SIGTERM);
+
+	}
+}
+
 /* autovac workers are told to shut down immediately */
-/* and bgworkers too; does this need tweaking? */
-SignalSomeChildren(SIGTERM,
-   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC);
 /* and the autovac launcher too */
 if (AutoVacPID != 0)
 	signal_child(AutoVacPID, SIGTERM);
@@ -5735,18 +5755,30 @@ do_start_bgworker(RegisteredBgWorker *rw)
  * specified start_time?
  */
 static bool
-bgworker_should_start_now(BgWorkerStartTime start_time)
+bgworker_should_start_now(RegisteredBgWorker *rw)
 {
+	BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time;
+
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
 		case PM_SHUTDOWN_2:
 		case PM_SHUTDOWN:
+			return false;
+
 		case PM_WAIT_BACKENDS:
 		case PM_WAIT_READONLY:
 		case PM_WAIT_BACKUP:
-			break;
+			/*
+			 * Allow new parallel workers in SmartShutdown while backends
+			 * are still there.
+			 */
+			if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) ||
+Shutdown != SmartShutdown ||
+FatalError)
+return false;
+			/* fall through */
 
 		case PM_RUN:
 			if (start_time == BgWorkerStart_RecoveryFinished)
@@ -5906,7 +5938,7 @@ maybe_start_bgworkers(void)
 			}
 		}
 
-		if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+		if (bgworker_should_start_now(rw))
 		{
 			/* reset crash time before trying to start worker */
 			rw->rw_crashed_at = 0;
-- 
2.11.0



Re: Too rigorous assert in reorderbuffer.c

2019-02-15 Thread Arseny Sher

Alvaro Herrera  writes:

> Thanks for checking!  I also run it on all branches, everything passes.
> Pushed now.

I'm sorry to bother you with this again, but due to new test our
internal buildfarm revealed that ajacent assert on cmin is also lie. You
see, we can't assume cmin is stable because the same key (relnode, tid)
might refer to completely different tuples if tuple was inserted by
aborted subxact, immeditaly reclaimed and then space occupied by another
one. Fix is attached.

Technically this might mean a user-facing bug, because we only pick the
first cmin which means we might get visibility wrong, allowing to see
some version too early (i.e real cmin of tuple is y, but decoding thinks
it is x, and x < y). However, I couldn't quickly make up an example
where this would actually lead to bad consequences. I tried to create
such extra visible row in pg_attribute, but that's ok because loop in
RelationBuildTupleDesc spins exactly natts times and ignores what is
left unscanned. It is also ok with pg_class, because apparently
ScanPgRelation also fishes out the (right) first tuple and doesn't check
for duplicates appearing later in the scan. Maybe I just haven't tried
hard enough though.

Attached 'aborted_subxact_test.patch' is an illustration of such wrong
cmin visibility on pg_attribute. It triggers assertion failure, but
otherwise ok (no user-facing issues), as I said earlier, so I am
disinclined to include it in the fix.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2b486b5e9f..505c7ff134 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1318,29 +1318,36 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		(void *) ,
 		HASH_ENTER | HASH_FIND,
 		);
-		if (!found)
-		{
-			ent->cmin = change->data.tuplecid.cmin;
-			ent->cmax = change->data.tuplecid.cmax;
-			ent->combocid = change->data.tuplecid.combocid;
-		}
-		else
+		if (found)
 		{
 			/*
-			 * Maybe we already saw this tuple before in this transaction,
-			 * but if so it must have the same cmin.
+			 * Tuple with such key (relnode, tid) might be inserted in aborted
+			 * subxact, space reclaimed and then filled with another tuple
+			 * from our xact or another; then we might update it again. It
+			 * means cmin can become valid/invalid at any moment, but valid
+			 * values are always monotonic.
 			 */
-			Assert(ent->cmin == change->data.tuplecid.cmin);
+			Assert((ent->cmin == InvalidCommandId) ||
+   (change->data.tuplecid.cmin == InvalidCommandId) ||
+   (change->data.tuplecid.cmin >= ent->cmin));
 
 			/*
-			 * cmax may be initially invalid, but once set it can only grow,
-			 * and never become invalid again.
+			 * Practically the same for cmax; to see invalid cmax after
+			 * valid, we need to insert new tuple in place of one reclaimed
+			 * from our aborted subxact.
 			 */
 			Assert((ent->cmax == InvalidCommandId) ||
-   ((change->data.tuplecid.cmax != InvalidCommandId) &&
-	(change->data.tuplecid.cmax > ent->cmax)));
-			ent->cmax = change->data.tuplecid.cmax;
+   (change->data.tuplecid.cmax == InvalidCommandId) ||
+   (change->data.tuplecid.cmax > ent->cmax));
 		}
+
+		/*
+		 * The 'right' (potentially belonging to committed xact) cmin and cmax
+		 * are always the latest.
+		 */
+		ent->cmin = change->data.tuplecid.cmin;
+		ent->cmax = change->data.tuplecid.cmax;
+		ent->combocid = change->data.tuplecid.combocid;
 	}
 }
 
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 4afb1d963e..32c2650d1a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -3,11 +3,11 @@
 MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
-REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
+# REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer
+	oldest_xmin snapshot_transfer aborted_subxact
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/specs/aborted_subxact.spec b/contrib/test_decoding/specs/aborted_subxact.spec
new file mode 100644
index 00..cb066fd5d2
--- /dev/null
+++ b/contrib/test_decoding/specs/aborted_subxact.spec
@@ -0,0 +1,35 @@
+# Test DDL in aborted subxact
+
+setup
+{
+SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'tes

Re: Too rigorous assert in reorderbuffer.c

2019-02-12 Thread Arseny Sher


Alvaro Herrera  writes:

>> I thought the blanket removal of the assert() was excessive, and we can
>> relax it instead; what do you think of the attached?
>
> More precisely, my question was: with this change, does the code still
> work correctly in your non-toy case?

Yes, it works. I thought for a moment that some obscure cases where cmax
on a single tuple is not strictly monotonic might exist, but looks like
they don't. So your change is ok for me, reshaping assert is better than
removing. make check is also good on all supported branches.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Too rigorous assert in reorderbuffer.c

2019-02-07 Thread Arseny Sher


Alvaro Herrera  writes:

> On 2019-Feb-06, Arseny Sher wrote:
>
>>
>> Alvaro Herrera  writes:
>>
>> > note the additional pg_temp_XYZ row in the middle.  This is caused by
>> > the rewrite in ALTER TABLE.  Peter E fixed that in Pg11 in commit
>> > 325f2ec55; I don't think there's much to do in the backbranches other
>> > than hide the pesky record to avoid it breaking the test.
>>
>> Oh, I see. Let's just remove the first insertion then, as in attached.
>> I've tested it on master and on 9.4.
>
> Ah, okay.  Does the test still fail when run without the code fix?

Yes. The problem here is overriding cmax of catalog (pg_attribute in the
test) tuples, so it fails without any data at all.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Too rigorous assert in reorderbuffer.c

2019-02-06 Thread Arseny Sher

Alvaro Herrera  writes:

> note the additional pg_temp_XYZ row in the middle.  This is caused by
> the rewrite in ALTER TABLE.  Peter E fixed that in Pg11 in commit
> 325f2ec55; I don't think there's much to do in the backbranches other
> than hide the pesky record to avoid it breaking the test.

Oh, I see. Let's just remove the first insertion then, as in attached.
I've tested it on master and on 9.4.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


>From b34726d9b7565df73319a44664d9cd04de5f514f Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Wed, 30 Jan 2019 23:31:47 +0300
Subject: [PATCH] Remove assertion in reorderbuffer that cmax is stable.

Since it can be rewritten arbitrary number of times if subxact(s) who tried to
delete some tuple version abort later. Add small test involving DDL in aborted
subxact as this case is interesting on its own and wasn't covered before.
---
 contrib/test_decoding/expected/ddl.out  | 18 ++
 contrib/test_decoding/sql/ddl.sql   | 13 +
 src/backend/replication/logical/reorderbuffer.c | 10 ++
 3 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..2bd28e6d15 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -409,6 +409,24 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (6 rows)
 
+-- check that DDL in aborted subtransactions handled correctly
+CREATE TABLE tr_sub_ddl(data int);
+BEGIN;
+SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text;
+INSERT INTO tr_sub_ddl VALUES ('blah-blah');
+ROLLBACK TO SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint;
+INSERT INTO tr_sub_ddl VALUES(43);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+   data   
+--
+ BEGIN
+ table public.tr_sub_ddl: INSERT: data[bigint]:43
+ COMMIT
+(3 rows)
+
 /*
  * Check whether treating a table as a catalog table works somewhat
  */
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index c4b10a4cf9..a55086443c 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -236,6 +236,19 @@ COMMIT;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
+-- check that DDL in aborted subtransactions handled correctly
+CREATE TABLE tr_sub_ddl(data int);
+BEGIN;
+SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text;
+INSERT INTO tr_sub_ddl VALUES ('blah-blah');
+ROLLBACK TO SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint;
+INSERT INTO tr_sub_ddl VALUES(43);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
 
 /*
  * Check whether treating a table as a catalog table works somewhat
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a49e226967..0ace0cfb87 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,13 +1327,15 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		else
 		{
 			Assert(ent->cmin == change->data.tuplecid.cmin);
-			Assert(ent->cmax == InvalidCommandId ||
-   ent->cmax == change->data.tuplecid.cmax);
 
 			/*
 			 * if the tuple got valid in this transaction and now got deleted
-			 * we already have a valid cmin stored. The cmax will be
-			 * InvalidCommandId though.
+			 * we already have a valid cmin stored. The cmax is still
+			 * InvalidCommandId though, so stamp it. Moreover, cmax might
+			 * be rewritten arbitrary number of times if subxacts who tried to
+			 * delete this version abort later. Pick up the latest since it is
+			 * (belonging to potentially committed subxact) the only one which
+			 * matters.
 			 */
 			ent->cmax = change->data.tuplecid.cmax;
 		}
-- 
2.11.0



Too rigorous assert in reorderbuffer.c

2019-01-30 Thread Arseny Sher
Hi,

My colleague Alexander Lakhin has noticed an assertion failure in
reorderbuffer.c:1330. Here is a simple snippet reproducing it:

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 
'test_decoding');

create table t(k int);
begin;
savepoint a;
alter table t alter column k type text;
rollback to savepoint a;
alter table t alter column k type bigint;
commit;

SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1');

It is indeed too opinionated since cmax of a tuple is not stable; it can
be rewritten if subxact who tried to delete it later aborts (analogy
also holds for xmax). Attached patch removes it. While here, I had also
considered worthwhile to add a test involving DDL in aborted subxact as
it is interesting anyway and wasn't covered before.

>From c5cd30f9e23c96390cafad82f832c918dfd3397f Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Wed, 30 Jan 2019 23:31:47 +0300
Subject: [PATCH] Remove assertion in reorderbuffer that cmax is stable.

Since it can be rewritten arbitrary number of times if subxact(s) who tried to
delete some tuple version abort later. Add small test involving DDL in aborted
subxact as this case is interesting on its own and wasn't covered before.
---
 contrib/test_decoding/expected/ddl.out  | 20 
 contrib/test_decoding/sql/ddl.sql   | 14 ++
 src/backend/replication/logical/reorderbuffer.c | 10 ++
 3 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..69dd74676c 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -409,6 +409,26 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
  COMMIT
 (6 rows)
 
+-- check that DDL in aborted subtransactions handled correctly
+CREATE TABLE tr_sub_ddl(data int);
+BEGIN;
+INSERT INTO tr_sub_ddl VALUES (42);
+SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text;
+INSERT INTO tr_sub_ddl VALUES ('blah-blah');
+ROLLBACK TO SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint;
+INSERT INTO tr_sub_ddl VALUES(43);
+COMMIT;
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+   data
+---
+ BEGIN
+ table public.tr_sub_ddl: INSERT: data[integer]:42
+ table public.tr_sub_ddl: INSERT: data[bigint]:43
+ COMMIT
+(4 rows)
+
 /*
  * Check whether treating a table as a catalog table works somewhat
  */
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index c4b10a4cf9..f47b4ad716 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -236,6 +236,20 @@ COMMIT;
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
+-- check that DDL in aborted subtransactions handled correctly
+CREATE TABLE tr_sub_ddl(data int);
+BEGIN;
+INSERT INTO tr_sub_ddl VALUES (42);
+SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE text;
+INSERT INTO tr_sub_ddl VALUES ('blah-blah');
+ROLLBACK TO SAVEPOINT a;
+ALTER TABLE tr_sub_ddl ALTER COLUMN data TYPE bigint;
+INSERT INTO tr_sub_ddl VALUES(43);
+COMMIT;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
 
 /*
  * Check whether treating a table as a catalog table works somewhat
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a49e226967..0ace0cfb87 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,13 +1327,15 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		else
 		{
 			Assert(ent->cmin == change->data.tuplecid.cmin);
-			Assert(ent->cmax == InvalidCommandId ||
-   ent->cmax == change->data.tuplecid.cmax);
 
 			/*
 			 * if the tuple got valid in this transaction and now got deleted
-			 * we already have a valid cmin stored. The cmax will be
-			 * InvalidCommandId though.
+			 * we already have a valid cmin stored. The cmax is still
+			 * InvalidCommandId though, so stamp it. Moreover, cmax might
+			 * be rewritten arbitrary number of times if subxacts who tried to
+			 * delete this version abort later. Pick up the latest since it is
+			 * (belonging to potentially committed subxact) the only one which
+			 * matters.
 			 */
 			ent->cmax = change->data.tuplecid.cmax;
 		}
-- 
2.11.0


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] logical decoding of two-phase transactions

2019-01-14 Thread Arseny Sher
e one we are aborting, clear its reorder buffer as 
well.
+*/
+   if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid)
+   ReorderBufferAbort(ctx->reorder, xid, origin_lsn);

What is the aim of this? How xl_xid xid of commit prepared record can be
normal?


+   /*
+* The transaction may or may not exist (during restarts for example).
+* Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+* it to be created below.
+*/

Code around looks sane, but I think that restarts are irrelevant to
rbtxn existence at this moment: if we are going to COMMIT/ABORT PREPARED
it, it must have been replayed and rbtxn purged immediately after. The
only reason why rbtxn can exist here is invalidation addition
(ReorderBufferAddInvalidations) happening a couple of calls earlier.
Also, instead of misty '2PC transactions do not contain any
reorderbuffers' I would say something like 'create dummy
ReorderBufferTXN to pass it to the callback'.


- filter_prepare_cb callback existence is checked in both decode.c and
  in filter_prepare_cb_wrapper.


Third patch:

+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */

I would explain here that this xid is checked for abort after each
catalog scan, and sent for the details to SetupHistoricSnapshot.


Nitpicking:

First patch: I still don't think that these flags need a bitmask.

Second patch:

- I still think ReorderBufferCommitInternal name is confusing and should
  be renamed to something like ReorderBufferReplay.


/* Do we know this is a subxact?  Xid of top-level txn if so */
TransactionId toplevel_xid;
+   /* In case of 2PC we need to pass GID to output plugin */
+   char *gid;

Better add here newline as between other fields.


+   txn->txn_flags |= RBTXN_PREPARE;
+   txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+   strcpy(txn->gid, gid);

pstrdup?


- ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the
  same and should be merged with comments explaining that the answer
  must be stable.

+  The optional commit_prepared_cb callback is called 
whenever
+  a commit prepared transaction has been decoded. The 
gid field,

a commit prepared transaction *record* has been decoded?


Fourth patch:

Applying: Teach test_decoding plugin to work with 2PC
.git/rebase-apply/patch:347: trailing whitespace.
-- test savepoints
.git/rebase-apply/patch:424: trailing whitespace.
# get XID of the above two-phase transaction
warning: 2 lines add whitespace errors.


[1] https://www.postgresql.org/message-id/87zhxrwgvh.fsf%40ars-thinkpad

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] logical decoding of two-phase transactions

2018-12-18 Thread Arseny Sher


Tomas Vondra  writes:

> Hi Nikhil,
>
> Thanks for the updated patch - I've started working on a review, with
> the hope of getting it committed sometime in 2019-01. But the patch
> bit-rotted again a bit (probably due to d3c09b9b), which broke the last
> part. Can you post a fixed version?

Please also note that at some time the thread was torn and continued in
another place:
https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com

And now we have two branches =(

I hadn't checked whether my concerns where addressed in the latest
version though.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Global snapshots

2018-09-26 Thread Arseny Sher
o, I could not understand some notes from Arseny:
>
>> 25 июля 2018 г., в 16:35, Arseny Sher  написал(а):
>>
>> * One drawback of these patches is that only REPEATABLE READ is
>>   supported. For READ COMMITTED, we must export every new snapshot
>>   generated on coordinator to all nodes, which is fairly easy to
>>   do. SERIALIZABLE will definitely require chattering between nodes,
>>   but that's much less demanded isolevel (e.g. we still don't support
>>   it on replicas).
>
> If all shards are executing transaction in SERIALIZABLE, what anomalies does 
> it permit?
>
> If you have transactions on server A and server B, there are
> transactions 1 and 2, transaction A1 is serialized before A2, but B1
> is after B2, right?
>
> Maybe we can somehow abort 1 or 2?

Yes, your explanation is concise and correct.
To put it in another way, ensuring SERIALIZABLE in MVCC requires
tracking reads, and there is no infrastructure for doing it
globally. Classical write skew is possible: you have node A holding x
and node B holding y, initially x = y = 30 and there is a constraint x +
y > 0.
Two concurrent xacts start:
T1: x = x - 42;
T2: y = y - 42;
They don't see each other, so both commit successfully and the
constraint is violated. We need to transfer info about reads between
nodes to know when we need to abort someone.

>>
>> * Another somewhat serious issue is that there is a risk of recency
>>   guarantee violation. If client starts transaction at node with
>>   lagging clocks, its snapshot might not include some recently
>>   committed transactions; if client works with different nodes, she
>>   might not even see her own changes. CockroachDB describes at [1] how
>>   they and Google Spanner overcome this problem. In short, both set
>>   hard limit on maximum allowed clock skew.  Spanner uses atomic
>>   clocks, so this skew is small and they just wait it at the end of
>>   each transaction before acknowledging the client. In CockroachDB, if
>>   tuple is not visible but we are unsure whether it is truly invisible
>>   or it's just the skew (the difference between snapshot and tuple's
>>   csn is less than the skew), transaction is restarted with advanced
>>   snapshot. This process is not infinite because the upper border
>>   (initial snapshot + max skew) stays the same; this is correct as we
>>   just want to ensure that our xact sees all the committed ones before
>>   it started. We can implement the same thing.
> I think that this situation is also covered in Clock-SI since
> transactions will not exit InDoubt state before we can see them. But
> I'm not sure, chances are that I get something wrong, I'll think more
> about it. I'd be happy to hear comments from Stas about this.

InDoubt state protects from seeing xact who is not yet committed
everywhere, but it doesn't protect from starting xact on node with
lagging clocks, obtaining plainly old snapshot. We won't see any
half-committed data (InDoubt covers us here) with it, but some recently
committed xacts might not get into our old snapshot entirely.

>> * 003_bank_shared.pl test is removed. In current shape (loading one
>>   node) it is useless, and if we bombard both nodes, deadlock surely
>>   appears. In general, global snaphots are not needed for such
>>   multimaster-like setup -- either there are no conflicts and we are
>>   fine, or there is a conflict, in which case we get a deadlock.
> Can we do something with this deadlock? Will placing an upper limit on
> time of InDoubt state fix the issue? I understand that aborting
> automatically is kind of dangerous...

Sure, this is just a generalization of basic deadlock problem to
distributed system. To deal with it, someone must periodically collect
locks across the nodes, build graph, check for loops and punish (abort)
one of its chain creators, if loop exists. InDoubt (and global snapshots
in general) is unrelated to this, we hanged on usual row-level lock in
this test. BTW, our pg_shardman extension has primitive deadlock
detector [2], I suppose Citus [3] also has one.

> Also, currently hanging 2pc transaction can cause a lot of headache
> for DBA. Can we have some kind of protection for the case when one
> node is gone permanently during transaction?

Oh, subject of automatic 2PC xacts resolution is also matter of another
(probably many-miles) thread, largely unrelated to global
snapshots/visibility. In general, the problem of distributed transaction
commit which doesn't block while majority of nodes is alive requires
implementing distributed consensus algorithm like Paxos or Raft. You
might also find thread [4] interesting.

Thank you for your interest in the topic!

[1] https://yadi.sk/i/qgmFeICvuRwYNA
[2] 
https://github.com/postgrespro/pg_shardman/blob/broa

Re: [HACKERS] logical decoding of two-phase transactions

2018-08-12 Thread Arseny Sher
 snapshot to all RBTXNs, if it is important.
 * Set base snap of our xact if it did DDL, to execute invalidations
   during replay.

I see that we don't need to do first two bullets during DecodePrepare:
xact effects are still invisible for everyone but itself after
PREPARE. As for seeing xacts own own changes, it is implemented via
logging cmin/cmax and we don't need to mark xact as committed for that
(c.f ReorderBufferCopySnap).

Regarding the third point... I think in 2PC decoding we might need to
execute invalidations twice:
1) After replaying xact on PREPARE to forget about catalog changes
   xact did -- it is not yet committed and must be invisible to
   other xacts until CP. In the latest patchset invalidations are
   executed only if there is at least one change in xact (it has base
   snap). It looks fine: we can't spoil catalogs if there were nothing
   to decode. Better to explain that somewhere.
2) After decoding COMMIT PREPARED to make changes visible. In current
   patchset it is always done. Actually, *this* is the reason
   RBTXN might already exist when we enter ReorderBufferFinishPrepared,
   not "(during restarts for example)" as comment says there:
   if there were inval messages, RBTXN will be created
   in DecodeCommit during their addition.

BTW, "that we might need to execute invalidations, add snapshot" in
SnapBuildCommitTxn looks like a cludge to me: I suppose it is better to
do that at ReorderBufferXidSetCatalogChanges.

Now, another issue is registering xact as committed in
SnapBuildCommitTxn during COMMIT PREPARED processing. Since RBTXN is
always purged after xact replay on PREPARE, the only medium we have for
noticing catalog changes during COMMIT PREPARED is invalidation messages
attached to the CP record. This raises the following question.
 * If there is a guarantee that whenever xact makes catalog changes it
   generates invalidation messages, then this code is fine. However,
   currently ReorderBufferXidSetCatalogChanges is also called on
   XLOG_HEAP_INPLACE processing and in SnapBuildProcessNewCid, which
   is useless if such guarantee exists.
 * If, on the other hand, there is no such guarantee, this code is
   broken.


>> - I am one of those people upthread who don't think that converting
>>   flags to bitmask is beneficial -- especially given that many of them
>>   are mutually exclusive, e.g. xact can't be committed and aborted at
>>   the same time. Apparently you have left this to the committer though.
>>
>
> Hmm, there seems to be divided opinion on this. I am willing to go
> back to using the booleans if there's opposition and if the committer
> so wishes. Note that this patch will end up adding 4/5 more booleans
> in that case (we add new ones for prepare, commit prepare, abort,
> rollback prepare etc).

Well, you can unite mutually exclusive fields into one enum or char with
macros defining possible values. Transaction can't be committed and
aborted at the same time, etc.


>> +  row. The change_cb callback may access system or
>> +  user catalog tables to aid in the process of outputting the row
>> +  modification details. In case of decoding a prepared (but yet
>> +  uncommitted) transaction or decoding of an uncommitted transaction, 
>> this
>> +  change callback is ensured sane access to catalog tables regardless of
>> +  simultaneous rollback by another backend of this very same 
>> transaction.
>>
>> I don't think we should explain this, at least in such words. As
>> mentioned upthread, we should warn about allowed systable_* accesses
>> instead. Same for message_cb.
>>
>
> Looks like you are looking at an earlier patchset. The latest patchset
> has removed the above.

I see, sorry.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] logical decoding of two-phase transactions

2018-08-07 Thread Arseny Sher


Andres Freund  writes:

>> - On decoding of aborted xacts. The idea to throw an error once we
>>   detect the abort is appealing, however I think you will have problems
>>   with subxacts in the current implementation. What if subxact issues
>>   DDL and then aborted, but main transaction successfully committed?
>
> I don't see a fundamental issue here. I've not reviewed the current
> patchset meaningfully, however. Do you see a fundamental issue here?

Hmm, yes, this is not an issue for this patch because after reading
PREPARE record we know all aborted subxacts and won't try to decode
their changes. However, this will be raised once we decide to decode
in-progress transactions. Checking for all subxids is expensive;
moreover, WAL doesn't provide all of them until commit... it might be
easier to prevent vacuuming of aborted stuff while decoding needs it.
Matter for another patch, anyway.

>> - Currently we will call abort_prepared cb even if we failed to actually
>>   prepare xact due to concurrent abort. I think it is confusing for
>>   users. We should either handle this by remembering not to invoke
>>   abort_prepared in these cases or at least document this behaviour,
>>   leaving this problem to the receiver side.
>
> What precisely do you mean by "concurrent abort"?

With current patch, the following is possible:
* We start decoding of some prepared xact;
* Xact aborts (ABORT PREPARED) for any reason;
* Decoding processs notices this on catalog scan and calls abort()
  callback;
* Later decoding process reads abort record and calls abort_prepared
  callback.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] logical decoding of two-phase transactions

2018-08-06 Thread Arseny Sher
ht be worthwile to put the check
+   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+   parsed->dbId == ctx->slot->data.database &&
+   !FilterByOrigin(ctx, origin_id) &&

which appears 3 times now into separate function.


+* two-phase transactions - we either have to have all of them or none.
+* The filter_prepare callback is optional, but can only be defined when

Kind of controversial (all of them or none, but optional), might be
formulated more accurately.


+   /*
+* Capabilities of the output plugin.
+*/
+   boolenable_twophase;

I would rename this to 'supports_twophase' since this is not an option
but a description of the plugin capabilities.


+   /* filter_prepare is optional, but requires two-phase decoding */
+   if ((ctx->callbacks.filter_prepare_cb != NULL) && 
(!ctx->enable_twophase))
+   ereport(ERROR,
+   (errmsg("Output plugin does not support 
two-phase decoding, but "
+   "registered
filter_prepared callback.")));

Don't think we need to check that...


+* Otherwise call either PREPARE (for twophase transactions) or 
COMMIT
+* (for regular ones).
+*/
+   if (rbtxn_rollback(txn))
+   rb->abort(rb, txn, commit_lsn);

This is the dead code since we don't have decoding of in-progress xacts
yet.


Third patch:
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */

I would explain here that this xid is checked for abort after each
catalog scan, and sent for the details to SetupHistoricSnapshot.


+   /*
+* If CheckXidAlive is valid, then we check if it aborted. If it did, we
+* error out
+*/
+   if (TransactionIdIsValid(CheckXidAlive) &&
+   !TransactionIdIsInProgress(CheckXidAlive) &&
+   !TransactionIdDidCommit(CheckXidAlive))
+   ereport(ERROR,
+   (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+errmsg("transaction aborted during system 
catalog scan")));

Probably centralize checks in one function? As well as 'We don't expect
direct calls to heap_fetch...' ones.


P.S. Looks like you have torn the thread chain: In-Reply-To header of
mail [1] is missing. Please don't do that.

[1] 
https://www.postgresql.org/message-id/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Global snapshots

2018-07-25 Thread Arseny Sher
Hello,

I have looked through the patches and found them pretty accurate. I'd
fixed a lot of small issues here and there; updated patchset is
attached. But first, some high-level notes:

 * I agree that it would be cool to implement functionality like current
   "snapshot too old": that is, abort transaction with old global
   snapshot only if it really attempted to touch modified data.

 * I also agree with Stas that any attempts to trade oldestxmin in
   gossip between the nodes would drastically complicate this patch and
   make it discussion-prone; it would be nice first to get some feedback
   on general approach, especially from people trying to distribute
   Postgres.

 * One drawback of these patches is that only REPEATABLE READ is
   supported. For READ COMMITTED, we must export every new snapshot
   generated on coordinator to all nodes, which is fairly easy to
   do. SERIALIZABLE will definitely require chattering between nodes,
   but that's much less demanded isolevel (e.g. we still don't support
   it on replicas).

 * Another somewhat serious issue is that there is a risk of recency
   guarantee violation. If client starts transaction at node with
   lagging clocks, its snapshot might not include some recently
   committed transactions; if client works with different nodes, she
   might not even see her own changes. CockroachDB describes at [1] how
   they and Google Spanner overcome this problem. In short, both set
   hard limit on maximum allowed clock skew.  Spanner uses atomic
   clocks, so this skew is small and they just wait it at the end of
   each transaction before acknowledging the client. In CockroachDB, if
   tuple is not visible but we are unsure whether it is truly invisible
   or it's just the skew (the difference between snapshot and tuple's
   csn is less than the skew), transaction is restarted with advanced
   snapshot. This process is not infinite because the upper border
   (initial snapshot + max skew) stays the same; this is correct as we
   just want to ensure that our xact sees all the committed ones before
   it started. We can implement the same thing.

Now, the description of my mostly cosmetical changes:

 * Don't ERROR in BroadcastStmt to allow us to handle failure manually;
 * Check global_snapshot_defer_time in ImportGlobalSnapshot instead of
   falling on assert;
 * (Arguably) improved comments around locking at circular buffer
   maintenance; also, don't lock procarray during global_snapshot_xmin
   bump.
 * s/snaphot/snapshot, other typos.
 * Don't track_global_snapshots by default -- while handy for testing, it
   doesn't look generally good.
 * Set track_global_snapshots = true in tests everywhere.
 * GUC renamed from postgres_fdw.use_tsdtm to
   postgres_fdw.use_global_snapshots for consistency.
 * 003_bank_shared.pl test is removed. In current shape (loading one
   node) it is useless, and if we bombard both nodes, deadlock surely
   appears. In general, global snaphots are not needed for such
   multimaster-like setup -- either there are no conflicts and we are
   fine, or there is a conflict, in which case we get a deadlock.
 * Fix initdb failure with non-zero global_snapshot_defer_time.
 * Enforce REPEATABLE READ since currently we export snap only once in
   xact.
 * Remove assertion that circular buffer entries are monotonic, as
   GetOldestXmin *can* go backwards.


[1] https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 21687e75366df03b92e48c6125bb2e90d01bb70a Mon Sep 17 00:00:00 2001
From: Stas Kelvich 
Date: Wed, 25 Apr 2018 16:05:46 +0300
Subject: [PATCH 1/3] GlobalCSNLog SLRU

---
 src/backend/access/transam/Makefile |   3 +-
 src/backend/access/transam/global_csn_log.c | 439 
 src/backend/access/transam/twophase.c   |   1 +
 src/backend/access/transam/varsup.c |   2 +
 src/backend/access/transam/xlog.c   |  12 +
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/ipc/procarray.c |   3 +
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/utils/misc/guc.c|   9 +
 src/backend/utils/probes.d  |   2 +
 src/bin/initdb/initdb.c |   3 +-
 src/include/access/global_csn_log.h |  30 ++
 src/include/storage/lwlock.h|   1 +
 src/include/utils/snapshot.h|   3 +
 14 files changed, 510 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/transam/global_csn_log.c
 create mode 100644 src/include/access/global_csn_log.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47269..03aa360ea3 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_

Re: Possible bug in logical replication.

2018-07-09 Thread Arseny Sher


Michael Paquier  writes:

> Okay, let's do as you suggest then.  Do you find the attached adapted?

Yes, thanks!

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: pgsql: Fix "base" snapshot handling in logical decoding

2018-07-02 Thread Arseny Sher

Arseny Sher  writes:

> There is also one thing that puzzles me as I don't know much about
> vacuum internals. If I do plain VACUUM of pg_attribute in the test, it
> shouts "catalog is missing 1 attribute(s) for relid" error (which is
> quite expected), while with 'VACUUM FULL pg_attribute' the tuple is
> silently (and wrongly, with dropped column missing) decoded. Moreover,
> if I perform the test manually, and do 'VACUUM FULL;', sometimes test
> becomes useless -- that is, tuple is successfully decoded with all three
> columns, as though VACUUM was not actually executed. All this is without
> the main patch, of course. I think I will look into this soon.

So I have been jumping around this and learned a few curious things.

1) Test in its current shape sometimes doesn't fulfill its aim indeed --
  that is, despite issued VACUUM the tuple is still decoded with all
  three fields. This happens because during attempt to advance xmin
  there is a good possiblity to encounter xl_running_xacts record logged
  before our CHECKPOINT (they are logged each 15 seconds). We do not
  try to advance xmin twice without client acknowledgment, so in this
  case xmin will not be advanced far enough to allow vacuum prune entry
  from pg_attribute.

2) This is not easy to notice because often (but not always) explicit
   VACUUM is not needed at all: tuple is often pruned by microvacuum
   (heap_page_prune_opts) right in the final decoding session. If we
   hadn't bumped xmin far enough during previous get_changes, we do that
   now, so microvacuum actually purges the entry. But if we were so
   unfortunate that 1) extra xl_running_xacts was logged and 2)
   microvacuum was in bad mood and didn't come, pg_attribute is not
   vacuumed and test becomes useless. To make this bulletproof, in the
   attached patch I doubled first get_changes: now there are two client
   acks, so our VACUUM always does the job.

3) As a side note, answer to my question 'why do we get different errors
   with VACUUM and VACUUM FULL' is the following. With VACUUM FULL, not
   only old pg_attribute entry is pruned, but also xmin of new entry
   with attisdropped=true is reset to frozen xid. This means that
   decoding session (RelationBuildTupleDesc) actually sees 3 attributes,
   and the fact that one of them is dropped doesn't embarrass this
   function (apparently relnatts in pg_class is never decremented) --
   we just go ahead and decode only live attributes.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/test_decoding/expected/oldest_xmin.out b/contrib/test_decoding/expected/oldest_xmin.out
index d09342c4be..d1b4f17e3a 100644
--- a/contrib/test_decoding/expected/oldest_xmin.out
+++ b/contrib/test_decoding/expected/oldest_xmin.out
@@ -1,6 +1,6 @@
 Parsed test spec with 2 sessions
 
-starting permutation: s0_begin s0_getxid s1_begin s1_insert s0_alter s0_commit s0_checkpoint s0_get_changes s1_commit s0_vacuum s0_get_changes
+starting permutation: s0_begin s0_getxid s1_begin s1_insert s0_alter s0_commit s0_checkpoint s0_get_changes s0_get_changes s1_commit s0_vacuum s0_get_changes
 step s0_begin: BEGIN;
 step s0_getxid: SELECT txid_current() IS NULL;
 ?column?   
@@ -14,8 +14,11 @@ step s0_checkpoint: CHECKPOINT;
 step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 data   
 
+step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data   
+
 step s1_commit: COMMIT;
-step s0_vacuum: VACUUM FULL;
+step s0_vacuum: VACUUM pg_attribute;
 step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 data   
 
diff --git a/contrib/test_decoding/specs/oldest_xmin.spec b/contrib/test_decoding/specs/oldest_xmin.spec
index 4f8af70aa2..141fe2b145 100644
--- a/contrib/test_decoding/specs/oldest_xmin.spec
+++ b/contrib/test_decoding/specs/oldest_xmin.spec
@@ -22,7 +22,7 @@ step "s0_getxid" { SELECT txid_current() IS NULL; }
 step "s0_alter" { ALTER TYPE basket DROP ATTRIBUTE mangos; }
 step "s0_commit" { COMMIT; }
 step "s0_checkpoint" { CHECKPOINT; }
-step "s0_vacuum" { VACUUM FULL; }
+step "s0_vacuum" { VACUUM pg_attribute; }
 step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
 
 session "s1"
@@ -30,8 +30,11 @@ step "s1_begin" { BEGIN; }
 step "s1_insert" { INSERT INTO harvest VALUES ((1, 2, 3)); }
 step "s1_commit" { COMMIT; }
 
-# Checkpoint with following get_changes forces to advance xmin. ALTER of a
+# Checkpoint with following get_changes forces xmin a

Re: pgsql: Fix "base" snapshot handling in logical decoding

2018-06-30 Thread Arseny Sher

Tom Lane  writes:

> Alvaro Herrera  writes:
>>> On 2018-Jun-28, Tom Lane wrote:
>>>> According to buildfarm member friarbird, and as confirmed here,
>>>> the contrib/test_decoding/specs/oldest_xmin.spec test added by this
>>>> commit fails under CLOBBER_CACHE_ALWAYS.
>
>> I suppose 60 seconds (isolationtester's default timeout) is just not
>> enough time for those machines.  We could increase it to 180 seconds and
>> see if that's enough to make them pass ...
>
> What I want to know is why this test is doing a database-wide VACUUM FULL
> in the first place.  If that isn't profligate wastage of testing cycles,
> why not?
>
>   regards, tom lane

Oh, that's my fault. I think just VACUUM pg_attribute is enough there --
it takes 42ms on my laptop with CLOBBER_CACHE_ALWAYS ('vacuum full'
occupies 1 minute), patch is attached. The test is still steadily fails
without the main patch.

There is also one thing that puzzles me as I don't know much about
vacuum internals. If I do plain VACUUM of pg_attribute in the test, it
shouts "catalog is missing 1 attribute(s) for relid" error (which is
quite expected), while with 'VACUUM FULL pg_attribute' the tuple is
silently (and wrongly, with dropped column missing) decoded. Moreover,
if I perform the test manually, and do 'VACUUM FULL;', sometimes test
becomes useless -- that is, tuple is successfully decoded with all three
columns, as though VACUUM was not actually executed. All this is without
the main patch, of course. I think I will look into this soon.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/test_decoding/expected/oldest_xmin.out b/contrib/test_decoding/expected/oldest_xmin.out
index d09342c4be..a5d7d3c2ca 100644
--- a/contrib/test_decoding/expected/oldest_xmin.out
+++ b/contrib/test_decoding/expected/oldest_xmin.out
@@ -15,7 +15,7 @@ step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slo
 data   
 
 step s1_commit: COMMIT;
-step s0_vacuum: VACUUM FULL;
+step s0_vacuum: VACUUM pg_attribute;
 step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 data   
 
diff --git a/contrib/test_decoding/specs/oldest_xmin.spec b/contrib/test_decoding/specs/oldest_xmin.spec
index 4f8af70aa2..cd2b69cae2 100644
--- a/contrib/test_decoding/specs/oldest_xmin.spec
+++ b/contrib/test_decoding/specs/oldest_xmin.spec
@@ -22,7 +22,7 @@ step "s0_getxid" { SELECT txid_current() IS NULL; }
 step "s0_alter" { ALTER TYPE basket DROP ATTRIBUTE mangos; }
 step "s0_commit" { COMMIT; }
 step "s0_checkpoint" { CHECKPOINT; }
-step "s0_vacuum" { VACUUM FULL; }
+step "s0_vacuum" { VACUUM pg_attribute; }
 step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
 
 session "s1"


Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-06-26 Thread Arseny Sher


Alvaro Herrera  writes:

> I'm struggling with this assert.  I find that test_decoding passes tests
> with this assertion in branch master, but fails in 9.4 - 10.  Maybe I'm
> running the tests wrong (no assertions in master?) but I don't see it.
> It *should* fail ...

Your v3 patch fails for me on freshest master (4d54543efa) in exactly
this assert, it looks like there is something wrong in your setup.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-06-26 Thread Arseny Sher


Alvaro Herrera  writes:

> Firstly -- this is top-notch detective work, kudos and thanks for the
> patch and test cases.  (I verified that both fail before the code fix.)

Thank you!

> Here's a v3.  I applied a lot of makeup in order to try to understand
> what's going on.  I *think* I have a grasp on the original code and your
> bugfix, not terribly firm I admit.

Well, when I started digging it, I have found logical decoding code
somewhat underdocumented. If you or any other committer will consider
the patch trying to improve the comments, I can prepare one.

> * you don't need to Assert that things are not NULL if you're
>   immediately going to dereference them.

Indeed.

> * I think setting snapshot_base to NULL in ReorderBufferCleanupTXN is
>   pointless, since the struct is gonna be freed shortly afterwards.

Yeah, but it is a general style of the original code, e.g. see
/* cosmetic... */
comments. It probably makes the picture a bit more aesthetic and
consistent, kind of final chord, though I don't mind removing it.

> * I rewrote many comments (both existing and some of the ones your patch
>   adds), and added lots of comments where there were none.

Now the code is nicer, thanks.

> * I'm a bit unsure about the comment atop ReorderBufferSetBaseSnapshot.
>   Obviously, the bit within the #if 0/#endif I'm going to remove before
>   push.

It looks like you've started editing that bit and didn't finish.

>   I don't understand why it says "Needs to be called before any
>   changes are added with ReorderBufferQueueChange"; but if you edit that
>   function and add an assert that the base snapshot is set, it crashes
>   pretty quickly in the test_decoding tests.  (The supposedly bogus
>   comment was there before your patch -- I'm not saying your comment
>   addition is faulty.)

That's because REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID changes are
queued whenever we have read that xact has modified the catalog,
regardless of base snapshot existence. Even if there are no changes yet,
we will need it for correct visibility of catalog, so I am inclined to
remove the assert and comment or rephrase the latter with 'any
*data-carrying* changes'.

> * I also noticed that we're doing subtxn cleanup one by one in both
>   ReorderBufferAssignChild and ReorderBufferCommitChild, which means the
>   top-level txn is sought in the hash table over and over, which seems a
>   bit silly.  Not this patch's problem to fix ...

There is already one-entry cache in ReorderBufferTXNByXid. We could add
'don't cache me' flag to it and use it with subxacts, or simply pull
top-level reorderbuffer out of loops.

I'm fine with the rest of your edits. One more little comment:

@@ -543,9 +546,10 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId 
xid, XLogRecPtr lsn,
ReorderBufferTXN *txn;

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+   Assert(txn->base_snapshot != NULL || txn->is_known_as_subxact);

change->lsn = lsn;
-   Assert(InvalidXLogRecPtr != lsn);
+   Assert(!XLogRecPtrIsInvalid(lsn));
dlist_push_tail(>changes, >node);
txn->nentries++;
txn->nentries_mem++;

Since we do that, probably we should replace all lsn validity checks
with XLogRectPtrIsInvalid?

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-06-21 Thread Arseny Sher
Hello,

Thank you for your time.

Alvaro Herrera  writes:

> On 2018-Jun-11, Antonin Houska wrote:
>
>>
>> One comment about the coding: since you introduce a new transaction list and
>> it's sorted by LSN, I think you should make the function AssertTXNLsnOrder()
>> more generic and use it to check the by_base_snapshot_lsn list too. For
>> example call it after the list insertion and deletion in
>> ReorderBufferPassBaseSnapshot().

Ok, probably this makes sense. Updated patch is attached.

>> Also I think it makes no harm if you split the diff into two, although both
>> fixes use the by_base_snapshot_lsn field.
>
> Please don't.

Yeah, I don't think we should bother with that.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 61b4b0a89c95f8912729c54c013b64033f88fccf Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Thu, 21 Jun 2018 10:29:07 +0300
Subject: [PATCH] Fix slot's xmin advancement and subxact's lost snapshots in
 decoding

There are two in some way related bugs here. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of slot was set to
oldest running xid in the record, which is wrong: actually snapshots which will
be used for not-yet-replayed at this moment xacts might consider older xacts as
running too. The problem wasn't noticed earlier because DDL which allows to
delete tuple (set xmax) while some another not yet committed xact looks at it
is pretty rare, if not unique: e.g. all forms of ALTER TABLE which change schema
acquire ACCESS EXCLUSIVE lock conflicting with any inserts. Included test
case uses ALTER of a composite type, which doesn't have such interlocking.

To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. Proposed
fix adds yet another list of ReorderBufferTXNs to it, where xacts are sorted by
LSN of the base snapshot, because
 * Plain usage of current ReorderBufferGetOldestTXN (to get base snapshot's
   xmin) is wrong, because order by LSN of the first record is *not* the same as
   order by LSN of record on which base snapshot was assigned: many record types
   don't need the snapshot, e.g. xl_xact_assignment. We could remember
   snapbuilder's xmin during processing of the first record, but that's too
   conservative and we would keep more tuples from vacuuming than needed.
 * On the other hand, we can't fully replace existing order by LSN of the first
   change with LSN of the base snapshot, because if we do that, WAL records
   which didn't forced receival of base snap might be recycled before we
   replay the xact, as ReorderBufferGetOldestTXN determines which LSN still must
   be kept.

Second issue concerns subxacts. Currently SnapBuildDistributeNewCatalogSnapshot
never spares a snapshot to known subxact. It means that if toplevel doesn't have
base snapshot (never did writes), there was an assignment record (subxact is
known) and subxact is writing, no new snapshots are being queued. Probably the
simplest fix would be to employ freshly baked by_base_snapshot_lsn list and just
distribute snapshot to all xacts with base snapshot whether they are subxacts or
not. However, in this case we would queue unneccessary snapshot to all already
known subxacts. To avoid this, in this patch base snapshot of known subxact
is immediately passed to the toplevel as soon as we learn the kinship / base
snap assigned -- we have to do that anyway in ReorderBufferCommitChild; new
snaps are distributed only to top-level xacts (or unknown subxacts) as before.

Also, minor memory leak is fixed: counter of toplevel's old base
snapshot was not decremented when snap has been passed from child.
---
 contrib/test_decoding/Makefile |   3 +-
 contrib/test_decoding/expected/oldest_xmin.out |  27 +++
 .../test_decoding/expected/subxact_snap_pass.out   |  49 +
 contrib/test_decoding/specs/oldest_xmin.spec   |  37 
 contrib/test_decoding/specs/subxact_snap_pass.spec |  42 +
 src/backend/replication/logical/reorderbuffer.c| 208 +++--
 src/backend/replication/logical/snapbuild.c|  15 +-
 src/include/replication/reorderbuffer.h|  15 +-
 8 files changed, 330 insertions(+), 66 deletions(-)
 create mode 100644 contrib/test_decoding/expected/oldest_xmin.out
 create mode 100644 contrib/test_decoding/expected/subxact_snap_pass.out
 create mode 100644 contrib/test_decoding/specs/oldest_xmin.spec
 create mode 100644 contrib/test_decoding/specs/subxact_snap_pass.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 1d601d8144..05d7766cd5 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -50,7 +50,8 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	$(pg_regress_installcheck) \
 	$(REGRESS

Re: Possible bug in logical replication.

2018-06-20 Thread Arseny Sher

Michael Paquier  writes:

> On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:
>> On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
>> It seems to me that we still want to have the slot forwarding finish in
>> this case even if this is interrupted.  Petr, isn't that the intention
>> here?
>
> I have been chewing a bit more on the proposed patch, finishing with the
> attached to close the loop.  Thoughts?

Sorry for being pedantic, but it seems to me worthwhile to mention *why*
we need decoding machinery at all -- like I wrote:

+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).

Also,

>   * The slot's restart_lsn is used as start point for reading records,

This is clearly seen from the code, I propose to remove this.

>   * while confirmed_lsn is used as base point for the decoding context.

And as I wrote, this doesn't matter as changes are not produced.

>   * The LSN position to move to is checked by doing a per-record scan and
>   * logical decoding which makes sure that confirmed_lsn is updated to a
>   * LSN which allows the future slot consumer to get consistent logical
> - * changes.
> + * changes.  As decoding is done with fast_forward mode, no changes are
> + * actually generated.

confirmed_lsn is always updated to `moveto` unless we run out of WAL
earlier (and unless we try to move slot backwards, which is obviously
forbidden) -- consistent changes are practically irrelevant
here. Moreover, we can directly set confirmed_lsn and still have
consistent changes further as restart_lsn and xmin of the slot are not
touched. What we actually do here is trying to advance *restart_lsn and
xmin* as far as we can but up to the point which ensures that decoding
can assemble a consistent snapshot allowing to fully decode all COMMITs
since updated `confirmed_flush_lsn`. All this happens in
SnapBuildProcessRunningXacts.

> It seems to me that we still want to have the slot forwarding finish in
> this case even if this is interrupted.  Petr, isn't that the intention
> here?

Probably, though I am not sure what is the point of this. Ok, I keep
this check in the updated (with your comments) patch and CC'ing Petr.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61588d626f..76bafca41c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -335,11 +335,14 @@ CreateInitDecodingContext(char *plugin,
  *		The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
  *		from the slot's confirmed_flush; otherwise, start from the specified
  *		location (but move it forwards to confirmed_flush if it's older than
- *		that, see below).
+ *		that, see below). Doesn't matter in fast_forward mode.
  *
  * output_plugin_options
  *		contains options passed to the output plugin.
  *
+ * fast_forward
+ *		bypasses the generation of logical changes.
+ *
  * read_page, prepare_write, do_write, update_progress
  *		callbacks that have to be filled to perform the use-case dependent,
  *		actual work.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..0a4985ef8c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -341,12 +341,10 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 
 /*
  * Helper function for advancing logical replication slot forward.
- * The slot's restart_lsn is used as start point for reading records,
- * while confirmed_lsn is used as base point for the decoding context.
- * The LSN position to move to is checked by doing a per-record scan and
- * logical decoding which makes sure that confirmed_lsn is updated to a
- * LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).
+ * We do it in special fast_forward mode without actual replay.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
@@ -358,7 +356,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 
 	PG_TRY();
 	{
-		/* restart at slot's confirmed_flush */
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 	NIL,
 	true,
@@ -388,10 +385,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			 */
 			startlsn = InvalidXLogRecPtr;
 
-			/*
-			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
-			 * store the description into our tuplestore.
-			 */
+			/* Changes are 

Re: Possible bug in logical replication.

2018-06-15 Thread Arseny Sher

Alvaro Herrera  writes:

> Can somebody (Arseny, Konstantin, horiguti, Sawada) please confirm that
> Michaël's commit fixes the reported bug?

I confirm that starting reading WAL since restart_lsn as implemented in
f731cfa fixes this issue, as well as the second issue tushar mentioned
at [1]. I think that the code still can be improved a bit though --
consider the attached patch:
* pg_logical_replication_slot_advance comment was not very informative
  and even a bit misleading: it said that we use confirmed_flush_lsn and
  restart_lsn, but didn't explain why.
* Excessive check in its main loop.
* Copy-paste comment fix.

[1] 
https://www.postgresql.org/message-id/5f85bf41-098e-c4e1-7332-9171fef57a0a%40enterprisedb.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From d8ed8ae3eec54b716d7dbb35379d0047a96c6c75 Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Fri, 15 Jun 2018 18:11:17 +0300
Subject: [PATCH] Cosmetic review for f731cfa.

* pg_logical_replication_slot_advance comment was not very informative and even
  a bit misleading: it said that we use confirmed_flush_lsn and restart_lsn, but
  didn't explain why.
* Excessive check in its main loop.
* Copy-paste comment fix.
---
 src/backend/replication/slotfuncs.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..597e81f4d9 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -341,24 +341,26 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 
 /*
  * Helper function for advancing logical replication slot forward.
- * The slot's restart_lsn is used as start point for reading records,
- * while confirmed_lsn is used as base point for the decoding context.
- * The LSN position to move to is checked by doing a per-record scan and
- * logical decoding which makes sure that confirmed_lsn is updated to a
- * LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn allowing
+ * to recycle WAL and old catalog tuples. We do it in special fast_forward
+ * mode without actual replay.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
 {
 	LogicalDecodingContext *ctx;
 	ResourceOwner old_resowner = CurrentResourceOwner;
+	/*
+	 * Start reading WAL at restart_lsn, which certainly points to the valid
+	 * record.
+	 */
 	XLogRecPtr	startlsn = MyReplicationSlot->data.restart_lsn;
 	XLogRecPtr	retlsn = MyReplicationSlot->data.confirmed_flush;
 
 	PG_TRY();
 	{
-		/* restart at slot's confirmed_flush */
+		/* start_lsn doesn't matter here, we don't replay xacts at all */
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 	NIL,
 	true,
@@ -388,17 +390,10 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			 */
 			startlsn = InvalidXLogRecPtr;
 
-			/*
-			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
-			 * store the description into our tuplestore.
-			 */
+			/* Changes are not actually produced in fast_forward mode. */
 			if (record != NULL)
 LogicalDecodingProcessRecord(ctx, ctx->reader);
 
-			/* Stop once the moving point wanted by caller has been reached */
-			if (moveto <= ctx->reader->EndRecPtr)
-break;
-
 			CHECK_FOR_INTERRUPTS();
 		}
 
-- 
2.11.0



Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-05-25 Thread Arseny Sher

Michael Paquier <mich...@paquier.xyz> writes:

> but for now I would recommend to register this patch to the next
> commit fest under the category "Bug Fixes" so as it does not fall into
> the cracks: https://commitfest.postgresql.org/18/
>
> I have added an entry to the open items in the section for older bugs:
> https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Older_Bugs
> However this list tends to be... Er... Ignored.
>

Thank you, Michael. I have created the commitfest entry.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Possible bug in logical replication.

2018-05-25 Thread Arseny Sher

Michael Paquier <mich...@paquier.xyz> writes:

> Maybe I am being too naive, but wouldn't it be just enough to update the
> confirmed flush LSN to ctx->reader->ReadRecPtr?  This way, the slot
> advances up to the beginning of the last record where user wants to
> advance, and not the beginning of the next record:

Same problem should be handled at pg_logical_slot_get_changes_guts and
apply worker feedback; and there is a convention that all commits since
confirmed_flush must be decoded, so we risk decoding such boundary
commit twice.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Possible bug in logical replication.

2018-05-25 Thread Arseny Sher
Hello,

Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes:

> restart_lsn stays at the beginning of a transaction until the
> transaction ends so just using restart_lsn allows repeated
> decoding of a transaction, in short, rewinding occurs. The
> function works only for inactive slot so the current code works
> fine on this point.

Sorry, I do not follow. restart_lsn is advanced whenever there is a
consistent snapshot dumped (in xl_running_xacts) which is old enough to
wholly decode all xacts not yet confirmed by the client. Could you
please elaborate, what's wrong with that?

> Addition to that restart_lsn also can be on a
> page bounary.

Do you have an example of that? restart_lsn is set initially to WAL
insert position at ReplicationSlotReserveWal, and later it always points
to xl_running_xacts record with consistent snapshot dumped.

> So directly set ctx->reader->EndRecPtr by startlsn fixes the
> problem, but I found another problem here.

There is a minor issue with the patch. Now slot advancement hangs
polling for new WAL on my example from [1]; most probably because we
must exit the loop when ctx->reader->EndRecPtr == moveto.

> The function accepts any LSN even if it is not at the begiining
> of a record. We will see errors or crashs or infinite waiting or
> maybe any kind of trouble by such values. The moved LSN must
> always be at the "end of a record" (that is, at the start of the
> next recored). The attached patch also fixes this.

Indeed, but we have these problems only if we are trying to read WAL
since confirmed_flush.

[1] https://www.postgresql.org/message-id/873720e4hf.fsf%40ars-thinkpad

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Possible bug in logical replication.

2018-05-17 Thread Arseny Sher

Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes:

> I think that using restart_lsn instead of confirmed_flush is not right 
> approach.
> If restart_lsn is not available and confirmed_flush is pointing to page
> boundary, then in any case we should somehow handle this case and adjust
> startlsn to point on the valid record position (by jjust adding page header
> size?).

Well, restart_lsn is always available on live slot: it is initially set
in ReplicationSlotReserveWal during slot creation.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Arseny Sher

Simon Riggs <si...@2ndquadrant.com> writes:

> Indexes on foreign tables cause an ERROR, so yes, we already just
> don't create them.
>
> You're suggesting silently skipping the ERROR. I can't see a reason for that.

Truly, I was inaccurate. I mean that index propagation is a nice feature,
and making it available for mix of foreign and local partitions skipping
the foreign ones is useful thing for users who understand the
consequences (of course there should be a warning in the docs -- my
patch was just an illustration of the idea). As I said, FDW-based
sharding often involves mix of foreign and local partitions, even in
simple manual forms. Imagine that you have a table which is too big to
fit into one server, so you partition it and move some partiitons to
another machine; you still would like to access the whole table to avoid
manual tuple routing. Or, you partition table by timestamps and keep
recent data on fast primary server, gradually moving old partitions to
another box. Again, it would be nice to access table as a whole to
e.g. calculate some analytics.

Anyway, given other opinions, I assume that the community has chosen
more conservative approach. Indeed, we can't guarantee uniqueness this
way; that's fully users responsiblity.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Arseny Sher

Simon Riggs <si...@2ndquadrant.com> writes:

> How much sense is it to have a partitioned table with a mix of local
> and foreign tables?

Well, as much sense as fdw-based sharding has, for instance. It is
arguable, but it exists.

> Shouldn't the fix be to allow creation of indexes on foreign tables?
> (Maybe they would be virtual or foreign indexes??)

Similar ideas were discussed at [1]. There was no wide consensus of even
what problems such feature would solve. Since currently indexes on
foreign tables are just forbidden, it seems to me that the best what
partitioning code can do today is just not creating them.

[1] 
https://www.postgresql.org/message-id/flat/4F62FD69.2060007%40lab.ntt.co.jp#4f62fd69.2060...@lab.ntt.co.jp

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Arseny Sher
Hi,

8b08f7d4 added propagation of indexes on partitioned tables to
partitions, which is very cool. However, index creation also recurses
down to foreign tables. I doubt this is intentional, as such indexes are
forbidden as not making much sense; attempt to create index on
partitioned table with foreign partition leads to an error
now. Attached lines fix this.
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
*** DefineIndex(Oid relationId,
*** 915,920 
--- 915,926 
  int			maplen;
  
  childrel = heap_open(childRelid, lockmode);
+ /* Foreign table doesn't need indexes */
+ if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+ {
+ 	heap_close(childrel, NoLock);
+ 	continue;
+ }
  childidxs = RelationGetIndexList(childrel);
  attmap =
  	convert_tuples_by_name_map(RelationGetDescr(childrel),
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** AttachPartitionEnsureIndexes(Relation re
*** 14352,14357 
--- 14352,14361 
  	MemoryContext cxt;
  	MemoryContext oldcxt;
  
+ 	/* Foreign table doesn't need indexes */
+ 	if (attachrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+ 		return;
+ 
  	cxt = AllocSetContextCreate(CurrentMemoryContext,
  "AttachPartitionEnsureIndexes",
  ALLOCSET_DEFAULT_SIZES);

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-04-16 Thread Arseny Sher
(delicate ping)



Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-04-07 Thread Arseny Sher
Hello,

I've discovered a couple of bugs in logical decoding code, both leading
to incorrect decoding results in somewhat rare cases. First, xmin of
slots is advanced too early. This affects the results only when
interlocking allows to perform DDL concurrently with looking at the
schema. In fact, I was not aware about such DDL until at
https://www.postgresql.org/message-id/flat/87tvu0p0jm.fsf%40ars-thinkpad#87tvu0p0jm.fsf@ars-thinkpad
I raised this question and Andres pointed out ALTER of composite
types. Probably there are others, I am not sure; it would be interesting
to know them.

Another problem is that new snapshots are never queued to known
subxacts. It means decoding results can be wrong if toplevel doesn't
write anything while subxact does.

Please see detailed description of the issues, tests which reproduce
them and fixes in the attached patch.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 0d03ef172a29ce64a6c5c26f484f0f3186895125 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-...@yandex.ru>
Date: Sat, 7 Apr 2018 08:22:30 +0300
Subject: [PATCH] Fix slot's xmin advancement and subxact's lost snapshots in
 decoding.

There are two in some way related bugs here. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of slot was set to
oldest running xid in the record, which is wrong: actually snapshots which will
be used for not-yet-replayed at this moment xacts might consider older xacts as
running too. The problem wasn't noticed earlier because DDL which allows to
delete tuple (set xmax) while some another not yet committed xact looks at it
is pretty rare, if not unique: e.g. all forms of ALTER TABLE which change schema
acquire ACCESS EXCLUSIVE lock conflicting with any inserts. Included test
case uses ALTER of a composite type, which doesn't have such interlocking.

To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. Proposed
fix adds yet another list of ReorderBufferTXNs to it, where xacts are sorted by
LSN of the base snapshot, because
 * Plain usage of current ReorderBufferGetOldestTXN (to get base snapshot's
   xmin) is wrong, because order by LSN of the first record is *not* the same as
   order by LSN of record on which base snapshot was assigned: many record types
   don't need the snapshot, e.g. xl_xact_assignment. We could remember
   snapbuilder's xmin during processing of the first record, but that's too
   conservative and we would keep more tuples from vacuuming than needed.
 * On the other hand, we can't fully replace existing order by LSN of the first
   change with LSN of the base snapshot, because if we do that, WAL records
   which didn't forced receival of base snap might be recycled before we
   replay the xact, as ReorderBufferGetOldestTXN determines which LSN still must
   be kept.

Second issue concerns subxacts. Currently SnapBuildDistributeNewCatalogSnapshot
never spares a snapshot to known subxact. It means that if toplevel doesn't have
base snapshot (never did writes), there was an assignment record (subxact is
known) and subxact is writing, no new snapshots are being queued. Probably the
simplest fix would be to employ freshly baked by_base_snapshot_lsn list and just
distribute snapshot to all xacts with base snapshot whether they are subxacts or
not. However, in this case we would queue unneccessary snapshot to all already
known subxacts. To avoid this, in this patch base snapshot of known subxact
is immediately passed to the toplevel as soon as we learn the kinship / base
snap assigned -- we have to do that anyway in ReorderBufferCommitChild; new
snaps are distributed only to top-level xacts (or unknown subxacts) as before.

Also, minor memory leak is fixed: counter of toplevel's old base
snapshot was not decremented when snap has been passed from child.
---
 contrib/test_decoding/Makefile |   3 +-
 contrib/test_decoding/expected/oldest_xmin.out |  27 +++
 .../test_decoding/expected/subxact_snap_pass.out   |  49 ++
 contrib/test_decoding/specs/oldest_xmin.spec   |  37 +
 contrib/test_decoding/specs/subxact_snap_pass.spec |  42 +
 src/backend/replication/logical/reorderbuffer.c| 183 ++---
 src/backend/replication/logical/snapbuild.c|  15 +-
 src/include/replication/reorderbuffer.h|  15 +-
 8 files changed, 304 insertions(+), 67 deletions(-)
 create mode 100644 contrib/test_decoding/expected/oldest_xmin.out
 create mode 100644 contrib/test_decoding/expected/subxact_snap_pass.out
 create mode 100644 contrib/test_decoding/specs/oldest_xmin.spec
 create mode 100644 contrib/test_decoding/specs/subxact_snap_pass.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 6c18189d9d..c696288d67 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5

Re: Why chain of snapshots is used in ReorderBufferCommit?

2018-03-01 Thread Arseny Sher
Andres Freund <and...@anarazel.de> writes:

> I don't think that's right. For one there's plenty types of DDL where no
> such locking exists, consider e.g. composite datums where additional
> columns can be added even after such a type is already used by another
> table.

Oh, indeed. Never mind the last paragraph.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: GSoC 2018

2017-12-15 Thread Arseny Sher
Aleksander Alekseev <a.aleks...@postgrespro.ru> writes:

> Hi Stephen,
>
>> synchronous_commit isn't the same as synchronous_standby_names ...
>
> What about synchronous_standby_names? Logical replication ignores it and
> doesn't wait for ack's from replicas or something?

It actually works, see [1]:

The name of a standby server for this purpose is the application_name
setting of the standby, as set in the standby's connection
information. In case of a physical replication standby, this should be
set in the primary_conninfo setting in recovery.conf; the default is
walreceiver. For logical replication, this can be set in the connection
information of the subscription, and it defaults to the subscription
name.

[1] 
https://www.postgresql.org/docs/current/static/runtime-config-replication.html

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company