Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-17 Thread Amit Kapila
On Thu, Jun 18, 2020 at 8:01 AM Masahiko Sawada
 wrote:
>
> On Wed, 17 Jun 2020 at 20:14, Amit Kapila  wrote:
> >
> >
> > I had written above in the context of persisting these stats.  I mean
> > to say if the process has bounced or server has restarted then the
> > previous stats might not make much sense because we were planning to
> > use pid [1], so the stats from process that has exited might not make
> > much sense or do you think that is okay?  If we don't want to persist
> > and the lifetime of these stats is till the process is alive then we
> > are fine.
> >
>
> Sorry for confusing you. The above my idea is about having the stats
> per slots. That is, we add spill_txns, spill_count and spill_bytes to
> pg_replication_slots or a new view pg_stat_logical_replication_slots
> with some columns: slot_name plus these stats columns and stats_reset.
> The idea is that the stats values accumulate until either the slot is
> dropped, the server crashed, the user executes the reset function, or
> logical decoding is performed with different logical_decoding_work_mem
> value than the previous time. In other words, the stats values are
> reset in either case. That way, I think the stats values always
> correspond to logical decoding using the same slot with the same
> logical_decoding_work_mem value.
>

What if the decoding has been performed by multiple backends using the
same slot?  In that case, it will be difficult to make the judgment
for the value of logical_decoding_work_mem based on stats.  It would
make sense if we provide a way to set logical_decoding_work_mem for a
slot but not sure if that is better than what we have now.

What problems do we see in displaying these for each process?  I think
users might want to see the stats for the exited processes or after
server restart but I think both of those are not even possible today.
I think the stats are available till the corresponding WALSender
process is active.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




回复:回复:回复:回复:how to create index concurrently on partitioned table

2020-06-17 Thread 李杰(慎追)
> Not sure I am following.  In the case of REINDEX, it seems to me that
> the calls to validate_index() and index_concurrently_build() can
> happen in a separate transaction for each index, as long as all the
> calls to index_concurrently_swap() are grouped together in the same
> transaction to make sure that index partition trees are switched
> consistently when all entries are swapped from an invalid state to a
> valid state, because the swapping phase is also when we attach a fresh
> index to a partition tree.  See also index_concurrently_create_copy()
> where we don't set parentIndexRelid for the lower call to
> index_create().  It would be good of course to check that when
> swapping we have the code to handle that for a lot of indexes at
> once.

Let's look at this example:  
A partition table has five partitions,
parttable: part1, part2, part3, part3 ,part5
We simply abstract the following definitions:
 phase 1:  index_create(),   it is  only registered in catalogs.
phase 2: index_concurrently_build(), Build the indexes.
phase 3: validate_index(), insert any missing index entries, mark the index 
as valid.

(schema 1)
```
StartTransaction  one
parttable  phase 1
part 1   phase 1
part 2   phase 1
part 3   phase 1
part 4   phase 1
part 5   phase 1
CommitTransaction

StartTransaction two
parttable  phase 2part 1   phase 2
part 2   phase 2
part 3   phase 2 (error occurred )
part 4   phase 2
part 5   phase 2
CommitTransaction

StartTransaction three
parttable  phase 3
part 1   phase 3
part 2   phase 3 
part 3   phase 3  
part 4   phase 4  
part 5   phase 5  CommitTransaction
...
```
Now, the following steps cannot continue due to an error in Transaction two .
so, Transaction two  roll back, Transaction three haven't started.
All of our indexes are invalid.  In this way, 
we ensure the strong consistency of indexes in the partition tree.
However, we need to rebuild all indexes when reindex.

(schema 2)
```
StartTransaction  one
parttable  phase 1
part 1   phase 1
part 2   phase 1
part 3   phase 1
part 4   phase 1
part 5   phase 1
CommitTransaction

StartTransaction two part 1   phase 2
part 1   phase 3
CommitTransaction

StartTransaction three part 2   phase 2
part 2   phase 3
CommitTransaction

StartTransaction fourpart 3   phase 2  (error  occurred )
part 3   phase 3
CommitTransaction

StartTransaction five part 4   phase 2
part 4   phase 3


StartTransaction sixpart 5   phase 2
part 5   phase 3
CommitTransaction


StartTransaction sevenparttable phase 2
parttable  phase 3
CommitTransaction

```

Now, the following steps cannot continue due to an error  in Transaction four .
so, Transaction four roll back, Transactions behind Transaction 3 have not 
started
The indexes of the p1 and p2 partitions are available. Other indexes are 
invalid.
In reindex, we can ignore the rebuild of p1 and p2.
This seems better, although it seems to be inconsistent.

Do you think that scheme is more suitable for CIC?


Thank you very much,
 Regards,  Adger








--
发件人:Michael Paquier 
发送时间:2020年6月18日(星期四) 10:41
收件人:李杰(慎追) 
抄 送:Justin Pryzby ; pgsql-hackers 
; 曾文旌(义从) ; 
Alvaro Herrera 
主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned table

On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
> However, I found a problem. If there are many partitions, 
> we may need to handle too many missing index entries when
> validate_index().  Especially for the first partition, the time may
> have been long and many entries are missing.  In this case, why
> don't we put the second and third phase together into a transaction
> for each partition? 

Not sure I am following.  In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree.  See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create().  It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael


Re: min_safe_lsn column in pg_replication_slots view

2020-06-17 Thread Kyotaro Horiguchi
Thanks for the comments.

At Wed, 17 Jun 2020 21:37:55 +0900, Fujii Masao  
wrote in 
> On 2020/06/15 16:35, Kyotaro Horiguchi wrote:
> > At Mon, 15 Jun 2020 13:44:31 +0900, Michael Paquier
> >  wrote in
> >> On Mon, Jun 15, 2020 at 12:40:03PM +0900, Fujii Masao wrote:
> >>> BTW, I just wonder why each row in pg_replication_slots needs to have
> >>> min_safe_lsn column? Basically min_safe_lsn should be the same between
> >>> every replication slots.
> >>
> >> Indeed, that's confusing in its current shape.  I would buy putting
> >> this value into pg_replication_slots if there were some consistency of
> >> the data to worry about because of locking issues, but here this data
> >> is controlled within info_lck, which is out of the the repslot LW
> >> lock.  So I think that it is incorrect to put this data in this view
> >> and that we should remove it, and that instead we had better push for
> >> a system view that maps with the contents of XLogCtl.
> 
> Agreed. But as you know it's too late to do that for v13...
> So firstly I'd like to fix the issues in pg_replication_slots view,
> and then we can improve the things later for v14 if necessary.
> 
> 
> > It was once the difference from the safe_lsn to restart_lsn which is
> > distinct among slots. Then it was changed to the safe_lsn. I agree to
> > the discussion above, but it is needed anywhere since no one can know
> > the margin until the slot goes to the "lost" state without it.  (Note
> > that currently even wal_status and min_safe_lsn can be inconsistent in
> > a line.)
> > Just for the need for table-consistency and in-line consistency, we
> > could just remember the result of XLogGetLastRemovedSegno() around
> > taking info_lock in the function.  That doesn't make a practical
> > difference but makes the view look consistent.
> 
> Agreed. Thanks for the patch. Here are the review comments:
> 
> 
> Not only the last removed segment but also current write position
> should be obtain at the beginning of pg_get_replication_slots()
> and should be given to GetWALAvailability(), for the consistency?

You are right.  Though I faintly thought that I didn't need that since
WriteRecPtr doesn't move by so wide steps as removed_segment, actually
it moves.

> Even after applying your patch, min_safe_lsn is calculated for
> each slot even though the calculated result is always the same.
> Which is a bit waste of cycle. We should calculate min_safe_lsn
> at the beginning and use it for each slot?

Agreed. That may results in a wastful calculation but it's better than
repeated wasteful calculations.

>   XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0,
> 
> Isn't it better to use 1 as the second argument of the above,
> in order to address the issue that I reported upthread?
> Otherwise, the WAL file name that pg_walfile_name(min_safe_lsn)
> returns
> would be confusing.

Mmm. pg_walfile_name seems too specialize to
pg_stop_backup(). (pg_walfile_name_offset() returns wrong result for
segment boundaries.)  I'm not willing to do that only to follow such
suspicious(?) specification, but surely it would practically be better
doing that. Please find the attached first patch.  I found that
there's no reason to hide min_safe_lsn when wal_status has certain
values. So I changed it to be shown always.

By the way, I noticed that when a replication slot reserves all
existing WAL segments, checkpoint cannot remove a file and
lastRemovedSegment is left being 0. The second attached forces
RemoveOldXlogFiles to initialize the variable even when no WAL
segments are removed.  It puts no additional loads on file system
since the directory is scanned anyway.  My old proposal to
unconditionally initialize it separately from checkpoint was rejected,
but I think this is acceptable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 6742e3b31ac53ef54458d64e34feeac7d4c710d2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Jun 2020 13:36:23 +0900
Subject: [PATCH v2 1/2] Make pg_stat_replication view consistent

min_safe_lsn is not consistent within the result of a query. Make it
consistent. On the way fixing that min_safe_lsn is changed to show the
second byte of a segment instead of first byte in order to users can
get the intended segment file name from pg_walfile_name.
---
 src/backend/access/transam/xlog.c   | 15 +++--
 src/backend/replication/slotfuncs.c | 35 -
 src/include/access/xlog.h   |  4 +++-
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc..940f5fcb18 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9484,6 +9484,9 @@ CreateRestartPoint(int flags)
  * Report availability of WAL for the given target LSN
  *		(typically a slot's restart_lsn)
  *
+ * currWriteLSN and lastRemovedSeg is the results of XLogGetLastRemovedSegno()
+ * and GetXLo

Re: Fast DSM segments

2020-06-17 Thread Thomas Munro
On Thu, Jun 11, 2020 at 5:37 AM Robert Haas  wrote:
> On Tue, Jun 9, 2020 at 6:03 PM Thomas Munro  wrote:
> > That all makes sense.  Now I'm wondering if I should use exactly that
> > word in the GUC... dynamic_shared_memory_preallocate?
>
> I tend to prefer verb-object rather than object-verb word ordering,
> because that's how English normally works, but I realize this is not a
> unanimous view.

It's pretty much just me and Yoda against all the rest of you, so
let's try preallocate_dynamic_shared_memory.  I guess it could also be
min_dynamic_shared_memory to drop the verb.  Other ideas welcome.

> It's a little strange because the fact of preallocating it makes it
> not dynamic any more. I don't know what to do about that.

Well, it's not dynamic at the operating system level, but it's still
dynamic in the sense that PostgreSQL code can get some and give it
back, and there's no change from the point of view of any DSM client
code.

Admittedly, the shared memory architecture is a bit confusing.  We
have main shared memory, DSM memory, DSA memory that is inside main
shared memory with extra DSMs as required, DSA memory that is inside a
DSM and creates extra DSMs as required, and with this patch also DSMs
that are inside main shared memory.  Not to mention palloc and
MemoryContexts and all that.  As you probably remember I once managed
to give an internal presentation at EDB for one hour of solid talking
about all the different kinds of allocators and what they're good for.
It was like a Möbius slide deck already.

Here's a version that adds some documentation.
From 8f222062b60d6674cd9f46e716a56201ef498f84 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 9 Apr 2020 14:12:38 +1200
Subject: [PATCH v2] Preallocate some DSM space at startup.

Create an optional region in the main shared memory segment that can be
used to acquire and release "fast" DSM segments, and can benefit from
huge pages allocated at cluster startup time, if configured.  Fall back
to the existing mechanisms when that space is full.  The size is
controlled by preallocate_dynamic_shared_memory, defaulting to 0.

Main region DSM segments initially contain whatever garbage the memory
held last time they were used, rather than zeroes.  That change revealed
that DSA areas failed to initialize themselves correctly in memory that
wasn't zeroed first, so fix that problem.

Reviewed-by: Robert Haas 
Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  26 +++
 src/backend/storage/ipc/dsm.c | 184 --
 src/backend/storage/ipc/dsm_impl.c|   3 +
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/utils/misc/guc.c  |  11 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/mmgr/dsa.c  |   5 +-
 src/include/storage/dsm.h |   3 +
 src/include/storage/dsm_impl.h|   1 +
 9 files changed, 213 insertions(+), 25 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 783bf7a12b..35d342a694 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1831,6 +1831,32 @@ include_dir 'conf.d'
   
  
 
+ 
+  preallocate_dynamic_shared_memory (integer)
+  
+   preallocate_dynamic_shared_memory configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory that should be allocated at server
+startup time for use by parallel queries.  When this memory region is
+insufficient or exhausted by concurrent parallel queries, new
+parallel queries try to allocate extra shared memory temporarily from
+the operating system using the method configured with
+dynamic_shared_memory_type, which may be slower
+due to memory management overheads.
+Memory that is allocated with
+preallocate_dynamic_shared_memory is affected by the
+huge_pages setting on operating systems where that
+is supported, and may be more likely to benefit from larger pages on
+operating systems where page size is managed automatically.  Larger
+memory pages can improve the performance of parallel hash joins.
+The default value is 0 (none).
+   
+  
+ 
+
  
  
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index ef64d08357..4f87ece3b3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -35,10 +35,12 @@
 
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "storage/dsm.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "utils/freepage.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
@@ -76,6 +78,8 @@ typedef struct dsm_control_item
 {
 	dsm_handle	

Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/18 14:40, Fujii Masao wrote:



On 2020/06/18 11:44, Kyotaro Horiguchi wrote:

At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao  
wrote in

ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.


OK, so what about the attached patch? I added
ConditionVariablePrepareToSleep()
just before entering the "for" loop in
InvalidateObsoleteReplicationSlots().


Thanks.


Thanks for the review!



ReplicationSlotAcquireInternal:
+ * If *slot == NULL, search for the slot with the given name.

'*' seems needless here.


Fixed.

Also I added "Only one of slot and name can be specified." into
the comments of ReplicationSlotAcquireInternal().



The patch moves ConditionVariablePrepareToSleep. We need to call the
function before looking into active_pid as originally commented.
Since it is not protected by ReplicationSlotControLock, just before
releasing the lock is not correct.

The attached on top of the v3 fixes that.


Yes, you're right. I merged your 0001.patch into mine.

+    if (behavior != SAB_Inquire)
+    ConditionVariablePrepareToSleep(&s->active_cv);
+    else if (behavior != SAB_Inquire)

Isn't "behavior == SAB_Block" condition better here?
I changed the patch that way.

The attached is the updated version of the patch.
I also merged Alvaro's patch into this.



+   s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
+   if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)

The conditions in the second line is needed for the case slot is
given, but it is already done in SearchNamedReplicationSlot if slot is
not given.  I would like something like the following instead, but I
don't insist on it.


Yes, I got rid of strcmp() check, but left is_use check as it is.
I like that because it's simpler.



 ReplicationSlot *s = NULL;
 ...
 if (!slot)
 s = SearchNamedReplicationSlot(name);
 else if(s->in_use && strcmp(name, NameStr(s->data.name)))
 s = slot;


+    ereport(ERROR,
+    (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("replication slot \"%s\" does not exist", name)));

The error message is not right when the given slot doesn't match the
given name.


This doesn't happen after applying Alvaro's patch.

BTW, using "name" here is not valid because it may be NULL.
So I added the following code and used "slot_name" in log messages.

+    slot_name = name ? name : NameStr(slot->data.name);


Sorry, this caused compiler failure. So I fixed that and
attached the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..a7bbcf3499 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,117 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock,
+   LW_SHARED));
+
+   for (i = 0; i < max_replication_slots; i++)
+   {
+   ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+   if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0)
+   {
+   slot = s;
+   break;
+   }
+   }
+
+   return slot;
+}
+
+/*
+ * Find a previously created slot and mark it as used by this process.
  *
  * The return value is only useful if behavior is SAB_Inquire, in which
- * it's zero if we successfully acquired the slot, or the PID of the
- * owning process otherwise.  If behavior is SAB_Error, then trying to
- * acquire an owned slot is an error.  If SAB_Block, we sleep until the
- * slot is released by the owning process.
+ * it's zero if we successfull

Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/18 11:44, Kyotaro Horiguchi wrote:

At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao  
wrote in

ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.


OK, so what about the attached patch? I added
ConditionVariablePrepareToSleep()
just before entering the "for" loop in
InvalidateObsoleteReplicationSlots().


Thanks.


Thanks for the review!



ReplicationSlotAcquireInternal:
+ * If *slot == NULL, search for the slot with the given name.

'*' seems needless here.


Fixed.

Also I added "Only one of slot and name can be specified." into
the comments of ReplicationSlotAcquireInternal().



The patch moves ConditionVariablePrepareToSleep. We need to call the
function before looking into active_pid as originally commented.
Since it is not protected by ReplicationSlotControLock, just before
releasing the lock is not correct.

The attached on top of the v3 fixes that.


Yes, you're right. I merged your 0001.patch into mine.

+   if (behavior != SAB_Inquire)
+   ConditionVariablePrepareToSleep(&s->active_cv);
+   else if (behavior != SAB_Inquire)

Isn't "behavior == SAB_Block" condition better here?
I changed the patch that way.

The attached is the updated version of the patch.
I also merged Alvaro's patch into this.



+   s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
+   if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)

The conditions in the second line is needed for the case slot is
given, but it is already done in SearchNamedReplicationSlot if slot is
not given.  I would like something like the following instead, but I
don't insist on it.


Yes, I got rid of strcmp() check, but left is_use check as it is.
I like that because it's simpler.



 ReplicationSlot *s = NULL;
 ...
 if (!slot)
 s = SearchNamedReplicationSlot(name);
 else if(s->in_use && strcmp(name, NameStr(s->data.name)))
 s = slot;


+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("replication slot \"%s\" does not exist", name)));

The error message is not right when the given slot doesn't match the
given name.


This doesn't happen after applying Alvaro's patch.

BTW, using "name" here is not valid because it may be NULL.
So I added the following code and used "slot_name" in log messages.

+   slot_name = name ? name : NameStr(slot->data.name);

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..77cf366ef1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,123 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock,
+   LW_SHARED));
+
+   for (i = 0; i < max_replication_slots; i++)
+   {
+   ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+   if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0)
+   {
+   slot = s;
+   break;
+   }
+   }
+
+   return slot;
+}
+
+/*
+ * Find a previously created slot and mark it as used by this process.
  *
  * The return value is only useful if behavior is SAB_Inquire, in which
- * it's zero if we successfully acquired the slot, or the PID of the
- * owning process otherwise.  If behavior is SAB_Error, then trying to
- * acquire an owned slot is an error.  If SAB_Block, we sleep until the
- * slot is released by the owning process.
+ * it's zero if we successfully acquired the slot, -1 if the slot no longer
+ * exists, or the PID of the owning process otherwise.  If behavior is
+ *

Re: Binary transfer vs Text transfer

2020-06-17 Thread Aleksei Ivanov
Hi,

Thanks for the attached link, but I also noticed using iftop, that during
fetching the data there is almost no delay using text transfer, while there
is several seconds in delay before data is starting fetching using binary
transfer. Could you suggest where can I have a look to resolve it or
decrease that time?

Best regards,


On Wed, Jun 17, 2020 at 12:12 Andres Freund  wrote:

> Hi,
>
>
> On 2020-06-17 11:55:44 -0700, Aleksei Ivanov wrote:
> > I have one question related to my current measurements results.
> >
> > I am fetching integers in text format like:
> >
> > Select * from table limit 1000. It take 18.5 seconds to finish and
> the
> > transfer data is 633 MB.
> >
> > When I fetching the same data using binary cursor, the transfer data is
> 480
> > MB, but the transfer time is 21.5 seconds?
> >
> > So, I have one question could someone explain what can be the reason why
> > the transferring time for binary data is higher?
>
> This thread might be interesting:
>
> https://www.postgresql.org/message-id/CAMkU%3D1whbRDUwa4eayD9%2B59K-coxO9senDkPRbTn3cg0pUz4AQ%40mail.gmail.com
>
> Greetings,
>
> Andres Freund
>


RE: I'd like to discuss scaleout at PGCon

2020-06-17 Thread tsunakawa.ta...@fujitsu.com
Hello,

It seems you didn't include pgsql-hackers.


From: Sumanta Mukherjee 
> I saw the presentation and it is great except that  it seems to be unclear of 
> both SD and SN  if the storage and the compute are being explicitly 
> separated. Separation of storage and compute would have some cost advantages 
> as per my understanding. The following two work (ref below) has some 
> information about the usefulness of this technique for scale out and so it 
> would be an interesting addition  to see if in the SN architecture that is 
> being proposed could be modified to take care of this phenomenon and reap the 
> gain.

Thanks.  Separation of compute and storage is surely to be considered.  Unlike 
the old days when the shared storage was considered to be a bottleneck with 
slow HDDs and FC-SAN, we could now expect high speed shared storage thanks to 
flash memory, NVMe-oF, and RDMA.

> 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
> Transactional Record Manager for Shared Flash. In CIDR 2011.

This is interesting.  I'll go into this.  Do you know there's any product based 
on Hyder?  OTOH, Hyder seems to require drastic changes when adopting for 
Postgres -- OCC, log-structured database, etc.  I'd like to hear how feasible 
those are.  However, its scale-out capability without the need for data or 
application partitioning appears appealing.


To explore another possibility that would have more affinity with the current 
Postgres, let me introduce our proprietary product called Symfoware.  It's not 
based on Postgres.

It has shared nothing scale-out functionality with full ACID based on 2PC, 
conventional 2PL locking and distributed deadlock resolution.  Despite being 
shared nothing, all the database files and transaction logs are stored on 
shared storage.

The database is divided into "log groups."  Each log group has one transaction 
log and multiple tablespaces (it's called "database space" instead of 
tablespace.)

Each DB instance in the cluster owns multiple log groups, and handles 
reads/writes to the data in its owning log groups.  When a DB instance fails, 
other surviving DB instances take over the log groups of the failed DB 
instance, recover the data using the transaction log of the log group, and 
accepts reads/writes to the data in the log group.  The DBA configures which DB 
instance initially owns which log groups and which DB instances are candidates 
to take over which log groups.

This way, no server is idle as a standby.  All DB instances work hard to 
process read-write transactions.  This "no idle server for HA" is one of the 
things Oracle RAC users want in terms of cost.

However, it still requires data and application partitioning unlike Hyder.  
Does anyone think of a way to eliminate partitioning?  Data and application 
partitioning is what Oracle RAC users want to avoid or cannot tolerate.

Ref: Introduction of the Symfoware shared nothing scale-out called "load share."
https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf


Regards
Takayuki Tsunakawa


--- below is Sumanta's original mail ---
From: Sumanta Mukherjee 
Sent: Wednesday, June 17, 2020 5:34 PM
To: Tsunakawa, Takayuki/綱川 貴之 
Cc: Bruce Momjian ; Merlin Moncure ; 
Robert Haas ; maumau...@gmail.com
Subject: Re: I'd like to discuss scaleout at PGCon

Hello,

I saw the presentation and it is great except that  it seems to be unclear of 
both SD and SN  if the storage and the compute are being explicitly separated. 
Separation of storage and compute would have some cost advantages as per my 
understanding. The following two work (ref below) has some information about 
the usefulness of this technique for scale out and so it would be an 
interesting addition  to see if in the SN architecture that is being proposed 
could be modified to take care of this phenomenon and reap the gain.

1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
Transactional Record Manager for Shared Flash. In CIDR 2011.

2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.
blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com





Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-17 Thread Tom Lane
I wrote:
> The minimum thing that we have to do, I'd say, is to change the
> documentation to explain what happens if there's no posixrules file.

Here's a proposed patch to do that.  To explain this, we more or less
have to fully document the POSIX timezone string format (otherwise
nobody's gonna understand what "M3.2.0,M11.1.0" means).  That's something
we've glossed over for many years, and I still feel like it's not
something to explain in-line in section 8.5.3, so I shoved all the gory
details into a new section in Appendix B.  To be clear, nothing here is
new behavior, it was just undocumented before.

regards, tom lane

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 3df189ad85..ca61439501 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -2492,25 +2492,10 @@ TIMESTAMP WITH TIME ZONE '2004-10-19 10:23:54+02'

 In addition to the timezone names and abbreviations,
 PostgreSQL will accept POSIX-style time zone
-specifications of the form STDoffset or
-STDoffsetDST, where
-STD is a zone abbreviation, offset is a
-numeric offset in hours west from UTC, and DST is an
-optional daylight-savings zone abbreviation, assumed to stand for one
-hour ahead of the given offset. For example, if EST5EDT
-were not already a recognized zone name, it would be accepted and would
-be functionally equivalent to United States East Coast time.  In this
-syntax, a zone abbreviation can be a string of letters, or an
-arbitrary string surrounded by angle brackets (<>).
-When a daylight-savings zone abbreviation is present,
-it is assumed to be used
-according to the same daylight-savings transition rules used in the
-IANA time zone database's posixrules entry.
-In a standard PostgreSQL installation,
-posixrules is the same as US/Eastern, so
-that POSIX-style time zone specifications follow USA daylight-savings
-rules.  If needed, you can adjust this behavior by replacing the
-posixrules file.
+specifications, as described in
+.  This option is not
+normally preferable to using a named time zone, but it may be
+necessary if no suitable IANA time zone entry is available.

   
  
@@ -2537,19 +2522,6 @@ TIMESTAMP WITH TIME ZONE '2004-10-19 10:23:54+02'
  above, this is not necessarily the same as local civil time on that date.
 
 
-
- One should be wary that the POSIX-style time zone feature can
- lead to silently accepting bogus input, since there is no check on the
- reasonableness of the zone abbreviations.  For example, SET
- TIMEZONE TO FOOBAR0 will work, leaving the system effectively using
- a rather peculiar abbreviation for UTC.
- Another issue to keep in mind is that in POSIX time zone names,
- positive offsets are used for locations west of Greenwich.
- Everywhere else, PostgreSQL follows the
- ISO-8601 convention that positive timezone offsets are east
- of Greenwich.
-
-
 
  In all cases, timezone names and abbreviations are recognized
  case-insensitively.  (This is a change from PostgreSQL
diff --git a/doc/src/sgml/datetime.sgml b/doc/src/sgml/datetime.sgml
index 7cce826e2d..fb210b377b 100644
--- a/doc/src/sgml/datetime.sgml
+++ b/doc/src/sgml/datetime.sgml
@@ -555,6 +555,210 @@
 
   
 
+  
+  POSIX Time Zone Specifications
+
+  
+   time zone
+   POSIX-style specification
+  
+
+  
+   PostgreSQL can accept time zone specifications that
+   are written according to the POSIX standard's rules
+   for the TZ environment
+   variable.  POSIX time zone specifications are
+   inadequate to deal with the complexity of real-world time zone history,
+   but there are sometimes reasons to use them.
+  
+
+  
+   A POSIX time zone specification has the form
+
+STD offset  DST  dstoffset   , rule  
+
+   (For readability, we show spaces between the fields, but spaces should
+   not be used in practice.)  The fields are:
+   
+
+ 
+  STD is the zone abbreviation to be used
+  for standard time.
+ 
+
+
+ 
+  offset is the zone's standard-time offset
+  from UTC.
+ 
+
+
+ 
+  DST is the zone abbreviation to be used
+  for daylight-savings time.  If this field and the following ones are
+  omitted, the zone uses a fixed UTC offset with no daylight-savings
+  rule.
+ 
+
+
+ 
+  dstoffset is the daylight-savings offset
+  from UTC.  This field is typically omitted, since it defaults to one
+  hour less than the standard-time offset,
+  which is usually the right thing.
+ 
+
+
+ 
+  rule defines the rule for when daylight
+  savings is in effect, as described below.
+ 
+
+   
+  
+
+  
+   In this syntax, a zone abbreviation can be a str

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread amul sul
On Wed, Jun 17, 2020 at 8:15 PM Robert Haas  wrote:
>
> On Wed, Jun 17, 2020 at 9:51 AM tushar  wrote:
> > 1) ALTER SYSTEM
> >
> > postgres=# alter system read only;
> > ALTER SYSTEM
> > postgres=# alter  system reset all;
> > ALTER SYSTEM
> > postgres=# create table t1(n int);
> > ERROR:  cannot execute CREATE TABLE in a read-only transaction
> >
> > Initially i thought after firing 'Alter system reset all' , it will be
> > back to  normal.
> >
> > can't we have a syntax like - "Alter system set read_only='True' ; "
>
> No, this needs to be separate from the GUC-modification syntax, I
> think. It's a different kind of state change. It doesn't, and can't,
> just edit postgresql.auto.conf.
>
> > 2)When i connected to postgres in a single user mode , i was not able to
> > set the system in read only
> >
> > [edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres
> >
> > PostgreSQL stand-alone backend 14devel
> > backend> alter system read only;
> > ERROR:  checkpointer is not running
> >
> > backend>
>
> Hmm, that's an interesting finding. I wonder what happens if you make
> the system read only, shut it down, and then restart it in single-user
> mode. Given what you see here, I bet you can't put it back into a
> read-write state from single user mode either, which seems like a
> problem. Either single-user mode should allow changing between R/O and
> R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
> ONLY and always allow writes anyway.
>
Ok, will try to enable changing between R/O and R/W in the next version.

Thanks Tushar for the testing.

Regards,
Amul




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread amul sul
On Wed, Jun 17, 2020 at 8:12 PM Robert Haas  wrote:
>
> On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila  wrote:
> > Do we prohibit the checkpointer to write dirty pages and write a
> > checkpoint record as well?  If so, will the checkpointer process
> > writes the current dirty pages and writes a checkpoint record or we
> > skip that as well?
>
> I think the definition of this feature should be that you can't write
> WAL. So, it's OK to write dirty pages in general, for example to allow
> for buffer replacement so we can continue to run read-only queries.
> But there's no reason for the checkpointer to do it: it shouldn't try
> to checkpoint, and therefore it shouldn't write dirty pages either.
> (I'm not sure if this is how the patch currently works; I'm describing
> how I think it should work.)
>
You are correct -- writing dirty pages is not restricted.

> > > If there are open transactions that have acquired an XID, the sessions 
> > > are killed
> > > before the barrier is absorbed.
> >
> > What about prepared transactions?
>
> They don't matter. The problem with a running transaction that has an
> XID is that somebody might end the session, and then we'd have to
> write either a commit record or an abort record. But a prepared
> transaction doesn't have that problem. You can't COMMIT PREPARED or
> ROLLBACK PREPARED while the system is read-only, as I suppose anybody
> would expect, but their mere existence isn't a problem.
>
> > What if vacuum is on an unlogged relation?  Do we allow writes via
> > vacuum to unlogged relation?
>
> Interesting question. I was thinking that we should probably teach the
> autovacuum launcher to stop launching workers while the system is in a
> READ ONLY state, but what about existing workers? Anything that
> generates invalidation messages, acquires an XID, or writes WAL has to
> be blocked in a read-only state; but I'm not sure to what extent the
> first two of those things would be a problem for vacuuming an unlogged
> table. I think you couldn't truncate it, at least, because that
> acquires an XID.
>
> > > Another part of the patch that quite uneasy and need a discussion is that 
> > > when the
> > > shutdown in the read-only state we do skip shutdown checkpoint and at a 
> > > restart, first
> > > startup recovery will be performed and latter the read-only state will be 
> > > restored to
> > > prohibit further WAL write irrespective of recovery checkpoint succeed or 
> > > not. The
> > > concern is here if this startup recovery checkpoint wasn't ok, then it 
> > > will never happen
> > > even if it's later put back into read-write mode.
> >
> > I am not able to understand this problem.  What do you mean by
> > "recovery checkpoint succeed or not", do you add a try..catch and skip
> > any error while performing recovery checkpoint?
>
> What I think should happen is that the end-of-recovery checkpoint
> should be skipped, and then if the system is put back into read-write
> mode later we should do it then. But I think right now the patch
> performs the end-of-recovery checkpoint before restoring the read-only
> state, which seems 100% wrong to me.
>
Yeah, we need more thought on how to proceed further.  I am kind of agree that
the current behavior is not right with Robert since writing end-of-recovery
checkpoint violates the no-wal-write rule.

Regards,
Amul




Re: [PATCH] Add support for choosing huge page size

2020-06-17 Thread Thomas Munro
Hi Odin,

Documentation syntax error "2MB" shows up as:

config.sgml:1605: parser error : Opening and ending tag mismatch:
literal line 1602 and para
   
  ^

Please install the documentation tools
https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun
configure and "make docs" to see these kinds of errors.

The build is currently failing on Windows:

undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h
line 143 at src/tools/msvc/Mkvcbuild.pm line 851.

I think that's telling us that you need to add this stuff into
src/tools/msvc/Solution.pm, so that we can say it doesn't have it.  I
don't have Windows but whenever you post a new version we'll see if
Windows likes it here:

http://cfbot.cputube.org/odin-ugedal.html

When using huge_pages=on, huge_page_size=1GB, but default
shared_buffers, I noticed that the error message reports the wrong
(unrounded) size in this message:

2020-06-18 02:06:30.407 UTC [73552] HINT:  This error usually means
that PostgreSQL's request for a shared memory segment exceeded
available memory, swap space, or huge pages. To reduce the request
size (currently 149069824 bytes), reduce PostgreSQL's shared memory
usage, perhaps by reducing shared_buffers or max_connections.

The request size was actually:

mmap(NULL, 1073741824, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30< /sys/kernel/mm/transparent_hugepage/enabled
# echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

shared_buffers=8GB
dynamic_shared_memory_main_size=8GB

create table t as select generate_series(1, 1)::int i;
alter table t set (parallel_workers = 7);
create extension pg_prewarm;
select pg_prewarm('t');
set max_parallel_workers_per_gather=7;
set work_mem='1GB';

select count(*) from t t1 join t t2 using (i);

4KB pages: 12.42 seconds
2MB pages:  9.12 seconds
1GB pages:  9.07 seconds

Unfortunately I can't access the TLB miss counters on this system due
to virtualisation restrictions, and the systems where I can don't have
1GB pages.  According to cpuid(1) this system has a fairly typical
setup:

   cache and TLB information (2):
  0x63: data TLB: 2M/4M pages, 4-way, 32 entries
data TLB: 1G pages, 4-way, 4 entries
  0x03: data TLB: 4K pages, 4-way, 64 entries

This operation is touching about 8GB of data (scanning 3.5GB of table,
building a 4.5GB hash table) so 4 x 1GB is not enough do this without
TLB misses.

Let's try that again, except this time with shared_buffers=4GB,
dynamic_shared_memory_main_size=4GB, and only half as many tuples in
t, so it ought to fit:

4KB pages:  6.37 seconds
2MB pages:  4.96 seconds
1GB pages:  5.07 seconds

Well that's disappointing.  I wondered if this was something to do
with NUMA effects on this two node box, so I tried running that again
with postgres under numactl --cpunodebind 0 --membind 0 and I got:

4KB pages:  5.43 seconds
2MB pages:  4.05 seconds
1GB pages:  4.00 seconds

>From this I can't really conclude that it's terribly useful to use
larger page sizes, but it's certainly useful to have the ability to do
further testing using the proposed GUC.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com




Re: Creating a function for exposing memory usage of backend process

2020-06-17 Thread Fujii Masao




On 2020/06/17 22:00, torikoshia wrote:

Hi,

As you may know better than I do, backend processes sometimes use a lot
of memory because of the various reasons like caches, prepared
statements and cursors.
When supporting PostgreSQL, I face situations for investigating the
reason of memory bloat.

AFAIK, the way to examine it is attaching a debugger and call
MemoryContextStats(TopMemoryContext), however, I feel some difficulties
doing it:

   - some production environments don't allow us to run a debugger easily
   - many lines about memory contexts are hard to analyze


Agreed. The feature to view how local memory contexts are used in
each process is very useful!



Using an extension(pg_stat_get_memory_context() in pg_cheat_funcs[1]),
we can get the view of the memory contexts, but it's the memory contexts
of the backend which executed the pg_stat_get_memory_context().


[user interface]
If we have a function exposing memory contexts for specified PID,
we can easily examine them.
I imagine a user interface something like this:

   =# SELECT * FROM pg_stat_get_backend_memory_context(PID);


I'm afraid that this interface is not convenient when we want to monitor
the usages of local memory contexts for all the processes. For example,
I'd like to monitor how much memory is totally used to store prepared
statements information. For that purpose, I wonder if it's more convenient
to provide the view displaying the memory context usages for
all the processes.

To provide that view, all the processes need to save their local memory
context usages into the shared memory or the special files in their
convenient timing. For example, backends do that every end of query
execution (during waiting for new request from clients). OTOH,
the query on the view scans and displays all those information.

Of course there would be several issues in this idea. One issue is
the performance overhead caused when each process stores
its own memory context usage to somewhere. Even if backends do that
during waiting for new request from clients, non-negligible overhead
might happen. Performance test is necessary. Also this means that
we cannot see the memory context usage of the process in the middle of
query execution since it's saved at the end of query. If local memory bloat
occurs only during query execution and we want to investigate it, we still
need to use gdb to output the memory context information.

Another issue is that the large amount of shared memory might be
necessary to save the memory context usages for all the proceses. We can
save the usage information into the file instead, but which would cause
more overhead. If we use shared memory, the similar parameter like
track_activity_query_size might be necessary. That is, backends save
only the specified number of memory context information. If it's zero,
the feature is disabled.

Also we should reduce the same of information to save. For example,
instead of saving all memory context information like MemoryContextStats()
prints, it might be better to save the summary stats (per memory context
type) from them.




    name   |   parent   | level | total_bytes | 
total_nblocks | free_bytes | free_chunks | used_bytes | some other attibutes..
--++---+-+---++-+
  TopMemoryContext |    | 0 |   68720 | 
  5 |   9936 |  16 |  58784
  TopTransactionContext    | TopMemoryContext   | 1 |    8192 | 
  1 |   7720 |   0 |    472
  PL/pgSQL function    | TopMemoryContext   | 1 |   16384 | 
  2 |   5912 |   1 |  10472
  PL/pgSQL function    | TopMemoryContext   | 1 |   32768 | 
  3 |  15824 |   3 |  16944
  dynahash | TopMemoryContext   | 1 |    8192 | 
  1 |    512 |   0 |   7680
...


[rough implementation ideas and challenges]
I suppose communication between a process which runs
pg_stat_get_backend_memory_context()(referred to as A) and
target backend(referred to as B) is like:

   1. A sends a message to B and order to dump the memory contexts
   2. B dumps its memory contexts to some shared area
   3. A reads the shared area and returns it to the function invoker

To do so, there seem some challenges.

(1) how to share memory contexts information between backend processes
The amount of memory contexts greatly varies depending on the
situation, so it's not appropriate to share the memory contexts using
fixed shared memory.
Also using the file on 'stats_temp_directory' seems difficult thinking
about the background of the shared-memory based stats collector
proposal[2].
Instead, I'm thinking about using dsm_mq, which allows messages of
arbitrary length to be sent and receive.

(2) how to send messages wanting memory contexts
Communicating via signal 

Missing HashAgg EXPLAIN ANALYZE details for parallel plans

2020-06-17 Thread David Rowley
Hi,

Now that HashAgg can spill to disk, we see a few more details in
EXPLAIN ANALYZE than we did previously, e.g. Peak Memory Usage, Disk
Usage.  However, the new code neglected to make EXPLAIN ANALYZE show
these new details for parallel workers.

Additionally, the new properties all were using
ExplainPropertyInteger() which meant that we got output like:

 QUERY PLAN
-
 HashAggregate (actual time=31.724..87.638 rows=1000 loops=1)
   Group Key: a
   Peak Memory Usage: 97 kB
   Disk Usage: 3928 kB
   HashAgg Batches: 798
   ->  Seq Scan on ab (actual time=0.006..9.243 rows=10 loops=1)

Where all the properties were on a line by themselves.  This does not
really follow what other nodes are doing, e.g sort:

QUERY PLAN
---
 GroupAggregate (actual time=47.530..70.935 rows=1000 loops=1)
   Group Key: a
   ->  Sort (actual time=47.500..59.344 rows=10 loops=1)
 Sort Key: a
 Sort Method: external merge  Disk: 2432kB
 ->  Seq Scan on ab (actual time=0.004..8.476 rows=10 loops=1)

Where we stick all the properties on a single line.

The attached patch fixes both the missing parallel worker information
and puts the properties on a single line for format=text.

I'd like to push this in the next few days.

David


hashagg_explain_fix.patch
Description: Binary data


Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-17 Thread Tatsuo Ishii
>> My concern is, FDW+2PC without atomic visibility could lead to data
>> inconsistency among servers in some cases. If my understanding is
>> correct, FDW+2PC (without atomic visibility) cannot prevent data
>> inconsistency in the case below. Initially table t1 has only one row
>> with i = 0 on both N1 and N2. By executing S1 and S2 concurrently, t1
>> now has different value of i, 0 and 1.
> 
> IIUC the following sequence won't happen because COMMIT PREPARED
> 's1n1' cannot be executed before PREPARE TRANSACTION 's1n2'.

You are right.

> But as
> you mentioned, we cannot prevent data inconsistency even with FDW+2PC
> e.g., when S2 starts a transaction between COMMIT PREPARED on N1 and
> COMMIT PREPARED on N2 by S1.

Ok, example updated.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N1: BEGIN;
BEGIN
S1/N2: BEGIN;
BEGIN
S1/N1: UPDATE t1 SET i = i + 1; -- i = 1
UPDATE 1
S1/N2: UPDATE t1 SET i = i + 1; -- i = 1
UPDATE 1
S2/N1: BEGIN;
BEGIN
S2/N2: BEGIN;
BEGIN
S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S2/N2: PREPARE TRANSACTION 's2n2';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S2/N1: DELETE FROM t1 WHERE i = 1;
DELETE 1
S2/N2: DELETE FROM t1 WHERE i = 1;
DELETE 0
S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED
S2/N2: COMMIT PREPARED 's2n2';
COMMIT PREPARED
S2/N1: SELECT * FROM t1;
 i 
---
(0 rows)

S2/N2: SELECT * FROM t1;
 i 
---
 1
(1 row)

> The point is this data inconsistency is
> lead by an inconsistent read but not by an inconsistent commit
> results. I think there are kinds of possibilities causing data
> inconsistency but atomic commit and atomic visibility eliminate
> different possibilities. We can eliminate all possibilities of data
> inconsistency only after we support 2PC and globally MVCC.

IMO any permanent data inconsistency is a serious problem for users no
matter what the technical reasons are.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Review for GetWALAvailability()

2020-06-17 Thread Kyotaro Horiguchi
At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao  
wrote in 
> > ReplicationSlotAcquireInternal.  I think we should call
> > ConditionVariablePrepareToSleep before the sorrounding for statement
> > block.
> 
> OK, so what about the attached patch? I added
> ConditionVariablePrepareToSleep()
> just before entering the "for" loop in
> InvalidateObsoleteReplicationSlots().

Thanks.

ReplicationSlotAcquireInternal:
+ * If *slot == NULL, search for the slot with the given name.

'*' seems needless here.


The patch moves ConditionVariablePrepareToSleep. We need to call the
function before looking into active_pid as originally commented.
Since it is not protected by ReplicationSlotControLock, just before
releasing the lock is not correct.

The attached on top of the v3 fixes that.



+   s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
+   if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)

The conditions in the second line is needed for the case slot is
given, but it is already done in SearchNamedReplicationSlot if slot is
not given.  I would like something like the following instead, but I
don't insist on it.

ReplicationSlot *s = NULL;
...
if (!slot)
s = SearchNamedReplicationSlot(name);
else if(s->in_use && strcmp(name, NameStr(s->data.name)))
s = slot;


+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("replication slot \"%s\" does not exist", name)));

The error message is not right when the given slot doesn't match the
given name.  It might be better to leave it to the caller.  Currently
no such caller exists so I don't insist on this but the message should
be revised otherwise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 8f1913ec7ff3809d11adf1611aba57e44cb42a82 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Jun 2020 10:55:49 +0900
Subject: [PATCH 1/3] 001

---
 src/backend/replication/slot.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b94e11a8e7..dcc76c4783 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -412,6 +412,14 @@ retry:
 	 */
 	if (IsUnderPostmaster)
 	{
+		/*
+		 * Get ready to sleep on it in case it is active if needed.
+		 * (We may end up not sleeping, but we don't want to do this while
+		 * holding the spinlock.)
+		 */
+		if (behavior != SAB_Inquire)
+			ConditionVariablePrepareToSleep(&s->active_cv);
+
 		SpinLockAcquire(&s->mutex);
 		if (s->active_pid == 0)
 			s->active_pid = MyProcPid;
@@ -438,12 +446,13 @@ retry:
 			return active_pid;
 
 		/* Wait here until we get signaled, and then restart */
-		ConditionVariablePrepareToSleep(&s->active_cv);
 		ConditionVariableSleep(&s->active_cv,
 			   WAIT_EVENT_REPLICATION_SLOT_DROP);
 		ConditionVariableCancelSleep();
 		goto retry;
 	}
+	else if (behavior != SAB_Inquire)
+		ConditionVariableCancelSleep();
 
 	/* Let everybody know we've modified this slot */
 	ConditionVariableBroadcast(&s->active_cv);
-- 
2.18.4

>From 1b2f94d6bfb4508b2cbf3d552a5615ae2959e90c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Jun 2020 11:25:25 +0900
Subject: [PATCH 2/3] 002

---
 src/backend/replication/slot.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index dcc76c4783..8893516f00 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -380,7 +380,7 @@ static int
 ReplicationSlotAcquireInternal(ReplicationSlot *slot, const char *name,
 			   SlotAcquireBehavior behavior)
 {
-	ReplicationSlot *s;
+	ReplicationSlot *s = NULL;
 	int			active_pid;
 
 retry:
@@ -393,8 +393,12 @@ retry:
 	 * acquire is not given. If the slot is not found, we either
 	 * return -1 or error out.
 	 */
-	s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
-	if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)
+	if (!slot)
+		s = SearchNamedReplicationSlot(name);
+	else if(s->in_use && strcmp(name, NameStr(s->data.name)))
+		s = slot;
+
+	if (s == NULL)
 	{
 		if (behavior == SAB_Inquire)
 		{
-- 
2.18.4

>From 2d4a83abd2f278a9f9dc6e9329aed1acc191f2c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Jun 2020 11:33:29 +0900
Subject: [PATCH 3/3] 003

---
 src/backend/replication/slot.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8893516f00..25ae334a29 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -373,8 +373,10 @@ ReplicationSlotAcquire(const char *name, SlotAcquireBehavior behavior)
  * Mark the specified slot as used by this process.
  *
  * If *slot == NULL, search for the slot with the given name.
- *
  * See comments about the return value in Replicatio

Re: 回复:回复:回复:how to create index concurrently on partitioned table

2020-06-17 Thread Michael Paquier
On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
> However, I found a problem. If there are many partitions, 
> we may need to handle too many missing index entries when
> validate_index().  Especially for the first partition, the time may
> have been long and many entries are missing.  In this case, why
> don't we put the second and third phase together into a transaction
> for each partition? 

Not sure I am following.  In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree.  See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create().  It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael


signature.asc
Description: PGP signature


Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-17 Thread Masahiko Sawada
On Wed, 17 Jun 2020 at 20:14, Amit Kapila  wrote:
>
> On Wed, Jun 17, 2020 at 1:34 PM Masahiko Sawada
>  wrote:
> >
> > On Sat, 13 Jun 2020 at 14:23, Amit Kapila  wrote:
> > >
> > > On Fri, Jun 12, 2020 at 6:11 PM Magnus Hagander  
> > > wrote:
> > > >
> > > > On Fri, Jun 12, 2020 at 10:23 AM Amit Kapila  
> > > > wrote:
> > > >>
> > > >
> > > >
> > > > The problem with "lifetime of a process" is that it's not predictable. 
> > > > A replication process might "bounce" for any reason, and it is normally 
> > > > not a problem. But if you suddenly lose your stats when you do that, it 
> > > > starts to matter a lot more. Especially when you don't know if it 
> > > > bounced. (Sure you can look at the backend_start time, but that adds a 
> > > > whole different sets of complexitites).
> > > >
> > >
> > > It is not clear to me what is a good way to display the stats for a
> > > process that has exited or bounced due to whatever reason.  OTOH, if
> > > we just display per-slot stats, it is difficult to imagine how the
> > > user can make any sense out of it or in other words how such stats can
> > > be useful to users.
> >
> > If we have the reset function, the user can reset before doing logical
> > decoding so that the user can use the stats directly. Or I think we
> > can automatically reset the stats when logical decoding is performed
> > with different logical_decoding_work_mem value than the previous one.
> >
>
> I had written above in the context of persisting these stats.  I mean
> to say if the process has bounced or server has restarted then the
> previous stats might not make much sense because we were planning to
> use pid [1], so the stats from process that has exited might not make
> much sense or do you think that is okay?  If we don't want to persist
> and the lifetime of these stats is till the process is alive then we
> are fine.
>

Sorry for confusing you. The above my idea is about having the stats
per slots. That is, we add spill_txns, spill_count and spill_bytes to
pg_replication_slots or a new view pg_stat_logical_replication_slots
with some columns: slot_name plus these stats columns and stats_reset.
The idea is that the stats values accumulate until either the slot is
dropped, the server crashed, the user executes the reset function, or
logical decoding is performed with different logical_decoding_work_mem
value than the previous time. In other words, the stats values are
reset in either case. That way, I think the stats values always
correspond to logical decoding using the same slot with the same
logical_decoding_work_mem value.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-17 Thread Masahiko Sawada
On Thu, 18 Jun 2020 at 08:31, Tatsuo Ishii  wrote:
>
> > okay, so it seems we need few things which middleware (Pangea) expects
> > if we have to follow the design of paper.
>
> Yes.
>
> > I haven't read the paper completely but it sounds quite restrictive
> > (like both commits and snapshots need to wait).
>
> Maybe. There is a performance evaluation in the paper. You might want
> to take a look at it.
>
> > Another point is that
> > do we want some middleware involved in the solution?   The main thing
> > I was looking into at this stage is do we think that the current
> > implementation proposed by the patch for 2PC is generic enough that we
> > would be later able to integrate the solution for atomic visibility?
>
> My concern is, FDW+2PC without atomic visibility could lead to data
> inconsistency among servers in some cases. If my understanding is
> correct, FDW+2PC (without atomic visibility) cannot prevent data
> inconsistency in the case below. Initially table t1 has only one row
> with i = 0 on both N1 and N2. By executing S1 and S2 concurrently, t1
> now has different value of i, 0 and 1.

IIUC the following sequence won't happen because COMMIT PREPARED
's1n1' cannot be executed before PREPARE TRANSACTION 's1n2'. But as
you mentioned, we cannot prevent data inconsistency even with FDW+2PC
e.g., when S2 starts a transaction between COMMIT PREPARED on N1 and
COMMIT PREPARED on N2 by S1. The point is this data inconsistency is
lead by an inconsistent read but not by an inconsistent commit
results. I think there are kinds of possibilities causing data
inconsistency but atomic commit and atomic visibility eliminate
different possibilities. We can eliminate all possibilities of data
inconsistency only after we support 2PC and globally MVCC.

>
> S1/N1: DROP TABLE t1;
> DROP TABLE
> S1/N1: CREATE TABLE t1(i int);
> CREATE TABLE
> S1/N1: INSERT INTO t1 VALUES(0);
> INSERT 0 1
> S1/N2: DROP TABLE t1;
> DROP TABLE
> S1/N2: CREATE TABLE t1(i int);
> CREATE TABLE
> S1/N2: INSERT INTO t1 VALUES(0);
> INSERT 0 1
> S1/N1: BEGIN;
> BEGIN
> S1/N2: BEGIN;
> BEGIN
> S1/N1: UPDATE t1 SET i = i + 1; -- i = 1
> UPDATE 1
> S1/N2: UPDATE t1 SET i = i + 1; -- i = 1
> UPDATE 1
> S1/N1: PREPARE TRANSACTION 's1n1';
> PREPARE TRANSACTION
> S1/N1: COMMIT PREPARED 's1n1';
> COMMIT PREPARED
> S2/N1: BEGIN;
> BEGIN
> S2/N2: BEGIN;
> BEGIN
> S2/N2: DELETE FROM t1 WHERE i = 1;
> DELETE 0
> S2/N1: DELETE FROM t1 WHERE i = 1;
> DELETE 1
> S1/N2: PREPARE TRANSACTION 's1n2';
> PREPARE TRANSACTION
> S2/N1: PREPARE TRANSACTION 's2n1';
> PREPARE TRANSACTION
> S2/N2: PREPARE TRANSACTION 's2n2';
> PREPARE TRANSACTION
> S1/N2: COMMIT PREPARED 's1n2';
> COMMIT PREPARED
> S2/N1: COMMIT PREPARED 's2n1';
> COMMIT PREPARED
> S2/N2: COMMIT PREPARED 's2n2';
> COMMIT PREPARED
> S2/N1: SELECT * FROM t1;
>  i
> ---
> (0 rows)
>
> S2/N2: SELECT * FROM t1;
>  i
> ---
>  1
> (1 row)
>

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_regress cleans up tablespace twice.

2020-06-17 Thread Michael Paquier
On Wed, Jun 17, 2020 at 05:02:31PM +0900, Kyotaro Horiguchi wrote:
> It look good to me as the Windows part. I agree that vcregress.pl
> don't need to parse EXTRA_REGRESS_OPTS by allowing a bit more tight
> bond between the caller sites of pg_regress and pg_regress.

Thanks, applied this part to HEAD then after more testing.

> Chaining to TAP sounds nice as a goal.

I submitted a patch for that, but we had no clear agreements about how
to handle major upgrades, as this involves a somewhat large
refactoring of PostgresNode.pm so as you register a path to the
binaries used by a given node registered.

> As the next step we need to amend GNUmakefile not to cleanup
> tablespace for the four test items. Then somehow treat tablespaces at
> non-default place?

Ah, you mean to not reset testtablespace where that's not necessary in
the tests by reworking the rules?  Yeah, perhaps we could do something
like that.  Not sure yet how to shape that in term of code but if you
have a clear idea, please feel free to submit it.  I think that this
may be better if discussed on a different thread though.
--
Michael


signature.asc
Description: PGP signature


Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao




On 2020/06/18 3:04, Alvaro Herrera wrote:

I think passing the slot name when the slot is also passed is useless
and wasteful; it'd be better to pass NULL for the name and ignore the
strcmp() in that case -- in fact I suggest to forbid passing both name
and slot.  (Any failure there would risk raising an error during
checkpoint, which is undesirable.)


Sounds reasonable.


So I propose the following tweaks to your patch, and otherwise +1.


Thanks for the patch! It looks good to me.

Barring any objections, I will commit the patches in the master and
v13 branches later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Improve planner cost estimations for alternative subplans

2020-06-17 Thread Melanie Plageman
On Fri, Jun 5, 2020 at 9:08 AM Alexey Bashtanov  wrote:

>
> In [1] we found a situation where it leads to a suboptimal plan,
> as it bloats the overall cost into large figures,
> a decision related to an outer part of the plan look negligible to the
> planner,
> and as a result it doesn't elaborate on choosing the optimal one.
>
>
Did this geometric average method result in choosing the desired plan for
this case?


> The patch is to fix it. Our linear model for costs cannot quite accommodate
> the piecewise linear matter of alternative subplans,
> so it is based on ugly heuristics and still cannot be very precise,
> but I think it's better than the current one.
>
> Thoughts?
>
>
Is there another place in planner where two alternatives are averaged
together and that cost is used?

To me, it feels a little bit weird that we are averaging together the
startup cost of a plan which will always have a 0 startup cost and a
plan that will always have a non-zero startup cost and the per tuple
cost of a plan that will always have a negligible per tuple cost and one
that might have a very large per tuple cost.

I guess it feels different because instead of comparing alternatives you
are blending them.

I don't have any academic basis for saying that the alternatives costs
shouldn't be averaged together for use in the rest of the plan, so I
could definitely be wrong.

-- 
Melanie Plageman


Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-06-17 Thread David G. Johnston
On Wed, Jun 17, 2020 at 4:32 PM Tom Lane  wrote:

> "David G. Johnston"  writes:
> > I'm firmly of the belief that the existing behavior of DROP relation IF
> > EXISTS is flawed - it should not be an error if there is a namespace
> > collision but the relkind of the existing relation doesn't match the
> > relkind set by the DROP command.
>
>
The other thread:

https://www.postgresql.org/message-id/CAKFQuwY90%3DGSX_65cYdAm18TWCv4CvnPdHCuH92qfzKSYaFnxQ%40mail.gmail.com

I don't particularly agree, as I said in the other thread.  The core
> point here is that it's not clear to me why the specific error of
> "wrong relkind" deserves a pass, while other errors such as "you're
> not the owner" don't.


Because if you're not the owner then by definition the expected target
exists and a drop is attempted - which can still fail.

  Both of those cases suggest that you're not
> targeting the relation you think you are, and both of them would get
> in the way of a subsequent CREATE.


Agreed, as noted on the other thread we actually are not sufficiently
paranoid in this situation.  Specifically, we allow dropping a relation
based upon a search_path search when the target it not on the first entry
in the search_path.  I'd be glad to see that hole closed up - but this is
still broken even when the name is always schema qualified.

  To me, success of DROP IF EXISTS
> should mean "the coast is clear to do a CREATE".  With an exception
> like this, a success would mean nothing at all.
>

To me and at least some users DROP IF EXISTS means that the specific object
I specified no longer exists, period.

If you want access to the behavior you describe go and write DROP ROUTINE.
As noted on the other thread I think that is a bad option but hey, it does
have the benefit of doing exactly what you describe.

Users can write multiple the drop commands necessary to get their create
command to execute successfully.  If the create command fails they can
react to that and figure out where their misunderstanding was.  Is that
really so terrible?

Another point here is that we have largely the same issue with respect
> to different subclasses of routines (functions/procedures/aggregates)
> and types (base types/composite types/domains).  If we do change
> something then I'd want to see it done consistently across all these
> cases.


Ok.  I don't necessarily disagree.  In fact the patch I submitted, which is
the on-topic discussion for this thread, brings up the very point that
domain behavior here is presently inconsistent.

At least for DROP TABLE IF EXISTS if we close up the hole with search_path
resolution by introducing an actual "found relation in the wrong location"
error then the risk will have been removed - which exists outside of the IF
EXISTS logic - and instead of not dropping a table and throwing an error we
just are not dropping a table.

So, in summary, this thread is to document the current behavior [actual doc
bug fix].  There is probably another thread buried in all of this for going
through and finding other undocumented behaviors for other object types
[potential doc bug fixes].  Then a thread for solidifying search_path
handling to actually fill in missing seemingly desirable safety features to
avoid drop target mis-identification (so we don't actually drop the wrong
object) [feature].  Then a thread to discuss whether or not dropping an
object that wasn't of the relkind that user specified should be an error
[bug fix held up due to insufficient safety features].  Then a thread to
discuss DROP ROUTINE [user choice of convenience over safety].

David J.


Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-06-17 Thread Tom Lane
"David G. Johnston"  writes:
> I'm firmly of the belief that the existing behavior of DROP relation IF
> EXISTS is flawed - it should not be an error if there is a namespace
> collision but the relkind of the existing relation doesn't match the
> relkind set by the DROP command.

I don't particularly agree, as I said in the other thread.  The core
point here is that it's not clear to me why the specific error of
"wrong relkind" deserves a pass, while other errors such as "you're
not the owner" don't.  Both of those cases suggest that you're not
targeting the relation you think you are, and both of them would get
in the way of a subsequent CREATE.  To me, success of DROP IF EXISTS
should mean "the coast is clear to do a CREATE".  With an exception
like this, a success would mean nothing at all.

Another point here is that we have largely the same issue with respect
to different subclasses of routines (functions/procedures/aggregates)
and types (base types/composite types/domains).  If we do change
something then I'd want to see it done consistently across all these
cases.

regards, tom lane




Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-17 Thread Tatsuo Ishii
> okay, so it seems we need few things which middleware (Pangea) expects
> if we have to follow the design of paper.

Yes.

> I haven't read the paper completely but it sounds quite restrictive
> (like both commits and snapshots need to wait).

Maybe. There is a performance evaluation in the paper. You might want
to take a look at it.

> Another point is that
> do we want some middleware involved in the solution?   The main thing
> I was looking into at this stage is do we think that the current
> implementation proposed by the patch for 2PC is generic enough that we
> would be later able to integrate the solution for atomic visibility?

My concern is, FDW+2PC without atomic visibility could lead to data
inconsistency among servers in some cases. If my understanding is
correct, FDW+2PC (without atomic visibility) cannot prevent data
inconsistency in the case below. Initially table t1 has only one row
with i = 0 on both N1 and N2. By executing S1 and S2 concurrently, t1
now has different value of i, 0 and 1.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N1: BEGIN;
BEGIN
S1/N2: BEGIN;
BEGIN
S1/N1: UPDATE t1 SET i = i + 1; -- i = 1
UPDATE 1
S1/N2: UPDATE t1 SET i = i + 1; -- i = 1
UPDATE 1
S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S2/N1: BEGIN;
BEGIN
S2/N2: BEGIN;
BEGIN
S2/N2: DELETE FROM t1 WHERE i = 1;
DELETE 0
S2/N1: DELETE FROM t1 WHERE i = 1;
DELETE 1
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S2/N2: PREPARE TRANSACTION 's2n2';
PREPARE TRANSACTION
S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED
S2/N2: COMMIT PREPARED 's2n2';
COMMIT PREPARED
S2/N1: SELECT * FROM t1;
 i 
---
(0 rows)

S2/N2: SELECT * FROM t1;
 i 
---
 1
(1 row)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Minor issue with deduplication's single value strategy

2020-06-17 Thread Peter Geoghegan
nbtdedup.c's "single value strategy" is used when we have a leaf page
that is full of duplicates of the same value. We infer that we will
have many more in the future, and cooperate with the "single value
strategy" used within nbtsplitloc.c.  Deduplication should create 6
large posting list tuples that will remain on the page after an
anticipated page split, plus some remaining non-dedup'd regular tuples
that go on the new right page. So nbtdedup.c anticipates how an
eventual page split will need to work to keep space utilization high,
but not too high (though only in this specific "single value strategy"
case).

(Note that the choice of 6 posting list tuples is arbitrary -- it
seemed like limiting the size of posting list tuples to 1/3 of a page
(the long established maximum) was a bit too aggressive, so the
posting list tuple size limit was halved to 1/6 of the page.)

Sometimes, an append-only inserter of low cardinality data (just a few
distinct values) can leave some "single value" pages with perhaps 8 or
9 tuples -- not the intended 7 (i.e. not the intended 6 tuples plus 1
high key). This isn't really a problem. It can only happen when a
plain tuple is "wedged" between two existing max-sized posting list
tuples, where we cannot merge the plain tuple with either
adjoining/neighboring posting list tuple (since that violates the 1/6
of a page limit on posting list tuple size). This scenario is rare,
and in any case the space utilization is almost exactly what it's
supposed to be in the end (the BTREE_SINGLEVAL_FILLFACTOR "fill
factor" that was added to Postgres 12 is still respected in the
presence of deduplication in Postgres 13). This much is okay.

However, I noticed that it's also possible to confuse "single value
strategy" in nbtdedup.c into thinking that it has already observed 6
"max sized" tuples, when in fact it has not. This can lead to the
occasional page that contains perhaps 50 or 70 tuples. This is rare
enough that you have to really be paying attention to the structure of
the index to notice it -- most of my test case indexes that use
"single value strategy" a lot aren't even affected. I'm not okay with
this, because it's not how nbtdedup.c is intended to work -- clearly
nbtdedup.c gets confused as a consequence of the "plain tuple gets
wedged" issue. I'm not okay with this.

Attached patch nails down the behavior of single value strategy. This
slightly improves the space utilization in a small minority of my test
case indexes, though not by enough to matter. For example, the TPC-H
idx_customer_nationkey index goes from 1193 pages to 1182 pages. This
is an utterly insignificant issue, but there is no reason to allow it.
When I think about space utilization, I don't just look at the index
overall -- I also expect a certain amount of consistency within
related subsets of the index's key space.

-- 
Peter Geoghegan


v1-0001-Fix-nbtdedup.c-single-value-strategy-issue.patch
Description: Binary data


DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-06-17 Thread David G. Johnston
This is a follow-up to Bug # 16492 which also links to a thread sent to
-hackers back in 2018.

I'm firmly of the belief that the existing behavior of DROP relation IF
EXISTS is flawed - it should not be an error if there is a namespace
collision but the relkind of the existing relation doesn't match the
relkind set by the DROP command.

Since our documentation fails to elaborate on any additional behavior, and
uses the relkind in the description, our users (few as they may be) are
rightly calling this a bug.  I loosely believe that any behavior change in
this area should not be back-patched thus for released versions this is a
documentation bug.  I have attached a patch to fix that bug.

In putting together the patch I noticed that the existing drop_if_exists
regression tests exercise the DROP DOMAIN command.  Out of curiosity I
included that in my namespace testing and discovered that DROP DOMAIN
thinks of itself as being a relation for purposes of IF EXISTS but DROP
TABLE does not.  I modified both DROP DOMAIN and the Glossary in response
to this finding - though I suspect to find disagreement with my choice.  I
looked at pg_class for some guidance but a quick search for RELKIND_
(DOMAIN) and finding nothing decided I didn't know enough and figured to
punt on any further exploration of this inconsistency.

The documentation and tests need to go in and be back-patched.  After that
happens I'll see whether and/or how to go about trying to get my PoV on the
behavioral change committed.

David J.


drop-if-exists-docs-and-tests-v1.patch
Description: Binary data


Re: language cleanups in code and docs

2020-06-17 Thread Bruce Momjian
On Wed, Jun 17, 2020 at 01:59:26PM -0400, Robert Haas wrote:
> On Mon, Jun 15, 2020 at 2:23 PM Andres Freund  wrote:
> > 0002: code: s/master/primary/
> > 0003: code: s/master/leader/
> > 0006: docs: s/master/root/
> > 0007: docs: s/master/supervisor/
> 
> I'd just like to make the pointer here that there's value in trying to
> use different terminology for different things. I picked "leader" and
> "worker" for parallel query and tried to use them consistently because
> "master" and "slave" were being used widely to refer to physical
> replication, and I thought it would be clearer to use something
> different, so I did. It's confusing if we use the same word for the
> server from which others replicate, the table from which others
> inherit, the process which initiates parallelism, and the first
> process that is launched across the whole cluster, regardless of
> *which* word we use for those things. So, I think there is every
> possibility that with careful thought, we can actually make things
> clearer, in addition to avoiding the use of terms that are no longer
> welcome.

I think the question is whether we can improve our terms as part of this
rewording, or if we make them worse.  When we got rid of slave and made
it standby, I think we made things worse since many of the replicas were
not functioning for the purpose of standby.  Standby is a role, not a
status, while replica is a status.

The other issue is how the terms interlink with other terms.  When we
used master/slave, multi-master matched the wording, but replication
didn't match.  If we go with replica, replication works, and
primary/replica kind of fits, e.g., master/replica does not.
Multi-master then no longer fits, multi-primary sounds odd, and
active-active doesn't match, though active-active is not used as much as
primary/replica, so maybe that is OK.  Ideally we would have all terms
matching, but maybe that is impossible.

My point is that these terms are symbolic (similes) --- the new terms
should link to their roles and to other terms in a logical way.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-17 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jun 17, 2020 at 3:45 PM Tom Lane  wrote:
>> The macros are kind of necessary unless you want to make s_lock.h
>> a bunch messier, because we use #ifdef tests on them.

> Where?

See the "Default Definitions", down near the end.

>> We could get rid of the double layer of macros, sure, but TBH that
>> sounds like change for the sake of change rather than a useful
>> improvement.

> Really? Multiple layers of macros seem like they pretty clearly make
> the source code harder to understand. There are plenty of places where
> such devices are necessary for one reason or another, but it doesn't
> seem like something we ought to keep around for no reason.

I wouldn't object to making the outer-layer macros in spin.h into static
inlines; as mentioned, that might have some debugging benefits.  But I
think messing with s_lock.h for marginal cosmetic reasons is a foolish
idea.  For one thing, there's no way whoever does it can verify all the
architecture-specific stanzas.  (I don't think we even have all of them
covered in the buildfarm.)

regards, tom lane




Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 3:45 PM Tom Lane  wrote:
> Robert Haas  writes:
> > This seems like it's straight out of the department of pointless
> > abstraction layers. Maybe we should remove all of the S_WHATEVER()
> > stuff and just define SpinLockAcquire() where we currently define
> > S_LOCK(), SpinLockRelease() where we currently define S_UNLOCK(), etc.
> > And, as you say, make them static inline functions while we're at it.
>
> The macros are kind of necessary unless you want to make s_lock.h
> a bunch messier, because we use #ifdef tests on them.

Where?

> We could get rid of the double layer of macros, sure, but TBH that
> sounds like change for the sake of change rather than a useful
> improvement.

Really? Multiple layers of macros seem like they pretty clearly make
the source code harder to understand. There are plenty of places where
such devices are necessary for one reason or another, but it doesn't
seem like something we ought to keep around for no reason.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-17 Thread Tom Lane
Robert Haas  writes:
> This seems like it's straight out of the department of pointless
> abstraction layers. Maybe we should remove all of the S_WHATEVER()
> stuff and just define SpinLockAcquire() where we currently define
> S_LOCK(), SpinLockRelease() where we currently define S_UNLOCK(), etc.
> And, as you say, make them static inline functions while we're at it.

The macros are kind of necessary unless you want to make s_lock.h
a bunch messier, because we use #ifdef tests on them.

We could get rid of the double layer of macros, sure, but TBH that
sounds like change for the sake of change rather than a useful
improvement.

regards, tom lane




Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 2:33 PM Andres Freund  wrote:
> > It seems odd and confusing that we have  both
> > S_LOCK() and s_lock(), anyway. Differentiating functions based on case
> > is not great practice.
>
> It's a terrible idea, yes.  Since we don't actually have any non-default
> implementations of S_LOCK, perhaps we should just rip it out?

I think we should rip out the conditional nature of the definition and
fix the comments. I don't think I prefer getting rid of it completely.

But then again on the other hand, what's the point of this crap anyway:

#define SpinLockInit(lock)  S_INIT_LOCK(lock)
#define SpinLockAcquire(lock) S_LOCK(lock)
#define SpinLockRelease(lock) S_UNLOCK(lock)
#define SpinLockFree(lock)  S_LOCK_FREE(lock)

This seems like it's straight out of the department of pointless
abstraction layers. Maybe we should remove all of the S_WHATEVER()
stuff and just define SpinLockAcquire() where we currently define
S_LOCK(), SpinLockRelease() where we currently define S_UNLOCK(), etc.

And, as you say, make them static inline functions while we're at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Binary transfer vs Text transfer

2020-06-17 Thread Andres Freund
Hi,


On 2020-06-17 11:55:44 -0700, Aleksei Ivanov wrote:
> I have one question related to my current measurements results.
> 
> I am fetching integers in text format like:
> 
> Select * from table limit 1000. It take 18.5 seconds to finish and the
> transfer data is 633 MB.
> 
> When I fetching the same data using binary cursor, the transfer data is 480
> MB, but the transfer time is 21.5 seconds?
> 
> So, I have one question could someone explain what can be the reason why
> the transferring time for binary data is higher?

This thread might be interesting:
https://www.postgresql.org/message-id/CAMkU%3D1whbRDUwa4eayD9%2B59K-coxO9senDkPRbTn3cg0pUz4AQ%40mail.gmail.com

Greetings,

Andres Freund




Binary transfer vs Text transfer

2020-06-17 Thread Aleksei Ivanov
Hello, colleagues,

I have one question related to my current measurements results.

I am fetching integers in text format like:

Select * from table limit 1000. It take 18.5 seconds to finish and the
transfer data is 633 MB.

When I fetching the same data using binary cursor, the transfer data is 480
MB, but the transfer time is 21.5 seconds?

So, I have one question could someone explain what can be the reason why
the transferring time for binary data is higher?

Best regards,


Re: Review for GetWALAvailability()

2020-06-17 Thread Alvaro Herrera
On 2020-Jun-17, Kyotaro Horiguchi wrote:

> @@ -342,7 +351,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
>   else
>   nulls[i++] = true;
>  
> - walstate = GetWALAvailability(slot_contents.data.restart_lsn);
> + /* use last_invalidated_lsn when the slot is invalidated */
> + if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
> + targetLSN = slot_contents.last_invalidated_lsn;
> + else
> + targetLSN = slot_contents.data.restart_lsn;
> +
> + walstate = GetWALAvailability(targetLSN, last_removed_seg,
> +   
> slot_contents.active_pid != 0);

Yeah, this approach seems better overall.  I'll see if I can get this
done after lunch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-17 Thread Andres Freund
Hi,

On 2020-06-17 10:34:31 -0400, Robert Haas wrote:
> On Tue, Jun 16, 2020 at 3:28 PM Andres Freund  wrote:
> > > I think 0003 looks a little strange: it seems to be
> > > testing things that might be implementation details of other things,
> > > and I'm not sure that's really correct. In particular:
> > Hm? Isn't s_lock the, as its comment says, "platform-independent portion
> > of waiting for a spinlock."?  I also don't think we need to purely
> > follow external APIs in internal tests.
> 
> I feel like we at least didn't use to use that on all platforms, but I
> might be misremembering.

There's only one definition of S_LOCK, and s_lock is the only spinlock
related user of perform_spin_delay(). So I don't think so?


> It seems odd and confusing that we have  both
> S_LOCK() and s_lock(), anyway. Differentiating functions based on case
> is not great practice.

It's a terrible idea, yes.  Since we don't actually have any non-default
implementations of S_LOCK, perhaps we should just rip it out? It'd
probably be clearer if SpinLockAcquire() would be what uses TAS() and
falls back to s_lock (best renamed to s_lock_slowpath or such).

It'd perhaps also be good to make SpinLockAcquire() a static inline
instead of a #define, so it can be properly attributed in debuggers and
profilers.

Greetings,

Andres Freund




Re: language cleanups in code and docs

2020-06-17 Thread Andres Freund
Hi,

On 2020-06-17 13:59:26 -0400, Robert Haas wrote:
> On Mon, Jun 15, 2020 at 2:23 PM Andres Freund  wrote:
> > 0002: code: s/master/primary/
> > 0003: code: s/master/leader/
> > 0006: docs: s/master/root/
> > 0007: docs: s/master/supervisor/
> 
> I'd just like to make the pointer here that there's value in trying to
> use different terminology for different things. I picked "leader" and
> "worker" for parallel query and tried to use them consistently because
> "master" and "slave" were being used widely to refer to physical
> replication, and I thought it would be clearer to use something
> different, so I did.

Just to be clear, that's exactly what I tried to do in the above
patches. E.g. in 0003 I tried to follow the scheme you just
outlined. There's a part of that patch that addresses pg_dump, but most
of the rest is just parallelism related pieces that ended up using
master, even though leader is the more widely used term.  I assume you
were just saying that the above use of different terms is actually
helpful:

> It's confusing if we use the same word for the server from which
> others replicate, the table from which others inherit, the process
> which initiates parallelism, and the first process that is launched
> across the whole cluster, regardless of *which* word we use for those
> things. So, I think there is every possibility that with careful
> thought, we can actually make things clearer, in addition to avoiding
> the use of terms that are no longer welcome.

With which I wholeheartedly agree.

Greetings,

Andres Freund




Re: language cleanups in code and docs

2020-06-17 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 15, 2020 at 2:23 PM Andres Freund  wrote:
>> 0002: code: s/master/primary/
>> 0003: code: s/master/leader/
>> 0006: docs: s/master/root/
>> 0007: docs: s/master/supervisor/

> I'd just like to make the pointer here that there's value in trying to
> use different terminology for different things.

+1 for that.

regards, tom lane




Re: [patch] demote

2020-06-17 Thread Andres Freund
Hi,

On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote:
> As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
> objectives than mine, I decided to share the humble patch I am playing with to
> step down an instance from primary to standby status.

To make sure we are on the same page: What your patch intends to do is
to leave the server running, but switch from being a primary to
replicating from another system. Correct?


> Main difference with Amul's patch is that all backends must be disconnected to
> process with the demote. Either we wait for them to disconnect (smart) or we
> kill them (fast). This makes life much easier from the code point of view, but
> much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
> would kill the session, with no options to wait for the action to finish, as 
> we
> do with pg_promote(). Keeping read only session around could probably be
> achieved using global barrier as Amul did, but without all the complexity
> related to WAL writes prohibition.

FWIW just doing that for normal backends isn't sufficient, you also have
to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due
to hint bits, the checkpoint record, and some more).


> There's still some questions in the current patch. As I wrote, it's an humble
> patch, a proof of concept, a bit naive.
> 
> Does it worth discussing it and improving it further or do I miss something
> obvious in this design that leads to a dead end?

I don't think there's a fundamental issue, but I think it needs to deal
with a lot more things than it does right now. StartupXLOG doesn't
currently deal correctly with subsystems that are already
initialized. And your patch doesn't make it so as far as I can tell.

Greetings,

Andres Freund




Re: More tzdb fun: POSIXRULES is being deprecated upstream

2020-06-17 Thread Tom Lane
Last year I wrote:
> Paul Eggert, in https://mm.icann.org/pipermail/tz/2019-June/028172.html:
>> zic’s -p option was intended as a transition from historical
>> System V code that treated TZ="XXXnYYY" as meaning US
>> daylight-saving rules in a time zone n hours west of UT,
>> with XXX abbreviating standard time and YYY abbreviating DST.
>> zic -p allows the tzdata installer to specify (say)
>> Europe/Brussels's rules instead of US rules.  This behavior
>> is not well documented and often fails in practice; for example it
>> does not work with current glibc for contemporary timestamps, and
>> it does not work in tzdb itself for timestamps after 2037.
>> So, document it as being obsolete, with the intent that it
>> will be removed in a future version.  This change does not
>> affect behavior of the default installation.

Well, we ignored this for a year, but it's about to become unavoidable:
http://mm.icann.org/pipermail/tz/2020-June/029093.html
IANA upstream is changing things so that by default there will not be
any "posixrules" file in the tz database.

That wouldn't directly affect our builds, since we don't use their
Makefile anyway.  But it will affect installations that use
--with-system-tzdata, which I believe is most vendor-packaged
Postgres installations.

It's possible that most or even all tzdata packagers will ignore
the change and continue to ship a posixrules file, for backwards
compatibility's sake.  But I doubt we should bet that way.
glibc-based distros, in particular, would have little motivation to
do so.  We should expect that, starting probably this fall, there
will be installations with no posixrules file.

The minimum thing that we have to do, I'd say, is to change the
documentation to explain what happens if there's no posixrules file.
However, in view of the fact that the posixrules feature doesn't work
past 2037 and isn't going to be fixed, maybe we should just nuke it
now rather than waiting for our hand to be forced.  I'm not sure that
I've ever heard of anyone replacing the posixrules file anyway.
(The fallback case is actually better in that it works for dates past
2037; it's worse only in that you can't configure it.)

I would definitely be in favor of "nuke it now" with respect to HEAD.
It's a bit more debatable for the back branches.  However, all branches
are going to be equally exposed to updated system tzdata trees, so
we've typically felt that changes in the tz-related code should be
back-patched.

Thoughts?

regards, tom lane




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Andres Freund
Hi,

On 2020-06-17 12:07:22 -0400, Robert Haas wrote:
> On Wed, Jun 17, 2020 at 10:58 AM Tom Lane  wrote:
> > I also think that putting such a thing into ALTER SYSTEM has got big
> > logical problems.  Someday we will probably want to have ALTER SYSTEM
> > write WAL so that standby servers can absorb the settings changes.
> > But if writing WAL is disabled, how can you ever turn the thing off again?
> 
> I mean, the syntax that we use for a feature like this is arbitrary. I
> picked this one, so I like it, but it can easily be changed if other
> people want something else. The rest of this argument doesn't seem to
> me to make very much sense. The existing ALTER SYSTEM functionality to
> modify a text configuration file isn't replicated today and I'm not
> sure why we should make it so, considering that replication generally
> only considers things that are guaranteed to be the same on the master
> and the standby, which this is not. But even if we did, that has
> nothing to do with whether some functionality that changes the system
> state without changing a text file ought to also be replicated. This
> is a piece of cluster management functionality and it makes no sense
> to replicate it. And no right-thinking person would ever propose to
> change a feature that renders the system read-only in such a way that
> it was impossible to deactivate it. That would be nuts.

I agree that the concrete syntax here doesn't seem to matter much. If
this worked by actually putting a GUC into the config file, it would
perhaps matter a bit more, but it doesn't afaict.  It seems good to
avoid new top-level statements, and ALTER SYSTEM seems to fit well.


I wonder if there's an argument about wanting to be able to execute this
command over a physical replication connection? I think this feature
fairly obviously is a building block for "gracefully failover to this
standby", and it seems like it'd be nicer if that didn't potentially
require two pg_hba.conf entries for the to-be-promoted primary on the
current/old primary?


> > Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
> > involves just killing the primary server, not expecting that you can
> > leisurely issue some commands to it first.
> 
> Yeah, that's exactly the problem I want to fix. If you kill the master
> server, then you have interrupted service, even for read-only queries.
> That sucks. Also, even if you don't care about interrupting service on
> the master, it's actually sorta hard to guarantee a clean switchover.
> The walsenders are supposed to send all the WAL from the master before
> exiting, but if the connection is broken for some reason, then the
> master is down and the standbys can't stream the rest of the WAL. You
> can start it up again, but then you might generate more WAL. You can
> try to copy the WAL around manually from one pg_wal directory to
> another, but that's not a very nice thing for users to need to do
> manually, and seems buggy and error-prone.

Also (I'm sure you're aware) if you just non-gracefully shut down the
old primary, you're going to have to rewind the old primary to be able
to use it as a standby. And if you non-gracefully stop you're gonna
incur checkpoint overhead, which is *massive* on non-toy
databases. There's a huge practical difference between a minor version
upgrade causing 10s of unavailability and causing 5min-30min.


> And how do you figure out where the WAL ends on the master and make
> sure that the standby replayed it all? If the master is up, it's easy:
> you just use the same queries you use all the time. If the master is
> down, you have to use some different technique that involves manually
> examining files or scrutinizing pg_controldata output. It's actually
> very difficult to get this right.

Yea, it's absurdly hard. I think it's really kind of ridiculous that we
expect others to get this right if we, the developers of this stuff,
can't really get it right because it's so complicated. Which imo makes
this:

> > Commands that involve a whole
> > bunch of subtle interlocking --- and, therefore, aren't going to work if
> > anything has gone wrong already anywhere in the server --- seem like a
> > particularly poor thing to be hanging your HA strategy on.

more of an argument for having this type of stuff builtin.


> It's important not to conflate controlled switchover with failover.
> When there's a failover, you have to accept some risk of data loss or
> service interruption; but a controlled switchover does not need to
> carry the same risks and there are plenty of systems out there where
> it doesn't.

Yup.

Greetings,

Andres Freund




Re: Review for GetWALAvailability()

2020-06-17 Thread Alvaro Herrera
I think passing the slot name when the slot is also passed is useless
and wasteful; it'd be better to pass NULL for the name and ignore the
strcmp() in that case -- in fact I suggest to forbid passing both name
and slot.  (Any failure there would risk raising an error during
checkpoint, which is undesirable.)

So I propose the following tweaks to your patch, and otherwise +1.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b94e11a8e7..d99d0e9b42 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -383,24 +383,25 @@ ReplicationSlotAcquireInternal(ReplicationSlot *slot, const char *name,
 	ReplicationSlot *s;
 	int			active_pid;
 
+	AssertArg((slot == NULL) ^ (name == NULL));
+
 retry:
 	Assert(MyReplicationSlot == NULL);
 
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	/*
-	 * Search for the slot with the specified name if the slot to
-	 * acquire is not given. If the slot is not found, we either
-	 * return -1 or error out.
+	 * Search for the slot with the specified name if the slot to acquire is
+	 * not given. If the slot is not found, we either return -1 or error out.
 	 */
-	s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
-	if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)
+	s = slot ? slot : SearchNamedReplicationSlot(name);
+	if (s == NULL || !s->in_use ||
+		(name && strcmp(name, NameStr(s->data.name)) != 0))
 	{
+		LWLockRelease(ReplicationSlotControlLock);
+
 		if (behavior == SAB_Inquire)
-		{
-			LWLockRelease(ReplicationSlotControlLock);
 			return -1;
-		}
 		ereport(ERROR,
 (errcode(ERRCODE_UNDEFINED_OBJECT),
  errmsg("replication slot \"%s\" does not exist", name)));
@@ -1161,8 +1162,7 @@ restart:
 			 * we have to wait on that CV for the process owning
 			 * the slot to be terminated, later.
 			 */
-			wspid = ReplicationSlotAcquireInternal(s, NameStr(slotname),
-   SAB_Inquire);
+			wspid = ReplicationSlotAcquireInternal(s, NULL, SAB_Inquire);
 
 			/*
 			 * Exit the loop if we successfully acquired the slot or


Re: language cleanups in code and docs

2020-06-17 Thread Robert Haas
On Mon, Jun 15, 2020 at 2:23 PM Andres Freund  wrote:
> 0002: code: s/master/primary/
> 0003: code: s/master/leader/
> 0006: docs: s/master/root/
> 0007: docs: s/master/supervisor/

I'd just like to make the pointer here that there's value in trying to
use different terminology for different things. I picked "leader" and
"worker" for parallel query and tried to use them consistently because
"master" and "slave" were being used widely to refer to physical
replication, and I thought it would be clearer to use something
different, so I did. It's confusing if we use the same word for the
server from which others replicate, the table from which others
inherit, the process which initiates parallelism, and the first
process that is launched across the whole cluster, regardless of
*which* word we use for those things. So, I think there is every
possibility that with careful thought, we can actually make things
clearer, in addition to avoiding the use of terms that are no longer
welcome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 12:45 PM Tom Lane  wrote:
> > Writing hint bits and marking index tuples as killed do not write WAL
> > unless checksums are enabled.
>
> And your point is?  I thought enabling checksums was considered
> good practice these days.

I don't want to have an argument about what typical or best practices
are; I wasn't trying to make any point about that one way or the
other. I'm just saying that the operations you listed don't
necessarily all write WAL. In an event, even if they did, the larger
point is that standbys work like that, too, so it's not unprecedented
or illogical to think of such things.

> >> You're right that these are the same things that we already forbid on a
> >> standby, for the same reason, so maybe it won't be as hard to identify
> >> them as I feared.  I wonder whether we should envision this as "demote
> >> primary to standby" rather than an independent feature.
>
> > See my comments on the nearby pg_demote thread. I think we want both.
>
> Well, if pg_demote can be done for X amount of effort, and largely
> gets the job done, while this requires 10X or 100X the effort and
> introduces 10X or 100X as many bugs, I'm not especially convinced
> that we want both.

Sure: if two features duplicate each other, and one of them is way
more work and way more buggy, then it's silly to have both, and we
should just accept the easy, bug-free one. However, as I said in the
other email to which I referred you, I currently believe that these
two features actually don't duplicate each other and that using them
both together would be quite beneficial. Also, even if they did, I
don't know where you are getting the idea that this feature will be
10X or 100X more work and more buggy than the other one. I have looked
at this code prior to it being posted, but I haven't looked at the
other code at all; I am guessing that you have looked at neither. I
would be happy if you did, because it is often the case that
architectural issues that escape other people are apparent to you upon
examination, and it's always nice to know about those earlier rather
than later so that one can decide to (a) give up or (b) fix them. But
I see no point in speculating in the abstract that such issues may
exist and that they may be more severe in one case than the other. My
own guess is that, properly implemented, they are within 2-3X of each
in one direction or the other, not 10-100X. It is almost unbelievable
to me that the pg_demote patch could be 100X simpler than this one; if
it were, I'd be practically certain it was a 5-minute hack job
unworthy of any serious consideration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jun 17, 2020 at 12:27 PM Tom Lane  wrote:
>> Which of the things I mentioned don't require writing WAL?

> Writing hint bits and marking index tuples as killed do not write WAL
> unless checksums are enabled.

And your point is?  I thought enabling checksums was considered
good practice these days.

>> You're right that these are the same things that we already forbid on a
>> standby, for the same reason, so maybe it won't be as hard to identify
>> them as I feared.  I wonder whether we should envision this as "demote
>> primary to standby" rather than an independent feature.

> See my comments on the nearby pg_demote thread. I think we want both.

Well, if pg_demote can be done for X amount of effort, and largely
gets the job done, while this requires 10X or 100X the effort and
introduces 10X or 100X as many bugs, I'm not especially convinced
that we want both. 

regards, tom lane




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 12:27 PM Tom Lane  wrote:
> Which of the things I mentioned don't require writing WAL?

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

> You're right that these are the same things that we already forbid on a
> standby, for the same reason, so maybe it won't be as hard to identify
> them as I feared.  I wonder whether we should envision this as "demote
> primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

> >> I also think that putting such a thing into ALTER SYSTEM has got big
> >> logical problems.
>
> > ... no right-thinking person would ever propose to
> > change a feature that renders the system read-only in such a way that
> > it was impossible to deactivate it. That would be nuts.
>
> My point was that putting this in ALTER SYSTEM paints us into a corner
> as to what we can do with ALTER SYSTEM in the future: we won't ever be
> able to make that do anything that would require writing WAL.  And I
> don't entirely believe your argument that that will never be something
> we'd want to do.

I think that depends a lot on how you view ALTER SYSTEM. I believe it
would be reasonable to view ALTER SYSTEM as a catch-all for commands
that make system-wide state changes, even if those changes are not all
of the same kind as each other; some might be machine-local, and
others cluster-wide; some WAL-logged, and others not. I don't think
it's smart to view ALTER SYSTEM through a lens that boxes it into only
editing postgresql.auto.conf; if that were so, we ought to have called
it ALTER CONFIGURATION FILE or something rather than ALTER SYSTEM. For
that reason, I do not see the choice of syntax as painting us into a
corner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [patch] demote

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais
 wrote:
> As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
> objectives than mine, I decided to share the humble patch I am playing with to
> step down an instance from primary to standby status.

Cool! This was vaguely on my hit list, but neither I nor any of my
colleagues had gotten the time and energy to have a go at it.

> Main difference with Amul's patch is that all backends must be disconnected to
> process with the demote. Either we wait for them to disconnect (smart) or we
> kill them (fast). This makes life much easier from the code point of view, but
> much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
> would kill the session, with no options to wait for the action to finish, as 
> we
> do with pg_promote(). Keeping read only session around could probably be
> achieved using global barrier as Amul did, but without all the complexity
> related to WAL writes prohibition.
>
> There's still some questions in the current patch. As I wrote, it's an humble
> patch, a proof of concept, a bit naive.
>
> Does it worth discussing it and improving it further or do I miss something
> obvious in this design that leads to a dead end?

I haven't looked at your code, but I think we should view the two
efforts as complementing each other, not competing. With both patches
in play, a clean switchover would look like this:

- first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
to make the primary read only, killing off write transactions
- next use pg_ctl promote to promote the standby
- finally use pg_ctl demote (or whatever we decide to call it) to turn
the read-only primary into a standby of the new primary

I think this would be way better than what you have to do today,
which as I mentioned in my reply to Tom on the other thread, is very
complicated and error-prone. I think with the combination of that
patch and this one (or some successor version of each) we could get to
a point where the tooling to do a clean switchover is relatively easy
to write and doesn't involve having to shut down the server completely
at any point. If we can do it while also preserving connections, at
least for read-only queries, that's a better user experience, but as
Tom pointed out over there, there are real concerns about the
complexity of these patches, so it may be that the approach you've
taken of just killing everything is safer and thus a superior choice
overall.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Tom Lane
Robert Haas  writes:
> This seems like pretty dubious hand-waving. Of course, things that
> write WAL are going to be broken by a switch that prevents writing
> WAL; but if they were not, there would be no purpose in having such a
> switch, so that's not really an argument. But you seem to have mixed
> in some things that don't require writing WAL, and claimed without
> evidence that those would somehow also be broken.

Which of the things I mentioned don't require writing WAL?

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared.  I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

>> I also think that putting such a thing into ALTER SYSTEM has got big
>> logical problems.

> ... no right-thinking person would ever propose to
> change a feature that renders the system read-only in such a way that
> it was impossible to deactivate it. That would be nuts.

My point was that putting this in ALTER SYSTEM paints us into a corner
as to what we can do with ALTER SYSTEM in the future: we won't ever be
able to make that do anything that would require writing WAL.  And I
don't entirely believe your argument that that will never be something
we'd want to do.

regards, tom lane




Re: language cleanups in code and docs

2020-06-17 Thread David Steele

On 6/17/20 12:08 PM, Magnus Hagander wrote:
On Wed, Jun 17, 2020 at 4:15 PM Andrew Dunstan 
mailto:andrew.duns...@2ndquadrant.com>> 


I'm not sure I like doing s/Black/Block/ here. It reads oddly. There are
too many other uses of Block in the sources. Forbidden might be a better
substitution, or Banned maybe. BanList is even less characters than
BlackList.

I'd be OK with either of those really -- I went with block because it 
was the easiest one :)


Not sure the number of characters is the important part :) Banlist does 
make sense to me for other reasons though -- it's what it is, isn't it? 
It bans those oids from being used in the current session -- I don't 
think there's any struggle to "make that sentence work", which means 
that seems like the relevant term.


I've seen also seen allowList/denyList as an alternative. I do agree 
that blockList is a bit confusing since we often use block in a very 
different context.


I do think it's worth doing -- it's a small round of changes, and it 
doesn't change anything user-exposed, so the cost for us is basically zero.


+1

Regards,
--
-David
da...@pgmasters.net




Re: language cleanups in code and docs

2020-06-17 Thread Jonathan S. Katz
On 6/17/20 12:08 PM, Magnus Hagander wrote:
> 
> 
> On Wed, Jun 17, 2020 at 4:15 PM Andrew Dunstan
> mailto:andrew.duns...@2ndquadrant.com>>
> wrote:
> 
> 
> On 6/17/20 6:32 AM, Magnus Hagander wrote:
> >
> >
> >
> >
> >
> > In looking at this I realize we also have exactly one thing referred
> > to as "blacklist" in our codebase, which is the "enum blacklist" (and
> > then a small internal variable in pgindent). AFAICT, it's not actually
> > exposed to userspace anywhere, so we could probably make the attached
> > change to blocklist at no "cost" (the only thing changed is the name
> > of the hash table, and we definitely change things like that in normal
> > releases with no specific thought on backwards compat).
> >
> >
> 
> I'm not sure I like doing s/Black/Block/ here. It reads oddly. There are
> too many other uses of Block in the sources. Forbidden might be a better
> substitution, or Banned maybe. BanList is even less characters than
> BlackList.
> 
> 
> I know, bikeshedding here.
> 
> 
> I'd be OK with either of those really -- I went with block because it
> was the easiest one :)
> 
> Not sure the number of characters is the important part :) Banlist does
> make sense to me for other reasons though -- it's what it is, isn't it?
> It bans those oids from being used in the current session -- I don't
> think there's any struggle to "make that sentence work", which means
> that seems like the relevant term.
> 
> I do think it's worth doing -- it's a small round of changes, and it
> doesn't change anything user-exposed, so the cost for us is basically zero. 

+1. I know post efforts for us to update our language have been
well-received, even long after the fact, and given this set has been
voiced actively and other fora and, as Magnus states, the cost for us to
change it is basically zero, we should just do it.

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: language cleanups in code and docs

2020-06-17 Thread Magnus Hagander
On Wed, Jun 17, 2020 at 4:15 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 6/17/20 6:32 AM, Magnus Hagander wrote:
> >
> >
> >
> >
> >
> > In looking at this I realize we also have exactly one thing referred
> > to as "blacklist" in our codebase, which is the "enum blacklist" (and
> > then a small internal variable in pgindent). AFAICT, it's not actually
> > exposed to userspace anywhere, so we could probably make the attached
> > change to blocklist at no "cost" (the only thing changed is the name
> > of the hash table, and we definitely change things like that in normal
> > releases with no specific thought on backwards compat).
> >
> >
>
> I'm not sure I like doing s/Black/Block/ here. It reads oddly. There are
> too many other uses of Block in the sources. Forbidden might be a better
> substitution, or Banned maybe. BanList is even less characters than
> BlackList.
>
>
> I know, bikeshedding here.
>

I'd be OK with either of those really -- I went with block because it was
the easiest one :)

Not sure the number of characters is the important part :) Banlist does
make sense to me for other reasons though -- it's what it is, isn't it? It
bans those oids from being used in the current session -- I don't think
there's any struggle to "make that sentence work", which means that seems
like the relevant term.

I do think it's worth doing -- it's a small round of changes, and it
doesn't change anything user-exposed, so the cost for us is basically zero.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: language cleanups in code and docs

2020-06-17 Thread Andrew Dunstan


On 6/17/20 11:00 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 6/17/20 6:32 AM, Magnus Hagander wrote:
>>> In looking at this I realize we also have exactly one thing referred
>>> to as "blacklist" in our codebase, which is the "enum blacklist" (and
>>> then a small internal variable in pgindent). AFAICT, it's not actually
>>> exposed to userspace anywhere, so we could probably make the attached
>>> change to blocklist at no "cost" (the only thing changed is the name
>>> of the hash table, and we definitely change things like that in normal
>>> releases with no specific thought on backwards compat).
>> I'm not sure I like doing s/Black/Block/ here. It reads oddly. There are
>> too many other uses of Block in the sources. Forbidden might be a better
>> substitution, or Banned maybe. BanList is even less characters than
>> BlackList.
> I think worrying about blacklist/whitelist is carrying things a bit far
> in the first place.


For the small effort and minimal impact involved, I think it's worth
avoiding the bad publicity.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 10:58 AM Tom Lane  wrote:
> Aside from the points you mention, such a switch would break autovacuum.
> It would break the ability for scans to do HOT-chain cleanup, which would
> likely lead to some odd behaviors (if, eg, somebody flips the switch
> between where that's supposed to happen and where an update needs to
> happen on the same page).  It would break the ability for indexscans to do
> killed-tuple marking, which is critical for performance in some scenarios.
> It would break the ability to set tuple hint bits, which is even more
> critical for performance.  It'd possibly break, or at least complicate,
> logic in index AMs to deal with index format updates --- I'm fairly sure
> there are places that will try to update out-of-date data structures
> rather than cope with the old structure, even in nominally read-only
> searches.

This seems like pretty dubious hand-waving. Of course, things that
write WAL are going to be broken by a switch that prevents writing
WAL; but if they were not, there would be no purpose in having such a
switch, so that's not really an argument. But you seem to have mixed
in some things that don't require writing WAL, and claimed without
evidence that those would somehow also be broken. I don't think that's
the case, but even if it were, so what? We live with all of these
restrictions on standbys anyway.

> I also think that putting such a thing into ALTER SYSTEM has got big
> logical problems.  Someday we will probably want to have ALTER SYSTEM
> write WAL so that standby servers can absorb the settings changes.
> But if writing WAL is disabled, how can you ever turn the thing off again?

I mean, the syntax that we use for a feature like this is arbitrary. I
picked this one, so I like it, but it can easily be changed if other
people want something else. The rest of this argument doesn't seem to
me to make very much sense. The existing ALTER SYSTEM functionality to
modify a text configuration file isn't replicated today and I'm not
sure why we should make it so, considering that replication generally
only considers things that are guaranteed to be the same on the master
and the standby, which this is not. But even if we did, that has
nothing to do with whether some functionality that changes the system
state without changing a text file ought to also be replicated. This
is a piece of cluster management functionality and it makes no sense
to replicate it. And no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

> Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
> involves just killing the primary server, not expecting that you can
> leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.
That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.
The walsenders are supposed to send all the WAL from the master before
exiting, but if the connection is broken for some reason, then the
master is down and the standbys can't stream the rest of the WAL. You
can start it up again, but then you might generate more WAL. You can
try to copy the WAL around manually from one pg_wal directory to
another, but that's not a very nice thing for users to need to do
manually, and seems buggy and error-prone.

And how do you figure out where the WAL ends on the master and make
sure that the standby replayed it all? If the master is up, it's easy:
you just use the same queries you use all the time. If the master is
down, you have to use some different technique that involves manually
examining files or scrutinizing pg_controldata output. It's actually
very difficult to get this right.

> Commands that involve a whole
> bunch of subtle interlocking --- and, therefore, aren't going to work if
> anything has gone wrong already anywhere in the server --- seem like a
> particularly poor thing to be hanging your HA strategy on.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

> I also wonder
> what this accomplishes that couldn't be done much more simply by killing
> the walsenders.

Killing the walsenders does nothing ... the clients immediately reconnect.

> In short, I see a huge amount of complexity here, an ongoing source of
> hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

I do think this is complex and the risk of bugs that are hard to
identify or hard to fix certainly needs to be considered. I
strenuously disagree with the idea that there is not very much real
us

Re: improving basebackup.c's file-reading code

2020-06-17 Thread Robert Haas
On Wed, Apr 29, 2020 at 5:52 AM Hamid Akhtar  wrote:
> The idea and the patch looks good to me.
>
> It makes sense to change fread to basebackup_read_file which internally calls 
> pg_pread which is a portable function as opposed to read. As you've 
> mentioned, this is much better for error handling.

Thanks for the review. I have now committed the patch, after rebasing
and adjusting one comment slightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




[patch] demote

2020-06-17 Thread Jehan-Guillaume de Rorthais
Hi,

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
objectives than mine, I decided to share the humble patch I am playing with to
step down an instance from primary to standby status.

I'm still wondering about the coding style, but as the discussion about this
kind of feature is rising, I share it in an early stage so it has a chance to
be discussed.

I'm opening a new discussion to avoid disturbing Amul's one. 

The design of my patch is similar to the crash recovery code, without resetting
the shared memory. It supports smart and fast demote. The only existing user
interface currently is "pg_ctl [-m smart|fast] demote". An SQL admin function,
eg. pg_demote(), would be easy to add.

Main difference with Amul's patch is that all backends must be disconnected to
process with the demote. Either we wait for them to disconnect (smart) or we
kill them (fast). This makes life much easier from the code point of view, but
much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
would kill the session, with no options to wait for the action to finish, as we
do with pg_promote(). Keeping read only session around could probably be
achieved using global barrier as Amul did, but without all the complexity
related to WAL writes prohibition.

There's still some questions in the current patch. As I wrote, it's an humble
patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

Thanks.

Regards,

[1] 
https://www.postgresql.org/message-id/flat/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw%40mail.gmail.com
>From 2075441bebc47d3dd5b6e0a76e16f5ebb12858af Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH] Demote PoC

---
 src/backend/access/transam/xlog.c   |   3 +-
 src/backend/postmaster/postmaster.c | 206 ++--
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c | 105 
 src/include/catalog/pg_control.h|   1 +
 src/include/libpq/libpq-be.h|   7 +-
 src/include/utils/pidfile.h |   1 +
 7 files changed, 271 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc..8a7f1a0855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8493,6 +8493,7 @@ ShutdownXLOG(int code, Datum arg)
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
 	/* Don't be chatty in standalone mode */
+	// FIXME: what message when demoting?
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
 			(errmsg("shutting down")));
 
@@ -8760,7 +8761,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = DB_SHUTDOWNING; // DEMOTING?
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b4d475bb0b..465d020f9d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -150,6 +150,9 @@
 
 #define BACKEND_TYPE_WORKER		(BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
+/* file to signal demotion from primary to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+
 /*
  * List of active backends (or child processes anyway; we don't actually
  * know whether a given child has become a backend or is still in the
@@ -269,18 +272,23 @@ typedef enum
 static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 
 /* Startup/shutdown state */
-#define			NoShutdown		0
-#define			SmartShutdown	1
-#define			FastShutdown	2
-#define			ImmediateShutdown	3
-
-static int	Shutdown = NoShutdown;
+typedef enum StepDownState {
+	NoShutdown = 0, /* find better label? */
+	SmartShutdown,
+	SmartDemote,
+	FastShutdown,
+	FastDemote,
+	ImmediateShutdown
+} StepDownState;
+
+static StepDownState StepDown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -314,7 +322,7 @@ static bool FatalError = false; /* T if recovering from backend crash */
  * will not be very long).
  *
  * Notice that this state variable does not distinguish *why* we entered
- * states later than PM_RUN --- Shutdown and FatalError must be consulted
+ * st

Re: jsonpath versus NaN

2020-06-17 Thread Tom Lane
Alexander Korotkov  writes:
> On Thu, Jun 11, 2020 at 10:00 PM Tom Lane  wrote:
>> I don't think this is very relevant.  The SQL standard has not got the
>> concepts of Inf or NaN either (see 4.4.2 Characteristics of numbers),
>> therefore their definition is only envisioning that a string representing
>> a normal finite number should be castable to DOUBLE PRECISION.  Thus,
>> both of the relevant standards think that "numbers" are just finite
>> numbers.

> Yes, I see.  No standard insists we should support NaN.  However,
> standard claims .double() should behave the same as CAST to double.
> So, I think if CAST supports NaN, but .double() doesn't, it's still a
> violation.

No, I think you are completely misunderstanding the standard.  They
are saying that strings that look like legal numbers according to SQL
should be castable into numbers.  But NaN and Inf are not legal
numbers according to SQL, so there is nothing in that text that
justifies accepting "NaN".  Nor does the JSON standard provide any
support for that position.  So I think it is fine to leave NaN/Inf
out of the world of what you can write in jsonpath.

I'd be more willing to let the code do this if it didn't require such
a horrid, dangerous kluge to do so.  But it does, and I don't see any
easy way around that, so I think we should just take out the kluge.
And do so sooner not later, before some misguided user starts to
depend on it.

regards, tom lane




Re: tablespace_map code cleanup

2020-06-17 Thread Robert Haas
On Wed, May 27, 2020 at 8:11 PM Kyotaro Horiguchi
 wrote:
> That makes sense to me. Thanks!

Great. Thanks to you and Amit for reviewing. I have committed the patches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: language cleanups in code and docs

2020-06-17 Thread Tom Lane
Andrew Dunstan  writes:
> On 6/17/20 6:32 AM, Magnus Hagander wrote:
>> In looking at this I realize we also have exactly one thing referred
>> to as "blacklist" in our codebase, which is the "enum blacklist" (and
>> then a small internal variable in pgindent). AFAICT, it's not actually
>> exposed to userspace anywhere, so we could probably make the attached
>> change to blocklist at no "cost" (the only thing changed is the name
>> of the hash table, and we definitely change things like that in normal
>> releases with no specific thought on backwards compat).

> I'm not sure I like doing s/Black/Block/ here. It reads oddly. There are
> too many other uses of Block in the sources. Forbidden might be a better
> substitution, or Banned maybe. BanList is even less characters than
> BlackList.

I think worrying about blacklist/whitelist is carrying things a bit far
in the first place.

regards, tom lane




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Tom Lane
Amit Kapila  writes:
> On Tue, Jun 16, 2020 at 7:26 PM amul sul  wrote:
>> Attached patch proposes $Subject feature which forces the system into 
>> read-only
>> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
>> WRITE executed.

> Do we prohibit the checkpointer to write dirty pages and write a
> checkpoint record as well?

I think this is a really bad idea and should simply be rejected.

Aside from the points you mention, such a switch would break autovacuum.
It would break the ability for scans to do HOT-chain cleanup, which would
likely lead to some odd behaviors (if, eg, somebody flips the switch
between where that's supposed to happen and where an update needs to
happen on the same page).  It would break the ability for indexscans to do
killed-tuple marking, which is critical for performance in some scenarios.
It would break the ability to set tuple hint bits, which is even more
critical for performance.  It'd possibly break, or at least complicate,
logic in index AMs to deal with index format updates --- I'm fairly sure
there are places that will try to update out-of-date data structures
rather than cope with the old structure, even in nominally read-only
searches.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems.  Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.  Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.  I also wonder
what this accomplishes that couldn't be done much more simply by killing
the walsenders.

In short, I see a huge amount of complexity here, an ongoing source of
hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

regards, tom lane




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 9:51 AM tushar  wrote:
> 1) ALTER SYSTEM
>
> postgres=# alter system read only;
> ALTER SYSTEM
> postgres=# alter  system reset all;
> ALTER SYSTEM
> postgres=# create table t1(n int);
> ERROR:  cannot execute CREATE TABLE in a read-only transaction
>
> Initially i thought after firing 'Alter system reset all' , it will be
> back to  normal.
>
> can't we have a syntax like - "Alter system set read_only='True' ; "

No, this needs to be separate from the GUC-modification syntax, I
think. It's a different kind of state change. It doesn't, and can't,
just edit postgresql.auto.conf.

> 2)When i connected to postgres in a single user mode , i was not able to
> set the system in read only
>
> [edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres
>
> PostgreSQL stand-alone backend 14devel
> backend> alter system read only;
> ERROR:  checkpointer is not running
>
> backend>

Hmm, that's an interesting finding. I wonder what happens if you make
the system read only, shut it down, and then restart it in single-user
mode. Given what you see here, I bet you can't put it back into a
read-write state from single user mode either, which seems like a
problem. Either single-user mode should allow changing between R/O and
R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
ONLY and always allow writes anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Robert Haas
On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila  wrote:
> Do we prohibit the checkpointer to write dirty pages and write a
> checkpoint record as well?  If so, will the checkpointer process
> writes the current dirty pages and writes a checkpoint record or we
> skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.
But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.
(I'm not sure if this is how the patch currently works; I'm describing
how I think it should work.)

> > If there are open transactions that have acquired an XID, the sessions are 
> > killed
> > before the barrier is absorbed.
>
> What about prepared transactions?

They don't matter. The problem with a running transaction that has an
XID is that somebody might end the session, and then we'd have to
write either a commit record or an abort record. But a prepared
transaction doesn't have that problem. You can't COMMIT PREPARED or
ROLLBACK PREPARED while the system is read-only, as I suppose anybody
would expect, but their mere existence isn't a problem.

> What if vacuum is on an unlogged relation?  Do we allow writes via
> vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

> > Another part of the patch that quite uneasy and need a discussion is that 
> > when the
> > shutdown in the read-only state we do skip shutdown checkpoint and at a 
> > restart, first
> > startup recovery will be performed and latter the read-only state will be 
> > restored to
> > prohibit further WAL write irrespective of recovery checkpoint succeed or 
> > not. The
> > concern is here if this startup recovery checkpoint wasn't ok, then it will 
> > never happen
> > even if it's later put back into read-write mode.
>
> I am not able to understand this problem.  What do you mean by
> "recovery checkpoint succeed or not", do you add a try..catch and skip
> any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then. But I think right now the patch
performs the end-of-recovery checkpoint before restoring the read-only
state, which seems 100% wrong to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks)

2020-06-17 Thread Robert Haas
On Tue, Jun 16, 2020 at 3:28 PM Andres Freund  wrote:
> > I think 0003 looks a little strange: it seems to be
> > testing things that might be implementation details of other things,
> > and I'm not sure that's really correct. In particular:
>
> My main motivation was to have something that runs more often than than
> the embeded test in s_lock.c's that nobody ever runs (they wouldn't even
> pass with disabled spinlocks, as S_LOCK_FREE isn't implemented).

Sure, that makes sense.

> > + /* and that "contended" acquisition works */
> > + s_lock(&struct_w_lock.lock, "testfile", 17, "testfunc");
> > + S_UNLOCK(&struct_w_lock.lock);
> >
> > I didn't think we had formally promised that s_lock() is actually
> > defined or working on all platforms.
>
> Hm? Isn't s_lock the, as its comment says, "platform-independent portion
> of waiting for a spinlock."?  I also don't think we need to purely
> follow external APIs in internal tests.

I feel like we at least didn't use to use that on all platforms, but I
might be misremembering. It seems odd and confusing that we have  both
S_LOCK() and s_lock(), anyway. Differentiating functions based on case
is not great practice.

> Sure, there's a lot that'd pass. But it's more than we had before. It
> did catch a bug much quicker than I'd have otherwise found it, FWIW.
>
> I don't think an empty implementation would pass btw, as long as TAS is
> defined.

Fair enough.

> Yea, we could use something better. But I don't see that happening
> quickly, and having something seems better than nothing.
>
> That seems quite hard to achieve. I really just wanted to have something
> I can do some very basic tests to catch issues quicker.
>
> The atomics tests found numerous issues btw, despite also not testing
> concurrency.
>
> I think we generally have way too few of such trivial tests. They can
> find plenty "real world" issues, but more importantly make it much
> quicker to iterate when working on some piece of code.
>
> Without the tests I couldn't even reproduce a deadlock due to the
> nesting. So they imo are pretty essential?

I'm not telling you not to commit these; I'm just more skeptical of
whether they are the right approach than you seem to be. But that's
OK: people can like different things, and I don't know exactly what
would be better anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




回复:回复:回复:how to create index concurrently on partitioned table

2020-06-17 Thread 李杰(慎追)
> We can refer to the implementation in the ReindexRelationConcurrently,
> in the three phases of the REINDEX CONCURRENTLY,
> all indexes of the partitions are operated one by one in each phase.
> In this way, we can maintain the consistency of the entire partitioned table 
> index.
> After we implement CIC in this way, we can also complete reindex partitioned 
> table index concurrently (this is not done now.)

 After careful analysis, I found that there were two choices that embarrassed 
me.
 Although we can handle the entire partition tree with one transaction  in each 
of the three phases of CIC, just like ordinary tables.

However, I found a problem. If there are many partitions, 
we may need to handle too many missing index entries when validate_index() .
Especially for the first partition, the time may have been long and many 
entries are missing.
In this case, why don't we put the second and third phase together into a 
transaction for each partition?

So, which schema do you think is better?
Choose to maintain consistency in all three phases, 
or just maintain consistency in the first phase?


Thank you very much,
 Regards,  Adger





Re: language cleanups in code and docs

2020-06-17 Thread Andrew Dunstan


On 6/17/20 6:32 AM, Magnus Hagander wrote:
>
>
>
>
>
> In looking at this I realize we also have exactly one thing referred
> to as "blacklist" in our codebase, which is the "enum blacklist" (and
> then a small internal variable in pgindent). AFAICT, it's not actually
> exposed to userspace anywhere, so we could probably make the attached
> change to blocklist at no "cost" (the only thing changed is the name
> of the hash table, and we definitely change things like that in normal
> releases with no specific thought on backwards compat).
>
>

I'm not sure I like doing s/Black/Block/ here. It reads oddly. There are
too many other uses of Block in the sources. Forbidden might be a better
substitution, or Banned maybe. BanList is even less characters than
BlackList.


I know, bikeshedding here.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: [PATCH] Missing links between system catalog documentation pages

2020-06-17 Thread Dagfinn Ilmari Mannsåker
Fabien COELHO  writes:

> Hello Dagfinn,
>
>>> The attached patch makes the first mention of another system catalog or
>>> view (as well as pg_hba.conf in pg_hba_file_lines) a link, for easier
>>> navigation.
>
> Why only the first mention? It seems unlikely that I would ever read
> such chapter linearly, and even so that I would want to jump especially
> on the first occurrence but not on others, so ISTM that it should be
> done all mentions?

It's the first mention in the introductory paragraph of _each_ catalog
table/view page, not the first mention in the entire catalogs.sgml file.
E.g. https://www.postgresql.org/docs/current/catalog-pg-aggregate.html
has two mentions of pg_proc one word apart:

   Each entry in pg_aggregate is an extension of an entry in pg_proc. The
   pg_proc entry carries the aggregate's name, …

I didn't think there was much point in linkifying both in that case, and
other similar situations.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen




Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread tushar

On 6/16/20 7:25 PM, amul sul wrote:
Attached patch proposes $Subject feature which forces the system into 
read-only
mode where insert write-ahead log will be prohibited until ALTER 
SYSTEM READ

WRITE executed.


Thanks Amul.

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter  system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR:  cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be 
back to  normal.


can't we have a syntax like - "Alter system set read_only='True' ; "

so that ALTER SYSTEM command syntax should be same for all.

postgres=# \h alter system
Command: ALTER SYSTEM
Description: change a server configuration parameter
Syntax:
ALTER SYSTEM SET configuration_parameter { TO | = } { value | 'value' | 
DEFAULT }


ALTER SYSTEM RESET configuration_parameter
ALTER SYSTEM RESET ALL

How we are going to justify this in help command of ALTER SYSTEM ?

2)When i connected to postgres in a single user mode , i was not able to 
set the system in read only


[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres


PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR:  checkpointer is not running

backend>

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: language cleanups in code and docs

2020-06-17 Thread Jonathan S. Katz
On 6/17/20 6:06 AM, Laurenz Albe wrote:
> On Tue, 2020-06-16 at 19:55 -0400, Tom Lane wrote:
>> Andrew Dunstan  writes:
>>> On 6/15/20 2:22 PM, Andres Freund wrote:
 2) 'master' as a reference to the branch. Personally I be in favor of
 changing the branch name, but it seems like it'd be better done as a
 somewhat separate discussion to me, as it affects development
 practices to some degree.
>>> I'm OK with this, but I would need plenty of notice to get the buildfarm
>>> adjusted.
>>
>> "master" is the default branch name established by git, is it not?  Not
>> something we picked.  I don't feel like we need to be out front of the
>> rest of the world in changing that.  It'd be a cheaper change than some of
>> the other proposals in this thread, no doubt, but it would still create
>> confusion for new hackers who are used to the standard git convention.
> 
> I have the feeling that all this is going somewhat too far.

First, I +1 the changes Andres proposed overall. In addition to it being
the right thing to do, it brings inline a lot of the terminology we have
been using to describe concepts in PostgreSQL for years (e.g.
primary/replica).

For the name of the git branch, I +1 following the convention of the git
upstream, and make changes based on that. Understandably, it could break
things for a bit, but that will occur for a lot of other projects as
well and everyone will adopt. We have the benefit that we're just
beginning our new development cycle too, so this is a good time to
introduce breaking change ;)

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: language cleanups in code and docs

2020-06-17 Thread Jonathan S. Katz
On 6/17/20 6:32 AM, Magnus Hagander wrote:

> In looking at this I realize we also have exactly one thing referred to
> as "blacklist" in our codebase, which is the "enum blacklist" (and then
> a small internal variable in pgindent). AFAICT, it's not actually
> exposed to userspace anywhere, so we could probably make the attached
> change to blocklist at no "cost" (the only thing changed is the name of
> the hash table, and we definitely change things like that in normal
> releases with no specific thought on backwards compat).

+1. Though if we are doing that, we should also handle "whitelist" too,
as this attached patch does. It's mostly in comments (with one Perl
variable), but I switched the language around to use "allowed"

Jonathan
diff --git a/contrib/postgres_fdw/postgres_fdw.h 
b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..3364a203c7 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,7 +77,7 @@ typedef struct PgFdwRelationInfo
booluse_remote_estimate;
Costfdw_startup_cost;
Costfdw_tuple_cost;
-   List   *shippable_extensions;   /* OIDs of whitelisted 
extensions */
+   List   *shippable_extensions;   /* OIDs of allowed extensions */
 
/* Cached catalog information. */
ForeignTable *table;
diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c
index 3433c19712..8a3dca9e90 100644
--- a/contrib/postgres_fdw/shippable.c
+++ b/contrib/postgres_fdw/shippable.c
@@ -111,7 +111,8 @@ InitializeShippableCache(void)
  *
  * Right now "shippability" is exclusively a function of whether the object
  * belongs to an extension declared by the user.  In the future we could
- * additionally have a whitelist of functions/operators declared one at a time.
+ * additionally have a list of allowed functions/operators declared one at a
+ * time.
  */
 static bool
 lookup_shippable(Oid objectId, Oid classId, PgFdwRelationInfo *fpinfo)
diff --git a/src/backend/access/hash/hashvalidate.c 
b/src/backend/access/hash/hashvalidate.c
index 6f14a9fb45..744d22b042 100644
--- a/src/backend/access/hash/hashvalidate.c
+++ b/src/backend/access/hash/hashvalidate.c
@@ -309,7 +309,7 @@ check_hash_func_signature(Oid funcid, int16 amprocnum, Oid 
argtype)
 * that are different from but physically compatible with the 
opclass
 * datatype.  In some of these cases, even a "binary coercible" 
check
 * fails because there's no relevant cast.  For the moment, fix 
it by
-* having a whitelist of allowed cases.  Test the specific 
function
+* having a list of allowed cases.  Test the specific function
 * identity, not just its input type, because hashvarlena() 
takes
 * INTERNAL and allowing any such function seems too scary.
 */
diff --git a/src/backend/utils/adt/lockfuncs.c 
b/src/backend/utils/adt/lockfuncs.c
index e992d1bbfc..267fc7adfd 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -632,9 +632,9 @@ pg_isolation_test_session_is_blocked(PG_FUNCTION_ARGS)
/*
 * Check if blocked_pid is waiting for a safe snapshot.  We could in
 * theory check the resulting array of blocker PIDs against the
-* interesting PIDs whitelist, but since there is no danger of 
autovacuum
-* blocking GetSafeSnapshot there seems to be no point in expending 
cycles
-* on allocating a buffer and searching for overlap; so it's presently
+* interesting list of allowed PIDs, but since there is no danger of
+* autovacuum blocking GetSafeSnapshot there seems to be no point in 
expending
+* cycles on allocating a buffer and searching for overlap; so it's 
presently
 * sufficient for the isolation tester's purposes to use a single 
element
 * buffer and check if the number of safe snapshot blockers is non-zero.
 */
diff --git a/src/tools/pginclude/README b/src/tools/pginclude/README
index a067c7f472..d1906a3516 100644
--- a/src/tools/pginclude/README
+++ b/src/tools/pginclude/README
@@ -64,7 +64,7 @@ with no prerequisite headers other than postgres.h (or 
postgres_fe.h
 or c.h, as appropriate).
 
 A small number of header files are exempted from this requirement,
-and are whitelisted in the headerscheck script.
+and are allowed in the headerscheck script.
 
 The easy way to run the script is to say "make -s headerscheck" in
 the top-level build directory after completing a build.  You should
@@ -86,7 +86,7 @@ the project's coding language is C, some people write 
extensions in C++,
 so it's helpful for include files to be C++-clean.
 
 A small number of header files are exempted from this requirement,
-and are whitelisted in the cpluspluscheck script.
+and are allowed in the cpluspluscheck script.
 
 The easy way to run the script is to say

Re: [Patch] ALTER SYSTEM READ ONLY

2020-06-17 Thread Amit Kapila
On Tue, Jun 16, 2020 at 7:26 PM amul sul  wrote:
>
> Hi,
>
> Attached patch proposes $Subject feature which forces the system into 
> read-only
> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
> WRITE executed.
>
> The high-level goal is to make the availability/scale-out situation better.  
> The feature
> will help HA setup where the master server needs to stop accepting WAL writes
> immediately and kick out any transaction expecting WAL writes at the end, in 
> case
> of network down on master or replication connections failures.
>
> For example, this feature allows for a controlled switchover without needing 
> to shut
> down the master. You can instead make the master read-only, wait until the 
> standby
> catches up, and then promote the standby. The master remains available for 
> read
> queries throughout, and also for WAL streaming, but without the possibility 
> of any
> new write transactions. After switchover is complete, the master can be shut 
> down
> and brought back up as a standby without needing to use pg_rewind. 
> (Eventually, it
> would be nice to be able to make the read-only master into a standby without 
> having
> to restart it, but that is a problem for another patch.)
>
> This might also help in failover scenarios. For example, if you detect that 
> the master
> has lost network connectivity to the standby, you might make it read-only 
> after 30 s,
> and promote the standby after 60 s, so that you never have two writable 
> masters at
> the same time. In this case, there's still some split-brain, but it's still 
> better than what
> we have now.
>
> Design:
> --
> The proposed feature is built atop of super barrier mechanism commit[1] to 
> coordinate
> global state changes to all active backends.  Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and 
> WAL
> permitted state respectively.  When the checkpointer process sees the WAL 
> prohibit
> state change request, it emits a global barrier and waits until all backends 
> that
> participate in the ProcSignal absorbs it. Once it has done the WAL read/write 
> state in
> share memory and control file will be updated so that XLogInsertAllowed() 
> returns
> accordingly.
>

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well?  If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

> If there are open transactions that have acquired an XID, the sessions are 
> killed
> before the barrier is absorbed.
>

What about prepared transactions?

> They can't commit without writing WAL, and they
> can't abort without writing WAL, either, so we must at least abort the 
> transaction. We
> don't necessarily need to kill the session, but it's hard to avoid in all 
> cases because
> (1) if there are subtransactions active, we need to force the top-level abort 
> record to
> be written immediately, but we can't really do that while keeping the 
> subtransactions
> on the transaction stack, and (2) if the session is idle, we also need the 
> top-level abort
> record to be written immediately, but can't send an error to the client until 
> the next
> command is issued without losing wire protocol synchronization. For now, we 
> just use
> FATAL to kill the session; maybe this can be improved in the future.
>
> Open transactions that don't have an XID are not killed, but will get an 
> ERROR if they
> try to acquire an XID later, or if they try to write WAL without acquiring an 
> XID (e.g. VACUUM).
>

What if vacuum is on an unlogged relation?  Do we allow writes via
vacuum to unlogged relation?

> To make that happen, the patch adds a new coding rule: a critical section 
> that will write
> WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), 
> or
> AssertWALPermitted_HaveXID(). The latter variants are used when we know for 
> certain
> that inserting WAL here must be OK, either because we have an XID (we would 
> have
> been killed by a change to read-only if one had occurred) or for some other 
> reason.
>
> The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
> ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
> SYSTEM READ WRITE update not only the shared memory state but also the control
> file, so that changes survive a restart.
>
> The transition between read-write and read-only is a pretty major transition, 
> so we emit
> log message for each successful execution of a ALTER SYSTEM READ {ONLY | 
> WRITE}
> command. Also, we have added a new GUC system_is_read_only which returns "on"
> when the system is in WAL prohibited state or recovery.
>
> Another part of the patch that quite uneasy and need a discussion is that 
> when the
> shutdown in the read-only state we do skip shutdown checkpoint and at a 
> restart, f

Creating a function for exposing memory usage of backend process

2020-06-17 Thread torikoshia

Hi,

As you may know better than I do, backend processes sometimes use a lot
of memory because of the various reasons like caches, prepared
statements and cursors.
When supporting PostgreSQL, I face situations for investigating the
reason of memory bloat.

AFAIK, the way to examine it is attaching a debugger and call
MemoryContextStats(TopMemoryContext), however, I feel some difficulties
doing it:

  - some production environments don't allow us to run a debugger easily
  - many lines about memory contexts are hard to analyze

Using an extension(pg_stat_get_memory_context() in pg_cheat_funcs[1]),
we can get the view of the memory contexts, but it's the memory contexts
of the backend which executed the pg_stat_get_memory_context().


[user interface]
If we have a function exposing memory contexts for specified PID,
we can easily examine them.
I imagine a user interface something like this:

  =# SELECT * FROM pg_stat_get_backend_memory_context(PID);

   name   |   parent   | level | total_bytes | 
total_nblocks | free_bytes | free_chunks | used_bytes | some other 
attibutes..

--++---+-+---++-+
 TopMemoryContext || 0 |   68720 |   
  5 |   9936 |  16 |  58784
 TopTransactionContext| TopMemoryContext   | 1 |8192 |   
  1 |   7720 |   0 |472
 PL/pgSQL function| TopMemoryContext   | 1 |   16384 |   
  2 |   5912 |   1 |  10472
 PL/pgSQL function| TopMemoryContext   | 1 |   32768 |   
  3 |  15824 |   3 |  16944
 dynahash | TopMemoryContext   | 1 |8192 |   
  1 |512 |   0 |   7680

...


[rough implementation ideas and challenges]
I suppose communication between a process which runs
pg_stat_get_backend_memory_context()(referred to as A) and
target backend(referred to as B) is like:

  1. A sends a message to B and order to dump the memory contexts
  2. B dumps its memory contexts to some shared area
  3. A reads the shared area and returns it to the function invoker

To do so, there seem some challenges.

(1) how to share memory contexts information between backend processes
The amount of memory contexts greatly varies depending on the
situation, so it's not appropriate to share the memory contexts using
fixed shared memory.
Also using the file on 'stats_temp_directory' seems difficult thinking
about the background of the shared-memory based stats collector
proposal[2].
Instead, I'm thinking about using dsm_mq, which allows messages of
arbitrary length to be sent and receive.

(2) how to send messages wanting memory contexts
Communicating via signal seems simple but assigning a specific number
of signal for this purpose seems wasting.
I'm thinking about using fixed shared memory to put dsm_mq handle.
To send a message, A creates a dsm_mq and put the handle on the shared
memory area. When B founds a handle, B dumps the memory contexts to the
corresponding dsm_mq.

However, enabling B to find the handle needs to check the shared memory
periodically. I'm not sure the suitable location and timing for this
checking yet, and doubt this way of communication is acceptable because
it gives certain additional loads to all the backends.

(3) clarifying the necessary attributes
As far as reading the past disucussion[3], it's not so clear what kind
of information should be exposed regarding memory contexts.


As a first step, to deal with (3) I'd like to add
pg_stat_get_backend_memory_context() which target is limited to the
local backend process.


Thanks for reading and how do you think?


[1] 
https://github.com/MasaoFujii/pg_cheat_funcs#setof-record-pg_stat_get_memory_context
[2] 
https://www.postgresql.org/message-id/flat/20180629.173418.190173462.horiguchi.kyot...@lab.ntt.co.jp
[3] 
https://www.postgresql.org/message-id/20190805171608.g22gxwmfr2r7uf6t%40alap3.anarazel.de



Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: Add A Glossary

2020-06-17 Thread Jürgen Purtz


On 17.06.20 02:09, Alvaro Herrera wrote:

On 2020-Jun-09, Jürgen Purtz wrote:


Can you agree to the following definitions? If no, we can alternatively
formulate for each of them: "Under discussion - currently not defined". My
proposals are inspired by chapter 2.2 Concepts: "Tables are grouped into
databases, and a collection of databases managed by a single PostgreSQL
server instance constitutes a database cluster."

After sleeping on it a few more times, I don't oppose the idea of making
"instance" be the running state and "database cluster" the on-disk stuff
that supports the instance.  Here's a patch that does things pretty much
along the lines you suggested.

I made small adjustments to "SQL objects":

* SQL objects in schemas were said to have their names unique in the
schema, but we failed to say anything about names of objects not in
schemas and global objects.  Added that.

* Had example object types for global objects and objects not in
schemas, but no examples for objects in schemas.  Added that.


Some programs whose output we could tweak per this:
pg_ctl

pg_ctl is a utility to initialize, start, stop, or control a PostgreSQL server.
  -D, --pgdata=DATADIR   location of the database storage area

to:

pg_ctl is a utility to initialize or control a PostgreSQL database cluster.
  -D, --pgdata=DATADIR   location of the database directory

pg_basebackup:

pg_basebackup takes a base backup of a running PostgreSQL server.

to:

pg_basebackup takes a base backup of a PostgreSQL instance.


+1, with two formal changes:

-  Rearrangement of term "Data page" to meet alphabetical order.

-  Add  in one case to meet xml-well-formedness.


One last question: The definition of "Data directory" reads "... A 
cluster's storage space comprises the data directory plus ..." and 
'cluster' links to '"glossary-instance". Shouldn't it link to 
"glossary-db-cluster"?


--

Jürgen Purtz


diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index e29b55e5ac..0499f9044f 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -413,6 +413,22 @@

   
 
+  
+   Data page
+   
+
+ The basic structure used to store relation data.
+ All pages are of the same size.
+ Data pages are typically stored on disk, each in a specific file,
+ and can be read to shared buffers
+ where they can be modified, becoming
+ dirty.  They become clean when written
+ to disk.  New pages, which initially exist in memory only, are also
+ dirty until written.
+
+   
+  
+
   
Database

@@ -441,6 +457,7 @@
  cluster is also sometimes used to refer to an instance.
  (Don't confuse this term with the SQL command CLUSTER.)
 
+   
   
 
   
@@ -448,22 +465,6 @@

   
 
-  
-   Data page
-   
-
- The basic structure used to store relation data.
- All pages are of the same size.
- Data pages are typically stored on disk, each in a specific file,
- and can be read to shared buffers
- where they can be modified, becoming
- dirty.  They become clean when written
- to disk.  New pages, which initially exist in memory only, are also
- dirty until written.
-
-   
-  
-
   
Datum



Re: min_safe_lsn column in pg_replication_slots view

2020-06-17 Thread Fujii Masao




On 2020/06/15 16:35, Kyotaro Horiguchi wrote:

At Mon, 15 Jun 2020 13:44:31 +0900, Michael Paquier  wrote 
in

On Mon, Jun 15, 2020 at 12:40:03PM +0900, Fujii Masao wrote:

BTW, I just wonder why each row in pg_replication_slots needs to have
min_safe_lsn column? Basically min_safe_lsn should be the same between
every replication slots.


Indeed, that's confusing in its current shape.  I would buy putting
this value into pg_replication_slots if there were some consistency of
the data to worry about because of locking issues, but here this data
is controlled within info_lck, which is out of the the repslot LW
lock.  So I think that it is incorrect to put this data in this view
and that we should remove it, and that instead we had better push for
a system view that maps with the contents of XLogCtl.


Agreed. But as you know it's too late to do that for v13...
So firstly I'd like to fix the issues in pg_replication_slots view,
and then we can improve the things later for v14 if necessary.



It was once the difference from the safe_lsn to restart_lsn which is
distinct among slots. Then it was changed to the safe_lsn. I agree to
the discussion above, but it is needed anywhere since no one can know
the margin until the slot goes to the "lost" state without it.  (Note
that currently even wal_status and min_safe_lsn can be inconsistent in
a line.)

Just for the need for table-consistency and in-line consistency, we
could just remember the result of XLogGetLastRemovedSegno() around
taking info_lock in the function.  That doesn't make a practical
difference but makes the view look consistent.


Agreed. Thanks for the patch. Here are the review comments:


Not only the last removed segment but also current write position
should be obtain at the beginning of pg_get_replication_slots()
and should be given to GetWALAvailability(), for the consistency?


Even after applying your patch, min_safe_lsn is calculated for
each slot even though the calculated result is always the same.
Which is a bit waste of cycle. We should calculate min_safe_lsn
at the beginning and use it for each slot?


XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0,

Isn't it better to use 1 as the second argument of the above,
in order to address the issue that I reported upthread?
Otherwise, the WAL file name that pg_walfile_name(min_safe_lsn) returns
would be confusing.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pgstattuple: Have pgstattuple_approx accept TOAST tables

2020-06-17 Thread Laurenz Albe
On Fri, 2020-04-17 at 13:01 +0200, Peter Eisentraut wrote:
> I alluded to this in [0], but it's better discussed in its own thread.
> 
> I think the check that makes pgstattuple_approx reject TOAST tables is a 
> mistake.  They have visibility and free space map, and it works just 
> fine if the check is removed.
> 
> Attached is a patch to fix this and add some tests related to how 
> pgstattuple and pg_visibility accept TOAST tables for inspection.
> 
> 
> [0]: 
> https://www.postgresql.org/message-id/dc35a398-37d0-75ce-07ea-1dd71d98f...@2ndquadrant.com

I gave the patch a spin, and it passes regression tests and didn't
cause any problems when I played with it.

No upgrade or dump considerations, of course.

This is a clear improvement.

I'll mark the patch as "ready for committer".

Yours,
Laurenz Albe





Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-17 Thread Amit Kapila
On Wed, Jun 17, 2020 at 1:34 PM Masahiko Sawada
 wrote:
>
> On Sat, 13 Jun 2020 at 14:23, Amit Kapila  wrote:
> >
> > On Fri, Jun 12, 2020 at 6:11 PM Magnus Hagander  wrote:
> > >
> > > On Fri, Jun 12, 2020 at 10:23 AM Amit Kapila  
> > > wrote:
> > >>
> > >
> > >
> > > The problem with "lifetime of a process" is that it's not predictable. A 
> > > replication process might "bounce" for any reason, and it is normally not 
> > > a problem. But if you suddenly lose your stats when you do that, it 
> > > starts to matter a lot more. Especially when you don't know if it 
> > > bounced. (Sure you can look at the backend_start time, but that adds a 
> > > whole different sets of complexitites).
> > >
> >
> > It is not clear to me what is a good way to display the stats for a
> > process that has exited or bounced due to whatever reason.  OTOH, if
> > we just display per-slot stats, it is difficult to imagine how the
> > user can make any sense out of it or in other words how such stats can
> > be useful to users.
>
> If we have the reset function, the user can reset before doing logical
> decoding so that the user can use the stats directly. Or I think we
> can automatically reset the stats when logical decoding is performed
> with different logical_decoding_work_mem value than the previous one.
>

I had written above in the context of persisting these stats.  I mean
to say if the process has bounced or server has restarted then the
previous stats might not make much sense because we were planning to
use pid [1], so the stats from process that has exited might not make
much sense or do you think that is okay?  If we don't want to persist
and the lifetime of these stats is till the process is alive then we
are fine.


[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k5nqeFdhpnCULpTh9TR%2B15rHZSbz0SDC6sZhr_v99SeKA%40mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/17 17:30, Kyotaro Horiguchi wrote:

At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao  
wrote in



On 2020/06/17 12:10, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
 wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:



So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about
this
case into the docs? Otherwise the users would be surprised at
termination
of backends by max_slot_wal_keep_size. I guess that it's basically
rarely
happen, though.


Well, if we could distinguish a walsender from a non-walsender
process,
then maybe it would make sense to leave backends alive.  But do we
want
that?  I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?


I have no better opinion about this. So I agree to leave the logic as
it is
at least for now, i.e., we terminate the process owning the slot
whatever
the type of process is.


Agreed.


+   /*
+* Signal to terminate the process using the replication slot.
+*
+* Try to signal every 100ms until it succeeds.
+*/
+   if (!killed && kill(active_pid, SIGTERM) == 0)
+   killed = true;
+   ConditionVariableTimedSleep(&slot->active_cv, 100,
+   
WAIT_EVENT_REPLICATION_SLOT_DROP);
+   } while (ReplicationSlotIsActive(slot, NULL));


Note that here you're signalling only once and then sleeping many
times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.


You mean; in this code path, signaling fails only when the target
process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?


I guess kill() can also fail if the PID now belongs to a process owned
by a different user.


Yes. This case means that the PostgreSQL process using the slot
disappeared
and the same PID was assigned to non-PostgreSQL process. So if kill()
fails
for this reason, we don't need to kill() again.


  I think we've disregarded very quick reuse of

PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected.  But we may make an extra call
to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.


Sorry I failed to understand your point...


My point is the ConditionVariableTimedSleep does *not* sleep on the CV
first time in this usage. The new version anyway avoids useless
kill(2) call, but still may make an extra call to
ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.


OK, so what about the attached patch? I added ConditionVariablePrepareToSleep()
just before entering the "for" loop in InvalidateObsoleteReplicationSlots().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..b94e11a8e7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,107 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock,
+   LW_SHARED));
+
+ 

Re: Operator class parameters and sgml docs

2020-06-17 Thread Alexander Korotkov
On Wed, Jun 17, 2020 at 2:50 AM Peter Geoghegan  wrote:
> On Tue, Jun 16, 2020 at 4:24 AM Alexander Korotkov
>  wrote:
> > Thank you for patience.  The documentation patch is attached.  I think
> > it requires review by native english speaker.
>
> * "...paramaters that controls" should be "...paramaters that control".
>
> * "with set of operator class specific option" should be "with a set
> of operator class specific options".
>
> * "The options could be accessible from each support function" should
> be "The options can be accessed from other support functions"

Fixed, thanks!

> It's very hard to write documentation like this, even for native
> English speakers. I think that it's important to have something in
> place, though. The GiST example helps a lot.

I've added a complete example for defining a set of parameters and
accessing them from another support function to the GiST
documentation.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


opclass_options_doc_2.patch
Description: Binary data


Re: language cleanups in code and docs

2020-06-17 Thread Magnus Hagander
On Mon, Jun 15, 2020 at 8:23 PM Andres Freund  wrote:

> Hi,
>
> We've removed the use of "slave" from most of the repo (one use
> remained, included here), but we didn't do the same for master. In the
> attached series I replaced most of the uses.
>
> 0001: tap tests: s/master/primary/
>   Pretty clear cut imo.
>
> 0002: code: s/master/primary/
>   This also includes a few minor other changes (s/in master/on the
>   primary/, a few 'the's added). Perhaps it'd be better to do those
>   separately?
>
> 0003: code: s/master/leader/
>   This feels pretty obvious. We've largely used the leader / worker
>   terminology, but there were a few uses of master left.
>
> 0004: code: s/master/$other/
>   This is most of the remaining uses of master in code. A number of
>   references to 'master' in the context of toast, a few uses of 'master
>   copy'. I guess some of these are a bit less clear cut.
>
> 0005: docs: s/master/primary/
>   These seem mostly pretty straightforward to me. The changes in
>   high-availability.sgml probably deserve the most attention.
>
> 0006: docs: s/master/root/
>   Here using root seems a lot better than master anyway (master seems
>   confusing in regard to inheritance scenarios). But perhaps parent
>   would be better? Went with root since it's about the topmost table.
>
> 0007: docs: s/master/supervisor/
>   I guess this could be a bit more contentious. Supervisor seems clearer
>   to me, but I can see why people would disagree. See also later point
>   about changes I have not done at this stage.
>
> 0008: docs: WIP multi-master rephrasing.
>   I like neither the new nor the old language much. I'd welcome input.
>
>
> After this series there are only two widespread use of 'master' in the
> tree.
> 1) 'postmaster'. As changing that would be somewhat invasive, the word
>is a bit more ambiguous, and it's largely just internal, I've left
>this alone for now. I personally would rather see this renamed as
>supervisor, which'd imo actually would also be a lot more
>descriptive. I'm willing to do the work, but only if there's at least
>some agreement.
> 2) 'master' as a reference to the branch. Personally I be in favor of
>changing the branch name, but it seems like it'd be better done as a
>somewhat separate discussion to me, as it affects development
>practices to some degree.
>
>
In looking at this I realize we also have exactly one thing referred to as
"blacklist" in our codebase, which is the "enum blacklist" (and then a
small internal variable in pgindent). AFAICT, it's not actually exposed to
userspace anywhere, so we could probably make the attached change to
blocklist at no "cost" (the only thing changed is the name of the hash
table, and we definitely change things like that in normal releases with no
specific thought on backwards compat).

//Magnus
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 14a8690019..6f93b41e4f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -75,7 +75,7 @@
 #define PARALLEL_KEY_PENDING_SYNCS			UINT64CONST(0x000B)
 #define PARALLEL_KEY_REINDEX_STATE			UINT64CONST(0x000C)
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0x000D)
-#define PARALLEL_KEY_ENUMBLACKLIST			UINT64CONST(0x000E)
+#define PARALLEL_KEY_ENUMBLOCKLIST			UINT64CONST(0x000E)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -211,7 +211,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 	Size		pendingsyncslen = 0;
 	Size		reindexlen = 0;
 	Size		relmapperlen = 0;
-	Size		enumblacklistlen = 0;
+	Size		enumblocklistlen = 0;
 	Size		segsize = 0;
 	int			i;
 	FixedParallelState *fps;
@@ -267,8 +267,8 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_estimate_chunk(&pcxt->estimator, reindexlen);
 		relmapperlen = EstimateRelationMapSpace();
 		shm_toc_estimate_chunk(&pcxt->estimator, relmapperlen);
-		enumblacklistlen = EstimateEnumBlacklistSpace();
-		shm_toc_estimate_chunk(&pcxt->estimator, enumblacklistlen);
+		enumblocklistlen = EstimateEnumBlocklistSpace();
+		shm_toc_estimate_chunk(&pcxt->estimator, enumblocklistlen);
 		/* If you add more chunks here, you probably need to add keys. */
 		shm_toc_estimate_keys(&pcxt->estimator, 11);
 
@@ -348,7 +348,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *error_queue_space;
 		char	   *session_dsm_handle_space;
 		char	   *entrypointstate;
-		char	   *enumblacklistspace;
+		char	   *enumblocklistspace;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -404,11 +404,11 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_RELMAPPER_STATE,
 	   relmapperspace);
 
-		/* Serialize enum blacklist state. */
-		enumblacklistspace = shm_toc_allocate(pcxt->toc, enumblacklistlen);
-		SerializeEnumBlacklist(enumblacklistspace, enumblacklistlen);
-		shm_toc_insert

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-17 Thread Amit Kapila
On Tue, Jun 16, 2020 at 2:36 PM Amit Kapila  wrote:
>
> On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila  wrote:
> >
> > I have few more comments on the patch
> > 0013-Change-buffile-interface-required-for-streaming-.patch:
> >
>
> Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
>

changes_filename(char *path, Oid subid, TransactionId xid)
 {
- char tempdirpath[MAXPGPATH];
-
- TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
- /*
- * We might need to create the tablespace's tempfile directory, if no
- * one has yet done so.
- */
- if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m",
- tempdirpath)));
-
- snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
- tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+ snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);

Today, I was studying this change and its impact.  Initially, I
thought that because the patch has removed pgsql_tmp prefix from the
filename, it might create problems if the temporary files remain on
the disk after the crash.  Now as the patch has started using BufFile
interface, it seems to be internally taking care of the same by
generating names like
"base/pgsql_tmp/pgsql_tmp13774.0.sharedfileset/16393-513.changes.0".
Basically, it ensures to create the file in the directory starting
with pgsql_tmp.  I have tried by crashing the server in a situation
where the temp files remain and after the restart, they are removed.
So, it seems okay to generate file names like that but I still suggest
testing other paths like backup where we ignore files whose names
start with PG_TEMP_FILE_PREFIX.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: language cleanups in code and docs

2020-06-17 Thread Laurenz Albe
On Tue, 2020-06-16 at 19:55 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 6/15/20 2:22 PM, Andres Freund wrote:
> > > 2) 'master' as a reference to the branch. Personally I be in favor of
> > > changing the branch name, but it seems like it'd be better done as a
> > > somewhat separate discussion to me, as it affects development
> > > practices to some degree.
> > I'm OK with this, but I would need plenty of notice to get the buildfarm
> > adjusted.
> 
> "master" is the default branch name established by git, is it not?  Not
> something we picked.  I don't feel like we need to be out front of the
> rest of the world in changing that.  It'd be a cheaper change than some of
> the other proposals in this thread, no doubt, but it would still create
> confusion for new hackers who are used to the standard git convention.

I have the feeling that all this is going somewhat too far.

I feel fine with removing the term "slave", even though I have no qualms
about enslaving machines.

But the term "master" is not restricted to slavery.  It can just as well
imply expert knowledge (think academic degree), and it can denote someone
with the authority to command (there is nothing wrong with "servant", right?
Or do we have to abolish the term "server" too?).

I appreciate that other people's sensitivities might be different, and I
don't want to start a fight over it.  But renaming things makes the code
history harder to read, so it should be used with moderation.

Yours,
Laurenz Albe





Re: Asynchronous Append on postgres_fdw nodes.

2020-06-17 Thread Andrey V. Lepikhov

On 6/16/20 1:30 PM, Kyotaro Horiguchi wrote:

They return 25056 rows, which is far more than 9741 rows. So remote
join won.

Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.


Thanks for the attention.
I see one slight flaw of this approach to asynchronous append:
AsyncAppend works only for ForeignScan subplans. if we have 
PartialAggregate, Join or another more complicated subplan, we can't use 
asynchronous machinery.
It may lead to a situation than small difference in a filter constant 
can cause a big difference in execution time.
I imagine an Append node, that can switch current subplan from time to 
time and all ForeignScan nodes of the overall plan are added to one 
queue. The scan buffer can be larger than a cursor fetch size and each 
IterateForeignScan() call can induce asynchronous scan of another 
ForeignScan node if buffer is not full.
But these are only thoughts, not an proposal. I have no questions to 
your patch right now.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com




Re: language cleanups in code and docs

2020-06-17 Thread Magnus Hagander
On Wed, Jun 17, 2020 at 3:44 AM Tom Lane  wrote:

> Alvaro Herrera  writes:
> > On 2020-Jun-16, Tom Lane wrote:
> >> "master" is the default branch name established by git, is it not?  Not
> >> something we picked.
>
> > Git itself is discussing this:
> >
> https://public-inbox.org/git/41438a0f-50e4-4e58-a3a7-3daaecb55...@jramsay.com.au/T/#t
> > and it seems that "main" is the winning choice.
>
> Oh, interesting.  If they do change I'd be happy to follow suit.
> But let's wait and see what they do, rather than possibly ending
> up with our own private convention.
>

I'm +1 for changing it (with good warning time to handle the buildfarm
situation), but also very much +1 for waiting to see exactly what upstream
(git) decides on and make sure we change to the same.  The worst possible
combination would be that we change it to something that's *different* than
upstream ends up with (even if upstream ends up being configurable).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Review for GetWALAvailability()

2020-06-17 Thread Kyotaro Horiguchi
At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/06/17 12:10, Kyotaro Horiguchi wrote:
> > At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
> >  wrote in
> >> On 2020-Jun-17, Fujii Masao wrote:
> >>> On 2020/06/17 3:50, Alvaro Herrera wrote:
> >>
> >>> So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> >>> But do we want to do this? If we want, we should add the note about
> >>> this
> >>> case into the docs? Otherwise the users would be surprised at
> >>> termination
> >>> of backends by max_slot_wal_keep_size. I guess that it's basically
> >>> rarely
> >>> happen, though.
> >>
> >> Well, if we could distinguish a walsender from a non-walsender
> >> process,
> >> then maybe it would make sense to leave backends alive.  But do we
> >> want
> >> that?  I admit I don't know what would be the reason to have a
> >> non-walsender process with an active slot, so I don't have a good
> >> opinion on what to do in this case.
> > The non-walsender backend is actually doing replication work. It
> > rather should be killed?
> 
> I have no better opinion about this. So I agree to leave the logic as
> it is
> at least for now, i.e., we terminate the process owning the slot
> whatever
> the type of process is.

Agreed.

> > +   /*
> > +* Signal to terminate the process using the 
> > replication slot.
> > +*
> > +* Try to signal every 100ms until it succeeds.
> > +*/
> > +   if (!killed && kill(active_pid, SIGTERM) == 0)
> > +   killed = true;
> > +   ConditionVariableTimedSleep(&slot->active_cv, 100,
> > +   
> > WAIT_EVENT_REPLICATION_SLOT_DROP);
> > +   } while (ReplicationSlotIsActive(slot, NULL));
> 
>  Note that here you're signalling only once and then sleeping many
>  times
>  in increments of 100ms -- you're not signalling every 100ms as the
>  comment claims -- unless the signal fails, but you don't really expect
>  that.  On the contrary, I'd claim that the logic is reversed: if the
>  signal fails, *then* you should stop signalling.
> >>>
> >>> You mean; in this code path, signaling fails only when the target
> >>> process
> >>> disappears just before signaling. So if it fails, slot->active_pid is
> >>> expected to become 0 even without signaling more. Right?
> >>
> >> I guess kill() can also fail if the PID now belongs to a process owned
> >> by a different user.
> 
> Yes. This case means that the PostgreSQL process using the slot
> disappeared
> and the same PID was assigned to non-PostgreSQL process. So if kill()
> fails
> for this reason, we don't need to kill() again.
> 
> >  I think we've disregarded very quick reuse of
> >> PIDs, so we needn't concern ourselves with it.
> > The first time call to ConditionVariableTimedSleep doen't actually
> > sleep, so the loop works as expected.  But we may make an extra call
> > to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
> > loop would make it better.
> 
> Sorry I failed to understand your point...

My point is the ConditionVariableTimedSleep does *not* sleep on the CV
first time in this usage. The new version anyway avoids useless
kill(2) call, but still may make an extra call to
ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.

> Anyway, the attached is the updated version of the patch. This fixes
> all the issues in InvalidateObsoleteReplicationSlots() that I reported
> upthread.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-17 Thread Pavel Stehule
st 17. 6. 2020 v 7:52 odesílatel Amit Khandekar 
napsal:

> On Wed, 10 Jun 2020 at 17:12, Pavel Stehule 
> wrote:
> > st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar 
> napsal:
> >> Could you show an example testcase that tests this recursive scenario,
> >> with which your earlier patch fails the test, and this v2 patch passes
> >> it ? I am trying to understand the recursive scenario and the re-use
> >> of expr->plan.
> >
> >
> > it hangs on plpgsql tests. So you can apply first version of patch
> >
> > and "make check"
>
> I could not reproduce the make check hang with the v1 patch. But I
> could see a crash with the below testcase. So I understand the purpose
> of the plan_owner variable that you introduced in v2.
>
> Consider this recursive test :
>
> create or replace procedure p1(in r int) as $$
> begin
>RAISE INFO 'r : % ', r;
>if r < 3 then
>   call p1(r+1);
>end if;
> end
> $$ language plpgsql;
>
> do $$
> declare r int default 1;
> begin
> call p1(r);
> end;
> $$;
>
> In p1() with r=2, when the stmt "call p1(r+1)" is being executed,
> consider this code of exec_stmt_call() with your v2 patch applied:
> if (expr->plan && !expr->plan->saved)
> {
>if (plan_owner)
>   SPI_freeplan(expr->plan);
>expr->plan = NULL;
> }
>
> Here, plan_owner is false. So SPI_freeplan() will not be called, and
> expr->plan is set to NULL. Now I have observed that the stmt pointer
> and expr pointer is shared between the p1() execution at this r=2
> level and the p1() execution at r=1 level. So after the above code is
> executed at r=2, when the upper level (r=1) exec_stmt_call() lands to
> the same above code snippet, it gets the same expr pointer, but it's
> expr->plan is already set to NULL without being freed. From this
> logic, it looks like the plan won't get freed whenever the expr/stmt
> pointers are shared across recursive levels, since expr->plan is set
> to NULL at the lowermost level ? Basically, the handle to the plan is
> lost so no one at the upper recursion level can explicitly free it
> using SPI_freeplan(), right ? This looks the same as the main issue
> where the plan does not get freed for non-recursive calls. I haven't
> got a chance to check if we can develop a testcase for this, similar
> to your testcase where the memory keeps on increasing.
>

This is a good consideration.

I am sending updated patch

Pavel



> -Amit
>
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index f41d675d65..88b8c7882a 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -2132,15 +2132,16 @@ static int
 exec_stmt_call(PLpgSQL_execstate *estate, PLpgSQL_stmt_call *stmt)
 {
 	PLpgSQL_expr *expr = stmt->expr;
+	volatile SPIPlanPtr	plan = expr->plan;
 	volatile LocalTransactionId before_lxid;
 	LocalTransactionId after_lxid;
 	volatile bool pushed_active_snap = false;
 	volatile int rc;
+	volatile bool plan_owner = false;
 
 	/* PG_TRY to ensure we clear the plan link, if needed, on failure */
 	PG_TRY();
 	{
-		SPIPlanPtr	plan = expr->plan;
 		ParamListInfo paramLI;
 
 		if (plan == NULL)
@@ -2155,6 +2156,16 @@ exec_stmt_call(PLpgSQL_execstate *estate, PLpgSQL_stmt_call *stmt)
 			 */
 			exec_prepare_plan(estate, expr, 0, estate->atomic);
 
+			/*
+			 * Not saved plan should be explicitly released. Without it, any
+			 * not recursive usage of CALL statemenst leak plan in SPI memory.
+			 * The created plan can be reused when procedure is called recursively,
+			 * and releasing plan can be done only in recursion root call, when
+			 * expression has not assigned plan. Where a plan was created, then
+			 * there plan can be released.
+			 */
+			plan_owner = true;
+
 			/*
 			 * The procedure call could end transactions, which would upset
 			 * the snapshot management in SPI_execute*, so don't let it do it.
@@ -2301,14 +2312,23 @@ exec_stmt_call(PLpgSQL_execstate *estate, PLpgSQL_stmt_call *stmt)
 		 * If we aren't saving the plan, unset the pointer.  Note that it
 		 * could have been unset already, in case of a recursive call.
 		 */
-		if (expr->plan && !expr->plan->saved)
+		if (plan && !plan->saved)
+		{
+			if (plan_owner)
+SPI_freeplan(plan);
+
 			expr->plan = NULL;
+		}
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
-	if (expr->plan && !expr->plan->saved)
+	if (plan && !plan->saved)
+	{
+		if (plan_owner)
+			SPI_freeplan(plan);
 		expr->plan = NULL;
+	}
 
 	if (rc < 0)
 		elog(ERROR, "SPI_execute_plan_with_paramlist failed executing query \"%s\": %s",


Re: remove some STATUS_* symbols

2020-06-17 Thread Peter Eisentraut

On 2020-06-12 09:30, Michael Paquier wrote:

On Thu, Jun 11, 2020 at 03:55:59PM +0200, Peter Eisentraut wrote:

On 2020-01-16 13:56, Robert Haas wrote:

IMHO, custom enums for each particular case would be a big improvement
over supposedly-generic STATUS codes. It makes it clearer which values
are possible in each code path, and it comes out nicer in the
debugger, too.


Given this feedback, I would like to re-propose the original patch, attached
again here.

After this, the use of the remaining STATUS_* symbols will be contained to
the frontend and backend libpq code, so it'll be more coherent.


I am still in a so-so state regarding this patch, but I find the
debugger argument a good one.  And please don't consider me as a
blocker.


Okay, I have committed it.


Add a separate enum for use in the locking APIs, which were the only
user.



+typedef enum
+{
+   PROC_WAIT_STATUS_OK,
+   PROC_WAIT_STATUS_WAITING,
+   PROC_WAIT_STATUS_ERROR,
+} ProcWaitStatus;


ProcWaitStatus, and more particularly PROC_WAIT_STATUS_WAITING are
strange names (the latter refers to "wait" twice).  What do you think
about renaming the enum to ProcStatus and the flags to PROC_STATUS_*?


I see your point, but I don't think that's better.  That would just 
invite someone else to use it for other process-related status things. 
We typically name enum constants like the type followed by a suffix. 
The fact that the suffix is similar to the prefix here is more of a 
coincidence.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-17 Thread Masahiko Sawada
On Sat, 13 Jun 2020 at 14:23, Amit Kapila  wrote:
>
> On Fri, Jun 12, 2020 at 6:11 PM Magnus Hagander  wrote:
> >
> > On Fri, Jun 12, 2020 at 10:23 AM Amit Kapila  
> > wrote:
> >>
> >
> >
> > The problem with "lifetime of a process" is that it's not predictable. A 
> > replication process might "bounce" for any reason, and it is normally not a 
> > problem. But if you suddenly lose your stats when you do that, it starts to 
> > matter a lot more. Especially when you don't know if it bounced. (Sure you 
> > can look at the backend_start time, but that adds a whole different sets of 
> > complexitites).
> >
>
> It is not clear to me what is a good way to display the stats for a
> process that has exited or bounced due to whatever reason.  OTOH, if
> we just display per-slot stats, it is difficult to imagine how the
> user can make any sense out of it or in other words how such stats can
> be useful to users.

If we have the reset function, the user can reset before doing logical
decoding so that the user can use the stats directly. Or I think we
can automatically reset the stats when logical decoding is performed
with different logical_decoding_work_mem value than the previous one.
In either way, since the stats correspond to the logical decoding
using the same slot with the same parameter value the user can use
them directly.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_regress cleans up tablespace twice.

2020-06-17 Thread Kyotaro Horiguchi
Thanks for working on this.

At Wed, 17 Jun 2020 16:12:07 +0900, Michael Paquier  wrote 
in 
> On Fri, May 15, 2020 at 05:25:08PM +0900, Kyotaro Horiguchi wrote:
> > I thought of that but didn't in the patch.  I refrained from doing
> > that because the output directory is dedicatedly created at the only
> > place (pg_upgrade test) where the --outputdir is specified. (I think I
> > tend to do too-much.)
> 
> So, I have reviewed the patch aimed at removing the cleanup of
> testtablespace done with WIN32, and finished with the attached to
> clean up things.  I simplified the logic, to not have to parse
> REGRESS_OPTS for --outputdir (no need for a regex, leaving
> EXTRA_REGRESS_OPTS alone), and reworked the code so as the tablespace
> cleanup only happens only where we need to: check, installcheck and
> upgradecheck.  No need for that with contribcheck, modulescheck,
> plcheck and ecpgcheck.

It look good to me as the Windows part. I agree that vcregress.pl
don't need to parse EXTRA_REGRESS_OPTS by allowing a bit more tight
bond between the caller sites of pg_regress and pg_regress.

> Note that after I changed my patch, this converged with a portion of
> patch 0002 you have posted here:
> https://www.postgresql.org/message-id/20200511.171354.514381788845037011.horikyota@gmail.com
> 
> Now about 0002, I tend to agree that we should try to do something
> about pg_upgrade test creating removing and then creating an extra
> testtablespace path that is not necessary as pg_upgrade test uses its
> own --outputdir.  I have not actually seen this stuff being a problem
> in practice as the main regression test suite runs first, largely
> before pg_upgrade test even with parallel runs so they have a low
> probability of conflict.  I'll try to think about a couple of options,

Agreed on probability. 

> one of them I have in mind now being that we could finally switch the
> upgrade tests to TAP and let test.sh go to the void.  This is an
> independent problem, so let's tackle both issues separately.

Chaning to TAP sounds nice as a goal.

As the next step we need to amend GNUmakefile not to cleanup
tablespace for the four test items. Then somehow treat tablespaces at
non-default place?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/17 12:10, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera  
wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:



So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.


Well, if we could distinguish a walsender from a non-walsender process,
then maybe it would make sense to leave backends alive.  But do we want
that?  I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.


The non-walsender backend is actually doing replication work. It
rather should be killed?


I have no better opinion about this. So I agree to leave the logic as it is
at least for now, i.e., we terminate the process owning the slot whatever
the type of process is.




+   /*
+* Signal to terminate the process using the replication slot.
+*
+* Try to signal every 100ms until it succeeds.
+*/
+   if (!killed && kill(active_pid, SIGTERM) == 0)
+   killed = true;
+   ConditionVariableTimedSleep(&slot->active_cv, 100,
+   
WAIT_EVENT_REPLICATION_SLOT_DROP);
+   } while (ReplicationSlotIsActive(slot, NULL));


Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.


You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?


I guess kill() can also fail if the PID now belongs to a process owned
by a different user.


Yes. This case means that the PostgreSQL process using the slot disappeared
and the same PID was assigned to non-PostgreSQL process. So if kill() fails
for this reason, we don't need to kill() again.


 I think we've disregarded very quick reuse of

PIDs, so we needn't concern ourselves with it.


The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected.  But we may make an extra call
to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.


Sorry I failed to understand your point...

Anyway, the attached is the updated version of the patch. This fixes
all the issues in InvalidateObsoleteReplicationSlots() that I reported
upthread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..a065d41d76 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,104 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   for (i = 0; i < max_replication_slots; i++)
+   {
+   ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+   if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0)
+   {
+   slot = s;
+   break;
+   }
+   }
+
+   return slot;
+}
+
+/*
+ * Find a previously created slot and mark it as used by this process.
  *
  * The return value is only useful if behavior is SAB_Inquire, in which
- * it's zero if we successfully acquired the slot, or the PID of the
- * owning

Re: Remove dead forceSync parameter of XactLogCommitRecord()

2020-06-17 Thread Michael Paquier
On Tue, Jun 16, 2020 at 08:26:15PM -0700, Noah Misch wrote:
> I think someone planned to have XactLogCommitRecord() use its forceSync
> parameter instead of reading the forceSyncCommit global variable, but that
> didn't happen.  I'd like to remove the parameter, as attached.  This has no
> functional consequences, as detailed in the commit message.

+1.  Looks like an oversight of 4f1b890b to me.
--
Michael


signature.asc
Description: PGP signature


Re: TAP tests and symlinks on Windows

2020-06-17 Thread Michael Paquier
On Tue, Jun 16, 2020 at 07:53:26AM -0400, Andrew Dunstan wrote:
> Not one of them has it.

Argh.

> I think we'll need a dynamic test for its presence rather than just
> assuming it's there. (Use require in an eval for this).

Sure.  No problem with implementing an automatic detection.

> However, since all of them would currently fail we wouldn't actually
> have any test coverage. I could see about installing it on one or two
> animals (jacana would be a problem, it's using a very old and limited
> perl to run TAP tests.)

Okay.  This could be a problem as jacana is proving to have good
coverage AFAIK.  So it looks like we are really heading in the
direction is still skipping the test if there is no support for
symlink in the environment. At least that makes less diffs in the
patch.
--
Michael


signature.asc
Description: PGP signature


Re: TAP tests and symlinks on Windows

2020-06-17 Thread Michael Paquier
On Tue, Jun 16, 2020 at 08:32:03AM -0400, Andrew Dunstan wrote:
> On 6/16/20 8:24 AM, Peter Eisentraut wrote:
>> MSYS2, which is basically Cygwin, emulates symlinks with junction
>> points, so this happens to work for our purpose.  We could therefore
>> enable these tests in that environment, if we could come up with a
>> reliable way to detect it.

Hmm.  In this case does perl's -l think that a junction point is
corrently a soft link or not?  We have a check based on that in
pg_basebackup's test and -l fails when it sees to a junction point,
forcing us to skip this test.

> From src/bin/pg_dump/t/010_dump_connstr.pl:
> 
> if ($^O eq 'msys' && `uname -or` =~ /^[2-9].*Msys/)

Smart.  This could become a central variable in TestLib.pm.
--
Michael


signature.asc
Description: PGP signature


Re: pg_regress cleans up tablespace twice.

2020-06-17 Thread Michael Paquier
On Fri, May 15, 2020 at 05:25:08PM +0900, Kyotaro Horiguchi wrote:
> I thought of that but didn't in the patch.  I refrained from doing
> that because the output directory is dedicatedly created at the only
> place (pg_upgrade test) where the --outputdir is specified. (I think I
> tend to do too-much.)

So, I have reviewed the patch aimed at removing the cleanup of
testtablespace done with WIN32, and finished with the attached to
clean up things.  I simplified the logic, to not have to parse
REGRESS_OPTS for --outputdir (no need for a regex, leaving
EXTRA_REGRESS_OPTS alone), and reworked the code so as the tablespace
cleanup only happens only where we need to: check, installcheck and
upgradecheck.  No need for that with contribcheck, modulescheck,
plcheck and ecpgcheck.

Note that after I changed my patch, this converged with a portion of
patch 0002 you have posted here:
https://www.postgresql.org/message-id/20200511.171354.514381788845037011.horikyota@gmail.com

Now about 0002, I tend to agree that we should try to do something
about pg_upgrade test creating removing and then creating an extra
testtablespace path that is not necessary as pg_upgrade test uses its
own --outputdir.  I have not actually seen this stuff being a problem
in practice as the main regression test suite runs first, largely
before pg_upgrade test even with parallel runs so they have a low
probability of conflict.  I'll try to think about a couple of options,
one of them I have in mind now being that we could finally switch the
upgrade tests to TAP and let test.sh go to the void.  This is an
independent problem, so let's tackle both issues separately.
--
Michael
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index f11a3b9e26..c8d190d248 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -494,28 +494,6 @@ convert_sourcefiles_in(const char *source_subdir, const char *dest_dir, const ch
 
 	snprintf(testtablespace, MAXPGPATH, "%s/testtablespace", outputdir);
 
-#ifdef WIN32
-
-	/*
-	 * On Windows only, clean out the test tablespace dir, or create it if it
-	 * doesn't exist.  On other platforms we expect the Makefile to take care
-	 * of that.  (We don't migrate that functionality in here because it'd be
-	 * harder to cope with platform-specific issues such as SELinux.)
-	 *
-	 * XXX it would be better if pg_regress.c had nothing at all to do with
-	 * testtablespace, and this were handled by a .BAT file or similar on
-	 * Windows.  See pgsql-hackers discussion of 2008-01-18.
-	 */
-	if (directory_exists(testtablespace))
-		if (!rmtree(testtablespace, true))
-		{
-			fprintf(stderr, _("\n%s: could not remove test tablespace \"%s\"\n"),
-	progname, testtablespace);
-			exit(2);
-		}
-	make_directory(testtablespace);
-#endif
-
 	/* finally loop on each file and do the replacement */
 	for (name = names; *name; name++)
 	{
diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index 3365ee578c..f4ea1ed9cb 100644
--- a/src/tools/msvc/vcregress.pl
+++ b/src/tools/msvc/vcregress.pl
@@ -123,6 +123,8 @@ sub installcheck_internal
 sub installcheck
 {
 	my $schedule = shift || 'serial';
+
+	CleanupTablespaceDirectory();
 	installcheck_internal($schedule);
 	return;
 }
@@ -143,6 +145,7 @@ sub check
 		"--temp-instance=./tmp_check");
 	push(@args, $maxconn) if $maxconn;
 	push(@args, $temp_config) if $temp_config;
+	CleanupTablespaceDirectory();
 	system(@args);
 	my $status = $? >> 8;
 	exit $status if $status;
@@ -571,7 +574,7 @@ sub upgradecheck
 	my $outputdir  = "$tmp_root/regress";
 	my @EXTRA_REGRESS_OPTS = ("--outputdir=$outputdir");
 	mkdir "$outputdir"|| die $!;
-	mkdir "$outputdir/testtablespace" || die $!;
+	CleanupTablespaceDirectory($outputdir);
 
 	my $logdir = "$topdir/src/bin/pg_upgrade/log";
 	rmtree($logdir);
@@ -737,6 +740,16 @@ sub InstallTemp
 	return;
 }
 
+sub CleanupTablespaceDirectory
+{
+	my $testdir = shift || getcwd();
+
+	my $testtablespace = "$testdir/testtablespace";
+
+	rmtree($testtablespace) if (-d $testtablespace);
+	mkdir($testtablespace);
+}
+
 sub usage
 {
 	print STDERR


signature.asc
Description: PGP signature