Re: Review for GetWALAvailability()

2020-06-26 Thread Alvaro Herrera
On 2020-Jun-25, Kyotaro Horiguchi wrote:

> It is error-prone restriction, as discussed before.
> 
> Without changing updator-side, invalid restart_lsn AND valid
> invalidated_at can be regarded as the lost state. With the following
> change suggested by Fujii-san we can avoid the confusing status.
> 
> With attached first patch on top of the slot-dirtify fix below, we get
> "lost" for invalidated slots after restart.

Makes sense.  I pushed this change, thanks.


> The confusing status can be avoided without fixing it, but I prefer to
> fix it.  As Fujii-san suggested upthread, couldn't we remember
> lastRemovedSegNo in the contorl file? (Yeah, it cuases a bump of
> PG_CONTROL_VERSION and CATALOG_VERSION_NO?).

I think that's a pg14 change.  Feel free to submit a patch to the
commitfest.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-25 Thread Kyotaro Horiguchi
At Thu, 25 Jun 2020 14:35:34 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/06/25 12:57, Alvaro Herrera wrote:
> > On 2020-Jun-25, Fujii Masao wrote:
> > 
> >>/*
> >> * Find the oldest extant segment file. We get 1 until checkpoint 
> >> removes
> >> * the first WAL segment file since startup, which causes the status 
> >> being
> >> * wrong under certain abnormal conditions but that doesn't actually 
> >> harm.
> >> */
> >>oldestSeg = XLogGetLastRemovedSegno() + 1;
> >>
> >> I see the point of the above comment, but this can cause wal_status to
> >> be
> >> changed from "lost" to "unreserved" after the server restart. Isn't
> >> this
> >> really confusing? At least it seems better to document that behavior.
> > Hmm.
> >
> >> Or if we *can ensure* that the slot with invalidated_at set always
> >> means
> >> "lost" slot, we can judge that wal_status is "lost" without using
> >> fragile
> >> XLogGetLastRemovedSegno(). Thought?
> > Hmm, this sounds compelling -- I think it just means we need to ensure
> > we reset invalidated_at to zero if the slot's restart_lsn is set to a
> > correct position afterwards.
> 
> Yes.

It is error-prone restriction, as discussed before.

Without changing updator-side, invalid restart_lsn AND valid
invalidated_at can be regarded as the lost state. With the following
change suggested by Fujii-san we can avoid the confusing status.

With attached first patch on top of the slot-dirtify fix below, we get
"lost" for invalidated slots after restart.

> > I don't think we have any operation that
> > does that, so it should be safe -- hopefully I didn't overlook
> > anything?
> 
> We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave()
> just after setting invalidated_at and restart_lsn in
> InvalidateObsoleteReplicationSlots()?
> Otherwise, restart_lsn can go back to the previous value after the
> restart.
> 
> diff --git a/src/backend/replication/slot.c
> b/src/backend/replication/slot.c
> index e8761f3a18..5584e5dd2c 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1229,6 +1229,13 @@ restart:
> s->data.invalidated_at = s->data.restart_lsn;
> s->data.restart_lsn = InvalidXLogRecPtr;
> SpinLockRelease(>mutex);
> +
> +   /*
> + * Save this invalidated slot to disk, to ensure that the slot
> +* is still invalid even after the server restart.
> +*/
> +   ReplicationSlotMarkDirty();
> +   ReplicationSlotSave();
> ReplicationSlotRelease();
>   /* if we did anything, start from scratch */
> 
> Maybe we don't need to do this if the slot is temporary?

The only difference of temprary slots from persistent one seems to be
an attribute "persistency". Actually,
create_physica_replication_slot() does the aboves for temporary slots.

> > Neither copy nor advance seem to work with a slot that has invalid
> > restart_lsn.
> > 
> >> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
> >> value even after the restart?
> > This seems more work to implement.
> 
> Yes.

The confusing status can be avoided without fixing it, but I prefer to
fix it.  As Fujii-san suggested upthread, couldn't we remember
lastRemovedSegNo in the contorl file? (Yeah, it cuases a bump of
PG_CONTROL_VERSION and CATALOG_VERSION_NO?).

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 7c8803f2bd7267d166f8a6f1a4ca5b129aa5ae96 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 25 Jun 2020 17:03:24 +0900
Subject: [PATCH 1/2] Make slot invalidation persistent

---
 src/backend/replication/slot.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e8761f3a18..26f874e3cb 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1229,7 +1229,15 @@ restart:
 		s->data.invalidated_at = s->data.restart_lsn;
 		s->data.restart_lsn = InvalidXLogRecPtr;
 		SpinLockRelease(>mutex);
+
+		/*
+		 * Save this invalidated slot to disk, to ensure that the slot
+		 * is still invalid even after the server restart.
+		 */
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
 		ReplicationSlotRelease();
+		
 
 		/* if we did anything, start from scratch */
 		CHECK_FOR_INTERRUPTS();
-- 
2.18.4

>From 0792c41c915b87474958689a2acd5c214b393015 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 25 Jun 2020 17:04:01 +0900
Subject: [PATCH 2/2] Show correct value in pg_replication_slots.wal_status
 after restart

The column may show bogus staus until checkpoint removes at least one
WAL segment. This patch changes the logic to detect the state so that
the column shows the correct status after restart.
---
 src/backend/replication/slotfuncs.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c 

Re: Review for GetWALAvailability()

2020-06-24 Thread Fujii Masao




On 2020/06/25 12:57, Alvaro Herrera wrote:

On 2020-Jun-25, Fujii Masao wrote:


/*
 * Find the oldest extant segment file. We get 1 until checkpoint 
removes
 * the first WAL segment file since startup, which causes the status 
being
 * wrong under certain abnormal conditions but that doesn't actually 
harm.
 */
oldestSeg = XLogGetLastRemovedSegno() + 1;

I see the point of the above comment, but this can cause wal_status to be
changed from "lost" to "unreserved" after the server restart. Isn't this
really confusing? At least it seems better to document that behavior.


Hmm.


Or if we *can ensure* that the slot with invalidated_at set always means
"lost" slot, we can judge that wal_status is "lost" without using fragile
XLogGetLastRemovedSegno(). Thought?


Hmm, this sounds compelling -- I think it just means we need to ensure
we reset invalidated_at to zero if the slot's restart_lsn is set to a
correct position afterwards.


Yes.


I don't think we have any operation that
does that, so it should be safe -- hopefully I didn't overlook anything?


We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave()
just after setting invalidated_at and restart_lsn in 
InvalidateObsoleteReplicationSlots()?
Otherwise, restart_lsn can go back to the previous value after the restart.

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e8761f3a18..5584e5dd2c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1229,6 +1229,13 @@ restart:
s->data.invalidated_at = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(>mutex);
+
+   /*
+* Save this invalidated slot to disk, to ensure that the slot
+* is still invalid even after the server restart.
+*/
+   ReplicationSlotMarkDirty();
+   ReplicationSlotSave();
ReplicationSlotRelease();
 
/* if we did anything, start from scratch */


Maybe we don't need to do this if the slot is temporary?


Neither copy nor advance seem to work with a slot that has invalid
restart_lsn.


Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
value even after the restart?


This seems more work to implement.


Yes.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Review for GetWALAvailability()

2020-06-24 Thread Alvaro Herrera
On 2020-Jun-25, Fujii Masao wrote:

>   /*
>* Find the oldest extant segment file. We get 1 until checkpoint 
> removes
>* the first WAL segment file since startup, which causes the status 
> being
>* wrong under certain abnormal conditions but that doesn't actually 
> harm.
>*/
>   oldestSeg = XLogGetLastRemovedSegno() + 1;
> 
> I see the point of the above comment, but this can cause wal_status to be
> changed from "lost" to "unreserved" after the server restart. Isn't this
> really confusing? At least it seems better to document that behavior.

Hmm.

> Or if we *can ensure* that the slot with invalidated_at set always means
> "lost" slot, we can judge that wal_status is "lost" without using fragile
> XLogGetLastRemovedSegno(). Thought?

Hmm, this sounds compelling -- I think it just means we need to ensure
we reset invalidated_at to zero if the slot's restart_lsn is set to a
correct position afterwards.  I don't think we have any operation that
does that, so it should be safe -- hopefully I didn't overlook anything?
Neither copy nor advance seem to work with a slot that has invalid
restart_lsn.

> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
> value even after the restart?

This seems more work to implement.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-24 Thread Fujii Masao




On 2020/06/25 3:27, Alvaro Herrera wrote:

Thanks for those corrections.

I have pushed this.  I think all problems Masao-san reported have been
dealt with, so we're done here.


Sorry for my late to reply here...

Thanks for committing the patch and improving the feature!

/*
 * Find the oldest extant segment file. We get 1 until checkpoint 
removes
 * the first WAL segment file since startup, which causes the status 
being
 * wrong under certain abnormal conditions but that doesn't actually 
harm.
 */
oldestSeg = XLogGetLastRemovedSegno() + 1;

I see the point of the above comment, but this can cause wal_status to be
changed from "lost" to "unreserved" after the server restart. Isn't this
really confusing? At least it seems better to document that behavior.

Or if we *can ensure* that the slot with invalidated_at set always means
"lost" slot, we can judge that wal_status is "lost" without using fragile
XLogGetLastRemovedSegno(). Thought?

Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
value even after the restart?

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Review for GetWALAvailability()

2020-06-24 Thread Alvaro Herrera
Thanks for those corrections.

I have pushed this.  I think all problems Masao-san reported have been
dealt with, so we're done here.

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-23 Thread Kyotaro Horiguchi
At Tue, 23 Jun 2020 19:06:25 -0400, Alvaro Herrera  
wrote in 
> On 2020-Jun-16, Kyotaro Horiguchi wrote:
> 
> > I saw that the "reserved" is the state where slots are working to
> > retain segments, and "normal" is the state to indicate that "WAL
> > segments are within max_wal_size", which is orthogonal to the notion
> > of "reserved".  So it seems to me useless when the retained WAL
> > segments cannot exceeds max_wal_size.
> > 
> > With longer description they would be:
> > 
> > "reserved under max_wal_size"
> > "reserved over max_wal_size"
> > "lost some segements"
> 
> > Come to think of that, I realized that my trouble was just the
> > wording.  Are the following wordings make sense to you?
> > 
> > "reserved"  - retained within max_wal_size
> > "extended"  - retained over max_wal_size
> > "lost"  - lost some segments
> 
> So let's add Unreserved to denote the state that it's over the slot size
> but no segments have been removed yet:

Oh! Thanks for the more proper word. It looks good to me.

> * Reservedunder max_wal_size
> * Extendedpast max_wal_size, but still within wal_keep_segments or
>   maximum slot size.
> * Unreserved  Past wal_keep_segments and the maximum slot size, but
>   not yet removed.  Recoverable condition.
> * Lostlost segments.  Unrecoverable condition.

Look good, too.

> It seems better to me to save the invalidation LSN in the persistent
> data rather than the in-memory data that's lost on restart.  As is, we
> would lose the status in a restart, which doesn't seem good to me.  It's
> just eight extra bytes to write ... should be pretty much free.

Agreed.

> This version I propose is based on the one you posted earlier today and
> is what I propose for commit.


-   /* slot does not reserve WAL. Either deactivated, or has never been 
active */
+   /*
+* slot does not reserve WAL. Either deactivated, or has never been 
active
+*/

Sorry, this is my fault. The change is useless.  The code for
WALAVAIL_REMOVED looks good.


 # Advance WAL again without checkpoint, reducing remain by 6 MB.
+$result = $node_master->safe_psql('postgres',
+   "SELECT wal_status, restart_lsn, min_safe_lsn FROM pg_replication_slots 
WHERE slot_name = 'rep1'"
+);
+print $result, "\n";

Sorry this is my fault, too. Removed in the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5a66115df1..49a881b262 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11239,19 +11239,29 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
Possible values are:

 
- normal means that the claimed files
+ reserved means that the claimed files
   are within max_wal_size.
 
 
- reserved means
+ extended means
   that max_wal_size is exceeded but the files are
-  still held, either by some replication slot or
-  by wal_keep_segments.
+  still retained, either by the replication slot or
+  by wal_keep_segments.
+ 
 
 
- lost means that some WAL files are
-  definitely lost and this slot cannot be used to resume replication
-  anymore.
+ 
+  unreserved means that the slot no longer
+  retains the required WAL files and some of them are to be removed at
+  the next checkpoint.  This state can return
+  to reserved or extended.
+ 
+
+
+ 
+  lost means that some required WAL files have
+  been removed and this slot is no longer usable.
+ 
 

The last two states are seen only when
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1256a103b..4a4bb30418 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9485,15 +9485,20 @@ CreateRestartPoint(int flags)
  *		(typically a slot's restart_lsn)
  *
  * Returns one of the following enum values:
- * * WALAVAIL_NORMAL means targetLSN is available because it is in the range
- *   of max_wal_size.
  *
- * * WALAVAIL_PRESERVED means it is still available by preserving extra
+ * * WALAVAIL_RESERVED means targetLSN is available and it is in the range of
+ *   max_wal_size.
+ *
+ * * WALAVAIL_EXTENDED means it is still available by preserving extra
  *   segments beyond max_wal_size. If max_slot_wal_keep_size is smaller
  *   than max_wal_size, this state is not returned.
  *
- * * WALAVAIL_REMOVED means it is definitely lost. A replication stream on
- *   a slot with this LSN cannot continue.
+ * * WALAVAIL_UNRESERVED means it is being lost and the next checkpoint will
+ *   remove reserved segments. The walsender using this slot may return to the
+ *   above.
+ *
+ * * WALAVAIL_REMOVED means it has been removed. A 

Re: Review for GetWALAvailability()

2020-06-23 Thread Alvaro Herrera
On 2020-Jun-16, Kyotaro Horiguchi wrote:

> I saw that the "reserved" is the state where slots are working to
> retain segments, and "normal" is the state to indicate that "WAL
> segments are within max_wal_size", which is orthogonal to the notion
> of "reserved".  So it seems to me useless when the retained WAL
> segments cannot exceeds max_wal_size.
> 
> With longer description they would be:
> 
> "reserved under max_wal_size"
> "reserved over max_wal_size"
> "lost some segements"

> Come to think of that, I realized that my trouble was just the
> wording.  Are the following wordings make sense to you?
> 
> "reserved"  - retained within max_wal_size
> "extended"  - retained over max_wal_size
> "lost"  - lost some segments

So let's add Unreserved to denote the state that it's over the slot size
but no segments have been removed yet:

* Reserved  under max_wal_size
* Extended  past max_wal_size, but still within wal_keep_segments or
maximum slot size.
* UnreservedPast wal_keep_segments and the maximum slot size, but
not yet removed.  Recoverable condition.
* Lost  lost segments.  Unrecoverable condition.


It seems better to me to save the invalidation LSN in the persistent
data rather than the in-memory data that's lost on restart.  As is, we
would lose the status in a restart, which doesn't seem good to me.  It's
just eight extra bytes to write ... should be pretty much free.

This version I propose is based on the one you posted earlier today and
is what I propose for commit.

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..8945959f9d 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11239,19 +11239,29 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
Possible values are:

 
- normal means that the claimed files
+ reserved means that the claimed files
   are within max_wal_size.
 
 
- reserved means
+ extended means
   that max_wal_size is exceeded but the files are
-  still held, either by some replication slot or
-  by wal_keep_segments.
+  still retained, either by the replication slot or
+  by wal_keep_segments.
+ 
 
 
- lost means that some WAL files are
-  definitely lost and this slot cannot be used to resume replication
-  anymore.
+ 
+  unreserved means that the slot no longer
+  retains the required WAL files and some of them are to be removed at
+  the next checkpoint.  This state can return
+  to reserved or extended.
+ 
+
+
+ 
+  lost means that some required WAL files have
+  been removed and this slot is no longer usable.
+ 
 

The last two states are seen only when
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 34ede80c44..e455384b5b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9488,15 +9488,20 @@ CreateRestartPoint(int flags)
  *		(typically a slot's restart_lsn)
  *
  * Returns one of the following enum values:
- * * WALAVAIL_NORMAL means targetLSN is available because it is in the range
- *   of max_wal_size.
  *
- * * WALAVAIL_PRESERVED means it is still available by preserving extra
+ * * WALAVAIL_RESERVED means targetLSN is available and it is in the range of
+ *   max_wal_size.
+ *
+ * * WALAVAIL_EXTENDED means it is still available by preserving extra
  *   segments beyond max_wal_size. If max_slot_wal_keep_size is smaller
  *   than max_wal_size, this state is not returned.
  *
- * * WALAVAIL_REMOVED means it is definitely lost. A replication stream on
- *   a slot with this LSN cannot continue.
+ * * WALAVAIL_UNRESERVED means it is being lost and the next checkpoint will
+ *   remove reserved segments. The walsender using this slot may return to the
+ *   above.
+ *
+ * * WALAVAIL_REMOVED means it has been removed. A replication stream on
+ *   a slot with this LSN cannot continue after a restart.
  *
  * * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
  */
@@ -9512,13 +9517,18 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 * slot */
 	uint64		keepSegs;
 
-	/* slot does not reserve WAL. Either deactivated, or has never been active */
+	/*
+	 * slot does not reserve WAL. Either deactivated, or has never been active
+	 */
 	if (XLogRecPtrIsInvalid(targetLSN))
 		return WALAVAIL_INVALID_LSN;
 
 	currpos = GetXLogWriteRecPtr();
 
-	/* calculate oldest segment currently needed by slots */
+	/*
+	 * calculate the oldest segment currently reserved by all slots,
+	 * considering wal_keep_segments and 

Re: Review for GetWALAvailability()

2020-06-23 Thread Kyotaro Horiguchi
Thanks for looking this.

At Fri, 19 Jun 2020 18:23:59 -0400, Alvaro Herrera  
wrote in 
> On 2020-Jun-17, Kyotaro Horiguchi wrote:
> 
> > @@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN)
> >  * the first WAL segment file since startup, which causes the status 
> > being
> >  * wrong under certain abnormal conditions but that doesn't actually 
> > harm.
> >  */
> > -   oldestSeg = XLogGetLastRemovedSegno() + 1;
> > +   oldestSeg = last_removed_seg + 1;
> >  
> > /* calculate oldest segment by max_wal_size and wal_keep_segments */
> > XLByteToSeg(currpos, currSeg, wal_segment_size);
> 
> This hunk should have updated the comment two lines above.  However:
> 
> > @@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
> > rsinfo->setResult = tupstore;
> > rsinfo->setDesc = tupdesc;
> >  
> > +   /*
> > +* Remember the last removed segment at this point for the consistency 
> > in
> > +* this table. Since there's no interlock between slot data and
> > +* checkpointer, the segment can be removed in-between, but that doesn't
> > +* make any practical difference.
> > +*/
> > +   last_removed_seg = XLogGetLastRemovedSegno();
> 
> I am mystified as to why you added this change.  I understand that your
> point here is to make all slots reported their state as compared to the
> same LSN, but why do it like that?  If a segment is removed in between,
> it could mean that the view reports more lies than it would if we update
> the segno for each slot.  I mean, suppose two slots are lagging behind
> and one is reported as 'extended' because when we compute it it's still
> in range; then a segment is removed.  With your coding, we'll report
> both as extended, but with the original coding, we'll report the new one
> as lost.  By the time the user reads the result, they'll read one
> incorrect report with the original code, and two incorrect reports with
> your code.  So ... yes it might be more consistent, but what does that
> buy the user?

I agree to you. Anyway the view may show "wrong" statuses if
concurrent WAL-file removal is running. But I can understand it is
better that the numbers in a view are consistent.  The change
contributes only to that point. So I noted as "doesn't make any
practical difference".  Since it is going to be removed, I removed the
changes for the part.

https://www.postgresql.org/message-id/9ddfbf8c-2f67-904d-44ed-cf8bc5916...@oss.nttdata.com

> OTOH it makes GetWALAvailability gain a new argument, which we have to
> document.
> 
> > +   /*
> > +* However segments required by the slot has been lost, if walsender is
> > +* active the walsender can read into the first reserved slot.
> > +*/
> > +   if (slot_is_active)
> > +   return WALAVAIL_BEING_REMOVED;
> 
> I don't understand this comment; can you please clarify what you mean?

I have had comments that the "lost" state should be a definite state,
that is, a state mustn't go back to other states.  I had the same from
Fujii-san again.

Suppose we are starting from the following situation:

State A:
| seg n-1 | seg n |
 ^
 X (restart_lsn of slot S) - max_slot_wal_keep_size 

If the segment n-1 is removed, slot S's status becomes
"lost". However, if the walsender that is using the slot has not been
killed yet, the point X can move foward to the segment n (State B).

State B:
| seg n-1 | seg n |
   ^
   X (restart_lsn of slot S) - max_slot_wal_keep_size 

This is the normal (or extend) state.  If we want to the state "lost"
to be definitive, we cannot apply the state label "lost" to State A if
it is active.

WALAVAIL_BEING_REMOVED (I noticed it has been removed for a wrong
reason so I revived it in this patch [1].) was used for the same state,
that is, the segment at restart_lsn will be removed soon but not yet.

1: 
https://www.postgresql.org/message-id/20200406.185027.648866525989475817.horikyota@gmail.com

> I admit I don't like this slot_is_active argument you propose to add to
> GetWALAvailability either; previously the function can be called with
> an LSN coming from anywhere, not just a slot; the new argument implies
> that the LSN comes from a slot.  (Your proposed patch doesn't document
> this one either.)

Agreed. I felt like you at the time.  I came up with another way after
hearing that from you.

In the attached GetWALAvailability() returns the state assuming the
walsender is not active. And the caller (pg_get_replication_slots())
considers the case where the walsender is active.

regares.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 74fb205fc87397671392f4f877b2466d1d19869c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 15 Jun 2020 16:18:23 +0900
Subject: [PATCH] Some fixes of pg_replication_slots.wal_status

The colums is shown as "lost" when the segment at the given LSN has
been removed by a checkpoint. But in 

Re: Review for GetWALAvailability()

2020-06-19 Thread Alvaro Herrera
On 2020-Jun-17, Kyotaro Horiguchi wrote:

> @@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN)
>* the first WAL segment file since startup, which causes the status 
> being
>* wrong under certain abnormal conditions but that doesn't actually 
> harm.
>*/
> - oldestSeg = XLogGetLastRemovedSegno() + 1;
> + oldestSeg = last_removed_seg + 1;
>  
>   /* calculate oldest segment by max_wal_size and wal_keep_segments */
>   XLByteToSeg(currpos, currSeg, wal_segment_size);

This hunk should have updated the comment two lines above.  However:

> @@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
>   rsinfo->setResult = tupstore;
>   rsinfo->setDesc = tupdesc;
>  
> + /*
> +  * Remember the last removed segment at this point for the consistency 
> in
> +  * this table. Since there's no interlock between slot data and
> +  * checkpointer, the segment can be removed in-between, but that doesn't
> +  * make any practical difference.
> +  */
> + last_removed_seg = XLogGetLastRemovedSegno();

I am mystified as to why you added this change.  I understand that your
point here is to make all slots reported their state as compared to the
same LSN, but why do it like that?  If a segment is removed in between,
it could mean that the view reports more lies than it would if we update
the segno for each slot.  I mean, suppose two slots are lagging behind
and one is reported as 'extended' because when we compute it it's still
in range; then a segment is removed.  With your coding, we'll report
both as extended, but with the original coding, we'll report the new one
as lost.  By the time the user reads the result, they'll read one
incorrect report with the original code, and two incorrect reports with
your code.  So ... yes it might be more consistent, but what does that
buy the user?

OTOH it makes GetWALAvailability gain a new argument, which we have to
document.

> + /*
> +  * However segments required by the slot has been lost, if walsender is
> +  * active the walsender can read into the first reserved slot.
> +  */
> + if (slot_is_active)
> + return WALAVAIL_BEING_REMOVED;

I don't understand this comment; can you please clarify what you mean?

I admit I don't like this slot_is_active argument you propose to add to
GetWALAvailability either; previously the function can be called with
an LSN coming from anywhere, not just a slot; the new argument implies
that the LSN comes from a slot.  (Your proposed patch doesn't document
this one either.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-19 Thread Fujii Masao




On 2020/06/18 16:36, Kyotaro Horiguchi wrote:

Mmm. I hurried too much..

At Thu, 18 Jun 2020 16:32:59 +0900 (JST), Kyotaro Horiguchi 
 wrote in

If name is specified (so slot is NULL) to
ReplicationSlotAcquireInternal and the slot is not found, the ereport
in following code dereferences NULL.


That's bogus. It is using name in that case. Sorry for the noise.

I don't find a problem by a brief look on it.


Thanks for the review! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Review for GetWALAvailability()

2020-06-18 Thread Kyotaro Horiguchi
Mmm. I hurried too much..

At Thu, 18 Jun 2020 16:32:59 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> If name is specified (so slot is NULL) to
> ReplicationSlotAcquireInternal and the slot is not found, the ereport
> in following code dereferences NULL.

That's bogus. It is using name in that case. Sorry for the noise.

I don't find a problem by a brief look on it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-18 Thread Kyotaro Horiguchi
At Thu, 18 Jun 2020 14:54:47 +0900, Fujii Masao  
wrote in 
> Sorry, this caused compiler failure. So I fixed that and
> attached the updated version of the patch.

At Thu, 18 Jun 2020 14:40:55 +0900, Fujii Masao  
wrote in 
> > + errmsg("replication slot \"%s\" does not exist", name)));
> > The error message is not right when the given slot doesn't match the
> > given name.
> 
> This doesn't happen after applying Alvaro's patch.

If name is specified (so slot is NULL) to
ReplicationSlotAcquireInternal and the slot is not found, the ereport
in following code dereferences NULL.


if (s == NULL || !s->in_use)
{
LWLockRelease(ReplicationSlotControlLock);

if (behavior == SAB_Inquire)
return -1;
ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
 errmsg("replication slot \"%s\" does not exist",
name ? name : NameStr(slot->data.name;
}


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/18 14:40, Fujii Masao wrote:



On 2020/06/18 11:44, Kyotaro Horiguchi wrote:

At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao  
wrote in

ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.


OK, so what about the attached patch? I added
ConditionVariablePrepareToSleep()
just before entering the "for" loop in
InvalidateObsoleteReplicationSlots().


Thanks.


Thanks for the review!



ReplicationSlotAcquireInternal:
+ * If *slot == NULL, search for the slot with the given name.

'*' seems needless here.


Fixed.

Also I added "Only one of slot and name can be specified." into
the comments of ReplicationSlotAcquireInternal().



The patch moves ConditionVariablePrepareToSleep. We need to call the
function before looking into active_pid as originally commented.
Since it is not protected by ReplicationSlotControLock, just before
releasing the lock is not correct.

The attached on top of the v3 fixes that.


Yes, you're right. I merged your 0001.patch into mine.

+    if (behavior != SAB_Inquire)
+    ConditionVariablePrepareToSleep(>active_cv);
+    else if (behavior != SAB_Inquire)

Isn't "behavior == SAB_Block" condition better here?
I changed the patch that way.

The attached is the updated version of the patch.
I also merged Alvaro's patch into this.



+   s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
+   if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)

The conditions in the second line is needed for the case slot is
given, but it is already done in SearchNamedReplicationSlot if slot is
not given.  I would like something like the following instead, but I
don't insist on it.


Yes, I got rid of strcmp() check, but left is_use check as it is.
I like that because it's simpler.



 ReplicationSlot *s = NULL;
 ...
 if (!slot)
 s = SearchNamedReplicationSlot(name);
 else if(s->in_use && strcmp(name, NameStr(s->data.name)))
 s = slot;


+    ereport(ERROR,
+    (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("replication slot \"%s\" does not exist", name)));

The error message is not right when the given slot doesn't match the
given name.


This doesn't happen after applying Alvaro's patch.

BTW, using "name" here is not valid because it may be NULL.
So I added the following code and used "slot_name" in log messages.

+    slot_name = name ? name : NameStr(slot->data.name);


Sorry, this caused compiler failure. So I fixed that and
attached the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..a7bbcf3499 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,117 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock,
+   LW_SHARED));
+
+   for (i = 0; i < max_replication_slots; i++)
+   {
+   ReplicationSlot *s = >replication_slots[i];
+
+   if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0)
+   {
+   slot = s;
+   break;
+   }
+   }
+
+   return slot;
+}
+
+/*
+ * Find a previously created slot and mark it as used by this process.
  *
  * The return value is only useful if behavior is SAB_Inquire, in which
- * it's zero if we successfully acquired the slot, or the PID of the
- * owning process otherwise.  If behavior is SAB_Error, then trying to
- * acquire an owned slot is an error.  If SAB_Block, we sleep until the
- * slot is released by the owning process.
+ * it's zero if we successfully acquired the slot, 

Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/18 11:44, Kyotaro Horiguchi wrote:

At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao  
wrote in

ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.


OK, so what about the attached patch? I added
ConditionVariablePrepareToSleep()
just before entering the "for" loop in
InvalidateObsoleteReplicationSlots().


Thanks.


Thanks for the review!



ReplicationSlotAcquireInternal:
+ * If *slot == NULL, search for the slot with the given name.

'*' seems needless here.


Fixed.

Also I added "Only one of slot and name can be specified." into
the comments of ReplicationSlotAcquireInternal().



The patch moves ConditionVariablePrepareToSleep. We need to call the
function before looking into active_pid as originally commented.
Since it is not protected by ReplicationSlotControLock, just before
releasing the lock is not correct.

The attached on top of the v3 fixes that.


Yes, you're right. I merged your 0001.patch into mine.

+   if (behavior != SAB_Inquire)
+   ConditionVariablePrepareToSleep(>active_cv);
+   else if (behavior != SAB_Inquire)

Isn't "behavior == SAB_Block" condition better here?
I changed the patch that way.

The attached is the updated version of the patch.
I also merged Alvaro's patch into this.



+   s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
+   if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)

The conditions in the second line is needed for the case slot is
given, but it is already done in SearchNamedReplicationSlot if slot is
not given.  I would like something like the following instead, but I
don't insist on it.


Yes, I got rid of strcmp() check, but left is_use check as it is.
I like that because it's simpler.



 ReplicationSlot *s = NULL;
 ...
 if (!slot)
 s = SearchNamedReplicationSlot(name);
 else if(s->in_use && strcmp(name, NameStr(s->data.name)))
 s = slot;


+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("replication slot \"%s\" does not exist", name)));

The error message is not right when the given slot doesn't match the
given name.


This doesn't happen after applying Alvaro's patch.

BTW, using "name" here is not valid because it may be NULL.
So I added the following code and used "slot_name" in log messages.

+   slot_name = name ? name : NameStr(slot->data.name);

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..77cf366ef1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,123 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock,
+   LW_SHARED));
+
+   for (i = 0; i < max_replication_slots; i++)
+   {
+   ReplicationSlot *s = >replication_slots[i];
+
+   if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0)
+   {
+   slot = s;
+   break;
+   }
+   }
+
+   return slot;
+}
+
+/*
+ * Find a previously created slot and mark it as used by this process.
  *
  * The return value is only useful if behavior is SAB_Inquire, in which
- * it's zero if we successfully acquired the slot, or the PID of the
- * owning process otherwise.  If behavior is SAB_Error, then trying to
- * acquire an owned slot is an error.  If SAB_Block, we sleep until the
- * slot is released by the owning process.
+ * it's zero if we successfully acquired the slot, -1 if the slot no longer
+ * exists, or the PID of the owning process otherwise.  If behavior is
+ * SAB_Error, then 

Re: Review for GetWALAvailability()

2020-06-17 Thread Kyotaro Horiguchi
At Wed, 17 Jun 2020 20:13:01 +0900, Fujii Masao  
wrote in 
> > ReplicationSlotAcquireInternal.  I think we should call
> > ConditionVariablePrepareToSleep before the sorrounding for statement
> > block.
> 
> OK, so what about the attached patch? I added
> ConditionVariablePrepareToSleep()
> just before entering the "for" loop in
> InvalidateObsoleteReplicationSlots().

Thanks.

ReplicationSlotAcquireInternal:
+ * If *slot == NULL, search for the slot with the given name.

'*' seems needless here.


The patch moves ConditionVariablePrepareToSleep. We need to call the
function before looking into active_pid as originally commented.
Since it is not protected by ReplicationSlotControLock, just before
releasing the lock is not correct.

The attached on top of the v3 fixes that.



+   s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
+   if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)

The conditions in the second line is needed for the case slot is
given, but it is already done in SearchNamedReplicationSlot if slot is
not given.  I would like something like the following instead, but I
don't insist on it.

ReplicationSlot *s = NULL;
...
if (!slot)
s = SearchNamedReplicationSlot(name);
else if(s->in_use && strcmp(name, NameStr(s->data.name)))
s = slot;


+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("replication slot \"%s\" does not exist", name)));

The error message is not right when the given slot doesn't match the
given name.  It might be better to leave it to the caller.  Currently
no such caller exists so I don't insist on this but the message should
be revised otherwise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 8f1913ec7ff3809d11adf1611aba57e44cb42a82 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Jun 2020 10:55:49 +0900
Subject: [PATCH 1/3] 001

---
 src/backend/replication/slot.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b94e11a8e7..dcc76c4783 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -412,6 +412,14 @@ retry:
 	 */
 	if (IsUnderPostmaster)
 	{
+		/*
+		 * Get ready to sleep on it in case it is active if needed.
+		 * (We may end up not sleeping, but we don't want to do this while
+		 * holding the spinlock.)
+		 */
+		if (behavior != SAB_Inquire)
+			ConditionVariablePrepareToSleep(>active_cv);
+
 		SpinLockAcquire(>mutex);
 		if (s->active_pid == 0)
 			s->active_pid = MyProcPid;
@@ -438,12 +446,13 @@ retry:
 			return active_pid;
 
 		/* Wait here until we get signaled, and then restart */
-		ConditionVariablePrepareToSleep(>active_cv);
 		ConditionVariableSleep(>active_cv,
 			   WAIT_EVENT_REPLICATION_SLOT_DROP);
 		ConditionVariableCancelSleep();
 		goto retry;
 	}
+	else if (behavior != SAB_Inquire)
+		ConditionVariableCancelSleep();
 
 	/* Let everybody know we've modified this slot */
 	ConditionVariableBroadcast(>active_cv);
-- 
2.18.4

>From 1b2f94d6bfb4508b2cbf3d552a5615ae2959e90c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Jun 2020 11:25:25 +0900
Subject: [PATCH 2/3] 002

---
 src/backend/replication/slot.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index dcc76c4783..8893516f00 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -380,7 +380,7 @@ static int
 ReplicationSlotAcquireInternal(ReplicationSlot *slot, const char *name,
 			   SlotAcquireBehavior behavior)
 {
-	ReplicationSlot *s;
+	ReplicationSlot *s = NULL;
 	int			active_pid;
 
 retry:
@@ -393,8 +393,12 @@ retry:
 	 * acquire is not given. If the slot is not found, we either
 	 * return -1 or error out.
 	 */
-	s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
-	if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)
+	if (!slot)
+		s = SearchNamedReplicationSlot(name);
+	else if(s->in_use && strcmp(name, NameStr(s->data.name)))
+		s = slot;
+
+	if (s == NULL)
 	{
 		if (behavior == SAB_Inquire)
 		{
-- 
2.18.4

>From 2d4a83abd2f278a9f9dc6e9329aed1acc191f2c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 18 Jun 2020 11:33:29 +0900
Subject: [PATCH 3/3] 003

---
 src/backend/replication/slot.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8893516f00..25ae334a29 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -373,8 +373,10 @@ ReplicationSlotAcquire(const char *name, SlotAcquireBehavior behavior)
  * Mark the specified slot as used by this process.
  *
  * If *slot == NULL, search for the slot with the given name.
- *
  * See comments about the return value in 

Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao




On 2020/06/18 3:04, Alvaro Herrera wrote:

I think passing the slot name when the slot is also passed is useless
and wasteful; it'd be better to pass NULL for the name and ignore the
strcmp() in that case -- in fact I suggest to forbid passing both name
and slot.  (Any failure there would risk raising an error during
checkpoint, which is undesirable.)


Sounds reasonable.


So I propose the following tweaks to your patch, and otherwise +1.


Thanks for the patch! It looks good to me.

Barring any objections, I will commit the patches in the master and
v13 branches later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Review for GetWALAvailability()

2020-06-17 Thread Alvaro Herrera
On 2020-Jun-17, Kyotaro Horiguchi wrote:

> @@ -342,7 +351,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
>   else
>   nulls[i++] = true;
>  
> - walstate = GetWALAvailability(slot_contents.data.restart_lsn);
> + /* use last_invalidated_lsn when the slot is invalidated */
> + if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
> + targetLSN = slot_contents.last_invalidated_lsn;
> + else
> + targetLSN = slot_contents.data.restart_lsn;
> +
> + walstate = GetWALAvailability(targetLSN, last_removed_seg,
> +   
> slot_contents.active_pid != 0);

Yeah, this approach seems better overall.  I'll see if I can get this
done after lunch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-17 Thread Alvaro Herrera
I think passing the slot name when the slot is also passed is useless
and wasteful; it'd be better to pass NULL for the name and ignore the
strcmp() in that case -- in fact I suggest to forbid passing both name
and slot.  (Any failure there would risk raising an error during
checkpoint, which is undesirable.)

So I propose the following tweaks to your patch, and otherwise +1.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b94e11a8e7..d99d0e9b42 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -383,24 +383,25 @@ ReplicationSlotAcquireInternal(ReplicationSlot *slot, const char *name,
 	ReplicationSlot *s;
 	int			active_pid;
 
+	AssertArg((slot == NULL) ^ (name == NULL));
+
 retry:
 	Assert(MyReplicationSlot == NULL);
 
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	/*
-	 * Search for the slot with the specified name if the slot to
-	 * acquire is not given. If the slot is not found, we either
-	 * return -1 or error out.
+	 * Search for the slot with the specified name if the slot to acquire is
+	 * not given. If the slot is not found, we either return -1 or error out.
 	 */
-	s = (slot == NULL) ? SearchNamedReplicationSlot(name) : slot;
-	if (s == NULL || !s->in_use || strcmp(name, NameStr(s->data.name)) != 0)
+	s = slot ? slot : SearchNamedReplicationSlot(name);
+	if (s == NULL || !s->in_use ||
+		(name && strcmp(name, NameStr(s->data.name)) != 0))
 	{
+		LWLockRelease(ReplicationSlotControlLock);
+
 		if (behavior == SAB_Inquire)
-		{
-			LWLockRelease(ReplicationSlotControlLock);
 			return -1;
-		}
 		ereport(ERROR,
 (errcode(ERRCODE_UNDEFINED_OBJECT),
  errmsg("replication slot \"%s\" does not exist", name)));
@@ -1161,8 +1162,7 @@ restart:
 			 * we have to wait on that CV for the process owning
 			 * the slot to be terminated, later.
 			 */
-			wspid = ReplicationSlotAcquireInternal(s, NameStr(slotname),
-   SAB_Inquire);
+			wspid = ReplicationSlotAcquireInternal(s, NULL, SAB_Inquire);
 
 			/*
 			 * Exit the loop if we successfully acquired the slot or


Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/17 17:30, Kyotaro Horiguchi wrote:

At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao  
wrote in



On 2020/06/17 12:10, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
 wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:



So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about
this
case into the docs? Otherwise the users would be surprised at
termination
of backends by max_slot_wal_keep_size. I guess that it's basically
rarely
happen, though.


Well, if we could distinguish a walsender from a non-walsender
process,
then maybe it would make sense to leave backends alive.  But do we
want
that?  I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?


I have no better opinion about this. So I agree to leave the logic as
it is
at least for now, i.e., we terminate the process owning the slot
whatever
the type of process is.


Agreed.


+   /*
+* Signal to terminate the process using the replication slot.
+*
+* Try to signal every 100ms until it succeeds.
+*/
+   if (!killed && kill(active_pid, SIGTERM) == 0)
+   killed = true;
+   ConditionVariableTimedSleep(>active_cv, 100,
+   
WAIT_EVENT_REPLICATION_SLOT_DROP);
+   } while (ReplicationSlotIsActive(slot, NULL));


Note that here you're signalling only once and then sleeping many
times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.


You mean; in this code path, signaling fails only when the target
process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?


I guess kill() can also fail if the PID now belongs to a process owned
by a different user.


Yes. This case means that the PostgreSQL process using the slot
disappeared
and the same PID was assigned to non-PostgreSQL process. So if kill()
fails
for this reason, we don't need to kill() again.


  I think we've disregarded very quick reuse of

PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected.  But we may make an extra call
to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.


Sorry I failed to understand your point...


My point is the ConditionVariableTimedSleep does *not* sleep on the CV
first time in this usage. The new version anyway avoids useless
kill(2) call, but still may make an extra call to
ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.


OK, so what about the attached patch? I added ConditionVariablePrepareToSleep()
just before entering the "for" loop in InvalidateObsoleteReplicationSlots().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..b94e11a8e7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,107 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock,
+   LW_SHARED));
+
+   for 

Re: Review for GetWALAvailability()

2020-06-17 Thread Kyotaro Horiguchi
At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2020/06/17 12:10, Kyotaro Horiguchi wrote:
> > At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
> >  wrote in
> >> On 2020-Jun-17, Fujii Masao wrote:
> >>> On 2020/06/17 3:50, Alvaro Herrera wrote:
> >>
> >>> So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> >>> But do we want to do this? If we want, we should add the note about
> >>> this
> >>> case into the docs? Otherwise the users would be surprised at
> >>> termination
> >>> of backends by max_slot_wal_keep_size. I guess that it's basically
> >>> rarely
> >>> happen, though.
> >>
> >> Well, if we could distinguish a walsender from a non-walsender
> >> process,
> >> then maybe it would make sense to leave backends alive.  But do we
> >> want
> >> that?  I admit I don't know what would be the reason to have a
> >> non-walsender process with an active slot, so I don't have a good
> >> opinion on what to do in this case.
> > The non-walsender backend is actually doing replication work. It
> > rather should be killed?
> 
> I have no better opinion about this. So I agree to leave the logic as
> it is
> at least for now, i.e., we terminate the process owning the slot
> whatever
> the type of process is.

Agreed.

> > +   /*
> > +* Signal to terminate the process using the 
> > replication slot.
> > +*
> > +* Try to signal every 100ms until it succeeds.
> > +*/
> > +   if (!killed && kill(active_pid, SIGTERM) == 0)
> > +   killed = true;
> > +   ConditionVariableTimedSleep(>active_cv, 100,
> > +   
> > WAIT_EVENT_REPLICATION_SLOT_DROP);
> > +   } while (ReplicationSlotIsActive(slot, NULL));
> 
>  Note that here you're signalling only once and then sleeping many
>  times
>  in increments of 100ms -- you're not signalling every 100ms as the
>  comment claims -- unless the signal fails, but you don't really expect
>  that.  On the contrary, I'd claim that the logic is reversed: if the
>  signal fails, *then* you should stop signalling.
> >>>
> >>> You mean; in this code path, signaling fails only when the target
> >>> process
> >>> disappears just before signaling. So if it fails, slot->active_pid is
> >>> expected to become 0 even without signaling more. Right?
> >>
> >> I guess kill() can also fail if the PID now belongs to a process owned
> >> by a different user.
> 
> Yes. This case means that the PostgreSQL process using the slot
> disappeared
> and the same PID was assigned to non-PostgreSQL process. So if kill()
> fails
> for this reason, we don't need to kill() again.
> 
> >  I think we've disregarded very quick reuse of
> >> PIDs, so we needn't concern ourselves with it.
> > The first time call to ConditionVariableTimedSleep doen't actually
> > sleep, so the loop works as expected.  But we may make an extra call
> > to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
> > loop would make it better.
> 
> Sorry I failed to understand your point...

My point is the ConditionVariableTimedSleep does *not* sleep on the CV
first time in this usage. The new version anyway avoids useless
kill(2) call, but still may make an extra call to
ReplicationSlotAcquireInternal.  I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.

> Anyway, the attached is the updated version of the patch. This fixes
> all the issues in InvalidateObsoleteReplicationSlots() that I reported
> upthread.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-17 Thread Fujii Masao



On 2020/06/17 12:10, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera  
wrote in

On 2020-Jun-17, Fujii Masao wrote:

On 2020/06/17 3:50, Alvaro Herrera wrote:



So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.


Well, if we could distinguish a walsender from a non-walsender process,
then maybe it would make sense to leave backends alive.  But do we want
that?  I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.


The non-walsender backend is actually doing replication work. It
rather should be killed?


I have no better opinion about this. So I agree to leave the logic as it is
at least for now, i.e., we terminate the process owning the slot whatever
the type of process is.




+   /*
+* Signal to terminate the process using the replication slot.
+*
+* Try to signal every 100ms until it succeeds.
+*/
+   if (!killed && kill(active_pid, SIGTERM) == 0)
+   killed = true;
+   ConditionVariableTimedSleep(>active_cv, 100,
+   
WAIT_EVENT_REPLICATION_SLOT_DROP);
+   } while (ReplicationSlotIsActive(slot, NULL));


Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.


You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?


I guess kill() can also fail if the PID now belongs to a process owned
by a different user.


Yes. This case means that the PostgreSQL process using the slot disappeared
and the same PID was assigned to non-PostgreSQL process. So if kill() fails
for this reason, we don't need to kill() again.


 I think we've disregarded very quick reuse of

PIDs, so we needn't concern ourselves with it.


The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected.  But we may make an extra call
to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.


Sorry I failed to understand your point...

Anyway, the attached is the updated version of the patch. This fixes
all the issues in InvalidateObsoleteReplicationSlots() that I reported
upthread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..a065d41d76 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,9 @@ ReplicationSlot *MyReplicationSlot = NULL;
 intmax_replication_slots = 0;  /* the maximum number 
of replication

 * slots */
 
+static ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+static int ReplicationSlotAcquireInternal(ReplicationSlot *slot,
+   
  const char *name, SlotAcquireBehavior behavior);
 static void ReplicationSlotDropAcquired(void);
 static void ReplicationSlotDropPtr(ReplicationSlot *slot);
 
@@ -322,77 +325,104 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 }
 
 /*
- * Find a previously created slot and mark it as used by this backend.
+ * Search for the named replication slot.
+ *
+ * Return the replication slot if found, otherwise NULL.
+ *
+ * The caller must hold ReplicationSlotControlLock in shared mode.
+ */
+static ReplicationSlot *
+SearchNamedReplicationSlot(const char *name)
+{
+   int i;
+   ReplicationSlot *slot = NULL;
+
+   for (i = 0; i < max_replication_slots; i++)
+   {
+   ReplicationSlot *s = >replication_slots[i];
+
+   if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0)
+   {
+   slot = s;
+   break;
+   }
+   }
+
+   return slot;
+}
+
+/*
+ * Find a previously created slot and mark it as used by this process.
  *
  * The return value is only useful if behavior is SAB_Inquire, in which
- * it's zero if we successfully acquired the slot, or the PID of the
- * owning process otherwise.  If 

Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi
At Wed, 17 Jun 2020 10:17:07 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera  
> wrote in 
> > On 2020-Jun-16, Kyotaro Horiguchi wrote:
> > 
> > > I noticed the another issue. If some required WALs are removed, the
> > > slot will be "invalidated", that is, restart_lsn is set to invalid
> > > value. As the result we hardly see the "lost" state.
> > > 
> > > It can be "fixed" by remembering the validity of a slot separately
> > > from restart_lsn. Is that worth doing?
> > 
> > We discussed this before.  I agree it would be better to do this
> > in some way, but I fear that if we do it naively, some code might exist
> > that reads the LSN without realizing that it needs to check the validity
> > flag first.
> 
> Yes, that was my main concern on it. That's error-prone. How about
> remembering the LSN where invalidation happened?  It's safe since no
> others than slot-monitoring functions would look
> last_invalidated_lsn. It can be reset if active_pid is a valid pid.
> 
> InvalidateObsoleteReplicationSlots:
>  ...
>   SpinLockAcquire(>mutex);
> + s->data.last_invalidated_lsn = s->data.restart_lsn;
>   s->data.restart_lsn = InvalidXLogRecPtr;
>   SpinLockRelease(>mutex);

The attached does that (Poc).  No document fix included.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d6fe205eb4..d3240d1e38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9485,20 +9485,25 @@ CreateRestartPoint(int flags)
  *		(typically a slot's restart_lsn)
  *
  * Returns one of the following enum values:
- * * WALAVAIL_NORMAL means targetLSN is available because it is in the range
- *   of max_wal_size.
  *
- * * WALAVAIL_PRESERVED means it is still available by preserving extra
+ * * WALAVAIL_RESERVED means targetLSN is available and it is in the range of
+ *   max_wal_size.
+ *
+ * * WALAVAIL_EXTENDED means it is still available by preserving extra
  *   segments beyond max_wal_size. If max_slot_wal_keep_size is smaller
  *   than max_wal_size, this state is not returned.
  *
+ * * WALAVAIL_BEING_REMOVED means it is being lost. The walsender using this
+ *   slot may return to the above.
+ *
  * * WALAVAIL_REMOVED means it is definitely lost. A replication stream on
  *   a slot with this LSN cannot continue.
  *
  * * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
  */
 WALAvailability
-GetWALAvailability(XLogRecPtr targetLSN)
+GetWALAvailability(XLogRecPtr targetLSN, XLogSegNo last_removed_seg,
+   bool slot_is_active)
 {
 	XLogRecPtr	currpos;		/* current write LSN */
 	XLogSegNo	currSeg;		/* segid of currpos */
@@ -9509,7 +9514,11 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 * slot */
 	uint64		keepSegs;
 
-	/* slot does not reserve WAL. Either deactivated, or has never been active */
+	/*
+	 * slot does not reserve WAL. Either deactivated, or has never been active
+	 * The caller should have passed last_invalidated_lsn as targetLSN if the
+	 * slot has been invalidated.
+	 */
 	if (XLogRecPtrIsInvalid(targetLSN))
 		return WALAVAIL_INVALID_LSN;
 
@@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 * the first WAL segment file since startup, which causes the status being
 	 * wrong under certain abnormal conditions but that doesn't actually harm.
 	 */
-	oldestSeg = XLogGetLastRemovedSegno() + 1;
+	oldestSeg = last_removed_seg + 1;
 
 	/* calculate oldest segment by max_wal_size and wal_keep_segments */
 	XLByteToSeg(currpos, currSeg, wal_segment_size);
@@ -9544,20 +9553,21 @@ GetWALAvailability(XLogRecPtr targetLSN)
 	 */
 	if (targetSeg >= oldestSeg)
 	{
-		/*
-		 * show "normal" when targetSeg is within max_wal_size, even if
-		 * max_slot_wal_keep_size is smaller than max_wal_size.
-		 */
-		if ((max_slot_wal_keep_size_mb <= 0 ||
-			 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
-			oldestSegMaxWalSize <= targetSeg)
-			return WALAVAIL_NORMAL;
-
-		/* being retained by slots */
-		if (oldestSlotSeg <= targetSeg)
+		/* show "reserved" when targetSeg is within max_wal_size */
+		if (oldestSegMaxWalSize <= targetSeg)
 			return WALAVAIL_RESERVED;
+
+		/* being retained by slots exceeding max_wal_size */
+		return WALAVAIL_EXTENDED;
 	}
 
+	/*
+	 * However segments required by the slot has been lost, if walsender is
+	 * active the walsender can read into the first reserved slot.
+	 */
+	if (slot_is_active)
+		return WALAVAIL_BEING_REMOVED;
+
 	/* Definitely lost */
 	return WALAVAIL_REMOVED;
 }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..f141b29d28 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->candidate_xmin_lsn = InvalidXLogRecPtr;
 	

Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi
At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera  
wrote in 
> On 2020-Jun-17, Fujii Masao wrote:
> > On 2020/06/17 3:50, Alvaro Herrera wrote:
> 
> > So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> > But do we want to do this? If we want, we should add the note about this
> > case into the docs? Otherwise the users would be surprised at termination
> > of backends by max_slot_wal_keep_size. I guess that it's basically rarely
> > happen, though.
> 
> Well, if we could distinguish a walsender from a non-walsender process,
> then maybe it would make sense to leave backends alive.  But do we want
> that?  I admit I don't know what would be the reason to have a
> non-walsender process with an active slot, so I don't have a good
> opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?

> > > > +   /*
> > > > +* Signal to terminate the process using the 
> > > > replication slot.
> > > > +*
> > > > +* Try to signal every 100ms until it succeeds.
> > > > +*/
> > > > +   if (!killed && kill(active_pid, SIGTERM) == 0)
> > > > +   killed = true;
> > > > +   ConditionVariableTimedSleep(>active_cv, 100,
> > > > +   
> > > > WAIT_EVENT_REPLICATION_SLOT_DROP);
> > > > +   } while (ReplicationSlotIsActive(slot, NULL));
> > > 
> > > Note that here you're signalling only once and then sleeping many times
> > > in increments of 100ms -- you're not signalling every 100ms as the
> > > comment claims -- unless the signal fails, but you don't really expect
> > > that.  On the contrary, I'd claim that the logic is reversed: if the
> > > signal fails, *then* you should stop signalling.
> > 
> > You mean; in this code path, signaling fails only when the target process
> > disappears just before signaling. So if it fails, slot->active_pid is
> > expected to become 0 even without signaling more. Right?
> 
> I guess kill() can also fail if the PID now belongs to a process owned
> by a different user.  I think we've disregarded very quick reuse of
> PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected.  But we may make an extra call
to kill(2).  Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-16 Thread Alvaro Herrera
On 2020-Jun-17, Fujii Masao wrote:
> On 2020/06/17 3:50, Alvaro Herrera wrote:

> So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> But do we want to do this? If we want, we should add the note about this
> case into the docs? Otherwise the users would be surprised at termination
> of backends by max_slot_wal_keep_size. I guess that it's basically rarely
> happen, though.

Well, if we could distinguish a walsender from a non-walsender process,
then maybe it would make sense to leave backends alive.  But do we want
that?  I admit I don't know what would be the reason to have a
non-walsender process with an active slot, so I don't have a good
opinion on what to do in this case.

> > > + /*
> > > +  * Signal to terminate the process using the replication slot.
> > > +  *
> > > +  * Try to signal every 100ms until it succeeds.
> > > +  */
> > > + if (!killed && kill(active_pid, SIGTERM) == 0)
> > > + killed = true;
> > > + ConditionVariableTimedSleep(>active_cv, 100,
> > > + 
> > > WAIT_EVENT_REPLICATION_SLOT_DROP);
> > > + } while (ReplicationSlotIsActive(slot, NULL));
> > 
> > Note that here you're signalling only once and then sleeping many times
> > in increments of 100ms -- you're not signalling every 100ms as the
> > comment claims -- unless the signal fails, but you don't really expect
> > that.  On the contrary, I'd claim that the logic is reversed: if the
> > signal fails, *then* you should stop signalling.
> 
> You mean; in this code path, signaling fails only when the target process
> disappears just before signaling. So if it fails, slot->active_pid is
> expected to become 0 even without signaling more. Right?

I guess kill() can also fail if the PID now belongs to a process owned
by a different user.  I think we've disregarded very quick reuse of
PIDs, so we needn't concern ourselves with it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-16 Thread Fujii Masao




On 2020/06/17 3:50, Alvaro Herrera wrote:

On 2020-Jun-17, Fujii Masao wrote:


While reading InvalidateObsoleteReplicationSlots() code, I found another issue.

ereport(LOG,
(errmsg("terminating walsender %d because replication 
slot \"%s\" is too far behind",
wspid, 
NameStr(slotname;
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.


Good point.


So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.



int wspid = 
ReplicationSlotAcquire(NameStr(slotname),

   SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?

Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to call
ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
active_pid is zero. No?


I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.


Understood. Thanks!



+   /*
+* Signal to terminate the process using the replication slot.
+*
+* Try to signal every 100ms until it succeeds.
+*/
+   if (!killed && kill(active_pid, SIGTERM) == 0)
+   killed = true;
+   ConditionVariableTimedSleep(>active_cv, 100,
+   
WAIT_EVENT_REPLICATION_SLOT_DROP);
+   } while (ReplicationSlotIsActive(slot, NULL));


Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.


You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi
At Tue, 16 Jun 2020 14:31:43 -0400, Alvaro Herrera  
wrote in 
> On 2020-Jun-16, Kyotaro Horiguchi wrote:
> 
> > I noticed the another issue. If some required WALs are removed, the
> > slot will be "invalidated", that is, restart_lsn is set to invalid
> > value. As the result we hardly see the "lost" state.
> > 
> > It can be "fixed" by remembering the validity of a slot separately
> > from restart_lsn. Is that worth doing?
> 
> We discussed this before.  I agree it would be better to do this
> in some way, but I fear that if we do it naively, some code might exist
> that reads the LSN without realizing that it needs to check the validity
> flag first.

Yes, that was my main concern on it. That's error-prone. How about
remembering the LSN where invalidation happened?  It's safe since no
others than slot-monitoring functions would look
last_invalidated_lsn. It can be reset if active_pid is a valid pid.

InvalidateObsoleteReplicationSlots:
 ...
SpinLockAcquire(>mutex);
+   s->data.last_invalidated_lsn = s->data.restart_lsn;
s->data.restart_lsn = InvalidXLogRecPtr;
SpinLockRelease(>mutex);

> On the other hand, maybe this is not a problem in practice, because if
> such a bug occurs, what will happen is that trying to read WAL from such
> a slot will return the error message that the WAL file cannot be found.
> Maybe this is acceptable?

I'm not sure.  For my part a problem of that would we need to look
into server logs to know what is acutally going on.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-16 Thread Kyotaro Horiguchi
At Wed, 17 Jun 2020 00:46:38 +0900, Fujii Masao  
wrote in 
> >> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan
> >> replication
> >>  slots array and uses the "for" loop in each scan. Also it calls
> >>  ReplicationSlotAcquire() for each "for" loop cycle, and
> >>  ReplicationSlotAcquire() uses another loop to scan replication slots
> >>  array. I don't think this is good design.
> >>
> >>  ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
> >>  InvalidateObsoleteReplicationSlots() already know the index of the
> >>  slot
> >>  that we want to find. The attached patch does that. Thought?
> > The inner loop is expected to run at most several times per
> > checkpoint, which won't be a serious problem. However, it is better if
> > we can get rid of that in a reasonable way.
> > The attached patch changes the behavior for SAB_Block. Before the
> > patch, it rescans from the first slot for the same name, but with the
> > patch it just rechecks the same slot.  The only caller of the function
> > with SAB_Block is ReplicationSlotDrop and I don't come up with a case
> > where another slot with the same name is created at different place
> > before the condition variable fires. But I'm not sure the change is
> > completely safe.
> 
> Yes, that change might not be safe. So I'm thinking another approach
> to
> fix the issues.
> 
> > Maybe some assertion is needed?
> > 
> >> 3. There is a corner case where the termination of walsender cleans up
> >>  the temporary replication slot while
> >>  InvalidateObsoleteReplicationSlots()
> >>  is sleeping on ConditionVariableTimedSleep(). In this case,
> >>  ReplicationSlotAcquire() is called in the subsequent cycle of the
> >>  "for"
> >>  loop, cannot find the slot and then emits ERROR message. This leads
> >>  to the failure of checkpoint by the checkpointer.
> > Agreed.
> > 
> >>  To avoid this case, if SAB_Inquire is specified,
> >>  ReplicationSlotAcquire()
> >>  should return the special value instead of emitting ERROR even when
> >>  it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
> >>  should
> >>  handle that special returned value.
> > I thought the same thing hearing that.
> 
> While reading InvalidateObsoleteReplicationSlots() code, I found
> another issue.
> 
>   ereport(LOG,
>   (errmsg("terminating walsender %d
>   because replication slot \"%s\" is too
>   far behind",
>   wspid,
>   NameStr(slotname;
>   (void) kill(wspid, SIGTERM);
> 
> wspid indicates the PID of process using the slot. That process can be
> a backend, for example, executing pg_replication_slot_advance().
> So "walsender" in the above log message is not always correct.

Agreed.

> 
>   int wspid = ReplicationSlotAcquire(NameStr(slotname),
>   
>SAB_Inquire);
> 
> Why do we need to call ReplicationSlotAcquire() here and mark the slot
> as
> used by the checkpointer? Isn't it enough to check directly the slot's
> active_pid, instead?
> Maybe ReplicationSlotAcquire() is necessary because
> ReplicationSlotRelease() is called later? If so, why do we need to
> call
> ReplicationSlotRelease()? ISTM that we don't need to do that if the
> slot's
> active_pid is zero. No?

My understanding of the reason is that we update a slot value here.
The restriction allows the owner of a slot to assume that all the slot
values don't voluntarily change.

slot.h:104
| * - Individual fields are protected by mutex where only the backend owning
| * the slot is authorized to update the fields from its own slot.  The
| * backend owning the slot does not need to take this lock when reading its
| * own fields, while concurrent backends not owning this slot should take the
| * lock when reading this slot's data.

> If my understanding is right, I'd like to propose the attached patch.
> It introduces DeactivateReplicationSlot() and replace the "for" loop
> in InvalidateObsoleteReplicationSlots() with
> it. ReplicationSlotAcquire()
> and ReplicationSlotRelease() are no longer called there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-16 Thread Alvaro Herrera
On 2020-Jun-17, Fujii Masao wrote:

> While reading InvalidateObsoleteReplicationSlots() code, I found another 
> issue.
> 
>   ereport(LOG,
>   (errmsg("terminating walsender %d 
> because replication slot \"%s\" is too far behind",
>   wspid, 
> NameStr(slotname;
>   (void) kill(wspid, SIGTERM);
> 
> wspid indicates the PID of process using the slot. That process can be
> a backend, for example, executing pg_replication_slot_advance().
> So "walsender" in the above log message is not always correct.

Good point.

>   int wspid = 
> ReplicationSlotAcquire(NameStr(slotname),
>   
>SAB_Inquire);
> 
> Why do we need to call ReplicationSlotAcquire() here and mark the slot as
> used by the checkpointer? Isn't it enough to check directly the slot's
> active_pid, instead?
> 
> Maybe ReplicationSlotAcquire() is necessary because
> ReplicationSlotRelease() is called later? If so, why do we need to call
> ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
> active_pid is zero. No?

I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.


> + /*
> +  * Signal to terminate the process using the replication slot.
> +  *
> +  * Try to signal every 100ms until it succeeds.
> +  */
> + if (!killed && kill(active_pid, SIGTERM) == 0)
> + killed = true;
> + ConditionVariableTimedSleep(>active_cv, 100,
> + 
> WAIT_EVENT_REPLICATION_SLOT_DROP);
> + } while (ReplicationSlotIsActive(slot, NULL));

Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that.  On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-16 Thread Alvaro Herrera
On 2020-Jun-16, Kyotaro Horiguchi wrote:

> I noticed the another issue. If some required WALs are removed, the
> slot will be "invalidated", that is, restart_lsn is set to invalid
> value. As the result we hardly see the "lost" state.
> 
> It can be "fixed" by remembering the validity of a slot separately
> from restart_lsn. Is that worth doing?

We discussed this before.  I agree it would be better to do this
in some way, but I fear that if we do it naively, some code might exist
that reads the LSN without realizing that it needs to check the validity
flag first.

On the other hand, maybe this is not a problem in practice, because if
such a bug occurs, what will happen is that trying to read WAL from such
a slot will return the error message that the WAL file cannot be found.
Maybe this is acceptable?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Review for GetWALAvailability()

2020-06-16 Thread Fujii Masao



On 2020/06/16 14:00, Kyotaro Horiguchi wrote:

At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao  
wrote in

In short, it is known behavior but it was judged as useless to prevent
that.
That can happen when checkpointer removes up to the segment that is
being read by walsender.  I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.
While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.


BTW, I read the code of InvalidateObsoleteReplicationSlots() and
probably
found some issues in it.

1. Each cycle of the "for" loop in
InvalidateObsoleteReplicationSlots()
 emits the log message  "terminating walsender ...". This means that
 if it takes more than 10ms for walsender to exit after it's signaled,
 the second and subsequent cycles would happen and output the same
 log message several times. IMO that log message should be output
 only once.


Sounds reasonable.


2. InvalidateObsoleteReplicationSlots() uses the loop to scan
replication 
 slots array and uses the "for" loop in each scan. Also it calls
 ReplicationSlotAcquire() for each "for" loop cycle, and
 ReplicationSlotAcquire() uses another loop to scan replication slots
 array. I don't think this is good design.

 ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
 InvalidateObsoleteReplicationSlots() already know the index of the
 slot
 that we want to find. The attached patch does that. Thought?


The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.

The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot.  The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe.


Yes, that change might not be safe. So I'm thinking another approach to
fix the issues.


Maybe some assertion is needed?


3. There is a corner case where the termination of walsender cleans up
 the temporary replication slot while
 InvalidateObsoleteReplicationSlots()
 is sleeping on ConditionVariableTimedSleep(). In this case,
 ReplicationSlotAcquire() is called in the subsequent cycle of the
 "for"
 loop, cannot find the slot and then emits ERROR message. This leads
 to the failure of checkpoint by the checkpointer.


Agreed.


 To avoid this case, if SAB_Inquire is specified,
 ReplicationSlotAcquire()
 should return the special value instead of emitting ERROR even when
 it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
 should
 handle that special returned value.


I thought the same thing hearing that.


While reading InvalidateObsoleteReplicationSlots() code, I found another issue.

ereport(LOG,
(errmsg("terminating walsender %d because replication 
slot \"%s\" is too far behind",
wspid, 
NameStr(slotname;
(void) kill(wspid, SIGTERM);

wspid indicates the PID of process using the slot. That process can be
a backend, for example, executing pg_replication_slot_advance().
So "walsender" in the above log message is not always correct.



int wspid = 
ReplicationSlotAcquire(NameStr(slotname),

   SAB_Inquire);

Why do we need to call ReplicationSlotAcquire() here and mark the slot as
used by the checkpointer? Isn't it enough to check directly the slot's
active_pid, instead?

Maybe ReplicationSlotAcquire() is necessary because
ReplicationSlotRelease() is called later? If so, why do we need to call
ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
active_pid is zero. No?

If my understanding is right, I'd like to propose the attached patch.
It introduces DeactivateReplicationSlot() and replace the "for" loop
in InvalidateObsoleteReplicationSlots() with it. ReplicationSlotAcquire()
and ReplicationSlotRelease() are no longer called there.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 505445f2dc..b89b6da768 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -99,6 +99,8 @@ ReplicationSlot 

Re: Review for GetWALAvailability()

2020-06-15 Thread Kyotaro Horiguchi
At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao  
wrote in 
> > In short, it is known behavior but it was judged as useless to prevent
> > that.
> > That can happen when checkpointer removes up to the segment that is
> > being read by walsender.  I think that that doesn't happen (or
> > happenswithin a narrow time window?) for physical replication but
> > happenes for logical replication.
> > While development, I once added walsender a code to exit for that
> > reason, but finally it is moved to InvalidateObsoleteReplicationSlots
> > as a bit defferent function.
> 
> BTW, I read the code of InvalidateObsoleteReplicationSlots() and
> probably
> found some issues in it.
> 
> 1. Each cycle of the "for" loop in
> InvalidateObsoleteReplicationSlots()
> emits the log message  "terminating walsender ...". This means that
> if it takes more than 10ms for walsender to exit after it's signaled,
> the second and subsequent cycles would happen and output the same
> log message several times. IMO that log message should be output
> only once.

Sounds reasonable.

> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan
> replication 
> slots array and uses the "for" loop in each scan. Also it calls
> ReplicationSlotAcquire() for each "for" loop cycle, and
> ReplicationSlotAcquire() uses another loop to scan replication slots
> array. I don't think this is good design.
> 
> ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
> InvalidateObsoleteReplicationSlots() already know the index of the
> slot
> that we want to find. The attached patch does that. Thought?

The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.

The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot.  The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe. Maybe some assertion is needed?

> 3. There is a corner case where the termination of walsender cleans up
> the temporary replication slot while
> InvalidateObsoleteReplicationSlots()
> is sleeping on ConditionVariableTimedSleep(). In this case,
> ReplicationSlotAcquire() is called in the subsequent cycle of the
> "for"
> loop, cannot find the slot and then emits ERROR message. This leads
> to the failure of checkpoint by the checkpointer.

Agreed.

> To avoid this case, if SAB_Inquire is specified,
> ReplicationSlotAcquire()
> should return the special value instead of emitting ERROR even when
> it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
> should
> handle that special returned value.

I thought the same thing hearing that. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-15 Thread Kyotaro Horiguchi
At Mon, 15 Jun 2020 18:59:49 +0900, Fujii Masao  
wrote in 
> > It was a kind of hard to decide. Even when max_slot_wal_keep_size is
> > smaller than max_wal_size, the segments more than
> > max_slot_wal_keep_size are not guaranteed to be kept.  In that case
> > the state transits as NORMAL->LOST skipping the "RESERVED" state.
> > Putting aside whether the setting is useful or not, I thought that the
> > state transition is somewhat abrupt.
> 
> IMO the direct transition of the state from normal to lost is ok to me
> if each state is clearly defined.
> 
> >> Or, if that condition is really necessary, the document should be
> >> updated so that the note about the condition is added.
> > Does the following make sense?
> > https://www.postgresql.org/docs/13/view-pg-replication-slots.html
> > normal means that the claimed files are within max_wal_size.
> > + If max_slot_wal_keep_size is smaller than max_wal_size, this state
> > + will not appear.
> 
> I don't think this change is enough. For example, when
> max_slot_wal_keep_size
> is smaller than max_wal_size and the amount of WAL files claimed by
> the slot
> is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But
> which is
> inconsistent with the meaning of "reserved" in the docs.

You're right.

> To consider what should be reported in wal_status, could you tell me
> what
> purpose and how the users is expected to use this information?

I saw that the "reserved" is the state where slots are working to
retain segments, and "normal" is the state to indicate that "WAL
segments are within max_wal_size", which is orthogonal to the notion
of "reserved".  So it seems to me useless when the retained WAL
segments cannot exceeds max_wal_size.

With longer description they would be:

"reserved under max_wal_size"
"reserved over max_wal_size"
"lost some segements"

Come to think of that, I realized that my trouble was just the
wording.  Are the following wordings make sense to you?

"reserved"  - retained within max_wal_size
"extended"  - retained over max_wal_size
"lost"  - lost some segments

With these wordings I can live with "not extended"=>"lost". Of course
more appropriate wording are welcome.

> Even if walsender is terminated during the state "lost", unless
> checkpointer
> removes the required WAL files, the state can go back to "reserved"
> after
> new replication connection is established. This is the same as what
> you're
> explaining at the above?

GetWALAvailability checks restart_lsn against lastRemovedSegNo, thus
the "lost" cannot be seen unless checkpointer actually have removed
the segment at restart_lsn (and restart_lsn has not been invalidated).
However, walsenders are killed before that segments are actually
removed so there're cases where physical walreceiver reconnects before
RemoveOldXloFiles removes all segments, then removed after
reconnection. "lost" can go back to "resrved" in that case. (Physical
walreceiver can connect to invalid-restart_lsn slot)

I noticed the another issue. If some required WALs are removed, the
slot will be "invalidated", that is, restart_lsn is set to invalid
value. As the result we hardly see the "lost" state.

It can be "fixed" by remembering the validity of a slot separately
from restart_lsn. Is that worth doing?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Review for GetWALAvailability()

2020-06-15 Thread Fujii Masao



On 2020/06/15 13:42, Kyotaro Horiguchi wrote:

At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao  
wrote in

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

 some WAL files are definitely lost and this slot cannot be used to
 resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.


keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
  wal_segment_size) +
  1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of
them.


Oops!  I don't want to believe I did that but it's definitely wrong.


The above should be the following?

 Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
 wal_keep_segments) + 1


Looks reasonable.


if ((max_slot_wal_keep_size_mb <= 0 ||
 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
  means that the claimed files are within max_wal_size". So whatever
  max_slot_wal_keep_size value is, IMO that "normal" should be
  reported if the WAL files claimed by the slot are within max_wal_size.
  Thought?


It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept.  In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.


Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.


Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.


If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?


In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender.  I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.


BTW, I read the code of InvalidateObsoleteReplicationSlots() and probably
found some issues in it.

1. Each cycle of the "for" loop in InvalidateObsoleteReplicationSlots()
emits the log message  "terminating walsender ...". This means that
if it takes more than 10ms for walsender to exit after it's signaled,
the second and subsequent cycles would happen and output the same
log message several times. IMO that log message should be output
only once.

2. InvalidateObsoleteReplicationSlots() uses the loop to scan replication
slots array and uses the "for" loop in each scan. Also it calls
ReplicationSlotAcquire() for each "for" loop cycle, and
ReplicationSlotAcquire() uses another loop to scan replication slots
array. I don't think this is good design.

ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
InvalidateObsoleteReplicationSlots() already know the index of the slot
that we want to find. The attached patch does that. Thought?

3. There is a corner case where the termination of walsender cleans up
the temporary replication slot while InvalidateObsoleteReplicationSlots()
is sleeping on ConditionVariableTimedSleep(). In this case,
ReplicationSlotAcquire() is called in the subsequent cycle of the "for"
loop, cannot find the slot and then emits ERROR message. This leads
to the failure of checkpoint by the checkpointer.

To avoid this case, if SAB_Inquire is specified, ReplicationSlotAcquire()
should return the special value instead of emitting ERROR even when
it cannot find the slot. Also 

Re: Review for GetWALAvailability()

2020-06-15 Thread Fujii Masao




On 2020/06/15 13:42, Kyotaro Horiguchi wrote:

At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao  
wrote in

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

 some WAL files are definitely lost and this slot cannot be used to
 resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.


keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
  wal_segment_size) +
  1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of
them.


Oops!  I don't want to believe I did that but it's definitely wrong.


The above should be the following?

 Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
 wal_keep_segments) + 1


Looks reasonable.


if ((max_slot_wal_keep_size_mb <= 0 ||
 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
  means that the claimed files are within max_wal_size". So whatever
  max_slot_wal_keep_size value is, IMO that "normal" should be
  reported if the WAL files claimed by the slot are within max_wal_size.
  Thought?


It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept.  In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.


IMO the direct transition of the state from normal to lost is ok to me
if each state is clearly defined.



Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.


Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.


I don't think this change is enough. For example, when max_slot_wal_keep_size
is smaller than max_wal_size and the amount of WAL files claimed by the slot
is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But which is
inconsistent with the meaning of "reserved" in the docs.

To consider what should be reported in wal_status, could you tell me what
purpose and how the users is expected to use this information?



If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?


In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender.  I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.  With the current mechanism, there's a
case where once invalidated slot came to revive but we decided to
allow that behavior, but forgot to document that.

Anyway if you see "lost", something bad is being happening.

- lost means that some WAL files are definitely lost and this slot
- cannot be used to resume replication anymore.
+ lost means that some required WAL files are removed and this slot is
+ no longer usable after once disconnected during this status.

If it is crucial that the "lost" state may come back to reserved or
normal state,

+ Note that there are cases where the state moves back to reserved or
+ normal state when all wal senders have left the just removed segment
+ before being terminated.

There is a case where the state moves back to reserved or normal state when wal 
senders leaves the just removed segment before being terminated.


Even if walsender is terminated during the state "lost", unless checkpointer
removes the required WAL files, the state can go back to "reserved" after
new replication connection is established. This is the 

Re: Review for GetWALAvailability()

2020-06-15 Thread Kyotaro Horiguchi
At Mon, 15 Jun 2020 13:42:25 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Oops!  I don't want to believe I did that but it's definitely wrong.

Hmm. Quite disappointing. The patch was just a crap.
This is the right patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..199053dd4a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11240,18 +11240,25 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx

 
  normal means that the claimed files
-  are within max_wal_size.
+  are within max_wal_size. If
+  max_slot_wal_keep_size is smaller than
+  max_wal_size, this state will not appear.
 
 
  reserved means
   that max_wal_size is exceeded but the files are
   still held, either by some replication slot or
-  by wal_keep_segments.
+  by wal_keep_segments.
+  
 
 
- lost means that some WAL files are
-  definitely lost and this slot cannot be used to resume replication
-  anymore.
+  
+lost means that some required WAL files are
+			removed and this slot is no longer usable after once disconnected
+			during this state. Note that there are cases where the state moves
+			back to reserved or normal state when all wal senders have left
+			the just removed segment before being terminated.
+  
 

The last two states are seen only when
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55cac186dc..d6fe205eb4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9528,8 +9528,8 @@ GetWALAvailability(XLogRecPtr targetLSN)
 
 	/* calculate oldest segment by max_wal_size and wal_keep_segments */
 	XLByteToSeg(currpos, currSeg, wal_segment_size);
-	keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
-			  wal_segment_size) + 1;
+	keepSegs = Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
+   wal_keep_segments) + 1;
 
 	if (currSeg > keepSegs)
 		oldestSegMaxWalSize = currSeg - keepSegs;


Re: Review for GetWALAvailability()

2020-06-14 Thread Kyotaro Horiguchi
At Sat, 13 Jun 2020 01:38:49 +0900, Fujii Masao  
wrote in 
> Hi,
> 
> The document explains that "lost" value that
> pg_replication_slots.wal_status reports means
> 
> some WAL files are definitely lost and this slot cannot be used to
> resume replication anymore.
> 
> However, I observed "lost" value while inserting lots of records,
> but replication could continue normally. So I wonder if
> pg_replication_slots.wal_status may have a bug.
> 
> wal_status is calculated in GetWALAvailability(), and probably I found
> some issues in it.
> 
> 
>   keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
> wal_segment_size) +
> 1;
> 
> max_wal_size_mb is the number of megabytes. wal_keep_segments is
> the number of WAL segment files. So it's strange to calculate max of
> them.

Oops!  I don't want to believe I did that but it's definitely wrong.

> The above should be the following?
> 
> Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size),
> wal_keep_segments) + 1

Looks reasonable.

>   if ((max_slot_wal_keep_size_mb <= 0 ||
>max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
>   oldestSegMaxWalSize <= targetSeg)
>   return WALAVAIL_NORMAL;
> 
> This code means that wal_status reports "normal" only when
> max_slot_wal_keep_size is negative or larger than max_wal_size.
> Why is this condition necessary? The document explains "normal
>  means that the claimed files are within max_wal_size". So whatever
>  max_slot_wal_keep_size value is, IMO that "normal" should be
>  reported if the WAL files claimed by the slot are within max_wal_size.
>  Thought?

It was a kind of hard to decide. Even when max_slot_wal_keep_size is
smaller than max_wal_size, the segments more than
max_slot_wal_keep_size are not guaranteed to be kept.  In that case
the state transits as NORMAL->LOST skipping the "RESERVED" state.
Putting aside whether the setting is useful or not, I thought that the
state transition is somewhat abrupt.

> Or, if that condition is really necessary, the document should be
> updated so that the note about the condition is added.

Does the following make sense?

https://www.postgresql.org/docs/13/view-pg-replication-slots.html

normal means that the claimed files are within max_wal_size.
+ If max_slot_wal_keep_size is smaller than max_wal_size, this state
+ will not appear.

> If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
> but any those WAL files have not been removed yet, wal_status seems
> to report "lost". Is this expected behavior? Per the meaning of "lost"
> described in the document, "lost" should be reported only when
> any claimed files are removed, I think. Thought?
> 
> Or this behavior is expected and the document is incorrect?

In short, it is known behavior but it was judged as useless to prevent
that.

That can happen when checkpointer removes up to the segment that is
being read by walsender.  I think that that doesn't happen (or
happenswithin a narrow time window?) for physical replication but
happenes for logical replication.

While development, I once added walsender a code to exit for that
reason, but finally it is moved to InvalidateObsoleteReplicationSlots
as a bit defferent function.  With the current mechanism, there's a
case where once invalidated slot came to revive but we decided to
allow that behavior, but forgot to document that.

Anyway if you see "lost", something bad is being happening.

- lost means that some WAL files are definitely lost and this slot
- cannot be used to resume replication anymore.
+ lost means that some required WAL files are removed and this slot is
+ no longer usable after once disconnected during this status.

If it is crucial that the "lost" state may come back to reserved or
normal state, 

+ Note that there are cases where the state moves back to reserved or
+ normal state when all wal senders have left the just removed segment
+ before being terminated.

There is a case where the state moves back to reserved or normal state when wal 
senders leaves the just removed segment before being terminated.

> BTW, if we want to implement GetWALAvailability() as the document
> advertises, we can simplify it like the attached POC patch.

I'm not sure it is right that the patch removes wal_keep_segments from
the function.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..199053dd4a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -11240,18 +11240,25 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx

 
  normal means that the claimed files
-  are within max_wal_size.
+  are within max_wal_size. If
+  max_slot_wal_keep_size is smaller than
+   

Review for GetWALAvailability()

2020-06-12 Thread Fujii Masao

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

some WAL files are definitely lost and this slot cannot be used to resume 
replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.


keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
  wal_segment_size) + 1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of them.
The above should be the following?

Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), wal_keep_segments) + 
1



if ((max_slot_wal_keep_size_mb <= 0 ||
 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
 means that the claimed files are within max_wal_size". So whatever
 max_slot_wal_keep_size value is, IMO that "normal" should be
 reported if the WAL files claimed by the slot are within max_wal_size.
 Thought?

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.



If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?



BTW, if we want to implement GetWALAvailability() as the document
advertises, we can simplify it like the attached POC patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 55cac186dc..0b9cca2173 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9504,62 +9504,29 @@ GetWALAvailability(XLogRecPtr targetLSN)
XLogSegNo   currSeg;/* segid of currpos */
XLogSegNo   targetSeg;  /* segid of targetLSN */
XLogSegNo   oldestSeg;  /* actual oldest segid */
-   XLogSegNo   oldestSegMaxWalSize;/* oldest segid kept by 
max_wal_size */
-   XLogSegNo   oldestSlotSeg = InvalidXLogRecPtr;  /* oldest segid 
kept by
-   
 * slot */
uint64  keepSegs;
 
/* slot does not reserve WAL. Either deactivated, or has never been 
active */
if (XLogRecPtrIsInvalid(targetLSN))
return WALAVAIL_INVALID_LSN;
 
-   currpos = GetXLogWriteRecPtr();
-
/* calculate oldest segment currently needed by slots */
XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
-   KeepLogSeg(currpos, );
 
-   /*
-* Find the oldest extant segment file. We get 1 until checkpoint 
removes
-* the first WAL segment file since startup, which causes the status 
being
-* wrong under certain abnormal conditions but that doesn't actually 
harm.
-*/
-   oldestSeg = XLogGetLastRemovedSegno() + 1;
+   /* Find the oldest extant segment file */
+   oldestSeg = XLogGetLastRemovedSegno();
 
-   /* calculate oldest segment by max_wal_size and wal_keep_segments */
+   if (targetSeg <= oldestSeg)
+   return WALAVAIL_REMOVED;
+
+   currpos = GetXLogWriteRecPtr();
XLByteToSeg(currpos, currSeg, wal_segment_size);
-   keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
- wal_segment_size) + 1;
+   keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size);
 
-   if (currSeg > keepSegs)
-   oldestSegMaxWalSize = currSeg - keepSegs;
-   else
-   oldestSegMaxWalSize = 1;
-
-   /*
-* If max_slot_wal_keep_size has changed after the last call, the 
segment
-* that would been kept by the current setting might have been lost by 
the
-* previous setting. No point in showing normal or keeping status values
-* if the targetSeg is known to be lost.
-*/
-   if (targetSeg >= oldestSeg)
-   {
-   /*
-* show "normal" when targetSeg is within max_wal_size, even if
-*