On Fri, Mar 25, 2016 at 9:20 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 9:29 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> Also I felt a sense of discomfort regarding using [ and ] as a special
>> character for priority method.
>> Because (, ) and [, ] are a little similar each other, so it would
>> easily make many syntax errors when nested style is supported.
>> And the synopsis of that in documentation is odd;
>>     synchronous_standby_names = 'N [ node_name [, ...] ]'
>>
>> This topic has been already discussed before but, we might want to
>> change it to other characters such as < and >?
>
> I personally would recommend against <>.  Those should mean less-than
> and greater-than, not grouping.  I think you could use parentheses,
> ().  There's nothing saying that has to mean any particular thing, so
> you may as well use it for the first thing implemented, perhaps.  Or
> you could use [] or {}.  It *is* important that you don't create
> confusing syntax summaries, but I don't think that's a reason to pick
> a nonstandard syntax for grouping.
>

I agree with you.
I've changed it to use parentheses.

Regards,

--
Masahiko Sawada
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d48a13f..1650b6d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2902,20 +2902,18 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Specifies a comma-separated list of standby names that can support
-        <firstterm>synchronous replication</>, as described in
-        <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
-        (as shown by a state of <literal>streaming</literal> in the
+        Specifies the standby names that can support <firstterm>synchronous replication</>
+        using either of two syntaxes: a comma-separated list, or a more flexible syntax
+        described in <xref linkend="dedicated-language-for-multi-sync-replication">.
+        Transactions waiting for commit will be allowed to proceed after a
+        configurable subset of standby servers confirms receipt of their data.
+        For the simple comma-separated list syntax, it is one server.
+        The synchronous standbys will be those named in this parameter that are both
+        currently connected and streaming data in real-time (as shown by a state
+        of <literal>streaming</> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys.
-        If the current synchronous standby disconnects for whatever reason,
+        If the any of the current synchronous standbys disconnects for whatever reason,
         it will be replaced immediately with the next-highest-priority standby.
         Specifying more than one standby name can allow very high availability.
        </para>
@@ -2923,9 +2921,10 @@ include_dir 'conf.d'
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
         <varname>primary_conninfo</> of the standby's WAL receiver.  There is
-        no mechanism to enforce uniqueness. In case of duplicates one of the
-        matching standbys will be chosen to be the synchronous standby, though
-        exactly which one is indeterminate.
+        no mechanism to enforce uniqueness. For each specified standby name,
+        only the specified count of standbys will be chosen to be synchronous
+        standbys, though exactly which ones is indeterminate.  The rest will
+        represent potential synchronous standbys.
         The special entry <literal>*</> matches any
         <varname>application_name</>, including the default application name
         of <literal>walreceiver</>.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 19d613e..5dd9fab 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1027,24 +1027,27 @@ primary_slot_name = 'node_a_slot'
 
    <para>
     Synchronous replication offers the ability to confirm that all changes
-    made by a transaction have been transferred to one synchronous standby
-    server. This extends the standard level of durability
+    made by a transaction have been transferred to one or more synchronous
+    standby servers. This extends that standard level of durability
     offered by a transaction commit. This level of protection is referred
-    to as 2-safe replication in computer science theory.
+    to as 2-safe replication in computer science theory, and group-1-safe
+    (group-safe and 1-safe) when <varname>synchronous_commit</> is set to
+    more than <literal>remote_write</>.
    </para>
 
    <para>
     When requesting synchronous replication, each commit of a
     write transaction will wait until confirmation is
     received that the commit has been written to the transaction log on disk
-    of both the primary and standby server. The only possibility that data
-    can be lost is if both the primary and the standby suffer crashes at the
+    of both the primary and standby servers. The only possibility that data
+    can be lost is if both the primary and the standbys suffer crash at the
     same time. This can provide a much higher level of durability, though only
-    if the sysadmin is cautious about the placement and management of the two
+    if the sysadmin is cautious about the placement and management of the these
     servers.  Waiting for confirmation increases the user's confidence that the
     changes will not be lost in the event of server crashes but it also
     necessarily increases the response time for the requesting transaction.
-    The minimum wait time is the roundtrip time between primary to standby.
+    The minimum wait time is the roundtrip time between the primary and the
+    slowest synchronous standby.
    </para>
 
    <para>
@@ -2327,4 +2330,50 @@ LOG:  database system is ready to accept read only connections
 
  </sect1>
 
+ <sect1 id="dedicated-language-for-multi-sync-replication">
+  <title>Dedicated language for multiple synchronous replication</title>
+
+  <indexterm zone="high-availability">
+   <primary>Dedicated language for multiple synchornous replication</primary>
+  </indexterm>
+
+  <sect2 id="dedicated-language-for-multi-sync-replication-description">
+   <title>Description</title>
+   <para>
+    Multiple synchronous replication is set up flexibly by setting
+    <xref linkend="guc-synchronous-standby-names"> using the following syntax.
+   </para>
+
+   <synopsis>
+    synchronous_standby_names = '<replaceable class="PARAMETER">N</replaceable> ( <replaceable class="PARAMETER">node_name</replaceable> [, ...] )'
+   </synopsis>
+
+   <para>
+    This syntax allows us to define a synchronous group that will wait for at
+    least N standbys of them, and a comma-separated list of group members that are surrounded by
+    parantheses.  The special value <literal>*</> for server name matches any standby.
+    By surrounding list of group members using parantheses, synchronous standbys are chosen from
+    that group using priority method.
+   </para>
+
+   <note>
+    <para>
+     All ASCII characters except for special characters(',', '&quot', '(', ')', ' ') are
+     allowed in unquoted standby names.  To use these special characters, the standby
+     name should be enclosed in double quotes.
+    </para>
+   </note>
+  </sect2>
+
+  <sect2 id="dedicated-language-for-multi-sync-replication-priority">
+   <title>Prioirty Method</title>
+   <para>
+    The synchronous priority is given to standby servers in the order that they appear in the list.
+    The first named server has the highest priority. The priority method chooses synchronous standbys
+    from group members using synchronous priority of each standbys. The master server ensures that
+    modified data will be replicated to N highest priority standbys at that moment.
+   </para>
+  </sect2>
+ </sect1>
+
 </chapter>
diff --git a/src/backend/Makefile b/src/backend/Makefile
index d22dbbf..ec2dc7b 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -203,7 +203,7 @@ distprep:
 	$(MAKE) -C parser	gram.c gram.h scan.c
 	$(MAKE) -C bootstrap	bootparse.c bootscanner.c
 	$(MAKE) -C catalog	schemapg.h postgres.bki postgres.description postgres.shdescription
-	$(MAKE) -C replication	repl_gram.c repl_scanner.c
+	$(MAKE) -C replication	repl_gram.c repl_scanner.c syncrep_gram.c syncrep_scanner.c
 	$(MAKE) -C storage/lmgr	lwlocknames.h
 	$(MAKE) -C utils	fmgrtab.c fmgroids.h errcodes.h
 	$(MAKE) -C utils/misc	guc-file.c
@@ -320,6 +320,8 @@ maintainer-clean: distclean
 	      catalog/postgres.shdescription \
 	      replication/repl_gram.c \
 	      replication/repl_scanner.c \
+	      replication/syncrep_gram.c \
+	      replication/syncrep_scanner.c \
 	      storage/lmgr/lwlocknames.c \
 	      storage/lmgr/lwlocknames.h \
 	      utils/fmgroids.h \
diff --git a/src/backend/replication/.gitignore b/src/backend/replication/.gitignore
index 2a0491d..d1df614 100644
--- a/src/backend/replication/.gitignore
+++ b/src/backend/replication/.gitignore
@@ -1,2 +1,4 @@
 /repl_gram.c
 /repl_scanner.c
+/syncrep_gram.c
+/syncrep_scanner.c
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index b73370e..c99717e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
-	repl_gram.o slot.o slotfuncs.o syncrep.o
+	repl_gram.o slot.o slotfuncs.o syncrep.o syncrep_gram.o
 
 SUBDIRS = logical
 
@@ -24,5 +24,10 @@ include $(top_srcdir)/src/backend/common.mk
 # repl_scanner is compiled as part of repl_gram
 repl_gram.o: repl_scanner.c
 
-# repl_gram.c and repl_scanner.c are in the distribution tarball, so
-# they are not cleaned here.
+# syncrep_scanner is complied as part of syncrep_gram
+syncrep_gram.o: syncrep_scanner.c
+syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEX_NO_BACKUP=yes
+
+# repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
+# are in the distribution tarball, so they are not cleaned here.
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 92faf4e..ba95e67 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -29,11 +29,12 @@
  * single ordered queue of waiting backends, so that we can avoid
  * searching the through all waiters each time we receive a reply.
  *
- * In 9.1 we support only a single synchronous standby, chosen from a
- * priority list of synchronous_standby_names. Before it can become the
- * synchronous standby it must have caught up with the primary; that may
- * take some time. Once caught up, the current highest priority standby
- * will release waiters from the queue.
+ * In 9.6 we support multiple synchronous standbys, chosen from a
+ * priority list of synchronous_standby_names. Before they can become the
+ * synchronous standbys they must have caught up with the primary; that may
+ * take some time. Once caught up, the current higher priority standbys
+ * which are considered as synchronous at that moment will release
+ * waiters from the queue.
  *
  * Portions Copyright (c) 2010-2016, PostgreSQL Global Development Group
  *
@@ -65,12 +66,15 @@ char	   *SyncRepStandbyNames;
 
 static bool announce_next_takeover = true;
 
+SyncRepConfigData *SyncRepConfig;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
+static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
+									   XLogRecPtr *flushPtr, bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
 
 #ifdef USE_ASSERT_CHECKING
@@ -332,6 +336,10 @@ SyncRepInitConfig(void)
 {
 	int			priority;
 
+	/* Update the config data of synchronous replication */
+	SyncRepFreeConfig(SyncRepConfig);
+	SyncRepUpdateConfig();
+
 	/*
 	 * Determine if we are a potential sync standby and remember the result
 	 * for handling replies from standby.
@@ -349,62 +357,8 @@ SyncRepInitConfig(void)
 }
 
 /*
- * Find the WAL sender servicing the synchronous standby with the lowest
- * priority value, or NULL if no synchronous standby is connected. If there
- * are multiple standbys with the same lowest priority value, the first one
- * found is selected. The caller must hold SyncRepLock.
- */
-WalSnd *
-SyncRepGetSynchronousStandby(void)
-{
-	WalSnd	   *result = NULL;
-	int			result_priority = 0;
-	int			i;
-
-	for (i = 0; i < max_wal_senders; i++)
-	{
-		/* Use volatile pointer to prevent code rearrangement */
-		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
-		int			this_priority;
-
-		/* Must be active */
-		if (walsnd->pid == 0)
-			continue;
-
-		/* Must be streaming */
-		if (walsnd->state != WALSNDSTATE_STREAMING)
-			continue;
-
-		/* Must be synchronous */
-		this_priority = walsnd->sync_standby_priority;
-		if (this_priority == 0)
-			continue;
-
-		/* Must have a lower priority value than any previous ones */
-		if (result != NULL && result_priority <= this_priority)
-			continue;
-
-		/* Must have a valid flush position */
-		if (XLogRecPtrIsInvalid(walsnd->flush))
-			continue;
-
-		result = (WalSnd *) walsnd;
-		result_priority = this_priority;
-
-		/*
-		 * If priority is equal to 1, there cannot be any other WAL senders
-		 * with a lower priority, so we're done.
-		 */
-		if (this_priority == 1)
-			return result;
-	}
-
-	return result;
-}
-
-/*
  * Update the LSNs on each queue based upon our latest state. This
- * implements a simple policy of first-valid-standby-releases-waiter.
+ * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
@@ -413,7 +367,10 @@ void
 SyncRepReleaseWaiters(void)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
-	WalSnd	   *syncWalSnd;
+	XLogRecPtr	writePtr;
+	XLogRecPtr	flushPtr;
+	bool		got_oldest;
+	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
 
@@ -429,22 +386,37 @@ SyncRepReleaseWaiters(void)
 		return;
 
 	/*
-	 * We're a potential sync standby. Release waiters if we are the highest
-	 * priority standby.
+	 * We're a potential sync standby. Release waiters if there are
+	 * enough sync standbys and we are considered as sync.
 	 */
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
-	syncWalSnd = SyncRepGetSynchronousStandby();
 
-	/* We should have found ourselves at least */
-	Assert(syncWalSnd != NULL);
+	/*
+	 * Check whether we are a sync standby or not, and calculate
+	 * the oldest positions among all sync standbys.
+	 */
+	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr, &am_sync);
+
+	/*
+	 * If we are managing the sync standby, though we weren't
+	 * prior to this, then announce we are now the sync standby.
+	 */
+	if (announce_next_takeover && am_sync)
+	{
+		announce_next_takeover = false;
+		ereport(LOG,
+				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
+						application_name, MyWalSnd->sync_standby_priority)));
+	}
 
 	/*
-	 * If we aren't managing the highest priority standby then just leave.
+	 * If the number of sync standbys is less than requested or we aren't
+	 * managing the sync standby then just leave.
 	 */
-	if (syncWalSnd != MyWalSnd)
+	if (!got_oldest || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
-		announce_next_takeover = true;
+		announce_next_takeover = !am_sync;
 		return;
 	}
 
@@ -452,34 +424,220 @@ SyncRepReleaseWaiters(void)
 	 * Set the lsn first so that when we wake backends they will release up to
 	 * this location.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
+	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
+		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
 		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
 	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
+	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
+		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
 		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
 	}
 
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
-		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
-	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+		 numwrite, (uint32) (writePtr >> 32), (uint32) writePtr,
+		 numflush, (uint32) (flushPtr >> 32), (uint32) flushPtr);
+}
+
+/*
+ * Calculate the oldest Write and Flush positions among sync standbys.
+ *
+ * Return false if the number of sync standbys is less than
+ * synchronous_standby_names specifies. Otherwise return true and
+ * store the oldest positions into *writePtr and *flushPtr.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static bool
+SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+						   bool *am_sync)
+{
+	List		*sync_standbys;
+	ListCell	*cell;
+
+	*writePtr = InvalidXLogRecPtr;
+	*flushPtr = InvalidXLogRecPtr;
+	*am_sync = false;
+
+	/* Get standbys that are considered as synchronous at this moment */
+	sync_standbys = SyncRepGetSyncStandbys();
+
+	/* Quick exit if there are not enough synchronous standbys */
+	if (list_length(sync_standbys) < SyncRepConfig->num_sync)
+	{
+		*am_sync = list_member_int(sync_standbys, MyWalSnd->slotno);
+		list_free(sync_standbys);
+		return false;
+	}
 
 	/*
-	 * If we are managing the highest priority standby, though we weren't
-	 * prior to this, then announce we are now the sync standby.
+	 * Scan through all sync standbys and calculate the oldest
+	 * Write and Flush positions.
 	 */
-	if (announce_next_takeover)
+	foreach (cell, sync_standbys)
 	{
-		announce_next_takeover = false;
-		ereport(LOG,
-				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
-						application_name, MyWalSnd->sync_standby_priority)));
+		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+		XLogRecPtr	write;
+		XLogRecPtr	flush;
+
+		SpinLockAcquire(&walsnd->mutex);
+		write = walsnd->write;
+		flush = walsnd->flush;
+		SpinLockRelease(&walsnd->mutex);
+
+		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+			*writePtr = write;
+		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+			*flushPtr = flush;
+		if (walsnd == MyWalSnd)
+			*am_sync = true;
 	}
+
+	list_free(sync_standbys);
+	return true;
+}
+
+/*
+ * Return the list of sync standbys, or NIL if no sync standby is connected.
+ *
+ * If there are multiple standbys with the same priority,
+ * the first one found is considered as higher priority.
+ * The caller must hold SyncRepLock.
+ */
+List *
+SyncRepGetSyncStandbys(void)
+{
+	List	*result = NIL;
+	List	*pending = NIL;
+	int	lowest_priority;
+	int	next_highest_priority;
+	int	this_priority;
+	int	target_priority;
+	int	i;
+	WalSnd	*walsnd;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	lowest_priority = list_length(SyncRepConfig->members);
+	next_highest_priority = lowest_priority + 1;
+
+	/*
+	 * Find the sync standbys which have the highest priority (i.e, 1).
+	 * Also store all the other potential sync standbys into the pending list,
+	 * in order to scan it later and find other sync standbys from it quickly.
+	 */
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		this_priority = walsnd->sync_standby_priority;
+		if (this_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * If the priority is equal to 1, consider this standby as sync
+		 * and append it to the result list. Otherwise append this standby
+		 * to the pending list to check if it's actually sync or not later.
+		 */
+		if (this_priority == 1)
+		{
+			result = lappend_int(result, i);
+			if (list_length(result) == SyncRepConfig->num_sync)
+			{
+				list_free(pending);
+				return result;		/* Exit if got enough sync standbys */
+			}
+		}
+		else
+		{
+			pending = lappend_int(pending, i);
+
+			/*
+			 * Track the highest priority among the standbys in the pending
+			 * list, in order to use it as the starting priority for later scan
+			 * of the list. This is useful to find quickly the sync standbys
+			 * from the pending list later because we can skip unnecessary
+			 * scans for the unused priorities.
+			 */
+			if (this_priority < next_highest_priority)
+				next_highest_priority = this_priority;
+		}
+	}
+
+	/*
+	 * Consider all pending standbys as sync if the number of them plus
+	 * already-found sync ones is lower than the configuration requests.
+	 */
+	if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync)
+		return list_concat(result, pending);
+
+	/*
+	 * Find the sync standbys from the pending list.
+	 */
+	target_priority = next_highest_priority;
+	while (target_priority <= lowest_priority)
+	{
+		ListCell	*cell;
+		ListCell	*prev = NULL;
+		ListCell	*next;
+
+		next_highest_priority = lowest_priority + 1;
+
+		foreach (cell, pending)
+		{
+			i = lfirst_int(cell);
+			walsnd = &WalSndCtl->walsnds[i];
+
+			next = lnext(cell);
+
+			this_priority = walsnd->sync_standby_priority;
+			if (this_priority == target_priority)
+			{
+				result = lappend_int(result, i);
+				if (list_length(result) == SyncRepConfig->num_sync)
+				{
+					list_free(pending);
+					return result;		/* Exit if got enough sync standbys */
+				}
+
+				/*
+				 * Remove the entry for this sync standby from the list
+				 * to prevent us from looking at the same entry again.
+				 */
+				pending = list_delete_cell(pending, cell, prev);
+
+				continue;
+			}
+
+			if (this_priority < next_highest_priority)
+				next_highest_priority = this_priority;
+
+			prev = cell;
+		}
+
+		target_priority = next_highest_priority;
+	}
+
+	return result;
 }
 
 /*
@@ -493,8 +651,7 @@ SyncRepReleaseWaiters(void)
 static int
 SyncRepGetStandbyPriority(void)
 {
-	char	   *rawstring;
-	List	   *elemlist;
+	List	   *members;
 	ListCell   *l;
 	int			priority = 0;
 	bool		found = false;
@@ -506,20 +663,11 @@ SyncRepGetStandbyPriority(void)
 	if (am_cascading_walsender)
 		return 0;
 
-	/* Need a modifiable copy of string */
-	rawstring = pstrdup(SyncRepStandbyNames);
-
-	/* Parse string into list of identifiers */
-	if (!SplitIdentifierString(rawstring, ',', &elemlist))
-	{
-		/* syntax error in list */
-		pfree(rawstring);
-		list_free(elemlist);
-		/* GUC machinery will have already complained - no need to do again */
+	if (!SyncStandbysDefined())
 		return 0;
-	}
 
-	foreach(l, elemlist)
+	members = SyncRepConfig->members;
+	foreach(l, members)
 	{
 		char	   *standby_name = (char *) lfirst(l);
 
@@ -533,9 +681,6 @@ SyncRepGetStandbyPriority(void)
 		}
 	}
 
-	pfree(rawstring);
-	list_free(elemlist);
-
 	return (found ? priority : 0);
 }
 
@@ -643,6 +788,43 @@ SyncRepUpdateSyncStandbysDefined(void)
 	}
 }
 
+/*
+ * Parse synchronous_standby_names and update the config data
+ * of synchronous standbys.
+ */
+void
+SyncRepUpdateConfig(void)
+{
+	bool	parse_res;
+
+	if (!SyncStandbysDefined())
+		return;
+
+	/*
+	 * check_synchronous_standby_names() verifies the setting value of
+	 * synchronous_standby_names before this function is called. So
+	 * syncrep_yyparse() must not cause an error here.
+	 */
+	parse_res = syncrep_scanstr(SyncRepStandbyNames);
+	Assert(parse_res);
+
+	SyncRepConfig = syncrep_parse_result;
+	syncrep_parse_result = NULL;
+}
+
+/*
+ * Free a previously-allocated config data of synchronous replication.
+ */
+void
+SyncRepFreeConfig(SyncRepConfigData *config)
+{
+	if (!config)
+		return;
+
+	list_free_deep(config->members);
+	pfree(config);
+}
+
 #ifdef USE_ASSERT_CHECKING
 static bool
 SyncRepQueueIsOrderedByLSN(int mode)
@@ -687,32 +869,22 @@ SyncRepQueueIsOrderedByLSN(int mode)
 bool
 check_synchronous_standby_names(char **newval, void **extra, GucSource source)
 {
-	char	   *rawstring;
-	List	   *elemlist;
-
-	/* Need a modifiable copy of string */
-	rawstring = pstrdup(*newval);
-
-	/* Parse string into list of identifiers */
-	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	if (*newval != NULL && (*newval)[0] != '\0')
 	{
-		/* syntax error in list */
-		GUC_check_errdetail("List syntax is invalid.");
-		pfree(rawstring);
-		list_free(elemlist);
-		return false;
-	}
-
-	/*
-	 * Any additional validation of standby names should go here.
-	 *
-	 * Don't attempt to set WALSender priority because this is executed by
-	 * postmaster at startup, not WALSender, so the application_name is not
-	 * yet correctly set.
-	 */
+		/* Parse value */
+		if (!(syncrep_scanstr(*newval)))
+		{
+			GUC_check_errcode(ERRCODE_SYNTAX_ERROR);
+			return false;
+		}
 
-	pfree(rawstring);
-	list_free(elemlist);
+		/*
+		 * syncrep_yyparse sets the global syncrep_parse_result as side effect.
+		 * But this function is required to just check, so frees it
+		 * once parsing parameter.
+		 */
+		SyncRepFreeConfig(syncrep_parse_result);
+	}
 
 	return true;
 }
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
new file mode 100644
index 0000000..380fedc
--- /dev/null
+++ b/src/backend/replication/syncrep_gram.y
@@ -0,0 +1,86 @@
+%{
+/*-------------------------------------------------------------------------
+ *
+ * syncrep_gram.y				- Parser for synchronous_standby_names
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/syncrep_gram.y
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "replication/syncrep.h"
+#include "utils/formatting.h"
+
+/* Result of the parsing is returned here */
+SyncRepConfigData	*syncrep_parse_result;
+
+static SyncRepConfigData *create_syncrep_config(char *num_sync, List *members);
+
+/*
+ * Bison doesn't allocate anything that needs to live across parser calls,
+ * so we can easily have it use palloc instead of malloc.  This prevents
+ * memory leaks if we error out during parsing.  Note this only works with
+ * bison >= 2.0.  However, in bison 1.875 the default is to use alloca()
+ * if possible, so there's not really much problem anyhow, at least if
+ * you're building with gcc.
+ */
+#define YYMALLOC palloc
+#define YYFREE   pfree
+
+%}
+
+%expect 0
+%name-prefix="syncrep_yy"
+
+%union
+{
+	char	   *str;
+	List	   *list;
+	SyncRepConfigData  *config;
+}
+
+%token <str> NAME NUM
+
+%type <config> result standby_config
+%type <list> standby_list
+%type <str> standby_name
+
+%start result
+
+%%
+result:
+		standby_config				{ syncrep_parse_result = $1; }
+;
+standby_config:
+		standby_list				{ $$ = create_syncrep_config("1", $1); }
+		| NUM '(' standby_list ')'		{ $$ = create_syncrep_config($1, $3); }
+;
+standby_list:
+		standby_name				{ $$ = list_make1($1);}
+		| standby_list ',' standby_name		{ $$ = lappend($1, $3);}
+;
+standby_name:
+		NAME					{ $$ = $1; }
+		| NUM					{ $$ = $1; }
+;
+%%
+
+static SyncRepConfigData *
+create_syncrep_config(char *num_sync, List *members)
+{
+	SyncRepConfigData *config =
+		(SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
+
+	config->num_sync = atoi(num_sync);
+	config->members = members;
+	return config;
+}
+
+#include "syncrep_scanner.c"
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
new file mode 100644
index 0000000..5d986d0
--- /dev/null
+++ b/src/backend/replication/syncrep_scanner.l
@@ -0,0 +1,182 @@
+%{
+/*-------------------------------------------------------------------------
+ *
+ * syncrep_scanner.l
+ *	  a lexical scanner for synchronous_standby_names
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/syncrep_scanner.l
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "lib/stringinfo.h"
+
+/*
+ * flex emits a yy_fatal_error() function that it calls in response to
+ * critical errors like malloc failure, file I/O errors, and detection of
+ * internal inconsistency.  That function prints a message and calls exit().
+ * Mutate it to instead call our handler, which jumps out of the parser.
+ */
+#undef fprintf
+#define fprintf(file, fmt, msg) syncrep_flex_fatal(msg)
+
+/* Handles to the buffer that the lexer uses internally */
+static YY_BUFFER_STATE scanbufhandle;
+
+static StringInfoData xdbuf;
+
+static const char *syncrep_flex_fatal_errmsg;
+static sigjmp_buf *syncrep_flex_fatal_jmp;
+
+static void	syncrep_scanner_init(const char *str);
+static void	syncrep_scanner_finish(void);
+static int	syncrep_flex_fatal(const char *msg);
+%}
+
+%option 8bit
+%option never-interactive
+%option nounput
+%option noinput
+%option noyywrap
+%option warn
+%option prefix="syncrep_yy"
+
+/*
+ * <xd> delimited identifiers (double-quoted identifiers)
+ */
+%x xd
+
+space		[ \t\n\r\f\v]
+
+undquoted_start	[^ ,\(\)\"]
+undquoted_cont		[^,\(\) \t\n\r\f\v]
+undquoted_name    {undquoted_start}{undquoted_cont}*
+dquoted_name		[^\"]+
+
+/* Double-quoted string */
+dquote		\"
+xdstart		{dquote}
+xddouble		{dquote}{dquote}
+xdstop		{dquote}
+xdinside		{dquoted_name}
+
+%%
+{space}+		{ /* ignore */ }
+{xdstart}	{
+				initStringInfo(&xdbuf);
+				BEGIN(xd);
+		}
+<xd>{xddouble} {
+				appendStringInfoChar(&xdbuf, '\"');
+		}
+<xd>{xdinside} {
+				appendStringInfoString(&xdbuf, yytext);
+		}
+<xd>{xdstop} {
+				yylval.str = pstrdup(xdbuf.data);
+				pfree(xdbuf.data);
+				BEGIN(INITIAL);
+				return NAME;
+		}
+","			{ return ','; }
+"("			{ return '('; }
+")"			{ return ')'; }
+[1-9][0-9]*	{
+				yylval.str = pstrdup(yytext);
+				return NUM;
+		}
+{undquoted_name} {
+				yylval.str = pstrdup(yytext);
+				return NAME;
+		}
+%%
+
+void
+yyerror(const char *message)
+{
+	ereport(IsUnderPostmaster ? DEBUG2 : LOG,
+			(errcode(ERRCODE_SYNTAX_ERROR),
+			 errmsg("%s at or near \"%s\"", message, yytext)));
+}
+
+void
+syncrep_scanner_init(const char *str)
+{
+	Size		slen = strlen(str);
+	char	   *scanbuf;
+
+	/*
+	 * Might be left over after ereport()
+	 */
+	if (YY_CURRENT_BUFFER)
+		yy_delete_buffer(YY_CURRENT_BUFFER);
+
+	/*
+	 * Make a scan buffer with special termination needed by flex.
+	 */
+	scanbuf = (char *) palloc(slen + 2);
+	memcpy(scanbuf, str, slen);
+	scanbuf[slen] = scanbuf[slen + 1] = YY_END_OF_BUFFER_CHAR;
+	scanbufhandle = yy_scan_buffer(scanbuf, slen + 2);
+}
+
+void
+syncrep_scanner_finish(void)
+{
+	yy_delete_buffer(scanbufhandle);
+	scanbufhandle = NULL;
+}
+
+/*
+ * Flex fatal errors bring us here.  Stash the error message and jump back to
+ * syncrep_scanstr().  Assume all msg arguments point to string constants; this
+ * holds for flex 2.5.31 (earliest we support) and flex 2.5.35 (latest as of
+ * this writing).  Otherwise, we would need to copy the message.
+ *
+ * We return "int" since this takes the place of calls to fprintf().
+*/
+static int
+syncrep_flex_fatal(const char *msg)
+{
+	syncrep_flex_fatal_errmsg = msg;
+	siglongjmp(*syncrep_flex_fatal_jmp, 1);
+	return 0;					/* keep compiler quiet */
+}
+
+bool
+syncrep_scanstr(const char *str)
+{
+	int	parse_res;
+	bool	ret = true;
+	sigjmp_buf	flex_fatal_jmp;
+
+	if (sigsetjmp(flex_fatal_jmp, 1) == 0)
+		syncrep_flex_fatal_jmp = &flex_fatal_jmp;
+	else
+	{
+		/*
+		 * Regain control after a fatal, internal flex error.  It may have
+		 * corrupted parser state.  Consequently, abandon the file, but trust
+		 * that the state remains sane enough for yy_delete_buffer().
+		 */
+		elog(IsUnderPostmaster ? DEBUG2 : LOG, "%s", syncrep_flex_fatal_errmsg);
+		ret = false;
+		goto cleanup;
+	}
+
+	syncrep_scanner_init(str);
+	parse_res = syncrep_yyparse();
+
+	if (parse_res != 0)
+		ret = false;
+cleanup:
+	syncrep_scanner_finish();
+	return ret;
+}
\ No newline at end of file
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f98475c..0867cc4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2666,6 +2666,7 @@ WalSndShmemInit(void)
 		{
 			WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
 
+			walsnd->slotno = i;
 			SpinLockInit(&walsnd->mutex);
 		}
 	}
@@ -2751,7 +2752,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 	Tuplestorestate *tupstore;
 	MemoryContext per_query_ctx;
 	MemoryContext oldcontext;
-	WalSnd	   *sync_standby;
+	List	   *sync_standbys;
 	int			i;
 
 	/* check to see if caller supports us returning a tuplestore */
@@ -2780,12 +2781,22 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 	MemoryContextSwitchTo(oldcontext);
 
 	/*
-	 * Get the currently active synchronous standby.
+	 * Allocate and update the config data of synchronous replication,
+	 * and then get the currently active synchronous standbys.
 	 */
+	SyncRepUpdateConfig();
 	LWLockAcquire(SyncRepLock, LW_SHARED);
-	sync_standby = SyncRepGetSynchronousStandby();
+	sync_standbys = SyncRepGetSyncStandbys();
 	LWLockRelease(SyncRepLock);
 
+	/*
+	 * Free the previously-allocated config data because a backend
+	 * no longer needs it. The next call of this function needs to
+	 * allocate and update the config data newly because the setting
+	 * of sync replication might be changed between the calls.
+	 */
+	SyncRepFreeConfig(SyncRepConfig);
+
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		WalSnd *walsnd = &WalSndCtl->walsnds[i];
@@ -2856,7 +2867,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
-			else if (walsnd == sync_standby)
+			else if (list_member_int(sync_standbys, i))
 				values[7] = CStringGetTextDatum("sync");
 			else
 				values[7] = CStringGetTextDatum("potential");
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 96e059b..561e44a 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -31,6 +31,18 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/*
+ * Struct for the configuration of synchronous replication.
+ */
+typedef struct SyncRepConfigData
+{
+	int	num_sync;	/* number of sync standbys that we need to wait for */
+	List	*members;	/* list of names of potential sync standbys */
+} SyncRepConfigData;
+
+extern SyncRepConfigData *syncrep_parse_result;
+extern SyncRepConfigData *SyncRepConfig;
+
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
@@ -44,14 +56,24 @@ extern void SyncRepCleanupAtProcExit(void);
 extern void SyncRepInitConfig(void);
 extern void SyncRepReleaseWaiters(void);
 
+/* called by wal sender and user backend */
+extern List *SyncRepGetSyncStandbys(void);
+extern void SyncRepUpdateConfig(void);
+extern void SyncRepFreeConfig(SyncRepConfigData *config);
+
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
-/* forward declaration to avoid pulling in walsender_private.h */
-struct WalSnd;
-extern struct WalSnd *SyncRepGetSynchronousStandby(void);
-
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);
 
+/*
+ * Internal functions for parsing synchronous_standby_names grammar,
+ * in syncrep_gram.y and syncrep_scanner.l
+ */
+extern int  syncrep_yyparse(void);
+extern int  syncrep_yylex(void);
+extern void syncrep_yyerror(const char *str);
+extern bool	syncrep_scanstr(const char *str);
+
 #endif   /* _SYNCREP_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7794aa5..a125c57 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -32,6 +32,7 @@ typedef enum WalSndState
  */
 typedef struct WalSnd
 {
+	int		slotno;			/* index of this slot in WalSnd array */
 	pid_t		pid;			/* this walsender's process id, or 0 */
 	WalSndState state;			/* this walsender's state */
 	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1eedd19..157b8c3 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -389,6 +389,7 @@ sub init
 	  unless defined $params{hba_permit_replication};
 	$params{allows_streaming} = 0 unless defined $params{allows_streaming};
 	$params{has_archiving}    = 0 unless defined $params{has_archiving};
+	$params{allows_sync_rep} = 0 unless defined $params{allows_sync_rep};
 
 	mkdir $self->backup_dir;
 	mkdir $self->archive_dir;
@@ -413,6 +414,10 @@ sub init
 		print $conf "hot_standby = on\n";
 		print $conf "max_connections = 10\n";
 	}
+	if ($params{allows_sync_rep})
+        {
+                print $conf "synchronous_standby_names = 'standby1,standby2'\n";
+        }
 
 	if ($TestLib::windows_os)
 	{
@@ -706,6 +711,23 @@ sub promote
 	TestLib::system_log('pg_ctl', '-D', $pgdata, '-l', $logfile, 'promote');
 }
 
+=pod
+
+=item $node->reload()
+
+Wrapper for pg_ctl reload
+
+=cut
+
+sub reload
+{
+	my ($self)	= @_;
+	my $pgdata	= $self->data_dir;
+	my $name	= $self->name;
+	print "### Reloading node \"$name\"\n";
+	TestLib::system_log('pg_ctl', '-D', $pgdata, 'reload');
+}
+
 # Internal routine to enable streaming replication on a standby node.
 sub enable_streaming
 {
diff --git a/src/test/recovery/t/006_sync_rep.pl b/src/test/recovery/t/006_sync_rep.pl
new file mode 100644
index 0000000..0927d41
--- /dev/null
+++ b/src/test/recovery/t/006_sync_rep.pl
@@ -0,0 +1,105 @@
+# Minimal test testing synchronous replication sync_state transition
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+
+# Initialize master node with synchronous_standby_names = 'standby1,standby2'
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1,allows_sync_rep => 1);
+$node_master->start;
+my $backup_name = 'my_backup';
+
+my $check_sql = "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
+
+# Take backup
+$node_master->backup($backup_name);
+
+# Create standby1 linking to master
+my $node_standby_1 = get_new_node('standby1');
+$node_standby_1->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby_1->start;
+
+
+# Create standby2 linking to master
+my $node_standby_2 = get_new_node('standby2');
+$node_standby_2->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby_2->start;
+
+# Create standby3 linking to master
+my $node_standby_3 = get_new_node('standby3');
+$node_standby_3->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby_3->start;
+
+# Create standby4
+my $node_standby_4 = get_new_node('standby4');
+$node_standby_4->init_from_backup($node_master, $backup_name, has_streaming => 1);
+
+# Check application sync_state on master initially
+my $result = $node_master->safe_psql('postgres', $check_sql);
+print "$result \n";
+is($result, "standby1|1|sync\nstandby2|2|potential\nstandby3|0|async", 'checked for synchornous standbys state for backward compatibility');
+
+# Change the synchronou_standby_names = '*' and check sync_state.
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*';");
+$node_master->reload;
+
+# Only Standby1 should be 'sync'.
+$result = $node_master->safe_psql('postgres', $check_sql);
+print "$result \n";
+is($result, "standby1|1|sync\nstandby2|1|potential\nstandby3|1|potential", 'checked for synchronous standbys state for backward compatibility with asterisk');
+
+# Stop all standbys
+$node_standby_1->stop;
+$node_standby_2->stop;
+$node_standby_3->stop;
+
+# Change the synchronous_standby_names = '2(standby1,standby2,standby3)' and check sync_state.
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '2(standby1,standby2,standby3)';");
+$node_master->reload;
+
+$node_standby_2->start;
+$node_standby_3->start;
+
+# Standby2 and standby3 should be 'sync'.
+$result = $node_master->safe_psql('postgres', $check_sql);
+print "$result \n";
+is($result, "standby2|2|sync\nstandby3|3|sync", 'checked for synchronous standbys state transition 1');
+
+$node_standby_1->start;
+$node_standby_4->start;
+
+# Standby1 should be 'sync' instead of standby3, and standby3 should turn to 'potential'.
+# Standby4 should be added as 'async'.
+$result = $node_master->safe_psql('postgres', $check_sql);
+print "$result \n";
+is($result, "standby1|1|sync\nstandby2|2|sync\nstandby3|3|potential\nstandby4|0|async", 'checked for synchronous standbys state transition 2');
+
+# Change the synchronous_standby_names = '2(standby1,*,standby2)' and check sync_state
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '2(standby1,*,standby2)';");
+$node_master->reload;
+
+# Standby1 and standby2 should be 'sync', and sync_priority of standby2 should be 2, not 3.
+$result = $node_master->safe_psql('postgres', $check_sql);
+print "$result \n";
+is($result, "standby1|1|sync\nstandby2|2|sync\nstandby3|2|potential\nstandby4|2|potential", 'checked for synchronous standbys state with asterisk 1');
+
+# Change the synchronous_standby_names = '2(*)' and check sync state
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '2(*)';");
+$node_master->reload;
+
+# Since standby2 and standby3 have more higher index number of WalSnd array, these standbys should be 'sync' instead of standby1.
+$result = $node_master->safe_psql('postgres', $check_sql);
+print "$result \n";
+is($result, "standby1|1|potential\nstandby2|1|sync\nstandby3|1|sync\nstandby4|1|potential", 'checked for synchronous standbys state with asterisk 2');
+
+# Stop Standby3 which is considered as 'sync.
+$node_standby_3->stop;
+
+# Standby1 become 'sync'
+$result = $node_master->safe_psql('postgres', $check_sql);
+print "$result \n";
+is($result, "standby1|1|sync\nstandby2|1|sync\nstandby4|1|potential", 'checked for synchronous standbys state with asterisk 3');
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index ebc2da8..dd9f7da 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -156,7 +156,7 @@ sub mkvcbuild
 		'bootparse.y');
 	$postgres->AddFiles('src/backend/utils/misc', 'guc-file.l');
 	$postgres->AddFiles('src/backend/replication', 'repl_scanner.l',
-		'repl_gram.y');
+		'repl_gram.y', 'syncrep_scannler.l', 'syncrep_gram.y');
 	$postgres->AddDefine('BUILDING_DLL');
 	$postgres->AddLibrary('secur32.lib');
 	$postgres->AddLibrary('ws2_32.lib');
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to