On Mon, Nov 14, 2016 at 5:39 PM, Masahiko Sawada <[email protected]> wrote:
> On Tue, Nov 8, 2016 at 10:12 PM, Michael Paquier
> <[email protected]> wrote:
>> On Tue, Nov 8, 2016 at 12:25 AM, Masahiko Sawada <[email protected]>
>> wrote:
>>> On Tue, Oct 25, 2016 at 10:35 PM, Michael Paquier
>>> <[email protected]> wrote:
>>>> + if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
>>>> + return SyncRepGetSyncStandbysPriority(am_sync);
>>>> + else /* SYNC_REP_QUORUM */
>>>> + return SyncRepGetSyncStandbysQuorum(am_sync);
>>>> Both routines share the same logic to detect if a WAL sender can be
>>>> selected as a candidate for sync evaluation or not, still per the
>>>> selection they do I agree that it is better to keep them as separate.
>>>>
>>>> + /* In quroum method, all sync standby priorities are always 1 */
>>>> + if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
>>>> + priority = 1;
>>>> Honestly I don't understand why you are enforcing that. Priority can
>>>> be important for users willing to switch from ANY to FIRST to have a
>>>> look immediately at what are the standbys that would become sync or
>>>> potential.
>>>
>>> I thought that since all standbys appearing in s_s_names list are
>>> treated equally in quorum method, these standbys should have same
>>> priority.
>>> If these standby have different sync_priority, it looks like that
>>> master server replicates to standby server based on priority.
>>
>> No actually, because we know that they are a quorum set, and that they
>> work in the same set. The concept of priorities has no real meaning
>> for quorum as there is no ordering of the elements. Another, perhaps
>> cleaner idea may be to mark the field as NULL actually.
>
> We know that but I'm concerned it might confuse the user.
> If these priorities are the same, it can obviously imply that all of
> the standby listed in s_s_names are handled equally.
>
>>>> It is not possible to guess from how many standbys this needs to wait
>>>> for. One idea would be to mark the sync_state not as "quorum", but
>>>> "quorum-N", or just add a new column to indicate how many in the set
>>>> need to give a commit confirmation.
>>>
>>> As Simon suggested before, we could support another feature that
>>> allows the client to control the quorum number.
>>> Considering adding that feature, I thought it's better to have and
>>> control that information as a GUC parameter.
>>> Thought?
>>
>> Similarly that would be a SIGHUP parameter? Why not. Perhaps my worry
>> is not that much legitimate, users could just look at s_s_names to
>> guess how many in hte set a commit needs to wait for.
>
> It would be PGC_USRSET similar to synchronous_commit. The user can
> specify it in statement level.
>
>> + <para>
>> + <literal>FIRST</> and <literal>ANY</> are case-insensitive word
>> + and the standby name having these words are must be double-quoted.
>> + </para>
>> s/word/words/.
>>
>> + <literal>FIRST</> and <literal>ANY</> specify the method of
>> + that how master server controls the standby servers.
>> A little bit hard to understand, I would suggest:
>> FIRST and ANY specify the method used by the master to control the
>> standby servers.
>>
>> + The keyword <literal>FIRST</>, coupled with an integer
>> + number N higher-priority standbys and makes transaction commit
>> + when their WAL records are received.
>> This is unclear to me. Here is a correction:
>> The keyword FIRST, coupled with an integer N, makes transaction commit
>> wait until WAL records are received fron the N standbys with higher
>> priority number.
>>
>> + <varname>synchronous_standby_names</>. For example, a setting
>> + of <literal>ANY 3 (s1, s2, s3, s4)</> makes transaction commits
>> + wait until receiving receipts from at least any three standbys
>> + of four listed servers <literal>s1</>, <literal>s2</>,
>> <literal>s3</>,
>> This could just mention WAL records instead of "receipts".
>>
>> Instead of saying "an integer number N", we could use <literal>num_sync</>.
>>
>> + If <literal>FIRST</> or <literal>ANY</> are not specified,
>> this parameter
>> + behaves as <literal>ANY</>. Note that this grammar is
>> incompatible with
>> + <productname>PostgresSQL</> 9.6, where no keyword specified
>> is equivalent
>> + as if <literal>FIRST</> was specified. In short, there is no
>> real need to
>> + specify <replaceable class="parameter">num_sync</replaceable> as
>> this
>> + behavior does not have changed, as well as it is not
>> necessary to mention
>> + pre-9.6 versions are the multi-sync grammar has been added in 9.6.
>> This paragraph could be reworked, say:
>> if FIRST or ANY are not specified this parameter behaves as if ANY is
>> used. Note that this grammar is incompatible with PostgreSQL 9.6 which
>> is the first version supporting multiple standbys with synchronous
>> replication, where no such keyword FIRST or ANY can be used. Note that
>> the grammar behaves as if FIRST is used, which is incompatible with
>> the post-9.6 behavior.
>>
>> + <entry>Synchronous state of this standby server. <literal>quorum-N</>
>> + , where N is the number of synchronous standbys that transactions
>> + need to wait for replies from, when standby is considered as a
>> + candidate of quorum commit.</entry>
>> Nitpicking: I think that the comma goes to the previous line if it is
>> the first character of a line.
>>
>> + if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
>> + return SyncRepGetSyncStandbysPriority(am_sync);
>> + else /* SYNC_REP_QUORUM */
>> + return SyncRepGetSyncStandbysQuorum(am_sync)
>> Or that?
>> if (PRIORITY)
>> return StandbysPriority();
>> else if (QUORUM)
>> return StandbysQuorum();
>> else
>> elog(ERROR, "Boom");
>>
>> + * In priority method, we need the oldest these positions among sync
>> + * standbys. In quorum method, we need the newest these positions
>> + * specified by SyncRepConfig->num_sync.
>> Last sentence is grammatically incorrect, and it would be more correct
>> to precise the Nth LSN positions to be able to select k standbys from
>> a set of n ones.
>>
>> + SpinLockAcquire(&walsnd->mutex);
>> + write_array[i] = walsnd->write;
>> + flush_array[i]= walsnd->flush;
>> + apply_array[i] = walsnd->flush;
>> + SpinLockRelease(&walsnd->mutex);
>> A nit: adding a space on the self of the second = character. And you
>> need to save the apply position of the WAL sender, not the flush
>> position in the array that is going to be ordered.
>>
>> /*
>> * More easily understood version of standby state. This is
>> purely
>> - * informational, not different from priority.
>> + * informational. In quorum method, we add the number to indicate
>> + * how many in the set need to give a commit confirmation.
>> */
>> if (priority == 0)
>> values[7] = CStringGetTextDatum("async");
>> else if (list_member_int(sync_standbys, i))
>> - values[7] = CStringGetTextDatum("sync");
>> + values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
>> + CStringGetTextDatum("sync") :
>> CStringGetTextDatum("quorum")
>> This code block and its explanation comments tell two different
>> stories. The comment is saying that something like "quorum-N" is used
>> but the code always prints "quorum".
>>
>> It may be a good idea in the test to check that when no keywords is
>> specified the group of standbys is in quorum mode.
>
> Yeah, I will add some tests.
>
> I will post new version patch incorporated other comments.
>
Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..e125dff 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3028,42 +3028,75 @@ include_dir 'conf.d'
transactions waiting for commit will be allowed to proceed after
these standby servers confirm receipt of their data.
The synchronous standbys will be those whose names appear
- earlier in this list, and
+ in this list, and
that are both currently connected and streaming data in real-time
(as shown by a state of <literal>streaming</literal> in the
<link linkend="monitoring-stats-views-table">
- <literal>pg_stat_replication</></link> view).
- Other standby servers appearing later in this list represent potential
- synchronous standbys. If any of the current synchronous
- standbys disconnects for whatever reason,
- it will be replaced immediately with the next-highest-priority standby.
- Specifying more than one standby name can allow very high availability.
+ <literal>pg_stat_replication</></link> view). If the keyword
+ <literal>FIRST</> is specified, other standby servers appearing
+ later in this list represent potential synchronous standbys.
+ If any of the current synchronous standbys disconnects for
+ whatever reason, it will be replaced immediately with the
+ next-highest-priority standby. Specifying more than one standby
+ name can allow very high availability.
</para>
<para>
This parameter specifies a list of standby servers using
either of the following syntaxes:
<synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
<replaceable class="parameter">standby_name</replaceable> [, ...]
</synopsis>
where <replaceable class="parameter">num_sync</replaceable> is
the number of synchronous standbys that transactions need to
wait for replies from,
and <replaceable class="parameter">standby_name</replaceable>
- is the name of a standby server. For example, a setting of
- <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
- until their WAL records are received by three higher-priority standbys
- chosen from standby servers <literal>s1</>, <literal>s2</>,
- <literal>s3</> and <literal>s4</>.
+ is the name of a standby server.
+ <literal>FIRST</> and <literal>ANY</> specify the method used by
+ the master to control the standby servres.
</para>
<para>
- The second syntax was used before <productname>PostgreSQL</>
+ The keyword <literal>FIRST</>, coupled with <literal>num_sync</>, makes
+ transaction commit wait until WAL records are received from the
+ <literal>num_sync</> standbys with higher priority number.
+ For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+ makes transaction commits wait until their WAL records are received
+ by three higher-priority standbys chosen from standby servers
+ <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+ </para>
+ <para>
+ The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+ makes transaction commits wait until WAL records are received
+ from at least <literal>num_sync</> connected standbys among those
+ defined in the list of <varname>synchronous_standby_names</>. For
+ example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+ transaction commits wait until receiving WAL records from at least
+ any three standbys of four listed servers <literal>s1</>,
+ <literal>s2</>, <literal>s3</>, <literal>s4</>.
+ </para>
+ <para>
+ <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+ and the standby name having these words are must be double-quoted.
+ </para>
+ <para>
+ The third syntax was used before <productname>PostgreSQL</>
version 9.6 and is still supported. It's the same as the first syntax
- with <replaceable class="parameter">num_sync</replaceable> equal to 1.
- For example, <literal>1 (s1, s2)</> and
- <literal>s1, s2</> have the same meaning: either <literal>s1</>
- or <literal>s2</> is chosen as a synchronous standby.
- </para>
+ with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+ equal to 1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+ have the same meaning: either <literal>s1</> or <literal>s2</> is
+ chosen as a synchronous standby.
+ </para>
+ <note>
+ <para>
+ If <literal>FIRST</> or <literal>ANY</> are not specified, this parameter
+ behaves as if <literal>ANY</> is used. Note that this grammar is incompatible
+ with <productname>PostgresSQL</> 9.6 which is first version supporting multiple
+ standbys with synchronous replication, where no such keyword <literal>FIRST</>
+ or <literal>ANY</> can be used. Note that the grammer behaves as if <literal>FIRST</>
+ is used, which is incompatible with the post-9.6 version behavior.
+ </para>
+ </note>
<para>
The name of a standby server for this purpose is the
<varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5bedaf2..7a0a22a 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
An example of <varname>synchronous_standby_names</> for multiple
synchronous standbys is:
<programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
</programlisting>
In this example, if four standby servers <literal>s1</>, <literal>s2</>,
<literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1161,6 +1161,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
<literal>s2</> fails. <literal>s4</> is an asynchronous standby since
its name is not in the list.
</para>
+ <para>
+ Another example of <varname>synchronous_standby_names</> for multiple
+ synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+ In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+ <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+ <literal>s2</> and <literal>s3</> will be considered as synchronous standby
+ candidates. The master server will wait for at least 2 replies from them.
+ <literal>s4</> is an asynchronous standby since its name is not in the list.
+ </para>
</sect3>
<sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..2c5f3de 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1389,7 +1389,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<row>
<entry><structfield>sync_state</></entry>
<entry><type>text</></entry>
- <entry>Synchronous state of this standby server</entry>
+ <entry>Synchronous state of this standby server. It is <literal>quorum</>
+ when standby is considered as a candidate of quorum commit.</entry>
</row>
</tbody>
</tgroup>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
# syncrep_scanner is complied as part of syncrep_gram
syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
syncrep_scanner.c: FLEX_NO_BACKUP=yes
# repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..bcc1317 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,19 @@
*
* In 9.5 or before only a single standby could be considered as
* synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
*
* Before the standbys chosen from synchronous_standby_names can
* become the synchronous standbys they must have caught up with
@@ -73,24 +76,27 @@
/* User-settable parameters for sync rep */
char *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
#define SyncStandbysDefined() \
(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
static bool announce_next_takeover = true;
-static SyncRepConfigData *SyncRepConfig = NULL;
static int SyncRepWaitMode = SYNC_REP_NO_WAIT;
static void SyncRepQueueInsert(int mode);
static void SyncRepCancelWait(void);
static int SyncRepWakeQueue(bool all, int mode);
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
- XLogRecPtr *flushPtr,
- XLogRecPtr *applyPtr,
- bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+ XLogRecPtr *flushPtr,
+ XLogRecPtr *applyPtr,
+ bool *am_sync);
static int SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int cmp_lsn(const void *a, const void *b);
#ifdef USE_ASSERT_CHECKING
static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +392,7 @@ SyncRepReleaseWaiters(void)
XLogRecPtr writePtr;
XLogRecPtr flushPtr;
XLogRecPtr applyPtr;
- bool got_oldest;
+ bool got_recptr;
bool am_sync;
int numwrite = 0;
int numflush = 0;
@@ -413,11 +419,10 @@ SyncRepReleaseWaiters(void)
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
/*
- * Check whether we are a sync standby or not, and calculate the oldest
- * positions among all sync standbys.
+ * Check whether we are a sync standby or not, and calculate the synced
+ * positions among all sync standbys using method.
*/
- got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
- &applyPtr, &am_sync);
+ got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
/*
* If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +440,7 @@ SyncRepReleaseWaiters(void)
* If the number of sync standbys is less than requested or we aren't
* managing a sync standby then just leave.
*/
- if (!got_oldest || !am_sync)
+ if (!got_recptr || !am_sync)
{
LWLockRelease(SyncRepLock);
announce_next_takeover = !am_sync;
@@ -471,17 +476,50 @@ SyncRepReleaseWaiters(void)
}
/*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool *am_sync)
+{
+ /* Set default result */
+ if (am_sync != NULL)
+ *am_sync = false;
+
+ /* Quick exit if sync replication is not requested */
+ if (SyncRepConfig == NULL)
+ return NIL;
+
+ if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+ return SyncRepGetSyncStandbysPriority(am_sync);
+ else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+ return SyncRepGetSyncStandbysQuorum(am_sync);
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ "invalid synchronization method is specified \"%d\"",
+ SyncRepConfig->sync_method));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
*
* Return false if the number of sync standbys is less than
* synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
*
* On return, *am_sync is set to true if this walsender is connecting to
* sync standby. Otherwise it's set to false.
*/
static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
XLogRecPtr *applyPtr, bool *am_sync)
{
List *sync_standbys;
@@ -507,29 +545,74 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
return false;
}
- /*
- * Scan through all sync standbys and calculate the oldest Write, Flush
- * and Apply positions.
- */
- foreach(cell, sync_standbys)
+ if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
{
- WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
- XLogRecPtr write;
- XLogRecPtr flush;
- XLogRecPtr apply;
-
- SpinLockAcquire(&walsnd->mutex);
- write = walsnd->write;
- flush = walsnd->flush;
- apply = walsnd->apply;
- SpinLockRelease(&walsnd->mutex);
-
- if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
- *writePtr = write;
- if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
- *flushPtr = flush;
- if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
- *applyPtr = apply;
+ /*
+ * Scan through all sync standbys and calculate the oldest
+ * Write, Flush and Apply positions.
+ */
+ foreach (cell, sync_standbys)
+ {
+ WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+ XLogRecPtr write;
+ XLogRecPtr flush;
+ XLogRecPtr apply;
+
+ SpinLockAcquire(&walsnd->mutex);
+ write = walsnd->write;
+ flush = walsnd->flush;
+ apply = walsnd->apply;
+ SpinLockRelease(&walsnd->mutex);
+
+ if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+ *writePtr = write;
+ if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+ *flushPtr = flush;
+ if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+ *applyPtr = apply;
+ }
+ }
+ else /* SYNC_REP_QUORUM */
+ {
+ XLogRecPtr *write_array;
+ XLogRecPtr *flush_array;
+ XLogRecPtr *apply_array;
+ int len;
+ int i = 0;
+
+ len = list_length(sync_standbys);
+ write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+ flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+ apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+ foreach (cell, sync_standbys)
+ {
+ WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+ SpinLockAcquire(&walsnd->mutex);
+ write_array[i] = walsnd->write;
+ flush_array[i] = walsnd->flush;
+ apply_array[i] = walsnd->apply;
+ SpinLockRelease(&walsnd->mutex);
+
+ i++;
+ }
+
+ qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+ qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+ qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+ /*
+ * Get N-th newest Write, Flush, Apply positions
+ * specified by SyncRepConfig->num_sync.
+ */
+ *writePtr = write_array[SyncRepConfig->num_sync - 1];
+ *flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+ *applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+ pfree(write_array);
+ pfree(flush_array);
+ pfree(apply_array);
}
list_free(sync_standbys);
@@ -537,17 +620,66 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
}
/*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
*
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+ List *result = NIL;
+ int i;
+
+ Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+ for (i = 0; i < max_wal_senders; i++)
+ {
+ volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+ /* Must be active */
+ if (walsnd->pid == 0)
+ continue;
+
+ /* Must be streaming */
+ if (walsnd->state != WALSNDSTATE_STREAMING)
+ continue;
+
+ /* Must be synchronous */
+ if (walsnd->sync_standby_priority == 0)
+ continue;
+
+ /* Must have a valid flush position */
+ if (XLogRecPtrIsInvalid(walsnd->flush))
+ continue;
+
+ /*
+ * Consider this standby as candidate of sync and append
+ * it to the result.
+ */
+ result = lappend_int(result, i);
+ if (am_sync != NULL && walsnd == MyWalSnd)
+ *am_sync = true;
+ }
+
+ return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
* the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
*
* On return, *am_sync is set to true if this walsender is connecting to
* sync standby. Otherwise it's set to false.
*/
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
{
List *result = NIL;
List *pending = NIL;
@@ -560,13 +692,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
volatile WalSnd *walsnd; /* Use volatile pointer to prevent code
* rearrangement */
- /* Set default result */
- if (am_sync != NULL)
- *am_sync = false;
-
- /* Quick exit if sync replication is not requested */
- if (SyncRepConfig == NULL)
- return NIL;
+ Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;
@@ -892,6 +1018,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
#endif
/*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+ XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+ XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+ if (lsn1 > lsn2)
+ return -1;
+ else if (lsn1 == lsn2)
+ return 0;
+ else
+ return 1;
+}
+
+/*
* ===========================================================
* Synchronous Replication functions executed by any process
* ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
char *syncrep_parse_error_msg;
static SyncRepConfigData *create_syncrep_config(const char *num_sync,
- List *members);
+ List *members, int sync_method);
/*
* Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
SyncRepConfigData *config;
}
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
%type <config> result standby_config
%type <list> standby_list
@@ -60,8 +60,10 @@ result:
;
standby_config:
- standby_list { $$ = create_syncrep_config("1", $1); }
- | NUM '(' standby_list ')' { $$ = create_syncrep_config($1, $3); }
+ standby_list { $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+ | NUM '(' standby_list ')' { $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+ | ANY NUM '(' standby_list ')' { $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+ | FIRST NUM '(' standby_list ')' { $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
;
standby_list:
@@ -77,7 +79,7 @@ standby_name:
static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
{
SyncRepConfigData *config;
int size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
config->config_size = size;
config->num_sync = atoi(num_sync);
+ config->sync_method = sync_method;
config->nmembers = list_length(members);
ptr = config->member_names;
foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit [0-9]
ident_start [A-Za-z\200-\377_]
ident_cont [A-Za-z\200-\377_0-9\$]
identifier {ident_start}{ident_cont}*
+any_ident any
+first_ident first
dquote \"
xdstart {dquote}
@@ -64,6 +66,14 @@ xdinside [^"]+
%%
{space}+ { /* ignore */ }
+{any_ident} {
+ yylval.str = pstrdup(yytext);
+ return ANY;
+ }
+{first_ident} {
+ yylval.str = pstrdup(yytext);
+ return FIRST;
+ }
{xdstart} {
initStringInfo(&xdbuf);
BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc5e508..04fe994 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2860,12 +2860,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
/*
* More easily understood version of standby state. This is purely
- * informational, not different from priority.
+ * informational. In quorum method, since all standbys are considered as
+ * a candidate of quorum commit standby state is always 'quorum'.
*/
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
- values[7] = CStringGetTextDatum("sync");
+ values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+ CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[7] = CStringGetTextDatum("potential");
}
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..8dd74a3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
#define SYNC_REP_WAITING 1
#define SYNC_REP_WAIT_COMPLETE 2
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY 0
+#define SYNC_REP_QUORUM 1
+
/*
* Struct for the configuration of synchronous replication.
*
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
int num_sync; /* number of sync standbys that we need to
* wait for */
int nmembers; /* number of members in the following list */
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;
+extern SyncRepConfigData *SyncRepConfig;
+
/* communication variables for parsing synchronous_standby_names GUC */
extern SyncRepConfigData *syncrep_parse_result;
extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..c502d20 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
# Query checking sync_priority and sync_state of each standby
my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
$node_master, qq(standby2|2|sync
standby3|3|sync),
'2 synchronous standbys',
- '2(standby1,standby2,standby3)');
+ 'FIRST 2(standby1,standby2,standby3)');
# Start standby1
$node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
standby3|3|sync
standby4|1|sync),
'num_sync exceeds the num of potential sync standbys',
- '6(standby4,standby0,standby3,standby2)');
+ 'FIRST 6(standby4,standby0,standby3,standby2)');
# The setting that * comes before another standby name is acceptable
# but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
standby3|2|potential
standby4|2|potential),
'asterisk comes before another standby name',
- '2(standby1,*,standby2)');
+ 'FIRST 2(standby1,*,standby2)');
# Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
# earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
standby3|1|sync
standby4|1|potential),
'multiple standbys having the same priority are chosen as sync',
- '2(*)');
+ 'FIRST 2(*)');
# Stop Standby3 which is considered in 'sync' state.
$node_standby_3->stop;
@@ -172,3 +172,34 @@ test_sync_state(
standby2|1|sync
standby4|1|potential),
'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter are having
+# same priority when synchronous_standby_names uses quorum method.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'ANY 2(standby1, standby2)');
+
+# Check that state of standbys are not the same as the behaviour of that
+# 'ANY' is specified.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'not specify synchronization method',
+'2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'ANY 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as candidates for quorum commit',
+'ANY 2(*)');
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers