On Mon, Dec 14, 2015 at 2:57 PM, Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote: > Thank you for the new patch. > > At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada.m...@gmail.com> > wrote in <cad21aodcn1ftoccyrqpu6fmy1xnpddakdtcbhw1r9m1mpm0...@mail.gmail.com> >> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.m...@gmail.com> >> wrote: >> > I agree with #3 way and the s_s_name format you suggested. >> > I think that It's extensible and is tolerable for future changes. >> > I'm going to implement the patch based on this idea if other hackers >> > agree with this design. >> >> Please find the attached draft patch which supports multi sync replication. >> This patch adds a GUC parameter synchronous_replication_method, which >> represent the method of synchronous replication. >> >> [Design of replication method] >> synchronous_replication_method has two values; 'priority' and >> '1-priority' for now. >> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the >> future. >> >> * s_r_method = '1-priority' >> This method is for backward compatibility, so the syntax of s_s_names >> is same as today. >> The behavior is same as well. >> >> * s_r_method = 'priority' >> This method is for multiple synchronous replication using priority method. >> The syntax of s_s_names is, >> <number of sync standbys>, <standby name> [, ...] > > Is there anyone opposed to this? > >> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2, >> node3' means that the master waits for acknowledge from at least 2 >> lowest priority servers. >> If 4 standbys(node1 - node4) are available, the master server waits >> acknowledge from 'node1' and 'node2. >> The each status of wal senders are; >> >> =# select application_name, sync_state from pg_stat_replication order >> by application_name; >> application_name | sync_state >> ------------------+------------ >> node1 | sync >> node2 | sync >> node3 | potential >> node4 | async >> (4 rows) >> >> After 'node2' crashed, the master will wait for acknowledge from >> 'node1' and 'node3'. >> The each status of wal senders are; >> >> =# select application_name, sync_state from pg_stat_replication order >> by application_name; >> application_name | sync_state >> ------------------+------------ >> node1 | sync >> node3 | sync >> node4 | async >> (3 rows) >> >> [Changing replication method] >> When we want to change the replication method, we have to change the >> s_r_method at first, and then do pg_reload_conf(). >> After changing replication method, we can change the s_s_names.
Thank you for reviewing the patch! Please find attached latest patch. > Mmm. I should be able to be changed at once, because s_r_method > and s_s_names contradict each other during the intermediate > state. Sorry to confuse you. I meant the case where we want to change the replication method using ALTER SYSTEM. >> [Expanding replication method] >> If we want to expand new replication method additionally, we need to >> implement two functions for each replication method: >> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys) >> This function obtains the list of standbys considered as synchronous >> at that time, and return its length. >> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >> This function obtains LSNs(write, flush) considered as synced. >> >> Also, this patch debug code is remain yet, you can debug this behavior >> using by enable DEBUG_REPLICATION macro. >> >> Please give me feedbacks. > > I haven't looked into this fully (sorry) but I'm concerned about > several points. > > > - I feel that some function names looks too long. For example > SyncRepGetSynchronousStandbysOnePriority occupies more than the > half of a line. (However, the replication code alrady has many > long function names..) Yeah, it would be better to change 'Synchronous' to 'Sync' at least. > - The comment below of SyncRepGetSynchronousStandbyOnePriority, > > /* Find lowest priority standby */ > > The code where the comment is for is doing the correct > thing. Howerver, the comment is confusing. A lower priority > *value* means a higher priority. Fixed. > - SyncRepGetSynchronousStandbys checks all if()s even when the > first one matches. Use switch or "else if" there if you they > are exclusive each other. Fixed. > - Do you intende the DEBUG_REPLICATION code in > SyncRepGetSynchronousStandbys*() to be the final shape? The > same code blocks which can work for both method should be in > their common caller but SyncRepGetSyncLsns*() are > headache. Although it might need more refactoring, I'm sorry > but I don't see a desirable shape for now. I'm not going to DEBUG_REPLICAION code to be the final shape. These codes are removed from this version patch. > By the way, palloc(20)/free() in such short term looks > ineffective. > > - SyncRepGetSyncLsnsPriority > > For the comment "/* Find lowest XLogRecPtr of both write and > flush from sync_nodes */", LSN is compared as early or late so > the comment would be better to be something like "Keep/Collect > the earliest write and flush LSNs among prioritized standbys". Fixed. > And what is more important, this block handles write and flush > LSN jumbled and it reults in missing the earliest(= most > delayed) LSN for certain cases. The following is an example. > > Standby 1: write LSN = 10, flush LSN = 5 > Standby 2: write LSN = 8 , flush LSN = 6 > > For this case, finally we get tmp_write = 10 and tmp_flush = 5 > from the current code, where tmp_write has wrong value since > LSN = 10 has *not* been written yet on standby 2. (the names > "tmp_*" don't seem appropriate here) > You are right. We have to handle write and flush LSN individually, and to get each lowest LSN. For example in this case, we have to get write = 8, flush = 5. I've change the logic so that. Regards, -- Masahiko Sawada
000_multi_sync_replication_v3.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers