Re: [ClusterLabs] large cluster - failure recovery
On 11/04/2015 12:55 PM, Digimer wrote: > On 04/11/15 01:50 PM, Radoslaw Garbacz wrote: >> Hi, >> >> I have a cluster of 32 nodes, and after some tuning was able to have it >> started and running, > > This is not supported by RH for a reasons; it's hard to get the timing > right. SUSE supports up to 32 nodes, but they must be doing some serious > magic behind the scenes. > > I would *strongly* recommend dividing this up into a few smaller > clusters... 8 nodes per cluster would be max I'd feel comfortable with. > You need your cluster to solve more problems than it causes... Hi Radoslaw, RH supports up to 16. 32 should be possible with recent pacemaker+corosync versions and careful tuning, but it's definitely leading-edge. An alternative with pacemaker 1.1.10+ (1.1.12+ recommended) is Pacemaker Remote, which easily scales to dozens of nodes: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html Pacemaker Remote is a really good approach once you start pushing the limits of cluster nodes. Probably better than trying to get corosync to handle more nodes. (There are long-term plans for improving corosync's scalability, but that doesn't help you now.) >> but it does not recover from a node disconnect-connect failure. >> It regains quorum, but CIB does not recover to a synchronized state and >> "cibadmin -Q" times out. >> >> Is there anything with corosync or pacemaker parameters I can do to make >> it recover from such a situation >> (everything works for smaller clusters). >> >> In my case it is OK for a node to disconnect (all the major resources >> are shutdown) >> and later reconnect the cluster (the running monitoring agent will >> cleanup and restart major resources if needed), >> so I do not have STONITH configured. >> >> Details: >> OS: CentOS 6 >> Pacemaker: Pacemaker 1.1.9-1512.el6 > > Upgrade. If you can upgrade to the latest CentOS 6.7, you can get a much newer Pacemaker. But Pacemaker is probably not limiting your cluster nodes; the newer version's main benefit would be Pacemaker Remote support. (Of course there are plenty of bug fixes and new features as well.) >> Corosync: Corosync Cluster Engine, version '2.3.2' > > This is not supported on EL6 at all. Please stick with corosync 1.4 and > use the cman pluging as the quorum provider. CentOS is self-supported anyway, so if you're willing to handle your own upgrades and such, nothing wrong with compiling. But corosync is up to 2.3.5 so you're already behind. :) I'd recommend compiling libqb 0.17.2 if you're compiling recent corosync and/or pacemaker. Alternatively, CentOS 7 will have recent versions of everything. >> Corosync configuration: >> token: 1 >> #token_retransmits_before_loss_const: 10 >> consensus: 15000 >> join: 1000 >> send_join: 80 >> merge: 1000 >> downcheck: 2000 >> #rrp_problem_count_timeout: 5000 >> max_network_delay: 150 # for azure >> >> >> Some logs: >> >> [...] >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not >> applied to 1.9275.1: current "epoch" is greater than required >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application >> of an update diff failed (-1006) >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not >> applied to 1.9275.1: current "epoch" is greater than required >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application >> of an update diff failed (-1006) >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not >> applied to 1.9275.1: current "epoch" is greater than required >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application >> of an update diff failed (-1006) >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not >> applied to 1.9275.1: current "epoch" is greater than required >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application >> of an update diff failed (-1006) >> [...] >> >> [...] >> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: >> cib_native_perform_op_delegate: Couldn't perform cib_query >> operation (timeout=120s): Operation already in progress (-114) >> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: >> get_cib_copy: Couldnt retrieve the CIB >> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: >> cib_native_perform_op_delegate: Couldn't perform cib_query >>
Re: [ClusterLabs] large cluster - failure recovery
Thank you Ken and Digimer for all your suggestions. On Wed, Nov 4, 2015 at 2:32 PM, Ken Gaillotwrote: > On 11/04/2015 12:55 PM, Digimer wrote: > > On 04/11/15 01:50 PM, Radoslaw Garbacz wrote: > >> Hi, > >> > >> I have a cluster of 32 nodes, and after some tuning was able to have it > >> started and running, > > > > This is not supported by RH for a reasons; it's hard to get the timing > > right. SUSE supports up to 32 nodes, but they must be doing some serious > > magic behind the scenes. > > > > I would *strongly* recommend dividing this up into a few smaller > > clusters... 8 nodes per cluster would be max I'd feel comfortable with. > > You need your cluster to solve more problems than it causes... > > Hi Radoslaw, > > RH supports up to 16. 32 should be possible with recent > pacemaker+corosync versions and careful tuning, but it's definitely > leading-edge. > > An alternative with pacemaker 1.1.10+ (1.1.12+ recommended) is Pacemaker > Remote, which easily scales to dozens of nodes: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html > > Pacemaker Remote is a really good approach once you start pushing the > limits of cluster nodes. Probably better than trying to get corosync to > handle more nodes. (There are long-term plans for improving corosync's > scalability, but that doesn't help you now.) > > >> but it does not recover from a node disconnect-connect failure. > >> It regains quorum, but CIB does not recover to a synchronized state and > >> "cibadmin -Q" times out. > >> > >> Is there anything with corosync or pacemaker parameters I can do to make > >> it recover from such a situation > >> (everything works for smaller clusters). > >> > >> In my case it is OK for a node to disconnect (all the major resources > >> are shutdown) > >> and later reconnect the cluster (the running monitoring agent will > >> cleanup and restart major resources if needed), > >> so I do not have STONITH configured. > >> > >> Details: > >> OS: CentOS 6 > >> Pacemaker: Pacemaker 1.1.9-1512.el6 > > > > Upgrade. > > If you can upgrade to the latest CentOS 6.7, you can get a much newer > Pacemaker. But Pacemaker is probably not limiting your cluster nodes; > the newer version's main benefit would be Pacemaker Remote support. (Of > course there are plenty of bug fixes and new features as well.) > > >> Corosync: Corosync Cluster Engine, version '2.3.2' > > > > This is not supported on EL6 at all. Please stick with corosync 1.4 and > > use the cman pluging as the quorum provider. > > CentOS is self-supported anyway, so if you're willing to handle your own > upgrades and such, nothing wrong with compiling. But corosync is up to > 2.3.5 so you're already behind. :) I'd recommend compiling libqb 0.17.2 > if you're compiling recent corosync and/or pacemaker. > > Alternatively, CentOS 7 will have recent versions of everything. > > >> Corosync configuration: > >> token: 1 > >> #token_retransmits_before_loss_const: 10 > >> consensus: 15000 > >> join: 1000 > >> send_join: 80 > >> merge: 1000 > >> downcheck: 2000 > >> #rrp_problem_count_timeout: 5000 > >> max_network_delay: 150 # for azure > >> > >> > >> Some logs: > >> > >> [...] > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not > >> applied to 1.9275.1: current "epoch" is greater than required > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > >> of an update diff failed (-1006) > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not > >> applied to 1.9275.1: current "epoch" is greater than required > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > >> of an update diff failed (-1006) > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not > >> applied to 1.9275.1: current "epoch" is greater than required > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > >> of an update diff failed (-1006) > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not > >> applied to 1.9275.1: current "epoch" is greater than required > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > >> of an update diff failed (-1006) > >> [...] > >> > >> [...] > >> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: > >> cib_native_perform_op_delegate: Couldn't
Re: [ClusterLabs] large cluster - failure recovery
On 04/11/15 01:50 PM, Radoslaw Garbacz wrote: > Hi, > > I have a cluster of 32 nodes, and after some tuning was able to have it > started and running, This is not supported by RH for a reasons; it's hard to get the timing right. SUSE supports up to 32 nodes, but they must be doing some serious magic behind the scenes. I would *strongly* recommend dividing this up into a few smaller clusters... 8 nodes per cluster would be max I'd feel comfortable with. You need your cluster to solve more problems than it causes... > but it does not recover from a node disconnect-connect failure. > It regains quorum, but CIB does not recover to a synchronized state and > "cibadmin -Q" times out. > > Is there anything with corosync or pacemaker parameters I can do to make > it recover from such a situation > (everything works for smaller clusters). > > In my case it is OK for a node to disconnect (all the major resources > are shutdown) > and later reconnect the cluster (the running monitoring agent will > cleanup and restart major resources if needed), > so I do not have STONITH configured. > > Details: > OS: CentOS 6 > Pacemaker: Pacemaker 1.1.9-1512.el6 Upgrade. > Corosync: Corosync Cluster Engine, version '2.3.2' This is not supported on EL6 at all. Please stick with corosync 1.4 and use the cman pluging as the quorum provider. > Corosync configuration: > token: 1 > #token_retransmits_before_loss_const: 10 > consensus: 15000 > join: 1000 > send_join: 80 > merge: 1000 > downcheck: 2000 > #rrp_problem_count_timeout: 5000 > max_network_delay: 150 # for azure > > > Some logs: > > [...] > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not > applied to 1.9275.1: current "epoch" is greater than required > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > of an update diff failed (-1006) > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not > applied to 1.9275.1: current "epoch" is greater than required > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > of an update diff failed (-1006) > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not > applied to 1.9275.1: current "epoch" is greater than required > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > of an update diff failed (-1006) > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not > applied to 1.9275.1: current "epoch" is greater than required > Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > of an update diff failed (-1006) > [...] > > [...] > Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: > cib_native_perform_op_delegate: Couldn't perform cib_query > operation (timeout=120s): Operation already in progress (-114) > Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: > get_cib_copy: Couldnt retrieve the CIB > Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: > cib_native_perform_op_delegate: Couldn't perform cib_query > operation (timeout=120s): Operation already in progress (-114) > Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: > get_cib_copy: Couldnt retrieve the CIB > Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] > Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ > Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] > Members[32]: 14 20 31 30 8 25 18 7 4 > Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ] > Completed service synchronization, ready to provide service. > Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] > Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ > Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] > Members[32]: 14 20 31 30 8 25 18 7 4 > [...] > > [...] > Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: > update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of > an update diff failed (-1006) > Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info: > apply_xml_diff: Digest mis-match: expected > 01192e5118739b7c33c23f7645da3f45, calculated > f8028c0c98526179ea5df0a2ba0d09de > Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning: > cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not > applied to 1.15046.2:
[ClusterLabs] large cluster - failure recovery
Hi, I have a cluster of 32 nodes, and after some tuning was able to have it started and running, but it does not recover from a node disconnect-connect failure. It regains quorum, but CIB does not recover to a synchronized state and "cibadmin -Q" times out. Is there anything with corosync or pacemaker parameters I can do to make it recover from such a situation (everything works for smaller clusters). In my case it is OK for a node to disconnect (all the major resources are shutdown) and later reconnect the cluster (the running monitoring agent will cleanup and restart major resources if needed), so I do not have STONITH configured. Details: OS: CentOS 6 Pacemaker: Pacemaker 1.1.9-1512.el6 Corosync: Corosync Cluster Engine, version '2.3.2' Corosync configuration: token: 1 #token_retransmits_before_loss_const: 10 consensus: 15000 join: 1000 send_join: 80 merge: 1000 downcheck: 2000 #rrp_problem_count_timeout: 5000 max_network_delay: 150 # for azure Some logs: [...] Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) [...] [...] Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114) Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: get_cib_copy: Couldnt retrieve the CIB Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114) Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: get_cib_copy: Couldnt retrieve the CIB Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4 Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4 [...] [...] Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info: apply_xml_diff: Digest mis-match: expected 01192e5118739b7c33c23f7645da3f45, calculated f8028c0c98526179ea5df0a2ba0d09de Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning: cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied to 1.15046.2: Failed application of an update diff Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied to 1.15046.3: current "num_updates" is greater than required [...] ps. Sorry if should posted on corosync newsgroup, just the CIB synchronization fails, so this group seemed to me the right place. -- Best Regards, Radoslaw Garbacz ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: