Re: [ClusterLabs] large cluster - failure recovery

2015-11-04 Thread Ken Gaillot
On 11/04/2015 12:55 PM, Digimer wrote:
> On 04/11/15 01:50 PM, Radoslaw Garbacz wrote:
>> Hi,
>>
>> I have a cluster of 32 nodes, and after some tuning was able to have it
>> started and running,
> 
> This is not supported by RH for a reasons; it's hard to get the timing
> right. SUSE supports up to 32 nodes, but they must be doing some serious
> magic behind the scenes.
> 
> I would *strongly* recommend dividing this up into a few smaller
> clusters... 8 nodes per cluster would be max I'd feel comfortable with.
> You need your cluster to solve more problems than it causes...

Hi Radoslaw,

RH supports up to 16. 32 should be possible with recent
pacemaker+corosync versions and careful tuning, but it's definitely
leading-edge.

An alternative with pacemaker 1.1.10+ (1.1.12+ recommended) is Pacemaker
Remote, which easily scales to dozens of nodes:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html

Pacemaker Remote is a really good approach once you start pushing the
limits of cluster nodes. Probably better than trying to get corosync to
handle more nodes. (There are long-term plans for improving corosync's
scalability, but that doesn't help you now.)

>> but it does not recover from a node disconnect-connect failure.
>> It regains quorum, but CIB does not recover to a synchronized state and
>> "cibadmin -Q" times out.
>>
>> Is there anything with corosync or pacemaker parameters I can do to make
>> it recover from such a situation
>> (everything works for smaller clusters).
>>
>> In my case it is OK for a node to disconnect (all the major resources
>> are shutdown)
>> and later reconnect the cluster (the running monitoring agent will
>> cleanup and restart major resources if needed),
>> so I do not have STONITH configured.
>>
>> Details:
>> OS: CentOS 6
>> Pacemaker: Pacemaker 1.1.9-1512.el6
> 
> Upgrade.

If you can upgrade to the latest CentOS 6.7, you can get a much newer
Pacemaker. But Pacemaker is probably not limiting your cluster nodes;
the newer version's main benefit would be Pacemaker Remote support. (Of
course there are plenty of bug fixes and new features as well.)

>> Corosync: Corosync Cluster Engine, version '2.3.2'
> 
> This is not supported on EL6 at all. Please stick with corosync 1.4 and
> use the cman pluging as the quorum provider.

CentOS is self-supported anyway, so if you're willing to handle your own
upgrades and such, nothing wrong with compiling. But corosync is up to
2.3.5 so you're already behind. :) I'd recommend compiling libqb 0.17.2
if you're compiling recent corosync and/or pacemaker.

Alternatively, CentOS 7 will have recent versions of everything.

>> Corosync configuration:
>> token: 1
>> #token_retransmits_before_loss_const: 10
>> consensus: 15000
>> join: 1000
>> send_join: 80
>> merge: 1000
>> downcheck: 2000
>> #rrp_problem_count_timeout: 5000
>> max_network_delay: 150 # for azure
>>
>>
>> Some logs:
>>
>> [...]
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not
>> applied to 1.9275.1: current "epoch" is greater than required
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
>> of an update diff failed (-1006)
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not
>> applied to 1.9275.1: current "epoch" is greater than required
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
>> of an update diff failed (-1006)
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not
>> applied to 1.9275.1: current "epoch" is greater than required
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
>> of an update diff failed (-1006)
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not
>> applied to 1.9275.1: current "epoch" is greater than required
>> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
>> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
>> of an update diff failed (-1006)
>> [...]
>>
>> [...]
>> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
>> cib_native_perform_op_delegate: Couldn't perform cib_query
>> operation (timeout=120s): Operation already in progress (-114)
>> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
>> get_cib_copy:   Couldnt retrieve the CIB
>> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
>> cib_native_perform_op_delegate: Couldn't perform cib_query
>> 

Re: [ClusterLabs] large cluster - failure recovery

2015-11-04 Thread Radoslaw Garbacz
Thank you Ken and Digimer for all your suggestions.

On Wed, Nov 4, 2015 at 2:32 PM, Ken Gaillot  wrote:

> On 11/04/2015 12:55 PM, Digimer wrote:
> > On 04/11/15 01:50 PM, Radoslaw Garbacz wrote:
> >> Hi,
> >>
> >> I have a cluster of 32 nodes, and after some tuning was able to have it
> >> started and running,
> >
> > This is not supported by RH for a reasons; it's hard to get the timing
> > right. SUSE supports up to 32 nodes, but they must be doing some serious
> > magic behind the scenes.
> >
> > I would *strongly* recommend dividing this up into a few smaller
> > clusters... 8 nodes per cluster would be max I'd feel comfortable with.
> > You need your cluster to solve more problems than it causes...
>
> Hi Radoslaw,
>
> RH supports up to 16. 32 should be possible with recent
> pacemaker+corosync versions and careful tuning, but it's definitely
> leading-edge.
>
> An alternative with pacemaker 1.1.10+ (1.1.12+ recommended) is Pacemaker
> Remote, which easily scales to dozens of nodes:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html
>
> Pacemaker Remote is a really good approach once you start pushing the
> limits of cluster nodes. Probably better than trying to get corosync to
> handle more nodes. (There are long-term plans for improving corosync's
> scalability, but that doesn't help you now.)
>
> >> but it does not recover from a node disconnect-connect failure.
> >> It regains quorum, but CIB does not recover to a synchronized state and
> >> "cibadmin -Q" times out.
> >>
> >> Is there anything with corosync or pacemaker parameters I can do to make
> >> it recover from such a situation
> >> (everything works for smaller clusters).
> >>
> >> In my case it is OK for a node to disconnect (all the major resources
> >> are shutdown)
> >> and later reconnect the cluster (the running monitoring agent will
> >> cleanup and restart major resources if needed),
> >> so I do not have STONITH configured.
> >>
> >> Details:
> >> OS: CentOS 6
> >> Pacemaker: Pacemaker 1.1.9-1512.el6
> >
> > Upgrade.
>
> If you can upgrade to the latest CentOS 6.7, you can get a much newer
> Pacemaker. But Pacemaker is probably not limiting your cluster nodes;
> the newer version's main benefit would be Pacemaker Remote support. (Of
> course there are plenty of bug fixes and new features as well.)
>
> >> Corosync: Corosync Cluster Engine, version '2.3.2'
> >
> > This is not supported on EL6 at all. Please stick with corosync 1.4 and
> > use the cman pluging as the quorum provider.
>
> CentOS is self-supported anyway, so if you're willing to handle your own
> upgrades and such, nothing wrong with compiling. But corosync is up to
> 2.3.5 so you're already behind. :) I'd recommend compiling libqb 0.17.2
> if you're compiling recent corosync and/or pacemaker.
>
> Alternatively, CentOS 7 will have recent versions of everything.
>
> >> Corosync configuration:
> >> token: 1
> >> #token_retransmits_before_loss_const: 10
> >> consensus: 15000
> >> join: 1000
> >> send_join: 80
> >> merge: 1000
> >> downcheck: 2000
> >> #rrp_problem_count_timeout: 5000
> >> max_network_delay: 150 # for azure
> >>
> >>
> >> Some logs:
> >>
> >> [...]
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not
> >> applied to 1.9275.1: current "epoch" is greater than required
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> >> of an update diff failed (-1006)
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not
> >> applied to 1.9275.1: current "epoch" is greater than required
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> >> of an update diff failed (-1006)
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not
> >> applied to 1.9275.1: current "epoch" is greater than required
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> >> of an update diff failed (-1006)
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not
> >> applied to 1.9275.1: current "epoch" is greater than required
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> >> of an update diff failed (-1006)
> >> [...]
> >>
> >> [...]
> >> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
> >> cib_native_perform_op_delegate: Couldn't 

Re: [ClusterLabs] large cluster - failure recovery

2015-11-04 Thread Digimer
On 04/11/15 01:50 PM, Radoslaw Garbacz wrote:
> Hi,
> 
> I have a cluster of 32 nodes, and after some tuning was able to have it
> started and running,

This is not supported by RH for a reasons; it's hard to get the timing
right. SUSE supports up to 32 nodes, but they must be doing some serious
magic behind the scenes.

I would *strongly* recommend dividing this up into a few smaller
clusters... 8 nodes per cluster would be max I'd feel comfortable with.
You need your cluster to solve more problems than it causes...

> but it does not recover from a node disconnect-connect failure.
> It regains quorum, but CIB does not recover to a synchronized state and
> "cibadmin -Q" times out.
> 
> Is there anything with corosync or pacemaker parameters I can do to make
> it recover from such a situation
> (everything works for smaller clusters).
> 
> In my case it is OK for a node to disconnect (all the major resources
> are shutdown)
> and later reconnect the cluster (the running monitoring agent will
> cleanup and restart major resources if needed),
> so I do not have STONITH configured.
> 
> Details:
> OS: CentOS 6
> Pacemaker: Pacemaker 1.1.9-1512.el6

Upgrade.

> Corosync: Corosync Cluster Engine, version '2.3.2'

This is not supported on EL6 at all. Please stick with corosync 1.4 and
use the cman pluging as the quorum provider.

> Corosync configuration:
> token: 1
> #token_retransmits_before_loss_const: 10
> consensus: 15000
> join: 1000
> send_join: 80
> merge: 1000
> downcheck: 2000
> #rrp_problem_count_timeout: 5000
> max_network_delay: 150 # for azure
> 
> 
> Some logs:
> 
> [...]
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not
> applied to 1.9275.1: current "epoch" is greater than required
> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> of an update diff failed (-1006)
> [...]
> 
> [...]
> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
> cib_native_perform_op_delegate: Couldn't perform cib_query
> operation (timeout=120s): Operation already in progress (-114)
> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
> get_cib_copy:   Couldnt retrieve the CIB
> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
> cib_native_perform_op_delegate: Couldn't perform cib_query
> operation (timeout=120s): Operation already in progress (-114)
> Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
> get_cib_copy:   Couldnt retrieve the CIB
> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 14 20 31 30 8 25 18 7 4
> Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [MAIN  ]
> Completed service synchronization, ready to provide service.
> Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
> Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
> Members[32]: 14 20 31 30 8 25 18 7 4
> [...]
> 
> [...]
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
> update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of
> an update diff failed (-1006)
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info:
> apply_xml_diff: Digest mis-match: expected
> 01192e5118739b7c33c23f7645da3f45, calculated
> f8028c0c98526179ea5df0a2ba0d09de
> Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:  warning:
> cib_process_diff:   Diff 1.15046.2 -> 1.15046.3 from local not
> applied to 1.15046.2: