Re: [ClusterLabs] Multiple OpenSIPS services on one cluster
> -Mensagem original- > De: Ken Gaillot [mailto:kgail...@redhat.com] > Enviada: terça-feira, 3 de Novembro de 2015 18:02 > Para: Nuno Pereira; 'Cluster Labs - All topics related to open-source clustering > welcomed' > Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster > > On 11/03/2015 05:38 AM, Nuno Pereira wrote: > >> -Mensagem original- > >> De: Ken Gaillot [mailto:kgail...@redhat.com] > >> Enviada: segunda-feira, 2 de Novembro de 2015 19:53 > >> Para: users@clusterlabs.org > >> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster > >> > >> On 11/02/2015 01:24 PM, Nuno Pereira wrote: > >>> Hi all. > >>> > >>> > >>> > >>> We have one cluster that has 9 nodes and 20 resources. > >>> > >>> > >>> > >>> Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive, > >>> PSIP-SRV02-active and PSIP-SRV02-active. > >>> > >>> They should provide an lsb:opensips service, 2 by 2: > >>> > >>> . The SRV01-opensips and SRV01-IP resources should be active on > > one of > >>> PSIP-SRV01-active or PSIP-SRV01-passive; > >>> > >>> . The SRV02-opensips and SRV02-IP resources should be active on > > one of > >>> PSIP-SRV02-active or PSIP-SRV02-passive. > >>> > >>> > >>> > >>> > >>> Everything works fine, until the moment that one of those nodes is > >> rebooted. > >>> In the last case the problem occurred with a reboot of PSIP-SRV01-passive, > >>> that wasn't providing the service at that moment. > >>> > >>> > >>> > >>> To be noted that all opensips nodes had the opensips service to be started > > on > >>> boot by initd, which was removed in the meanwhile. > >>> > >>> The problem is that the service SRV01-opensips is detected to be started > > on > >>> both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips > is > >>> detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active. > >>> > >>> After that and several operations done by the cluster, which include > > actions > >>> to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01- > >> passive, > >>> and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active, > >> which > >>> fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes > >> unmanaged. > >>> > >>> > >>> > >>> Any ideas on how to fix this? > >>> > >>> Nuno Pereira > >>> > >>> G9Telecom > >> > >> Your configuration looks appropriate, so it sounds like something is > >> still starting the opensips services outside cluster control. Pacemaker > >> recovers from multiple running instances by stopping them all, then > >> starting on the expected node. > > Yesterday I removed the pacemaker from starting on boot, and > > tested it: the problem persists. > > Also, I checked the logs and the opensips wasn't started on the > > PSIP-SRV01-passive machine, the one that was rebooted. > > Is it possible to change that behaviour, as it is undesirable for our > > environment? > > For example, only to stop it on one of the hosts. > > > >> You can verify that Pacemaker did not start the extra instances by > >> looking for start messages in the logs (they will look like "Operation > >> SRV01-opensips_start_0" etc.). > > On the rebooted node I don't see 2 starts, but only 2 failed stops, the first > > failed for the service that wasn't supposed to run there, and a normal one for > > the service that was supposed to run there: > > > > Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd:error: > > process_lrm_event: Operation SRV02-opensips_stop_0 (node=PSIP- > > SRV01-passive, call=52, status=4, cib-update=23, confirmed=true) Error > > Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd: notice: > > process_lrm_event: Operation SRV01-opensips_stop_0: ok (node=PSIP- > > SRV01-passive, call=51, rc=0, cib-update=24, confirmed=true) > > > > > >> The other question is why did the stop command fail. The logs should > >> shed some light on that too; look for the equivalent "_stop_0" operation > >> and the messages around it. Th
Re: [ClusterLabs] Multiple OpenSIPS services on one cluster
> -Mensagem original- > De: Ken Gaillot [mailto:kgail...@redhat.com] > Enviada: segunda-feira, 2 de Novembro de 2015 19:53 > Para: users@clusterlabs.org > Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster > > On 11/02/2015 01:24 PM, Nuno Pereira wrote: > > Hi all. > > > > > > > > We have one cluster that has 9 nodes and 20 resources. > > > > > > > > Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive, > > PSIP-SRV02-active and PSIP-SRV02-active. > > > > They should provide an lsb:opensips service, 2 by 2: > > > > . The SRV01-opensips and SRV01-IP resources should be active on one of > > PSIP-SRV01-active or PSIP-SRV01-passive; > > > > . The SRV02-opensips and SRV02-IP resources should be active on one of > > PSIP-SRV02-active or PSIP-SRV02-passive. > > > > > > > > > > Everything works fine, until the moment that one of those nodes is > rebooted. > > In the last case the problem occurred with a reboot of PSIP-SRV01-passive, > > that wasn't providing the service at that moment. > > > > > > > > To be noted that all opensips nodes had the opensips service to be started on > > boot by initd, which was removed in the meanwhile. > > > > The problem is that the service SRV01-opensips is detected to be started on > > both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips is > > detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active. > > > > After that and several operations done by the cluster, which include actions > > to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01- > passive, > > and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active, > which > > fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes > unmanaged. > > > > > > > > Any ideas on how to fix this? > > > > Nuno Pereira > > > > G9Telecom > > Your configuration looks appropriate, so it sounds like something is > still starting the opensips services outside cluster control. Pacemaker > recovers from multiple running instances by stopping them all, then > starting on the expected node. Yesterday I removed the pacemaker from starting on boot, and tested it: the problem persists. Also, I checked the logs and the opensips wasn't started on the PSIP-SRV01-passive machine, the one that was rebooted. Is it possible to change that behaviour, as it is undesirable for our environment? For example, only to stop it on one of the hosts. > You can verify that Pacemaker did not start the extra instances by > looking for start messages in the logs (they will look like "Operation > SRV01-opensips_start_0" etc.). On the rebooted node I don't see 2 starts, but only 2 failed stops, the first failed for the service that wasn't supposed to run there, and a normal one for the service that was supposed to run there: Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd:error: process_lrm_event: Operation SRV02-opensips_stop_0 (node=PSIP- SRV01-passive, call=52, status=4, cib-update=23, confirmed=true) Error Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd: notice: process_lrm_event: Operation SRV01-opensips_stop_0: ok (node=PSIP- SRV01-passive, call=51, rc=0, cib-update=24, confirmed=true) > The other question is why did the stop command fail. The logs should > shed some light on that too; look for the equivalent "_stop_0" operation > and the messages around it. The resource agent might have reported an > error, or it might have timed out. I see this: Nov 02 23:01:24 [1689] PSIP-SRV01-passive lrmd: warning: operation_finished: SRV02-opensips_stop_0:1983 - terminated with signal 15 Nov 02 23:01:24 [1689] PSIP-BBT01-passive lrmd: info: log_finished: finished - rsc: SRV02-opensips action:stop call_id:52 pid:1983 exit-code:1 exec-time:79ms queue-time:0ms As it can be seen above, the call_id for the failed stop is greater that the one with success, but ends before. Also, as both operations are stopping the exact same service, the last one fails. And on the case of the one that fails, it wasn't supposed to be stopped or started in that host, as was configured. Might it be related to any problem with the init.d script of opensips, like an invalid result code, or something? I checked http://refspecs.linuxbase.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/inis crptact.html and didn't found any problem, but might had miss some use case. Nuno Pereira G9Telecom smime.p7s Description: S/MIME cryptographic signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Multiple OpenSIPS services on one cluster
Hi all. We have one cluster that has 9 nodes and 20 resources. Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive, PSIP-SRV02-active and PSIP-SRV02-active. They should provide an lsb:opensips service, 2 by 2: . The SRV01-opensips and SRV01-IP resources should be active on one of PSIP-SRV01-active or PSIP-SRV01-passive; . The SRV02-opensips and SRV02-IP resources should be active on one of PSIP-SRV02-active or PSIP-SRV02-passive. The relevant configuration is the following: Resources: Resource: SRV01-IP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.0.0.1 cidr_netmask=27 Meta Attrs: target-role=Started Operations: monitor interval=8s (SRV01-IP-monitor-8s) Resource: SRV01-opensips (class=lsb type=opensips) Operations: monitor interval=8s (SRV01-opensips-monitor-8s) Resource: SRV02-IP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.0.0.2 cidr_netmask=27 Operations: monitor interval=8s (SRV02-IP-monitor-8s) Resource: SRV02-opensips (class=lsb type=opensips) Operations: monitor interval=30 (SRV02-opensips-monitor-30) Location Constraints: Resource: SRV01-opensips Enabled on: PSIP-SRV01-active (score:100) (id:prefer1-srv01-active) Enabled on: PSIP-SRV01-passive (score:99) (id:prefer3-srv01-active) Resource: SRV01-IP Enabled on: PSIP-SRV01-active (score:100) (id:prefer-SRV01-ACTIVE) Enabled on: PSIP-SRV01-passive (score:99) (id:prefer-SRV01-PASSIVE) Resource: SRV02-IP Enabled on: PSIP-SRV02-active (score:100) (id:prefer-SRV02-ACTIVE) Enabled on: PSIP-SRV02-passive (score:99) (id:prefer-SRV02-PASSIVE) Resource: SRV02-opensips Enabled on: PSIP-SRV02-active (score:100) (id:prefer-SRV02-ACTIVE) Enabled on: PSIP-SRV02-passive (score:99) (id:prefer-SRV02-PASSIVE) Ordering Constraints: SRV01-IP then SRV01-opensips (score:INFINITY) (id:SRV01-opensips-after-ip) SRV02-IP then SRV02-opensips (score:INFINITY) (id:SRV02-opensips-after-ip) Colocation Constraints: SRV01-opensips with SRV01-IP (score:INFINITY) (id:SRV01-opensips-with-ip) SRV02-opensips with SRV02-IP (score:INFINITY) (id:SRV02-opensips-with-ip) Cluster Properties: cluster-infrastructure: cman . symmetric-cluster: false Everything works fine, until the moment that one of those nodes is rebooted. In the last case the problem occurred with a reboot of PSIP-SRV01-passive, that wasn't providing the service at that moment. To be noted that all opensips nodes had the opensips service to be started on boot by initd, which was removed in the meanwhile. The problem is that the service SRV01-opensips is detected to be started on both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips is detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active. After that and several operations done by the cluster, which include actions to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01-passive, and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active, which fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes unmanaged. Any ideas on how to fix this? Nuno Pereira G9Telecom smime.p7s Description: S/MIME cryptographic signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org