> -----Mensagem original----- > De: Ken Gaillot [mailto:kgail...@redhat.com] > Enviada: terça-feira, 3 de Novembro de 2015 18:02 > Para: Nuno Pereira; 'Cluster Labs - All topics related to open-source clustering > welcomed' > Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster > > On 11/03/2015 05:38 AM, Nuno Pereira wrote: > >> -----Mensagem original----- > >> De: Ken Gaillot [mailto:kgail...@redhat.com] > >> Enviada: segunda-feira, 2 de Novembro de 2015 19:53 > >> Para: users@clusterlabs.org > >> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster > >> > >> On 11/02/2015 01:24 PM, Nuno Pereira wrote: > >>> Hi all. > >>> > >>> > >>> > >>> We have one cluster that has 9 nodes and 20 resources. > >>> > >>> > >>> > >>> Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive, > >>> PSIP-SRV02-active and PSIP-SRV02-active. > >>> > >>> They should provide an lsb:opensips service, 2 by 2: > >>> > >>> . The SRV01-opensips and SRV01-IP resources should be active on > > one of > >>> PSIP-SRV01-active or PSIP-SRV01-passive; > >>> > >>> . The SRV02-opensips and SRV02-IP resources should be active on > > one of > >>> PSIP-SRV02-active or PSIP-SRV02-passive. > >>> > >>> > >>> > >>> > >>> Everything works fine, until the moment that one of those nodes is > >> rebooted. > >>> In the last case the problem occurred with a reboot of PSIP-SRV01-passive, > >>> that wasn't providing the service at that moment. > >>> > >>> > >>> > >>> To be noted that all opensips nodes had the opensips service to be started > > on > >>> boot by initd, which was removed in the meanwhile. > >>> > >>> The problem is that the service SRV01-opensips is detected to be started > > on > >>> both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips > is > >>> detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active. > >>> > >>> After that and several operations done by the cluster, which include > > actions > >>> to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01- > >> passive, > >>> and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active, > >> which > >>> fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes > >> unmanaged. > >>> > >>> > >>> > >>> Any ideas on how to fix this? > >>> > >>> Nuno Pereira > >>> > >>> G9Telecom > >> > >> Your configuration looks appropriate, so it sounds like something is > >> still starting the opensips services outside cluster control. Pacemaker > >> recovers from multiple running instances by stopping them all, then > >> starting on the expected node. > > Yesterday I removed the pacemaker from starting on boot, and > > tested it: the problem persists. > > Also, I checked the logs and the opensips wasn't started on the > > PSIP-SRV01-passive machine, the one that was rebooted. > > Is it possible to change that behaviour, as it is undesirable for our > > environment? > > For example, only to stop it on one of the hosts. > > > >> You can verify that Pacemaker did not start the extra instances by > >> looking for start messages in the logs (they will look like "Operation > >> SRV01-opensips_start_0" etc.). > > On the rebooted node I don't see 2 starts, but only 2 failed stops, the first > > failed for the service that wasn't supposed to run there, and a normal one for > > the service that was supposed to run there: > > > > Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd: error: > > process_lrm_event: Operation SRV02-opensips_stop_0 (node=PSIP- > > SRV01-passive, call=52, status=4, cib-update=23, confirmed=true) Error > > Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd: notice: > > process_lrm_event: Operation SRV01-opensips_stop_0: ok (node=PSIP- > > SRV01-passive, call=51, rc=0, cib-update=24, confirmed=true) > > > > > >> The other question is why did the stop command fail. The logs should > >> shed some light on that too; look for the equivalent "_stop_0" operation > >> and the messages around it. The resource agent might have reported an > >> error, or it might have timed out. > > I see this: > > > > Nov 02 23:01:24 [1689] PSIP-SRV01-passive lrmd: warning: > > operation_finished: SRV02-opensips_stop_0:1983 - terminated with signal > 15 > > Nov 02 23:01:24 [1689] PSIP-BBT01-passive lrmd: info: log_finished: > > finished - rsc: SRV02-opensips action:stop call_id:52 pid:1983 exit-code:1 > > exec-time:79ms queue-time:0ms > > > > As it can be seen above, the call_id for the failed stop is greater that the > > one with success, but ends before. > > Also, as both operations are stopping the exact same service, the last one > > fails. And on the case of the one that fails, it wasn't supposed to be stopped > > or started in that host, as was configured. > > I think I see what's happening. I overlooked that SRV01-opensips and > SRV02-opensips are using the same LSB init script. That means Pacemaker > can't distinguish one instance from the other. If it runs "status" for > one instance, it will return "running" if *either* instance is running. > If it tries to stop one instance, that will stop whichever one is running. > > I don't know what version of Pacemaker you're running, but 1.1.13 has a > feature "resource-discovery" that could be used to make Pacemaker ignore > SRV01-opensips on the nodes that run SRV02-opensips, and vice versa: > http://blog.clusterlabs.org/blog/2014/feature-spotlight-controllable-resource- > discovery/ That sounds consistent with what we have seen. Unfortunatly I'm using version 1.1.12-4 from yum on CentOS 6.2, and so I don't have that option. I may test if it's available (is there any pcs command to check it?), but I need to clone some hosts and so.
> Alternatively, you could clone the LSB resource instead of having two, > but that would be tricky with your other requirements. What are your > reasons for wanting to restrict each instance to two specific nodes, > rather than let Pacemaker select any two of the four nodes to run the > resources? SRV01-opensips and SRV02-opensips are actually the same service with different configurations, created for different purposes, used by different clients, and shouldn't run on the same host. If I run SRV01-opensips on a SRV02 host, clients wouldn't have service, and vice versa. I think that I go for this. > Another option would be to write an OCF script to use instead of the LSB > one. You'd need to add a parameter to distinguish the two instances > (maybe the IP it's bound to?), and make start/stop/status operate only > on the specified instance. That way, Pacemaker could run "status" for > both instances and get the right result for each. It looks like someone > did write one a while back, but it needs work (I notice stop always > returns success, which is bad): http://anders.com/cms/259 I already had it here, and it suffers from branding problems (references to OpenSer on OpenSIPS, etc). It doesn't seem to work: # ocf-tester -o ip=127.0.0.1 -n OpenSIPS /usr/lib/ocf/resource.d/anders.com/OpenSIPS Beginning tests for /usr/lib/ocf/resource.d/anders.com/OpenSIPS... * rc=7: Monitoring an active resource should return 0 * rc=7: Probing an active resource should return 0 * rc=7: Monitoring an active resource should return 0 * rc=7: Monitoring an active resource should return 0 Tests failed: /usr/lib/ocf/resource.d/anders.com/OpenSIPS failed 4 tests > > Might it be related to any problem with the init.d script of opensips, like an > > invalid result code, or something? I checked > > http://refspecs.linuxbase.org/LSB_3.1.0/LSB-Core-generic/LSB-Core- > generic/inis > > crptact.html and didn't found any problem, but might had miss some use > case. > > You can follow this guide to verify the script's LSB compliance as far > as it matters to Pacemaker: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html- > single/Pacemaker_Explained/index.html#ap-lsb I can't fully test it right now, but at least in one case the script doesn't work well. OpenSIPS requires the IP or it doesn't start. On a host without the HA IP, the script returns 0 but the process dies and isn't running one second later. That doesn't help. Nuno Pereira G9Telecom _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org