On 11/03/2015 01:40 PM, Nuno Pereira wrote: >> -----Mensagem original----- >> De: Ken Gaillot [mailto:kgail...@redhat.com] >> Enviada: terça-feira, 3 de Novembro de 2015 18:02 >> Para: Nuno Pereira; 'Cluster Labs - All topics related to open-source > clustering >> welcomed' >> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster >> >> On 11/03/2015 05:38 AM, Nuno Pereira wrote: >>>> -----Mensagem original----- >>>> De: Ken Gaillot [mailto:kgail...@redhat.com] >>>> Enviada: segunda-feira, 2 de Novembro de 2015 19:53 >>>> Para: users@clusterlabs.org >>>> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster >>>> >>>> On 11/02/2015 01:24 PM, Nuno Pereira wrote: >>>>> Hi all. >>>>> >>>>> >>>>> >>>>> We have one cluster that has 9 nodes and 20 resources. >>>>> >>>>> >>>>> >>>>> Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive, >>>>> PSIP-SRV02-active and PSIP-SRV02-active. >>>>> >>>>> They should provide an lsb:opensips service, 2 by 2: >>>>> >>>>> . The SRV01-opensips and SRV01-IP resources should be active on >>> one of >>>>> PSIP-SRV01-active or PSIP-SRV01-passive; >>>>> >>>>> . The SRV02-opensips and SRV02-IP resources should be active on >>> one of >>>>> PSIP-SRV02-active or PSIP-SRV02-passive. >>>>> >>>>> >>>>> >>>>> >>>>> Everything works fine, until the moment that one of those nodes is >>>> rebooted. >>>>> In the last case the problem occurred with a reboot of > PSIP-SRV01-passive, >>>>> that wasn't providing the service at that moment. >>>>> >>>>> >>>>> >>>>> To be noted that all opensips nodes had the opensips service to be > started >>> on >>>>> boot by initd, which was removed in the meanwhile. >>>>> >>>>> The problem is that the service SRV01-opensips is detected to be started >>> on >>>>> both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips >> is >>>>> detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active. >>>>> >>>>> After that and several operations done by the cluster, which include >>> actions >>>>> to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01- >>>> passive, >>>>> and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active, >>>> which >>>>> fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes >>>> unmanaged. >>>>> >>>>> >>>>> >>>>> Any ideas on how to fix this? >>>>> >>>>> Nuno Pereira >>>>> >>>>> G9Telecom >>>> >>>> Your configuration looks appropriate, so it sounds like something is >>>> still starting the opensips services outside cluster control. Pacemaker >>>> recovers from multiple running instances by stopping them all, then >>>> starting on the expected node. >>> Yesterday I removed the pacemaker from starting on boot, and >>> tested it: the problem persists. >>> Also, I checked the logs and the opensips wasn't started on the >>> PSIP-SRV01-passive machine, the one that was rebooted. >>> Is it possible to change that behaviour, as it is undesirable for our >>> environment? >>> For example, only to stop it on one of the hosts. >>> >>>> You can verify that Pacemaker did not start the extra instances by >>>> looking for start messages in the logs (they will look like "Operation >>>> SRV01-opensips_start_0" etc.). >>> On the rebooted node I don't see 2 starts, but only 2 failed stops, the > first >>> failed for the service that wasn't supposed to run there, and a normal one > for >>> the service that was supposed to run there: >>> >>> Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd: error: >>> process_lrm_event: Operation SRV02-opensips_stop_0 (node=PSIP- >>> SRV01-passive, call=52, status=4, cib-update=23, confirmed=true) Error >>> Nov 02 23:01:24 [1692] PSIP-SRV01-passive crmd: notice: >>> process_lrm_event: Operation SRV01-opensips_stop_0: ok (node=PSIP- >>> SRV01-passive, call=51, rc=0, cib-update=24, confirmed=true) >>> >>> >>>> The other question is why did the stop command fail. The logs should >>>> shed some light on that too; look for the equivalent "_stop_0" operation >>>> and the messages around it. The resource agent might have reported an >>>> error, or it might have timed out. >>> I see this: >>> >>> Nov 02 23:01:24 [1689] PSIP-SRV01-passive lrmd: warning: >>> operation_finished: SRV02-opensips_stop_0:1983 - terminated with > signal >> 15 >>> Nov 02 23:01:24 [1689] PSIP-BBT01-passive lrmd: info: > log_finished: >>> finished - rsc: SRV02-opensips action:stop call_id:52 pid:1983 exit-code:1 >>> exec-time:79ms queue-time:0ms >>> >>> As it can be seen above, the call_id for the failed stop is greater that > the >>> one with success, but ends before. >>> Also, as both operations are stopping the exact same service, the last one >>> fails. And on the case of the one that fails, it wasn't supposed to be > stopped >>> or started in that host, as was configured. >> >> I think I see what's happening. I overlooked that SRV01-opensips and >> SRV02-opensips are using the same LSB init script. That means Pacemaker >> can't distinguish one instance from the other. If it runs "status" for >> one instance, it will return "running" if *either* instance is running. >> If it tries to stop one instance, that will stop whichever one is running. >> >> I don't know what version of Pacemaker you're running, but 1.1.13 has a >> feature "resource-discovery" that could be used to make Pacemaker ignore >> SRV01-opensips on the nodes that run SRV02-opensips, and vice versa: >> > http://blog.clusterlabs.org/blog/2014/feature-spotlight-controllable-resource- >> discovery/ > That sounds consistent with what we have seen. > Unfortunatly I'm using version 1.1.12-4 from yum on CentOS 6.2, and so I don't > have that option. > I may test if it's available (is there any pcs command to check it?), but I > need to clone some hosts and so.
6.2 doesn't have it; I'm not sure about 6.7; I think 7.1 does. >> Alternatively, you could clone the LSB resource instead of having two, >> but that would be tricky with your other requirements. What are your >> reasons for wanting to restrict each instance to two specific nodes, >> rather than let Pacemaker select any two of the four nodes to run the >> resources? > SRV01-opensips and SRV02-opensips are actually the same service with > different configurations, created for different purposes, used by > different clients, and shouldn't run on the same host. > If I run SRV01-opensips on a SRV02 host, clients wouldn't have service, and > vice versa. > > I think that I go for this. The unusual part would be referencing a particular clone instance in constraints (opensips:0 and opensips:1 instead of just opensips). I've never done that, but it's worth trying. >> Another option would be to write an OCF script to use instead of the LSB >> one. You'd need to add a parameter to distinguish the two instances >> (maybe the IP it's bound to?), and make start/stop/status operate only >> on the specified instance. That way, Pacemaker could run "status" for >> both instances and get the right result for each. It looks like someone >> did write one a while back, but it needs work (I notice stop always >> returns success, which is bad): http://anders.com/cms/259 > I already had it here, and it suffers from branding problems (references to > OpenSer on OpenSIPS, etc). > It doesn't seem to work: > > # ocf-tester -o ip=127.0.0.1 -n OpenSIPS > /usr/lib/ocf/resource.d/anders.com/OpenSIPS > Beginning tests for /usr/lib/ocf/resource.d/anders.com/OpenSIPS... > * rc=7: Monitoring an active resource should return 0 > * rc=7: Probing an active resource should return 0 > * rc=7: Monitoring an active resource should return 0 > * rc=7: Monitoring an active resource should return 0 > Tests failed: /usr/lib/ocf/resource.d/anders.com/OpenSIPS failed 4 tests Yes, it would definitely need some development work. >>> Might it be related to any problem with the init.d script of opensips, > like an >>> invalid result code, or something? I checked >>> http://refspecs.linuxbase.org/LSB_3.1.0/LSB-Core-generic/LSB-Core- >> generic/inis >>> crptact.html and didn't found any problem, but might had miss some use >> case. >> >> You can follow this guide to verify the script's LSB compliance as far >> as it matters to Pacemaker: >> >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html- >> single/Pacemaker_Explained/index.html#ap-lsb > I can't fully test it right now, but at least in one case the script doesn't > work well. > OpenSIPS requires the IP or it doesn't start. On a host without the HA IP, > the script returns 0 but the process dies and isn't running one second later. > That doesn't help. > > > Nuno Pereira > G9Telecom That's not ideal, but it wouldn't be a problem, because you've got colocation/ordering constraints that ensure Pacemaker won't try to start opensips unless the IP is up. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org