[ClusterLabs] Need advice: deep pacemaker integration, best approach?

2024-06-09 Thread alexey
Hi All,

 

We intend to integrate Pacemaker as failover engine into a very specific
product. The handmade prototype works pretty well. It includes a couple of
dozens coordinated resources to implement one target application instance
with its full network configuration. The prototype was made with pcs shell,
but the process is very complex and annoying for mass- rollout by field
engineers.

 

Our goal is to develop a kind of configuration shell to allow a user to
setup, monitor and manage app instance as entities, not as a set of cluster
resources. Means, user deals with app settings and status, the shell
translates it to resources configuration and status and back.

 

The shell be made with Python, as it is the best for us for now. The
question for me: what is the best approach to put Pacemaker under the
capote. I did not consider to build it over pcs as pcs output quite hard to
render, so I have to use more machine-friendly interface to pacemaker for
sure but the question is which ones fits our needs the best.

 

It seems like the best way is to use custom resource agents, XML structures
and cibadmin to manage configuration and get status information. However, it
is not clean: should cibadmin be used exclusively, or there also other API
to pacemaker config pull/push?

 

Also, it is not clear how to manage resource errors and cleanup? Are there
other ways that call to crm_resource for cleanup and failed resource
restart? Could it be made via CIB manipulation like force lrm history
records delete? 

 

I understand that the source is the ultimate answer for any question, but I
will be very grateful for any advice from ones who has all the answers on
their fingertips.

 

Thank you in advance for sharing your thoughts and experience!

 

Sincerely,

 

Alex

 

 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Is XSD definition of CIB available?

2024-06-06 Thread alexey
Hi all,

 

Is there XSD scheme for Pacemaker CIB available as a document to see the
full XML syntax and definitions?

 

I was tried to search over the sources, but got no success.

 

Thank you in advance!

 

Alex

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Disabled resources after parallel removing of group

2024-05-17 Thread alexey
Hi Alexander,

 

AFAIK, Pacemaker itself only have deal with XML-based configuration database, 
shared across all cluster. Each time you call pcs or any other tool it takes 
XML (or part of it) from pacemaker, tweaks it and then push it back to 
Pacemaker. Each time XML is pushed, Pacemaker completely rethink the new 
config, look to the current state and schedule changes from current state to 
target state. I can’t point you to exact place in the docs where this 
described, but this from Pacemaker docs.

 

Therefore, each use of pcs command triggering this process immediately. Seems 
that some async-driven side effects may happen from this. Then, you may do ANY 
count of changes in one stroke if Pacemaker got the new config with all these 
changes. So, you need to enforce management tools FIRST prepare all changes and 
THEN push it all at once. And then you have no need to complete separate 
changes in background because the preparation is very fast. And final 
application will be done at max possible speed too.

 

Miroslav exampled how to manage bulk delete, but this is the common way to 
manage massive change. Any operations could be done! You take the Pacemaker CIB 
to a file, complete all the changes against the file instead write each one to 
CIB and then push the total back, then Pacemaker will schedule all changes.

 

You may put ANY commands in any mix: add, change, delete, but use -f  
option for changes to be done against file. You may keep original to push diff 
(as at Miroslav example), or you may just push whole changed config, AFAIK, 
there no difference.

 

###

# Make a copy of CIB into local file

pcs cluster cib config.xml

 

# do changes against file

pcs -f config.xml resource add 

 

pcs -f config.xml constraint

 

pcs -f config.xml resource disable 

 

pcs -f config.xml resource remove 

 

# And finally push the whole ‘configuration’ scope back (mind there no diff, 
but push only config scope)

pcs cluster cib-push config.xml –config

 



 

And Pacemaker apply all changes at once.

 

Miroslav’s example taken from pcs man page 
  for command 
‘cluster cib-push’. My example works too.

 

Have a good failover! Means no failover at all )))

 

Alex

 

 

From: Users  On Behalf Of Александр Руденко
Sent: Friday, May 17, 2024 6:46 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Disabled resources after parallel removing of group

 

Miroslav, thank you! 

It helps me understand that it's not a configuration issue.

BTW, is it okay to create new resources in parallel?

On timeline it looks like:

pcs resource create resA1  --group groupA

pcs resource create resB1  --group groupB
resA1 Started
pcs resource create resA2  --group groupA

res B1 Started
pcs resource create resB2  --group groupB

res A2 Started

res B2 Started

 

For now, it works okay)

In our case, cluster events like 'create' and 'remove' are generated by users, 
and for now we don't have any queue for operations. But now, I realized that we 
need a queue for 'remove' operations. Maybe we need a queue for 'create' 
operations to?

 

пт, 17 мая 2024 г. в 17:49, Miroslav Lisik mailto:mli...@redhat.com> >:

Hi Aleksandr!

It is not safe to use `pcs resource remove` command in parallel because
you run into the same issues as you already described. Processes run by
remove command are not synchronized.

Unfortunately, remove command does not support more than one resource
yet.

If you really need to remove resources at once you can use this method:
1. get the current cib configuration:
pcs cluster cib > original.xml

2. create a new copy of the file:
cp original.xml new.xml

3. disable all to be removed resources using -f option and new
configuration file:
pcs -f new.xml resource disable ...

4. remove resources using -f option and new configuration file:
pcs -f new.xml resource remove 
...

5. push new cib configuration to the cluster
pcs cluster cib-push new.xml diff-against=original.xml


On 5/17/24 13:47, Александр Руденко wrote:
> Hi!
> 
> I am new in the pacemaker world, and I, unfortunately, have problems 
> with simple actions like group removal. Please, help me understand when 
> I'm wrong.
> 
> For simplicity I will use standard resources like IPaddr2 (but we have 
> this problem on any type of our custom resources).
> 
> I have 5 groups like this:
> 
> Full List of Resources:
>* Resource Group: group-1:
>  * ip-11 (ocf::heartbeat:IPaddr2): Started vdc16
>  * ip-12 (ocf::heartbeat:IPaddr2): Started vdc16
>* Resource Group: group-2:
>  * ip-21 (ocf::heartbeat:IPaddr2): Started vdc17
>  * ip-22 (ocf::heartbeat:IPaddr2): Started vdc17
>* Resource Group: group-3:
>  * ip-31 (ocf::heartbeat:IPaddr2): Started vdc18
>  * ip-32 (ocf::heartbeat:IPaddr2): Started vdc18
>* Resource 

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-03 Thread alexey
Hi,

> > Also, I've done wireshark capture and found great mess in TCP, it
> > seems like connection between qdevice and qnetd really stops for some
> > time and packets won't deliver.
> 
> Could you check UDP? I guess there is a lot of UDP packets sent by
corosync
> which probably makes TCP to not go thru.
Very improbably.  UPD itself can't prevent TCP from working, and 1GB links
seems too wide for corosync may overload it.
Also, overload usually leads to SOME packets drop, but there absolutely
other case: NO TCP packet passed, I got two captures from two side and I see
that for some time each party sends TCP packets, but other party do not
receive it at all.

> >
> > For my guess, it match corosync syncing activities, and I suspect that
> > corosync prevent any other traffic on the interface it use for rings.
> >
> > As I switch qnetd and qdevice to use different interface it seems to
> > work fine.
> 
> Actually having dedicated interface just for corosync/knet traffic is
optimal
> solution. qdevice+qnetd on the other hand should be as close to "customer"
as
> possible.
> 
I am sure qnetd is not intended to proof of network reachability, it only an
arbiter to provide quorum resolution. Therefore, as for me it is better to
keep it on the intra-cluster network with high priority transport. If we
need to make a solution based on network reachability, there other ways to
provide it.

> So if you could have two interfaces (one just for corosync, second for
> qnetd+qdevice+publicly accessible services) it might be a solution?
> 
Yes, this way it works, but I wish to know WHY it won't work on the shared
interface.

> > So, the question is: does corosync really temporary blocks any other
> > traffic on the interface it uses? Or it is just a coincidence? If it
> > blocks, is
> 
> Nope, no "blocking". But it sends quite some few UDP packets and I guess
it can
> really use all available bandwidth so no TCP goes thru.
Use all available 1GBps? Impossible.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-03 Thread alexey
Hi,

> > Thanks great for your suggestion, probably I need to think about this
> > way too, however, the project environment is not a good one to rely on
> > fencing and, moreover, we can't control the bottom layer a trusted
> > way.
> 
> That is a problem. A VM being gone is not the only possible failure scenario. 
> For
> example, a kernel or device driver issue could temporarily freeze the node, or
> networking could temporarily drop out, causing the node to appear lost to
> Corosync, but the node could be responsive again (with the app running) after 
> the
> app has been started on the other node.
> 
> If there's no problem with the app running on both nodes at the same time, 
> then
> that's fine, but that's rarely the case. If an IP address is needed, or 
> shared storage
> is used, simultaneous access will cause problems that only fencing can avoid.
The pacemaker use very pessimistic approach if you set resources to require 
quorum. 
If a network outage is trigger changes, it will ruin quorum first and after 
that try to rebuild it. Therefore there are two questions: 
1. How to keep active app running?
2. How to prevent two copies started.
As for me, quorum-dependent resource management performs well on both points.

> > my goal is to keep the app from moves (e.g. restarts) as long as
> > possible. This means only two kinds of moves accepted: current host
> > fail (move to other with restart) or admin move (managed move at
> > certain time with restart). Any other troubles should NOT trigger app
> > down/restart. Except of total connectivity loss where no second node,
> > no arbiter => stop service.
> 
> Total connectivity loss may not be permanent. Fencing ensures the connectivity
> will not be restored after the app is started elsewhere.
Nothing bad if it restored and the node alive, but got app down because of no 
quorum.

 
> Pacemaker 2.0.4 and later supports priority-fencing-delay which allows the 
> node
> currently running the app to survive. The node not running the app will wait 
> the
> configured amount of time before trying to fence the other node. Of course 
> that
> does add more time to the recovery if the node running the app is really gone.
I feel I am not sure about how it works.
Imagine just connectivity loss between nodes but no to the other pars.
And Node1 runs app. Everything well, node2 off.
So, we start Node2 with intention to restore cluster.
Node 2 starts and trying to find it's partner, failure and fence node1 out.
While Node1 not even know about Node2 starts.

Is it correct?

> >
> > Therefore, quorum-based management seems better way for my exact case.
> 
> Unfortunately it's unsafe without fencing.
You may say I am stupid, but I really can’t understand why quorum-based 
resource management is unreliable without fencing.
May a host hold quorum bit longer than another host got quorum and run app. 
Probably, it may do this.
But fencing is not immediate too. So, it can't protect for 100% from short-time 
parallel runs.

> That does complicate the situation. Ideally there would be some way to request
> the VM to be immediately destroyed (whether via fence_xvm, a cloud provider
> API, or similar).
What you mean by "destroyed"? Mean get down?

> 
> >
> > Please, mind all the above is from my common sense and quite poor
> > fundamental knowledge in clustering. And please be so kind to correct
> > me if I am wrong at any point.
> >
> > Sincerely,
> >
> > Alex
> > -Original Message-
> > From: Users  On Behalf Of Ken Gaillot
> > Sent: Thursday, May 2, 2024 5:55 PM
> > To: Cluster Labs - All topics related to open-source clustering
> > welcomed 
> > Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice
> > connenction disrupted.
> >
> > I don't see fencing times in here -- fencing is absolutely essential.
> >
> > With the setup you describe, I would drop qdevice. With fencing,
> > quorum is not strictly required in a two-node cluster (two_node should
> > be set in corosync.conf). You can set priority-fencing-delay to reduce
> > the chance of simultaneous fencing. For VMs, you can use fence_xvm,
> > which is extremely quick.
> >
> > On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote:
> > > Hi All,
> > >
> > > I am trying to build application-specific 2-node failover cluster
> > > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9,
> > > knet transport.
> > >
> > > For some reason I can’t use 3-node then I have to use
> > > qnetd+qdevice
> > > 3.0.1.
> > >
> > > The main goal Is to protect custom app which is not cluster-aware by
> > > itself. It is quite stateful, can’t store the state outside memory
> > > and take some time to get converged with other parts of the system,
> > > then the best scenario is “failover is a restart with same config”,
> > > but each unnecessary restart is painful. So, if failover done, app
> > > must retain on the backup node until it fail or admin push it back,
> > > this work well with stickiness param.
> 

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-02 Thread alexey
Dear Ken, 

First of all, there no fencing at all, it is off.

Thanks great for your suggestion, probably I need to think about this way too, 
however, the project environment is not a good one to rely on fencing and, 
moreover, we can't control the bottom layer a trusted way.

As I understand, fence_xvm just kills VM that not inside the quorum part, or, 
in a case of two-host just one survive who shoot first. But my goal is to keep 
the app from moves (e.g. restarts) as long as possible. This means only two 
kinds of moves accepted: current host fail (move to other with restart) or 
admin move (managed move at certain time with restart). Any other troubles 
should NOT trigger app down/restart. Except of total connectivity loss where no 
second node, no arbiter => stop service.

AFAIK, fencing in two-nodes creates undetermined fence racing, and even it 
warrants only one node survive, it has no respect to if the app already runs on 
the node or not. So, the situation: one node already run app, while other lost 
its connection to the first, but not to the fence device. And win the race => 
kill current active => app restarts. That's exactly what I am trying to avoid.

Therefore, quorum-based management seems better way for my exact case.

Also, VM fencing rely on the idea that all VMs are inside a well-managed first 
layer cluster with it's own quorum/fencing on place or separate nodes and VMs 
never moved between without careful fencing reconfig. In mu case, I can't be 
sure about both points, I do not manage bottom layer. The max I can do is to 
request that every my MV (node, arbiter) located on different phy node and this 
may protect app from node failure and bring more freedom to get nodes off for 
service. Also, I have to limit overall MV count while there need for multiple 
app instances (VM pairs) running at once and one extra VM as arbiter for all 
them (2*N+1), but not 3-node for each instance (3*N) which could be more 
reasonable for my opinion, but not for one who allocate resources.

Please, mind all the above is from my common sense and quite poor fundamental 
knowledge in clustering. And please be so kind to correct me if I am wrong at 
any point.

Sincerely,

Alex
-Original Message-
From: Users  On Behalf Of Ken Gaillot
Sent: Thursday, May 2, 2024 5:55 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice 
connenction disrupted.

I don't see fencing times in here -- fencing is absolutely essential.

With the setup you describe, I would drop qdevice. With fencing, quorum is not 
strictly required in a two-node cluster (two_node should be set in 
corosync.conf). You can set priority-fencing-delay to reduce the chance of 
simultaneous fencing. For VMs, you can use fence_xvm, which is extremely quick.

On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote:
> Hi All,
>  
> I am trying to build application-specific 2-node failover cluster 
> using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet 
> transport.
>  
> For some reason I can’t use 3-node then I have to use qnetd+qdevice 
> 3.0.1.
>  
> The main goal Is to protect custom app which is not cluster-aware by 
> itself. It is quite stateful, can’t store the state outside memory and 
> take some time to get converged with other parts of the system, then 
> the best scenario is “failover is a restart with same config”, but 
> each unnecessary restart is painful. So, if failover done, app must 
> retain on the backup node until it fail or admin push it back, this 
> work well with stickiness param.
>  
> So, the goal is to detect serving node fail ASAP and restart it ASAP 
> on other node, using DRBD-synced config/data. ASAP means within 5-7 
> sec, not 30 or more.
>  
> I was tried different combinations of timing, and finally got 
> acceptable result within 5 sec for the best case. But! The case is 
> very unstable.
>  
> My setup is a simple: two nodes on VM, and one more VM as arbiter 
> (qnetd), VMs under Proxmox and connected by net via external ethernet 
> switch to get closer to reality where “nodes VM” should locate as VM 
> on different PHY hosts in one rack.
>  
> Then, it was adjusted for faster detect and failover.
> In Corosync, left the token default 1000ms, but add
> “heartbeat_failures_allowed: 3”, this made corosync catch node failure 
> for about 200ms (4x50ms heartbeat).
> Both qnet and qdevice was run with  net_heartbeat_interval_min=200 to 
> allow play with faster hearbeats and detects Also, quorum.device.net 
> has timeout: 500, sync_timeout: 3000, algo:
> LMS.
>  
> The testing is to issue “ate +%M:%S.%N && qm stop 201”, and then check 
> the logs on timestamp when the app started on the “backup”
> host. And, when backup host boot again, the test is to check the logs 
> for the app was not restarted.
>  
> Sometimes switchover work like a charm but sometimes it may delay for 
> dozens of seconds.

[ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-01 Thread alexey
Hi All,

 

I am trying to build application-specific 2-node failover cluster using
ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet transport.

 

For some reason I can't use 3-node then I have to use qnetd+qdevice 3.0.1.

 

The main goal Is to protect custom app which is not cluster-aware by itself.
It is quite stateful, can't store the state outside memory and take some
time to get converged with other parts of the system, then the best scenario
is "failover is a restart with same config", but each unnecessary restart is
painful. So, if failover done, app must retain on the backup node until it
fail or admin push it back, this work well with stickiness param.

 

So, the goal is to detect serving node fail ASAP and restart it ASAP on
other node, using DRBD-synced config/data. ASAP means within 5-7 sec, not 30
or more. 

 

I was tried different combinations of timing, and finally got acceptable
result within 5 sec for the best case. But! The case is very unstable.

 

My setup is a simple: two nodes on VM, and one more VM as arbiter (qnetd),
VMs under Proxmox and connected by net via external ethernet switch to get
closer to reality where "nodes VM" should locate as VM on different PHY
hosts in one rack.

 

Then, it was adjusted for faster detect and failover.

1.  In Corosync, left the token default 1000ms, but add
"heartbeat_failures_allowed: 3", this made corosync catch node failure for
about 200ms (4x50ms heartbeat).
2.  Both qnet and qdevice was run with  net_heartbeat_interval_min=200
to allow play with faster hearbeats and detects
3.  Also, quorum.device.net has timeout: 500, sync_timeout: 3000, algo:
LMS.

 

The testing is to issue "ate +%M:%S.%N && qm stop 201", and then check the
logs on timestamp when the app started on the "backup" host. And, when
backup host boot again, the test is to check the logs for the app was not
restarted.

 

Sometimes switchover work like a charm but sometimes it may delay for dozens
of seconds. 

Sometimes when the primary host boot up again, secondary hold quorum well
and keep app running, sometimes quorum is lost first (and pacemaker downs
the app) and then found and pacemaker get app up again, so unwanted restart
happen.

 

My investigation shows that the difference between "good" and "bad" cases:

 

Good case - all the logs clear and reasonable.

 

Bad case: qnetd losing connection to second node just after the connection
to "failure" node detected and then it may take dozens of seconds to restore
it. All this time qdevice trying to connect qnetd and fails:

 

Example, host 192.168.100.1 send to failure, 100.2 is failover to:

 

>From qnetd:

May 01 23:30:39 arbiter corosync-qnetd[6338]: Client
:::192.168.100.1:60686 doesn't sent any message during 600ms.
Disconnecting

May 01 23:30:39 arbiter corosync-qnetd[6338]: Client
:::192.168.100.1:60686 (init_received 1, cluster bsc-test-cluster,
node_id 1) disconnect

May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: Client
0x55a6fc6785b0 (cluster bsc-test-cluster, node_id 1) disconnect

May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms:   server going down
0

>>> This is unexpected down, at normal scenario connection persist

May 01 23:30:40 arbiter corosync-qnetd[6338]: Client
:::192.168.100.2:32790 doesn't sent any message during 600ms.
Disconnecting

May 01 23:30:40 arbiter corosync-qnetd[6338]: Client
:::192.168.100.2:32790 (init_received 1, cluster bsc-test-cluster,
node_id 2) disconnect

May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: Client
0x55a6fc6363d0 (cluster bsc-test-cluster, node_id 2) disconnect

May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms:   server going down
0

May 01 23:30:56 arbiter corosync-qnetd[6338]: New client connected

May 01 23:30:56 arbiter corosync-qnetd[6338]:   cluster name =
bsc-test-cluster

May 01 23:30:56 arbiter corosync-qnetd[6338]:   tls started = 0

May 01 23:30:56 arbiter corosync-qnetd[6338]:   tls peer certificate
verified = 0

May 01 23:30:56 arbiter corosync-qnetd[6338]:   node_id = 2

May 01 23:30:56 arbiter corosync-qnetd[6338]:   pointer = 0x55a6fc6363d0

May 01 23:30:56 arbiter corosync-qnetd[6338]:   addr_str =
:::192.168.100.2:57736

May 01 23:30:56 arbiter corosync-qnetd[6338]:   ring id = (2.801)

May 01 23:30:56 arbiter corosync-qnetd[6338]:   cluster dump:

May 01 23:30:56 arbiter corosync-qnetd[6338]: client =
:::192.168.100.2:57736, node_id = 2

May 01 23:30:56 arbiter corosync-qnetd[6338]: Client
:::192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent
initial node list.

May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 99

May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:

May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1,
data_center_id = 0, node_state = not set

May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2,
data_center_id = 0, node_state = not set

May 01 23:30:56 arbiter corosync-qnetd[6338]: 

Re: [ClusterLabs] lrmd segfault

2017-02-01 Thread alexey

Yes, it was running on bare-metal.

After upgrade to SL7.3 and pacemaker 1.1.15 the problem is gone.
I hope gone forever.

--
Alexey Kurnosov

> ... and it is running on bare-metal?
> Just to be sure it is not due to some code-patching done by a hypervisor ...
> 


pgpDdYGEVrMdf.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] lrmd segfault

2017-01-31 Thread alexey

As i said, we used rpm from standard repo, hardly it compiled incorrectly. And 
according
to a spec L5630 (the node's CPU) has SSE4.2 support. And in that case it should 
be
illegal instruction exception, not segfault.

--
Alexey Kurnosov

On Tue, Jan 31, 2017 at 07:34:18AM +0100, Kristoffer Grönlund wrote:
> ale...@kurnosov.spb.ru writes:
> 
> > [ Unknown signature status ]
> >
> > Hi All.
> >
> > We have the heterogeneous corosync/pacemaker cluster of 5 nodes: 3 
> > SL7(Scientific linux) and 2 SL6.
> > SL7 pacemaker installed from a standard repo (corosync - 2.3.4, pacemaker - 
> > 1.1.13-10), SL6 build from sources (same version).
> > The cluster not unified, some nodes have RA which other do not have. crmsh 
> > used for management.
> > SL6 nodes runs surprisingly smoothly, but SL7 steady segfaulting in the 
> > exactly same place.
> > Here is an example:
> >
> 
> Just from looking at the core dump, it looks like your processor doesn't
> support the SSE extensions used by the newer version of the code. You'll
> need to recompile and disable use of those extensions.
> 
> It looks like the code is using SSE 4.2, which is relatively new:
> 
> https://en.wikipedia.org/wiki/SSE4#SSE4.2
> 
> Cheers,
> Kristoffer
> 
> > Core was generated by `/usr/libexec/pacemaker/lrmd'.
> > Program terminated with signal 11, Segmentation fault.
> > #0  __strcasecmp_l_sse42 () at 
> > ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
> > 164 movdqu  (%rdi), %xmm1
> > (gdb) bt
> > #0  __strcasecmp_l_sse42 () at 
> > ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
> > #1  0x7fed076136dc in crm_str_eq (a=, b=b@entry=0xed7070 
> > "DRBD_D16", use_case=use_case@entry=0) at utils.c:1416
> > #2  0x7fed073eaafa in is_op_blocked (rsc=0xed7070 "DRBD_D16") at 
> > services.c:644
> > #3  0x7fed073eac1d in services_action_async (op=0xed58e0, 
> > action_callback=) at services.c:625
> > #4  0x00404e4a in lrmd_rsc_execute_service_lib (cmd=0xed9e10, 
> > rsc=0xed4500) at lrmd.c:1242
> > #5  lrmd_rsc_execute (rsc=0xed4500) at lrmd.c:1308
> > #6  lrmd_rsc_dispatch (user_data=0xed4500, user_data@entry= > variable: value has been optimized out>) at lrmd.c:1317
> > #7  0x7fed07634c73 in crm_trigger_dispatch (source=0xed54c0, 
> > callback=, userdata=) at mainloop.c:107
> > #8  0x7fed055cb7aa in g_main_dispatch (context=0xeb4d40) at gmain.c:3109
> > #9  g_main_context_dispatch (context=context@entry=0xeb4d40) at gmain.c:3708
> > #10 0x7fed055cbaf8 in g_main_context_iterate (context=0xeb4d40, 
> > block=block@entry=1, dispatch=dispatch@entry=1, self=) at 
> > gmain.c:3779
> > #11 0x7fed055cbdca in g_main_loop_run (loop=0xe96510) at gmain.c:3973
> > #12 0x004028ce in main (argc=, argv=0x7ffe9b3b0fd8) 
> > at main.c:476
> >
> > Any help would be appreciated.
> >
> > --
> > Alexey Kurnosov
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com


pgpU0oc7XiVQ8.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] lrmd segfault

2017-01-30 Thread alexey

Hi All.

We have the heterogeneous corosync/pacemaker cluster of 5 nodes: 3 
SL7(Scientific linux) and 2 SL6.
SL7 pacemaker installed from a standard repo (corosync - 2.3.4, pacemaker - 
1.1.13-10), SL6 build from sources (same version).
The cluster not unified, some nodes have RA which other do not have. crmsh used 
for management.
SL6 nodes runs surprisingly smoothly, but SL7 steady segfaulting in the exactly 
same place.
Here is an example:

Core was generated by `/usr/libexec/pacemaker/lrmd'.
Program terminated with signal 11, Segmentation fault.
#0  __strcasecmp_l_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164 movdqu  (%rdi), %xmm1
(gdb) bt
#0  __strcasecmp_l_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
#1  0x7fed076136dc in crm_str_eq (a=, b=b@entry=0xed7070 
"DRBD_D16", use_case=use_case@entry=0) at utils.c:1416
#2  0x7fed073eaafa in is_op_blocked (rsc=0xed7070 "DRBD_D16") at 
services.c:644
#3  0x7fed073eac1d in services_action_async (op=0xed58e0, 
action_callback=) at services.c:625
#4  0x00404e4a in lrmd_rsc_execute_service_lib (cmd=0xed9e10, 
rsc=0xed4500) at lrmd.c:1242
#5  lrmd_rsc_execute (rsc=0xed4500) at lrmd.c:1308
#6  lrmd_rsc_dispatch (user_data=0xed4500, user_data@entry=) at lrmd.c:1317
#7  0x7fed07634c73 in crm_trigger_dispatch (source=0xed54c0, 
callback=, userdata=) at mainloop.c:107
#8  0x7fed055cb7aa in g_main_dispatch (context=0xeb4d40) at gmain.c:3109
#9  g_main_context_dispatch (context=context@entry=0xeb4d40) at gmain.c:3708
#10 0x7fed055cbaf8 in g_main_context_iterate (context=0xeb4d40, 
block=block@entry=1, dispatch=dispatch@entry=1, self=) at 
gmain.c:3779
#11 0x7fed055cbdca in g_main_loop_run (loop=0xe96510) at gmain.c:3973
#12 0x004028ce in main (argc=, argv=0x7ffe9b3b0fd8) at 
main.c:476

Any help would be appreciated.

--
Alexey Kurnosov


pgpOLPeiAUqoD.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org