Re: [ClusterLabs] fence-virtd reach remote server serial/VM channel/TCP
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hello Jan, I know that increasing the complexity reduces the availability of a service, so it is no surprise to me that it is frowned upon running services, which should be highgly available, on virtual machines. However, services are regularely run on VMs and HA is desired, even if the only thing that should be protected against is the downtime when the kernel needs to get upgraded or a daemon needs to be restarted. So I think fence-virt has a use case. My use case currently is to build a HA cluster of VMs, which currently host a simple mirror for software packages. They're stored on shared storage, which has a partition formatted with GFS2 on it. I use pcs(d), pacemaker, corosync and fence-virt over a serial device to fence hosts. Obviously, a single serial connection I currently only have one hypervisor, but could expand to more. I'm doing this, because I want to write a doc about clustering on Linux in the year 2015, so clustering on VMs is definitely a use case that I will describe. I know that multicast should actually work in common use cases, but I found that for some reason, the bridge device of the VMs don't forward traffic for the default multicast group of fence-virt to the other bridge ports, rendering it useless. I haven't dug deeper why that happens, but through Googling I found that it's a common problem that bridge devices on Linux don't forward some types of traffic. Obviously, if multicast works, one can just relay multicast networks over several other interfaces to relay requests. The man page of virt_fence.conf mentions libvirt-qmf as backend, instead of libvirt, which should be able to route fencing requests to the correct host by using Apache QMF. I figure that's the correct backend for such a purpose. Mit freundlichen Grüßen/Kind Regards, Noel Kuntze GPG Key ID: 0x63EC6658 Fingerprint: 23CA BB60 2146 05E7 7278 6592 3839 298F 63EC 6658 Am 05.08.2015 um 21:09 schrieb Jan Pokorný: On 02/08/15 16:30 +0200, Noel Kuntze wrote: I would like to know if it is possible for fence-virtd to realy a request from a client, which it received via serial, VM channel or TCP connection from an agent to another daemon, if the VM that should be fenced does not run on the same host as the contacted daemon. First, it doesn't sound like very commendable or at least common setup to have virtualized cluster nodes spread around multiple hosts. When increasing the complexity of a deployment, new points of failure can be introduced, defeating the purpose of HA. Could you please share details of your use case? To your question, it might (hypothetically) be doable if you manage to put the guests on the first host together with the other host into the same multicast-friendly network or will rely multicast packets between those remote sides by other means. Alternatively, you might implement such relying directly as fence_virtd module (backend), possibly reusing some code from the client side (fence_virt). ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -BEGIN PGP SIGNATURE- Version: GnuPG v2 iQIcBAEBCAAGBQJVwsgQAAoJEDg5KY9j7GZY15kP/0iScwPftMkssy5K33Cj1Qa5 9jOwOAB35yNoecsfpG26b1jLN2V0XsNOQa1NKp9L6lx0TXiv/2D9meu9/aRckVG5 rPgzl4zuOeNrdIYzN+AHruDfIiqbtxwRQ83taUulP+rtYAspJt8KwcjQiyp4MkGp IyH4YphbN7jWKl/EwNqSsneEIjI6jZrD93DFVI9wmCsg1zKd8IcOAaAeB86q9C24 JQE4s8tvxOw6BkoIEZq8fBC4aFvhBBSKvoBUwvTUnlcTWwoxraHRbXz+R+F+Zr3Z Db6kOMs2q5Ogscpv2xlJaP5VCGgsCSMEesJT3hBR4AgSWbHgexlKKG1PGRTBUWz3 EuhC7TR66tfldTS8mbiZ6lqdjeXneRnEWIhZaCwHWOu8k4Q5ap5X8r1PYnzJIEXA XKfJquPuSiesyflMdxxMZeDCW/Fme8V9dF8cy6TzUrEjLAqPo7kyXtxJxzXJk5x3 qWcsF9BIhoExfX0jx6pYes4ArzxGw8umUB4Sp5J0smAI5V8+DUp2NHNNjRdfeHfd fwwchC8sruX51pQEiniOv2FfejwTKqv/Qqd+A+ps1/02j4S/jITcVWBX819RTswJ UA1bSS/dSFZc0DEZhUCxdgJYuAQ/1SPNePK2Okb9BgX24phoF5/f5NuDGTA9ZUN2 IgiMTn7O0gx0fhz+nS4l =OBVz -END PGP SIGNATURE- ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Corosync: 100% cpu (corosync 2.3.5, libqb 0.17.1, pacemaker 1.1.13)
hi, I've built a recent cluster stack from sources on Debian Jessie and I can't get rid of cpu spikes. Corosync blocks the entire system for seconds on every simple transition, even itself: drbdtest1 corosync[4734]: [MAIN ] Corosync main process was not scheduled for 2590.4512 ms (threshold is 2400. ms). Consider token timeout increase. and even drbd: drbdtest1 kernel: drbd p1: PingAck did not arrive in time. My previous build (corosync 1.4.6, libqb 0.17.0, pacemaker 1.1.12) works fine on this nodes with the same corosync/pacemaker setup. What should I try? It's a test environment, the issue is 100% reproducible in seconds. Network traffic is minimal all the time and there is no I/O load. *Pacemaker config:* node 167969573: drbdtest1 node 167969574: drbdtest2 primitive drbd_p1 ocf:linbit:drbd \ params drbd_resource=p1 \ op monitor interval=30 primitive drbd_p2 ocf:linbit:drbd \ params drbd_resource=p2 \ op monitor interval=30 primitive dummy_test ocf:pacemaker:Dummy \ meta allow-migrate=true \ params state=/var/run/activenode primitive fence_libvirt stonith:external/libvirt \ params hostlist=drbdtest1,drbdtest2 hypervisor_uri=qemu+ssh://libvirt-fencing@mgx4/system \ op monitor interval=30 primitive fs_boot Filesystem \ params device=/dev/null directory=/boot fstype=* \ meta is-managed=false \ op monitor interval=20 timeout=40 on-fail=block OCF_CHECK_LEVEL=20 primitive fs_f1 Filesystem \ params device=/dev/drbd/by-res/p1 directory=/mnt/p1 fstype=ext4 options=commit=60,barrier=0,data=writeback \ op monitor interval=20 timeout=40 \ op start timeout=300 interval=0 \ op stop timeout=180 interval=0 primitive ip_10.3.3.138 IPaddr2 \ params ip=10.3.3.138 cidr_netmask=32 \ op monitor interval=10s timeout=20s primitive sysinfo ocf:pacemaker:SysInfo \ op start timeout=20s interval=0 \ op stop timeout=20s interval=0 \ op monitor interval=60s group dummy-group dummy_test ms ms_drbd_p1 drbd_p1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_drbd_p2 drbd_p2 \ meta master-max=2 master-node-max=1 clone-max=2 notify=true clone fencing_by_libvirt fence_libvirt \ meta globally-unique=false clone fs_boot_clone fs_boot clone sysinfos sysinfo \ meta globally-unique=false location fs1_on_high_load fs_f1 \ rule -inf: cpu_load gte 4 colocation dummy_coloc inf: dummy-group ms_drbd_p2:Master colocation f1a-coloc inf: fs_f1 ms_drbd_p1:Master colocation f1b-coloc inf: fs_f1 fs_boot_clone:Started order dummy_order inf: ms_drbd_p2:promote dummy-group:start order orderA inf: ms_drbd_p1:promote fs_f1:start property cib-bootstrap-options: \ dc-version=1.1.13-6052cd1 \ cluster-infrastructure=corosync \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ symmetric-cluster=true \ placement-strategy=default \ last-lrm-refresh=1438735742 \ have-watchdog=false property cib-bootstrap-options-stonith: \ stonith-enabled=true \ stonith-action=reboot rsc_defaults rsc-options: \ resource-stickiness=100 *corosync.conf:* totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 clear_node_high_bit: yes crypto_cipher: none crypto_hash: none interface { ringnumber: 0 bindnetaddr: 10.3.3.37 mcastaddr: 225.0.0.37 mcastport: 5403 ttl: 1 } } logging { fileline: off to_stderr: no to_logfile: yes logfile: /var/log/corosync/corosync.log to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 2 } ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Failure of apache services
Hi. this is my pcs status Online: [ node1 node2 ] Full list of resources: WebSite(ocf::heartbeat:apache):Started node1 Failed actions: WebSite_start_0 on node2 'unknown error' (1): call=28, status=complete, last-rc-change='Wed Aug 5 08:26:47 2015', queued=0ms, exec=3158ms Traceback (most recent call last): File /usr/sbin/pcs, line 138, in module main(sys.argv[1:]) File /usr/sbin/pcs, line 127, in main status.status_cmd(argv) File /usr/lib/python2.6/site-packages/pcs/status.py, line 13, in status_cmd full_status() File /usr/lib/python2.6/site-packages/pcs/status.py, line 60, in full_status utils.serviceStatus( ) File /usr/lib/python2.6/site-packages/pcs/utils.py, line 1504, in serviceStatus if is_systemctl(): File /usr/lib/python2.6/site-packages/pcs/utils.py, line 1476, in is_systemctl elif re.search(r'Foobar Linux release 6\.', issue): NameError: global name 'issue' is not defined I am not able to run my apache service on the second node. -- With Regards P.Vijay ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Disabling resources and adding apache instances
Cluster name: pacemaker1 Last updated: Wed Aug 5 09:07:27 2015 Last change: Wed Aug 5 08:58:24 2015 Stack: cman Current DC: node1 - partition with quorum Version: 1.1.11-97629de 2 Nodes configured 2 Resources configured Online: [ node1 node2 ] Full list of resources: ClusterIP(ocf::heartbeat:IPaddr2):Started node2 WebSite(ocf::heartbeat:apache):Started node1 Failed actions: WebSite_monitor_0 on node2 'unknown error' (1): call=96, status=complete, last-rc-change='Wed Aug 5 08:53:24 2015', queued=1ms, exec=51ms Traceback (most recent call last): File /usr/sbin/pcs, line 138, in module main(sys.argv[1:]) File /usr/sbin/pcs, line 127, in main status.status_cmd(argv) File /usr/lib/python2.6/site-packages/pcs/status.py, line 13, in status_cmd full_status() File /usr/lib/python2.6/site-packages/pcs/status.py, line 60, in full_status utils.serviceStatus( ) File /usr/lib/python2.6/site-packages/pcs/utils.py, line 1504, in serviceStatus if is_systemctl(): File /usr/lib/python2.6/site-packages/pcs/utils.py, line 1476, in is_systemctl elif re.search(r'Foobar Linux release 6\.', issue): NameError: global name 'issue' is not defined This is the error that i got after location constraint and ClusterIP started on node1 . On Wed, Aug 5, 2015 at 12:37 PM, Andrei Borzenkov arvidj...@gmail.com wrote: On Wed, Aug 5, 2015 at 9:23 AM, Vijay Partha vijaysarath...@gmail.com wrote: Hi, I have 2 doubts. 1.) If i disable a resource and reboot the node, will the pacemaker restart the service? What exactly disable means? There is no such operation in pacemaker. Or how can i stop the service and after rebooting the service should be started automatically by pacemaker Unfortunately pacemaker does not really provide any way to temporary stop resource. You can set target role to Stopped which will trigger resource stop. Then resource won't be started after reboot, because you told it to remain Stopped. Same applies to is-managed=false. If I'm wrong and it is possible I would be very interested to learn it. 2.) how to create apache instances in such a way that 1 instance runs in 1 node and another instance runs on the second node. Just define two resources and set location constraints for each. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- With Regards P.Vijay ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Disabling resources and adding apache instances
Andrei Borzenkov arvidj...@gmail.com schrieb am 05.08.2015 um 09:07 in Nachricht caa91j0xsx3u6xnexjts1u9pcwpovrqwjhodqpvmsexonudb...@mail.gmail.com: On Wed, Aug 5, 2015 at 9:23 AM, Vijay Partha vijaysarath...@gmail.com wrote: Hi, I have 2 doubts. 1.) If i disable a resource and reboot the node, will the pacemaker restart the service? What exactly disable means? There is no such operation in pacemaker. Or how can i stop the service and after rebooting the service should be started automatically by pacemaker Unfortunately pacemaker does not really provide any way to temporary stop resource. You can set target role to Stopped which will trigger Actually it does: You can have time-based rules. If you add location constraints for a resource disallowing it to run anywhere for some time, I guess it will work ;-) resource stop. Then resource won't be started after reboot, because you told it to remain Stopped. Same applies to is-managed=false. If I'm wrong and it is possible I would be very interested to learn it. Sometimes things happen that nobody can explain, unfortunately. 2.) how to create apache instances in such a way that 1 instance runs in 1 node and another instance runs on the second node. Just define two resources and set location constraints for each. Regards, Ulrich P.S.: Thought of the day: Is there any reasonable use for a nuclear bomb? If not, why have one? (We use it, because we have it is not considered to be a valid answer) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] apache services
On 08/05/2015 04:05 AM, Vijay Partha wrote: Hi, I need to run apache service on both the nodes in a cluster. httpd is listening in port 80 on first node and httpd is listening to port 81 on the second. I am not able to add these instances separately rather both of them are starting on the same node1. even if i move the service i get an error WebSite1_start_0 on node2 'unknown error' (1): call=27, status=complete, last-rc-change='Wed Aug 5 11:02:47 2015', queued=1ms, exec=3146ms. Please help me out. You have two separate issues: 1. Both instances are starting on the same node; and 2. Moving an instance produces an error. For #1, the answer is colocation constraints (which are distinct from location constraints and ordering constraints). Colocation constraints say that two resources should be kept together (if the score is positive) or kept apart (if the score is negative). For #2, pacemaker is asking the resource agent to perform an action, and the resource agent is saying it can't. Look at the logs to try to find the error reported by the resource agent. You can also try running the resource agent manually. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Running pacemaker 1.1.13 with legacy plugin or heartbeat
FYI to anyone running the legacy plugin or heartbeat as pacemaker's communication layer: Use-after-free memory issues can cause segfault crashes in the cib when using pacemaker 1.1.13 with the legacy plugin. Heartbeat is likely to be affected as well. Clusters using CMAN or corosync 2 as the communication layer are not affected. if switching to CMAN or corosync 2 isn't an option for you, I strongly recommend using a vendor that supports your communication layer, as they are more likely to do thorough testing and provide fixes. If anyone wants a targeted patch, I can provide one, but I would recommend instead using the upstream master branch as of at least commit 0f8059e. That branch includes an overhaul of the affected code area, as well as other bug fixes. -- Ken Gaillot kgail...@redhat.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] fence-virtd reach remote server serial/VM channel/TCP
On 05/08/15 03:09 PM, Jan Pokorný wrote: On 02/08/15 16:30 +0200, Noel Kuntze wrote: I would like to know if it is possible for fence-virtd to realy a request from a client, which it received via serial, VM channel or TCP connection from an agent to another daemon, if the VM that should be fenced does not run on the same host as the contacted daemon. First, it doesn't sound like very commendable or at least common setup to have virtualized cluster nodes spread around multiple hosts. When increasing the complexity of a deployment, new points of failure can be introduced, defeating the purpose of HA. Could you please share details of your use case? To interject; It is not something I would do, but I've heard of cases where a separate department handles hardware and the devops types are restricted to VMs only. In such a case, you would want to span hosts to protect against a host failure. Not sure if this is Noel's use-case, of course. To your question, it might (hypothetically) be doable if you manage to put the guests on the first host together with the other host into the same multicast-friendly network or will rely multicast packets between those remote sides by other means. Alternatively, you might implement such relying directly as fence_virtd module (backend), possibly reusing some code from the client side (fence_virt). -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of pacemaker_remote.
Ok, I’ll look into it. Thanks for retesting. On 5 Aug 2015, at 4:00 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Do you know if this behaviour still exists? A LOT of work went into the remote node logic in the last couple of months, its possible this was fixed as a side-effect. It is the latest and does not confirm it. I confirm it. I confirmed it in latest Pacemaker.(pacemaker-eefdc909a41b571dc2e155f7b14b5ef0368f2de7) After all the phenomenon occurs. In the first clean up, pacemaker fails in connection with pacemaker_remote. The second succeeds. The problem does not seem to be settled somehow or other. It was the latest and incorporated my log again. --- (snip) static size_tcrm_remote_recv_once(crm_remote_t * remote){int rc = 0; size_t read_len = sizeof(struct crm_remote_header_v0); struct crm_remote_header_v0 *header = crm_remote_header(remote); if(header) { /* Stop at the end of the current message */ read_len = header-size_total; } /* automatically grow the buffer when needed */ if(remote-buffer_size read_len) { remote-buffer_size = 2 * read_len; crm_trace(Expanding buffer to %u bytes, remote-buffer_size); remote-buffer = realloc_safe(remote-buffer, remote-buffer_size + 1);CRM_ASSERT(remote-buffer != NULL); } #ifdef HAVE_GNUTLS_GNUTLS_H if (remote-tls_session) {if (remote-buffer == NULL) { crm_info(### YAMAUCHI buffer is NULL [buffer_zie[%d] readlen[%d], remote-buffer_size, read_len); } rc = gnutls_record_recv(*(remote-tls_session), remote-buffer + remote-buffer_offset, remote-buffer_size - remote-buffer_offset); (snip) --- When Pacemaker fails in connection first in remote, my log is printed. My log is not printed by the second connection. [root@sl7-01 ~]# tail -f /var/log/messages | grep YAMA Aug 5 14:46:25 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40] Aug 5 14:46:26 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40] Aug 5 14:46:28 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40] Aug 5 14:46:30 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40] Aug 5 14:46:31 sl7-01 crmd[21306]: info: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40] (snip) Best Regards, Hideo Yamauchi. - Original Message - From: renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp To: Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Cc: Date: 2015/8/4, Tue 18:40 Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of pacemaker_remote. Hi Andrew, Do you know if this behaviour still exists? A LOT of work went into the remote node logic in the last couple of months, its possible this was fixed as a side-effect. It is the latest and does not confirm it. I confirm it. Many Thanks! Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed users@clusterlabs.org Cc: Date: 2015/8/4, Tue 13:16 Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: [Question] About movement of pacemaker_remote. On 12 May 2015, at 12:12 pm, renayama19661...@ybb.ne.jp wrote: Hi All, The problem is like a buffer becoming NULL after crm_resouce -C practice somehow or other after having rebooted remote node. I incorporated log in a source code and confirmed it. crm_remote_recv_once(crm_remote_t * remote) { (snip) /* automatically grow the buffer when needed */ if(remote-buffer_size read_len) { remote-buffer_size = 2 * read_len; crm_trace(Expanding buffer to %u bytes, remote-buffer_size); remote-buffer = realloc_safe(remote-buffer, remote-buffer_size + 1); CRM_ASSERT(remote-buffer != NULL); } #ifdef HAVE_GNUTLS_GNUTLS_H if (remote-tls_session) { if (remote-buffer == NULL) { crm_info(### YAMAUCHI buffer is NULL [buffer_zie[%d] readlen[%d], remote-buffer_size, read_len); } rc = gnutls_record_recv(*(remote-tls_session), remote-buffer + remote-buffer_offset, remote-buffer_size - remote-buffer_offset); (snip) May 12 10:54:01 sl7-01 crmd[30447]: info: crm_remote_recv_once: ### YAMAUCHI buffer is NULL [buffer_zie[1326] readlen[40] May 12 10:54:02 sl7-01 crmd[30447]: info: crm_remote_recv_once: ### YAMAUCHI buffer is NULL