Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
On 08/24/2015 04:52 AM, Andrei Borzenkov wrote: 24.08.2015 12:35, Tom Yates пишет: I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: primitive ExternalIP lsb:hb-adsl-helper \ op monitor interval=60s and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus: #!/bin/bash RETVAL=0 start() { /sbin/pppoe-start } stop() { /sbin/pppoe-stop } case $1 in start) start ;; stop) stop ;; status) /sbin/ifconfig ppp0 /dev/null exit 0 exit 1 ;; *) echo $Usage: $0 {start|stop|status} exit 3 esac exit $? Pacemaker expects that LSB agents follow the LSB spec for return codes, and won't be able to behave properly if they don't: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-lsb However it's just as easy to write an OCF agent, which gives you more flexibility (accepting parameters, etc.): http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf The problem is that sometimes the ADSL connection falls over, as they do, eg: Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes. Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received 164420300 bytes. Aug 20 11:42:13 positron pppd[2469]: Connection terminated. Aug 20 11:42:13 positron pppd[2469]: Modem hangup Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session 1735: Input/output error Aug 20 11:42:13 positron pppoe[2470]: Sent PADT Aug 20 11:42:13 positron pppd[2469]: Exit. Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; attempting re-connection. CRMd then logs a bunch of stuff, followed by Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no additional parameters are needed. [...] Aug 20 11:42:18 positron pppoe-stop: Killing pppd Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop process 28357 exited with return code 1. At this point, the PPPoE connection is down, and stays down. CRMd doesn't fail the group which contains both internal and external interfaces over to the other node, but nor does it try to restart the service. I'm fairly sure this is because I've done something boneheaded, but I can't get my bone head around what it might be. Any light anyone can shed is much appreciated. If stop operation failed resource state is undefined; pacemaker won't do anything with this resource. Either make sure script returns success when appropriate or the only option is to make it fence node where resource was active. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] [Announce] libqb v0.17.2 release
This is mainly a bug fix release, but also includes a new split-logging feature. Changes v0.17.1 - v0.17.2 Implement extended information logging (aka split logging) switch libtool soname versioning from -version-number to -version-info High: loop: fixes resource starvation in mainloop code Fix: valgrind invalid file descriptor warning Fix: Unlink files bound to unix domain sockets Fix: resolves compile error for solaris Fix alignment issues on hppa Fix description of qbutil.h Fix comment typo: neccessary - necessary Fix comment typos: incomming - incoming Low: examples: fix race condition in glib mainloop example Low: build: update .gitignore for vim swap files and make check output Low: check_ipc: generate unique server names for tests Low: check_ipc: give connection stress tests for shm and socket unique names Low: tests: regression tests for stress testing loop_poll ipc create/destroy ipc test improvements. The udata member of the kevent struct is a void * Fixes several warnings under clang Add Doxygen description for qbipc_common.h doc: improve README and RPM description Clear DOT_FONTNAME, since FreeSans is not included any more. The new default is Helvetica. Remove obsolete options from doxyfiles Do not suppress echoing of Doxygen commands The current release tarball is here: https://github.com/ClusterLabs/libqb/archive/v0.17.2.tar.gz The github repository is here: https://github.com/ClusterLabs/libqb Please report bugs and issues in bugzilla: https://bugzilla.redhat.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Slightly OT] OCFS2 over LVM
On 24/08/15 07:55 AM, Jorge Fábregas wrote: On 08/24/2015 06:52 AM, Kai Dupke wrote: Not sure what you want to run on top of your 2-node cluster, but OCFS2 is only needed when you need a shared file system. This is for an application that manages the high-availability by itself (in an active/active fashion) and the only thing that's needed from the OS is a shared filesystem. I quickly thought about NFS but then the reliability of the NFS server was questioned etc. I could create an NFS cluster for that but that will be two more servers. You get the idea. I'm still googling NFSv4 vs OCFS2 If anyone here have experience (going from one to the other) I'd like to hear it. For plain failover with volumes managed by cLVM you don't need OCFS2 (and can save one level of complexity). This is my first time using a cluster filesystem and indeed I get it: there's lots of things to be taken care of many possible ways to break it. Thanks, Jorge Speaking from a gfs2 background, but assuming it's similar in concept to ocfs2... Cluster locking comes at a performance cost. All locks need to be coordinated between the nodes, and that will always be slower that local locking only. They are also far less commonly used than options like nfs. Using a pair of nodes with a traditional file system exported by NFS and made accessible by a floating (virtual) IP address gives you redundancy without incurring the complexity and performance overhead of cluster locking. Also, you won't need clvmd either. The trade-off through is that if/when the primary fails, the nfs daemon will appear to restart to the users and that may require a reconnection (not sure, I use nfs sparingly). Generally speaking, I recommend always avoiding cluster FSes unless they're really required. I say this as a person who uses gfs2 in every cluster I build, but I do so carefully and in limited uses. In my case, gfs2 backs ISOs and XML definition files for VMs, things that change rarely so cluster locking overhead is all but a non-issue, and I have to have DLM for clustered LVM anyway, so I've already incurred the complexity costs so hey, why not. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 0 Nodes configured in crm_mon
Hi all, I'm trying to run corosync2 + pacemaker setup on Debian Jessie (only for testing purpose), I've successfully compiled all components using this guide: http://clusterlabs.org/wiki/Compiling_on_Debian Unfortunately, if I run crm_mon I don't see any nodes. ### Last updated: Mon Aug 24 17:36:00 2015 Last change: Mon Aug 24 17:17:42 2015 Current DC: NONE 0 Nodes configured 0 Resources configured I don't see any errors in corosync log either: http://pastebin.com/bJX66B9e This is my corosync.conf ### # Please read the corosync.conf.5 manual page totem { version: 2 crypto_cipher: none crypto_hash: none interface { ringnumber: 0 bindnetaddr: 192.168.122.0 mcastport: 5405 ttl: 1 } transport: udpu } logging { fileline: off to_logfile: yes to_syslog: no logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } nodelist { node { ring0_addr: 192.168.122.172 #nodeid: 1 } node { ring0_addr: 192.168.122.113 #nodeid: 2 } } quorum { # Enable and configure quorum subsystem (default: off) # see also corosync.conf.5 and votequorum.5 #provider: corosync_votequorum } used components: pacemaker: 1.1.12 corosync: 2.3.5 libqb: 0.17.1 Did I miss something? Thanks! Stan ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Cluster.conf
The cluster.conf is needed by cman, and in RHEL 6, pacemaker needs to use cman as the quorum provider. So you need a skeleton cluster.conf and it is different from cib.xml. If you use pcs/pcsd to setup pacemaker on RHEL 6.7, it should configure everything for you, so you should be able to go straight to setting up pacemaker and not worry about cman/corosync directly. digimer On 24/08/15 01:52 PM, Streeter, Michelle N wrote: If I have a cluster.conf file in /etc/cluster, my cluster will not start. Pacemaker 1.1.11, Corosync 1.4.7, cman 3.0.12, But if I do not have a cluster.conf file then my cluster does start with my current configuration. However, when I try to stop the cluster, it wont stop unless I have my cluster.conf file in place. How can I dump my cib to my cluster.conf file so my cluster will start with the conf file in place? Michelle Streeter ASC2 MCS – SDE/ACL/SDL/EDL OKC Software Engineer The Boeing Company ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Cluster.conf
On 24/08/15 17:52 +, Streeter, Michelle N wrote: If I have a cluster.conf file in /etc/cluster, my cluster will not start. Pacemaker 1.1.11, Corosync 1.4.7, cman 3.0.12, But if I do not have a cluster.conf file then my cluster does start with my current configuration. I don't think CMAN component can operate without that file (location possibly overridden with $COROSYNC_CLUSTER_CONFIG_FILE environment variable). What distro, or at least commands to bring the cluster up do you use? However, when I try to stop the cluster, it wont stop unless I have my cluster.conf file in place. How can I dump my cib to my cluster.conf file Note that cluster.conf and CIB serves different purposes, at least in Pacemaker+CMAN setup (akin to RHEL 6.x for x being 5+) so you don't want to interchange them. so my cluster will start with the conf file in place? -- Jan (Poki) pgphqSAIh33_Z.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] CRM managing ADSL connection; failure not handled
I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: primitive ExternalIP lsb:hb-adsl-helper \ op monitor interval=60s and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus: #!/bin/bash RETVAL=0 start() { /sbin/pppoe-start } stop() { /sbin/pppoe-stop } case $1 in start) start ;; stop) stop ;; status) /sbin/ifconfig ppp0 /dev/null exit 0 exit 1 ;; *) echo $Usage: $0 {start|stop|status} exit 3 esac exit $? The problem is that sometimes the ADSL connection falls over, as they do, eg: Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes. Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received 164420300 bytes. Aug 20 11:42:13 positron pppd[2469]: Connection terminated. Aug 20 11:42:13 positron pppd[2469]: Modem hangup Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session 1735: Input/output error Aug 20 11:42:13 positron pppoe[2470]: Sent PADT Aug 20 11:42:13 positron pppd[2469]: Exit. Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; attempting re-connection. CRMd then logs a bunch of stuff, followed by Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no additional parameters are needed. [...] Aug 20 11:42:18 positron pppoe-stop: Killing pppd Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop process 28357 exited with return code 1. At this point, the PPPoE connection is down, and stays down. CRMd doesn't fail the group which contains both internal and external interfaces over to the other node, but nor does it try to restart the service. I'm fairly sure this is because I've done something boneheaded, but I can't get my bone head around what it might be. Any light anyone can shed is much appreciated. -- Tom Yates - http://www.teaparty.net ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
24.08.2015 12:35, Tom Yates пишет: I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: primitive ExternalIP lsb:hb-adsl-helper \ op monitor interval=60s and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus: #!/bin/bash RETVAL=0 start() { /sbin/pppoe-start } stop() { /sbin/pppoe-stop } case $1 in start) start ;; stop) stop ;; status) /sbin/ifconfig ppp0 /dev/null exit 0 exit 1 ;; *) echo $Usage: $0 {start|stop|status} exit 3 esac exit $? The problem is that sometimes the ADSL connection falls over, as they do, eg: Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes. Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received 164420300 bytes. Aug 20 11:42:13 positron pppd[2469]: Connection terminated. Aug 20 11:42:13 positron pppd[2469]: Modem hangup Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session 1735: Input/output error Aug 20 11:42:13 positron pppoe[2470]: Sent PADT Aug 20 11:42:13 positron pppd[2469]: Exit. Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; attempting re-connection. CRMd then logs a bunch of stuff, followed by Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no additional parameters are needed. [...] Aug 20 11:42:18 positron pppoe-stop: Killing pppd Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop process 28357 exited with return code 1. At this point, the PPPoE connection is down, and stays down. CRMd doesn't fail the group which contains both internal and external interfaces over to the other node, but nor does it try to restart the service. I'm fairly sure this is because I've done something boneheaded, but I can't get my bone head around what it might be. Any light anyone can shed is much appreciated. If stop operation failed resource state is undefined; pacemaker won't do anything with this resource. Either make sure script returns success when appropriate or the only option is to make it fence node where resource was active. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] CRM managing ADSL connection; failure not handled
24.08.2015 13:32, Tom Yates пишет: On Mon, 24 Aug 2015, Andrei Borzenkov wrote: 24.08.2015 12:35, Tom Yates пишет: I've got a failover firewall pair where the external interface is ADSL; that is, PPPoE. i've defined the service thus: If stop operation failed resource state is undefined; pacemaker won't do anything with this resource. Either make sure script returns success when appropriate or the only option is to make it fence node where resource was active. andrei, thank you for your prompt and helpful response. if i understand you aright, my problem is that the stop script didn't return a 0 (OK) exit status, so CRM didn't know where to go. is the exit status of the stop script how CRM determines the status of the stop operation? correct and if that gives exit code 0, it will then try to do a /etc/init.d/script start? If resource was previously active and stop was attempted as cleanup after resource failure - yes, it should attempt to start it again. does CRM also use the output of /etc/init.d/script status to determine continuing successful operation? It definitely does not use *output* of script - only return code. If the question is whether it probes resource additionally to checking stop exit code - I do not think so (I know it does it in some cases for systemd resources). ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Slightly OT] OCFS2 over LVM
On 08/24/2015 06:20 PM, Digimer wrote: Cluster locking comes at a performance cost. All locks need to be coordinated between the nodes, and that will always be slower that local locking only. They are also far less commonly used than options like nfs. right. Using a pair of nodes with a traditional file system exported by NFS and made accessible by a floating (virtual) IP address gives you redundancy without incurring the complexity and performance overhead of cluster locking. Then you have to copy all data on the network, which limits data throughput. Also, you won't need clvmd either. The trade-off through is that if/when the primary fails, the nfs daemon will appear to restart to the users and that may require a reconnection (not sure, I use nfs sparingly). AFAIK NFS failover includes an NFS timeout, which can be tuned but might give you an extra time till the failover will be finished by the client perspective. Generally speaking, I recommend always avoiding cluster FSes unless they're really required. Full ACK. greetings Kai Dupke Senior Product Manager Server Product Line -- Sell not virtue to purchase wealth, nor liberty to purchase power. Phone: +49-(0)5102-9310828 Mail: kdu...@suse.com Mobile: +49-(0)173-5876766 WWW: www.suse.com SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany) GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org