Re: [Linux-cluster] Need advice Redhat Clusters
On 2017-07-30 02:03 PM, deepesh kumar wrote: > I need to set up 2 node HA Active Passive redhat cluster on rhel 6.9. > > Should I start with rgmanager or pacemaker .. > > Do I need Quorum disk ..(mandatory ) and what fence method should I use. > > Thanks to great friends..!!! > > -- > DEEPESH KUMAR Hi Deepesh, Note that this channel is deprecated, please use clusterlabs - users (cc'ed here). Use pacemaker, but it will need the cman plugin. Only existing projects should use rgmanager. The fence method you use will depend on what your nodes are; IPMI is common on most servers, so fence_ipmilan is quite common. Switched PDUs from APC are also popular, and they use fence_apc_snmp, etc. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] GFS2 Errors
On 2017-07-18 07:25 PM, Kristián Feldsam wrote: > Hello, I see today GFS2 errors in log and nothing about that is on net, > so I writing to this mailing list. > > node2 19.07.2017 01:11:55 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete > nr=-4549568322848002755 > node2 19.07.2017 01:10:56 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete > nr=-8191295421473926116 > node2 19.07.2017 01:10:48 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete > nr=-8225402411152149004 > node2 19.07.2017 01:10:47 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete > nr=-8230186816585019317 > node2 19.07.2017 01:10:45 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete > nr=-8242007238441787628 > node2 19.07.2017 01:10:39 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete > nr=-8250926852732428536 > node3 19.07.2017 00:16:02 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete > nr=-5150933278940354602 > node3 19.07.2017 00:16:02 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete nr=-64 > node3 19.07.2017 00:16:02 kernel kernerr vmscan: shrink_slab: > gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete nr=-64 > > Would somebody explain this errors? cluster is looks like working > normally. I enabled vm.zone_reclaim_mode = 1 on nodes... > > Thank you! Please post this to the Clusterlabs - Users list. This ML is deprecated. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] HA cluster 6.5 redhat active passive Error
:Database > is failed > Apr 28 21:08:17 12RHAPPTR04V rgmanager[2183]: #13: Service > service:Database failed to stop cleanly > Apr 28 21:08:28 12RHAPPTR04V rgmanager[2183]: State change: 12RHAPPTR03V UP > Apr 28 21:08:46 12RHAPPTR04V kernel: fuse init (API version 7.14) > Apr 28 21:08:46 12RHAPPTR04V seahorse-daemon[4044]: DNS-SD > initialization failed: Daemon not running > Apr 28 21:08:46 12RHAPPTR04V seahorse-daemon[4044]: init gpgme version 1.1.8 > Apr 28 21:08:46 12RHAPPTR04V pulseaudio[4099]: pid.c: Stale PID file, > overwriting. > Apr 28 21:09:38 12RHAPPTR04V ricci[4367]: Executing '/usr/bin/virsh > nodeinfo' > > > thanks > Deepesh kumar Hi Deepesh, You probably got a notice that the linux-cluster list is deprecated. I am replying to the new list, clusterlabs. You will want to subscribe there and continue over there as there are many more people on that list. For clvmd, you need to set lvm.conf to set; global { locking_type = 3; fallback_to_clustered_locking = 1 fallback_to_local_locking = 0 } This assumes you are not trying to use LVM and clustered LVM at the same time. If you are, you probably don't want to. If you do anyway, don't set the fallback variables. With this, you then start cman, then start clvmd. With clvmd running, new VGs default to clustered type. You can override this with 'vgcreate -c{y,n}'. If you still have trouble, please share your full cluster.conf (obfuscate passwords, please). -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Rhel 7.2 Pacemaker cluster - gfs2 file system- NFS document
On 28/04/17 06:34 AM, Dawood Munavar S M wrote: > Hello All, > > Could you please share any links/documents to create NFS HA cluster over > gfs2 file system using Pacemaker. > > Currently I have completed till mounting of gfs2 file systems on cluster > nodes and now I need to create cluster resources for NFS server, exports > and mount on client. > > Thanks, > Munavar. I use gfs2 quite a bit, but not nfs. Can I make a suggestion? Don't use gfs2 for this. You will have much better performance if you use an active/passive failover with a non-clustered FS. GFS2, like any cluster FS, needs to have the cluster handle locks which is always going to be slower (by a fair amount) than traditional internal FS locking. The common NFS HA cluster setup is to have the cluster promote/connect the backing storage (drbd/iscsi), mount the FS, start nfs and then take a floating IP address. GFS2 is an excellent FS for situations where it is needed, and should be avoided anywhere possible. :) -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Active/passive cluster between physical and VM
On 22/03/17 03:11 AM, Amjad Syed wrote: > Hello, > > We are planning to build a 2 node Active/passive cluster using pacemaker. > Can the cluster be build between one physical and one VM machine in > Centos 7.x? > If yes, what can be used as fencing agent ? So long as the traffic between the nodes is not molested, it should work fine. As for fencing, it depends on your hardware and hypervisor... Using a generic example, you could use fence_ipmilan to fence the hardware node and fence_virsh to fence a KVM/qemu based VM. PS - I've cc'ed clusterlabs - users ML. This list is deprecated, so please switch over to there (http://lists.clusterlabs.org/mailman/listinfo/users). -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] unable to start mysql as a clustered service, OK stand-alone
Please ask again on the Clusterlabs - Users list. This list is (quite) deprecated now. http://clusterlabs.org/mailman/listinfo/users digimer On 08/08/16 06:40 PM, berg...@merctech.com wrote: > I've got a 3-node CentOS6 cluster and I'm trying to add mysql 5.1 as a new > service. Other cluster > services (IP addresses, Postgresql, applications) work fine. > > The mysql config file and data files are located on shared, cluster-wide > storage (GPFS). > > On each node, I can successfully start mysql via: > service mysqld start > and via: > rg_test test /etc/cluster/cluster.conf start service mysql > > (in each case, the corresponding command with the 'stop' option will also > successfully shut down mysql). > > However, attempting to start the mysql service with clusvcadm results in the > service failing over > from one node to the next, and being marked as "stopped" after the last node. > > Each failover happens very quickly, in about 5 seconds. I suspect that > rgmanager isn't waiting long > enough for mysql to start before checking if it is running and I have added > startup delays in > cluster.conf, but they don't seem to be honored. Nothing is written into the > mysql log file at this > time -- no startup or failure messages. The only log entries > (/var/log/messages, /var/log/cluster/*, > etc) reference rgmanager, not the mysql process itself. > > > Any suggestions? > > > RHCS components: > cman-3.0.12.1-78.el6.x86_64 > luci-0.26.0-78.el6.centos.x86_64 > rgmanager-3.0.12.1-26.el6_8.3.x86_64 > ricci-0.16.2-86.el6.x86_64 > > > - /etc/cluster/cluster.conf (edited) - > > > > config_file="/var/lib/pgsql/data/postgresql.conf" name="PostgreSQL8" > postmaster_user="postgres" startup_wait="25"/> > > config_file="/cluster_shared/mysql_centos6/etc/my.cnf" > listen_address="192.168.169.173" name="mysql" shutdown_wait="10" > startup_wait="30"/> > > restart_expire_time="180"> > > > > > > > -- > > > - /var/log/cluster/rgmanager.log from attempt to start > mysql with clusvcadm --- > Aug 08 11:58:16 rgmanager Recovering failed service service:mysql > Aug 08 11:58:16 rgmanager [ip] Link for eth2: Detected > Aug 08 11:58:16 rgmanager [ip] Adding IPv4 address 192.168.169.173/24 to eth2 > Aug 08 11:58:16 rgmanager [ip] Pinging addr 192.168.169.173 from dev eth2 > Aug 08 11:58:18 rgmanager [ip] Sending gratuitous ARP: 192.168.169.173 > c8:1f:66:e8:bb:34 brd ff:ff:ff:ff:ff:ff > Aug 08 11:58:19 rgmanager [mysql] Verifying Configuration Of mysql:mysql > Aug 08 11:58:19 rgmanager [mysql] Verifying Configuration Of mysql:mysql > > Succeed > Aug 08 11:58:19 rgmanager [mysql] Monitoring Service mysql:mysql > Aug 08 11:58:19 rgmanager [mysql] Checking Existence Of File > /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed > Aug 08 11:58:19 rgmanager [mysql] Monitoring Service mysql:mysql > Service Is > Not Running > Aug 08 11:58:19 rgmanager [mysql] Starting Service mysql:mysql > Aug 08 11:58:19 rgmanager [mysql] Looking For IP Address > Succeed - IP > Address Found > Aug 08 11:58:20 rgmanager [mysql] Starting Service mysql:mysql > Succeed > Aug 08 11:58:21 rgmanager [mysql] Monitoring Service mysql:mysql > Aug 08 11:58:21 rgmanager 1 events processed > Aug 08 11:58:21 rgmanager [mysql] Checking Existence Of File > /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed > Aug 08 11:58:21 rgmanager [mysql] Monitoring Service mysql:mysql > Service Is > Not Running > Aug 08 11:58:21 rgmanager start on mysql "mysql" returned 7 (unspecified) > Aug 08 11:58:21 rgmanager #68: Failed to start service:mysql; return value: 1 > Aug 08 11:58:21 rgmanager Stopping service service:mysql > Aug 08 11:58:21 rgmanager [mysql] Verifying Configuration Of mysql:mysql > Aug 08 11:58:21 rgmanager [mysql] Verifying Configuration Of mysql:mysql > > Succeed > Aug 08 11:58:21 rgmanager [mysql] Stopping Service mysql:mysql > Aug 08 11:58:21 rgmanager [mysql] Checking Existence Of File > /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed - File Doesn't > Exist > Aug 08 11:58:21 rgmanager [mysql] Stopping Service mysql:mysql > Su
Re: [Linux-cluster] Fencing Question
On 06/06/16 05:37 PM, Andrew Kerber wrote: > I am doing some experimentation with Linux clustering, and still fairly > new on it. I have built a cluster as a proof of concept running a > PostgreSQL 9.5 database on gfs2 using VMware workstation 12.0 and > RHEL7. GFS2 requires a fencing resource, which I have managed to create > using fence_virsh. And the clustering software thinks the fencing is > working. However, it will not actually shut down a node, and I have not > been able to figure out the appropriate parameters for VMware > workstation to get it to work. I tried fence-scsi also, but that doesnt > seem to work with a shared vmdk, Has anyone figured out a fencing > agent that will work with VMware workstation? > > Failing that, is there a comprehensive set of instructions for creating > my own fencing agent? > > > -- > Andrew W. Kerber > > 'If at first you dont succeed, dont take up skydiving.' The 'fence_vmware' (and helpers) are designed specifically for vmware. I've not used them myself, but I've heard of many people using them successfully. Side note; GFS2, for all it's greatness, is not fast (nothing using cluster locking will be). Be sure to performance test before production. If you find the performance is not good, consider active/passive on a standard FS. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Copying the result continuously.
On 10/03/16 10:55 PM, Elsaid Younes wrote: > > Hi all, > > I wish to be able to run long simulation through gromacs program, > using MPI method. I want to modify the input data after every sub-task. > I think that is the meaning of the following code, which is part my script. > > |cat < copyfile.sh #!/bin/sh cp -p result*.dat $SLURM_SUBMIT_DIR > EOF chmod u+x copyfile.sh srun -n $SLURM_NNODES -N $SLURM_NNODES cp > copyfile.sh $SNIC_TMP | > > |And I have to srun copyfile.sh in the end of every processor.| > | > > |srun -n $SLURM_NNODES -N $SLURM_NNODES copyfile.sh | > > |Is there something wrong? I need to know what is the meaning of result*?| > | > | > |Thanks in advance,| > |/Elsaid| Hi Elsaid, Your question is on-topic for here, so I hope someone here might be able to help you. Do note though that most discussion here is related to availability clustering. HPC clustering is fairly lowly represented. So if you can think of other places to ask as well as here, you might want to cross-post your questions to those other lists. cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] CMAN Failed to start on Secondary Node
Working fencing is required. The rgmanager component waits for a successful fence message before beginning recovery (to prevent split-brains). On 05/03/16 04:47 AM, Shreekant Jena wrote: > secondary node > > -- > [root@Node2 ~]# cat /etc/cluster/cluster.conf > > > post_join_delay="3"/> > > > > > > > > > > > > > restricted="1"> > priority="1"/> > priority="1"/> > > > > > > name="PE51SPM1"> > force_fsck="1" force_unmount="1" fsid="3446" fstype="ext3" > mountpoint="/SPIM/admin" name="admin" options="" self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="17646" fstype="ext3" > mountpoint="/flatfile_upload" name="flatfile_upload" options="" > self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="64480" fstype="ext3" > mountpoint="/oracle" name="oracle" options="" self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="60560" fstype="ext3" > mountpoint="/SPIM/datafile_01" name="datafile_01" options="" > self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="48426" fstype="ext3" > mountpoint="/SPIM/datafile_02" name="datafile_02" options="" > self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="54326" fstype="ext3" > mountpoint="/SPIM/redolog_01" name="redolog_01" options="" self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="23041" fstype="ext3" > mountpoint="/SPIM/redolog_02" name="redolog_02" options="" self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="46362" fstype="ext3" > mountpoint="/SPIM/redolog_03" name="redolog_03" options="" self_fence="1"/> > force_fsck="1" force_unmount="1" fsid="58431" fstype="ext3" > mountpoint="/SPIM/archives_01" name="archives_01" options="" > self_fence="1"/> > > > > > > > > [root@Node2 ~]# clustat > msg_open: Invalid argument > Member Status: Inquorate > > Resource Group Manager not running; no service information available. > > Membership information not available > > > > Primary Node > > - > [root@Node1 ~]# clustat > Member Status: Quorate > > Member Name Status > -- -- > Node1 Online, Local, rgmanager > Node2 Offline > > Service Name Owner (Last) State > --- - -- - > Package1 Node1started > > > On Sat, Mar 5, 2016 at 12:17 PM, Digimer <li...@alteeve.ca > <mailto:li...@alteeve.ca>> wrote: > > Please share your cluster.conf (only obfuscate passwords please) and the > output of 'clustat' from each node. > > digimer > > On 05/03/16 01:46 AM, Shreekant Jena wrote: > > Dear All, > > > > I have a 2 node cluster but after reboot secondary node is showing > > offline . And cman failed to start . > > > > Please find below logs on secondary node:- > > > > root@EI51SPM1 cluster]# clustat > > msg_open: Invalid argument > > Member Status: Inquorate > > > > Resource Group Manager not running; no service information available. > > > > Me
Re: [Linux-cluster] CMAN Failed to start on Secondary Node
Please share your cluster.conf (only obfuscate passwords please) and the output of 'clustat' from each node. digimer On 05/03/16 01:46 AM, Shreekant Jena wrote: > Dear All, > > I have a 2 node cluster but after reboot secondary node is showing > offline . And cman failed to start . > > Please find below logs on secondary node:- > > root@EI51SPM1 cluster]# clustat > msg_open: Invalid argument > Member Status: Inquorate > > Resource Group Manager not running; no service information available. > > Membership information not available > [root@EI51SPM1 cluster]# tail -10 /var/log/messages > Feb 24 13:36:23 EI51SPM1 ccsd[25487]: Error while processing connect: > Connection refused > Feb 24 13:36:23 EI51SPM1 kernel: CMAN: sending membership request > Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Cluster is not quorate. Refusing > connection. > Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Error while processing connect: > Connection refused > Feb 24 13:36:28 EI51SPM1 kernel: CMAN: sending membership request > Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate. Refusing > connection. > Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect: > Connection refused > Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate. Refusing > connection. > Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect: > Connection refused > Feb 24 13:36:33 EI51SPM1 kernel: CMAN: sending membership request > [root@EI51SPM1 cluster]# > [root@EI51SPM1 cluster]# cman_tool status > Protocol version: 5.0.1 > Config version: 166 > Cluster name: IVRS_DB > Cluster ID: 9982 > Cluster Member: No > Membership state: Joining > [root@EI51SPM1 cluster]# cman_tool nodes > Node Votes Exp Sts Name > [root@EI51SPM1 cluster]# > [root@EI51SPM1 cluster]# > > > Thanks & regards > SHREEKANTA JENA > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] GFS2 mount hangs for some disks
Can you re-ask this on the clusterlabs user's list? This list is being phased out. http://clusterlabs.org/mailman/listinfo/users digimer On 05/01/16 01:37 PM, B.Baransel BAĞCI wrote: > Hi list, > > I have some problems with GFS2 with failed nodes. After one of the > cluster nodes fenced and rebooted, it cannot mount some of the gfs2 file > systems but hangs on the mount operation. No output. I've waited nearly > 10 minutes to mount single disk but it didn't respond. Only solution is > to shutdown all nodes and clean start of the cluster. I'm suspecting > journal size or file system quotas. > > I have 8-node rhel-6 cluster with GFS2 formatted disks which are all > mounted by all nodes. > There are two types of disk: > Type A : > ~50 GB disk capacity > 8 journal with size 512MB > block-size: 1024 > very small files (Avg: 50 byte - sym.links) > ~500.000 file (inode) > Usage: 10% > Nearly no write IO (under 1000 file per day) > No user quota (quota=off) > Mount options: async,quota=off,nodiratime,noatime > > Tybe B : > ~1 TB disk capacity > 8 journal with size 512MB > block-size: 4096 > relatively small files (Avg: 20 KB) > ~5.000.000 file (inode) > Usage: 20% > write IO ~50.000 file per day > user quota is on (some of the users exceeded quota) > Mount options: async,quota=on,nodiratime,noatime > > To improve performance, I set journal size to 512 MB instead of 128 MB > default. All disk are connected with fiber from SAN Storage. All disk on > cluster LVM. All nodes connected to each other with private Gb-switch. > > For example, after "node5" failed and fenced, it can re-enter the > cluster. When i try "service gfs2 start", it can mount "Type A" disks, > but hangs on the first "Tybe B" disk. Logs hangs on the "Trying to join > cluster lock_dlm" message: > > ... > Jan 05 00:01:52 node5 lvm[4090]: Found volume group "VG_of_TYPE_A" > Jan 05 00:01:52 node5 lvm[4119]: Activated 2 logical volumes in > volume group VG_of_TYPE_A > Jan 05 00:01:52 node5 lvm[4119]: 2 logical volume(s) in volume group > "VG_of_TYPE_A" now active > Jan 05 00:01:52 node5 lvm[4119]: Wiping internal VG cache > Jan 05 00:02:26 node5 kernel: Slow work thread pool: Starting up > Jan 05 00:02:26 node5 kernel: Slow work thread pool: Ready > Jan 05 00:02:26 node5 kernel: GFS2 (built Dec 12 2014 16:06:57) > installed > Jan 05 00:02:26 node5 kernel: GFS2: fsid=: Trying to join cluster > "lock_dlm", "TESTCLS:typeA1" > Jan 05 00:02:26 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: Joined > cluster. Now mounting FS... > Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5, > already locked for use > Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5: > Looking at journal... > Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5: Done > Jan 05 00:02:27 node5 kernel: GFS2: fsid=: Trying to join cluster > "lock_dlm", "TESTCLS:typeA2" > Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: Joined > cluster. Now mounting FS... > Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5, > already locked for use > Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5: > Looking at journal... > Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5: Done > Jan 05 00:02:28 node5 kernel: GFS2: fsid=: Trying to join cluster > "lock_dlm", "TESTCLS:typeB1" > > > I've waited nearly 10 minutes in this state without respond or log. In > this state, I cannot do `ls` in another nodes for this file system. Any > idea of the cause of the problem? How is the cluster affected by journal > size or count? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies
On 04/12/15 09:14 AM, Kelvin Edmison wrote: > > > On 12/03/2015 09:31 PM, Digimer wrote: >> On 03/12/15 08:39 PM, Kelvin Edmison wrote: >>> On 12/03/2015 06:14 PM, Digimer wrote: >>>> On 03/12/15 02:19 PM, Kelvin Edmison wrote: >>>>> I am hoping that someone can help me understand the problems I'm >>>>> having >>>>> with linux clustering for VMs. >>>>> >>>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure >>>>> that a >>>>> service is always available. The hosts and guests are both RHEL 6.7. >>>>> The goal is to have only one of the two VMs running at a time. >>>>> >>>>> The configuration works when we test/simulate VM deaths and >>>>> graceful VM >>>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ). >>>>> >>>>> However, when we simulate the sudden isolation of host A (e.g. ifdown >>>>> eth0), two things happen >>>>> 1) the VM on host B does not start, and repeated fence_xvm errors >>>>> appear >>>>> in the logs on host B >>>>> 2) when the 'failed' node is returned to service, the cman service on >>>>> host B dies. >>>> If the node's host is dead, then there is no way for the survivor to >>>> determine the state of the lost VM node. The cluster is not allowed to >>>> take "no answer" as confirmation of fence success. >>>> >>>> If your hosts have IPMI, then you could add fence_ipmilan as a backup >>>> method where, if fence_xvm fails, it moves on and reboots the host >>>> itself. >>> Thank you for the suggestion. The hosts do have ipmi. I'll explore it >>> but I'm a little concerned about what it means for the other >>> non-clustered VM workloads that exist on these two servers. >>> >>> Do you have any thoughts as to why host B's cman process is dying when >>> 'host A' returns? >>> >>> Thanks, >>>Kelvin >> It's not dieing, it's blocking. When a node is lost, dlm blocks until >> fenced tells it that the fence was successful. If fenced can't contact >> the lost node's fence method(s), then it doesn't succeed and dlm stays >> blocked. To anything that uses DLM, like rgmanager, it appears like the >> host is hung but it is by design. The logic is that, as bad as it is to >> hang, it's better than risking a split-brain. > when I said the cman service is dying, I should have further qualified > it. I mean that the corosync process is no longer running (ps -ef | grep > corosync does not show it) and after recovering the failed host A, > manual intervention (service cman start) was required on host B to > recover full cluster services. > > [root@host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do > printf "%-12s " $SERVICE; service $SERVICE status; done > ricci ricci (pid 5469) is running... > fence_virtdfence_virtd (pid 4862) is running... > cman Found stale pid file > rgmanager rgmanager (pid 5366) is running... > > > Thanks, > Kelvin Oh now that is interesting... You'll want input from Fabio, Chrissie or one of the other core devs, I suspect. If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and if you can reproduce it reliably, can you create a new thread with the reproducer? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies
On 04/12/15 01:52 PM, Kelvin Edmison wrote: > > > On 12/04/2015 12:49 PM, Digimer wrote: >> On 04/12/15 09:14 AM, Kelvin Edmison wrote: >>> >>> On 12/03/2015 09:31 PM, Digimer wrote: >>>> On 03/12/15 08:39 PM, Kelvin Edmison wrote: >>>>> On 12/03/2015 06:14 PM, Digimer wrote: >>>>>> On 03/12/15 02:19 PM, Kelvin Edmison wrote: >>>>>>> I am hoping that someone can help me understand the problems I'm >>>>>>> having >>>>>>> with linux clustering for VMs. >>>>>>> >>>>>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure >>>>>>> that a >>>>>>> service is always available. The hosts and guests are both RHEL >>>>>>> 6.7. >>>>>>> The goal is to have only one of the two VMs running at a time. >>>>>>> >>>>>>> The configuration works when we test/simulate VM deaths and >>>>>>> graceful VM >>>>>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ). >>>>>>> >>>>>>> However, when we simulate the sudden isolation of host A (e.g. >>>>>>> ifdown >>>>>>> eth0), two things happen >>>>>>> 1) the VM on host B does not start, and repeated fence_xvm errors >>>>>>> appear >>>>>>> in the logs on host B >>>>>>> 2) when the 'failed' node is returned to service, the cman >>>>>>> service on >>>>>>> host B dies. >>>>>> If the node's host is dead, then there is no way for the survivor to >>>>>> determine the state of the lost VM node. The cluster is not >>>>>> allowed to >>>>>> take "no answer" as confirmation of fence success. >>>>>> >>>>>> If your hosts have IPMI, then you could add fence_ipmilan as a backup >>>>>> method where, if fence_xvm fails, it moves on and reboots the host >>>>>> itself. >>>>> Thank you for the suggestion. The hosts do have ipmi. I'll >>>>> explore it >>>>> but I'm a little concerned about what it means for the other >>>>> non-clustered VM workloads that exist on these two servers. >>>>> >>>>> Do you have any thoughts as to why host B's cman process is dying when >>>>> 'host A' returns? >>>>> >>>>> Thanks, >>>>> Kelvin >>>> It's not dieing, it's blocking. When a node is lost, dlm blocks until >>>> fenced tells it that the fence was successful. If fenced can't contact >>>> the lost node's fence method(s), then it doesn't succeed and dlm stays >>>> blocked. To anything that uses DLM, like rgmanager, it appears like the >>>> host is hung but it is by design. The logic is that, as bad as it is to >>>> hang, it's better than risking a split-brain. >>> when I said the cman service is dying, I should have further qualified >>> it. I mean that the corosync process is no longer running (ps -ef | grep >>> corosync does not show it) and after recovering the failed host A, >>> manual intervention (service cman start) was required on host B to >>> recover full cluster services. >>> >>> [root@host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do >>> printf "%-12s " $SERVICE; service $SERVICE status; done >>> ricci ricci (pid 5469) is running... >>> fence_virtdfence_virtd (pid 4862) is running... >>> cman Found stale pid file >>> rgmanager rgmanager (pid 5366) is running... >>> >>> >>> Thanks, >>>Kelvin >> Oh now that is interesting... >> >> You'll want input from Fabio, Chrissie or one of the other core devs, I >> suspect. >> >> If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and >> if you can reproduce it reliably, can you create a new thread with the >> reproducer? > It's RHEL proper in both host and guest, and we can reproduce it reliably. Excellent! Please reply here with the rhbz#. I'm keen to see what comes of it. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies
On 03/12/15 02:19 PM, Kelvin Edmison wrote: > > I am hoping that someone can help me understand the problems I'm having > with linux clustering for VMs. > > I am clustering 2 VMs on two separate VM hosts, trying to ensure that a > service is always available. The hosts and guests are both RHEL 6.7. > The goal is to have only one of the two VMs running at a time. > > The configuration works when we test/simulate VM deaths and graceful VM > host shutdowns, and administrative switchovers (i.e. clusvcadm -r ). > > However, when we simulate the sudden isolation of host A (e.g. ifdown > eth0), two things happen > 1) the VM on host B does not start, and repeated fence_xvm errors appear > in the logs on host B > 2) when the 'failed' node is returned to service, the cman service on > host B dies. If the node's host is dead, then there is no way for the survivor to determine the state of the lost VM node. The cluster is not allowed to take "no answer" as confirmation of fence success. If your hosts have IPMI, then you could add fence_ipmilan as a backup method where, if fence_xvm fails, it moves on and reboots the host itself. > This is my cluster.conf file (some elisions re: hostnames) > > > > > > > > > > > > > > > > > > > > > > > key_file="/etc/cluster/fence_xvm_hostA.key" > multicast_address="239.255.1.10" name="virtfence1"/> > key_file="/etc/cluster/fence_xvm_hostB.key" > multicast_address="239.255.2.10" name="virtfence2"/> > > > > > use_virsh="1"/> > > > > > > Thanks for any help you can offer, > Kelvin Edmison > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies
On 03/12/15 08:39 PM, Kelvin Edmison wrote: > On 12/03/2015 06:14 PM, Digimer wrote: >> On 03/12/15 02:19 PM, Kelvin Edmison wrote: >>> I am hoping that someone can help me understand the problems I'm having >>> with linux clustering for VMs. >>> >>> I am clustering 2 VMs on two separate VM hosts, trying to ensure that a >>> service is always available. The hosts and guests are both RHEL 6.7. >>> The goal is to have only one of the two VMs running at a time. >>> >>> The configuration works when we test/simulate VM deaths and graceful VM >>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ). >>> >>> However, when we simulate the sudden isolation of host A (e.g. ifdown >>> eth0), two things happen >>> 1) the VM on host B does not start, and repeated fence_xvm errors appear >>> in the logs on host B >>> 2) when the 'failed' node is returned to service, the cman service on >>> host B dies. >> If the node's host is dead, then there is no way for the survivor to >> determine the state of the lost VM node. The cluster is not allowed to >> take "no answer" as confirmation of fence success. >> >> If your hosts have IPMI, then you could add fence_ipmilan as a backup >> method where, if fence_xvm fails, it moves on and reboots the host >> itself. > > Thank you for the suggestion. The hosts do have ipmi. I'll explore it > but I'm a little concerned about what it means for the other > non-clustered VM workloads that exist on these two servers. > > Do you have any thoughts as to why host B's cman process is dying when > 'host A' returns? > > Thanks, > Kelvin It's not dieing, it's blocking. When a node is lost, dlm blocks until fenced tells it that the fence was successful. If fenced can't contact the lost node's fence method(s), then it doesn't succeed and dlm stays blocked. To anything that uses DLM, like rgmanager, it appears like the host is hung but it is by design. The logic is that, as bad as it is to hang, it's better than risking a split-brain. As for what will happen to non-cluster services, well, if I can be blunt, you shouldn't mix the two. If something is important enough to make HA, then it is important enough for dedicated hardware in my opinion. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Alternative to resource monitor polling?
I would ask on the Cluster Labs mailing list; Either users or Developers. digimer On 15/10/15 03:42 PM, Vallevand, Mark K wrote: > Is this the correct forum for questions like this? > > > > Ubuntu 12.04 LTS > > pacemaker 1.1.10 > > cman 3.1.7 > > corosync 1.4.6 > > > > One more question: > > If my cluster has no resources, it seems like it takes 20s for a stopped > node to be detected. Is the value really 20s and is it a parameter that > can be adjusted? > > > > Thanks. > > > > Regards. > Mark K Vallevand mark.vallev...@unisys.com > <mailto:mark.vallev...@unisys.com> > Never try and teach a pig to sing: it's a waste of time, and it annoys > the pig. > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY > MATERIAL and is thus for use only by the intended recipient. If you > received this in error, please contact the sender and delete the e-mail > and its attachments from all computers. > > *From:*linux-cluster-boun...@redhat.com > [mailto:linux-cluster-boun...@redhat.com] *On Behalf Of *Vallevand, Mark K > *Sent:* Thursday, October 15, 2015 12:19 PM > *To:* linux clustering > *Subject:* [Linux-cluster] Alternative to resource monitor polling? > > > > Is there an alternative to resource monitor polling to detect a resource > failure? > > If, for example, a resource failure is detected by our own software, > could it signal clustering that a resource has failed? > > > > Regards. > Mark K Vallevand mark.vallev...@unisys.com > <mailto:mark.vallev...@unisys.com> > Never try and teach a pig to sing: it's a waste of time, and it annoys > the pig. > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY > MATERIAL and is thus for use only by the intended recipient. If you > received this in error, please contact the sender and delete the e-mail > and its attachments from all computers. > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Linux cluster] DLM not start
On 02/09/15 09:58 AM, Nguyễn Trường Sơn wrote: > How can i use fencing? > > Do you mean "pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld > op monitor interval=60s on-fail=fence" > > > It is still error. > > I have Centos 7.0, with pacemaker-1.1.12-22.el7_1.2.x86_64 Fencing is a process where a lost node is removed from the cluster, usually by rebooting it with IPMI, cutting power using a switched PDU, etc. How you exactly do fencing depends on your environment and potential fence devices you have. DLM requires working fencing. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] new cluster setup error
On 30/06/15 10:51 PM, Megan . wrote: Good Evening! Anyone seen this before? I just setup these boxes and i'm trying to create a new cluster. I set the ricci password on all of the nodes, started ricci. I try to create cluster and i get the below. Thanks! Centos 6.6 2.6.32-504.23.4.el6.x86_64 ccs-0.16.2-75.el6_6.2.x86_64 ricci-0.16.2-75.el6_6.1.x86_64 cman-3.0.12.1-68.el6.x86_64 [root@admin1-dit cluster]# ccs --createcluster test Traceback (most recent call last): File /usr/sbin/ccs, line 2450, in module main(sys.argv[1:]) File /usr/sbin/ccs, line 286, in main if (createcluster): create_cluster(clustername) File /usr/sbin/ccs, line 939, in create_cluster elif get_cluster_conf_xml() != f.read(): File /usr/sbin/ccs, line 884, in get_cluster_conf_xml xml = send_ricci_command(cluster, get_cluster.conf) File /usr/sbin/ccs, line 2340, in send_ricci_command dom = minidom.parseString(res[1].replace('\t','')) File /usr/lib64/python2.6/xml/dom/minidom.py, line 1928, in parseString return expatbuilder.parseString(string) File /usr/lib64/python2.6/xml/dom/expatbuilder.py, line 940, in parseString return builder.parseString(string) File /usr/lib64/python2.6/xml/dom/expatbuilder.py, line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: no element found: line 1, column 0 Are the ricci and modclusterd daemons running? Does your firewall allow TCP ports 1 and 16851 between nodes? Does the file /etc/cluster/cluster.conf exist and, if so, does 'ls -lahZ' show: -rw-r-. root root system_u:object_r:cluster_conf_t:s0 /etc/cluster/cluster.conf -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] gfs2-utils 3.1.8 released
Hi Andrew, Congrats!! Want to add the cluster labs mailing list to your list of release announcement locations? digimer On 07/04/15 01:03 PM, Andrew Price wrote: Hi, I am happy to announce the 3.1.8 release of gfs2-utils. This release includes the following visible changes: * Performance improvements in fsck.gfs2, mkfs.gfs2 and gfs2_edit savemeta. * Better checking of journals, the jindex, system inodes and inode 'goal' values in fsck.gfs2 * gfs2_jadd and gfs2_grow are now separate programs instead of symlinks to mkfs.gfs2. * Improved test suite and related documentation. * No longer clobbers the configure script's --sbindir option. * No longer depends on perl. * Various minor bug fixes and enhancements. See below for a complete list of changes. The source tarball is available from: https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.8.tar.gz Please test, and report bugs against the gfs2-utils component of Fedora rawhide: https://bugzilla.redhat.com/enter_bug.cgi?product=Fedoracomponent=gfs2-utilsversion=rawhide Regards, Andy Changes since version 3.1.7: Abhi Das (6): fsck.gfs2: fix broken i_goal values in inodes gfs2_convert: use correct i_goal values instead of zeros for inodes tests: test for incorrect inode i_goal values mkfs.gfs2: addendum to fix broken i_goal values in inodes gfs2_utils: more gfs2_convert i_goal fixes gfs2-utils: more fsck.gfs2 i_goal fixes Andrew Price (58): gfs2-utils tests: Build unit tests with consistent cpp flags libgfs2: Move old rgrp layout functions into fsck.gfs2 gfs2-utils build: Add test coverage option fsck.gfs2: Fix memory leak in pass2 gfs2_convert: Fix potential memory leaks in adjust_inode gfs2_edit: Fix signed value used as array index in print_ld_blks gfs2_edit: Set umask before calling mkstemp in savemetaopen() gfs2_edit: Fix use-after-free in find_wrap_pt libgfs2: Clean up broken rgrp length check libgfs2: Remove superfluous NULL check from gfs2_rgrp_free libgfs2: Fail fd comparison if the fds are negative libgfs2: Fix check for O_RDONLY fsck.gfs2: Remove dead code from scan_inode_list mkfs.gfs2: Terminate lockproto and locktable strings explicitly libgfs2: Add generic field assignment and print functions gfs2_edit: Use metadata description to print and assign fields gfs2l: Switch to lgfs2_field_assign libgfs2: Remove device_name from struct gfs2_sbd libgfs2: Remove path_name from struct gfs2_sbd libgfs2: metafs_path improvements gfs2_grow: Don't use PATH_MAX in main_grow gfs2_jadd: Don't use fixed size buffers for paths libgfs2: Remove orig_journals from struct gfs2_sbd gfs2l: Check unchecked returns in openfs gfs2-utils configure: Fix exit with failure condition gfs2-utils configure: Remove checks for non-existent -W flags gfs2_convert: Don't use a fixed sized buffer for device path gfs2_edit: Add bounds checking for the journalN keyword libgfs2: Make find_good_lh and jhead_scan static Build gfs2_grow, gfs2_jadd and mkfs.gfs2 separately gfs2-utils: Honour --sbindir gfs2-utils configure: Use AC_HELP_STRING in help messages fsck.gfs2: Improve reporting of pass timings mkfs.gfs2: Revert default resource group size gfs2-utils tests: Add keywords to tests gfs2-utils tests: Shorten TESTSUITEFLAGS to TOPTS gfs2-utils tests: Improve docs gfs2-utils tests: Skip unit tests if check is not found gfs2-utils tests: Document usage of convenience macros fsck.gfs2: Fix 'initializer element is not constant' build error fsck.gfs2: Simplify bad_journalname gfs2-utils build: Add a configure script summary mkfs.gfs2: Remove unused declarations gfs2-utils/tests: Fix unit tests for older check libraries fsck.gfs2: Fix memory leaks in pass1_process_rgrp libgfs2: Use the correct parent for rgrp tree insertion libgfs2: Remove some obsolete function declarations gfs2-utils: Move metafs handling into gfs2/mkfs/ gfs2_grow/jadd: Use a matching context mount option in mount_gfs2_meta gfs2_edit savemeta: Don't read rgrps twice fsck.gfs2: Fetch directory inodes early in pass2() libgfs2: Remove some unused data structures gfs2-utils: Tidy up Makefile.am files gfs2-utils build: Remove superfluous passive header checks gfs2-utils: Consolidate some bad constants strings gfs2-utils: Update translation template libgfs2: Fix potential NULL deref in linked_leaf_search() gfs2_grow: Put back the definition of FALLOC_FL_KEEP_SIZE Bob Peterson (15): fsck.gfs2: Detect and correct corrupt journals fsck.gfs2: Change basic dentry checks for too long of file names
Re: [Linux-cluster] GFS2: Could not open the file on one of the nodes
Does the logs show the fence succeeded or failed? Can you please post the logs from the surviving two nodes starting just before the failure until a few minutes after? digimer On 31/01/15 12:10 AM, cluster lab wrote: Some more information: Cluster is a three nodes cluster, One of its node (ID == 1) fenced because of network failure ... After fence this problem borned ... On Sat, Jan 31, 2015 at 8:28 AM, cluster lab cluster.l...@gmail.com wrote: Hi, There is n't any unusual state or message, Also GFS logs (gfs, dlm) are silent ... Is there any chance to find source of problem? On Thu, Jan 29, 2015 at 7:04 PM, Bob Peterson rpete...@redhat.com wrote: - Original Message - On affected node: stat FILE | grep Inode stat: cannot stat `FILE': Input/output error On other node: stat PublicDNS1-OS.qcow2 | grep Inode Device: fd06h/64774dInode: 267858 Links: 1 Something funky going on. I'd check dmesg for withdraw messages, etc., on the affected node. Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] GFS2: Could not open the file on one of the nodes
On 31/01/15 01:52 AM, cluster lab wrote: Jan 21 17:07:43 ost-pvm2 fenced[47840]: fencing node ost-pvm1 There are no messages about this succeeding or failing... It looks like only 15 seconds seconds worth of logs. Can you please share the full amount of time I mentioned before, from both nodes? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] GFS2: Could not open the file on one of the nodes
That looks OK. Can you touch a file from one node and see it on the other and vice-versa? Is there anything in either node's log files when you run 'qemu-img check'? On 29/01/15 12:34 AM, cluster lab wrote: Node2: # dlm_tool ls dlm lockspaces name VMStorage3 id0xb26438a2 flags 0x0008 fs_reg changemember 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 name VMStorage2 id0xab7f09e3 flags 0x0008 fs_reg changemember 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 name VMStorage1 id0x80525a20 flags 0x0008 fs_reg changemember 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 === Node1: # dlm_tool ls dlm lockspaces name VMStorage3 id0xb26438a2 flags 0x0008 fs_reg changemember 2 joined 1 remove 0 failed 0 seq 2,2 members 1 2 name VMStorage2 id0xab7f09e3 flags 0x0008 fs_reg changemember 2 joined 1 remove 0 failed 0 seq 2,2 members 1 2 name VMStorage1 id0x80525a20 flags 0x0008 fs_reg changemember 2 joined 1 remove 0 failed 0 seq 2,2 members 1 2 On Thu, Jan 29, 2015 at 8:57 AM, Digimer li...@alteeve.ca wrote: On 28/01/15 11:50 PM, cluster lab wrote: Hi, In a two node cluster, i received to different result from qemu-img check on just one file: node1 # qemu-img check VMStorage/x.qcow2 No errors were found on the image. Node2 # qemu-img check VMStorage/x.qcow2 qemu-img: Could not open 'VMStorage/x.qcow2 All other files are OK, and the cluster works properly. What is the problem? Packages: kernel: 2.6.32-431.5.1.el6.x86_64 GFS2: gfs2-utils-3.0.12.1-23.el6.x86_64 corosync: corosync-1.4.1-17.el6.x86_64 Best Regards What does 'dlm_tool ls' show? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Pacemaker] HA Summit Key-signing Party
On 26/01/15 09:14 AM, Jan Pokorný wrote: Hello cluster masters, On 13/01/15 00:31 -0500, Digimer wrote: Any concerns/comments/suggestions, please speak up ASAP! I'd like to throw a key-signing party as it will be a perfect opportunity to build a web of trust amongst us. If you haven't incorporated OpenPGP to your communication with the world yet, I would recommend at least considering it, even more in the post-Snowden era. You can use it to prove authenticity/integrity of the data you emit (signing; not just for email as is the case with this one, but also for SW releases and more), provide privacy/confidentiality of interchanged data (encryption; again, typical scenario is a private email, e.g., when you responsibly report a vulnerability to the respective maintainers), or both. In case you have no experience with this technology, there are plentiful resources on GnuPG (most renowned FOSS implementation): - https://www.gnupg.org/documentation/howtos.en.html - http://cryptnet.net/fdp/crypto/keysigning_party/en/keysigning_party.html#prep (preparation steps for a key-signing party) - ... To make the verification process as smooth and as little time-consuming as possible, I would stick with a list-based method: http://cryptnet.net/fdp/crypto/keysigning_party/en/keysigning_party.html#list_based and volunteer for a role of a coordinator. What's needed? Once you have a key pair (and provided that you are using GnuPG), please run the following sequence: # figure out the key ID for the identity to be verified; # IDENTITY is either your associated email address/your name # if only single key ID matches, specific key otherwise # (you can use gpg -K to select a desired ID at the sec line) KEY=$(gpg --with-colons 'IDENTITY' | grep '^pub' | cut -d: -f5) # export the public key to a file that is suitable for exchange gpg --export -a -- $KEY $KEY # verify that you have an expected data to share gpg --with-fingerprint -- $KEY with IDENTITY adjusted as per the instruction above, and send me the resulting $KEY file, preferably in a signed (or even encrypted[*]) email from an address associated with that very public key of yours. [*] You can find my public key at public keyservers: http://pool.sks-keyservers.net/pks/lookup?op=vindexsearch=0x60BCBB4F5CD7F9EF Indeed, the trust in this key should be ephemeral/one-off (e.g., using a temporary keyring, not a universal one before we proceed with the signing :) Timeline? Best if you send me your public keys before 2015-02-02. I will then compile a list of the attendees together with their keys and publish it at https://people.redhat.com/jpokorny/keysigning/2015-ha/ so you can print it out and be ready for the party. Thanks for your cooperation, looking forward to this side-event and hope this will be beneficial to all involved. P.S. There's now an opportunity to visit an exhibition of the Bohemian Crown Jewels replicas directly in Brno (sorry, Google Translate only) https://translate.google.com/translate?sl=autotl=enjs=yprev=_thl=enie=UTF-8u=http%3A%2F%2Fwww.letohradekbrno.cz%2F%3Fidm%3D55 =o, keysigning is a brilliant idea! I can put the keys in the plan wiki, too. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Pacemaker] [Linux-HA] [ha-wg-technical] [RFC] Organizing HA Summit 2015
Woohoo!! Will be very nice to see you. :) I've added you. Can you give me a short sentence to introduce yourself to people who haven't met you? Madi On 13/01/15 11:33 PM, Yusuke Iida wrote: Hi Digimer, I am Iida to participate from NTT along with Mori. I want you added to the list of participants. I'm sorry contact is late. Regards, Yusuke 2014-12-23 2:13 GMT+09:00 Digimer li...@alteeve.ca: It will be very nice to see you again! Will Ikeda-san be there as well? digimer On 22/12/14 03:35 AM, Keisuke MORI wrote: Hi all, Really late response but, I will be joining the HA summit, with a few colleagues from NTT. See you guys in Brno, Thanks, 2014-12-08 22:36 GMT+09:00 Jan Pokorný jpoko...@redhat.com: Hello, it occured to me that if you want to use the opportunity and double as as tourist while being in Brno, it's about the right time to consider reservations/ticket purchases this early. At least in some cases it is a must, e.g., Villa Tugendhat: http://rezervace.spilberk.cz/langchange.aspx?mrsname=languageId=2returnUrl=%2Flist On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote: DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. -- Jan ___ ha-wg-technical mailing list ha-wg-techni...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] [Planning] Organizing HA Summit 2015
Hi all, With Fabio away for now, I (and others) are working on the final preparations for the summit. This is your chance to speak up and influence the planning! Objections/suggestions? Speak now please. :) In particular, please raise topics you want to discuss. Either add them to the wiki directly or email me and I will update the wiki for you. (Note that registration is closed because of spammers, if you want an account just let me know and I'll open it back up). The plan is; * Informal atmosphere with limited structure to make sure key topics are addressed. Two ways topics will be discussed; ** Someone will guide a given topic they want to raise for ~45 minutes, 15 minutes for QA ** Round-table style discussion with no one person leading (though it would be nice to have someone taking notes). People presenting are asked not to use slides. Hand-outs are fine and either a white-board or paper flip-board will be available for illustrating ideas and flushing out concepts. The summit will start at 9:00 and go until 17:00. We'll go for a semi-official summit dinner and drinks around 6pm on the 4th (location to be determined). Those staying in Brno are more than welcome to join an informal dinner and drinks (and possibly some sight-seeing, etc) the evening of the 5th. Any concerns/comments/suggestions, please speak up ASAP! -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] HA Summit 2015 - plan wiki closed for registration
Spammers got through the captcha, *sigh*. If anyone wants to create an account to edit, please email me off-list and I'll get you setup ASAP. Sorry for the hassle. http://plan.alteeve.ca/index.php/Main_Page -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster
On 08/01/15 07:17 AM, Cao, Vinh wrote: Hi Digimer, You are correct. I do need to configure fencing. But before fencing, I need to have these servers become member of cluster first. If they are not member of cluster set. Doesn't matter I try to configure fencing or not. My cluster won't work. Thanks for your help. Vinh Define the fence methods right from the start. As soon as the cluster forms, the first thing you do is run 'fence_check -f' on all nodes. If there is a problem, fix it. Only then do you add services. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster
Please configure fencing. If you don't, it _will_ cause you problems. On 07/01/15 09:48 PM, Cao, Vinh wrote: Hi Digimer, No we're not supporting multicast. I'm trying to use Broadcast, but Redhat support is saying better to use transport=udpu. Which I did set and it is complaining time out. I did try to set broadcast, but somehow it didn't work either. Let me give broadcast a try again. Thanks, Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 5:51 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster It looks like a network problem... Does your (virtual) switch support multicast properly and have you opened up the proper ports in the firewall? On 07/01/15 05:32 PM, Cao, Vinh wrote: Hi Digimer, Yes, I just did. Looks like they are failing. I'm not sure why that is. Please see the attachment for all servers log. By the way, I do appreciated all the helps I can get. Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 4:33 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster Quorum is enabled by default. I need to see the entire logs from all five nodes, as I mentioned in the first email. Please disable cman from starting on boot, configure fencing properly and then reboot all nodes cleanly. Start the 'tail -f -n 0 /var/log/messages' on all five nodes, then in another window, start cman on all five nodes. When things settle down, copy/paste all the log output please. On 07/01/15 04:29 PM, Cao, Vinh wrote: Hi Digimer, Here is from the logs: [root@ustlvcmsp1954 ~]# tail -f /var/log/messages Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync profile loading service Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Using quorum provider quorum_cman Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [CPG ] chosen downlist: sender r(0) ip(10.30.197.108) ; members(old:0 left:0) Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Completed service synchronization, ready to provide service. Jan 7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Unloading all Corosync service engines. Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync configuration service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync profile loading service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:2055. Jan 7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed Then it die at: Starting cman...[ OK ] Waiting for quorum... Timed-out waiting for cluster [FAILED] Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem is still there. One thing I don't know why cluster is looking for quorum? I did have any disk quorum setup in cluster.conf file. Any helps can I get appreciated. Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 3:59 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster On 07/01/15 03:39 PM, Cao, Vinh wrote: Hello Digimer, Yes, I would agrre with you RHEL6.4 is old. We patched
Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster
It looks like a network problem... Does your (virtual) switch support multicast properly and have you opened up the proper ports in the firewall? On 07/01/15 05:32 PM, Cao, Vinh wrote: Hi Digimer, Yes, I just did. Looks like they are failing. I'm not sure why that is. Please see the attachment for all servers log. By the way, I do appreciated all the helps I can get. Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 4:33 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster Quorum is enabled by default. I need to see the entire logs from all five nodes, as I mentioned in the first email. Please disable cman from starting on boot, configure fencing properly and then reboot all nodes cleanly. Start the 'tail -f -n 0 /var/log/messages' on all five nodes, then in another window, start cman on all five nodes. When things settle down, copy/paste all the log output please. On 07/01/15 04:29 PM, Cao, Vinh wrote: Hi Digimer, Here is from the logs: [root@ustlvcmsp1954 ~]# tail -f /var/log/messages Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync profile loading service Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Using quorum provider quorum_cman Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [CPG ] chosen downlist: sender r(0) ip(10.30.197.108) ; members(old:0 left:0) Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Completed service synchronization, ready to provide service. Jan 7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Unloading all Corosync service engines. Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync configuration service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync profile loading service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:2055. Jan 7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed Then it die at: Starting cman...[ OK ] Waiting for quorum... Timed-out waiting for cluster [FAILED] Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem is still there. One thing I don't know why cluster is looking for quorum? I did have any disk quorum setup in cluster.conf file. Any helps can I get appreciated. Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 3:59 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster On 07/01/15 03:39 PM, Cao, Vinh wrote: Hello Digimer, Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not sure why these servers are still at 6.4. Most of our system are 6.6. Here is my cluster config. All I want is using cluster to have BGFS2 mount via /etc/fstab. root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=15 name=p1954_to_p1958 clusternodes clusternode name=ustlvcmsp1954 nodeid=1/ clusternode name=ustlvcmsp1955 nodeid=2/ clusternode name=ustlvcmsp1956 nodeid=3/ clusternode name=ustlvcmsp1957 nodeid=4/ clusternode name=ustlvcmsp1958 nodeid=5/ /clusternodes You don't configure
Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster
Did you configure fencing properly? On 07/01/15 05:32 PM, Cao, Vinh wrote: Hi Digimer, Yes, I just did. Looks like they are failing. I'm not sure why that is. Please see the attachment for all servers log. By the way, I do appreciated all the helps I can get. Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 4:33 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster Quorum is enabled by default. I need to see the entire logs from all five nodes, as I mentioned in the first email. Please disable cman from starting on boot, configure fencing properly and then reboot all nodes cleanly. Start the 'tail -f -n 0 /var/log/messages' on all five nodes, then in another window, start cman on all five nodes. When things settle down, copy/paste all the log output please. On 07/01/15 04:29 PM, Cao, Vinh wrote: Hi Digimer, Here is from the logs: [root@ustlvcmsp1954 ~]# tail -f /var/log/messages Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync profile loading service Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Using quorum provider quorum_cman Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [CPG ] chosen downlist: sender r(0) ip(10.30.197.108) ; members(old:0 left:0) Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Completed service synchronization, ready to provide service. Jan 7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Unloading all Corosync service engines. Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync configuration service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync profile loading service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:2055. Jan 7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed Then it die at: Starting cman...[ OK ] Waiting for quorum... Timed-out waiting for cluster [FAILED] Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem is still there. One thing I don't know why cluster is looking for quorum? I did have any disk quorum setup in cluster.conf file. Any helps can I get appreciated. Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 3:59 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster On 07/01/15 03:39 PM, Cao, Vinh wrote: Hello Digimer, Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not sure why these servers are still at 6.4. Most of our system are 6.6. Here is my cluster config. All I want is using cluster to have BGFS2 mount via /etc/fstab. root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=15 name=p1954_to_p1958 clusternodes clusternode name=ustlvcmsp1954 nodeid=1/ clusternode name=ustlvcmsp1955 nodeid=2/ clusternode name=ustlvcmsp1956 nodeid=3/ clusternode name=ustlvcmsp1957 nodeid=4/ clusternode name=ustlvcmsp1958 nodeid=5/ /clusternodes You don't configure the fencing for the nodes... If anything causes a fence, the cluster will lock up (by design
Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster
My first though would be to set fence_daemon post_join_delay=30 / in cluster.conf. If that doesn't work, please share your configuration file. Then, with all nodes offline, open a terminal to each node and run 'tail -f -n 0 /var/log/messages'. With that running, start all the nodes and wait for things to settle down, then paste the five nodes' output as well. Also, 6.4 is pretty old, why not upgrade to 6.6? digimer On 07/01/15 03:10 PM, Cao, Vinh wrote: Hello Cluster guru, I'm trying to setup Redhat 6.4 OS cluster with 5 nodes. With two nodes I don’t have any issue. But with 5 nodes, when I ran clustat I got 3 nodes online and the other two off line. When I start the one that are off line. Service cman start. I got: [root@ustlvcmspxxx ~]# service cman status corosync is stopped [root@ustlvcmsp1954 ~]# service cman start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman...[ OK ] Waiting for quorum... Timed-out waiting for cluster [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld...[ OK ] Stopping dlm_controld...[ OK ] Stopping fenced... [ OK ] Stopping cman...[ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Can you help? Thank you, Vinh -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster
On 07/01/15 03:39 PM, Cao, Vinh wrote: Hello Digimer, Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not sure why these servers are still at 6.4. Most of our system are 6.6. Here is my cluster config. All I want is using cluster to have BGFS2 mount via /etc/fstab. root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=15 name=p1954_to_p1958 clusternodes clusternode name=ustlvcmsp1954 nodeid=1/ clusternode name=ustlvcmsp1955 nodeid=2/ clusternode name=ustlvcmsp1956 nodeid=3/ clusternode name=ustlvcmsp1957 nodeid=4/ clusternode name=ustlvcmsp1958 nodeid=5/ /clusternodes You don't configure the fencing for the nodes... If anything causes a fence, the cluster will lock up (by design). fencedevices fencedevice agent=fence_vmware_soap ipaddr=10.30.197.108 login=rhfence name=p1954 passwd=/ fencedevice agent=fence_vmware_soap ipaddr=10.30.197.109 login=rhfence name=p1955 passwd= / fencedevice agent=fence_vmware_soap ipaddr=10.30.197.110 login=rhfence name=p1956 passwd= / fencedevice agent=fence_vmware_soap ipaddr=10.30.197.111 login=rhfence name=p1957 passwd= / fencedevice agent=fence_vmware_soap ipaddr=10.30.197.112 login=rhfence name=p1958 passwd= / /fencedevices /cluster clustat show: Cluster Status for p1954_to_p1958 @ Wed Jan 7 15:38:00 2015 Member Status: Quorate Member Name ID Status -- -- ustlvcmsp1954 1 Offline ustlvcmsp1955 2 Online, Local ustlvcmsp1956 3 Online ustlvcmsp1957 4 Offline ustlvcmsp1958 5 Online I need to make them all online, so I can use fencing for mounting shared disk. Thanks, Vinh What about the log entries from the start-up? Did you try the post_join_delay config? -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 3:16 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster My first though would be to set fence_daemon post_join_delay=30 / in cluster.conf. If that doesn't work, please share your configuration file. Then, with all nodes offline, open a terminal to each node and run 'tail -f -n 0 /var/log/messages'. With that running, start all the nodes and wait for things to settle down, then paste the five nodes' output as well. Also, 6.4 is pretty old, why not upgrade to 6.6? digimer On 07/01/15 03:10 PM, Cao, Vinh wrote: Hello Cluster guru, I'm trying to setup Redhat 6.4 OS cluster with 5 nodes. With two nodes I don't have any issue. But with 5 nodes, when I ran clustat I got 3 nodes online and the other two off line. When I start the one that are off line. Service cman start. I got: [root@ustlvcmspxxx ~]# service cman status corosync is stopped [root@ustlvcmsp1954 ~]# service cman start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman...[ OK ] Waiting for quorum... Timed-out waiting for cluster [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld...[ OK ] Stopping dlm_controld...[ OK ] Stopping fenced... [ OK ] Stopping cman...[ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Can you help? Thank you, Vinh -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure
Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster
Quorum is enabled by default. I need to see the entire logs from all five nodes, as I mentioned in the first email. Please disable cman from starting on boot, configure fencing properly and then reboot all nodes cleanly. Start the 'tail -f -n 0 /var/log/messages' on all five nodes, then in another window, start cman on all five nodes. When things settle down, copy/paste all the log output please. On 07/01/15 04:29 PM, Cao, Vinh wrote: Hi Digimer, Here is from the logs: [root@ustlvcmsp1954 ~]# tail -f /var/log/messages Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync profile loading service Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Using quorum provider quorum_cman Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [QUORUM] Members[1]: 1 Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [CPG ] chosen downlist: sender r(0) ip(10.30.197.108) ; members(old:0 left:0) Jan 7 16:14:01 ustlvcmsp1954 corosync[8182]: [MAIN ] Completed service synchronization, ready to provide service. Jan 7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Unloading all Corosync service engines. Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync configuration service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync profile loading service Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Jan 7 16:15:06 ustlvcmsp1954 corosync[8182]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:2055. Jan 7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed Then it die at: Starting cman...[ OK ] Waiting for quorum... Timed-out waiting for cluster [FAILED] Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem is still there. One thing I don't know why cluster is looking for quorum? I did have any disk quorum setup in cluster.conf file. Any helps can I get appreciated. Vinh -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, January 07, 2015 3:59 PM To: linux clustering Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster On 07/01/15 03:39 PM, Cao, Vinh wrote: Hello Digimer, Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not sure why these servers are still at 6.4. Most of our system are 6.6. Here is my cluster config. All I want is using cluster to have BGFS2 mount via /etc/fstab. root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=15 name=p1954_to_p1958 clusternodes clusternode name=ustlvcmsp1954 nodeid=1/ clusternode name=ustlvcmsp1955 nodeid=2/ clusternode name=ustlvcmsp1956 nodeid=3/ clusternode name=ustlvcmsp1957 nodeid=4/ clusternode name=ustlvcmsp1958 nodeid=5/ /clusternodes You don't configure the fencing for the nodes... If anything causes a fence, the cluster will lock up (by design). fencedevices fencedevice agent=fence_vmware_soap ipaddr=10.30.197.108 login=rhfence name=p1954 passwd=/ fencedevice agent=fence_vmware_soap ipaddr=10.30.197.109 login=rhfence name=p1955 passwd= / fencedevice agent=fence_vmware_soap ipaddr=10.30.197.110 login=rhfence name=p1956 passwd= / fencedevice agent=fence_vmware_soap ipaddr=10.30.197.111 login=rhfence name=p1957 passwd= / fencedevice agent
Re: [Linux-cluster] [ha-wg-technical] [Pacemaker] [RFC] Organizing HA Summit 2015
It will be very nice to see you again! Will Ikeda-san be there as well? digimer On 22/12/14 03:35 AM, Keisuke MORI wrote: Hi all, Really late response but, I will be joining the HA summit, with a few colleagues from NTT. See you guys in Brno, Thanks, 2014-12-08 22:36 GMT+09:00 Jan Pokorný jpoko...@redhat.com: Hello, it occured to me that if you want to use the opportunity and double as as tourist while being in Brno, it's about the right time to consider reservations/ticket purchases this early. At least in some cases it is a must, e.g., Villa Tugendhat: http://rezervace.spilberk.cz/langchange.aspx?mrsname=languageId=2returnUrl=%2Flist On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote: DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. -- Jan ___ ha-wg-technical mailing list ha-wg-techni...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] new cluster acting odd
only under load, then that's an indication of the problem, too. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] new cluster acting odd
On 01/12/14 11:34 AM, Dan Riley wrote: Ha, I was unaware this was part of the folklore. We have a couple of 9-node clusters, it did take some tuning to get them stable, and we are thinking about splitting one of them. For our clusters, we found uniform configuration helped a lot, so a mix of physical hosts and VMs in the same (largish) cluster would make me a little nervous, don't know about anyone else's feelings. Personally, I only build 2-node clusters. When I need more resource, I drop-in another pair. This allows all my clusters, going back to 2008/9, to have nearly identical configurations. In HA, I would argue, nothing is more useful than consistency and simplicity. That said, I'd not fault anyone for going to 4 or 5 node. Beyond that though, I would argue that the cluster should be broken up. In HA, uptime should always trump resource utilization efficiency. Mixing real and virtual machines strikes me as an avoidable complexity, too. Something fence related is not working. We used to see something like this, and it usually traced back to shouldn't be possible inconsistencies in the fence group membership. Once the fence group gets blocked by a mismatched membership list, everything above it breaks. Sometimes a fence_tool ls -n on all the cluster members will reveal an inconsistency (fence_tool dump also, but that's hard to interpret without digging into the group membership protocols). If you can find an inconsistency, manually fencing the nodes in the minority might repair it. In all my years, I've never seen this happen in production. If you can create a reproducer, I would *love* to see/examine it! At the time, I did quite a lot of staring at fence_tool dumps, but never figured out how the fence group was getting into shouldn't be possible inconsistencies. This was also all late RHEL5 and early RHEL6, so may not be applicable anymore. HA in 6.2+ seems to be quite a bit more stable (I think for more reasons than just the HA stuff). For this reason, I am staying on RHEL 6 until at least 7.2+ is out. :) My recommendation would be to schedule a maintenance window and then stop everything except cman (no rgmanager, no gfs2, etc). Then methodically test crashing all nodes (I like 'echo c /proc/sysrq-trigger) and verify they are fenced and then recover properly. It's worth disabling cman and rgmanager from starting at boot (period, but particularly for this test). If you can reliably (and repeatedly) crash - fence - rejoin, then I'd start loading back services and re-trying. If the problem reappears only under load, then that's an indication of the problem, too. I'd agree--start at the bottom of the stack and work your way up. -dan -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] new cluster acting odd
. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] new cluster acting odd
On 01/12/14 01:03 PM, Megan . wrote: We have 11 10-20TB GFS2 mounts that I need to share across all nodes. Its the only reason we went with the cluster solution. I don't know how we could split it up into different smaller clusters. I would do this, personally; 2-Node cluster; DRBD (on top of local disks or a pair of SANs, one per node), exported over NFS and configured in a simple single-primary (master/slave) configuration with a floating IP. GFS2, like any clustered filesystem, requires cluster locking. This locking comes with a non-trivial overhead. Exporting NFS allows you to avoid this bottle-neck and with a simple 2-node cluster behind the scenes, you maintain full HA. In HA, nothing is more important than simplicity. Said another way; A cluster isn't beautiful when there is nothing left to add. It is beautiful when there is nothing left to take away. On Mon, Dec 1, 2014 at 12:14 PM, Digimer li...@alteeve.ca wrote: On 01/12/14 11:56 AM, Megan . wrote: Thank you for your replies. The cluster is intended to be 9 nodes, but i haven't finished building the remaining 2. Our production cluster is expected to be similar in size. What tuning should I be looking at? Here is a link to our config. http://pastebin.com/LUHM8GQR I had to remove IP addresses. Can you simplify those fencedevice definitions? I would wonder if the set timeouts could be part of the problem. Always start with the simplest possible configurations and only add options in response to actual issues discovered in testing. I can try to simplify. I had the longer timeouts because what I saw happening on the physical boxes, was the box would be on its way down/up and the fence command would fail, but the box actually did come back online. The physicals take 10-15 minutes to reboot and i wasn't sure how to handle timeout issues, so i made the timeouts a bit extreme for testing. I'll try to make the config more vanilla for troubleshooting. I'm not really sure why the state of the node should impact the fence action in any way. Fencing is supposed to work, regardless of the state of the target. Fencing works like this (with a default config, on most fence agents); 1. Force off 2. Verify off 3. Try to boot, don't care if it succeeds. So once the node is confirmed off by the agent, the fence is considered a success. How long (if at all) it takes for the node to reboot does not factor in. I tried the method of (echo c /proc/sysrq-trigger) to crash a node, the cluster kept seeing it as online and never fenced it, yet i could no longer ssh to the node. I did this on a physical and VM box with the same result. I had to fence_node node to get it to reboot, but it came up split brained (thinking it was the only one online). Now that node has cman down and the rest of the cluster sees it as still online. Then corosync failed to detect the fault. That is a sign, to me, of a fundamental network or configuration issue. Corosync should have shown messages about a node being lost and reconfiguring. If that didn't happen, then you're not even up to the point where fencing factors in. Did you configure corosync.conf? When it came up, did it think it was quorate or inquorate? corosync.conf didn't work since it seems the RedHat HA Cluster doesn't use that file. http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf I tried it since we wanted to try to put the multicast traffic on a different bond/vlan but we figured out the file isn't used. Right, I wanted to make sure that, if you had tried, you've since removed the corosync.conf entirely. Corosync is fully controlled by the cman cluster.conf file. I thought fencing was working because i'm able to do fence_node node and see the box reboot and come back online. I did have to get the FC version of the fence_agents because of an issue with the idrac agent not working properly. We are running fence-agents-3.1.6-1.fc14.x86_64 That tells you that the configuration of the fence agents is working, but it doesn't test failure detection. You can use the 'fence_check' tool to see if the cluster can talk to everything, but in the end, the only useful test is to simulate an actual crash. Wait; 'fc14' ?! What OS are you using? We are Centos 6.6. I went with the fedora core agents because of this exact issue http://forum.proxmox.com/threads/12311-Proxmox-HA-fencing-and-Dell-iDrac7 I read that it was fixed in the next version, which i could only find for FC. It would be *much* better to file a bug report (https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%206) - Version: 6.6 - Component: fence-agents Mixing RPMs from other OSes is not a good idea at all. fence_tool dump worked on one of my nodes, but it is just hanging on the rest. [root@map1-uat ~]# fence_tool dump 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log 1417448610 fenced 3.0.12.1 started 1417448610 connected to dbus
Re: [Linux-cluster] [ha-wg-technical] Wiki for planning created - Re: [Pacemaker] [RFC] Organizing HA Summit 2015
On 27/11/14 11:52 AM, Digimer wrote: I just created a dedicated/fresh wiki for planning and organizing: http://plan.alteeve.ca/index.php/Main_Page Other than the domain, it has no association with any existing project, so it should be a neutral enough platform. Also, it's not owned by $megacorp (I wish!), so spying/privacy shouldn't be an issue I hope. If there is concern, I can setup https. If no one else gets to it before me, I'll start collating the data from the mailing list onto that wiki tomorrow (maaaybe today, depends). The wiki requires registration, but that's it. I'm not bothering with captchas because, in my experience, spammer walk right through them anyway. I do have edits email me, so I can catch and roll back any spam quickly. Ok, I was getting 3~5 spam accounts created per day. To deal with this, I setup 'questy' captcha program with five (random) questions that should be easy to answer, even for non-english speakers. Just the same, if anyone has any trouble registering, please feel free to email me directly and I will be happy to help. Madi -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Cluster-devel] [Pacemaker] Wiki for planning created - Re: [RFC] Organizing HA Summit 2015
On 29/11/14 12:45 AM, Fabio M. Di Nitto wrote: On 11/28/2014 8:10 PM, Jan Pokorný wrote: On 28/11/14 00:37 -0500, Digimer wrote: On 28/11/14 12:33 AM, Fabio M. Di Nitto wrote: On 11/27/2014 5:52 PM, Digimer wrote: I just created a dedicated/fresh wiki for planning and organizing: http://plan.alteeve.ca/index.php/Main_Page [...] Awesome! thanks for taking care of it. Do you have a chance to add also an instance of etherpad to the site? Mostly to do collaborative editing while we sit all around the same table. Otherwise we can use a public instance and copy paste info after that in the wiki. Never tried setting up etherpad before, but if it runs on rhel 6, I should have no problem setting it up. Provided no conspiracy to be started, there are a bunch of popular instances, e.g. http://piratepad.net/ Right, some of them only store etherpads for 30 days. Just be careful the one we choose or we make our own. Fabio I'll set one up, but I'll need a few days, I'm out of the country at the moment. It's not needed until the conference, is it? Or will you want to have it before then? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Wiki for planning created - Re: [Pacemaker] [RFC] Organizing HA Summit 2015
I just created a dedicated/fresh wiki for planning and organizing: http://plan.alteeve.ca/index.php/Main_Page Other than the domain, it has no association with any existing project, so it should be a neutral enough platform. Also, it's not owned by $megacorp (I wish!), so spying/privacy shouldn't be an issue I hope. If there is concern, I can setup https. If no one else gets to it before me, I'll start collating the data from the mailing list onto that wiki tomorrow (maaaybe today, depends). The wiki requires registration, but that's it. I'm not bothering with captchas because, in my experience, spammer walk right through them anyway. I do have edits email me, so I can catch and roll back any spam quickly. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Cluster-devel] Wiki for planning created - Re: [Pacemaker] [RFC] Organizing HA Summit 2015
On 28/11/14 12:33 AM, Fabio M. Di Nitto wrote: On 11/27/2014 5:52 PM, Digimer wrote: I just created a dedicated/fresh wiki for planning and organizing: http://plan.alteeve.ca/index.php/Main_Page Other than the domain, it has no association with any existing project, so it should be a neutral enough platform. Also, it's not owned by $megacorp (I wish!), so spying/privacy shouldn't be an issue I hope. If there is concern, I can setup https. If no one else gets to it before me, I'll start collating the data from the mailing list onto that wiki tomorrow (maaaybe today, depends). The wiki requires registration, but that's it. I'm not bothering with captchas because, in my experience, spammer walk right through them anyway. I do have edits email me, so I can catch and roll back any spam quickly. Awesome! thanks for taking care of it. Do you have a chance to add also an instance of etherpad to the site? Mostly to do collaborative editing while we sit all around the same table. Otherwise we can use a public instance and copy paste info after that in the wiki. Fabio Never tried setting up etherpad before, but if it runs on rhel 6, I should have no problem setting it up. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Pacemaker] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015
On 26/11/14 05:28 PM, Andrew Beekhof wrote: On 27 Nov 2014, at 8:18 am, Marek marx Grac mg...@redhat.com wrote: On 11/26/2014 08:00 PM, Michael Schwartzkopff wrote: Am Donnerstag, 27. November 2014, 00:13:11 schrieb Rajagopal Swaminathan: Greetings, Guys, I am a poor Indian whom US of A Abhors and have successfully deployed over 5 centos/rhel clusts vaying from 4-6. May I Know where this event is held? Brno, Slovakia. Next international Airport: Vienna. Brno is quite close to Slovakia but it is in Czech Republic. International airports around are Vienna, Prague and mostly low-cost ones in Brno and Bratislava Anyone want to meet in munich and share a car? :-) I might be up for that. I've not looked into flights yet, though I do have a standing invitation for beer in Vienna, so I'm sort of planning to fly through there. Apparently there is a very convenient bus from Vienna to Brno. Why Munich? (Don't get me wrong, I loved it there last year!) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [ha-wg-technical] [Cluster-devel] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015
On 25/11/14 04:31 PM, Andrew Beekhof wrote: Yeah, but you're already bringing him for your personal conference. That's a bit different. ;-) OK, let's switch tracks a bit. What *topics* do we actually have? Can we fill two days? Where would we want to collect them? Personally I'm interested in talking about scaling - with pacemaker-remoted and/or a new messaging/membership layer. Other design-y topics: - SBD - degraded mode - improved notifications This my be something my company can bring to the table. We just hired a dev whose principle goal is to develop and alert system for HA. We're modelling it heavily on the fence/resource agent model with a scan core and scan agents. It's sort of like existing tools, but designed specifically for HA clusters and heavily focused on not interfering with the host more than at all necessary. By Feb., it should be mostly done. We're doing this for our own needs, but it might be a framework worth talking about, if nothing else to see if others consider it a fit. Of course, it will be entirely open source. *If* there is interest, I could put together a(n informal) talk on it with a demo. - containerisation of services (cgroups, docker, virt) - resource-agents (upstream releases, handling of pull requests, testing) User-facing topics could include recent features (ie. pacemaker-remoted, crm_resource --restart) and common deployment scenarios (eg. NFS) that people get wrong. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Cluster-devel] [ha-wg] [Linux-HA] [RFC] Organizing HA Summit 2015
On 26/11/14 12:51 AM, Fabio M. Di Nitto wrote: On 11/25/2014 10:54 AM, Lars Marowsky-Bree wrote: On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote: Yeah, well, devconf.cz is not such an interesting event for those who do not wear the fedora ;-) That would be the perfect opportunity for you to convert users to Suse ;) I´d prefer, at least for this round, to keep dates/location and explore the option to allow people to join remotely. Afterall there are tons of tools between google hangouts and others that would allow that. That is, in my experience, the absolute worst. It creates second class participants and is a PITA for everyone. I agree, it is still a way for people to join in tho. I personally disagree. In my experience, one either does a face-to-face meeting, or a virtual one that puts everyone on the same footing. Mixing both works really badly unless the team already knows each other. I know that an in-person meeting is useful, but we have a large team in Beijing, the US, Tasmania (OK, one crazy guy), various countries in Europe etc. Yes same here. No difference.. we have one crazy guy in Australia.. Yeah, but you're already bringing him for your personal conference. That's a bit different. ;-) OK, let's switch tracks a bit. What *topics* do we actually have? Can we fill two days? Where would we want to collect them? I´d say either a google doc or any random etherpad/wiki instance will do just fine. As for the topics: - corosync qdevice and plugins (network, disk, integration with sdb?, others?) - corosync RRP / libknet integration/replacement - fence autodetection/autoconfiguration For the user facing topics (that is if there are enough participants and I only got 1 user confirmation so far): - demos, cluster 101, tutorials - get feedback - get feedback - get more feedback Fabio I'd be happy to do a cluster 101 or similar, if there is interest. Not sure if that would be particularly appealing to anyone coming to our meeting, as I think anyone interested is probably well past 101. :) Anyway, you guys know my background, let me know if there is a topic you'd like me to cover for the user side of things. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [ha-wg-technical] [ha-wg] [Linux-HA] [RFC] Organizing HA Summit 2015
On 26/11/14 12:51 AM, Fabio M. Di Nitto wrote: On 11/25/2014 10:54 AM, Lars Marowsky-Bree wrote: On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote: Yeah, well, devconf.cz is not such an interesting event for those who do not wear the fedora ;-) That would be the perfect opportunity for you to convert users to Suse ;) I´d prefer, at least for this round, to keep dates/location and explore the option to allow people to join remotely. Afterall there are tons of tools between google hangouts and others that would allow that. That is, in my experience, the absolute worst. It creates second class participants and is a PITA for everyone. I agree, it is still a way for people to join in tho. I personally disagree. In my experience, one either does a face-to-face meeting, or a virtual one that puts everyone on the same footing. Mixing both works really badly unless the team already knows each other. I know that an in-person meeting is useful, but we have a large team in Beijing, the US, Tasmania (OK, one crazy guy), various countries in Europe etc. Yes same here. No difference.. we have one crazy guy in Australia.. Yeah, but you're already bringing him for your personal conference. That's a bit different. ;-) OK, let's switch tracks a bit. What *topics* do we actually have? Can we fill two days? Where would we want to collect them? I´d say either a google doc or any random etherpad/wiki instance will do just fine. As for the topics: - corosync qdevice and plugins (network, disk, integration with sdb?, others?) - corosync RRP / libknet integration/replacement - fence autodetection/autoconfiguration For the user facing topics (that is if there are enough participants and I only got 1 user confirmation so far): - demos, cluster 101, tutorials - get feedback - get feedback - get more feedback Fabio Ok, I do have a topic I want to add; Merging the dozen different mailing lists, IRC channels and other support forums. This thread is a good example of the thinness that the community is spread over. A 'dev', 'user', 'announce' list should be enough for all HA. Likewise, one IRC channel should be enough, too. The trick will be discussing this without bikeshedding. :) digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Pacemaker] [ha-wg] [RFC] Organizing HA Summit 2015
On 24/11/14 09:54 AM, Fabio M. Di Nitto wrote: On 11/24/2014 3:39 PM, Lars Marowsky-Bree wrote: On 2014-09-08T12:30:23, Fabio M. Di Nitto fdini...@redhat.com wrote: Folks, Fabio, thanks for organizing this and getting the ball rolling. And again sorry for being late to said game; I was busy elsewhere. However, it seems that the idea for such a HA Summit in Brno/Feb 2015 hasn't exactly fallen on fertile grounds, even with the suggested user/client day. (Or if there was a lot of feedback, it wasn't public.) I wonder why that is, and if/how we can make this more attractive? I suspect a lot of it is that, given people's busy schedules, February seems far away. Also, I wonder how much discussion has happened outside of these lists. Is it really that there hasn't been much feedback? Fabio started this ball rolling, so I would be interested to hear what he's heard. Frankly, as might have been obvious ;-), for me the venue is an issue. It's not easy to reach, and I'm theoretically fairly close in Germany already. I wonder if we could increase participation with a virtual meeting (on either those dates or another), similar to what the Ceph Developer Summit does? Requested feedback given; Virtual meetings are never as good, and I really don't like this idea. In my experience, just as much productive decision making happens in the unofficial after-hours activities as during formal(ish) meetings/presentations. I think it is very important that the meeting remain in-person if at all possible. Those appear really productive and make it possible for a wide range of interested parties from all over the world to attend, regardless of travel times, or even just attend select sessions (that would otherwise make it hard to justify travel expenses time off). Alternatively, would a relocation to a more connected venue help, such as Vienna xor Prague? Personally, I don't care where we meet, but I do believe Fabio already ruled out a relocation. I'd love to get some more feedback from the community. I agree. some feedback would be useful. digimer puts on her flame-retardant pantaloons and waits for the worst... -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Pacemaker] [ha-wg] [RFC] Organizing HA Summit 2015
On 24/11/14 10:12 AM, Lars Marowsky-Bree wrote: Beijing, the US, Tasmania (OK, one crazy guy), various countries in Oh, bring him! crazy++ -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015
All the cool kids will be there. You want to be a cool kid, right? :p On 01/11/14 01:06 AM, Fabio M. Di Nitto wrote: just a kind reminder. On 9/8/2014 12:30 PM, Fabio M. Di Nitto wrote: All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/ Could you please let me know by end of Nov if you are interested or not? I have heard only from few people so far. Cheers Fabio ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] daemon cpg_join error retrying
On 29/10/14 06:16 PM, Andrew Beekhof wrote: On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) lk...@cisco.com wrote: I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. I don't really recall. Hopefully someone more familiar with GFS2 can chime in. # gfs2_tool sb /dev/c01n01_vg0/shared table current lock table name = an-cluster-01:shared Replace with your device, of course. :) Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. It does not sound like your network is particularly healthy. Are you using multicast or udpu? If multicast, it might be worth trying udpu Agreed. Persistent multicast required? Thanks Lax -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Andrew Beekhof Sent: Wednesday, October 29, 2014 2:42 PM To: linux clustering Subject: Re: [Linux-cluster] daemon cpg_join error retrying On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) lk...@cisco.com wrote: Hi All, In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? Thanks Lax -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Rhel BootLoader, Single-user mode password Interactive Boot in a Cloud environment
On 22/10/14 04:44 AM, Sunhux G wrote: We run cloud service our vCenter is not accessible to our tenants and their IT support; so I would say console access is not feasible unless the tenant/customer IT come to our DC. If the following 3 hardenings are done our tenant/customer RHEL Linux VM, what's the impact to the tenant's sysadmin IT operation? a) CIS 1.5.3 Set Boot Loader Password *:* if this password is set, when tenant reboot (shutdown -r) their VM each time, will it prompt for the bootloader password at console? If so, is there any way the tenant, could still get their VM booted up if they have no access to vCenter's console? b) CIS 1.5.4 Require Authentication for Single-User Mode *:* Does Linux allow ssh access while in single-user mode can this 'single-user mode password' be entered via an ssh session (without access to console), assuming certain 'terminal' service is started up / running while in single user mode c) CIS 1.5.5 Disable Interactive Boot *:* what's the general consensus on this? Disable or enable? Our corporate hardening guide does not mention this item. So if the tenant wishes to boot up step by step (ie pausing at each startup script), they can't do it? Feel free to add any other impacts that anyone can think of Lastly, how do people out there grant console access to their tenants in Cloud environment without security compromise (I mean without granting vCenter access) : I heard that we can customize vCenter to grant limited access of vCenter to tenants, is this so? Sun Hi Sun, Did you mean to post this to the vmware mailing list? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x)
On 15/10/14 10:15 AM, Marek marx Grac wrote: On 10/14/2014 01:01 PM, Digimer wrote: Hi Marek et. al., This is a RHEL 6.5 install, so Kristoffer's comment about needing a newer version of python is a bit of a concern. Has this been tested on RHEL 6 with an APC with the 6.x firmware? Current release do not contain required patch, it will be in next one (or z-stream if someone request it). The upstream release work as expected (retested today) on Fedora20/RHEL7. Fact that upstream release can not be run on RHEL6 is new issue for me but we did not try that before. m, Consider it officially requested. We use APC switched PDUs as backup fence devices extensively, so this would pretty heavily hurt us if we started getting v6 firmware. Should I open a RHBZ? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x)
On 13/10/14 03:10 PM, Thomas Meier wrote: Hi When configuring PDU fencing in my 2-node-cluster I ran into some problems with the fence_apc_snmp agent. Turning a node off works fine, but fence_apc_snmp then exits with error. When I do this manually (from node2): fence_apc_snmp -a node1 -n 1 -o off the output of the command is not an expected: Success: Powered OFF but in my case: Returned 2: Error in packet. Reason: (genError) A general failure occured Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21 When I check the PDU, the port is without power, so this part works. But it seems that the fence agent can't read the status of the PDU and then exits with error. The same seems to happen when fenced is calling the agent. The agent also exits with an error and fencing can't succeed and the cluster hangs. From the logfile: fenced[2100]: fence node1 dev 1.0 agent fence_apc_snmp result: error from agent My Setup: - CentOS 6.5 with fence-agents-3.1.5-35.el6_5.4.x86_64 installed. - APC AP8953 PDU with firmware 6.1 - 2-node-cluster based on https://alteeve.ca/w/AN!Cluster_Tutorial_2 - fencing agents in use: fence_ipmilan (working) and fence_apc_snmp I did some recherche, and for me it looks like that my fence-agents package is too old for my APC firmware. I've already found the fence-agents repo: https://git.fedorahosted.org/cgit/fence-agents.git/ Here https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875 it says: fence_apc_snmp: Add support for firmware 6.x I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build of fence_apc_snmp doesn't work. It gives: [root@box1]# fence_apc_snmp -v -a node1 -n 1 -o status Traceback (most recent call last): File /usr/sbin/fence_apc_snmp, line 223, in module main() File /usr/sbin/fence_apc_snmp, line 197, in main options = check_input(device_opt, process_input(device_opt)) File /usr/share/fence/fencing.py, line 705, in check_input logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr)) TypeError: __init__() got an unexpected keyword argument 'stream' I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and if so, install the right version of fence_apc_snmp on the cluster without breaking things, but I'm a bit clueless how to build me a working version. Maybe you have some tips? Thanks in advance Thomas Hi Marek et. al., This is a RHEL 6.5 install, so Kristoffer's comment about needing a newer version of python is a bit of a concern. Has this been tested on RHEL 6 with an APC with the 6.x firmware? cheeps -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] cLVM unusable on quorated cluster
On 03/10/14 10:35 AM, Daniel Dehennin wrote: Hello, I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN for an OpenNebula cluster. As I'm new to cluster world, I have hard time figuring why sometime things get really wrong and where I must look to find answers. My OpenNebula frontend, running in a VM, does not manage to run the resources and my syslog has a lot of: #+begin_src ocfs2_controld: Unable to open checkpoint ocfs2:controld: Object does not exist #+end_src When this happens, other nodes have problem: #+begin_src root@nebula3:~# LANG=C vgscan cluster request failed: Host is down Unable to obtain global lock. #+end_src But things looks fin in “crm_mon”: #+begin_src root@nebula3:~# crm_mon -1 Last updated: Fri Oct 3 16:25:43 2014 Last change: Fri Oct 3 14:51:59 2014 via cibadmin on nebula1 Stack: openais Current DC: nebula3 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 5 Nodes configured, 5 expected votes 32 Resources configured. Node quorum: standby Online: [ nebula3 nebula2 nebula1 ] OFFLINE: [ one ] Stonith-nebula3-IPMILAN(stonith:external/ipmi):Started nebula2 Stonith-nebula2-IPMILAN(stonith:external/ipmi):Started nebula3 Stonith-nebula1-IPMILAN(stonith:external/ipmi):Started nebula2 Clone Set: ONE-Storage-Clone [ONE-Storage] Started: [ nebula1 nebula3 nebula2 ] Stopped: [ ONE-Storage:3 ONE-Storage:4 ] Quorum-Node(ocf::heartbeat:VirtualDomain): Started nebula3 Stonith-Quorum-Node (stonith:external/libvirt): Started nebula3 #+end_src I don't know how to interpret dlm_tool informations: #+begin_src root@nebula3:~# dlm_tool ls -n dlm lockspaces name CCB10CE8D4FF489B9A2ECB288DACF2D7 id0x09250e49 flags 0x0008 fs_reg changemember 3 joined 1 remove 0 failed 0 seq 2,2 members 1189587136 1206364352 1223141568 all nodes nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none name clvmd id0x4104eefa flags 0x changemember 3 joined 0 remove 1 failed 0 seq 4,4 members 1189587136 1206364352 1223141568 all nodes nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none #+end_src Is there any documentation on troubleshooting DLM/cLVM? Regards. Can you paste your full pacemaker config and the logs from the other nodes starting just before the lost node went away? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] clvmd issues
On 03/10/14 12:57 PM, manish vaidya wrote: First i apologise for late reply , delay due to i cannot believe ,any response from site , I am a newcomer , already , i had posted this problem on many online forums , but they didn't give any response Thank all , for taking my problem seriously ** response from you are you using clvmd? if your answer is = yes, you need to be sure, you pv is visibile to your cluster nodes *** i am using clvmd When use pvscan command cluster hangs I want to reproduce this situation again for perfection , such as when i try to run pvcreate command in cluster , message should come lock from node2 node3 , I have created new cluster , this new cluster is working fine , How to do This? any setting in lvm.conf Can you share your setup please? What kind of cluster? What version? What is the configuration file? Was there anything interesting in the system logs? etc. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Pacemaker] [RFC] Organizing HA Summit 2015
On 08/09/14 06:30 AM, Fabio M. Di Nitto wrote: All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/ How is this looking? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster
On 09/09/14 03:14 AM, Amjad Syed wrote: device lanplus = name=inspuripmi action =reboot/ Something is breaking the network during the shutdown, a fence is being called and both nodes are killing the other, causing a dual fence. So you have a set of problems, I think. First, disable acpid on both nodes. Second, change the quoted line (only) to: device lanplus = name=inspuripmi delay=15 action =reboot/ If I am right, this will mean that 192.168.10.10 will stay up (fence) .11 Third, what bonding mode are you using? I would only use mode=1. Forth, please set the node names to match 'uname -n' on both nodes. Be sure the names translate to the IPs you want (via /etc/hosts, ideally). Fifth, as Sivaji suggested, please put switch(es) between the nodes. If it still tries to fence when a node shuts down (watch /var/log/messages and look for 'fencing node ...'), please paste your logs from both nodes. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Pacemaker] [RFC] Organizing HA Summit 2015
On 08/09/14 06:30 AM, Fabio M. Di Nitto wrote: All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/ I think this is a good idea. 3 days may be a good idea, as well. I think I would be more useful trying to bring the user's perspective more so than a developer's. So on that, I would like to propose a discussion on merging some of the disparate lists, channels, sites, etc. to help simplify life for new users looking for help from or to wanting to join the HA community. I also understand that Fabio will buy the first round of drinks. :) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Please help me on cluster error
Can you share your cluster information please? This could be a network problem, as the messages below happen when the network between the nodes isn't fast enough or has too long latency and cluster traffic is considered lost and re-requested. If you don't have fencing working properly, and if a network issue caused a node to be declared lost, clustered LVM (and anything else using cluster locking) will fail (by design). If you share your configuration and more of your logs, it will help us understand what is happening. Please also tell us what version of the cluster software you're using. digimer On 30/08/14 10:12 AM, manish vaidya wrote: i created four node cluster in kvm enviorment But i faced error when create new pv such as pvcreate /dev/sdb1 got error , lock from node 2 lock from node3 also strange cluster logs Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e 5f Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f 60 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63 64 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69 6a Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84 85 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a 9b Please help me on this issue http://sigads.rediff.com/RealMedia/ads/click_nx.ads/www.rediffmail.com/signatureline.htm@Middle? Get your own *FREE* website, *FREE* domain *FREE* mobile app with Company email. *Know More * http://track.rediff.com/click?url=___http://businessemail.rediff.com/email-ids-for-companies-with-less-than-50-employees?sc_cid=sign-1-10-13___cmp=hostlnk=sign-1-10-13nsrv1=host -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] corosync ring failure
Any logs in the switch? Is the multicast group being deleted/recreated? On 23/07/14 11:53 AM, C. Handel wrote: hi, i run a cluster with two corosync rings. One of the rings is marked faulty every fourty seconds, to immediately recover a second later. the other ring is stable i have no idea how i should debug this. we are running sl6.5 with pacemaker 1.1.10, cman 3.0.12, corosync 1.4.1 cluster consists of three machines. Ring1 is running on 10gigbit interfaces, Ring0 on 1gigibit interfaces. Both rings don't leave their respective switch. corosync communication is udpu, rrp_mode is passive cluster.conf: cluster config_version=30 name=aslfile cman transport=udpu /cman fence_daemon post_join_delay=120 post_fail_delay=30/ fencedevices fencedevice name=pcmk agent=fence_pcmk action=off/ /fencedevices quorumd cman_label=qdisk device=/dev/mapper/mpath-091quorump1 min_score=1 votes=2 /quorumd clusternodes clusternode name=asl430m90 nodeid=430 altname name=asl430/ fence method name=pcmk-redirect device name=pcmk port=asl430m90/ /method /fence /clusternode clusternode name=asl431m90 nodeid=431 altname name=asl431/ fence method name=pcmk-redirect device name=pcmk port=asl431m90/ /method /fence /clusternode clusternode name=asl432m90 nodeid=432 altname name=asl432/ fence method name=pcmk-redirect device name=pcmk port=asl432m90/ /method /fence /clusternode /clusternodes /cluster syslog Jul 23 17:48:34 asl431 corosync[3254]: [TOTEM ] Marking ringid 1 interface 140.181.134.212 FAULTY Jul 23 17:48:35 asl431 corosync[3254]: [TOTEM ] Automatically recovered ring 1 Jul 23 17:48:35 asl431 corosync[3254]: [TOTEM ] Automatically recovered ring 1 Jul 23 17:48:35 asl431 corosync[3254]: [TOTEM ] Automatically recovered ring 1 Jul 23 17:49:14 asl431 corosync[3254]: [TOTEM ] Marking ringid 1 interface 140.181.134.212 FAULTY Jul 23 17:49:15 asl431 corosync[3254]: [TOTEM ] Automatically recovered ring 1 Jul 23 17:49:15 asl431 corosync[3254]: [TOTEM ] Automatically recovered ring 1 Jul 23 17:49:15 asl431 corosync[3254]: [TOTEM ] Automatically recovered ring 1 Greetings Christoph -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Error in Cluster.conf
On 24/06/14 08:55 AM, Jan Pokorný wrote: On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote: On 6/24/2014 12:32 PM, Amjad Syed wrote: Hello I am getting the following error when i run ccs_config_Validate ccs_config_validate Relax-NG validity error : Extra element clusternodes in interleave You defined clusternodes.. twice. That + the are more issues discoverable by more powerful validator jing (packaged in Fedora and RHEL 7, for instance, admittedly not for RHEL 6/EPEL): $ jing cluster.rng cluster.conf cluster.conf:13:47: error: element fencedvice not allowed anywhere; expected the element end-tag or element fencedevice cluster.conf:15:23: error: element clusternodes not allowed here; expected the element end-tag or element clvmd, dlm, fence_daemon, fence_xvmd, gfs_controld, group, logging, quorumd, rm, totem or uidgid cluster.conf:26:76: error: IDREF fence_node2 without matching ID cluster.conf:19:77: error: IDREF fence_node1 without matching ID So it spotted also: - a typo in fencedvice - broken referential integrity; it is prescribed name attribute of device tag should match a name of a defined fencedevice Hope this helps. -- Jan Also, without fence methods defined for the nodes, rgmanager will block the first time there is an issue. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Online change of fence device options - possible?
On 23/06/14 02:16 PM, Digimer wrote: On 23/06/14 02:09 PM, Vasil Valchev wrote: Hello, I have a RHEL 6.5 cluster, using rgmanager. The fence devices are fence_ipmilan - fencing through HP iLO4. The issue is the fence devices weren't configured entirely correct - recently after a node failure, the fence agent was returning failures (even though it was fencing the node successfully), which apparently can be avoided by setting the power_wait option to the fence dev configuration. My question is - after changing the fence device (I think directly through the .conf will be fine?), iterating the config version, and syncing the .conf through the cluster software - is something else necessary to apply the change (eg. cman reload)? Will the new fence option be used the next time a fencing action is performed? And lastly can all of this be performed while the cluster and services are operational or they have to be stopped/restarted? Regards, Vasil This should be fine. As you said; Update the fence config, increment the config_version, save and exit. Run 'ccs_config_validate' and if that passes, 'cman_tool version -r'. Note that for this to work, you need to have set the 'ricci' user's shell password as well as have the 'ricci' and 'modclusterd' daemons running. Once done, run 'fence_check'[1] to verify that the fence config works (it makes a status call to check). If that works, you're good to go. You can also crontab the fence_check call and have it email you or something so that you can catch fence failures earlier. digimer 1. https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config I should clarify; You can update the config while the cluster is online. No fences will be called and you do not need to restart anything. cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Online change of fence device options - possible?
On 23/06/14 02:09 PM, Vasil Valchev wrote: Hello, I have a RHEL 6.5 cluster, using rgmanager. The fence devices are fence_ipmilan - fencing through HP iLO4. The issue is the fence devices weren't configured entirely correct - recently after a node failure, the fence agent was returning failures (even though it was fencing the node successfully), which apparently can be avoided by setting the power_wait option to the fence dev configuration. My question is - after changing the fence device (I think directly through the .conf will be fine?), iterating the config version, and syncing the .conf through the cluster software - is something else necessary to apply the change (eg. cman reload)? Will the new fence option be used the next time a fencing action is performed? And lastly can all of this be performed while the cluster and services are operational or they have to be stopped/restarted? Regards, Vasil This should be fine. As you said; Update the fence config, increment the config_version, save and exit. Run 'ccs_config_validate' and if that passes, 'cman_tool version -r'. Note that for this to work, you need to have set the 'ricci' user's shell password as well as have the 'ricci' and 'modclusterd' daemons running. Once done, run 'fence_check'[1] to verify that the fence config works (it makes a status call to check). If that works, you're good to go. You can also crontab the fence_check call and have it email you or something so that you can catch fence failures earlier. digimer 1. https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] fence Agent
On 22/06/14 03:55 AM, Amjad Syed wrote: Hello, I am trying to setup a simple 2 node cluster in active/passive mode for oracle high availability We are using one INSPUR server and one HP proliant (Management decision based on hardware availability) and we are seeing if we can use IPMI as fencing method CCHS though supports HP ILO, DELL IPMI, IBM , but not INSPUR. So the basic question i have is what if we can use fence_ILO (for HP) and fence_ipmilan (For INSPUR)? IF any one have any experience with fence_ipmilan or point to resources , it would really be appreciated. Sincerely, Amjad fence_ipmilan works with just about every IPMI-based out of band management interface. Most of those branded ones, like DRAC, RSA, iLO, etc are fundamentally based on IPMI. I've used fence_ipmilan on iLO personally and it's fine. If you can show what 'ipmitool' command you use that can show if the peer is powered on or off, then you should be able to translate it quite easily to a matching fence_ipmilan call (check man fence_ipmilan for the switches). Once you can check the power status of the peer(s) with fence_ipmilan, you're 95% of the way there. cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
On 18/06/14 02:20 PM, YB Tan Sri Dato Sri' Adli a.k.a Dell wrote: Hi, The linux clustering will be only working perfectly if you run the linux operating systems between nodes. Allow root ssh persistent connection on top of same specifications hardware platform. To perform test or proof of concept, you may allow to run and configure between two nodes. The databases for clustering will be configure right after the two nodes linux operating systems run with persistent root access ssh connection. Sent from Yahoo Mail for iPhone https://overview.mail.yahoo.com?.src=iOS You have said this a couple times now, and I am not sure why. There is no need to have persistent, root access SSH between nodes. It's helpful in some cases, sure, but certainly not required. Corosync, which provides cluster membership and communication, handles internode traffic itself, on it's own TCP port (using multicast by default or unicast if configured). There is also nothing restricting you to two nodes. It's a good configuration, and one I use personally, but there are many 3+ node clusters out there. As for a database cluster, that would depend entirely on which database you are using and whether you are using tools specific for that DB or a more generic HA stack like corosync + pacemaker. Cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Two-node cluster GFS2 confusing
I don't use VMware myself, but I think fence_vmware will work for you. Please note that simply enabling stonith is not enough. As you realize, you need a configured and working fence method. If you try using the command line, you can play with the command's switched asking for 'status'. When that returns properly, you will then just need to convert the switches into arguments for pacemaker. Read the man page for 'fence_vmware', and then try calling: fence_vmware ... -o status Fill in the switches and values you need based on the instructions in 'man fence_vmware'. digimer On 18/06/14 09:51 PM, Le Trung Kien wrote: Hi, As Digimer suggested, I change property stonith-enabled=true But now I don't know which fencing method I should use, because my two Redhat nodes running on VMWare Workstation, OpenFiler as SCSI shared LUN storage. I attempted to use fence_scsi, but no luck, I got this error: Jun 19 08:35:58 server1 stonith_admin[3837]: notice: crm_log_args: Invoked: stonith_admin --reboot server2 --tolerance 5s Jun 19 08:36:08 server1 root: fence_pcmk[3836]: Call to fence server2 (reset) failed with rc=255 Here is my fencing configuration: ?xml version=1.0? cluster config_version=1 name=mycluster cman expected_votes=1 cluster_id=1/ fence_daemon post_fail_delay=0 post_join_delay=30/ clusternodes clusternode name=server1 votes=1 nodeid=1 fence method name=scsi device name=scsi_dev key=1/ /method /fence /clusternode clusternode name=server2 votes=1 nodeid=2 fence method name=scsi device name=scsi_dev key=2/ /method /fence /clusternode /clusternodes fencedevices fencedevice agent=fence_scsi name=scsi_dev aptpl=1 logfile=/tmp/fence_scsi.log/ /fencedevices /cluster And the log: /tmp/fence_scsi.log show: Jun 18 19:49:40 fence_scsi: [error] no devices found I will try vmware_soap to see if it works. Kien Le -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer Sent: Wednesday, June 18, 2014 11:18 AM To: linux clustering Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing On 16/06/14 07:43 AM, Le Trung Kien wrote: Hello everyone, I'm a new man on linux cluster. I have built a two-node cluster (without qdisk), includes: Redhat 6.4 cman pacemaker gfs2 My cluster could fail-over (back and forth) between two nodes for these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on /mnt/gfs2_storage), WebSite ( apache service) My problem occurs when I stop/start node in the following order: (when both nodes started) 1. Stop: node1 (shutdown) - all resource fail-over on node2 - all resources still working on node2 2. Stop: node2 (stop service: pacemaker then cman) - all resources stop (of course) 3. Start: node1 (start service: cman then pacemaker) - only ClusterIP started, WebFS failed, WebSite not started Status: Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1 Stack: cman Current DC: server1 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 2 Nodes configured, 1 expected votes 4 Resources configured. Online: [ server1 ] OFFLINE: [ server2 ] ClusterIP (ocf::heartbeat:IPaddr2): Started server1 WebFS (ocf::heartbeat:Filesystem):Started server1 (unmanaged) FAILED Failed actions: WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error Here is my /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=1 name=mycluster logging debug=on/ clusternodes clusternode name=server1 nodeid=1 fence method name=pcmk-redirect device name=pcmk port=server1/ /method /fence /clusternode clusternode name=server2 nodeid=2 fence method name=pcmk-redirect device name=pcmk port=server2/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices /cluster Here is my: crm configure show snip stonith-enabled=false \ Well this is a problem. When cman detects a failure (well corosync, but cman is told), it initiates a fence request. The fence daemon informs DLM with blocks. Then fenced calls the configured 'fence_pcmk', which just passes the request up to pacemaker. Without stonith configured in fencing, pacemaker will fail
Re: [Linux-cluster] Two-node cluster GFS2 confusing
On 16/06/14 07:43 AM, Le Trung Kien wrote: Hello everyone, I'm a new man on linux cluster. I have built a two-node cluster (without qdisk), includes: Redhat 6.4 cman pacemaker gfs2 My cluster could fail-over (back and forth) between two nodes for these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on /mnt/gfs2_storage), WebSite ( apache service) My problem occurs when I stop/start node in the following order: (when both nodes started) 1. Stop: node1 (shutdown) - all resource fail-over on node2 - all resources still working on node2 2. Stop: node2 (stop service: pacemaker then cman) - all resources stop (of course) 3. Start: node1 (start service: cman then pacemaker) - only ClusterIP started, WebFS failed, WebSite not started Status: Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1 Stack: cman Current DC: server1 - partition WITHOUT quorum Version: 1.1.8-7.el6-394e906 2 Nodes configured, 1 expected votes 4 Resources configured. Online: [ server1 ] OFFLINE: [ server2 ] ClusterIP (ocf::heartbeat:IPaddr2): Started server1 WebFS (ocf::heartbeat:Filesystem):Started server1 (unmanaged) FAILED Failed actions: WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error Here is my /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=1 name=mycluster logging debug=on/ clusternodes clusternode name=server1 nodeid=1 fence method name=pcmk-redirect device name=pcmk port=server1/ /method /fence /clusternode clusternode name=server2 nodeid=2 fence method name=pcmk-redirect device name=pcmk port=server2/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices /cluster Here is my: crm configure show snip stonith-enabled=false \ Well this is a problem. When cman detects a failure (well corosync, but cman is told), it initiates a fence request. The fence daemon informs DLM with blocks. Then fenced calls the configured 'fence_pcmk', which just passes the request up to pacemaker. Without stonith configured in fencing, pacemaker will fail to fence, of course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by design. If configure proper fencing in pacemaker (and test it to make sure it works), then pacemaker *would* succeed in fencing and return a success to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans up lost locks and returns to normal operation. So please configure and test real stonith in pacemaker and see if your problem is resolved. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] 2-node cluster fence loop
Have you tried simple things like disabling iptables or selinux, just to test? If that doesn't work, and it's a small cluster, try unicast and see if that helps (again, even if just to test). On 12/06/14 10:29 AM, Arun G Nair wrote: We have multicast enabled on the switch. I've also tried the multicast.py tool from RH's knowledge base to test multicast and I see the expected output, though the tool uses a different multicast IP( guess that shouldn't matter). I've tried increasing the post_join_delay to 360 seconds to give me enough time to check everything on both the nodes. One node still gets fenced. `clustat` output says the other node is offline on both servers. So one node can't see the other one ? This again points to issue with multicast. Any other clues as to what/where to look ? On Wed, Jun 11, 2014 at 8:33 PM, Digimer li...@alteeve.ca mailto:li...@alteeve.ca wrote: On 11/06/14 10:48 AM, Arun G Nair wrote: Hello, What are the reasons for fence loops when only cman is started ? We have an RHEL 6.5 2-node cluster which goes in to a fence loop and every time we start cman on both nodes. Either one fences the other. Multicast seems to be working properly. My understanding is that without rgmanager running there won't be a multicast group subscription ? I don't see the multicast address in 'netstat -g' unless rgmanager is running. I've tried to increase the fence post_join_delay but one of the nodes still gets fenced. The cluster works fine if we use unicast UDP. Thanks, Hi, When cman starts, it waits post_join_delay seconds for the peer to connect. If, after that time expires (6 seconds by default, iirc), it gives up and calls a fence against the peer to put it into a known state. Corosync is what determines membership, and it is started by cman. The rgmanager only handles resource start/stop/relocate/recovery and has nothing to do with fencing directly. Corosync is what uses multicast. So as you seem to have already surmised, multicast is probably not working in your environment. Have you enabled multicast traffic on the firewall? Do your switches support multicast properly? digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com mailto:Linux-cluster@redhat.com https://www.redhat.com/__mailman/listinfo/linux-cluster https://www.redhat.com/mailman/listinfo/linux-cluster -- Arun G Nair Sr. Sysadmin Dimension Data | Ph: (800) 664-9973 Feedback? We're listening http://www.surveymonkey.com/s/XRCYXBH -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
To confirm; Have you tried with the bonds setup where each node has one link into either switch? I just want to be sure you've ruled out all the network hardware. Also please confirm that you used mode=1 (active-passive) bonding. Assuming this doesn't help, then I would say that I was wrong in assuming it was network related. The next thing I would look at is corosync. Do you see any messages about totem retransmit? On 12/06/14 11:32 AM, Schaefer, Micah wrote: Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and fenced, then node3 was fenced when node4 came back online. The network topology is as follows: switch1: node1, node3 (two connections) switch2: node2, node4 (two connections) switch1 ― switch2 All on the same subnet I set up monitoring at 100 millisecond of the nics in active-backup mode, and saw no messages about link problems before the fence. I see multicast between the servers using tcpdump. Any more ideas? On 6/12/14, 12:19 AM, Digimer li...@alteeve.ca wrote: I considered that, but I would expect more nodes to be lost. On 12/06/14 12:12 AM, Netravali, Ganesh wrote: Make sure multicast is enabled across the switches. -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Schaefer, Micah Sent: Thursday, June 12, 2014 1:20 AM To: linux clustering Subject: Re: [Linux-cluster] Node is randomly fenced Okay, I set up active/ backup bonding and will watch for any change. This is the network side: 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 output errors, 0 collisions, 0 interface resets This is the server side: em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 GiB) Interrupt:34 Memory:d500-d57f I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I'll work on cross connecting the bonded interfaces to different switches. On 6/11/14, 3:28 PM, Digimer li...@alteeve.ca wrote: The first thing I would do is get a second NIC and configure active-passive bonding. network issues are too common to ignore in HA setups. Ideally, I would span the links across separate stacked switches. As for debugging the issue, I can only recommend to look closely at the system and switch logs for clues. On 11/06/14 02:55 PM, Schaefer, Micah wrote: I have the issue on two of my nodes. Each node has 1ea 10gb connection. No bonding, single link. What else can I look at? I manage the network too. I don¹t see any link down notifications, don¹t see any errors on the ports. On 6/11/14, 2:29 PM, Digimer li...@alteeve.ca wrote: On 11/06/14 02:21 PM, Schaefer, Micah wrote: It failed again, even after deleting all the other failover domains. Cluster conf http://pastebin.com/jUXkwKS4 I turned corosync output to debug. How can I go about troubleshooting if it really is a network issue or something else? Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new configuration. Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) This is, to me, *strongly* indicative of a network issue. It's not likely switch-wide as only one member was lost, but I would certainly put my money on a network problem somewhere, some how. Do you use bonding? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca
Re: [Linux-cluster] Node is randomly fenced
On 12/06/14 12:33 PM, yvette hirth wrote: On 06/12/2014 08:32 AM, Schaefer, Micah wrote: Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and fenced, then node3 was fenced when node4 came back online. The network topology is as follows: switch1: node1, node3 (two connections) switch2: node2, node4 (two connections) switch1 ― switch2 All on the same subnet I set up monitoring at 100 millisecond of the nics in active-backup mode, and saw no messages about link problems before the fence. I see multicast between the servers using tcpdump. Any more ideas? spanning-tree scans/rebuilds happen on 10Gb circuits just like they do on 1Gb circuits, and when they happen, traffic on the switches *can* come to a grinding halt, depending upon the switch firmware and the type of spanning-tree scan/rebuild being done. you may want to check your switch logs to see if any spanning-tree rebuilds were being done at the time of the fence. just an idea, and hth yvette hirth When I've seen this (I now disable STP entirely), it blocks all traffic so I would expect multiple/all nodes to partition off on their own. Still, worth looking into. :) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
On 12/06/14 12:48 PM, Schaefer, Micah wrote: As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning tree changes are happening and all the ports have port-fast enabled for these servers. My switch logging level is very high and I have no messages in relation to the time frames or ports. TOTEM reports that “A processor joined or left the membership…”, but that isn’t enough detail. Also note that I did not have these issues until adding new servers: node3 and node4 to the cluster. Node1 and node2 do not fence each other (unless a real issue is there), and they are on different switches. Then I can't imagine it being network anymore. Seeing as both node 3 and 4 get fenced, it's likely not hardware either. Are the workloads on 3 and 4 much higher (or are the computers much slower) than 1 and 2? I'm wondering if the nodes are simply not keeping up with corosync traffic. You might try adjusting the corosync token timeout and retransmit counts to see if that reduces the node loses. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
:44:53 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] got commit token Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq received 86 Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334 Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state. Jun 12 14:44:54 corosync [TOTEM ] got commit token Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state. Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101: Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102: Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103: Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104: Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104: Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101 Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1 Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages in recovery. Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0 Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1 Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state. Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0 Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms, flushing membership messages. Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms, flushing membership messages. On 6/12/14, 1:55 PM, Schaefer, Micah micah.schae...@jhuapl.edu wrote: I just found that the clock on node1 was off by about a minute and a half compared to the rest of the nodes. I am running ntp, so not sure why the time wasn’t synced up. Wonder if node1 being behind, would think it was not receiving updates from the other nodes? On 6/12/14, 1:29 PM, Digimer li...@alteeve.ca wrote: Even if the token changes stop the immediate fencing, don't leave it please. There is something fundamentally wrong that you need to identify/fix. Keep us posted! On 12/06/14 01:24 PM, Schaefer, Micah wrote: The servers do not run any tasks other than the tasks in the cluster service group. Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1 and 2 are virtual machines with much less resources
Re: [Linux-cluster] 2-node cluster fence loop
On 11/06/14 10:48 AM, Arun G Nair wrote: Hello, What are the reasons for fence loops when only cman is started ? We have an RHEL 6.5 2-node cluster which goes in to a fence loop and every time we start cman on both nodes. Either one fences the other. Multicast seems to be working properly. My understanding is that without rgmanager running there won't be a multicast group subscription ? I don't see the multicast address in 'netstat -g' unless rgmanager is running. I've tried to increase the fence post_join_delay but one of the nodes still gets fenced. The cluster works fine if we use unicast UDP. Thanks, Hi, When cman starts, it waits post_join_delay seconds for the peer to connect. If, after that time expires (6 seconds by default, iirc), it gives up and calls a fence against the peer to put it into a known state. Corosync is what determines membership, and it is started by cman. The rgmanager only handles resource start/stop/relocate/recovery and has nothing to do with fencing directly. Corosync is what uses multicast. So as you seem to have already surmised, multicast is probably not working in your environment. Have you enabled multicast traffic on the firewall? Do your switches support multicast properly? digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
On 11/06/14 02:21 PM, Schaefer, Micah wrote: It failed again, even after deleting all the other failover domains. Cluster conf http://pastebin.com/jUXkwKS4 I turned corosync output to debug. How can I go about troubleshooting if it really is a network issue or something else? Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new configuration. Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) This is, to me, *strongly* indicative of a network issue. It's not likely switch-wide as only one member was lost, but I would certainly put my money on a network problem somewhere, some how. Do you use bonding? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
The first thing I would do is get a second NIC and configure active-passive bonding. network issues are too common to ignore in HA setups. Ideally, I would span the links across separate stacked switches. As for debugging the issue, I can only recommend to look closely at the system and switch logs for clues. On 11/06/14 02:55 PM, Schaefer, Micah wrote: I have the issue on two of my nodes. Each node has 1ea 10gb connection. No bonding, single link. What else can I look at? I manage the network too. I don¹t see any link down notifications, don¹t see any errors on the ports. On 6/11/14, 2:29 PM, Digimer li...@alteeve.ca wrote: On 11/06/14 02:21 PM, Schaefer, Micah wrote: It failed again, even after deleting all the other failover domains. Cluster conf http://pastebin.com/jUXkwKS4 I turned corosync output to debug. How can I go about troubleshooting if it really is a network issue or something else? Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new configuration. Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) This is, to me, *strongly* indicative of a network issue. It's not likely switch-wide as only one member was lost, but I would certainly put my money on a network problem somewhere, some how. Do you use bonding? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
I considered that, but I would expect more nodes to be lost. On 12/06/14 12:12 AM, Netravali, Ganesh wrote: Make sure multicast is enabled across the switches. -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Schaefer, Micah Sent: Thursday, June 12, 2014 1:20 AM To: linux clustering Subject: Re: [Linux-cluster] Node is randomly fenced Okay, I set up active/ backup bonding and will watch for any change. This is the network side: 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 output errors, 0 collisions, 0 interface resets This is the server side: em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 GiB) Interrupt:34 Memory:d500-d57f I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I'll work on cross connecting the bonded interfaces to different switches. On 6/11/14, 3:28 PM, Digimer li...@alteeve.ca wrote: The first thing I would do is get a second NIC and configure active-passive bonding. network issues are too common to ignore in HA setups. Ideally, I would span the links across separate stacked switches. As for debugging the issue, I can only recommend to look closely at the system and switch logs for clues. On 11/06/14 02:55 PM, Schaefer, Micah wrote: I have the issue on two of my nodes. Each node has 1ea 10gb connection. No bonding, single link. What else can I look at? I manage the network too. I don¹t see any link down notifications, don¹t see any errors on the ports. On 6/11/14, 2:29 PM, Digimer li...@alteeve.ca wrote: On 11/06/14 02:21 PM, Schaefer, Micah wrote: It failed again, even after deleting all the other failover domains. Cluster conf http://pastebin.com/jUXkwKS4 I turned corosync output to debug. How can I go about troubleshooting if it really is a network issue or something else? Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new configuration. Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) This is, to me, *strongly* indicative of a network issue. It's not likely switch-wide as only one member was lost, but I would certainly put my money on a network problem somewhere, some how. Do you use bonding? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node is randomly fenced
On 04/06/14 10:59 AM, Schaefer, Micah wrote: I have a 4 node cluster, running a single service group. I have been seeing node1 fence node3 while node3 is actively running the service group at random intervals. Rgmanager logs show no failures in service checks, and no other logs provide any useful information. How can I go about finding out why node1 is fencing node3? I currently set up the failover domain to be restricted and not include node3. cluster.conf : http://pastebin.com/xYy6xp6N Random fencing is almost always caused by network failures. Can you look are the system logs, starting a little before the fence and continuing until after the fence completes, and paste them here? I suspect you will see corosync complaining. If this is true, do your switches support persistent multicast? Do you use active/passive bonding? Have you tried different switch/cable/NIC? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] fence_ipmilan / custom hardware target address (ipmitool -t hexaddr)
On 15/05/14 02:39 PM, Jeff Johnson wrote: Greetings, I am looking to adapt fence_ipmilan to interact with a custom implementation of an IPMI BMC. Doing so requires the use of ipmitool's -t option to bridge IPMI requests to a specified internal (non-networked) hardware address. I do not see this option existing in fence_ipmilan or any of the other fence_agents modules. The ipmitool operation would be '/path/to/ipmitool -t 0x42 chassis power operation'. No network, IP, Auth, User, Password or other arguments required. I want to check with the developers to see if there is an existing path for this use case before submitting a patch for consideration. Thanks, --Jeff Marek Grac, who I've cc'ed here, would be the best person to give advice on this. As a user, I think a simple patch to add your option would be fine. I do not believe (though stand to be corrected) that address, user or password is currently required with fence_ipmilan. If I am wrong and it is required, then perhaps forking fence_ipmilan to something like fence_ipmihw (or whatever) and then pushing it out as a new agent should be easy and could work. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] clusvcadm
On 07/05/14 03:05 PM, Paras pradhan wrote: Well I have a qdisk with vote 3 . Thats why it is 6. Here is the log. I see some GFS hung but no issue with GFS mounts at this time. http://pastebin.com/MP4BF86c I am seeing this at clumond.log not sure if this is related and what is it. Mon May 5 21:58:20 2014 clumond: Peer (vprd3.domain): pruning queue 23340-11670 Tue May 6 01:38:57 2014 clumond: Peer (vprd3.domain): pruning queue 23340-11670 Tue May 6 01:39:02 2014 clumond: Peer (vprd1.domain): pruning queue 23340-11670 Thanks Paras Was there a failed fence action prior to this? If so, DLM is probably blocked. Can you post your logs starting from just prior to the network interruption? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] KVM Live migration when node's FS is read-only
Hi all, So I hit a weird issue last week... (EL6 + cman + rgamanager + drbd) For reasons unknown, a client thought they could start yanking and replacing hard drives on a running node. Obviously, that did not end well. The VMs that had been running on the node continues to operate fine and they just started using the peer's storage. The problem came when I tried to live-migrate the VMs over to the still-good node. Obviously, the old host couldn't write to logs, and the live-migration failed. Once failed, rgmanager also stopped working once the migration failed. In the end, I had to manually fence the node (corosync never failed, so it didn't get automatically fenced). This obviously caused the VMs running on the node to reboot, causing a ~40 second outage. It strikes me that the system *should* have been able to migrate, had it not tried to write to the logs. Is there a way, or can there be made a way, to migrate VMs off of a node whose underlying FS is read-only/corrupt/destroyed, so long as the programs in memory are still working? I am sure this is part a part rgmanager, part KVM/qemu question. Thanks for any feedback! -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Simple data replication in a cluster
On 03/04/14 04:58 PM, Vallevand, Mark K wrote: I’m looking for a simple way to replicate data within a cluster. It looks like my resources will be self-configuring and may need to push changes they see to all nodes in the cluster. The idea being that when a node crashes, the resource will have its configuration present on the node on which it is restarted. We’re talking about a few kb of data, probably in one file, probably text. A typical cluster would have multiple resources (more than two), one resource per node and one extra node. Ideas? Could I use the CIB directly to replicate data? Use cibadmin to update something and sync? How big can a resource parameter be? Could a resource modify its parameters so that they are replicated throughout the cluster? Is there a simple file replication Resource Agent? Drdb seems like overkill. Regards. Mark K Vallevand mark.vallev...@unisys.com If you don't want to use DRBD + gfs2 (what I use), then you'll probably want to look at corosync directly for keeping the data in sync. Pacemaker itself is a cluster resource manager and I don't think the cib is well suited for general data sync'ing. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] GFS2 unformat helper tool
On 30/03/14 10:34 AM, Hamid Jafarian wrote: Hi, We developed GFS2 volume unformat helper tool. Read about this code at: http://pdnsoft.com/en/web/pdnen/blog/-/blogs/gfs2-unformat-helper-tool-1 Regards Thanks for sharing this! Madi -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] unformat gfs2
That is very good news! Now, about your backups... ;) Look forward to seeing your code! digimer On 22/03/14 04:13 PM, Mr.Pine wrote: Good news for all : I successfully recoved all of my data(1.5TB) without even one bit lost! my program tooks only 1 hour to do all the jobs on my 1.7 TB partition.(I could not wait 100 days for my bash script to finish). I will publish my source code very soon for the public use. Special thanks to Bob for the help. Mr.Pine On Wed, Mar 19, 2014 at 4:58 PM, Bob Peterson rpete...@redhat.com wrote: - Original Message - Hi, Scripts is very very slow, so i should write program in c/c++. I need some confidence about data structures and data location on disk. As i reviewed blocks of data: All reserved blocks (GFS2 specific blocks) start by : 0x01161970 Blocktype store location is at Byte # 8, Type of start block of each resource group is: 2 Bitmaps are in block types 2 3. In block type 2, bitmap info starts from Byte # 129 In block type 3, bitmap info starts from Byte # 25 Length of RGs are const, 5 in my volume (out put of gfs2_edit -p rindex /dev/..) Is this info right? Logic of my program seams should be like this: (1) Loop in device and temporary store block id of dinode blocks, and also their bitmap locations (2) Change bitmap of blocks to 3 (11) Bob, could you confirm this? Regards Pine. Hi Pine, This is correct. The length of RGs is properly determined by the values in the rindex system file, but 5 is very common, and is usually constant. (It may change if you used gfs2_grow or gfs2_convert from gfs1). The bitmap is 2 bits per block in the resource group, and it's relative to the start of the particular rgrp. You should probably use the same algorithm in libgfs2 to change the proper bit in the bitmaps. You can get this from the public gfs2-utils git tree. Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'
On 18/03/14 09:27 PM, Digimer wrote: Hi all, I would like to tell rgmanager to give more time for VMs to stop. I want this: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600 action name=stop timeout=10m / /vm I already use ccs to create the entry: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600/ via: ccs -h localhost --activate --sync --password secret \ --addvm vm01-win2008 \ --domain=primary_n01 \ path=/shared/definitions/ \ autostart=0 \ exclusive=0 \ recovery=restart \ max_restarts=2 \ restart_expire_time=600 I'm hoping it's a simple additional switch. :) Thanks! As per the request on #linux-cluster, I have opened a rhbz for this: https://bugzilla.redhat.com/show_bug.cgi?id=1079032 -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'
On 20/03/14 03:31 PM, Digimer wrote: On 18/03/14 09:27 PM, Digimer wrote: Hi all, I would like to tell rgmanager to give more time for VMs to stop. I want this: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600 action name=stop timeout=10m / /vm I already use ccs to create the entry: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600/ via: ccs -h localhost --activate --sync --password secret \ --addvm vm01-win2008 \ --domain=primary_n01 \ path=/shared/definitions/ \ autostart=0 \ exclusive=0 \ recovery=restart \ max_restarts=2 \ restart_expire_time=600 I'm hoping it's a simple additional switch. :) Thanks! As per the request on #linux-cluster, I have opened a rhbz for this: https://bugzilla.redhat.com/show_bug.cgi?id=1079032 Split the rgmanager section out: https://bugzilla.redhat.com/show_bug.cgi?id=1079039 -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'
On 19/03/14 06:31 PM, Chris Feist wrote: On 03/18/2014 08:27 PM, Digimer wrote: Hi all, I would like to tell rgmanager to give more time for VMs to stop. I want this: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600 action name=stop timeout=10m / /vm I already use ccs to create the entry: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600/ via: ccs -h localhost --activate --sync --password secret \ --addvm vm01-win2008 \ --domain=primary_n01 \ path=/shared/definitions/ \ autostart=0 \ exclusive=0 \ recovery=restart \ max_restarts=2 \ restart_expire_time=600 I'm hoping it's a simple additional switch. :) Unfortunately currently ccs doesn't support setting resource actions. However it's my understanding that rgmanager doesn't check timeouts unless __enforce_timeouts is set to 1. So you shouldn't be seeing a vm resource go to failed if it takes a long time to stop. Are you trying to make the vm resource fail if it takes longer than 10 minutes to stop? I was afraid you were going to say that. :( The problem is that after calling 'disable' against the VM service, rgmanager waits two minutes. If the service isn't closed in that time, the server is forced off (at least, this was the behaviour when I last tested this). The concern is that, by default, windows installs queue updates to install when the system shuts down. During this time, windows makes it very clear that you should not power off the system during the updates. So if this timer is hit, and the VM is forced off, the guest OS can be damaged. Of course, we can debate the (lack of) wisdom of this behaviour, and I already document this concern (and even warn people to check for updates before stopping the server), it's not sufficient. If a user doesn't read the warning, or simply forgets to check, the consequences can be non-trivial. If ccs can't be made to add this attribute, and if the behaviour persists (I will test shortly after sending this reply), then I will have to edit the cluster.conf directly, something I am loath to do if at all avoidable. Cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'
On 19/03/14 07:45 PM, Digimer wrote: On 19/03/14 06:31 PM, Chris Feist wrote: On 03/18/2014 08:27 PM, Digimer wrote: Hi all, I would like to tell rgmanager to give more time for VMs to stop. I want this: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600 action name=stop timeout=10m / /vm I already use ccs to create the entry: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600/ via: ccs -h localhost --activate --sync --password secret \ --addvm vm01-win2008 \ --domain=primary_n01 \ path=/shared/definitions/ \ autostart=0 \ exclusive=0 \ recovery=restart \ max_restarts=2 \ restart_expire_time=600 I'm hoping it's a simple additional switch. :) Unfortunately currently ccs doesn't support setting resource actions. However it's my understanding that rgmanager doesn't check timeouts unless __enforce_timeouts is set to 1. So you shouldn't be seeing a vm resource go to failed if it takes a long time to stop. Are you trying to make the vm resource fail if it takes longer than 10 minutes to stop? I was afraid you were going to say that. :( The problem is that after calling 'disable' against the VM service, rgmanager waits two minutes. If the service isn't closed in that time, the server is forced off (at least, this was the behaviour when I last tested this). The concern is that, by default, windows installs queue updates to install when the system shuts down. During this time, windows makes it very clear that you should not power off the system during the updates. So if this timer is hit, and the VM is forced off, the guest OS can be damaged. Of course, we can debate the (lack of) wisdom of this behaviour, and I already document this concern (and even warn people to check for updates before stopping the server), it's not sufficient. If a user doesn't read the warning, or simply forgets to check, the consequences can be non-trivial. If ccs can't be made to add this attribute, and if the behaviour persists (I will test shortly after sending this reply), then I will have to edit the cluster.conf directly, something I am loath to do if at all avoidable. Cheers Confirmed; I called disable on a VM with gnome running, so that I could abort the VM's shut down. an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date Wed Mar 19 21:06:29 EDT 2014 Local machine disabling vm:vm01-rhel6...Success Wed Mar 19 21:08:36 EDT 2014 2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been a windows guest in the middle of installing updates, it would be highly likely to be screwed now. To confirm, I changed the config to: vm autostart=0 domain=primary_n01 exclusive=0 max_restarts=2 name=vm01-rhel6 path=/shared/definitions/ recovery=restart restart_expire_time=600 action name=stop timeout=10m/ /vm Then I repeated the test: an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date Wed Mar 19 21:13:18 EDT 2014 Local machine disabling vm:vm01-rhel6...Success Wed Mar 19 21:23:31 EDT 2014 10 minutes and 13 seconds before the cluster killed the server, much less likely to interrupt a in-progress OS update (truth be told, I plan to set 30 minutes. I understand that this blocks other processes, but in an HA environment, I'd strongly argue that safe speed. digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'
On 19/03/14 10:12 PM, Pavel Herrmann wrote: Hi On Wednesday 19 of March 2014 21:26:56 Digimer wrote: On 19/03/14 07:45 PM, Digimer wrote: On 19/03/14 06:31 PM, Chris Feist wrote: On 03/18/2014 08:27 PM, Digimer wrote: Hi all, I would like to tell rgmanager to give more time for VMs to stop. I want this: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600 action name=stop timeout=10m / /vm I already use ccs to create the entry: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600/ via: ccs -h localhost --activate --sync --password secret \ --addvm vm01-win2008 \ --domain=primary_n01 \ path=/shared/definitions/ \ autostart=0 \ exclusive=0 \ recovery=restart \ max_restarts=2 \ restart_expire_time=600 I'm hoping it's a simple additional switch. :) Unfortunately currently ccs doesn't support setting resource actions. However it's my understanding that rgmanager doesn't check timeouts unless __enforce_timeouts is set to 1. So you shouldn't be seeing a vm resource go to failed if it takes a long time to stop. Are you trying to make the vm resource fail if it takes longer than 10 minutes to stop? I was afraid you were going to say that. :( The problem is that after calling 'disable' against the VM service, rgmanager waits two minutes. If the service isn't closed in that time, the server is forced off (at least, this was the behaviour when I last tested this). The concern is that, by default, windows installs queue updates to install when the system shuts down. During this time, windows makes it very clear that you should not power off the system during the updates. So if this timer is hit, and the VM is forced off, the guest OS can be damaged. Of course, we can debate the (lack of) wisdom of this behaviour, and I already document this concern (and even warn people to check for updates before stopping the server), it's not sufficient. If a user doesn't read the warning, or simply forgets to check, the consequences can be non-trivial. If ccs can't be made to add this attribute, and if the behaviour persists (I will test shortly after sending this reply), then I will have to edit the cluster.conf directly, something I am loath to do if at all avoidable. Cheers Confirmed; I called disable on a VM with gnome running, so that I could abort the VM's shut down. an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date Wed Mar 19 21:06:29 EDT 2014 Local machine disabling vm:vm01-rhel6...Success Wed Mar 19 21:08:36 EDT 2014 2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been a windows guest in the middle of installing updates, it would be highly likely to be screwed now. Is this really the best way to handle such an event? From what I remember, Windows can (or could, I don't have any 'modern' windows laying around) be told to shutdown without updating. maybe a wiser approach would be to make the stop event (which I believe is delivered to the guest as pressing the ACPI power button) trigger a shutdown without updates. keep in mind that doing system updates on timer is dangerous, irrelevant of the actual time regards Pavel Herrmann This assumes that we can modify how windows behaves. Unless there is a magic ACPI event that windows will reliably interpret as power off without updating, we can't rely on this. We have clients (and I am sure we aren't the only ones) who install their own OSes without any input from us. As mentioned earlier, we do document the risks, but that's not good enough. We can't force users to read. So we have a choice; Take mitigating steps or let the user shoot themselves in the foot because they should have known better. As personally satisfying as option #2 might seem, option #1 is the more professional approach, I would _strongly_ argue. digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] unformat gfs2
On 18/03/14 09:38 AM, Mr.Pine wrote: I have accidentally reformatted a GFS cluster. We need to unformat it.. is there any way to recover disk ? I read this post http://web.archiveorange.com/archive/v/TUhSn11xEn9QxXBIZ0k6 it say that I can use gfs2_edit to recover data. I need more details about changing block map to 0xff tnx Do you have a support agreement with Red Hat? If so, open a ticket with them. If not, then you can try also asking for help in freenode's #linux-cluster channel. It says no gfs support, but that's to prevent confusion with tracking open tickets, which won't apply if you don't have official red hat support. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Adding a stop timeout to a VM service using 'ccs'
Hi all, I would like to tell rgmanager to give more time for VMs to stop. I want this: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600 action name=stop timeout=10m / /vm I already use ccs to create the entry: vm name=vm01-win2008 domain=primary_n01 autostart=0 path=/shared/definitions/ exclusive=0 recovery=restart max_restarts=2 restart_expire_time=600/ via: ccs -h localhost --activate --sync --password secret \ --addvm vm01-win2008 \ --domain=primary_n01 \ path=/shared/definitions/ \ autostart=0 \ exclusive=0 \ recovery=restart \ max_restarts=2 \ restart_expire_time=600 I'm hoping it's a simple additional switch. :) Thanks! -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] [Linux-HA] Problems with fence_apc agent and accessing APC AP8965
Does 'fence_apc_snmp -a hac-pdu1 -n 1 -o status' work? What about 'fence_apc -a hac-pdu1 -l user -p passwd -n 1 -o status'? digimer On 23/02/14 08:57 PM, Andrew Beekhof wrote: Forwarding to linux-cluster which has more people knowledgeable on this set of fencing agents. On 22 Feb 2014, at 12:38 am, Tony Stocker tony.stoc...@nasa.gov wrote: All, I have a bigger issue regarding dual power supplies and fence_apc that I'm going to eventually need to resolve. But at this point I'm simply having basic issues getting the fence_apc agent to be able to access the devices in general, to wit: # fence_apc --ssh --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status Unable to connect/login to fencing device However I can manually SSH into the device just fine: # ssh blah@hac-pdu1 blah@hac-pdu1's password: American Power Conversion Network Management Card AOS v5.1.9 (c) Copyright 2010 All Rights Reserved RPDU 2g v5.1.6 --- Name : hac-pdu1 Date : 02/21/2014 Contact : syst...@mail.myserver123.com Time : 13:12:02 Location : C101, HAC Rack 1 User : Administrator Up Time : 223 Days 17 Hours 0 Minutes Stat : P+ N4+ N6+ A+ Type ? for command listing Use tcpip command for IP address(-i), subnet(-s), and gateway(-g) apc So perhaps the place to start first is simply getting the fence_apc agent (provided by CentOS/RHEL package fence-agents-3.1.5-35.el6_5.3.x86_64) to actually be able to work correctly. Once that's done, I'll still need help on the dual power supply issue. I'm not seeing any attempts to login in the APC's logs file, though I do see connections when I manually login, e.g.: 02/21/2014 13:13:11System: Console user 'apc' logged out from 192.168.1.216. 02/21/2014 13:12:40System: Console user 'apc' logged in from 192.168.1.216. A manual 'telnet [name] 22' command also works fine from the command line: # telnet hac-pdu1 22 Trying 192.168.1.222... Connected to hac-1-pdu1 (192.168.1.222). Escape character is '^]'. SSH-2.0-cryptlib However fence_apc_snmp **does** seem to work: # fence_apc_snmp --snmp-version=1 --community=public --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status /usr/bin/snmpwalk -m '' -Oeqn -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.2.1.1.2.0' No log handling enabled - turning on stderr logging Created directory: /var/lib/net-snmp/mib_indexes .1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6 Trying APC Master Switch (fallback) /usr/bin/snmpget -m '' -Oeqn -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.4.1.318.1.1.4.4.2.1.3.1' .1.3.6.1.4.1.318.1.1.4.4.2.1.3.1 1 Status: ON Does anyone have any ideas as to why fence_apc is not working? Thanks! Tony -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] CLVM CMAN live adding nodes
This is not true. I change things outside the rm tags often without restarting the cluster. It would be a significant flaw if that were the case. On 22/02/14 04:33 AM, emmanuel segura wrote: I know if you need to modify anything outside rm... /rm{used by rgmanager} tag in the cluster.conf file, you need to restart the whole cluster stack, with cman+rgmanager i have never seen how to add a node and remove a node from cluster without restart cman. 2014-02-22 6:21 GMT+01:00 Bjoern Teipel bjoern.tei...@internetbrands.com mailto:bjoern.tei...@internetbrands.com: Hi all, who's using CLVM with CMAN in a cluster with more than 2 nodes in production ? Did you guys got it to manage to live add a new node to the cluster while everything is running ? I'm only able to add nodes while the cluster stack is shutdown. That's certainly not a good idea when you have to run CLVM on hypervisors and you need to shutdown all VMs to add a new box. Would be also good if you paste some of your configs using IPMI fencing Thanks in advance, Bjoern -- Linux-cluster mailing list Linux-cluster@redhat.com mailto:Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Creating clustered LVM snapshots, locking and exclusivity
Hi all, I want to get clustered LV snapshotting working. I was under the impression it was simply a matter of disabling the LV on the other node (2-node cluster here). However, this fails because of locking issues. I can change the peer node's LV to 'inactive' with (confirmed with lvscan): [root@an-c05n01 ~]# lvchange -aln /dev/an-c05n02_vg0/vm01-rhel2_0 [root@an-c05n01 ~]# lvscan inactive '/dev/an-c05n02_vg0/vm01-rhel2_0' [50.00 GiB] inherit But I still can't create snapshot on the other node running the VM: [root@an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0 vm01-rhel2_0 must be active exclusively to create snapshot So I try to set it exclusive: [root@an-c05n02 ~]# lvchange -aey /dev/an-c05n02_vg0/vm01-rhel2_0 Error locking on node an-c05n02.alteeve.ca: Device or resource busy If I stop the VM running on the LV, then I can set the exclusive lock, boot the VM and later create the snapshot fine: [root@an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0 Logical volume vm01-rhel2_0_snapshot created But then later, I can't remove the exclusive value, so I can't re-active the LV after deleting the snapshot. I have to shut the VM down again in order to remove the exclusive flag. I'm assuming it's possible to snapshot clustered LVs while they're in use, without stopping what is using them twice... Can someone help clarify what the magical incantation is? Thanks! -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] PowerEdge R610 idrac express fencing
On 19/02/14 07:30 PM, Michael Mendoza wrote: Good afternoon. We are trying to configure 2 dell R610 with idrac6 EXPRESS in cluster with redhat 5.10 x64. For testing we are using the command fence_ipmilan. We can ping idrac on the remote host. fence_ipmilan -a X.X.X.X -l usern -p -t 200 -o status -v -- works Over 3 minutes to confirm a fence action is extremely log time! fence_ipmilan -a X.X.X.X -l usern -p -t 200 -o reboot -v The problem is the server reboot, but while it reboot the idrac6 reboot too. so the host A after 120 seconds aprox lost connection and get the follow message. So you're saying that the IPMI interface, after rebooting the host, fails to respond for two full minutes? That strikes me as a reason to call Dell and ask for help. That can't be normal. Spawning: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]' -v -v -v chassis power on'... Spawned: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]' -v -v -v chassis power on' - PID 10104 Looking for: 'Password:', val = 1 'Unable to establish LAN', val = 11 'IPMI mutex', val = 14 'Unsupported cipher suite ID', val = 2048 'read_rakp2_message: no support for', val = 2048 'Up/On', val = 0 ExpectToken returned 11 Reaping pid 10104 Failed cman version is CMAN-2.0.115.118.e15_10.3 This is an old existing cluster, or a new one you're trying to build? however I have other host with centos 6.4 and CMAN3.0... and the connection is not lost. I run the same command, the server reboot as well as idrac, the ping is back and the ipmi connection is not lost.. Are these nodes in the same cluster? cman 2 and 3 only are designed to work in maintenance mode for rolling upgrades. am I doing something wrong? I used the -t and -T option even 300 / 400 and it doesnt matter, the connection is shut after 120secounds. in centos work fine. ( I already opened a case with redhat and am waiting answer.) Thanks It might be that the 120 second upper limit is a bug. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] What condition would cause clvmd to exit with '143' when status called?
I hit this in a program I use to monitor 'clvmd' on a local and peer node: 989; [ DEBUG ] - get_daemon_state(); daemon: [clvmd], node: [peer] 1002; [ DEBUG ] - shell call: [/usr/bin/sshr...@an-c07n02.alteeve.ca /etc/init.d/clvmd status; echo clvmd:\$?] 1019; [ DEBUG ] - line: [clvmd (pid 4114 4098) is running...] 1011; [ DEBUG ] - peer::daemon::clvmd::rc: [143] 1019; [ DEBUG ] - line: [bash: line 1: 4096 Terminated /etc/init.d/clvmd status] Daemon: [clvmd] is in an unknown state on: [an-c07n02.alteeve.ca]. Status return code was: [143]. The line numbers and debug messages are my program, not clvmd. Any idea why this would happen? This was from a node that had just been intentionally crashed, was fenced and booted back up. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Question about cluster behavior
Replies in-line: On 14/02/14 12:07 PM, FABIO FERRARI wrote: So it's not a normal behavior, I guess. Here is my cluster.conf: ?xml version=1.0? cluster config_version=59 name=mail clusternodes clusternode name=eta.mngt.unimo.it nodeid=1 fence method name=fence-eta device name=fence-eta/ /method /fence /clusternode clusternode name=beta.mngt.unimo.it nodeid=2 fence method name=fence-beta device name=fence-beta/ /method /fence /clusternode clusternode name=guerro.mngt.unimo.it nodeid=3 fence method name=fence-guerro device name=fence-guerro port=Guerro ssl=on uuid=4213f370-9572-63c7-26e4-22f0f43843aa/ /method /fence /clusternode /clusternodes cman expected_votes=5/ You generally don't need to set this, the cluster can calculate it. quorumd label=mail-qdisk/ You don't set any votes, so the default is 1. So with expected votes being 5, that means all three nodes have to be up or two nodes and qdisk. rm resources ip address=155.185.44.61/24 sleeptime=10/ mysql config_file=/etc/my.cnf listen_address=155.185.44.61 name=mysql shutdown_wait=10 startup_wait=10/ script file=/etc/init.d/httpd name=httpd/ script file=/etc/init.d/postfix name=postfix/ script file=/etc/init.d/dovecot name=dovecot/ fs device=/dev/mapper/mailvg-maillv force_fsck=1 force_unmount=1 fsid=58161 fstype=xfs mountpoint=/cl name=mailvg-maill v options=defaults,noauto self_fence=1/ lvm lv_name=maillv name=lvm-mailvg-maillv self_fence=1 vg_name=mailvg/ /resources failoverdomains failoverdomain name=mailfailoverdomain nofailback=1 ordered=1 restricted=1 failoverdomainnode name=eta.mngt.unimo.it priority=1/ failoverdomainnode name=beta.mngt.unimo.it priority=2/ failoverdomainnode name=guerro.mngt.unimo.it priority=3/ /failoverdomain /failoverdomains service domain=mailfailoverdomain max_restarts=3 name=mailservices recovery=restart restart_expire_time=600 fs ref=mailvg-maillv ip ref=155.185.44.61/24 mysql ref=mysql script ref=httpd/ script ref=postfix/ script ref=dovecot/ /mysql /ip /fs /service /rm fencedevices fencedevice agent=fence_ipmilan auth=password ipaddr=155.185.135.105 lanplus=on login=root name=fence-eta passwd=** pr ivlvl=ADMINISTRATOR/ fencedevice agent=fence_ipmilan auth=password ipaddr=155.185.135.106 lanplus=on login=root name=fence-beta passwd=** p rivlvl=ADMINISTRATOR/ fencedevice agent=fence_vmware_soap ipaddr=155.185.0.10 login=etabetaguerro name=fence-guerro passwd=**/ /fencedevices /cluster What log file do you need? There are many in /var/log/cluster.. By default, /var/log/messages is the most useful. Checking 'cman_tool status' and 'clustat' are also good. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] backup best practice when using Luci
On 10/02/14 09:12 AM, Benjamin Budts wrote: Ladies Gents (I won’t make that same mistake again ;) ), First, thank you to the lady who helped me explain how to force an OK on fencing that is failing. A 2 node config Luci : I would liketo put a backup solution in place for the cluster config / nodeconfig/ fencing etc... What would you recommend ? Or does Luci archive versions of config-files somewhere ? Basically, if shit hits the fan I would like to untar a golden image of a config on luci and push it back to the nodes… Thx The main config file is /etc/cluster/cluster.conf. A copy of this file should be on all nodes at once, so even if you didn't have a backup proper, you should be able to copy it from another node. Beyond that, I personally backup (sample taken from a node called 'an-c05n01'; mkdir ~/base cd ~/base mkdir root mkdir -p etc/sysconfig/network-scripts/ mkdir -p etc/udev/rules.d/ # Root user rsync -av /root/.bashrc root/ rsync -av /root/.ssh root/ # Directories rsync -av /etc/ssh etc/ rsync -av /etc/apcupsd etc/ rsync -av /etc/cluster etc/ rsync -av /etc/drbd.*etc/ rsync -av /etc/lvm etc/ # Specific files. rsync -av /etc/sysconfig/network-scripts/ifcfg-{eth*,bond*,vbr*} etc/sysconfig/network-scripts/ rsync -av /etc/udev/rules.d/70-persistent-net.rules etc/udev/rules.d/ rsync -av /etc/sysconfig/network etc/sysconfig/ rsync -av /etc/hosts etc/ rsync -av /etc/ntp.conf etc/ # Save recreating user accounts. rsync -av /etc/passwdetc/ rsync -av /etc/group etc/ rsync -av /etc/shadowetc/ rsync -av /etc/gshadow etc/ # If you have the cluster built and want to backup it's configs. mkdir etc/cluster mkdir etc/lvm rsync -av /etc/cluster/cluster.conf etc/cluster/ rsync -av /etc/lvm/lvm.conf etc/lvm/ # NOTE: DRBD won't work until you've manually created the partitions. rsync -av /etc/drbd.d etc/ # If you had to manually set the UUID in libvirtd; mkdir etc/libvirt rsync -av /etc/libvirt/libvirt.conf etc/libvirt/ # If you're running RHEL and want to backup your registration info; rsync -av /etc/sysconfig/rhn etc/sysconfig/ # Pack it up # NOTE: Change the name to suit your node. tar -cvf base_an-c05n01.tar etc root I then push the resulting tar file to my PXE server. I have a kickstart script that does a minimal rhel6 install, plus the cluster stuff, and then has a %post script that downloads this tar and extracts it. This way, when the node needs to be rebuilt, it's 95% ready to go. I still need to do things like 'drbdadm create-md res', but it's still very quick to restore a node. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] manual intervention 1 node when fencing fails due to complete power outage
On 07/02/14 11:13 AM, Benjamin Budts wrote: Gents, We're not all gents. ;) I have a 2 node setup (with quorum disk), redhat 6.5 a luci mgmt console. Everything has been configured and we’re doing failover tests now. Couple of questions I have : ·When I simulate a complete power failure of a servers pdu’s (no more access to idrac fencing or APC PDU fencing) I can see that the fencing of that node who was running the application fails ßI noticed unless fencing returns an OK I’m stuck and my application won’t start on my 2^nd node. Which is ok I guess, because no fencing could mean there is still I/O on my san. This is expected. If a lost node can't be put into a known state, there is no safe way to proceed. To do so would be to risk a split brain at least, and data loss/corruption at worst. The way I deal with this is to have nodes with redundant power supplies and use two PDUs and two UPSes. This way, the failure of on cirtcuit / UPS / PDU doesn't knock out the power to the mainboard of the nodes, so you don't lose IPMI. Clustat also shows on the active node that the 1^st node is still running the application. That's likely because rgmanager uses DLM, and DLM blocks until the fence succeeds, so it can't update it's view. How can I intervene manually, so as to force a start of the application on the node that is still alive ? If you are *100% ABSOLUTELY SURE* that the lost node has been powered off, then you can run 'fence_ack_manual'. Please be super careful about this though. If you do this, in the heat of the moment with clients or bosses yelling at you, and the peer isn't really off (ie: it's only hung), you risk serious problems. I can not emphasis strongly enough the caution needed when using this command. Is there a way to tell the cluster, don’t take into account node 1 anymore and don’t try to fence anymore, just start the application on the node that is still ok ? No. That would risk a split brain and data corruption. The only safe option for the cluster, if the face of a failed fence, is to hang. As bad as it is to hang, it's better than risking corruption. I can’t possibly wait until power returns to that server. Downtime could be too long. See the solution I mentioned earlier. ·If I tell a node to leave the cluster in Luci, I would like it to remain a non-cluster member after the reboot of that node. It rejoins the cluster automatically after a reboot. Any way to prevent this ? Thx Don't let cman and rgmanager start on boot. This is always my policy. If a node failed and got fenced, I want it to reboot, so that I can log into it and figure out what happened, but I do _not_ want it back in the cluster until I've determined it is healthy. hth -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine
On 01/02/14 01:35 PM, nik600 wrote: Dear all i need some clarification about clustering with rhel 6.4 i have a cluster with 2 node in active/passive configuration, i simply want to have a virtual ip and migrate it between 2 nodes. i've noticed that if i reboot or manually shut down a node the failover works correctly, but if i power-off one node the cluster doesn't failover on the other node. Another stange situation is that if power off all the nodes and then switch on only one the cluster doesn't start on the active node. I've read manual and documentation at https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html and i've understand that the problem is related to fencing, but the problem is that my 2 nodes are on 2 virtual machine , i can't control hardware and can't issue any custom command on the host-side. I've tried to use fence_xvm but i'm not sure about it because if my VM has powered-off, how can it reply to fence_vxm messags? Here my logs when i power off the VM: == /var/log/cluster/fenced.log == Feb 01 18:50:22 fenced fencing node mynode02 Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result: error from agent Feb 01 18:50:53 fenced fence mynode02 failed I've tried to force the manual fence with: fence_ack_manual mynode02 and in this case the failover works properly. The point is: as i'm not using any shared filesystem but i'm only sharing apache with a virtual ip, i won't have any split-brain scenario so i don't need fencing, or not? So, is there the possibility to have a simple dummy fencing? here is my config.xml: ?xml version=1.0? cluster config_version=20 name=hacluster fence_daemon clean_start=0 post_fail_delay=0 post_join_delay=0/ cman expected_votes=1 two_node=1/ clusternodes clusternode name=mynode01 nodeid=1 votes=1 fence method name=mynode01 device domain=mynode01 name=mynode01/ /method /fence /clusternode clusternode name=mynode02 nodeid=2 votes=1 fence method name=mynode02 device domain=mynode02 name=mynode02/ /method /fence /clusternode /clusternodes fencedevices fencedevice agent=fence_xvm name=mynode01/ fencedevice agent=fence_xvm name=mynode02/ /fencedevices rm log_level=7 failoverdomains failoverdomain name=MYSERVICE nofailback=0 ordered=0 restricted=0 failoverdomainnode name=mynode01 priority=1/ failoverdomainnode name=mynode02 priority=2/ /failoverdomain /failoverdomains resources/ service autostart=1 exclusive=0 name=MYSERVICE recovery=relocate ip address=192.168.1.239 monitor_link=on sleeptime=2/ apache config_file=conf/httpd.conf name=apache server_root=/etc/httpd shutdown_wait=0/ /service /rm /cluster Thanks to all in advance. The fence_virtd/fence_xvm agent works by using multicast to talk to the VM host. So the off confirmation comes from the hypervisor, not the target. Depending on your setup, you might find better luck with fence_virsh (I have to use this as there is a known multicast issue with Fedora hosts). Can you try, as a test if nothing else, if 'fence_virsh' will work for you? fence_virsh -a host ip -l root -p host root pw -n virsh name for target vm -o status If this works, it should be trivial to add to cluster.conf. If that works, then you have a working fence method. However, I would recommend switching back to fence_xvm if you can. The fence_virsh agent is dependent on libvirtd running, which some consider a risk. hth -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine
Ooooh, I'm not sure what option you have then. I suppose fence_virtd/fence_xvm is your best option, but you're going to need to have the admin configure the fence_virtd side. On 01/02/14 03:50 PM, nik600 wrote: My problem is that i don't have root access at host level. Il 01/feb/2014 19:49 Digimer li...@alteeve.ca mailto:li...@alteeve.ca ha scritto: On 01/02/14 01:35 PM, nik600 wrote: Dear all i need some clarification about clustering with rhel 6.4 i have a cluster with 2 node in active/passive configuration, i simply want to have a virtual ip and migrate it between 2 nodes. i've noticed that if i reboot or manually shut down a node the failover works correctly, but if i power-off one node the cluster doesn't failover on the other node. Another stange situation is that if power off all the nodes and then switch on only one the cluster doesn't start on the active node. I've read manual and documentation at https://access.redhat.com/__site/documentation/en-US/Red___Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html and i've understand that the problem is related to fencing, but the problem is that my 2 nodes are on 2 virtual machine , i can't control hardware and can't issue any custom command on the host-side. I've tried to use fence_xvm but i'm not sure about it because if my VM has powered-off, how can it reply to fence_vxm messags? Here my logs when i power off the VM: == /var/log/cluster/fenced.log == Feb 01 18:50:22 fenced fencing node mynode02 Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result: error from agent Feb 01 18:50:53 fenced fence mynode02 failed I've tried to force the manual fence with: fence_ack_manual mynode02 and in this case the failover works properly. The point is: as i'm not using any shared filesystem but i'm only sharing apache with a virtual ip, i won't have any split-brain scenario so i don't need fencing, or not? So, is there the possibility to have a simple dummy fencing? here is my config.xml: ?xml version=1.0? cluster config_version=20 name=hacluster fence_daemon clean_start=0 post_fail_delay=0 post_join_delay=0/ cman expected_votes=1 two_node=1/ clusternodes clusternode name=mynode01 nodeid=1 votes=1 fence method name=mynode01 device domain=mynode01 name=mynode01/ /method /fence /clusternode clusternode name=mynode02 nodeid=2 votes=1 fence method name=mynode02 device domain=mynode02 name=mynode02/ /method /fence /clusternode /clusternodes fencedevices fencedevice agent=fence_xvm name=mynode01/ fencedevice agent=fence_xvm name=mynode02/ /fencedevices rm log_level=7 failoverdomains failoverdomain name=MYSERVICE nofailback=0 ordered=0 restricted=0 failoverdomainnode name=mynode01 priority=1/ failoverdomainnode name=mynode02 priority=2/ /failoverdomain /failoverdomains resources/ service autostart=1 exclusive=0 name=MYSERVICE recovery=relocate ip address=192.168.1.239 monitor_link=on sleeptime=2/ apache config_file=conf/httpd.conf name=apache server_root=/etc/httpd shutdown_wait=0/ /service /rm /cluster Thanks to all in advance. The fence_virtd/fence_xvm agent works by using multicast to talk to the VM host. So the off confirmation comes from the hypervisor, not the target. Depending on your setup, you might find better luck with fence_virsh (I have to use this as there is a known multicast issue with Fedora hosts). Can you try