Re: [Linux-cluster] Need advice Redhat Clusters

2017-07-30 Thread Digimer
On 2017-07-30 02:03 PM, deepesh kumar wrote:
> I need to set up 2 node HA Active Passive redhat cluster on rhel 6.9.
> 
> Should I start with rgmanager or pacemaker ..
> 
> Do I need Quorum disk ..(mandatory )   and what fence method should I use.
> 
> Thanks to great friends..!!!
> 
> -- 
> DEEPESH KUMAR

Hi Deepesh,

  Note that this channel is deprecated, please use clusterlabs - users
(cc'ed here).

  Use pacemaker, but it will need the cman plugin. Only existing
projects should use rgmanager. The fence method you use will depend on
what your nodes are; IPMI is common on most servers, so fence_ipmilan is
quite common. Switched PDUs from APC are also popular, and they use
fence_apc_snmp, etc.


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 Errors

2017-07-18 Thread Digimer
On 2017-07-18 07:25 PM, Kristián Feldsam wrote:
> Hello, I see today GFS2 errors in log and nothing about that is on net,
> so I writing to this mailing list.
> 
> node2 19.07.2017 01:11:55 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete
> nr=-4549568322848002755
> node2 19.07.2017 01:10:56 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete
> nr=-8191295421473926116
> node2 19.07.2017 01:10:48 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete
> nr=-8225402411152149004
> node2 19.07.2017 01:10:47 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete
> nr=-8230186816585019317
> node2 19.07.2017 01:10:45 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete
> nr=-8242007238441787628
> node2 19.07.2017 01:10:39 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete
> nr=-8250926852732428536
> node3 19.07.2017 00:16:02 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete
> nr=-5150933278940354602
> node3 19.07.2017 00:16:02 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete nr=-64
> node3 19.07.2017 00:16:02 kernel  kernerr vmscan: shrink_slab:
> gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete nr=-64
> 
> Would somebody explain this errors? cluster is looks like working
> normally. I enabled vm.zone_reclaim_mode = 1 on nodes...
> 
> Thank you!

Please post this to the Clusterlabs - Users list. This ML is deprecated.


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] HA cluster 6.5 redhat active passive Error

2017-04-29 Thread Digimer
:Database
> is failed
> Apr 28 21:08:17 12RHAPPTR04V rgmanager[2183]: #13: Service
> service:Database failed to stop cleanly
> Apr 28 21:08:28 12RHAPPTR04V rgmanager[2183]: State change: 12RHAPPTR03V UP
> Apr 28 21:08:46 12RHAPPTR04V kernel: fuse init (API version 7.14)
> Apr 28 21:08:46 12RHAPPTR04V seahorse-daemon[4044]: DNS-SD
> initialization failed: Daemon not running
> Apr 28 21:08:46 12RHAPPTR04V seahorse-daemon[4044]: init gpgme version 1.1.8
> Apr 28 21:08:46 12RHAPPTR04V pulseaudio[4099]: pid.c: Stale PID file,
> overwriting.
> Apr 28 21:09:38 12RHAPPTR04V ricci[4367]: Executing '/usr/bin/virsh
> nodeinfo'
> 
> 
> thanks 
> Deepesh kumar

Hi Deepesh,

  You probably got a notice that the linux-cluster list is deprecated. I
am replying to the new list, clusterlabs. You will want to subscribe
there and continue over there as there are many more people on that list.

  For clvmd, you need to set lvm.conf to set;

global {
locking_type = 3;
fallback_to_clustered_locking = 1
fallback_to_local_locking = 0
}

  This assumes you are not trying to use LVM and clustered LVM at the
same time. If you are, you probably don't want to. If you do anyway,
don't set the fallback variables.

  With this, you then start cman, then start clvmd. With clvmd running,
new VGs default to clustered type. You can override this with 'vgcreate
-c{y,n}'.

  If you still have trouble, please share your full cluster.conf
(obfuscate passwords, please).

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Rhel 7.2 Pacemaker cluster - gfs2 file system- NFS document

2017-04-28 Thread Digimer
On 28/04/17 06:34 AM, Dawood Munavar S M wrote:
> Hello All,
> 
> Could you please share any links/documents to create NFS HA cluster over
> gfs2 file system using Pacemaker.
> 
> Currently I have completed till mounting of gfs2 file systems on cluster
> nodes and now I need to create cluster resources for NFS server, exports
> and mount on client.
> 
> Thanks,
> Munavar.

I use gfs2 quite a bit, but not nfs.

Can I make a suggestion? Don't use gfs2 for this.

You will have much better performance if you use an active/passive
failover with a non-clustered FS. GFS2, like any cluster FS, needs to
have the cluster handle locks which is always going to be slower (by a
fair amount) than traditional internal FS locking.

The common NFS HA cluster setup is to have the cluster promote/connect
the backing storage (drbd/iscsi), mount the FS, start nfs and then take
a floating IP address.

GFS2 is an excellent FS for situations where it is needed, and should be
avoided anywhere possible. :)

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Active/passive cluster between physical and VM

2017-03-22 Thread Digimer
On 22/03/17 03:11 AM, Amjad Syed wrote:
> Hello,
> 
> We are planning to build a 2 node Active/passive cluster  using pacemaker.
> Can the cluster be build between one physical and one VM machine in
> Centos 7.x?
> If yes, what can be used as fencing agent ? 

So long as the traffic between the nodes is not molested, it should work
fine. As for fencing, it depends on your hardware and hypervisor...
Using a generic example, you could use fence_ipmilan to fence the
hardware node and fence_virsh to fence a KVM/qemu based VM.

PS - I've cc'ed clusterlabs - users ML. This list is deprecated, so
please switch over to there
(http://lists.clusterlabs.org/mailman/listinfo/users).

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] unable to start mysql as a clustered service, OK stand-alone

2016-08-08 Thread Digimer
Please ask again on the Clusterlabs - Users list. This list is (quite)
deprecated now.

http://clusterlabs.org/mailman/listinfo/users

digimer

On 08/08/16 06:40 PM, berg...@merctech.com wrote:
> I've got a 3-node CentOS6 cluster and I'm trying to add mysql 5.1 as a new 
> service. Other cluster
> services (IP addresses, Postgresql, applications) work fine.
> 
> The mysql config file and data files are located on shared, cluster-wide 
> storage (GPFS).
> 
> On each node, I can successfully start mysql via:
>   service mysqld start
> and via:
>   rg_test test /etc/cluster/cluster.conf start service mysql
> 
> (in each case, the corresponding command with the 'stop' option will also 
> successfully shut down mysql).
> 
> However, attempting to start the mysql service with clusvcadm results in the 
> service failing over
> from one node to the next, and being marked as "stopped" after the last node.
> 
> Each failover happens very quickly, in about 5 seconds. I suspect that 
> rgmanager isn't waiting long
> enough for mysql to start before checking if it is running and I have added 
> startup delays in
> cluster.conf, but they don't seem to be honored. Nothing is written into the 
> mysql log file at this
> time -- no startup or failure messages. The only log entries 
> (/var/log/messages, /var/log/cluster/*,
> etc) reference rgmanager, not the mysql process itself.
> 
> 
> Any suggestions?
> 
> 
> RHCS components:
>   cman-3.0.12.1-78.el6.x86_64
>   luci-0.26.0-78.el6.centos.x86_64
>   rgmanager-3.0.12.1-26.el6_8.3.x86_64
>   ricci-0.16.2-86.el6.x86_64
> 
> 
> - /etc/cluster/cluster.conf (edited) -
> 
> 
> 
>  config_file="/var/lib/pgsql/data/postgresql.conf" name="PostgreSQL8" 
> postmaster_user="postgres" startup_wait="25"/>
> 
>  config_file="/cluster_shared/mysql_centos6/etc/my.cnf" 
> listen_address="192.168.169.173" name="mysql" shutdown_wait="10" 
> startup_wait="30"/>
> 
>  restart_expire_time="180">
> 
> 
> 
> 
> 
> 
> --
> 
> 
> - /var/log/cluster/rgmanager.log from attempt to start 
> mysql with clusvcadm ---
> Aug 08 11:58:16 rgmanager Recovering failed service service:mysql
> Aug 08 11:58:16 rgmanager [ip] Link for eth2: Detected
> Aug 08 11:58:16 rgmanager [ip] Adding IPv4 address 192.168.169.173/24 to eth2
> Aug 08 11:58:16 rgmanager [ip] Pinging addr 192.168.169.173 from dev eth2
> Aug 08 11:58:18 rgmanager [ip] Sending gratuitous ARP: 192.168.169.173 
> c8:1f:66:e8:bb:34 brd ff:ff:ff:ff:ff:ff
> Aug 08 11:58:19 rgmanager [mysql] Verifying Configuration Of mysql:mysql
> Aug 08 11:58:19 rgmanager [mysql] Verifying Configuration Of mysql:mysql > 
> Succeed
> Aug 08 11:58:19 rgmanager [mysql] Monitoring Service mysql:mysql
> Aug 08 11:58:19 rgmanager [mysql] Checking Existence Of File 
> /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed
> Aug 08 11:58:19 rgmanager [mysql] Monitoring Service mysql:mysql > Service Is 
> Not Running
> Aug 08 11:58:19 rgmanager [mysql] Starting Service mysql:mysql
> Aug 08 11:58:19 rgmanager [mysql] Looking For IP Address > Succeed -  IP 
> Address Found
> Aug 08 11:58:20 rgmanager [mysql] Starting Service mysql:mysql > Succeed
> Aug 08 11:58:21 rgmanager [mysql] Monitoring Service mysql:mysql
> Aug 08 11:58:21 rgmanager 1 events processed
> Aug 08 11:58:21 rgmanager [mysql] Checking Existence Of File 
> /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed
> Aug 08 11:58:21 rgmanager [mysql] Monitoring Service mysql:mysql > Service Is 
> Not Running
> Aug 08 11:58:21 rgmanager start on mysql "mysql" returned 7 (unspecified)
> Aug 08 11:58:21 rgmanager #68: Failed to start service:mysql; return value: 1
> Aug 08 11:58:21 rgmanager Stopping service service:mysql
> Aug 08 11:58:21 rgmanager [mysql] Verifying Configuration Of mysql:mysql
> Aug 08 11:58:21 rgmanager [mysql] Verifying Configuration Of mysql:mysql > 
> Succeed
> Aug 08 11:58:21 rgmanager [mysql] Stopping Service mysql:mysql
> Aug 08 11:58:21 rgmanager [mysql] Checking Existence Of File 
> /var/run/cluster/mysql/mysql:mysql.pid [mysql:mysql] > Failed - File Doesn't 
> Exist
> Aug 08 11:58:21 rgmanager [mysql] Stopping Service mysql:mysql > Su

Re: [Linux-cluster] Fencing Question

2016-06-06 Thread Digimer
On 06/06/16 05:37 PM, Andrew Kerber wrote:
> I am doing some experimentation with Linux clustering, and still fairly
> new on it. I have built a cluster as a proof of concept running a
> PostgreSQL 9.5 database on gfs2 using VMware workstation 12.0 and
> RHEL7.  GFS2 requires a fencing resource, which I have managed to create
> using fence_virsh.  And the clustering software thinks the fencing is
> working.  However, it will not actually shut down a node, and I have not
> been able to figure out the appropriate parameters for VMware
> workstation to get it to work.  I tried fence-scsi also, but that doesnt
> seem to work with a shared vmdk,   Has anyone figured out a fencing
> agent that will work with VMware workstation?
> 
> Failing that, is there a comprehensive set of instructions for creating
> my own fencing agent?
> 
> 
> -- 
> Andrew W. Kerber
> 
> 'If at first you dont succeed, dont take up skydiving.'

The 'fence_vmware' (and helpers) are designed specifically for vmware.
I've not used them myself, but I've heard of many people using them
successfully.

Side note;

GFS2, for all it's greatness, is not fast (nothing using cluster locking
will be). Be sure to performance test before production. If you find the
performance is not good, consider active/passive on a standard FS.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Copying the result continuously.

2016-03-10 Thread Digimer
On 10/03/16 10:55 PM, Elsaid Younes wrote:
> 
> Hi all,
> 
> I  wish to be able to run long simulation through gromacs program,
> using MPI method. I want to modify the input data after every sub-task.
> I think that is the meaning of the following code, which is part my script.
> 
> |cat < copyfile.sh #!/bin/sh cp -p result*.dat $SLURM_SUBMIT_DIR
> EOF chmod u+x copyfile.sh srun -n $SLURM_NNODES -N $SLURM_NNODES cp
> copyfile.sh $SNIC_TMP |
> 
> |And I have to srun copyfile.sh in the end of every processor.|
> |
> 
> |srun -n $SLURM_NNODES -N $SLURM_NNODES copyfile.sh |
> 
> |Is there something wrong? I need to know what is the meaning of result*?|
> |
> |
> |Thanks in advance,|
> |/Elsaid|

Hi Elsaid,

  Your question is on-topic for here, so I hope someone here might be
able to help you.

  Do note though that most discussion here is related to availability
clustering. HPC clustering is fairly lowly represented. So if you can
think of other places to ask as well as here, you might want to
cross-post your questions to those other lists.

cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] CMAN Failed to start on Secondary Node

2016-03-05 Thread Digimer
Working fencing is required. The rgmanager component waits for a
successful fence message before beginning recovery (to prevent
split-brains).

On 05/03/16 04:47 AM, Shreekant Jena wrote:
> secondary node
> 
> --
> [root@Node2 ~]# cat /etc/cluster/cluster.conf
> 
> 
>  post_join_delay="3"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  restricted="1">
>  priority="1"/>
>  priority="1"/>
> 
> 
> 
> 
> 
>  name="PE51SPM1">
>  force_fsck="1" force_unmount="1" fsid="3446" fstype="ext3"
> mountpoint="/SPIM/admin" name="admin" options="" self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="17646" fstype="ext3"
> mountpoint="/flatfile_upload" name="flatfile_upload" options=""
> self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="64480" fstype="ext3"
> mountpoint="/oracle" name="oracle" options="" self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="60560" fstype="ext3"
> mountpoint="/SPIM/datafile_01" name="datafile_01" options=""
> self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="48426" fstype="ext3"
> mountpoint="/SPIM/datafile_02" name="datafile_02" options=""
> self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="54326" fstype="ext3"
> mountpoint="/SPIM/redolog_01" name="redolog_01" options="" self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="23041" fstype="ext3"
> mountpoint="/SPIM/redolog_02" name="redolog_02" options="" self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="46362" fstype="ext3"
> mountpoint="/SPIM/redolog_03" name="redolog_03" options="" self_fence="1"/>
>  force_fsck="1" force_unmount="1" fsid="58431" fstype="ext3"
> mountpoint="/SPIM/archives_01" name="archives_01" options=""
> self_fence="1"/>
> 
>         
> 
> 
> 
> 
> 
> [root@Node2 ~]# clustat
> msg_open: Invalid argument
> Member Status: Inquorate
> 
> Resource Group Manager not running; no service information available.
> 
> Membership information not available
> 
> 
> 
> Primary Node
> 
> -
> [root@Node1 ~]# clustat
> Member Status: Quorate
> 
>   Member Name  Status
>   --   --
>   Node1 Online, Local, rgmanager
>   Node2 Offline
> 
>   Service Name Owner (Last)   State
>   ---  - --   -
>   Package1 Node1started
> 
> 
> On Sat, Mar 5, 2016 at 12:17 PM, Digimer <li...@alteeve.ca
> <mailto:li...@alteeve.ca>> wrote:
> 
> Please share your cluster.conf (only obfuscate passwords please) and the
> output of 'clustat' from each node.
> 
> digimer
> 
> On 05/03/16 01:46 AM, Shreekant Jena wrote:
> > Dear All,
> >
> > I have a 2 node cluster but after reboot secondary node is showing
> > offline . And cman failed to start .
> >
> > Please find below logs on secondary node:-
> >
> > root@EI51SPM1 cluster]# clustat
> > msg_open: Invalid argument
> > Member Status: Inquorate
> >
> > Resource Group Manager not running; no service information available.
> >
> > Me

Re: [Linux-cluster] CMAN Failed to start on Secondary Node

2016-03-04 Thread Digimer
Please share your cluster.conf (only obfuscate passwords please) and the
output of 'clustat' from each node.

digimer

On 05/03/16 01:46 AM, Shreekant Jena wrote:
> Dear All,
> 
> I have a 2 node cluster but after reboot secondary node is showing
> offline . And cman failed to start .
> 
> Please find below logs on secondary node:-
> 
> root@EI51SPM1 cluster]# clustat
> msg_open: Invalid argument
> Member Status: Inquorate
> 
> Resource Group Manager not running; no service information available.
> 
> Membership information not available
> [root@EI51SPM1 cluster]# tail -10 /var/log/messages
> Feb 24 13:36:23 EI51SPM1 ccsd[25487]: Error while processing connect:
> Connection refused
> Feb 24 13:36:23 EI51SPM1 kernel: CMAN: sending membership request
> Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
> connection.
> Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Error while processing connect:
> Connection refused
> Feb 24 13:36:28 EI51SPM1 kernel: CMAN: sending membership request
> Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
> connection.
> Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
> Connection refused
> Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
> connection.
> Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
> Connection refused
> Feb 24 13:36:33 EI51SPM1 kernel: CMAN: sending membership request
> [root@EI51SPM1 cluster]#
> [root@EI51SPM1 cluster]# cman_tool status
> Protocol version: 5.0.1
> Config version: 166
> Cluster name: IVRS_DB
> Cluster ID: 9982
> Cluster Member: No
> Membership state: Joining
> [root@EI51SPM1 cluster]# cman_tool nodes
> Node  Votes Exp Sts  Name
> [root@EI51SPM1 cluster]#
> [root@EI51SPM1 cluster]#
> 
> 
> Thanks & regards 
> SHREEKANTA JENA
> 
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS2 mount hangs for some disks

2016-01-05 Thread Digimer
Can you re-ask this on the clusterlabs user's list? This list is being
phased out.

http://clusterlabs.org/mailman/listinfo/users

digimer

On 05/01/16 01:37 PM, B.Baransel BAĞCI wrote:
> Hi list,
> 
> I have some problems with GFS2 with failed nodes. After one of the
> cluster nodes fenced and rebooted, it cannot mount some of the gfs2 file
> systems but hangs on the mount operation. No output. I've waited nearly
> 10 minutes to mount single disk but it didn't respond. Only solution is
> to shutdown all nodes and clean start of the cluster. I'm suspecting
> journal size or file system quotas.
> 
> I have 8-node rhel-6 cluster with GFS2 formatted disks which are all
> mounted by all nodes.
> There are two types of disk:
> Type A :
> ~50 GB disk capacity
> 8 journal with size 512MB
> block-size: 1024
> very small files (Avg: 50 byte - sym.links)
> ~500.000 file (inode)
> Usage: 10%
> Nearly no write IO (under 1000 file per day)
> No user quota (quota=off)
> Mount options: async,quota=off,nodiratime,noatime
> 
> Tybe B :
> ~1 TB disk capacity
> 8 journal with size 512MB
> block-size: 4096
> relatively small files (Avg: 20 KB)
> ~5.000.000 file (inode)
> Usage: 20%
> write IO ~50.000 file per day
> user quota is on (some of the users exceeded quota)
> Mount options: async,quota=on,nodiratime,noatime
> 
> To improve performance, I set journal size to 512 MB instead of 128 MB
> default. All disk are connected with fiber from SAN Storage. All disk on
> cluster LVM. All nodes connected to each other with private Gb-switch.
> 
> For example, after "node5" failed and fenced, it can re-enter the
> cluster. When i try "service gfs2 start", it can mount "Type A" disks,
> but hangs on the first "Tybe B" disk. Logs hangs on the "Trying to join
> cluster lock_dlm" message:
> 
> ...
> Jan 05 00:01:52 node5 lvm[4090]: Found volume group "VG_of_TYPE_A"
> Jan 05 00:01:52 node5 lvm[4119]: Activated 2 logical volumes in
> volume group VG_of_TYPE_A
> Jan 05 00:01:52 node5 lvm[4119]: 2 logical volume(s) in volume group
> "VG_of_TYPE_A" now active
> Jan 05 00:01:52 node5 lvm[4119]: Wiping internal VG cache
> Jan 05 00:02:26 node5 kernel: Slow work thread pool: Starting up
> Jan 05 00:02:26 node5 kernel: Slow work thread pool: Ready
> Jan 05 00:02:26 node5 kernel: GFS2 (built Dec 12 2014 16:06:57)
> installed
> Jan 05 00:02:26 node5 kernel: GFS2: fsid=: Trying to join cluster
> "lock_dlm", "TESTCLS:typeA1"
> Jan 05 00:02:26 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: Joined
> cluster. Now mounting FS...
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5,
> already locked for use
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5:
> Looking at journal...
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA1.5: jid=5: Done
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=: Trying to join cluster
> "lock_dlm", "TESTCLS:typeA2"
> Jan 05 00:02:27 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: Joined
> cluster. Now mounting FS...
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5,
> already locked for use
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5:
> Looking at journal...
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=TESTCLS:typeA2.5: jid=5: Done
> Jan 05 00:02:28 node5 kernel: GFS2: fsid=: Trying to join cluster
> "lock_dlm", "TESTCLS:typeB1"
> 
> 
> I've waited nearly 10 minutes in this state without respond or log. In
> this state, I cannot do `ls` in another nodes for this file system. Any
> idea of the cause of the problem? How is the cluster affected by journal
> size or count?


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies

2015-12-04 Thread Digimer
On 04/12/15 09:14 AM, Kelvin Edmison wrote:
> 
> 
> On 12/03/2015 09:31 PM, Digimer wrote:
>> On 03/12/15 08:39 PM, Kelvin Edmison wrote:
>>> On 12/03/2015 06:14 PM, Digimer wrote:
>>>> On 03/12/15 02:19 PM, Kelvin Edmison wrote:
>>>>> I am hoping that someone can help me understand the problems I'm
>>>>> having
>>>>> with linux clustering for VMs.
>>>>>
>>>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure
>>>>> that a
>>>>> service is always available.  The hosts and guests are both RHEL 6.7.
>>>>> The goal is to have only one of the two VMs running at a time.
>>>>>
>>>>> The configuration works when we test/simulate VM deaths and
>>>>> graceful VM
>>>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ).
>>>>>
>>>>> However, when we simulate the sudden isolation of host A (e.g. ifdown
>>>>> eth0), two things happen
>>>>> 1) the VM on host B does not start, and repeated fence_xvm errors
>>>>> appear
>>>>> in the logs on host B
>>>>> 2) when the 'failed' node is returned to service, the cman service on
>>>>> host B dies.
>>>> If the node's host is dead, then there is no way for the survivor to
>>>> determine the state of the lost VM node. The cluster is not allowed to
>>>> take "no answer" as confirmation of fence success.
>>>>
>>>> If your hosts have IPMI, then you could add fence_ipmilan as a backup
>>>> method where, if fence_xvm fails, it moves on and reboots the host
>>>> itself.
>>> Thank you for the suggestion.  The hosts do have ipmi.  I'll explore it
>>> but I'm a little concerned about what it means for the other
>>> non-clustered VM workloads that exist on these two servers.
>>>
>>> Do you have any thoughts as to why host B's cman process is dying when
>>> 'host A' returns?
>>>
>>> Thanks,
>>>Kelvin
>> It's not dieing, it's blocking. When a node is lost, dlm blocks until
>> fenced tells it that the fence was successful. If fenced can't contact
>> the lost node's fence method(s), then it doesn't succeed and dlm stays
>> blocked. To anything that uses DLM, like rgmanager, it appears like the
>> host is hung but it is by design. The logic is that, as bad as it is to
>> hang, it's better than risking a split-brain.
> when I said the cman service is dying, I should have further qualified
> it. I mean that the corosync process is no longer running (ps -ef | grep
> corosync does not show it)  and after recovering the failed host A,
> manual intervention (service cman start) was required on host B to
> recover full cluster services.
> 
> [root@host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do
> printf "%-12s   " $SERVICE; service $SERVICE status; done
> ricci  ricci (pid  5469) is running...
> fence_virtdfence_virtd (pid  4862) is running...
> cman   Found stale pid file
> rgmanager  rgmanager (pid  5366) is running...
> 
> 
> Thanks,
>   Kelvin

Oh now that is interesting...

You'll want input from Fabio, Chrissie or one of the other core devs, I
suspect.

If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and
if you can reproduce it reliably, can you create a new thread with the
reproducer?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies

2015-12-04 Thread Digimer
On 04/12/15 01:52 PM, Kelvin Edmison wrote:
> 
> 
> On 12/04/2015 12:49 PM, Digimer wrote:
>> On 04/12/15 09:14 AM, Kelvin Edmison wrote:
>>>
>>> On 12/03/2015 09:31 PM, Digimer wrote:
>>>> On 03/12/15 08:39 PM, Kelvin Edmison wrote:
>>>>> On 12/03/2015 06:14 PM, Digimer wrote:
>>>>>> On 03/12/15 02:19 PM, Kelvin Edmison wrote:
>>>>>>> I am hoping that someone can help me understand the problems I'm
>>>>>>> having
>>>>>>> with linux clustering for VMs.
>>>>>>>
>>>>>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure
>>>>>>> that a
>>>>>>> service is always available.  The hosts and guests are both RHEL
>>>>>>> 6.7.
>>>>>>> The goal is to have only one of the two VMs running at a time.
>>>>>>>
>>>>>>> The configuration works when we test/simulate VM deaths and
>>>>>>> graceful VM
>>>>>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ).
>>>>>>>
>>>>>>> However, when we simulate the sudden isolation of host A (e.g.
>>>>>>> ifdown
>>>>>>> eth0), two things happen
>>>>>>> 1) the VM on host B does not start, and repeated fence_xvm errors
>>>>>>> appear
>>>>>>> in the logs on host B
>>>>>>> 2) when the 'failed' node is returned to service, the cman
>>>>>>> service on
>>>>>>> host B dies.
>>>>>> If the node's host is dead, then there is no way for the survivor to
>>>>>> determine the state of the lost VM node. The cluster is not
>>>>>> allowed to
>>>>>> take "no answer" as confirmation of fence success.
>>>>>>
>>>>>> If your hosts have IPMI, then you could add fence_ipmilan as a backup
>>>>>> method where, if fence_xvm fails, it moves on and reboots the host
>>>>>> itself.
>>>>> Thank you for the suggestion.  The hosts do have ipmi.  I'll
>>>>> explore it
>>>>> but I'm a little concerned about what it means for the other
>>>>> non-clustered VM workloads that exist on these two servers.
>>>>>
>>>>> Do you have any thoughts as to why host B's cman process is dying when
>>>>> 'host A' returns?
>>>>>
>>>>> Thanks,
>>>>> Kelvin
>>>> It's not dieing, it's blocking. When a node is lost, dlm blocks until
>>>> fenced tells it that the fence was successful. If fenced can't contact
>>>> the lost node's fence method(s), then it doesn't succeed and dlm stays
>>>> blocked. To anything that uses DLM, like rgmanager, it appears like the
>>>> host is hung but it is by design. The logic is that, as bad as it is to
>>>> hang, it's better than risking a split-brain.
>>> when I said the cman service is dying, I should have further qualified
>>> it. I mean that the corosync process is no longer running (ps -ef | grep
>>> corosync does not show it)  and after recovering the failed host A,
>>> manual intervention (service cman start) was required on host B to
>>> recover full cluster services.
>>>
>>> [root@host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do
>>> printf "%-12s   " $SERVICE; service $SERVICE status; done
>>> ricci  ricci (pid  5469) is running...
>>> fence_virtdfence_virtd (pid  4862) is running...
>>> cman   Found stale pid file
>>> rgmanager  rgmanager (pid  5366) is running...
>>>
>>>
>>> Thanks,
>>>Kelvin
>> Oh now that is interesting...
>>
>> You'll want input from Fabio, Chrissie or one of the other core devs, I
>> suspect.
>>
>> If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and
>> if you can reproduce it reliably, can you create a new thread with the
>> reproducer?
> It's RHEL proper in both host and guest, and we can reproduce it reliably.

Excellent!

Please reply here with the rhbz#. I'm keen to see what comes of it.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies

2015-12-03 Thread Digimer
On 03/12/15 02:19 PM, Kelvin Edmison wrote:
> 
> I am hoping that someone can help me understand the problems I'm having
> with linux clustering for VMs.
> 
> I am clustering 2 VMs on two separate VM hosts, trying to ensure that a
> service is always available.  The hosts and guests are both RHEL 6.7.
> The goal is to have only one of the two VMs running at a time.
> 
> The configuration works when we test/simulate VM deaths and graceful VM
> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ).
> 
> However, when we simulate the sudden isolation of host A (e.g. ifdown
> eth0), two things happen
> 1) the VM on host B does not start, and repeated fence_xvm errors appear
> in the logs on host B
> 2) when the 'failed' node is returned to service, the cman service on
> host B dies.

If the node's host is dead, then there is no way for the survivor to
determine the state of the lost VM node. The cluster is not allowed to
take "no answer" as confirmation of fence success.

If your hosts have IPMI, then you could add fence_ipmilan as a backup
method where, if fence_xvm fails, it moves on and reboots the host itself.

> This is my cluster.conf file (some elisions re: hostnames)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  key_file="/etc/cluster/fence_xvm_hostA.key"
> multicast_address="239.255.1.10" name="virtfence1"/>
>  key_file="/etc/cluster/fence_xvm_hostB.key"
> multicast_address="239.255.2.10" name="virtfence2"/>
> 
> 
> 
> 
>  use_virsh="1"/>
> 
> 
> 
> 
> 
> Thanks for any help you can offer,
>   Kelvin Edmison
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies

2015-12-03 Thread Digimer
On 03/12/15 08:39 PM, Kelvin Edmison wrote:
> On 12/03/2015 06:14 PM, Digimer wrote:
>> On 03/12/15 02:19 PM, Kelvin Edmison wrote:
>>> I am hoping that someone can help me understand the problems I'm having
>>> with linux clustering for VMs.
>>>
>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure that a
>>> service is always available.  The hosts and guests are both RHEL 6.7.
>>> The goal is to have only one of the two VMs running at a time.
>>>
>>> The configuration works when we test/simulate VM deaths and graceful VM
>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ).
>>>
>>> However, when we simulate the sudden isolation of host A (e.g. ifdown
>>> eth0), two things happen
>>> 1) the VM on host B does not start, and repeated fence_xvm errors appear
>>> in the logs on host B
>>> 2) when the 'failed' node is returned to service, the cman service on
>>> host B dies.
>> If the node's host is dead, then there is no way for the survivor to
>> determine the state of the lost VM node. The cluster is not allowed to
>> take "no answer" as confirmation of fence success.
>>
>> If your hosts have IPMI, then you could add fence_ipmilan as a backup
>> method where, if fence_xvm fails, it moves on and reboots the host
>> itself.
> 
> Thank you for the suggestion.  The hosts do have ipmi.  I'll explore it
> but I'm a little concerned about what it means for the other
> non-clustered VM workloads that exist on these two servers.
> 
> Do you have any thoughts as to why host B's cman process is dying when
> 'host A' returns?
> 
> Thanks,
>   Kelvin

It's not dieing, it's blocking. When a node is lost, dlm blocks until
fenced tells it that the fence was successful. If fenced can't contact
the lost node's fence method(s), then it doesn't succeed and dlm stays
blocked. To anything that uses DLM, like rgmanager, it appears like the
host is hung but it is by design. The logic is that, as bad as it is to
hang, it's better than risking a split-brain.

As for what will happen to non-cluster services, well, if I can be
blunt, you shouldn't mix the two. If something is important enough to
make HA, then it is important enough for dedicated hardware in my opinion.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Alternative to resource monitor polling?

2015-10-15 Thread Digimer
I would ask on the Cluster Labs mailing list; Either users or Developers.

digimer

On 15/10/15 03:42 PM, Vallevand, Mark K wrote:
> Is this the correct forum for questions like this?
> 
>  
> 
> Ubuntu 12.04 LTS
> 
> pacemaker 1.1.10
> 
> cman 3.1.7
> 
> corosync 1.4.6
> 
>  
> 
> One more question:
> 
> If my cluster has no resources, it seems like it takes 20s for a stopped
> node to be detected.  Is the value really 20s and is it a parameter that
> can be adjusted?
> 
>  
> 
> Thanks.
> 
>  
> 
> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com
> <mailto:mark.vallev...@unisys.com>
> Never try and teach a pig to sing: it's a waste of time, and it annoys
> the pig.
> 
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is thus for use only by the intended recipient. If you
> received this in error, please contact the sender and delete the e-mail
> and its attachments from all computers.
> 
> *From:*linux-cluster-boun...@redhat.com
> [mailto:linux-cluster-boun...@redhat.com] *On Behalf Of *Vallevand, Mark K
> *Sent:* Thursday, October 15, 2015 12:19 PM
> *To:* linux clustering
> *Subject:* [Linux-cluster] Alternative to resource monitor polling?
> 
>  
> 
> Is there an alternative to resource monitor polling to detect a resource
> failure?
> 
> If, for example, a resource failure is detected by our own software,
> could it signal clustering that a resource has failed?
> 
>  
> 
> Regards.
> Mark K Vallevand   mark.vallev...@unisys.com
> <mailto:mark.vallev...@unisys.com>
> Never try and teach a pig to sing: it's a waste of time, and it annoys
> the pig.
> 
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is thus for use only by the intended recipient. If you
> received this in error, please contact the sender and delete the e-mail
> and its attachments from all computers.
> 
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Linux cluster] DLM not start

2015-09-02 Thread Digimer
On 02/09/15 09:58 AM, Nguyễn Trường Sơn wrote:
> How can i use fencing?
> 
> Do you mean "pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld
> op monitor interval=60s on-fail=fence"
> 
> 
> It is still error.
> 
> I have Centos 7.0, with pacemaker-1.1.12-22.el7_1.2.x86_64

Fencing is a process where a lost node is removed from the cluster,
usually by rebooting it with IPMI, cutting power using a switched PDU,
etc. How you exactly do fencing depends on your environment and
potential fence devices you have.

DLM requires working fencing.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] new cluster setup error

2015-06-30 Thread Digimer
On 30/06/15 10:51 PM, Megan . wrote:
 Good Evening!
 
 Anyone seen this before?  I just setup these boxes and i'm trying to
 create a new cluster.  I set the ricci password on all of the nodes,
 started ricci.  I try to create cluster and i get the below.
 
 Thanks!
 
 
 Centos 6.6
  2.6.32-504.23.4.el6.x86_64 
 
 ccs-0.16.2-75.el6_6.2.x86_64
 ricci-0.16.2-75.el6_6.1.x86_64
 cman-3.0.12.1-68.el6.x86_64
 
 [root@admin1-dit cluster]# ccs --createcluster test
 
 Traceback (most recent call last):
   File /usr/sbin/ccs, line 2450, in module
 main(sys.argv[1:])
   File /usr/sbin/ccs, line 286, in main
 if (createcluster): create_cluster(clustername)
   File /usr/sbin/ccs, line 939, in create_cluster
 elif get_cluster_conf_xml() != f.read():
   File /usr/sbin/ccs, line 884, in get_cluster_conf_xml
 xml = send_ricci_command(cluster, get_cluster.conf)
   File /usr/sbin/ccs, line 2340, in send_ricci_command
 dom = minidom.parseString(res[1].replace('\t',''))
   File /usr/lib64/python2.6/xml/dom/minidom.py, line 1928, in parseString
 return expatbuilder.parseString(string)
   File /usr/lib64/python2.6/xml/dom/expatbuilder.py, line 940, in
 parseString
 return builder.parseString(string)
   File /usr/lib64/python2.6/xml/dom/expatbuilder.py, line 223, in
 parseString
 parser.Parse(string, True)
 xml.parsers.expat.ExpatError: no element found: line 1, column 0

Are the ricci and modclusterd daemons running? Does your firewall allow
TCP ports 1 and 16851 between nodes? Does the file
/etc/cluster/cluster.conf exist and, if so, does 'ls -lahZ' show:

-rw-r-. root root system_u:object_r:cluster_conf_t:s0
/etc/cluster/cluster.conf

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2-utils 3.1.8 released

2015-04-07 Thread Digimer
Hi Andrew,

  Congrats!!

  Want to add the cluster labs mailing list to your list of release
announcement locations?

digimer

On 07/04/15 01:03 PM, Andrew Price wrote:
 Hi,
 
 I am happy to announce the 3.1.8 release of gfs2-utils. This release
 includes the following visible changes:
 
   * Performance improvements in fsck.gfs2, mkfs.gfs2 and gfs2_edit
 savemeta.
   * Better checking of journals, the jindex, system inodes and inode
 'goal' values in fsck.gfs2
   * gfs2_jadd and gfs2_grow are now separate programs instead of
 symlinks to mkfs.gfs2.
   * Improved test suite and related documentation.
   * No longer clobbers the configure script's --sbindir option.
   * No longer depends on perl.
   * Various minor bug fixes and enhancements.
 
 See below for a complete list of changes. The source tarball is
 available from:
   https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.8.tar.gz
 
 Please test, and report bugs against the gfs2-utils component of Fedora
 rawhide:
 
 https://bugzilla.redhat.com/enter_bug.cgi?product=Fedoracomponent=gfs2-utilsversion=rawhide
 
 
 Regards,
 Andy
 
 Changes since version 3.1.7:
 
 Abhi Das (6):
   fsck.gfs2: fix broken i_goal values in inodes
   gfs2_convert: use correct i_goal values instead of zeros for inodes
   tests: test for incorrect inode i_goal values
   mkfs.gfs2: addendum to fix broken i_goal values in inodes
   gfs2_utils: more gfs2_convert i_goal fixes
   gfs2-utils: more fsck.gfs2 i_goal fixes
 
 Andrew Price (58):
   gfs2-utils tests: Build unit tests with consistent cpp flags
   libgfs2: Move old rgrp layout functions into fsck.gfs2
   gfs2-utils build: Add test coverage option
   fsck.gfs2: Fix memory leak in pass2
   gfs2_convert: Fix potential memory leaks in adjust_inode
   gfs2_edit: Fix signed value used as array index in print_ld_blks
   gfs2_edit: Set umask before calling mkstemp in savemetaopen()
   gfs2_edit: Fix use-after-free in find_wrap_pt
   libgfs2: Clean up broken rgrp length check
   libgfs2: Remove superfluous NULL check from gfs2_rgrp_free
   libgfs2: Fail fd comparison if the fds are negative
   libgfs2: Fix check for O_RDONLY
   fsck.gfs2: Remove dead code from scan_inode_list
   mkfs.gfs2: Terminate lockproto and locktable strings explicitly
   libgfs2: Add generic field assignment and print functions
   gfs2_edit: Use metadata description to print and assign fields
   gfs2l: Switch to lgfs2_field_assign
   libgfs2: Remove device_name from struct gfs2_sbd
   libgfs2: Remove path_name from struct gfs2_sbd
   libgfs2: metafs_path improvements
   gfs2_grow: Don't use PATH_MAX in main_grow
   gfs2_jadd: Don't use fixed size buffers for paths
   libgfs2: Remove orig_journals from struct gfs2_sbd
   gfs2l: Check unchecked returns in openfs
   gfs2-utils configure: Fix exit with failure condition
   gfs2-utils configure: Remove checks for non-existent -W flags
   gfs2_convert: Don't use a fixed sized buffer for device path
   gfs2_edit: Add bounds checking for the journalN keyword
   libgfs2: Make find_good_lh and jhead_scan static
   Build gfs2_grow, gfs2_jadd and mkfs.gfs2 separately
   gfs2-utils: Honour --sbindir
   gfs2-utils configure: Use AC_HELP_STRING in help messages
   fsck.gfs2: Improve reporting of pass timings
   mkfs.gfs2: Revert default resource group size
   gfs2-utils tests: Add keywords to tests
   gfs2-utils tests: Shorten TESTSUITEFLAGS to TOPTS
   gfs2-utils tests: Improve docs
   gfs2-utils tests: Skip unit tests if check is not found
   gfs2-utils tests: Document usage of convenience macros
   fsck.gfs2: Fix 'initializer element is not constant' build error
   fsck.gfs2: Simplify bad_journalname
   gfs2-utils build: Add a configure script summary
   mkfs.gfs2: Remove unused declarations
   gfs2-utils/tests: Fix unit tests for older check libraries
   fsck.gfs2: Fix memory leaks in pass1_process_rgrp
   libgfs2: Use the correct parent for rgrp tree insertion
   libgfs2: Remove some obsolete function declarations
   gfs2-utils: Move metafs handling into gfs2/mkfs/
   gfs2_grow/jadd: Use a matching context mount option in
 mount_gfs2_meta
   gfs2_edit savemeta: Don't read rgrps twice
   fsck.gfs2: Fetch directory inodes early in pass2()
   libgfs2: Remove some unused data structures
   gfs2-utils: Tidy up Makefile.am files
   gfs2-utils build: Remove superfluous passive header checks
   gfs2-utils: Consolidate some bad constants strings
   gfs2-utils: Update translation template
   libgfs2: Fix potential NULL deref in linked_leaf_search()
   gfs2_grow: Put back the definition of FALLOC_FL_KEEP_SIZE
 
 Bob Peterson (15):
   fsck.gfs2: Detect and correct corrupt journals
   fsck.gfs2: Change basic dentry checks for too long of file names

Re: [Linux-cluster] GFS2: Could not open the file on one of the nodes

2015-01-30 Thread Digimer
Does the logs show the fence succeeded or failed? Can you please post 
the logs from the surviving two nodes starting just before the failure 
until a few minutes after?


digimer

On 31/01/15 12:10 AM, cluster lab wrote:

Some more information:

Cluster is a three nodes cluster,
One of its node (ID == 1) fenced because of network failure ...

After fence this problem borned ...


On Sat, Jan 31, 2015 at 8:28 AM, cluster lab cluster.l...@gmail.com wrote:

Hi,

There is n't any unusual state or message,
Also GFS logs (gfs, dlm) are silent ...

Is there any chance to find source of problem?

On Thu, Jan 29, 2015 at 7:04 PM, Bob Peterson rpete...@redhat.com wrote:

- Original Message -

On affected node:

stat FILE | grep Inode
stat: cannot stat `FILE': Input/output error

On other node:
stat PublicDNS1-OS.qcow2 | grep Inode
Device: fd06h/64774dInode: 267858  Links: 1


Something funky going on.
I'd check dmesg for withdraw messages, etc., on the affected node.

Regards,

Bob Peterson
Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS2: Could not open the file on one of the nodes

2015-01-30 Thread Digimer

On 31/01/15 01:52 AM, cluster lab wrote:

Jan 21 17:07:43 ost-pvm2 fenced[47840]: fencing node ost-pvm1


There are no messages about this succeeding or failing... It looks like 
only 15 seconds seconds worth of logs. Can you please share the full 
amount of time I mentioned before, from both nodes?


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS2: Could not open the file on one of the nodes

2015-01-28 Thread Digimer
That looks OK. Can you touch a file from one node and see it on the 
other and vice-versa? Is there anything in either node's log files when 
you run 'qemu-img check'?


On 29/01/15 12:34 AM, cluster lab wrote:

Node2: # dlm_tool ls
dlm lockspaces
name  VMStorage3
id0xb26438a2
flags 0x0008 fs_reg
changemember 2 joined 1 remove 0 failed 0 seq 1,1
members   1 2

name  VMStorage2
id0xab7f09e3
flags 0x0008 fs_reg
changemember 2 joined 1 remove 0 failed 0 seq 1,1
members   1 2

name  VMStorage1
id0x80525a20
flags 0x0008 fs_reg
changemember 2 joined 1 remove 0 failed 0 seq 1,1
members   1 2
===
Node1: # dlm_tool ls
dlm lockspaces
name  VMStorage3
id0xb26438a2
flags 0x0008 fs_reg
changemember 2 joined 1 remove 0 failed 0 seq 2,2
members   1 2

name  VMStorage2
id0xab7f09e3
flags 0x0008 fs_reg
changemember 2 joined 1 remove 0 failed 0 seq 2,2
members   1 2

name  VMStorage1
id0x80525a20
flags 0x0008 fs_reg
changemember 2 joined 1 remove 0 failed 0 seq 2,2
members   1 2

On Thu, Jan 29, 2015 at 8:57 AM, Digimer li...@alteeve.ca wrote:

On 28/01/15 11:50 PM, cluster lab wrote:


Hi,

In a two node cluster, i received to different result from qemu-img
check on just one file:

node1 # qemu-img check VMStorage/x.qcow2
No errors were found on the image.

Node2 # qemu-img check VMStorage/x.qcow2
qemu-img: Could not open 'VMStorage/x.qcow2

All other files are OK, and the cluster works properly.
What is the problem?


Packages:
kernel: 2.6.32-431.5.1.el6.x86_64
GFS2: gfs2-utils-3.0.12.1-23.el6.x86_64
corosync: corosync-1.4.1-17.el6.x86_64

Best Regards



What does 'dlm_tool ls' show?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Pacemaker] HA Summit Key-signing Party

2015-01-26 Thread Digimer

On 26/01/15 09:14 AM, Jan Pokorný wrote:

Hello cluster masters,

On 13/01/15 00:31 -0500, Digimer wrote:

Any concerns/comments/suggestions, please speak up ASAP!


I'd like to throw a key-signing party as it will be a perfect
opportunity to build a web of trust amongst us.

If you haven't incorporated OpenPGP to your communication with the
world yet, I would recommend at least considering it, even more in
the post-Snowden era.  You can use it to prove authenticity/integrity
of the data you emit (signing; not just for email as is the case
with this one, but also for SW releases and more), provide
privacy/confidentiality of interchanged data (encryption; again,
typical scenario is a private email, e.g., when you responsibly
report a vulnerability to the respective maintainers), or both.

In case you have no experience with this technology, there are
plentiful resources on GnuPG (most renowned FOSS implementation):
- https://www.gnupg.org/documentation/howtos.en.html
- http://cryptnet.net/fdp/crypto/keysigning_party/en/keysigning_party.html#prep
   (preparation steps for a key-signing party)
- ...

To make the verification process as smooth and as little
time-consuming as possible, I would stick with a list-based method:
http://cryptnet.net/fdp/crypto/keysigning_party/en/keysigning_party.html#list_based
and volunteer for a role of a coordinator.


What's needed?
Once you have a key pair (and provided that you are using GnuPG), please
run the following sequence:

 # figure out the key ID for the identity to be verified;
 # IDENTITY is either your associated email address/your name
 # if only single key ID matches, specific key otherwise
 # (you can use gpg -K to select a desired ID at the sec line)
 KEY=$(gpg --with-colons 'IDENTITY' | grep '^pub' | cut -d: -f5)

 # export the public key to a file that is suitable for exchange
 gpg --export -a -- $KEY  $KEY

 # verify that you have an expected data to share
 gpg --with-fingerprint -- $KEY

with IDENTITY adjusted as per the instruction above, and send me the
resulting $KEY file, preferably in a signed (or even encrypted[*]) email
from an address associated with that very public key of yours.

[*] You can find my public key at public keyservers:
http://pool.sks-keyservers.net/pks/lookup?op=vindexsearch=0x60BCBB4F5CD7F9EF
Indeed, the trust in this key should be ephemeral/one-off
(e.g., using a temporary keyring, not a universal one before we proceed
with the signing :)


Timeline?
Best if you send me your public keys before 2015-02-02.  I will then
compile a list of the attendees together with their keys and publish
it at https://people.redhat.com/jpokorny/keysigning/2015-ha/
so you can print it out and be ready for the party.

Thanks for your cooperation, looking forward to this side-event and
hope this will be beneficial to all involved.


P.S. There's now an opportunity to visit an exhibition of the Bohemian
Crown Jewels replicas directly in Brno (sorry, Google Translate only)
https://translate.google.com/translate?sl=autotl=enjs=yprev=_thl=enie=UTF-8u=http%3A%2F%2Fwww.letohradekbrno.cz%2F%3Fidm%3D55


=o, keysigning is a brilliant idea!

I can put the keys in the plan wiki, too.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Pacemaker] [Linux-HA] [ha-wg-technical] [RFC] Organizing HA Summit 2015

2015-01-13 Thread Digimer

Woohoo!!

Will be very nice to see you. :)

I've added you. Can you give me a short sentence to introduce yourself 
to people who haven't met you?


Madi

On 13/01/15 11:33 PM, Yusuke Iida wrote:

Hi Digimer,

I am Iida to participate from NTT along with Mori.
I want you added to the list of participants.

I'm sorry contact is late.

Regards,
Yusuke

2014-12-23 2:13 GMT+09:00 Digimer li...@alteeve.ca:

It will be very nice to see you again! Will Ikeda-san be there as well?

digimer

On 22/12/14 03:35 AM, Keisuke MORI wrote:


Hi all,

Really late response but,
I will be joining the HA summit, with a few colleagues from NTT.

See you guys in Brno,
Thanks,


2014-12-08 22:36 GMT+09:00 Jan Pokorný jpoko...@redhat.com:


Hello,

it occured to me that if you want to use the opportunity and double
as as tourist while being in Brno, it's about the right time to
consider reservations/ticket purchases this early.
At least in some cases it is a must, e.g., Villa Tugendhat:


http://rezervace.spilberk.cz/langchange.aspx?mrsname=languageId=2returnUrl=%2Flist

On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:


DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.

My suggestion would be to have a 2 days dedicated HA summit the 4th and
the 5th of February.



--
Jan

___
ha-wg-technical mailing list
ha-wg-techni...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical








--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
___
Linux-HA mailing list
linux...@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems







--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

[Linux-cluster] [Planning] Organizing HA Summit 2015

2015-01-12 Thread Digimer

Hi all,

  With Fabio away for now, I (and others) are working on the final 
preparations for the summit. This is your chance to speak up and 
influence the planning! Objections/suggestions? Speak now please. :)


  In particular, please raise topics you want to discuss. Either add 
them to the wiki directly or email me and I will update the wiki for 
you. (Note that registration is closed because of spammers, if you want 
an account just let me know and I'll open it back up).


The plan is;

* Informal atmosphere with limited structure to make sure key topics are 
addressed.


Two ways topics will be discussed;

** Someone will guide a given topic they want to raise for ~45 minutes, 
15 minutes for QA


** Round-table style discussion with no one person leading (though it 
would be nice to have someone taking notes).


People presenting are asked not to use slides. Hand-outs are fine and 
either a white-board or paper flip-board will be available for 
illustrating ideas and flushing out concepts.


The summit will start at 9:00 and go until 17:00. We'll go for a 
semi-official summit dinner and drinks around 6pm on the 4th (location 
to be determined). Those staying in Brno are more than welcome to join 
an informal dinner and drinks (and possibly some sight-seeing, etc) the 
evening of the 5th.


Any concerns/comments/suggestions, please speak up ASAP!

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] HA Summit 2015 - plan wiki closed for registration

2015-01-11 Thread Digimer

Spammers got through the captcha, *sigh*.

If anyone wants to create an account to edit, please email me off-list 
and I'll get you setup ASAP. Sorry for the hassle.


http://plan.alteeve.ca/index.php/Main_Page

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

2015-01-08 Thread Digimer

On 08/01/15 07:17 AM, Cao, Vinh wrote:

Hi Digimer,

You are correct. I do need to configure fencing. But before fencing, I need to 
have these servers become member of cluster first.
If they are not member of cluster set. Doesn't matter I try to configure 
fencing or not. My cluster won't work.

Thanks for your help.
Vinh


Define the fence methods right from the start. As soon as the cluster 
forms, the first thing you do is run 'fence_check -f' on all nodes. If 
there is a problem, fix it. Only then do you add services.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

2015-01-07 Thread Digimer

Please configure fencing. If you don't, it _will_ cause you problems.

On 07/01/15 09:48 PM, Cao, Vinh wrote:

Hi Digimer,

No we're not supporting multicast. I'm trying to use Broadcast, but Redhat 
support is saying better to use transport=udpu. Which I did set and it is 
complaining time out.
I did try to set broadcast, but somehow it didn't work either.

Let me give broadcast a try again.

Thanks,
Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 5:51 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

It looks like a network problem... Does your (virtual) switch support multicast 
properly and have you opened up the proper ports in the firewall?

On 07/01/15 05:32 PM, Cao, Vinh wrote:

Hi Digimer,

Yes, I just did. Looks like they are failing. I'm not sure why that is.
Please see the attachment for all servers log.

By the way, I do appreciated all the helps I can get.

Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 4:33 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

Quorum is enabled by default. I need to see the entire logs from all five 
nodes, as I mentioned in the first email. Please disable cman from starting on 
boot, configure fencing properly and then reboot all nodes cleanly. Start the 
'tail -f -n 0 /var/log/messages' on all five nodes, then in another window, 
start cman on all five nodes. When things settle down, copy/paste all the log 
output please.

On 07/01/15 04:29 PM, Cao, Vinh wrote:

Hi Digimer,

Here is from the logs:
[root@ustlvcmsp1954 ~]# tail -f /var/log/messages
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync profile loading service
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Using quorum provider 
quorum_cman
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync cluster quorum service v0.1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Compatibility mode set 
to whitetank.  Using V1 and V2 of the synchronization engine.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [CPG   ] chosen downlist: 
sender r(0) ip(10.30.197.108) ; members(old:0 left:0)
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jan  7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Unloading all Corosync 
service engines.
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync extended virtual synchrony service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync configuration service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster closed process group service v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster config database access v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync profile loading service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: openais checkpoint service B.01.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync CMAN membership service 2.90
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster quorum service v0.1
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Corosync Cluster 
Engine exiting with status 0 at main.c:2055.
Jan  7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed

Then it die at:
Starting cman...[  OK  ]
  Waiting for quorum... Timed-out waiting for cluster
  [FAILED]

Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem 
is still there. One thing I don't know why cluster is looking for quorum?
I did have any disk quorum setup in cluster.conf file.

Any helps can I get appreciated.

Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 3:59 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

On 07/01/15 03:39 PM, Cao, Vinh wrote:

Hello Digimer,

Yes, I would agrre with you RHEL6.4 is old. We patched

Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

2015-01-07 Thread Digimer
It looks like a network problem... Does your (virtual) switch support 
multicast properly and have you opened up the proper ports in the firewall?


On 07/01/15 05:32 PM, Cao, Vinh wrote:

Hi Digimer,

Yes, I just did. Looks like they are failing. I'm not sure why that is.
Please see the attachment for all servers log.

By the way, I do appreciated all the helps I can get.

Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 4:33 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

Quorum is enabled by default. I need to see the entire logs from all five 
nodes, as I mentioned in the first email. Please disable cman from starting on 
boot, configure fencing properly and then reboot all nodes cleanly. Start the 
'tail -f -n 0 /var/log/messages' on all five nodes, then in another window, 
start cman on all five nodes. When things settle down, copy/paste all the log 
output please.

On 07/01/15 04:29 PM, Cao, Vinh wrote:

Hi Digimer,

Here is from the logs:
[root@ustlvcmsp1954 ~]# tail -f /var/log/messages
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync profile loading service
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Using quorum provider 
quorum_cman
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync cluster quorum service v0.1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Compatibility mode set 
to whitetank.  Using V1 and V2 of the synchronization engine.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [CPG   ] chosen downlist: 
sender r(0) ip(10.30.197.108) ; members(old:0 left:0)
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jan  7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Unloading all Corosync 
service engines.
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync extended virtual synchrony service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync configuration service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster closed process group service v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster config database access v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync profile loading service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: openais checkpoint service B.01.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync CMAN membership service 2.90
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster quorum service v0.1
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Corosync Cluster 
Engine exiting with status 0 at main.c:2055.
Jan  7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed

Then it die at:
   Starting cman...[  OK  ]
 Waiting for quorum... Timed-out waiting for cluster
 [FAILED]

Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem 
is still there. One thing I don't know why cluster is looking for quorum?
I did have any disk quorum setup in cluster.conf file.

Any helps can I get appreciated.

Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 3:59 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

On 07/01/15 03:39 PM, Cao, Vinh wrote:

Hello Digimer,

Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not 
sure why these servers are still at 6.4. Most of our system are 6.6.

Here is my cluster config. All I want is using cluster to have BGFS2 mount via 
/etc/fstab.
root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf ?xml
version=1.0? cluster config_version=15 name=p1954_to_p1958
   clusternodes
   clusternode name=ustlvcmsp1954 nodeid=1/
   clusternode name=ustlvcmsp1955 nodeid=2/
   clusternode name=ustlvcmsp1956 nodeid=3/
   clusternode name=ustlvcmsp1957 nodeid=4/
   clusternode name=ustlvcmsp1958 nodeid=5/
   /clusternodes


You don't configure

Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

2015-01-07 Thread Digimer

Did you configure fencing properly?

On 07/01/15 05:32 PM, Cao, Vinh wrote:

Hi Digimer,

Yes, I just did. Looks like they are failing. I'm not sure why that is.
Please see the attachment for all servers log.

By the way, I do appreciated all the helps I can get.

Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 4:33 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

Quorum is enabled by default. I need to see the entire logs from all five 
nodes, as I mentioned in the first email. Please disable cman from starting on 
boot, configure fencing properly and then reboot all nodes cleanly. Start the 
'tail -f -n 0 /var/log/messages' on all five nodes, then in another window, 
start cman on all five nodes. When things settle down, copy/paste all the log 
output please.

On 07/01/15 04:29 PM, Cao, Vinh wrote:

Hi Digimer,

Here is from the logs:
[root@ustlvcmsp1954 ~]# tail -f /var/log/messages
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync profile loading service
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Using quorum provider 
quorum_cman
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync cluster quorum service v0.1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Compatibility mode set 
to whitetank.  Using V1 and V2 of the synchronization engine.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [CPG   ] chosen downlist: 
sender r(0) ip(10.30.197.108) ; members(old:0 left:0)
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jan  7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Unloading all Corosync 
service engines.
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync extended virtual synchrony service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync configuration service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster closed process group service v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster config database access v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync profile loading service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: openais checkpoint service B.01.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync CMAN membership service 2.90
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster quorum service v0.1
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Corosync Cluster 
Engine exiting with status 0 at main.c:2055.
Jan  7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed

Then it die at:
   Starting cman...[  OK  ]
 Waiting for quorum... Timed-out waiting for cluster
 [FAILED]

Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem 
is still there. One thing I don't know why cluster is looking for quorum?
I did have any disk quorum setup in cluster.conf file.

Any helps can I get appreciated.

Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 3:59 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

On 07/01/15 03:39 PM, Cao, Vinh wrote:

Hello Digimer,

Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not 
sure why these servers are still at 6.4. Most of our system are 6.6.

Here is my cluster config. All I want is using cluster to have BGFS2 mount via 
/etc/fstab.
root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf ?xml
version=1.0? cluster config_version=15 name=p1954_to_p1958
   clusternodes
   clusternode name=ustlvcmsp1954 nodeid=1/
   clusternode name=ustlvcmsp1955 nodeid=2/
   clusternode name=ustlvcmsp1956 nodeid=3/
   clusternode name=ustlvcmsp1957 nodeid=4/
   clusternode name=ustlvcmsp1958 nodeid=5/
   /clusternodes


You don't configure the fencing for the nodes... If anything causes a fence, 
the cluster will lock up (by design

Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

2015-01-07 Thread Digimer
My first though would be to set fence_daemon post_join_delay=30 / in 
cluster.conf.


If that doesn't work, please share your configuration file. Then, with 
all nodes offline, open a terminal to each node and run 'tail -f -n 0 
/var/log/messages'. With that running, start all the nodes and wait for 
things to settle down, then paste the five nodes' output as well.


Also, 6.4 is pretty old, why not upgrade to 6.6?

digimer

On 07/01/15 03:10 PM, Cao, Vinh wrote:

Hello Cluster guru,

I'm trying to setup Redhat 6.4 OS cluster with 5 nodes. With two nodes I
don’t have any issue.

But with 5 nodes, when I ran clustat I got 3 nodes online and the other
two off line.

When I start the one that are off line. Service cman start. I got:

[root@ustlvcmspxxx ~]# service cman status

corosync is stopped

[root@ustlvcmsp1954 ~]# service cman start

Starting cluster:

Checking if cluster has been disabled at boot...[  OK  ]

Checking Network Manager... [  OK  ]

Global setup... [  OK  ]

Loading kernel modules...   [  OK  ]

Mounting configfs...[  OK  ]

Starting cman...[  OK  ]

Waiting for quorum... Timed-out waiting for cluster

[FAILED]

Stopping cluster:

Leaving fence domain... [  OK  ]

Stopping gfs_controld...[  OK  ]

Stopping dlm_controld...[  OK  ]

Stopping fenced...  [  OK  ]

Stopping cman...[  OK  ]

Waiting for corosync to shutdown:   [  OK  ]

Unloading kernel modules... [  OK  ]

Unmounting configfs...  [  OK  ]

Can you help?

Thank you,

Vinh






--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

2015-01-07 Thread Digimer

On 07/01/15 03:39 PM, Cao, Vinh wrote:

Hello Digimer,

Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not 
sure why these servers are still at 6.4. Most of our system are 6.6.

Here is my cluster config. All I want is using cluster to have BGFS2 mount via 
/etc/fstab.
root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf
?xml version=1.0?
cluster config_version=15 name=p1954_to_p1958
 clusternodes
 clusternode name=ustlvcmsp1954 nodeid=1/
 clusternode name=ustlvcmsp1955 nodeid=2/
 clusternode name=ustlvcmsp1956 nodeid=3/
 clusternode name=ustlvcmsp1957 nodeid=4/
 clusternode name=ustlvcmsp1958 nodeid=5/
 /clusternodes


You don't configure the fencing for the nodes... If anything causes a 
fence, the cluster will lock up (by design).



 fencedevices
 fencedevice agent=fence_vmware_soap ipaddr=10.30.197.108 login=rhfence 
name=p1954 passwd=/
 fencedevice agent=fence_vmware_soap ipaddr=10.30.197.109 login=rhfence 
name=p1955 passwd=  /
 fencedevice agent=fence_vmware_soap ipaddr=10.30.197.110 login=rhfence 
name=p1956 passwd=  /
 fencedevice agent=fence_vmware_soap ipaddr=10.30.197.111 login=rhfence 
name=p1957 passwd=  /
 fencedevice agent=fence_vmware_soap ipaddr=10.30.197.112 login=rhfence 
name=p1958 passwd=  /
 /fencedevices
/cluster

clustat show:

Cluster Status for p1954_to_p1958 @ Wed Jan  7 15:38:00 2015
Member Status: Quorate

  Member Name ID   Status
  --   --
  ustlvcmsp1954   1 Offline
  ustlvcmsp1955   2 Online, 
Local
  ustlvcmsp1956   3 Online
  ustlvcmsp1957   4 Offline
  ustlvcmsp1958   5 Online

I need to make them all online, so I can use fencing for mounting shared disk.

Thanks,
Vinh


What about the log entries from the start-up? Did you try the 
post_join_delay config?




-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 3:16 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

My first though would be to set fence_daemon post_join_delay=30 / in 
cluster.conf.

If that doesn't work, please share your configuration file. Then, with all 
nodes offline, open a terminal to each node and run 'tail -f -n 0 
/var/log/messages'. With that running, start all the nodes and wait for things 
to settle down, then paste the five nodes' output as well.

Also, 6.4 is pretty old, why not upgrade to 6.6?

digimer

On 07/01/15 03:10 PM, Cao, Vinh wrote:

Hello Cluster guru,

I'm trying to setup Redhat 6.4 OS cluster with 5 nodes. With two nodes
I don't have any issue.

But with 5 nodes, when I ran clustat I got 3 nodes online and the
other two off line.

When I start the one that are off line. Service cman start. I got:

[root@ustlvcmspxxx ~]# service cman status

corosync is stopped

[root@ustlvcmsp1954 ~]# service cman start

Starting cluster:

 Checking if cluster has been disabled at boot...[  OK  ]

 Checking Network Manager... [  OK  ]

 Global setup... [  OK  ]

 Loading kernel modules...   [  OK  ]

 Mounting configfs...[  OK  ]

 Starting cman...[  OK  ]

Waiting for quorum... Timed-out waiting for cluster

 [FAILED]

Stopping cluster:

 Leaving fence domain... [  OK  ]

 Stopping gfs_controld...[  OK  ]

 Stopping dlm_controld...[  OK  ]

 Stopping fenced...  [  OK  ]

 Stopping cman...[  OK  ]

 Waiting for corosync to shutdown:   [  OK  ]

 Unloading kernel modules... [  OK  ]

 Unmounting configfs...  [  OK  ]

Can you help?

Thank you,

Vinh






--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is 
trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure

Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

2015-01-07 Thread Digimer
Quorum is enabled by default. I need to see the entire logs from all 
five nodes, as I mentioned in the first email. Please disable cman from 
starting on boot, configure fencing properly and then reboot all nodes 
cleanly. Start the 'tail -f -n 0 /var/log/messages' on all five nodes, 
then in another window, start cman on all five nodes. When things settle 
down, copy/paste all the log output please.


On 07/01/15 04:29 PM, Cao, Vinh wrote:

Hi Digimer,

Here is from the logs:
[root@ustlvcmsp1954 ~]# tail -f /var/log/messages
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync profile loading service
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Using quorum provider 
quorum_cman
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine loaded: 
corosync cluster quorum service v0.1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Compatibility mode set 
to whitetank.  Using V1 and V2 of the synchronization engine.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [QUORUM] Members[1]: 1
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [CPG   ] chosen downlist: 
sender r(0) ip(10.30.197.108) ; members(old:0 left:0)
Jan  7 16:14:01 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jan  7 16:14:01 ustlvcmsp1954 rgmanager[8099]: Waiting for quorum to form
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Unloading all Corosync 
service engines.
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync extended virtual synchrony service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync configuration service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster closed process group service v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster config database access v1.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync profile loading service
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: openais checkpoint service B.01.01
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync CMAN membership service 2.90
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [SERV  ] Service engine 
unloaded: corosync cluster quorum service v0.1
Jan  7 16:15:06 ustlvcmsp1954 corosync[8182]:   [MAIN  ] Corosync Cluster 
Engine exiting with status 0 at main.c:2055.
Jan  7 16:15:06 ustlvcmsp1954 rgmanager[8099]: Quorum formed

Then it die at:
  Starting cman...[  OK  ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]

Yes, I did made changes with: fence_daemon post_join_delay=30/ the problem 
is still there. One thing I don't know why cluster is looking for quorum?
I did have any disk quorum setup in cluster.conf file.

Any helps can I get appreciated.

Vinh

-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, January 07, 2015 3:59 PM
To: linux clustering
Subject: Re: [Linux-cluster] needs helps GFS2 on 5 nodes cluster

On 07/01/15 03:39 PM, Cao, Vinh wrote:

Hello Digimer,

Yes, I would agrre with you RHEL6.4 is old. We patched monthly, but I'm not 
sure why these servers are still at 6.4. Most of our system are 6.6.

Here is my cluster config. All I want is using cluster to have BGFS2 mount via 
/etc/fstab.
root@ustlvcmsp1955 ~]# cat /etc/cluster/cluster.conf ?xml
version=1.0? cluster config_version=15 name=p1954_to_p1958
  clusternodes
  clusternode name=ustlvcmsp1954 nodeid=1/
  clusternode name=ustlvcmsp1955 nodeid=2/
  clusternode name=ustlvcmsp1956 nodeid=3/
  clusternode name=ustlvcmsp1957 nodeid=4/
  clusternode name=ustlvcmsp1958 nodeid=5/
  /clusternodes


You don't configure the fencing for the nodes... If anything causes a fence, 
the cluster will lock up (by design).


  fencedevices
  fencedevice agent=fence_vmware_soap ipaddr=10.30.197.108 login=rhfence 
name=p1954 passwd=/
  fencedevice agent=fence_vmware_soap ipaddr=10.30.197.109 login=rhfence 
name=p1955 passwd=  /
  fencedevice agent=fence_vmware_soap ipaddr=10.30.197.110 login=rhfence 
name=p1956 passwd=  /
  fencedevice agent=fence_vmware_soap ipaddr=10.30.197.111 login=rhfence 
name=p1957 passwd=  /
  fencedevice agent

Re: [Linux-cluster] [ha-wg-technical] [Pacemaker] [RFC] Organizing HA Summit 2015

2014-12-22 Thread Digimer

It will be very nice to see you again! Will Ikeda-san be there as well?

digimer

On 22/12/14 03:35 AM, Keisuke MORI wrote:

Hi all,

Really late response but,
I will be joining the HA summit, with a few colleagues from NTT.

See you guys in Brno,
Thanks,


2014-12-08 22:36 GMT+09:00 Jan Pokorný jpoko...@redhat.com:

Hello,

it occured to me that if you want to use the opportunity and double
as as tourist while being in Brno, it's about the right time to
consider reservations/ticket purchases this early.
At least in some cases it is a must, e.g., Villa Tugendhat:

http://rezervace.spilberk.cz/langchange.aspx?mrsname=languageId=2returnUrl=%2Flist

On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:

DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.

My suggestion would be to have a 2 days dedicated HA summit the 4th and
the 5th of February.


--
Jan

___
ha-wg-technical mailing list
ha-wg-techni...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical








--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] new cluster acting odd

2014-12-01 Thread Digimer
 only 
under load, then that's an indication of the problem, too.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] new cluster acting odd

2014-12-01 Thread Digimer

On 01/12/14 11:34 AM, Dan Riley wrote:

Ha, I was unaware this was part of the folklore.  We have a couple of 9-node 
clusters, it did take some tuning to get them stable, and we are thinking about 
splitting one of them.  For our clusters, we found uniform configuration helped 
a lot, so a mix of physical hosts and VMs in the same (largish) cluster would 
make me a little nervous, don't know about anyone else's feelings.


Personally, I only build 2-node clusters. When I need more resource, I 
drop-in another pair. This allows all my clusters, going back to 2008/9, 
to have nearly identical configurations. In HA, I would argue, nothing 
is more useful than consistency and simplicity.


That said, I'd not fault anyone for going to 4 or 5 node. Beyond that 
though, I would argue that the cluster should be broken up. In HA, 
uptime should always trump resource utilization efficiency.


Mixing real and virtual machines strikes me as an avoidable complexity, too.


Something fence related is not working.


We used to see something like this, and it usually traced back to shouldn't be possible 
inconsistencies in the fence group membership.  Once the fence group gets blocked by a mismatched membership 
list, everything above it breaks.  Sometimes a fence_tool ls -n on all the cluster members will 
reveal an inconsistency (fence_tool dump also, but that's hard to interpret without digging into 
the group membership protocols).  If you can find an inconsistency, manually fencing the nodes in the 
minority might repair it.


In all my years, I've never seen this happen in production. If you can 
create a reproducer, I would *love* to see/examine it!



At the time, I did quite a lot of staring at fence_tool dumps, but never figured out how 
the fence group was getting into shouldn't be possible inconsistencies.  This 
was also all late RHEL5 and early RHEL6, so may not be applicable anymore.


HA in 6.2+ seems to be quite a bit more stable (I think for more reasons 
than just the HA stuff). For this reason, I am staying on RHEL 6 until 
at least 7.2+ is out. :)



My recommendation would be to schedule a maintenance window and then stop 
everything except cman (no rgmanager, no gfs2, etc). Then methodically test 
crashing all nodes (I like 'echo c  /proc/sysrq-trigger) and verify they are 
fenced and then recover properly. It's worth disabling cman and rgmanager from 
starting at boot (period, but particularly for this test).

If you can reliably (and repeatedly) crash - fence - rejoin, then I'd start 
loading back services and re-trying. If the problem reappears only under load, then 
that's an indication of the problem, too.


I'd agree--start at the bottom of the stack and work your way up.

-dan


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] new cluster acting odd

2014-12-01 Thread Digimer
.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] new cluster acting odd

2014-12-01 Thread Digimer

On 01/12/14 01:03 PM, Megan . wrote:

We have 11 10-20TB GFS2 mounts that I need to share across all nodes.
Its the only reason we went with the cluster solution.  I don't know
how we could split it up into different smaller clusters.


I would do this, personally;

2-Node cluster; DRBD (on top of local disks or a pair of SANs, one per 
node), exported over NFS and configured in a simple single-primary 
(master/slave) configuration with a floating IP.


GFS2, like any clustered filesystem, requires cluster locking. This 
locking comes with a non-trivial overhead. Exporting NFS allows you to 
avoid this bottle-neck and with a simple 2-node cluster behind the 
scenes, you maintain full HA.


In HA, nothing is more important than simplicity. Said another way;

A cluster isn't beautiful when there is nothing left to add. It is 
beautiful when there is nothing left to take away.



On Mon, Dec 1, 2014 at 12:14 PM, Digimer li...@alteeve.ca wrote:

On 01/12/14 11:56 AM, Megan . wrote:


Thank you for your replies.

The cluster is intended to be 9 nodes, but i haven't finished building
the remaining 2.  Our production cluster is expected to be similar in
size.  What tuning should I be looking at?


Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
remove IP addresses.



Can you simplify those fencedevice definitions? I would wonder if the set
timeouts could be part of the problem. Always start with the simplest
possible configurations and only add options in response to actual issues
discovered in testing.


I can try to simplify.  I had the longer timeouts because what I saw
happening on the physical boxes, was the box would be on its way
down/up and the fence command would fail, but the box actually did
come back online.  The physicals take 10-15 minutes to reboot and i
wasn't sure how to handle timeout issues, so i made the timeouts a bit
extreme for testing. I'll try to make the config more vanilla for
troubleshooting.


I'm not really sure why the state of the node should impact the fence 
action in any way. Fencing is supposed to work, regardless of the state 
of the target.


Fencing works like this (with a default config, on most fence agents);

1. Force off
2. Verify off
3. Try to boot, don't care if it succeeds.

So once the node is confirmed off by the agent, the fence is considered 
a success. How long (if at all) it takes for the node to reboot does not 
factor in.



I tried the method of (echo c  /proc/sysrq-trigger) to crash a node,
the cluster kept seeing it as online and never fenced it, yet i could
no longer ssh to the node.  I did this on a physical and VM box with
the same result.  I had to fence_node node to get it to reboot, but it
came up split brained (thinking it was the only one online). Now that
node has cman down and the rest of the cluster sees it as still
online.



Then corosync failed to detect the fault. That is a sign, to me, of a
fundamental network or configuration issue. Corosync should have shown
messages about a node being lost and reconfiguring. If that didn't happen,
then you're not even up to the point where fencing factors in.

Did you configure corosync.conf? When it came up, did it think it was
quorate or inquorate?


corosync.conf didn't work since it seems the RedHat HA Cluster doesn't
use that file.  http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf
  I tried it since we wanted to try to put the multicast traffic on a
different bond/vlan but we figured out the file isn't used.


Right, I wanted to make sure that, if you had tried, you've since 
removed the corosync.conf entirely. Corosync is fully controlled by the 
cman cluster.conf file.



I thought fencing was working because i'm able to do fence_node node
and see the box reboot and come back online.  I did have to get the FC
version of the fence_agents because of an issue with the idrac agent
not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64



That tells you that the configuration of the fence agents is working, but it
doesn't test failure detection. You can use the 'fence_check' tool to see if
the cluster can talk to everything, but in the end, the only useful test is
to simulate an actual crash.

Wait; 'fc14' ?! What OS are you using?


We are Centos 6.6.  I went with the fedora core agents because of this
exact issue 
http://forum.proxmox.com/threads/12311-Proxmox-HA-fencing-and-Dell-iDrac7
  I read that it was fixed in the next version, which i could only find
for FC.


It would be *much* better to file a bug report 
(https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%206) 
- Version: 6.6 - Component: fence-agents


Mixing RPMs from other OSes is not a good idea at all.


fence_tool dump worked on one of my nodes, but it is just hanging on the
rest.

[root@map1-uat ~]# fence_tool dump
1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/fenced.log
1417448610 fenced 3.0.12.1 started
1417448610 connected to dbus

Re: [Linux-cluster] [ha-wg-technical] Wiki for planning created - Re: [Pacemaker] [RFC] Organizing HA Summit 2015

2014-11-29 Thread Digimer

On 27/11/14 11:52 AM, Digimer wrote:

I just created a dedicated/fresh wiki for planning and organizing:

http://plan.alteeve.ca/index.php/Main_Page

Other than the domain, it has no association with any existing project,
so it should be a neutral enough platform. Also, it's not owned by
$megacorp (I wish!), so spying/privacy shouldn't be an issue I hope. If
there is concern, I can setup https.

If no one else gets to it before me, I'll start collating the data from
the mailing list onto that wiki tomorrow (maaaybe today, depends).

The wiki requires registration, but that's it. I'm not bothering with
captchas because, in my experience, spammer walk right through them
anyway. I do have edits email me, so I can catch and roll back any spam
quickly.


Ok, I was getting 3~5 spam accounts created per day. To deal with this, 
I setup 'questy' captcha program with five (random) questions that 
should be easy to answer, even for non-english speakers. Just the same, 
if anyone has any trouble registering, please feel free to email me 
directly and I will be happy to help.


Madi

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Cluster-devel] [Pacemaker] Wiki for planning created - Re: [RFC] Organizing HA Summit 2015

2014-11-28 Thread Digimer

On 29/11/14 12:45 AM, Fabio M. Di Nitto wrote:



On 11/28/2014 8:10 PM, Jan Pokorný wrote:

On 28/11/14 00:37 -0500, Digimer wrote:

On 28/11/14 12:33 AM, Fabio M. Di Nitto wrote:

On 11/27/2014 5:52 PM, Digimer wrote:

I just created a dedicated/fresh wiki for planning and organizing:

http://plan.alteeve.ca/index.php/Main_Page

[...]


Awesome! thanks for taking care of it. Do you have a chance to add also
an instance of etherpad to the site?

Mostly to do collaborative editing while we sit all around the same table.

Otherwise we can use a public instance and copy paste info after that in
the wiki.


Never tried setting up etherpad before, but if it runs on rhel 6, I should
have no problem setting it up.


Provided no conspiracy to be started, there are a bunch of popular
instances, e.g. http://piratepad.net/



Right, some of them only store etherpads for 30 days. Just be careful
the one we choose or we make our own.

Fabio


I'll set one up, but I'll need a few days, I'm out of the country at the 
moment. It's not needed until the conference, is it? Or will you want to 
have it before then?


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] Wiki for planning created - Re: [Pacemaker] [RFC] Organizing HA Summit 2015

2014-11-27 Thread Digimer

I just created a dedicated/fresh wiki for planning and organizing:

http://plan.alteeve.ca/index.php/Main_Page

Other than the domain, it has no association with any existing project, 
so it should be a neutral enough platform. Also, it's not owned by 
$megacorp (I wish!), so spying/privacy shouldn't be an issue I hope. If 
there is concern, I can setup https.


If no one else gets to it before me, I'll start collating the data from 
the mailing list onto that wiki tomorrow (maaaybe today, depends).


The wiki requires registration, but that's it. I'm not bothering with 
captchas because, in my experience, spammer walk right through them 
anyway. I do have edits email me, so I can catch and roll back any spam 
quickly.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Cluster-devel] Wiki for planning created - Re: [Pacemaker] [RFC] Organizing HA Summit 2015

2014-11-27 Thread Digimer

On 28/11/14 12:33 AM, Fabio M. Di Nitto wrote:



On 11/27/2014 5:52 PM, Digimer wrote:

I just created a dedicated/fresh wiki for planning and organizing:

http://plan.alteeve.ca/index.php/Main_Page

Other than the domain, it has no association with any existing project,
so it should be a neutral enough platform. Also, it's not owned by
$megacorp (I wish!), so spying/privacy shouldn't be an issue I hope. If
there is concern, I can setup https.

If no one else gets to it before me, I'll start collating the data from
the mailing list onto that wiki tomorrow (maaaybe today, depends).

The wiki requires registration, but that's it. I'm not bothering with
captchas because, in my experience, spammer walk right through them
anyway. I do have edits email me, so I can catch and roll back any spam
quickly.



Awesome! thanks for taking care of it. Do you have a chance to add also
an instance of etherpad to the site?

Mostly to do collaborative editing while we sit all around the same table.

Otherwise we can use a public instance and copy paste info after that in
the wiki.

Fabio


Never tried setting up etherpad before, but if it runs on rhel 6, I 
should have no problem setting it up.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Pacemaker] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-26 Thread Digimer

On 26/11/14 05:28 PM, Andrew Beekhof wrote:



On 27 Nov 2014, at 8:18 am, Marek marx Grac mg...@redhat.com wrote:


On 11/26/2014 08:00 PM, Michael Schwartzkopff wrote:

Am Donnerstag, 27. November 2014, 00:13:11 schrieb Rajagopal Swaminathan:

Greetings,


Guys, I am a poor Indian whom US of A Abhors and have successfully
deployed over 5 centos/rhel clusts vaying from 4-6.

May I Know where this event is held?

Brno, Slovakia. Next international Airport: Vienna.

Brno is quite close to Slovakia but it is in Czech Republic. International 
airports around are Vienna, Prague and mostly low-cost ones in Brno and 
Bratislava


Anyone want to meet in munich and share a car? :-)


I might be up for that. I've not looked into flights yet, though I do 
have a standing invitation for beer in Vienna, so I'm sort of planning 
to fly through there. Apparently there is a very convenient bus from 
Vienna to Brno.


Why Munich? (Don't get me wrong, I loved it there last year!)

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [ha-wg-technical] [Cluster-devel] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-25 Thread Digimer

On 25/11/14 04:31 PM, Andrew Beekhof wrote:

Yeah, but you're already bringing him for your personal conference.
That's a bit different. ;-)

OK, let's switch tracks a bit. What *topics* do we actually have? Can we
fill two days? Where would we want to collect them?


Personally I'm interested in talking about scaling - with pacemaker-remoted 
and/or a new messaging/membership layer.

Other design-y topics:
- SBD
- degraded mode
- improved notifications


This my be something my company can bring to the table. We just hired a 
dev whose principle goal is to develop and alert system for HA. We're 
modelling it heavily on the fence/resource agent model with a scan 
core and scan agents. It's sort of like existing tools, but designed 
specifically for HA clusters and heavily focused on not interfering with 
the host more than at all necessary. By Feb., it should be mostly done.


We're doing this for our own needs, but it might be a framework worth 
talking about, if nothing else to see if others consider it a fit. Of 
course, it will be entirely open source. *If* there is interest, I could 
put together a(n informal) talk on it with a demo.



- containerisation of services (cgroups, docker, virt)
- resource-agents (upstream releases, handling of pull requests, testing)

User-facing topics could include recent features (ie. pacemaker-remoted, 
crm_resource --restart) and common deployment scenarios (eg. NFS) that people 
get wrong.




--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Cluster-devel] [ha-wg] [Linux-HA] [RFC] Organizing HA Summit 2015

2014-11-25 Thread Digimer

On 26/11/14 12:51 AM, Fabio M. Di Nitto wrote:



On 11/25/2014 10:54 AM, Lars Marowsky-Bree wrote:

On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote:


Yeah, well, devconf.cz is not such an interesting event for those who do
not wear the fedora ;-)

That would be the perfect opportunity for you to convert users to Suse ;)



I´d prefer, at least for this round, to keep dates/location and explore
the option to allow people to join remotely. Afterall there are tons of
tools between google hangouts and others that would allow that.

That is, in my experience, the absolute worst. It creates second class
participants and is a PITA for everyone.

I agree, it is still a way for people to join in tho.


I personally disagree. In my experience, one either does a face-to-face
meeting, or a virtual one that puts everyone on the same footing.
Mixing both works really badly unless the team already knows each
other.


I know that an in-person meeting is useful, but we have a large team in
Beijing, the US, Tasmania (OK, one crazy guy), various countries in
Europe etc.

Yes same here. No difference.. we have one crazy guy in Australia..


Yeah, but you're already bringing him for your personal conference.
That's a bit different. ;-)

OK, let's switch tracks a bit. What *topics* do we actually have? Can we
fill two days? Where would we want to collect them?


I´d say either a google doc or any random etherpad/wiki instance will do
just fine.

As for the topics:
- corosync qdevice and plugins (network, disk, integration with sdb?,
   others?)
- corosync RRP / libknet integration/replacement
- fence autodetection/autoconfiguration

For the user facing topics (that is if there are enough participants and
I only got 1 user confirmation so far):

- demos, cluster 101, tutorials
- get feedback
- get feedback
- get more feedback

Fabio


I'd be happy to do a cluster 101 or similar, if there is interest. Not 
sure if that would be particularly appealing to anyone coming to our 
meeting, as I think anyone interested is probably well past 101. :) 
Anyway, you guys know my background, let me know if there is a topic 
you'd like me to cover for the user side of things.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [ha-wg-technical] [ha-wg] [Linux-HA] [RFC] Organizing HA Summit 2015

2014-11-25 Thread Digimer

On 26/11/14 12:51 AM, Fabio M. Di Nitto wrote:



On 11/25/2014 10:54 AM, Lars Marowsky-Bree wrote:

On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote:


Yeah, well, devconf.cz is not such an interesting event for those who do
not wear the fedora ;-)

That would be the perfect opportunity for you to convert users to Suse ;)



I´d prefer, at least for this round, to keep dates/location and explore
the option to allow people to join remotely. Afterall there are tons of
tools between google hangouts and others that would allow that.

That is, in my experience, the absolute worst. It creates second class
participants and is a PITA for everyone.

I agree, it is still a way for people to join in tho.


I personally disagree. In my experience, one either does a face-to-face
meeting, or a virtual one that puts everyone on the same footing.
Mixing both works really badly unless the team already knows each
other.


I know that an in-person meeting is useful, but we have a large team in
Beijing, the US, Tasmania (OK, one crazy guy), various countries in
Europe etc.

Yes same here. No difference.. we have one crazy guy in Australia..


Yeah, but you're already bringing him for your personal conference.
That's a bit different. ;-)

OK, let's switch tracks a bit. What *topics* do we actually have? Can we
fill two days? Where would we want to collect them?


I´d say either a google doc or any random etherpad/wiki instance will do
just fine.

As for the topics:
- corosync qdevice and plugins (network, disk, integration with sdb?,
   others?)
- corosync RRP / libknet integration/replacement
- fence autodetection/autoconfiguration

For the user facing topics (that is if there are enough participants and
I only got 1 user confirmation so far):

- demos, cluster 101, tutorials
- get feedback
- get feedback
- get more feedback

Fabio


Ok, I do have a topic I want to add;

Merging the dozen different mailing lists, IRC channels and other 
support forums. This thread is a good example of the thinness that the 
community is spread over.


A 'dev', 'user', 'announce' list should be enough for all HA. Likewise, 
one IRC channel should be enough, too.


The trick will be discussing this without bikeshedding. :)

digimer

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Pacemaker] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-24 Thread Digimer

On 24/11/14 09:54 AM, Fabio M. Di Nitto wrote:

On 11/24/2014 3:39 PM, Lars Marowsky-Bree wrote:

On 2014-09-08T12:30:23, Fabio M. Di Nitto fdini...@redhat.com wrote:

Folks, Fabio,

thanks for organizing this and getting the ball rolling. And again sorry
for being late to said game; I was busy elsewhere.

However, it seems that the idea for such a HA Summit in Brno/Feb 2015
hasn't exactly fallen on fertile grounds, even with the suggested
user/client day. (Or if there was a lot of feedback, it wasn't
public.)

I wonder why that is, and if/how we can make this more attractive?


I suspect a lot of it is that, given people's busy schedules, February 
seems far away. Also, I wonder how much discussion has happened outside 
of these lists. Is it really that there hasn't been much feedback?


Fabio started this ball rolling, so I would be interested to hear what 
he's heard.



Frankly, as might have been obvious ;-), for me the venue is an issue.
It's not easy to reach, and I'm theoretically fairly close in Germany
already.

I wonder if we could increase participation with a virtual meeting (on
either those dates or another), similar to what the Ceph Developer
Summit does?


Requested feedback given;

Virtual meetings are never as good, and I really don't like this idea. 
In my experience, just as much productive decision making happens in the 
unofficial after-hours activities as during formal(ish) 
meetings/presentations.


I think it is very important that the meeting remain in-person if at all 
possible.



Those appear really productive and make it possible for a wide range of
interested parties from all over the world to attend, regardless of
travel times, or even just attend select sessions (that would otherwise
make it hard to justify travel expenses  time off).

Alternatively, would a relocation to a more connected venue help, such
as Vienna xor Prague?


Personally, I don't care where we meet, but I do believe Fabio already 
ruled out a relocation.



I'd love to get some more feedback from the community.


I agree. some feedback would be useful.


digimer puts on her flame-retardant pantaloons and waits for the worst...

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Pacemaker] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-24 Thread Digimer

On 24/11/14 10:12 AM, Lars Marowsky-Bree wrote:

Beijing, the US, Tasmania (OK, one crazy guy), various countries in


Oh, bring him! crazy++

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015

2014-10-31 Thread Digimer

All the cool kids will be there.

You want to be a cool kid, right?

:p

On 01/11/14 01:06 AM, Fabio M. Di Nitto wrote:

just a kind reminder.

On 9/8/2014 12:30 PM, Fabio M. Di Nitto wrote:

All,

it's been almost 6 years since we had a face to face meeting for all
developers and vendors involved in Linux HA.

I'd like to try and organize a new event and piggy-back with DevConf in
Brno [1].

DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.

My suggestion would be to have a 2 days dedicated HA summit the 4th and
the 5th of February.

The goal for this meeting is to, beside to get to know each other and
all social aspect of those events, tune the directions of the various HA
projects and explore common areas of improvements.

I am also very open to the idea of extending to 3 days, 1 one dedicated
to customers/users and 2 dedicated to developers, by starting the 3rd.

Thoughts?

Fabio

PS Please hit reply all or include me in CC just to make sure I'll see
an answer :)

[1] http://devconf.cz/


Could you please let me know by end of Nov if you are interested or not?

I have heard only from few people so far.

Cheers
Fabio
___
Linux-HA mailing list
linux...@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems




--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] daemon cpg_join error retrying

2014-10-29 Thread Digimer

On 29/10/14 06:16 PM, Andrew Beekhof wrote:



On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) lk...@cisco.com wrote:


I wonder if there is a mismatch between the cluster name in cluster.conf and 
the cluster name the GFS filesystem was created with.

How to check  cluster name of GFS file system? I had similar configuration 
running fine in multiple other setups with no such issue.


I don't really recall. Hopefully someone more familiar with GFS2 can chime in.


# gfs2_tool sb /dev/c01n01_vg0/shared table
current lock table name = an-cluster-01:shared

Replace with your device, of course. :)





Also one more issue I am seeing in one other setup a repeated flood of 'A 
processor joined or left the membership and a new membership was formed' 
messages for every 4secs. I am running with default TOTEM settings with token 
time out as 10 secs. Even after I increase the token, consensus values to be 
higher. It goes on flooding the same message after newer consensus defined time 
(eg: if I increase it to be 10secs, then I see new membership formed messages 
for every 10secs)

Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: 
sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: 
sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service 
synchronization, ready to provide service.


It does not sound like your network is particularly healthy.
Are you using multicast or udpu? If multicast, it might be worth trying udpu


Agreed. Persistent multicast required?


Thanks
Lax


-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Andrew Beekhof
Sent: Wednesday, October 29, 2014 2:42 PM
To: linux clustering
Subject: Re: [Linux-cluster] daemon cpg_join error retrying



On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) lk...@cisco.com wrote:

Hi All,

In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon 
cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.


I wonder if there is a mismatch between the cluster name in cluster.conf and 
the cluster name the GFS filesystem was created with.



Even after I force kill the pacemaker processes and reboot the server and bring 
the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix 
this issue?


Thanks
Lax

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster






--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Rhel BootLoader, Single-user mode password Interactive Boot in a Cloud environment

2014-10-22 Thread Digimer

On 22/10/14 04:44 AM, Sunhux G wrote:

We run cloud service  our vCenter is not accessible to our tenants
and their IT support; so I would say console access is not feasible
unless the tenant/customer IT come to our DC.

If the following 3 hardenings are done our tenant/customer RHEL
Linux VM, what's the impact to the tenant's sysadmin  IT operation?


a) CIS 1.5.3 Set Boot Loader Password *:*
 if this password is set, when tenant reboot (shutdown -r)
 their VM each time, will it prompt for the bootloader
 password at console?  If so, is there any way the tenant,
 could still get their VM booted up if they have no access
 to vCenter's console?

b) CIS 1.5.4 Require Authentication for Single-User Mode *:*
 Does Linux allow ssh access while in single-user mode 
 can this 'single-user mode password' be entered via an
 ssh session (without access to console), assuming certain
 'terminal' service is started up / running while in single
 user mode

c) CIS 1.5.5 Disable Interactive Boot *:*
 what's the general consensus on this? Disable or enable?
 Our corporate hardening guide does not mention this item.
 So if the tenant wishes to boot up step by step (ie pausing
 at each startup script), they can't do it?

Feel free to add any other impacts that anyone can think of

Lastly, how do people out there grant console access to their
tenants in Cloud environment without security compromise
(I mean without granting vCenter access) : I heard that we can
customize vCenter to grant limited access of vCenter to
tenants, is this so?


Sun


Hi Sun,

  Did you mean to post this to the vmware mailing list?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x)

2014-10-15 Thread Digimer

On 15/10/14 10:15 AM, Marek marx Grac wrote:


On 10/14/2014 01:01 PM, Digimer wrote:


Hi Marek et. al.,

  This is a RHEL 6.5 install, so Kristoffer's comment about needing a
newer version of python is a bit of a concern. Has this been tested on
RHEL 6 with an APC with the 6.x firmware?


Current release do not contain required patch, it will be in next one
(or z-stream if someone request it). The upstream release work as
expected (retested today) on Fedora20/RHEL7. Fact that upstream release
can not be run on RHEL6 is new issue for me but we did not try that before.

m,


Consider it officially requested. We use APC switched PDUs as backup 
fence devices extensively, so this would pretty heavily hurt us if we 
started getting v6 firmware.


Should I open a RHBZ?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x)

2014-10-14 Thread Digimer

On 13/10/14 03:10 PM, Thomas Meier wrote:

Hi

When configuring PDU fencing in my 2-node-cluster I ran into some problems with
the fence_apc_snmp agent. Turning a node off works fine, but
fence_apc_snmp then exits with error.



When I do this manually (from node2):

fence_apc_snmp -a node1 -n 1 -o off

the output of the command is not an expected:

Success: Powered OFF

but in my case:

Returned 2: Error in packet.
Reason: (genError) A general failure occured
Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21


When I check the PDU, the port is without power, so this part works.
But it seems that the fence agent can't read the status of the PDU
and then exits with error. The same seems to happen when fenced
is calling the agent. The agent also exits with an error and fencing can't 
succeed
and the cluster hangs.


From the logfile:


 fenced[2100]: fence node1 dev 1.0 agent fence_apc_snmp result: error from 
agent


My Setup: - CentOS 6.5 with fence-agents-3.1.5-35.el6_5.4.x86_64 installed.
   - APC AP8953 PDU with firmware 6.1
   - 2-node-cluster based on https://alteeve.ca/w/AN!Cluster_Tutorial_2
   - fencing agents in use: fence_ipmilan (working) and fence_apc_snmp


I did some recherche, and for me it looks like that my fence-agents package is 
too old for my APC firmware.

I've already found the fence-agents repo: 
https://git.fedorahosted.org/cgit/fence-agents.git/

Here 
https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875
it says: fence_apc_snmp: Add support for firmware 6.x


I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but 
my build
of fence_apc_snmp doesn't work.

It gives:

[root@box1]# fence_apc_snmp -v -a node1 -n 1 -o status
Traceback (most recent call last):
   File /usr/sbin/fence_apc_snmp, line 223, in module
 main()
   File /usr/sbin/fence_apc_snmp, line 197, in main
 options = check_input(device_opt, process_input(device_opt))
   File /usr/share/fence/fencing.py, line 705, in check_input
 logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr))
TypeError: __init__() got an unexpected keyword argument 'stream'


I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and 
if so,
install the right version of fence_apc_snmp on the cluster without breaking 
things,
but I'm a bit clueless how to build me a working version.


Maybe you have some tips?



Thanks in advance

Thomas


Hi Marek et. al.,

  This is a RHEL 6.5 install, so Kristoffer's comment about needing a 
newer version of python is a bit of a concern. Has this been tested on 
RHEL 6 with an APC with the 6.x firmware?


cheeps

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] cLVM unusable on quorated cluster

2014-10-03 Thread Digimer

On 03/10/14 10:35 AM, Daniel Dehennin wrote:

Hello,

I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN
for an OpenNebula cluster.

As I'm new to cluster world, I have hard time figuring why sometime
things get really wrong and where I must look to find answers.

My OpenNebula frontend, running in a VM, does not manage to run the
resources and my syslog has a lot of:

#+begin_src
ocfs2_controld: Unable to open checkpoint ocfs2:controld: Object does not 
exist
#+end_src

When this happens, other nodes have problem:

#+begin_src
root@nebula3:~# LANG=C vgscan
   cluster request failed: Host is down
   Unable to obtain global lock.
#+end_src

But things looks fin in “crm_mon”:

#+begin_src
root@nebula3:~# crm_mon -1

Last updated: Fri Oct  3 16:25:43 2014
Last change: Fri Oct  3 14:51:59 2014 via cibadmin on nebula1
Stack: openais
Current DC: nebula3 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
5 Nodes configured, 5 expected votes
32 Resources configured.


Node quorum: standby
Online: [ nebula3 nebula2 nebula1 ]
OFFLINE: [ one ]

  Stonith-nebula3-IPMILAN(stonith:external/ipmi):Started nebula2
  Stonith-nebula2-IPMILAN(stonith:external/ipmi):Started nebula3
  Stonith-nebula1-IPMILAN(stonith:external/ipmi):Started nebula2
  Clone Set: ONE-Storage-Clone [ONE-Storage]
  Started: [ nebula1 nebula3 nebula2 ]
  Stopped: [ ONE-Storage:3 ONE-Storage:4 ]
  Quorum-Node(ocf::heartbeat:VirtualDomain): Started nebula3
  Stonith-Quorum-Node   (stonith:external/libvirt):   Started nebula3
#+end_src

I don't know how to interpret dlm_tool informations:

#+begin_src
root@nebula3:~# dlm_tool ls -n
dlm lockspaces
name  CCB10CE8D4FF489B9A2ECB288DACF2D7
id0x09250e49
flags 0x0008 fs_reg
changemember 3 joined 1 remove 0 failed 0 seq 2,2
members   1189587136 1206364352 1223141568
all nodes
nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none

name  clvmd
id0x4104eefa
flags 0x
changemember 3 joined 0 remove 1 failed 0 seq 4,4
members   1189587136 1206364352 1223141568
all nodes
nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none
nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
#+end_src




Is there any documentation on troubleshooting DLM/cLVM?

Regards.


Can you paste your full pacemaker config and the logs from the other 
nodes starting just before the lost node went away?


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] clvmd issues

2014-10-03 Thread Digimer

On 03/10/14 12:57 PM, manish vaidya wrote:

First i apologise for late reply , delay due to i cannot believe ,any
response from site , I am a newcomer , already , i had posted this
problem on many online forums , but they didn't give any response

Thank all , for taking my problem seriously

** response from you

are you using clvmd? if your answer is = yes, you need to be sure, you pv

is visibile to your cluster nodes

*** i am using clvmd  When use pvscan command cluster hangs

I want to reproduce this situation again for perfection , such as when i
try to run pvcreate command in cluster , message should come lock from
node2  node3 , I have created new cluster , this new cluster is working
fine ,
How to do This? any setting in lvm.conf


Can you share your setup please?

What kind of cluster? What version? What is the configuration file? Was 
there anything interesting in the system logs? etc.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Pacemaker] [RFC] Organizing HA Summit 2015

2014-09-22 Thread Digimer

On 08/09/14 06:30 AM, Fabio M. Di Nitto wrote:

All,

it's been almost 6 years since we had a face to face meeting for all
developers and vendors involved in Linux HA.

I'd like to try and organize a new event and piggy-back with DevConf in
Brno [1].

DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.

My suggestion would be to have a 2 days dedicated HA summit the 4th and
the 5th of February.

The goal for this meeting is to, beside to get to know each other and
all social aspect of those events, tune the directions of the various HA
projects and explore common areas of improvements.

I am also very open to the idea of extending to 3 days, 1 one dedicated
to customers/users and 2 dedicated to developers, by starting the 3rd.

Thoughts?

Fabio

PS Please hit reply all or include me in CC just to make sure I'll see
an answer :)

[1] http://devconf.cz/


How is this looking?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster

2014-09-09 Thread Digimer

On 09/09/14 03:14 AM, Amjad Syed wrote:

device lanplus =  name=inspuripmi  action =reboot/


Something is breaking the network during the shutdown, a fence is being 
called and both nodes are killing the other, causing a dual fence. So 
you have a set of problems, I think.


First, disable acpid on both nodes.

Second, change the quoted line (only) to:

device lanplus =  name=inspuripmi delay=15 action =reboot/

If I am right, this will mean that 192.168.10.10 will stay up (fence) .11

Third, what bonding mode are you using? I would only use mode=1.

Forth, please set the node names to match 'uname -n' on both nodes. Be 
sure the names translate to the IPs you want (via /etc/hosts, ideally).


Fifth, as Sivaji suggested, please put switch(es) between the nodes.

If it still tries to fence when a node shuts down (watch 
/var/log/messages and look for 'fencing node ...'), please paste your 
logs from both nodes.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Pacemaker] [RFC] Organizing HA Summit 2015

2014-09-08 Thread Digimer

On 08/09/14 06:30 AM, Fabio M. Di Nitto wrote:

All,

it's been almost 6 years since we had a face to face meeting for all
developers and vendors involved in Linux HA.

I'd like to try and organize a new event and piggy-back with DevConf in
Brno [1].

DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.

My suggestion would be to have a 2 days dedicated HA summit the 4th and
the 5th of February.

The goal for this meeting is to, beside to get to know each other and
all social aspect of those events, tune the directions of the various HA
projects and explore common areas of improvements.

I am also very open to the idea of extending to 3 days, 1 one dedicated
to customers/users and 2 dedicated to developers, by starting the 3rd.

Thoughts?

Fabio

PS Please hit reply all or include me in CC just to make sure I'll see
an answer :)

[1] http://devconf.cz/


I think this is a good idea. 3 days may be a good idea, as well.

I think I would be more useful trying to bring the user's perspective 
more so than a developer's. So on that, I would like to propose a 
discussion on merging some of the disparate lists, channels, sites, etc. 
to help simplify life for new users looking for help from or to wanting 
to join the HA community.


I also understand that Fabio will buy the first round of drinks. :)

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Please help me on cluster error

2014-08-30 Thread Digimer

Can you share your cluster information please?

This could be a network problem, as the messages below happen when the 
network between the nodes isn't fast enough or has too long latency and 
cluster traffic is considered lost and re-requested.


If you don't have fencing working properly, and if a network issue 
caused a node to be declared lost, clustered LVM (and anything else 
using cluster locking) will fail (by design).


If you share your configuration and more of your logs, it will help us 
understand what is happening. Please also tell us what version of the 
cluster software you're using.


digimer

On 30/08/14 10:12 AM, manish vaidya wrote:

i created four node cluster in kvm enviorment But i faced error when
create new pv such as pvcreate /dev/sdb1
got error , lock from node 2  lock from node3

also strange cluster logs

Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e
5f
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f
60
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63
64
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69
6a
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84
85
Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a
9b


Please help me on this issue
http://sigads.rediff.com/RealMedia/ads/click_nx.ads/www.rediffmail.com/signatureline.htm@Middle?

Get your own *FREE* website, *FREE* domain  *FREE* mobile app with
Company email.
*Know More *
http://track.rediff.com/click?url=___http://businessemail.rediff.com/email-ids-for-companies-with-less-than-50-employees?sc_cid=sign-1-10-13___cmp=hostlnk=sign-1-10-13nsrv1=host






--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] corosync ring failure

2014-07-23 Thread Digimer

Any logs in the switch? Is the multicast group being deleted/recreated?

On 23/07/14 11:53 AM, C. Handel wrote:

hi,

i run a cluster with two corosync rings. One of the rings is marked
faulty every fourty seconds, to immediately recover a second later.
the other ring is stable

i have no idea how i should debug this.


we are running sl6.5 with pacemaker 1.1.10, cman 3.0.12, corosync 1.4.1
cluster consists of three machines. Ring1 is running on 10gigbit
interfaces, Ring0 on 1gigibit interfaces. Both rings don't leave their
respective switch.

corosync communication is udpu, rrp_mode is passive

cluster.conf:

cluster config_version=30 name=aslfile

cman transport=udpu
/cman

fence_daemon post_join_delay=120 post_fail_delay=30/

fencedevices
 fencedevice name=pcmk agent=fence_pcmk action=off/
/fencedevices

quorumd
cman_label=qdisk
device=/dev/mapper/mpath-091quorump1
min_score=1
votes=2

/quorumd

clusternodes
clusternode name=asl430m90 nodeid=430
 altname name=asl430/
 fence
 method name=pcmk-redirect
 device name=pcmk port=asl430m90/
 /method
 /fence
/clusternode
clusternode name=asl431m90 nodeid=431
 altname name=asl431/
 fence
 method name=pcmk-redirect
 device name=pcmk port=asl431m90/
 /method
 /fence
/clusternode
clusternode name=asl432m90 nodeid=432
 altname name=asl432/
 fence
 method name=pcmk-redirect
 device name=pcmk port=asl432m90/
 /method
 /fence
/clusternode
/clusternodes
/cluster


syslog


Jul 23 17:48:34 asl431 corosync[3254]:   [TOTEM ] Marking ringid 1
interface 140.181.134.212 FAULTY
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:14 asl431 corosync[3254]:   [TOTEM ] Marking ringid 1
interface 140.181.134.212 FAULTY
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1



Greetings
Christoph




--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Error in Cluster.conf

2014-06-24 Thread Digimer

On 24/06/14 08:55 AM, Jan Pokorný wrote:

On 24/06/14 13:56 +0200, Fabio M. Di Nitto wrote:

On 6/24/2014 12:32 PM, Amjad Syed wrote:

Hello

I am getting the following error when i run ccs_config_Validate

ccs_config_validate
Relax-NG validity error : Extra element clusternodes in interleave


You defined clusternodes.. twice.


That + the are more issues discoverable by more powerful validator
jing (packaged in Fedora and RHEL 7, for instance, admittedly not
for RHEL 6/EPEL):

$ jing cluster.rng cluster.conf

cluster.conf:13:47: error:
   element fencedvice not allowed anywhere; expected the element
   end-tag or element fencedevice
cluster.conf:15:23: error:
   element clusternodes not allowed here; expected the element
   end-tag or element clvmd, dlm, fence_daemon, fence_xvmd,
   gfs_controld, group, logging, quorumd, rm, totem or
   uidgid
cluster.conf:26:76: error:
   IDREF fence_node2 without matching ID
cluster.conf:19:77: error:
   IDREF fence_node1 without matching ID


So it spotted also:
- a typo in fencedvice
- broken referential integrity; it is prescribed name attribute
   of device tag should match a name of a defined fencedevice

Hope this helps.

-- Jan


Also, without fence methods defined for the nodes, rgmanager will block 
the first time there is an issue.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Online change of fence device options - possible?

2014-06-23 Thread Digimer

On 23/06/14 02:16 PM, Digimer wrote:

On 23/06/14 02:09 PM, Vasil Valchev wrote:

Hello,

I have a RHEL 6.5 cluster, using rgmanager.
The fence devices are fence_ipmilan - fencing through HP iLO4.

The issue is the fence devices weren't configured entirely correct -
recently after a node failure, the fence agent was returning failures
(even though it was fencing the node successfully), which apparently can
be avoided by setting the power_wait option to the fence dev
configuration.

My question is - after changing the fence device (I think directly
through the .conf will be fine?), iterating the config version, and
syncing the .conf through the cluster software - is something else
necessary to apply the change (eg. cman reload)?

Will the new fence option be used the next time a fencing action is
performed?

And lastly can all of this be performed while the cluster and services
are operational or they have to be stopped/restarted?


Regards,
Vasil


This should be fine. As you said; Update the fence config, increment the
config_version, save and exit. Run 'ccs_config_validate' and if that
passes, 'cman_tool version -r'. Note that for this to work, you need to
have set the 'ricci' user's shell password as well as have the 'ricci'
and 'modclusterd' daemons running.

Once done, run 'fence_check'[1] to verify that the fence config works
(it makes a status call to check). If that works, you're good to go.

You can also crontab the fence_check call and have it email you or
something so that you can catch fence failures earlier.

digimer

1.
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config


I should clarify; You can update the config while the cluster is online. 
No fences will be called and you do not need to restart anything.


cheers

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Online change of fence device options - possible?

2014-06-23 Thread Digimer

On 23/06/14 02:09 PM, Vasil Valchev wrote:

Hello,

I have a RHEL 6.5 cluster, using rgmanager.
The fence devices are fence_ipmilan - fencing through HP iLO4.

The issue is the fence devices weren't configured entirely correct -
recently after a node failure, the fence agent was returning failures
(even though it was fencing the node successfully), which apparently can
be avoided by setting the power_wait option to the fence dev configuration.

My question is - after changing the fence device (I think directly
through the .conf will be fine?), iterating the config version, and
syncing the .conf through the cluster software - is something else
necessary to apply the change (eg. cman reload)?

Will the new fence option be used the next time a fencing action is
performed?

And lastly can all of this be performed while the cluster and services
are operational or they have to be stopped/restarted?


Regards,
Vasil


This should be fine. As you said; Update the fence config, increment the 
config_version, save and exit. Run 'ccs_config_validate' and if that 
passes, 'cman_tool version -r'. Note that for this to work, you need to 
have set the 'ricci' user's shell password as well as have the 'ricci' 
and 'modclusterd' daemons running.


Once done, run 'fence_check'[1] to verify that the fence config works 
(it makes a status call to check). If that works, you're good to go.


You can also crontab the fence_check call and have it email you or 
something so that you can catch fence failures earlier.


digimer

1. 
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_Fence_check_to_Verify_our_Fencing_Config


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] fence Agent

2014-06-22 Thread Digimer

On 22/06/14 03:55 AM, Amjad Syed wrote:

Hello,

I am trying to setup a simple 2 node cluster in active/passive mode for
oracle high availability

We are using one  INSPUR server and one HP proliant (Management decision
based on  hardware availability)   and we are seeing if we can use IPMI
as fencing method

CCHS though supports HP ILO, DELL IPMI, IBM , but not  INSPUR.

So the basic question i have is what if we can use fence_ILO (for HP)
and fence_ipmilan (For INSPUR)?

IF any one have any experience with fence_ipmilan or point to resources
, it would really be appreciated.

Sincerely,
Amjad


fence_ipmilan works with just about every IPMI-based out of band 
management interface. Most of those branded ones, like DRAC, RSA, iLO, 
etc are fundamentally based on IPMI. I've used fence_ipmilan on iLO 
personally and it's fine.


If you can show what 'ipmitool' command you use that can show if the 
peer is powered on or off, then you should be able to translate it quite 
easily to a matching fence_ipmilan call (check man fence_ipmilan for the 
switches). Once you can check the power status of the peer(s) with 
fence_ipmilan, you're 95% of the way there.


cheers

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Node is randomly fenced

2014-06-18 Thread Digimer

On 18/06/14 02:20 PM, YB Tan Sri Dato Sri' Adli a.k.a Dell wrote:

Hi,

The linux clustering will be only working perfectly if you run the linux
operating systems between nodes. Allow root ssh persistent connection on
top of same specifications hardware platform.

To perform test or proof of concept, you may allow to run and configure
between two nodes.

The databases for clustering will be configure right after the two nodes
linux operating systems run with persistent root access ssh connection.

Sent from Yahoo Mail for iPhone https://overview.mail.yahoo.com?.src=iOS


You have said this a couple times now, and I am not sure why. There is 
no need to have persistent, root access SSH between nodes. It's helpful 
in some cases, sure, but certainly not required. Corosync, which 
provides cluster membership and communication, handles internode traffic 
itself, on it's own TCP port (using multicast by default or unicast if 
configured).


There is also nothing restricting you to two nodes. It's a good 
configuration, and one I use personally, but there are many 3+ node 
clusters out there.


As for a database cluster, that would depend entirely on which database 
you are using and whether you are using tools specific for that DB or a 
more generic HA stack like corosync + pacemaker.


Cheers

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Two-node cluster GFS2 confusing

2014-06-18 Thread Digimer
I don't use VMware myself, but I think fence_vmware will work for you. 
Please note that simply enabling stonith is not enough. As you realize, 
you need a configured and working fence method.


If you try using the command line, you can play with the command's 
switched asking for 'status'. When that returns properly, you will then 
just need to convert the switches into arguments for pacemaker.


Read the man page for 'fence_vmware', and then try calling:

fence_vmware ... -o status

Fill in the switches and values you need based on the instructions in 
'man fence_vmware'.


digimer

On 18/06/14 09:51 PM, Le Trung Kien wrote:

Hi,

As Digimer suggested, I change property

stonith-enabled=true

But now I don't know which fencing method I should use, because my two Redhat 
nodes running on VMWare Workstation, OpenFiler as SCSI shared LUN storage.

I attempted to use fence_scsi, but no luck, I got this error:

Jun 19 08:35:58 server1 stonith_admin[3837]:   notice: crm_log_args: Invoked: 
stonith_admin --reboot server2 --tolerance 5s
Jun 19 08:36:08 server1 root: fence_pcmk[3836]: Call to fence server2 (reset) 
failed with rc=255

Here is my fencing configuration:

?xml version=1.0?
cluster config_version=1 name=mycluster
cman expected_votes=1 cluster_id=1/
fence_daemon post_fail_delay=0 post_join_delay=30/
clusternodes
 clusternode name=server1 votes=1 nodeid=1
 fence
 method name=scsi
 device name=scsi_dev key=1/
 /method
 /fence
 /clusternode
 clusternode name=server2 votes=1 nodeid=2
 fence
 method name=scsi
 device name=scsi_dev key=2/
 /method
 /fence
 /clusternode
 /clusternodes
fencedevices
 fencedevice agent=fence_scsi name=scsi_dev aptpl=1 
logfile=/tmp/fence_scsi.log/
/fencedevices
/cluster

And the log: /tmp/fence_scsi.log show:

Jun 18 19:49:40 fence_scsi: [error] no devices found

I will try vmware_soap to see if it works.

Kien Le

-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Digimer
Sent: Wednesday, June 18, 2014 11:18 AM
To: linux clustering
Subject: Re: [Linux-cluster] Two-node cluster GFS2 confusing

On 16/06/14 07:43 AM, Le Trung Kien wrote:

Hello everyone,

I'm a new man on linux cluster.  I have built a two-node cluster (without 
qdisk), includes:

Redhat 6.4
cman
pacemaker
gfs2

My cluster could fail-over (back and forth) between two nodes for
these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on
/mnt/gfs2_storage), WebSite ( apache service)

My problem occurs when I stop/start node in the following order: (when
both nodes started)

1. Stop: node1 (shutdown) - all resource fail-over on node2 - all
resources still working on node2 2. Stop: node2 (stop service:
pacemaker then cman) - all resources stop (of course) 3. Start: node1
(start service: cman then pacemaker) - only ClusterIP started, WebFS
failed, WebSite not started

Status:

Last updated: Mon Jun 16 18:34:56 2014 Last change: Mon Jun 16
14:24:54 2014 via cibadmin on server1
Stack: cman
Current DC: server1 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 1 expected votes
4 Resources configured.

Online: [ server1 ]
OFFLINE: [ server2 ]

   ClusterIP  (ocf::heartbeat:IPaddr2):   Started server1
   WebFS  (ocf::heartbeat:Filesystem):Started server1 (unmanaged) FAILED

Failed actions:
  WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out):
unknown error

Here is my /etc/cluster/cluster.conf
?xml version=1.0?
cluster config_version=1 name=mycluster
  logging debug=on/
  clusternodes
  clusternode name=server1 nodeid=1
  fence
  method name=pcmk-redirect
  device name=pcmk port=server1/
  /method
  /fence
  /clusternode
  clusternode name=server2 nodeid=2
  fence
  method name=pcmk-redirect
  device name=pcmk port=server2/
  /method
  /fence
  /clusternode
  /clusternodes
  fencedevices
  fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
/cluster

Here is my: crm configure show



snip


  stonith-enabled=false \


Well this is a problem.

When cman detects a failure (well corosync, but cman is told), it initiates a 
fence request. The fence daemon informs DLM with blocks.
Then fenced calls the configured 'fence_pcmk', which just passes the request up 
to pacemaker.

Without stonith configured in fencing, pacemaker will fail

Re: [Linux-cluster] Two-node cluster GFS2 confusing

2014-06-17 Thread Digimer

On 16/06/14 07:43 AM, Le Trung Kien wrote:

Hello everyone,

I'm a new man on linux cluster.  I have built a two-node cluster (without 
qdisk), includes:

Redhat 6.4
cman
pacemaker
gfs2

My cluster could fail-over (back and forth) between two nodes for these 3 
resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on 
/mnt/gfs2_storage), WebSite ( apache service)

My problem occurs when I stop/start node in the following order: (when both 
nodes started)

1. Stop: node1 (shutdown) - all resource fail-over on node2 - all resources 
still working on node2
2. Stop: node2 (stop service: pacemaker then cman) - all resources stop (of 
course)
3. Start: node1 (start service: cman then pacemaker) - only ClusterIP started, 
WebFS failed, WebSite not started

Status:

Last updated: Mon Jun 16 18:34:56 2014
Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1
Stack: cman
Current DC: server1 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 1 expected votes
4 Resources configured.

Online: [ server1 ]
OFFLINE: [ server2 ]

  ClusterIP  (ocf::heartbeat:IPaddr2):   Started server1
  WebFS  (ocf::heartbeat:Filesystem):Started server1 (unmanaged) FAILED

Failed actions:
 WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error

Here is my /etc/cluster/cluster.conf
?xml version=1.0?
cluster config_version=1 name=mycluster
 logging debug=on/
 clusternodes
 clusternode name=server1 nodeid=1
 fence
 method name=pcmk-redirect
 device name=pcmk port=server1/
 /method
 /fence
 /clusternode
 clusternode name=server2 nodeid=2
 fence
 method name=pcmk-redirect
 device name=pcmk port=server2/
 /method
 /fence
 /clusternode
 /clusternodes
 fencedevices
 fencedevice name=pcmk agent=fence_pcmk/
 /fencedevices
/cluster

Here is my: crm configure show



snip


 stonith-enabled=false \


Well this is a problem.

When cman detects a failure (well corosync, but cman is told), it 
initiates a fence request. The fence daemon informs DLM with blocks. 
Then fenced calls the configured 'fence_pcmk', which just passes the 
request up to pacemaker.


Without stonith configured in fencing, pacemaker will fail to fence, of 
course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by 
design.


If configure proper fencing in pacemaker (and test it to make sure it 
works), then pacemaker *would* succeed in fencing and return a success 
to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans 
up lost locks and returns to normal operation.


So please configure and test real stonith in pacemaker and see if your 
problem is resolved.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] 2-node cluster fence loop

2014-06-12 Thread Digimer
Have you tried simple things like disabling iptables or selinux, just to 
test? If that doesn't work, and it's a small cluster, try unicast and 
see if that helps (again, even if just to test).


On 12/06/14 10:29 AM, Arun G Nair wrote:

We have multicast enabled on the switch. I've also tried the
multicast.py tool from RH's knowledge base to test multicast and I see
the expected output, though the tool uses a different multicast IP(
guess that shouldn't matter). I've tried increasing the post_join_delay
to 360 seconds to give me enough time to check everything on both the
nodes. One node still gets fenced. `clustat` output says the other node
is offline on both servers. So one node can't see the other one ? This
again points to issue with multicast. Any other clues as to what/where
to look ?


On Wed, Jun 11, 2014 at 8:33 PM, Digimer li...@alteeve.ca
mailto:li...@alteeve.ca wrote:

On 11/06/14 10:48 AM, Arun G Nair wrote:

Hello,

 What are the reasons for fence loops when only cman is
started ? We
have an RHEL 6.5 2-node cluster which goes in to a fence loop
and every
time we start cman on both nodes. Either one fences the other.
Multicast
seems to be working properly. My understanding is that without
rgmanager
running there won't be a multicast group subscription ? I don't
see the
multicast address in 'netstat -g' unless rgmanager is running. I've
tried to increase the fence post_join_delay but one of the nodes
still
gets fenced.

The cluster works fine if we use unicast UDP.

Thanks,


Hi,

   When cman starts, it waits post_join_delay seconds for the peer
to connect. If, after that time expires (6 seconds by default,
iirc), it gives up and calls a fence against the peer to put it into
a known state.

   Corosync is what determines membership, and it is started by
cman. The rgmanager only handles resource
start/stop/relocate/recovery and has nothing to do with fencing
directly. Corosync is what uses multicast.

   So as you seem to have already surmised, multicast is probably
not working in your environment. Have you enabled multicast traffic
on the firewall? Do your switches support multicast properly?

digimer

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person
without access to education?

--
Linux-cluster mailing list
Linux-cluster@redhat.com mailto:Linux-cluster@redhat.com
https://www.redhat.com/__mailman/listinfo/linux-cluster
https://www.redhat.com/mailman/listinfo/linux-cluster




--
Arun G Nair
Sr. Sysadmin
Dimension Data | Ph: (800) 664-9973
Feedback? We're listening http://www.surveymonkey.com/s/XRCYXBH





--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Node is randomly fenced

2014-06-12 Thread Digimer
To confirm; Have you tried with the bonds setup where each node has one
link into either switch? I just want to be sure you've ruled out all the
network hardware. Also please confirm that you used mode=1
(active-passive) bonding.

Assuming this doesn't help, then I would say that I was wrong in
assuming it was network related. The next thing I would look at is
corosync. Do you see any messages about totem retransmit?

On 12/06/14 11:32 AM, Schaefer, Micah wrote:
 Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
 fenced, then node3 was fenced when node4 came back online. The network
 topology is as follows:
 switch1: node1, node3 (two connections)
 switch2: node2, node4 (two connections)
 switch1 ― switch2
 All on the same subnet
 
 I set up monitoring at 100 millisecond of the nics in active-backup mode,
 and saw no messages about link problems before the fence.
 
 I see multicast between the servers using tcpdump.
 
 
 Any more ideas?
 
 
 
 
 
 On 6/12/14, 12:19 AM, Digimer li...@alteeve.ca wrote:
 
 I considered that, but I would expect more nodes to be lost.

 On 12/06/14 12:12 AM, Netravali, Ganesh wrote:
 Make sure multicast is enabled across the switches.

 -Original Message-
 From: linux-cluster-boun...@redhat.com
 [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Schaefer, Micah
 Sent: Thursday, June 12, 2014 1:20 AM
 To: linux clustering
 Subject: Re: [Linux-cluster] Node is randomly fenced

 Okay, I set up active/ backup bonding and will watch for any change.

 This is the network side:
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 output errors, 0 collisions, 0 interface resets



 This is the server side:

 em1   Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
 inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
 inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
 TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0
 GiB)
 Interrupt:34 Memory:d500-d57f



 I need to run some fiber, but for now two nodes are plugged into one
 switch and the other two nodes into a separate switch that are on the
 same subnet. I'll work on cross connecting the bonded interfaces to
 different switches.



 On 6/11/14, 3:28 PM, Digimer li...@alteeve.ca wrote:

 The first thing I would do is get a second NIC and configure
 active-passive bonding. network issues are too common to ignore in HA
 setups. Ideally, I would span the links across separate stacked
 switches.

 As for debugging the issue, I can only recommend to look closely at the
 system and switch logs for clues.

 On 11/06/14 02:55 PM, Schaefer, Micah wrote:
 I have the issue on two of my nodes. Each node has 1ea 10gb
 connection.
 No
 bonding, single link. What else can I look at? I manage the network
 too. I  don¹t see any link down notifications, don¹t see any errors on
 the ports.




 On 6/11/14, 2:29 PM, Digimer li...@alteeve.ca wrote:

 On 11/06/14 02:21 PM, Schaefer, Micah wrote:
 It failed again, even after deleting all the other failover domains.

 Cluster conf
 http://pastebin.com/jUXkwKS4

 I turned corosync output to debug. How can I go about
 troubleshooting if  it really is a network issue or something else?



 Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
 14:10:17 corosync [TOTEM ] A processor failed, forming new
 configuration.
 Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
 corosync [TOTEM ] A processor joined or left the membership and a
 new membership was formed.
 Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
 ip(10.70.100.101) ; members(old:4 left:1)

 This is, to me, *strongly* indicative of a network issue. It's not
 likely switch-wide as only one member was lost, but I would
 certainly put my money on a network problem somewhere, some how.

 Do you use bonding?

 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/ What if the cure for
 cancer is trapped in the mind of a person without access to
 education?

 --
 Linux-cluster mailing list
 Linux-cluster@redhat.com
 https://www.redhat.com/mailman/listinfo/linux-cluster




 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
 is trapped in the mind of a person without access to education?

 --
 Linux-cluster mailing list
 Linux-cluster@redhat.com
 https://www.redhat.com/mailman/listinfo/linux-cluster




 -- 
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?

 -- 
 Linux-cluster mailing list
 Linux-cluster@redhat.com
 https://www.redhat.com/mailman/listinfo/linux-cluster
 
 


-- 
Digimer
Papers and Projects: https://alteeve.ca

Re: [Linux-cluster] Node is randomly fenced

2014-06-12 Thread Digimer
On 12/06/14 12:33 PM, yvette hirth wrote:
 On 06/12/2014 08:32 AM, Schaefer, Micah wrote:
 
 Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
 fenced, then node3 was fenced when node4 came back online. The network
 topology is as follows:
 switch1: node1, node3 (two connections)
 switch2: node2, node4 (two connections)
 switch1 ― switch2
 All on the same subnet

 I set up monitoring at 100 millisecond of the nics in active-backup mode,
 and saw no messages about link problems before the fence.

 I see multicast between the servers using tcpdump.

 Any more ideas?
 
 spanning-tree scans/rebuilds happen on 10Gb circuits just like they do
 on 1Gb circuits, and when they happen, traffic on the switches *can*
 come to a grinding halt, depending upon the switch firmware and the type
 of spanning-tree scan/rebuild being done.
 
 you may want to check your switch logs to see if any spanning-tree
 rebuilds were being done at the time of the fence.
 
 just an idea, and hth
 yvette hirth

When I've seen this (I now disable STP entirely), it blocks all traffic
so I would expect multiple/all nodes to partition off on their own.
Still, worth looking into. :)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Node is randomly fenced

2014-06-12 Thread Digimer
On 12/06/14 12:48 PM, Schaefer, Micah wrote:
 As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
 tree changes are happening and all the ports have port-fast enabled for
 these servers. My switch logging level is very high and I have no messages
 in relation to the time frames or ports.
 
 TOTEM reports that “A processor joined or left the membership…”, but that
 isn’t enough detail.
 
 Also note that I did not have these issues until adding new servers: node3
 and node4 to the cluster. Node1 and node2 do not fence each other (unless
 a real issue is there), and they are on different switches.

Then I can't imagine it being network anymore. Seeing as both node 3 and
4 get fenced, it's likely not hardware either. Are the workloads on 3
and 4 much higher (or are the computers much slower) than 1 and 2? I'm
wondering if the nodes are simply not keeping up with corosync traffic.
You might try adjusting the corosync token timeout and retransmit counts
to see if that reduces the node loses.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Node is randomly fenced

2014-06-12 Thread Digimer
:44:53 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] got commit token
Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq received 86
Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334
Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state.
Jun 12 14:44:54 corosync [TOTEM ] got commit token
Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state.
Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104:
Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep 10.70.100.101
Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received flag 1
Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages
in recovery.
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 0, aru 
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 1, aru 0
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 2, aru 0
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
flag0 retrans queue empty 1 count 3, aru 0
Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0 install
seq 0 aru 0 0
Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state
Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0
Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1
Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
flushing membership messages.
Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms,
flushing membership messages.









On 6/12/14, 1:55 PM, Schaefer, Micah micah.schae...@jhuapl.edu wrote:


I just found that the clock on node1 was off by about a minute and a half
compared to the rest of the nodes.

I am running ntp, so not sure why the time wasn’t synced up. Wonder if
node1 being behind, would think it was not receiving updates from the
other nodes?







On 6/12/14, 1:29 PM, Digimer li...@alteeve.ca wrote:


Even if the token changes stop the immediate fencing, don't leave it
please. There is something fundamentally wrong that you need to
identify/fix.

Keep us posted!

On 12/06/14 01:24 PM, Schaefer, Micah wrote:

The servers do not run any tasks other than the tasks in the cluster
service group.

Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1
and 2 are virtual machines with much less resources

Re: [Linux-cluster] 2-node cluster fence loop

2014-06-11 Thread Digimer

On 11/06/14 10:48 AM, Arun G Nair wrote:

Hello,

What are the reasons for fence loops when only cman is started ? We
have an RHEL 6.5 2-node cluster which goes in to a fence loop and every
time we start cman on both nodes. Either one fences the other. Multicast
seems to be working properly. My understanding is that without rgmanager
running there won't be a multicast group subscription ? I don't see the
multicast address in 'netstat -g' unless rgmanager is running. I've
tried to increase the fence post_join_delay but one of the nodes still
gets fenced.

The cluster works fine if we use unicast UDP.

Thanks,


Hi,

  When cman starts, it waits post_join_delay seconds for the peer to 
connect. If, after that time expires (6 seconds by default, iirc), it 
gives up and calls a fence against the peer to put it into a known state.


  Corosync is what determines membership, and it is started by cman. 
The rgmanager only handles resource start/stop/relocate/recovery and has 
nothing to do with fencing directly. Corosync is what uses multicast.


  So as you seem to have already surmised, multicast is probably not 
working in your environment. Have you enabled multicast traffic on the 
firewall? Do your switches support multicast properly?


digimer

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Node is randomly fenced

2014-06-11 Thread Digimer
On 11/06/14 02:21 PM, Schaefer, Micah wrote:
 It failed again, even after deleting all the other failover domains.
 
 Cluster conf
 http://pastebin.com/jUXkwKS4
 
 I turned corosync output to debug. How can I go about troubleshooting if
 it really is a network issue or something else?
 
 
 
 Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
 Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
 configuration.
 Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
 Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
 membership and a new membership was formed.
 Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
 ip(10.70.100.101) ; members(old:4 left:1)

This is, to me, *strongly* indicative of a network issue. It's not
likely switch-wide as only one member was lost, but I would certainly
put my money on a network problem somewhere, some how.

Do you use bonding?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Node is randomly fenced

2014-06-11 Thread Digimer
The first thing I would do is get a second NIC and configure 
active-passive bonding. network issues are too common to ignore in HA 
setups. Ideally, I would span the links across separate stacked switches.


As for debugging the issue, I can only recommend to look closely at the 
system and switch logs for clues.


On 11/06/14 02:55 PM, Schaefer, Micah wrote:

I have the issue on two of my nodes. Each node has 1ea 10gb connection. No
bonding, single link. What else can I look at? I manage the network too. I
don¹t see any link down notifications, don¹t see any errors on the ports.




On 6/11/14, 2:29 PM, Digimer li...@alteeve.ca wrote:


On 11/06/14 02:21 PM, Schaefer, Micah wrote:

It failed again, even after deleting all the other failover domains.

Cluster conf
http://pastebin.com/jUXkwKS4

I turned corosync output to debug. How can I go about troubleshooting if
it really is a network issue or something else?



Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)


This is, to me, *strongly* indicative of a network issue. It's not
likely switch-wide as only one member was lost, but I would certainly
put my money on a network problem somewhere, some how.

Do you use bonding?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster






--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Node is randomly fenced

2014-06-11 Thread Digimer

I considered that, but I would expect more nodes to be lost.

On 12/06/14 12:12 AM, Netravali, Ganesh wrote:

Make sure multicast is enabled across the switches.

-Original Message-
From: linux-cluster-boun...@redhat.com 
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Schaefer, Micah
Sent: Thursday, June 12, 2014 1:20 AM
To: linux clustering
Subject: Re: [Linux-cluster] Node is randomly fenced

Okay, I set up active/ backup bonding and will watch for any change.

This is the network side:
  0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
  0 output errors, 0 collisions, 0 interface resets



This is the server side:

em1   Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
   inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
   inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
   TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000
   RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0 GiB)
   Interrupt:34 Memory:d500-d57f



I need to run some fiber, but for now two nodes are plugged into one switch and 
the other two nodes into a separate switch that are on the same subnet. I'll 
work on cross connecting the bonded interfaces to different switches.



On 6/11/14, 3:28 PM, Digimer li...@alteeve.ca wrote:


The first thing I would do is get a second NIC and configure
active-passive bonding. network issues are too common to ignore in HA
setups. Ideally, I would span the links across separate stacked switches.

As for debugging the issue, I can only recommend to look closely at the
system and switch logs for clues.

On 11/06/14 02:55 PM, Schaefer, Micah wrote:

I have the issue on two of my nodes. Each node has 1ea 10gb connection.
No
bonding, single link. What else can I look at? I manage the network
too. I  don¹t see any link down notifications, don¹t see any errors on
the ports.




On 6/11/14, 2:29 PM, Digimer li...@alteeve.ca wrote:


On 11/06/14 02:21 PM, Schaefer, Micah wrote:

It failed again, even after deleting all the other failover domains.

Cluster conf
http://pastebin.com/jUXkwKS4

I turned corosync output to debug. How can I go about
troubleshooting if  it really is a network issue or something else?



Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
14:10:17 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
corosync [TOTEM ] A processor joined or left the membership and a
new membership was formed.
Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)


This is, to me, *strongly* indicative of a network issue. It's not
likely switch-wide as only one member was lost, but I would
certainly put my money on a network problem somewhere, some how.

Do you use bonding?

--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for
cancer is trapped in the mind of a person without access to
education?

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster






--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
is trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster






--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Node is randomly fenced

2014-06-04 Thread Digimer

On 04/06/14 10:59 AM, Schaefer, Micah wrote:

I have a 4 node cluster, running a single service group. I have been
seeing node1 fence node3 while node3 is actively running the service group
at random intervals.

Rgmanager logs show no failures in service checks, and no other logs
provide any useful information. How can I go about finding out why node1
is fencing node3?

I currently set up the failover domain to be restricted and not include
node3.

cluster.conf : http://pastebin.com/xYy6xp6N


Random fencing is almost always caused by network failures. Can you look 
are the system logs, starting a little before the fence and continuing 
until after the fence completes, and paste them here? I suspect you will 
see corosync complaining.


If this is true, do your switches support persistent multicast? Do you 
use active/passive bonding? Have you tried different switch/cable/NIC?


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] fence_ipmilan / custom hardware target address (ipmitool -t hexaddr)

2014-05-15 Thread Digimer

On 15/05/14 02:39 PM, Jeff Johnson wrote:

Greetings,

I am looking to adapt fence_ipmilan to interact with a custom
implementation of an IPMI BMC. Doing so requires the use of ipmitool's
-t option to bridge IPMI requests to a specified internal
(non-networked) hardware address.

I do not see this option existing in fence_ipmilan or any of the other
fence_agents modules.

The ipmitool operation would be '/path/to/ipmitool -t 0x42 chassis power
operation'. No network, IP, Auth, User, Password or other arguments
required.

I want to check with the developers to see if there is an existing path
for this use case before submitting a patch for consideration.

Thanks,

--Jeff


Marek Grac, who I've cc'ed here, would be the best person to give advice 
on this.


As a user, I think a simple patch to add your option would be fine. I do 
not believe (though stand to be corrected) that address, user or 
password is currently required with fence_ipmilan.


If I am wrong and it is required, then perhaps forking fence_ipmilan to 
something like fence_ipmihw (or whatever) and then pushing it out as a 
new agent should be easy and could work.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] clusvcadm

2014-05-07 Thread Digimer

On 07/05/14 03:05 PM, Paras pradhan wrote:

Well I have a qdisk with vote 3 . Thats why it is 6.

Here is the log. I see some GFS hung but no issue with GFS mounts at
this time.

http://pastebin.com/MP4BF86c

I am seeing this at clumond.log not sure if this is related and what is it.

Mon May  5 21:58:20 2014 clumond: Peer (vprd3.domain): pruning queue
23340-11670

Tue May  6 01:38:57 2014 clumond: Peer (vprd3.domain): pruning queue
23340-11670

Tue May  6 01:39:02 2014 clumond: Peer (vprd1.domain): pruning queue
23340-11670


Thanks
Paras


Was there a failed fence action prior to this? If so, DLM is probably 
blocked.


Can you post your logs starting from just prior to the network interruption?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] KVM Live migration when node's FS is read-only

2014-04-15 Thread Digimer

Hi all,

  So I hit a weird issue last week... (EL6 + cman + rgamanager + drbd)

  For reasons unknown, a client thought they could start yanking and 
replacing hard drives on a running node. Obviously, that did not end 
well. The VMs that had been running on the node continues to operate 
fine and they just started using the peer's storage.


  The problem came when I tried to live-migrate the VMs over to the 
still-good node. Obviously, the old host couldn't write to logs, and the 
live-migration failed. Once failed, rgmanager also stopped working once 
the migration failed. In the end, I had to manually fence the node 
(corosync never failed, so it didn't get automatically fenced).


  This obviously caused the VMs running on the node to reboot, causing 
a ~40 second outage. It strikes me that the system *should* have been 
able to migrate, had it not tried to write to the logs.


  Is there a way, or can there be made a way, to migrate VMs off of a 
node whose underlying FS is read-only/corrupt/destroyed, so long as the 
programs in memory are still working?


  I am sure this is part a part rgmanager, part KVM/qemu question.

Thanks for any feedback!

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Simple data replication in a cluster

2014-04-03 Thread Digimer

On 03/04/14 04:58 PM, Vallevand, Mark K wrote:

I’m looking for a simple way to replicate data within a cluster.

It looks like my resources will be self-configuring and may need to push
changes they see to all nodes in the cluster.  The idea being that when
a node crashes, the resource will have its configuration present on the
node on which it is restarted.  We’re talking about a few kb of data,
probably in one file, probably text.  A typical cluster would have
multiple resources (more than two), one resource per node and one extra
node.

Ideas?

Could I use the CIB directly to replicate data?  Use cibadmin to update
something and sync?

How big can a resource parameter be?  Could a resource modify its
parameters so that they are replicated throughout the cluster?

Is there a simple file replication Resource Agent?

Drdb seems like overkill.

Regards.
Mark K Vallevand mark.vallev...@unisys.com


If you don't want to use DRBD + gfs2 (what I use), then you'll probably 
want to look at corosync directly for keeping the data in sync. 
Pacemaker itself is a cluster resource manager and I don't think the cib 
is well suited for general data sync'ing.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS2 unformat helper tool

2014-03-30 Thread Digimer

On 30/03/14 10:34 AM, Hamid Jafarian wrote:

Hi,

We developed GFS2 volume unformat helper tool.
Read about this code at:
http://pdnsoft.com/en/web/pdnen/blog/-/blogs/gfs2-unformat-helper-tool-1

Regards


Thanks for sharing this!

Madi

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] unformat gfs2

2014-03-22 Thread Digimer

That is very good news! Now, about your backups... ;)

Look forward to seeing your code!

digimer

On 22/03/14 04:13 PM, Mr.Pine wrote:

Good news for all :
I successfully recoved all of my data(1.5TB) without even one bit lost!
my program tooks only 1 hour to do all the jobs on my 1.7 TB
partition.(I could not wait 100 days for my bash script to finish).

I will publish my  source code very soon for the public use.

Special thanks to Bob for the help.

Mr.Pine

On Wed, Mar 19, 2014 at 4:58 PM, Bob Peterson rpete...@redhat.com wrote:

- Original Message -

Hi,

Scripts is very very slow, so i should write program in c/c++.

I need some confidence about data structures and data location on disk.
As i reviewed blocks of data:

All reserved blocks (GFS2 specific blocks) start by : 0x01161970
Blocktype store location is at Byte # 8,
Type of start block of each resource group is: 2
Bitmaps are in block types 2  3.
In block type 2, bitmap info starts from Byte # 129
In block type 3, bitmap info starts from Byte # 25
Length of RGs are const, 5 in my volume (out put of gfs2_edit -p rindex
/dev/..)

Is this info right?

Logic of my program seams should be like this:

(1)
Loop in device and temporary store block id of dinode blocks, and also
their bitmap locations

(2)
Change bitmap of blocks to 3 (11)

Bob, could you confirm this?

Regards
Pine.


Hi Pine,

This is correct. The length of RGs is properly determined by the values
in the rindex system file, but 5 is very common, and is usually constant.
(It may change if you used gfs2_grow or gfs2_convert from gfs1).
The bitmap is 2 bits per block in the resource group, and it's relative
to the start of the particular rgrp. You should probably use the same
algorithm in libgfs2 to change the proper bit in the bitmaps. You can
get this from the public gfs2-utils git tree.

Regards,

Bob Peterson
Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster





--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

2014-03-20 Thread Digimer

On 18/03/14 09:27 PM, Digimer wrote:

Hi all,

   I would like to tell rgmanager to give more time for VMs to stop. I
want this:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2 restart_expire_time=600
   action name=stop timeout=10m /
/vm

I already use ccs to create the entry:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2 restart_expire_time=600/

via:

ccs -h localhost --activate --sync --password secret \
  --addvm vm01-win2008 \
  --domain=primary_n01 \
  path=/shared/definitions/ \
  autostart=0 \
  exclusive=0 \
  recovery=restart \
  max_restarts=2 \
  restart_expire_time=600

I'm hoping it's a simple additional switch. :)

Thanks!


As per the request on #linux-cluster, I have opened a rhbz for this:

https://bugzilla.redhat.com/show_bug.cgi?id=1079032

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

2014-03-20 Thread Digimer

On 20/03/14 03:31 PM, Digimer wrote:

On 18/03/14 09:27 PM, Digimer wrote:

Hi all,

   I would like to tell rgmanager to give more time for VMs to stop. I
want this:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2 restart_expire_time=600
   action name=stop timeout=10m /
/vm

I already use ccs to create the entry:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2 restart_expire_time=600/

via:

ccs -h localhost --activate --sync --password secret \
  --addvm vm01-win2008 \
  --domain=primary_n01 \
  path=/shared/definitions/ \
  autostart=0 \
  exclusive=0 \
  recovery=restart \
  max_restarts=2 \
  restart_expire_time=600

I'm hoping it's a simple additional switch. :)

Thanks!


As per the request on #linux-cluster, I have opened a rhbz for this:

https://bugzilla.redhat.com/show_bug.cgi?id=1079032


Split the rgmanager section out:

https://bugzilla.redhat.com/show_bug.cgi?id=1079039

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

2014-03-19 Thread Digimer

On 19/03/14 06:31 PM, Chris Feist wrote:

On 03/18/2014 08:27 PM, Digimer wrote:

Hi all,

   I would like to tell rgmanager to give more time for VMs to stop. I
want this:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2
restart_expire_time=600
   action name=stop timeout=10m /
/vm

I already use ccs to create the entry:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2
restart_expire_time=600/

via:

ccs -h localhost --activate --sync --password secret \
  --addvm vm01-win2008 \
  --domain=primary_n01 \
  path=/shared/definitions/ \
  autostart=0 \
  exclusive=0 \
  recovery=restart \
  max_restarts=2 \
  restart_expire_time=600

I'm hoping it's a simple additional switch. :)


Unfortunately currently ccs doesn't support setting resource actions.
However it's my understanding that rgmanager doesn't check timeouts
unless __enforce_timeouts is set to 1.  So you shouldn't be seeing a
vm resource go to failed if it takes a long time to stop.  Are you
trying to make the vm resource fail if it takes longer than 10 minutes
to stop?


I was afraid you were going to say that. :(

The problem is that after calling 'disable' against the VM service, 
rgmanager waits two minutes. If the service isn't closed in that time, 
the server is forced off (at least, this was the behaviour when I last 
tested this).


The concern is that, by default, windows installs queue updates to 
install when the system shuts down. During this time, windows makes it 
very clear that you should not power off the system during the updates. 
So if this timer is hit, and the VM is forced off, the guest OS can be 
damaged.


Of course, we can debate the (lack of) wisdom of this behaviour, and I 
already document this concern (and even warn people to check for updates 
before stopping the server), it's not sufficient. If a user doesn't read 
the warning, or simply forgets to check, the consequences can be 
non-trivial.


If ccs can't be made to add this attribute, and if the behaviour 
persists (I will test shortly after sending this reply), then I will 
have to edit the cluster.conf directly, something I am loath to do if at 
all avoidable.


Cheers

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

2014-03-19 Thread Digimer

On 19/03/14 07:45 PM, Digimer wrote:

On 19/03/14 06:31 PM, Chris Feist wrote:

On 03/18/2014 08:27 PM, Digimer wrote:

Hi all,

   I would like to tell rgmanager to give more time for VMs to stop. I
want this:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2
restart_expire_time=600
   action name=stop timeout=10m /
/vm

I already use ccs to create the entry:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2
restart_expire_time=600/

via:

ccs -h localhost --activate --sync --password secret \
  --addvm vm01-win2008 \
  --domain=primary_n01 \
  path=/shared/definitions/ \
  autostart=0 \
  exclusive=0 \
  recovery=restart \
  max_restarts=2 \
  restart_expire_time=600

I'm hoping it's a simple additional switch. :)


Unfortunately currently ccs doesn't support setting resource actions.
However it's my understanding that rgmanager doesn't check timeouts
unless __enforce_timeouts is set to 1.  So you shouldn't be seeing a
vm resource go to failed if it takes a long time to stop.  Are you
trying to make the vm resource fail if it takes longer than 10 minutes
to stop?


I was afraid you were going to say that. :(

The problem is that after calling 'disable' against the VM service,
rgmanager waits two minutes. If the service isn't closed in that time,
the server is forced off (at least, this was the behaviour when I last
tested this).

The concern is that, by default, windows installs queue updates to
install when the system shuts down. During this time, windows makes it
very clear that you should not power off the system during the updates.
So if this timer is hit, and the VM is forced off, the guest OS can be
damaged.

Of course, we can debate the (lack of) wisdom of this behaviour, and I
already document this concern (and even warn people to check for updates
before stopping the server), it's not sufficient. If a user doesn't read
the warning, or simply forgets to check, the consequences can be
non-trivial.

If ccs can't be made to add this attribute, and if the behaviour
persists (I will test shortly after sending this reply), then I will
have to edit the cluster.conf directly, something I am loath to do if at
all avoidable.

Cheers


Confirmed;

I called disable on a VM with gnome running, so that I could abort the 
VM's shut down.


an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:06:29 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:08:36 EDT 2014

2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been 
a windows guest in the middle of installing updates, it would be highly 
likely to be screwed now.


To confirm, I changed the config to:

vm autostart=0 domain=primary_n01 exclusive=0 max_restarts=2 
name=vm01-rhel6 path=/shared/definitions/ recovery=restart 
restart_expire_time=600

  action name=stop timeout=10m/
/vm

Then I repeated the test:

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:13:18 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:23:31 EDT 2014

10 minutes and 13 seconds before the cluster killed the server, much 
less likely to interrupt a in-progress OS update (truth be told, I plan 
to set 30 minutes.


I understand that this blocks other processes, but in an HA environment, 
I'd strongly argue that safe  speed.


digimer

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

2014-03-19 Thread Digimer

On 19/03/14 10:12 PM, Pavel Herrmann wrote:

Hi

On Wednesday 19 of March 2014 21:26:56 Digimer wrote:

On 19/03/14 07:45 PM, Digimer wrote:

On 19/03/14 06:31 PM, Chris Feist wrote:

On 03/18/2014 08:27 PM, Digimer wrote:

Hi all,

I would like to tell rgmanager to give more time for VMs to stop. I

want this:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2
restart_expire_time=600

action name=stop timeout=10m /

/vm

I already use ccs to create the entry:

vm name=vm01-win2008 domain=primary_n01 autostart=0
path=/shared/definitions/ exclusive=0 recovery=restart
max_restarts=2
restart_expire_time=600/

via:

ccs -h localhost --activate --sync --password secret \

   --addvm vm01-win2008 \
   --domain=primary_n01 \
   path=/shared/definitions/ \
   autostart=0 \
   exclusive=0 \
   recovery=restart \
   max_restarts=2 \
   restart_expire_time=600

I'm hoping it's a simple additional switch. :)


Unfortunately currently ccs doesn't support setting resource actions.
However it's my understanding that rgmanager doesn't check timeouts
unless __enforce_timeouts is set to 1.  So you shouldn't be seeing a
vm resource go to failed if it takes a long time to stop.  Are you
trying to make the vm resource fail if it takes longer than 10 minutes
to stop?


I was afraid you were going to say that. :(

The problem is that after calling 'disable' against the VM service,
rgmanager waits two minutes. If the service isn't closed in that time,
the server is forced off (at least, this was the behaviour when I last
tested this).

The concern is that, by default, windows installs queue updates to
install when the system shuts down. During this time, windows makes it
very clear that you should not power off the system during the updates.
So if this timer is hit, and the VM is forced off, the guest OS can be
damaged.

Of course, we can debate the (lack of) wisdom of this behaviour, and I
already document this concern (and even warn people to check for updates
before stopping the server), it's not sufficient. If a user doesn't read
the warning, or simply forgets to check, the consequences can be
non-trivial.

If ccs can't be made to add this attribute, and if the behaviour
persists (I will test shortly after sending this reply), then I will
have to edit the cluster.conf directly, something I am loath to do if at
all avoidable.

Cheers


Confirmed;

I called disable on a VM with gnome running, so that I could abort the
VM's shut down.

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:06:29 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:08:36 EDT 2014

2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been
a windows guest in the middle of installing updates, it would be highly
likely to be screwed now.


Is this really the best way to handle such an event?

 From what I remember, Windows can (or could, I don't have any 'modern' windows
laying around) be told to shutdown without updating. maybe a wiser approach
would be to make the stop event (which I believe is delivered to the guest as
pressing the ACPI power button) trigger a shutdown without updates.

keep in mind that doing system updates on timer is dangerous, irrelevant of
the actual time

regards
Pavel Herrmann


This assumes that we can modify how windows behaves. Unless there is a 
magic ACPI event that windows will reliably interpret as power off 
without updating, we can't rely on this.


We have clients (and I am sure we aren't the only ones) who install 
their own OSes without any input from us. As mentioned earlier, we do 
document the risks, but that's not good enough. We can't force users to 
read.


So we have a choice; Take mitigating steps or let the user shoot 
themselves in the foot because they should have known better. As 
personally satisfying as option #2 might seem, option #1 is the more 
professional approach, I would _strongly_ argue.


digimer

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] unformat gfs2

2014-03-18 Thread Digimer

On 18/03/14 09:38 AM, Mr.Pine wrote:

I have accidentally reformatted a GFS cluster.
We need to unformat it.. is there any way to recover disk ?

I read this post
http://web.archiveorange.com/archive/v/TUhSn11xEn9QxXBIZ0k6

it say that I can use gfs2_edit to recover data.
I need more details about changing block map to 0xff

tnx


Do you have a support agreement with Red Hat? If so, open a ticket with 
them. If not, then you can try also asking for help in freenode's 
#linux-cluster channel. It says no gfs support, but that's to prevent 
confusion with tracking open tickets, which won't apply if you don't 
have official red hat support.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

2014-03-18 Thread Digimer

Hi all,

  I would like to tell rgmanager to give more time for VMs to stop. I 
want this:


vm name=vm01-win2008 domain=primary_n01 autostart=0 
path=/shared/definitions/ exclusive=0 recovery=restart 
max_restarts=2 restart_expire_time=600

  action name=stop timeout=10m /
/vm

I already use ccs to create the entry:

vm name=vm01-win2008 domain=primary_n01 autostart=0 
path=/shared/definitions/ exclusive=0 recovery=restart 
max_restarts=2 restart_expire_time=600/


via:

ccs -h localhost --activate --sync --password secret \
 --addvm vm01-win2008 \
 --domain=primary_n01 \
 path=/shared/definitions/ \
 autostart=0 \
 exclusive=0 \
 recovery=restart \
 max_restarts=2 \
 restart_expire_time=600

I'm hoping it's a simple additional switch. :)

Thanks!

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] [Linux-HA] Problems with fence_apc agent and accessing APC AP8965

2014-02-23 Thread Digimer
Does 'fence_apc_snmp -a hac-pdu1 -n 1 -o status' work? What about 
'fence_apc -a hac-pdu1 -l user -p passwd -n 1 -o status'?


digimer

On 23/02/14 08:57 PM, Andrew Beekhof wrote:

Forwarding to linux-cluster which has more people knowledgeable on this set of 
fencing agents.

On 22 Feb 2014, at 12:38 am, Tony Stocker tony.stoc...@nasa.gov wrote:



All,

I have a bigger issue regarding dual power supplies and fence_apc that I'm 
going to eventually need to resolve.  But at this point I'm simply having basic 
issues getting the fence_apc agent to be able to access the devices in general, 
to wit:

# fence_apc --ssh --ip=hac-pdu1 --plug=1 --username=blah --password=blah 
--verbose --action=status
Unable to connect/login to fencing device


However I can manually SSH into the device just fine:

# ssh  blah@hac-pdu1
blah@hac-pdu1's password:


American Power Conversion   Network Management Card AOS v5.1.9
(c) Copyright 2010 All Rights Reserved  RPDU 2g v5.1.6
---
Name  : hac-pdu1  Date : 02/21/2014
Contact   : syst...@mail.myserver123.com  Time : 13:12:02
Location  : C101, HAC Rack 1  User : Administrator
Up Time   : 223 Days 17 Hours 0 Minutes   Stat : P+ N4+ N6+ A+


Type ? for command listing
Use tcpip command for IP address(-i), subnet(-s), and gateway(-g)

apc


So perhaps the place to start first is simply getting the fence_apc agent 
(provided by CentOS/RHEL package fence-agents-3.1.5-35.el6_5.3.x86_64) to 
actually be able to work correctly.  Once that's done, I'll still need help on 
the dual power supply issue.

I'm not seeing any attempts to login in the APC's logs file, though I do see 
connections when I manually login, e.g.:
02/21/2014  13:13:11System: Console user 'apc' logged out from 
192.168.1.216.
02/21/2014  13:12:40System: Console user 'apc' logged in from 
192.168.1.216.

A manual 'telnet [name] 22' command also works fine from the command line:
# telnet hac-pdu1 22
Trying 192.168.1.222...
Connected to hac-1-pdu1 (192.168.1.222).
Escape character is '^]'.
SSH-2.0-cryptlib


However fence_apc_snmp **does** seem to work:

# fence_apc_snmp --snmp-version=1 --community=public --ip=hac-pdu1 --plug=1 
--username=blah --password=blah --verbose --action=status
/usr/bin/snmpwalk -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' 
'.1.3.6.1.2.1.1.2.0'
No log handling enabled - turning on stderr logging
Created directory: /var/lib/net-snmp/mib_indexes
.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6

Trying APC Master Switch (fallback)
/usr/bin/snmpget -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' 
'.1.3.6.1.4.1.318.1.1.4.4.2.1.3.1'
.1.3.6.1.4.1.318.1.1.4.4.2.1.3.1 1

Status: ON


Does anyone have any ideas as to why fence_apc is not working?


Thanks!
Tony

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
Linux-HA mailing list
linux...@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems







--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] CLVM CMAN live adding nodes

2014-02-22 Thread Digimer
This is not true. I change things outside the rm tags often without 
restarting the cluster. It would be a significant flaw if that were the 
case.


On 22/02/14 04:33 AM, emmanuel segura wrote:

I know if you need to modify anything outside rm... /rm{used by
rgmanager} tag in the cluster.conf file, you need to restart the whole
cluster stack, with cman+rgmanager i have never seen how to add a node
and remove a node from cluster without restart cman.




2014-02-22 6:21 GMT+01:00 Bjoern Teipel
bjoern.tei...@internetbrands.com
mailto:bjoern.tei...@internetbrands.com:

Hi all,

who's using CLVM with CMAN in a cluster with more than 2 nodes in
production ?
Did you guys got it to manage to live add a new node to the cluster
while everything is running ?
I'm only able to add nodes while the cluster stack is shutdown.
That's certainly not a good idea when you have to run CLVM on
hypervisors and you need to shutdown all VMs to add a new box.
Would be also good if you paste some of your configs using IPMI fencing

Thanks in advance,
Bjoern

--
Linux-cluster mailing list
Linux-cluster@redhat.com mailto:Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




--
esta es mi vida e me la vivo hasta que dios quiera





--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] Creating clustered LVM snapshots, locking and exclusivity

2014-02-20 Thread Digimer

Hi all,

  I want to get clustered LV snapshotting working. I was under the 
impression it was simply a matter of disabling the LV on the other node 
(2-node cluster here). However, this fails because of locking issues.


  I can change the peer node's LV to 'inactive' with (confirmed with 
lvscan):


[root@an-c05n01 ~]# lvchange -aln /dev/an-c05n02_vg0/vm01-rhel2_0
[root@an-c05n01 ~]# lvscan

  inactive  '/dev/an-c05n02_vg0/vm01-rhel2_0' [50.00 GiB] inherit

  But I still can't create snapshot on the other node running the VM:

[root@an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n 
vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0

  vm01-rhel2_0 must be active exclusively to create snapshot

So I try to set it exclusive:

[root@an-c05n02 ~]# lvchange -aey /dev/an-c05n02_vg0/vm01-rhel2_0
  Error locking on node an-c05n02.alteeve.ca: Device or resource busy

If I stop the VM running on the LV, then I can set the exclusive lock, 
boot the VM and later create the snapshot fine:


[root@an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n 
vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0

  Logical volume vm01-rhel2_0_snapshot created

But then later, I can't remove the exclusive value, so I can't re-active 
the LV after deleting the snapshot. I have to shut the VM down again in 
order to remove the exclusive flag.


I'm assuming it's possible to snapshot clustered LVs while they're in 
use, without stopping what is using them twice... Can someone help 
clarify what the magical incantation is?


Thanks!

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] PowerEdge R610 idrac express fencing

2014-02-19 Thread Digimer

On 19/02/14 07:30 PM, Michael Mendoza wrote:

Good afternoon.

We are trying to configure 2 dell R610 with idrac6 EXPRESS in cluster
with redhat 5.10 x64.

For testing we are using the command fence_ipmilan. We can ping idrac on
the remote host.

fence_ipmilan -a X.X.X.X -l usern -p   -t 200  -o status  -v   -- works


Over 3 minutes to confirm a fence action is extremely log time!


fence_ipmilan -a X.X.X.X -l usern -p   -t 200  -o reboot -v

The problem is the server reboot, but while it reboot the idrac6 reboot
too. so the host A after 120 seconds aprox lost connection and get the
follow message.


So you're saying that the IPMI interface, after rebooting the host, 
fails to respond for two full minutes? That strikes me as a reason to 
call Dell and ask for help. That can't be normal.



Spawning: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P
'[set]' -v -v -v chassis power on'...
Spawned: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P
'[set]' -v -v -v chassis power on' - PID 10104
Looking for:
'Password:', val = 1
'Unable to establish LAN', val = 11
'IPMI mutex', val = 14
'Unsupported cipher suite ID', val = 2048
'read_rakp2_message: no support for', val = 2048
'Up/On', val = 0
ExpectToken returned 11
Reaping pid 10104
Failed


cman version is CMAN-2.0.115.118.e15_10.3


This is an old existing cluster, or a new one you're trying to build?


however I have other host with centos 6.4 and CMAN3.0... and the
connection is not lost. I run the same command, the server reboot as
well as idrac, the ping is back and the ipmi connection is not lost..


Are these nodes in the same cluster? cman 2 and 3 only are designed to 
work in maintenance mode for rolling upgrades.



am I doing something wrong? I used the -t and -T option even 300 / 400
and it doesnt matter, the connection is shut after 120secounds. in
centos work fine. ( I already opened a case with redhat and am waiting
answer.)
Thanks


It might be that the 120 second upper limit is a bug.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] What condition would cause clvmd to exit with '143' when status called?

2014-02-16 Thread Digimer

I hit this in a program I use to monitor 'clvmd' on a local and peer node:


989; [ DEBUG ] - get_daemon_state(); daemon: [clvmd], node: [peer]
1002; [ DEBUG ] - shell call: [/usr/bin/sshr...@an-c07n02.alteeve.ca 
/etc/init.d/clvmd status; echo clvmd:\$?]

1019; [ DEBUG ] - line: [clvmd (pid 4114 4098) is running...]
1011; [ DEBUG ] - peer::daemon::clvmd::rc: [143]
1019; [ DEBUG ] - line: [bash: line 1:  4096 Terminated 
/etc/init.d/clvmd status]

Daemon: [clvmd] is in an unknown state on: [an-c07n02.alteeve.ca].
Status return code was: [143].


The line numbers and debug messages are my program, not clvmd. Any idea 
why this would happen?


This was from a node that had just been intentionally crashed, was 
fenced and booted back up.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Question about cluster behavior

2014-02-14 Thread Digimer

Replies in-line:

On 14/02/14 12:07 PM, FABIO FERRARI wrote:

So it's not a normal behavior, I guess.

Here is my cluster.conf:

?xml version=1.0?
cluster config_version=59 name=mail
 clusternodes
 clusternode name=eta.mngt.unimo.it nodeid=1
 fence
 method name=fence-eta
 device name=fence-eta/
 /method
 /fence
 /clusternode
 clusternode name=beta.mngt.unimo.it nodeid=2
 fence
 method name=fence-beta
 device name=fence-beta/
 /method
 /fence
 /clusternode
 clusternode name=guerro.mngt.unimo.it nodeid=3
 fence
 method name=fence-guerro
 device name=fence-guerro
port=Guerro
 ssl=on uuid=4213f370-9572-63c7-26e4-22f0f43843aa/
 /method
 /fence
 /clusternode
 /clusternodes
 cman expected_votes=5/


You generally don't need to set this, the cluster can calculate it.


 quorumd label=mail-qdisk/


You don't set any votes, so the default is 1. So with expected votes 
being 5, that means all three nodes have to be up or two nodes and qdisk.



 rm
 resources
 ip address=155.185.44.61/24 sleeptime=10/
 mysql config_file=/etc/my.cnf
listen_address=155.185.44.61 name=mysql
shutdown_wait=10 startup_wait=10/
 script file=/etc/init.d/httpd name=httpd/
 script file=/etc/init.d/postfix name=postfix/
 script file=/etc/init.d/dovecot name=dovecot/
 fs device=/dev/mapper/mailvg-maillv
force_fsck=1 force_unmount=1 fsid=58161
fstype=xfs mountpoint=/cl name=mailvg-maill
v options=defaults,noauto self_fence=1/
 lvm lv_name=maillv name=lvm-mailvg-maillv
self_fence=1 vg_name=mailvg/
 /resources
 failoverdomains
 failoverdomain name=mailfailoverdomain
nofailback=1 ordered=1 restricted=1
 failoverdomainnode
name=eta.mngt.unimo.it priority=1/
 failoverdomainnode
name=beta.mngt.unimo.it priority=2/
 failoverdomainnode
name=guerro.mngt.unimo.it priority=3/
 /failoverdomain
 /failoverdomains
 service domain=mailfailoverdomain max_restarts=3
name=mailservices recovery=restart
restart_expire_time=600
 fs ref=mailvg-maillv
 ip ref=155.185.44.61/24
 mysql ref=mysql
 script ref=httpd/
 script ref=postfix/
 script ref=dovecot/
 /mysql
 /ip
 /fs
 /service
 /rm
 fencedevices
 fencedevice agent=fence_ipmilan auth=password
ipaddr=155.185.135.105 lanplus=on login=root
name=fence-eta passwd=** pr
ivlvl=ADMINISTRATOR/
 fencedevice agent=fence_ipmilan auth=password
ipaddr=155.185.135.106 lanplus=on login=root
name=fence-beta passwd=** p
rivlvl=ADMINISTRATOR/
 fencedevice agent=fence_vmware_soap
ipaddr=155.185.0.10 login=etabetaguerro
name=fence-guerro passwd=**/
 /fencedevices
/cluster

What log file do you need? There are many in /var/log/cluster..


By default, /var/log/messages is the most useful. Checking 'cman_tool 
status' and 'clustat' are also good.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] backup best practice when using Luci

2014-02-10 Thread Digimer

On 10/02/14 09:12 AM, Benjamin Budts wrote:

Ladies  Gents (I won’t make that same mistake again ;)  ),

First, thank you to the lady who helped me explain how to force an OK on
fencing that is failing.

A 2 node config  Luci :

I would liketo put a backup solution in place for the cluster config /
nodeconfig/ fencing etc...

What would you recommend ? Or does  Luci archive versions of
config-files somewhere ?

Basically, if shit hits the fan I would like to untar a golden image of
a config on luci and push it back to the nodes…

Thx


The main config file is /etc/cluster/cluster.conf. A copy of this file 
should be on all nodes at once, so even if you didn't have a backup 
proper, you should be able to copy it from another node.


Beyond that, I personally backup (sample taken from a node called 
'an-c05n01';



mkdir ~/base
cd ~/base
mkdir root
mkdir -p etc/sysconfig/network-scripts/
mkdir -p etc/udev/rules.d/

# Root user
rsync -av /root/.bashrc  root/
rsync -av /root/.ssh root/

# Directories
rsync -av /etc/ssh   etc/
rsync -av /etc/apcupsd   etc/
rsync -av /etc/cluster   etc/
rsync -av /etc/drbd.*etc/
rsync -av /etc/lvm   etc/

# Specific files.
rsync -av /etc/sysconfig/network-scripts/ifcfg-{eth*,bond*,vbr*} 
etc/sysconfig/network-scripts/

rsync -av /etc/udev/rules.d/70-persistent-net.rules etc/udev/rules.d/
rsync -av /etc/sysconfig/network etc/sysconfig/
rsync -av /etc/hosts etc/
rsync -av /etc/ntp.conf  etc/

# Save recreating user accounts.
rsync -av /etc/passwdetc/
rsync -av /etc/group etc/
rsync -av /etc/shadowetc/
rsync -av /etc/gshadow   etc/

# If you have the cluster built and want to backup it's configs.
mkdir etc/cluster
mkdir etc/lvm
rsync -av /etc/cluster/cluster.conf etc/cluster/
rsync -av /etc/lvm/lvm.conf etc/lvm/
# NOTE: DRBD won't work until you've manually created the partitions.
rsync -av /etc/drbd.d etc/

# If you had to manually set the UUID in libvirtd;
mkdir etc/libvirt
rsync -av /etc/libvirt/libvirt.conf etc/libvirt/

# If you're running RHEL and want to backup your registration info;
rsync -av /etc/sysconfig/rhn etc/sysconfig/

# Pack it up
# NOTE: Change the name to suit your node.
tar -cvf base_an-c05n01.tar etc root


I then push the resulting tar file to my PXE server. I have a kickstart 
script that does a minimal rhel6 install, plus the cluster stuff, and 
then has a %post script that downloads this tar and extracts it.


This way, when the node needs to be rebuilt, it's 95% ready to go. I 
still need to do things like 'drbdadm create-md res', but it's still 
very quick to restore a node.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] manual intervention 1 node when fencing fails due to complete power outage

2014-02-07 Thread Digimer

On 07/02/14 11:13 AM, Benjamin Budts wrote:

Gents,


We're not all gents. ;)


I have a 2 node setup (with quorum disk), redhat 6.5  a luci mgmt console.

Everything has been configured and we’re doing failover tests now.

Couple of questions I have :

·When I simulate a complete power failure of a servers pdu’s (no more
access to idrac fencing or APC PDU fencing) I can see that the fencing
of that node who was running the application fails ßI  noticed unless
fencing returns an OK I’m stuck and my application won’t start on my
2^nd node. Which is ok I guess, because no fencing could mean there is
still I/O on my san.


This is expected. If a lost node can't be put into a known state, there 
is no safe way to proceed. To do so would be to risk a split brain at 
least, and data loss/corruption at worst.


The way I deal with this is to have nodes with redundant power supplies 
and use two PDUs and two UPSes. This way, the failure of on cirtcuit / 
UPS / PDU doesn't knock out the power to the mainboard of the nodes, so 
you don't lose IPMI.



Clustat also shows on the active node that the 1^st node is still
running the application.


That's likely because rgmanager uses DLM, and DLM blocks until the fence 
succeeds, so it can't update it's view.



How can I intervene manually, so as to force a start of the application
on the node that is still alive ?


If you are *100% ABSOLUTELY SURE* that the lost node has been powered 
off, then you can run 'fence_ack_manual'. Please be super careful about 
this though. If you do this, in the heat of the moment with clients or 
bosses yelling at you, and the peer isn't really off (ie: it's only 
hung), you risk serious problems.


I can not emphasis strongly enough the caution needed when using this 
command.



Is there a way to tell the cluster, don’t take into account node 1
anymore and don’t try to fence anymore, just start the application on
the node that is still ok ?


No. That would risk a split brain and data corruption. The only safe 
option for the cluster, if the face of a failed fence, is to hang. As 
bad as it is to hang, it's better than risking corruption.



I can’t possibly wait until power returns to that server. Downtime could
be too long.


See the solution I mentioned earlier.


·If I tell a node to leave the cluster in Luci, I would like it to
remain a non-cluster member after the reboot of that node. It rejoins
the cluster automatically after a reboot. Any way to prevent this ?

Thx


Don't let cman and rgmanager start on boot. This is always my policy. If 
a node failed and got fenced, I want it to reboot, so that I can log 
into it and figure out what happened, but I do _not_ want it back in the 
cluster until I've determined it is healthy.


hth

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine

2014-02-01 Thread Digimer

On 01/02/14 01:35 PM, nik600 wrote:

Dear all

i need some clarification about clustering with rhel 6.4

i have a cluster with 2 node in active/passive configuration, i simply
want to have a virtual ip and migrate it between 2 nodes.

i've noticed that if i reboot or manually shut down a node the failover
works correctly, but if i power-off one node the cluster doesn't
failover on the other node.

Another stange situation is that if power off all the nodes and then
switch on only one the cluster doesn't start on the active node.

I've read manual and documentation at

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html

and i've understand that the problem is related to fencing, but the
problem is that my 2 nodes are on 2 virtual machine , i can't control
hardware and can't issue any custom command on the host-side.

I've tried to use fence_xvm but i'm not sure about it because if my VM
has powered-off, how can it reply to fence_vxm messags?

Here my logs when i power off the VM:

== /var/log/cluster/fenced.log ==
Feb 01 18:50:22 fenced fencing node mynode02
Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result:
error from agent
Feb 01 18:50:53 fenced fence mynode02 failed

I've tried to force the manual fence with:

fence_ack_manual mynode02

and in this case the failover works properly.

The point is: as i'm not using any shared filesystem but i'm only
sharing apache with a virtual ip, i won't have any split-brain scenario
so i don't need fencing, or not?

So, is there the possibility to have a simple dummy fencing?

here is my config.xml:

?xml version=1.0?
cluster config_version=20 name=hacluster
 fence_daemon clean_start=0 post_fail_delay=0
post_join_delay=0/
 cman expected_votes=1 two_node=1/
 clusternodes
 clusternode name=mynode01 nodeid=1 votes=1
 fence
 method name=mynode01
 device domain=mynode01
name=mynode01/
 /method
 /fence
 /clusternode
 clusternode name=mynode02 nodeid=2 votes=1
 fence
 method name=mynode02
 device domain=mynode02
name=mynode02/
 /method
 /fence
 /clusternode
 /clusternodes
 fencedevices
 fencedevice agent=fence_xvm name=mynode01/
 fencedevice agent=fence_xvm name=mynode02/
 /fencedevices
 rm log_level=7
 failoverdomains
 failoverdomain name=MYSERVICE nofailback=0
ordered=0 restricted=0
 failoverdomainnode name=mynode01
priority=1/
 failoverdomainnode name=mynode02
priority=2/
 /failoverdomain
 /failoverdomains
 resources/
 service autostart=1 exclusive=0 name=MYSERVICE
recovery=relocate
 ip address=192.168.1.239 monitor_link=on
sleeptime=2/
apache config_file=conf/httpd.conf name=apache
server_root=/etc/httpd shutdown_wait=0/
 /service
 /rm
/cluster

Thanks to all in advance.


The fence_virtd/fence_xvm agent works by using multicast to talk to the 
VM host. So the off confirmation comes from the hypervisor, not the 
target.


Depending on your setup, you might find better luck with fence_virsh (I 
have to use this as there is a known multicast issue with Fedora hosts). 
Can you try, as a test if nothing else, if 'fence_virsh' will work for you?


fence_virsh -a host ip -l root -p host root pw -n virsh name for 
target vm -o status


If this works, it should be trivial to add to cluster.conf. If that 
works, then you have a working fence method. However, I would recommend 
switching back to fence_xvm if you can. The fence_virsh agent is 
dependent on libvirtd running, which some consider a risk.


hth

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine

2014-02-01 Thread Digimer
Ooooh, I'm not sure what option you have then. I suppose 
fence_virtd/fence_xvm is your best option, but you're going to need to 
have the admin configure the fence_virtd side.


On 01/02/14 03:50 PM, nik600 wrote:

My problem is that i don't have root access at host level.

Il 01/feb/2014 19:49 Digimer li...@alteeve.ca
mailto:li...@alteeve.ca ha scritto:

On 01/02/14 01:35 PM, nik600 wrote:

Dear all

i need some clarification about clustering with rhel 6.4

i have a cluster with 2 node in active/passive configuration, i
simply
want to have a virtual ip and migrate it between 2 nodes.

i've noticed that if i reboot or manually shut down a node the
failover
works correctly, but if i power-off one node the cluster doesn't
failover on the other node.

Another stange situation is that if power off all the nodes and then
switch on only one the cluster doesn't start on the active node.

I've read manual and documentation at


https://access.redhat.com/__site/documentation/en-US/Red___Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html

and i've understand that the problem is related to fencing, but the
problem is that my 2 nodes are on 2 virtual machine , i can't
control
hardware and can't issue any custom command on the host-side.

I've tried to use fence_xvm but i'm not sure about it because if
my VM
has powered-off, how can it reply to fence_vxm messags?

Here my logs when i power off the VM:

== /var/log/cluster/fenced.log ==
Feb 01 18:50:22 fenced fencing node mynode02
Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm
result:
error from agent
Feb 01 18:50:53 fenced fence mynode02 failed

I've tried to force the manual fence with:

fence_ack_manual mynode02

and in this case the failover works properly.

The point is: as i'm not using any shared filesystem but i'm only
sharing apache with a virtual ip, i won't have any split-brain
scenario
so i don't need fencing, or not?

So, is there the possibility to have a simple dummy fencing?

here is my config.xml:

?xml version=1.0?
cluster config_version=20 name=hacluster
  fence_daemon clean_start=0 post_fail_delay=0
post_join_delay=0/
  cman expected_votes=1 two_node=1/
  clusternodes
  clusternode name=mynode01 nodeid=1 votes=1
  fence
  method name=mynode01
  device domain=mynode01
name=mynode01/
  /method
  /fence
  /clusternode
  clusternode name=mynode02 nodeid=2 votes=1
  fence
  method name=mynode02
  device domain=mynode02
name=mynode02/
  /method
  /fence
  /clusternode
  /clusternodes
  fencedevices
  fencedevice agent=fence_xvm name=mynode01/
  fencedevice agent=fence_xvm name=mynode02/
  /fencedevices
  rm log_level=7
  failoverdomains
  failoverdomain name=MYSERVICE
nofailback=0
ordered=0 restricted=0
  failoverdomainnode
name=mynode01
priority=1/
  failoverdomainnode
name=mynode02
priority=2/
  /failoverdomain
  /failoverdomains
  resources/
  service autostart=1 exclusive=0
name=MYSERVICE
recovery=relocate
  ip address=192.168.1.239
monitor_link=on
sleeptime=2/
apache config_file=conf/httpd.conf name=apache
server_root=/etc/httpd shutdown_wait=0/
  /service
  /rm
/cluster

Thanks to all in advance.


The fence_virtd/fence_xvm agent works by using multicast to talk to
the VM host. So the off confirmation comes from the hypervisor,
not the target.

Depending on your setup, you might find better luck with fence_virsh
(I have to use this as there is a known multicast issue with Fedora
hosts). Can you try

  1   2   3   4   5   >