Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Ken Gaillot
On 08/24/2015 04:52 AM, Andrei Borzenkov wrote:
 24.08.2015 12:35, Tom Yates пишет:
 I've got a failover firewall pair where the external interface is ADSL;
 that is, PPPoE.  i've defined the service thus:

 primitive ExternalIP lsb:hb-adsl-helper \
  op monitor interval=60s

 and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:

 #!/bin/bash
 RETVAL=0
 start() {
  /sbin/pppoe-start
 }
 stop() {
  /sbin/pppoe-stop
 }
 case $1 in
start)
  start
  ;;
stop)
  stop
  ;;
status)
  /sbin/ifconfig ppp0  /dev/null  exit 0
  exit 1
  ;;
*)
  echo $Usage: $0 {start|stop|status}
  exit 3
 esac
 exit $?

Pacemaker expects that LSB agents follow the LSB spec for return codes,
and won't be able to behave properly if they don't:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-lsb


However it's just as easy to write an OCF agent, which gives you more
flexibility (accepting parameters, etc.):

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf

 The problem is that sometimes the ADSL connection falls over, as they
 do, eg:

 Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
 Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
 Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received
 164420300 bytes.
 Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
 Aug 20 11:42:13 positron pppd[2469]: Modem hangup
 Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session
 1735: Input/output error
 Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
 Aug 20 11:42:13 positron pppd[2469]: Exit.
 Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost;
 attempting re-connection.

 CRMd then logs a bunch of stuff, followed by

 Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
 Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no
 additional parameters are needed.
 [...]
 Aug 20 11:42:18 positron pppoe-stop: Killing pppd
 Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
 Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop
 process 28357 exited with return code 1.


 At this point, the PPPoE connection is down, and stays down.  CRMd
 doesn't fail the group which contains both internal and external
 interfaces over to the other node, but nor does it try to restart the
 service.  I'm fairly sure this is because I've done something
 boneheaded, but I can't get my bone head around what it might be.

 Any light anyone can shed is much appreciated.


 
 If stop operation failed resource state is undefined; pacemaker won't do
 anything with this resource. Either make sure script returns success
 when appropriate or the only option is to make it fence node where
 resource was active.
 
 
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [Announce] libqb v0.17.2 release

2015-08-24 Thread Christine Caulfield
This is mainly a bug fix release, but also includes a new split-logging
feature.

Changes v0.17.1 - v0.17.2

Implement extended information logging (aka split logging)
switch libtool soname versioning from -version-number to -version-info
High: loop: fixes resource starvation in mainloop code
Fix: valgrind invalid file descriptor warning
Fix: Unlink files bound to unix domain sockets
Fix: resolves compile error for solaris
Fix alignment issues on hppa
Fix description of qbutil.h
Fix comment typo: neccessary - necessary
Fix comment typos: incomming - incoming
Low: examples: fix race condition in glib mainloop example
Low: build: update .gitignore for vim swap files and make check output
Low: check_ipc: generate unique server names for tests
Low: check_ipc: give connection stress tests for shm and socket
unique names
Low: tests: regression tests for stress testing loop_poll ipc
create/destroy
ipc test improvements.
The udata member of the kevent struct is a void *
Fixes several warnings under clang
Add Doxygen description for qbipc_common.h
doc: improve README and RPM description
Clear DOT_FONTNAME, since FreeSans is not included any more. The new
default is Helvetica.
Remove obsolete options from doxyfiles
Do not suppress echoing of Doxygen commands

The current release tarball is here:
https://github.com/ClusterLabs/libqb/archive/v0.17.2.tar.gz

The github repository is here:
https://github.com/ClusterLabs/libqb

Please report bugs and issues in bugzilla:
https://bugzilla.redhat.com


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Slightly OT] OCFS2 over LVM

2015-08-24 Thread Digimer
On 24/08/15 07:55 AM, Jorge Fábregas wrote:
 On 08/24/2015 06:52 AM, Kai Dupke wrote:
 Not sure what you want to run on top of your 2-node cluster, but OCFS2
 is only needed when you need a shared file system.
 
 This is for an application that manages the high-availability by itself
 (in an active/active fashion) and the only thing that's needed from the
 OS is a shared filesystem.  I quickly thought about NFS but then the
 reliability of the NFS server was questioned etc.  I could create an NFS
 cluster for that but that will be two more servers.  You get the idea.
 
 I'm still googling NFSv4 vs OCFS2  If anyone here have experience
 (going from one to the other) I'd like to hear it.
 
 
 For plain failover with volumes managed by cLVM you don't need OCFS2
 (and can save one level of complexity).
 
 This is my first time using a cluster filesystem and indeed I get it:
 there's lots of things to be taken care of  many possible ways to break it.
 
 Thanks,
 Jorge

Speaking from a gfs2 background, but assuming it's similar in concept to
ocfs2...

Cluster locking comes at a performance cost. All locks need to be
coordinated between the nodes, and that will always be slower that local
locking only. They are also far less commonly used than options like nfs.

Using a pair of nodes with a traditional file system exported by NFS and
made accessible by a floating (virtual) IP address gives you redundancy
without incurring the complexity and performance overhead of cluster
locking. Also, you won't need clvmd either. The trade-off through is
that if/when the primary fails, the nfs daemon will appear to restart to
the users and that may require a reconnection (not sure, I use nfs
sparingly).

Generally speaking, I recommend always avoiding cluster FSes unless
they're really required. I say this as a person who uses gfs2 in every
cluster I build, but I do so carefully and in limited uses. In my case,
gfs2 backs ISOs and XML definition files for VMs, things that change
rarely so cluster locking overhead is all but a non-issue, and I have to
have DLM for clustered LVM anyway, so I've already incurred the
complexity costs so hey, why not.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] 0 Nodes configured in crm_mon

2015-08-24 Thread Stanislav Kopp
Hi all,

I'm trying to run corosync2 + pacemaker setup on Debian Jessie (only
for testing purpose), I've successfully compiled all components using
this guide: http://clusterlabs.org/wiki/Compiling_on_Debian

Unfortunately, if I run crm_mon I don't see any nodes.

###
Last updated: Mon Aug 24 17:36:00 2015
Last change: Mon Aug 24 17:17:42 2015
Current DC: NONE
0 Nodes configured
0 Resources configured


I don't see any errors in corosync log either: http://pastebin.com/bJX66B9e

This is my corosync.conf

###

# Please read the corosync.conf.5 manual page
totem {
version: 2

crypto_cipher: none
crypto_hash: none

interface {
ringnumber: 0
bindnetaddr: 192.168.122.0
mcastport: 5405
ttl: 1
}
transport: udpu
}

logging {
fileline: off
to_logfile: yes
to_syslog: no
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}

nodelist {
node {
ring0_addr: 192.168.122.172
#nodeid: 1
}

node {
ring0_addr: 192.168.122.113
#nodeid: 2
}
}

quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
#provider: corosync_votequorum
}



used components:

pacemaker: 1.1.12
corosync: 2.3.5
libqb: 0.17.1


Did I miss something?

Thanks!
Stan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster.conf

2015-08-24 Thread Digimer
The cluster.conf is needed by cman, and in RHEL 6, pacemaker needs to
use cman as the quorum provider. So you need a skeleton cluster.conf and
it is different from cib.xml.

If you use pcs/pcsd to setup pacemaker on RHEL 6.7, it should configure
everything for you, so you should be able to go straight to setting up
pacemaker and not worry about cman/corosync directly.

digimer

On 24/08/15 01:52 PM, Streeter, Michelle N wrote:
 If I have a cluster.conf file in /etc/cluster, my cluster will not
 start.   Pacemaker 1.1.11, Corosync 1.4.7, cman 3.0.12,  But if I do not
 have a cluster.conf file then my cluster does start with my current
 configuration.   However, when I try to stop the cluster, it wont stop
 unless I have my cluster.conf file in place.   How can I dump my cib to
 my cluster.conf file so my cluster will start with the conf file in place?
 
  
 
 Michelle Streeter
 
 ASC2 MCS – SDE/ACL/SDL/EDL OKC Software Engineer
 The Boeing Company
 
  
 
 
 
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster.conf

2015-08-24 Thread Jan Pokorný
On 24/08/15 17:52 +, Streeter, Michelle N wrote:
 If I have a cluster.conf file in /etc/cluster, my cluster will not
 start.   Pacemaker 1.1.11, Corosync 1.4.7, cman 3.0.12,  But if I do
 not have a cluster.conf file then my cluster does start with my
 current configuration.

I don't think CMAN component can operate without that file (location
possibly overridden with $COROSYNC_CLUSTER_CONFIG_FILE environment
variable).  What distro, or at least commands to bring the cluster up
do you use?

 However, when I try to stop the cluster, it wont stop unless I have
 my cluster.conf file in place.   How can I dump my cib to my
 cluster.conf file

Note that cluster.conf and CIB serves different purposes, at least
in Pacemaker+CMAN setup (akin to RHEL 6.x for x being 5+) so you don't
want to interchange them.

 so my cluster will start with the conf file in place?

-- 
Jan (Poki)


pgphqSAIh33_Z.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Tom Yates
I've got a failover firewall pair where the external interface is ADSL; 
that is, PPPoE.  i've defined the service thus:


primitive ExternalIP lsb:hb-adsl-helper \
op monitor interval=60s

and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:

#!/bin/bash
RETVAL=0
start() {
/sbin/pppoe-start
}
stop() {
/sbin/pppoe-stop
}
case $1 in
  start)
start
;;
  stop)
stop
;;
  status)
/sbin/ifconfig ppp0  /dev/null  exit 0
exit 1
;;
  *)
echo $Usage: $0 {start|stop|status}
exit 3
esac
exit $?

The problem is that sometimes the ADSL connection falls over, as they do, 
eg:


Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received 164420300 
bytes.
Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
Aug 20 11:42:13 positron pppd[2469]: Modem hangup
Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session 1735: 
Input/output error
Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
Aug 20 11:42:13 positron pppd[2469]: Exit.
Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; attempting 
re-connection.

CRMd then logs a bunch of stuff, followed by

Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no 
additional parameters are needed.
[...]
Aug 20 11:42:18 positron pppoe-stop: Killing pppd
Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop process 
28357 exited with return code 1.


At this point, the PPPoE connection is down, and stays down.  CRMd doesn't 
fail the group which contains both internal and external interfaces over 
to the other node, but nor does it try to restart the service.  I'm fairly 
sure this is because I've done something boneheaded, but I can't get my 
bone head around what it might be.


Any light anyone can shed is much appreciated.


--

  Tom Yates  -  http://www.teaparty.net

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Andrei Borzenkov

24.08.2015 12:35, Tom Yates пишет:

I've got a failover firewall pair where the external interface is ADSL;
that is, PPPoE.  i've defined the service thus:

primitive ExternalIP lsb:hb-adsl-helper \
 op monitor interval=60s

and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:

#!/bin/bash
RETVAL=0
start() {
 /sbin/pppoe-start
}
stop() {
 /sbin/pppoe-stop
}
case $1 in
   start)
 start
 ;;
   stop)
 stop
 ;;
   status)
 /sbin/ifconfig ppp0  /dev/null  exit 0
 exit 1
 ;;
   *)
 echo $Usage: $0 {start|stop|status}
 exit 3
esac
exit $?

The problem is that sometimes the ADSL connection falls over, as they
do, eg:

Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received
164420300 bytes.
Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
Aug 20 11:42:13 positron pppd[2469]: Modem hangup
Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session
1735: Input/output error
Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
Aug 20 11:42:13 positron pppd[2469]: Exit.
Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost;
attempting re-connection.

CRMd then logs a bunch of stuff, followed by

Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no
additional parameters are needed.
[...]
Aug 20 11:42:18 positron pppoe-stop: Killing pppd
Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop
process 28357 exited with return code 1.


At this point, the PPPoE connection is down, and stays down.  CRMd
doesn't fail the group which contains both internal and external
interfaces over to the other node, but nor does it try to restart the
service.  I'm fairly sure this is because I've done something
boneheaded, but I can't get my bone head around what it might be.

Any light anyone can shed is much appreciated.




If stop operation failed resource state is undefined; pacemaker won't do 
anything with this resource. Either make sure script returns success 
when appropriate or the only option is to make it fence node where 
resource was active.



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Andrei Borzenkov

24.08.2015 13:32, Tom Yates пишет:

On Mon, 24 Aug 2015, Andrei Borzenkov wrote:


24.08.2015 12:35, Tom Yates пишет:

I've got a failover firewall pair where the external interface is ADSL;
that is, PPPoE.  i've defined the service thus:


If stop operation failed resource state is undefined; pacemaker won't
do anything with this resource. Either make sure script returns
success when appropriate or the only option is to make it fence node
where resource was active.


andrei, thank you for your prompt and helpful response.

if i understand you aright, my problem is that the stop script didn't
return a 0 (OK) exit status, so CRM didn't know where to go.  is the
exit status of the stop script how CRM determines the status of the stop
operation?


correct


 and if that gives exit code 0, it will then try to do a
/etc/init.d/script start?



If resource was previously active and stop was attempted as cleanup 
after resource failure - yes, it should attempt to start it again.




does CRM also use the output of /etc/init.d/script status to determine
continuing successful operation?



It definitely does not use *output* of script - only return code. If the 
question is whether it probes resource additionally to checking stop 
exit code - I do not think so (I know it does it in some cases for 
systemd resources).




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Slightly OT] OCFS2 over LVM

2015-08-24 Thread Kai Dupke
On 08/24/2015 06:20 PM, Digimer wrote:
 Cluster locking comes at a performance cost. All locks need to be
 coordinated between the nodes, and that will always be slower that local
 locking only. They are also far less commonly used than options like nfs.

right.

 Using a pair of nodes with a traditional file system exported by NFS and
 made accessible by a floating (virtual) IP address gives you redundancy
 without incurring the complexity and performance overhead of cluster
 locking.

Then you have to copy all data on the network, which limits data throughput.


 Also, you won't need clvmd either. The trade-off through is
 that if/when the primary fails, the nfs daemon will appear to restart to
 the users and that may require a reconnection (not sure, I use nfs
 sparingly).

AFAIK NFS failover includes an NFS timeout, which can be tuned but might
give you an extra time till the failover will be finished by the client
perspective.

 Generally speaking, I recommend always avoiding cluster FSes unless
 they're really required.

Full ACK.

greetings
Kai Dupke
Senior Product Manager
Server Product Line
-- 
Sell not virtue to purchase wealth, nor liberty to purchase power.
Phone:  +49-(0)5102-9310828 Mail: kdu...@suse.com
Mobile: +49-(0)173-5876766  WWW:  www.suse.com

SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org