Re: [ClusterLabs] Resource ocf:heartbeat:asterisk fails to start

2016-06-17 Thread Digimer
On 17/06/16 03:05 PM, FreeSoftwareServers wrote:
> Just wanted to share!
> 
> This misinformation got me started down the wrong path, which was
> running user/group root/root. Good old internet!
> 
> http://www.klaverstyn.com.au/david/wiki/index.php?title=Asterisk_Cluster

They disable stonith, so ya, not a great resource.


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resource ocf:heartbeat:asterisk fails to start

2016-06-17 Thread FreeSoftwareServers
Just wanted to share!

 

This misinformation got me started down the wrong path, which was running
user/group root/root. Good old internet!

 

http://www.klaverstyn.com.au/david/wiki/index.php?title=Asterisk_Cluster

 

 

pcs resource create Asterisk ocf:heartbeat:asterisk params user="root"
group="root" op monitor timeout="30"
 
pcs constraint location Asterisk prefers node01
 
pcs resource delete Asterisk
 

Testing :

 

usr/lib/ocf/resource.d/heartbeat/asterisk start ; echo $?

1st Error :

/usr/lib/ocf/resource.d/heartbeat/asterisk: line 38:
/lib/heartbeat/ocf-shellfuncs: No such file or directory

Reso :

nano /usr/lib/ocf/resource.d/heartbeat/asterisk

~asterisk

# Initialization: 
 
. /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs 
 
#: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} 
#. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

Testing again>>

2nd Error :

INFO: Asterisk PBX is not running
ERROR: Directory /var/run/asterisk is not writable by asterisk

Reso :

chmod 777 /var/run/asterisk
chown asterisk:asterisk /var/run/asterisk
ls -la /var/run/ | grep asterisk

drwxrwxrwx 2 asterisk asterisk 60 Jun 17 14:22 asterisk

Success?:

root@node1:/var/run $ /usr/lib/ocf/resource.d/heartbeat/asterisk start ;
echo $?
INFO: Asterisk PBX is not running
INFO: 0 active channels 0 active calls 0 calls processed
DEBUG: Asterisk PBX monitor succeeded
INFO: Asterisk PBX started
0

Failover :

root@node2:~ $ /usr/lib/ocf/resource.d/heartbeat/asterisk start ; echo $?
INFO: Asterisk PBX is not running
INFO: 0 active channels 0 active calls 0 calls processed
DEBUG: Asterisk PBX monitor succeeded
INFO: Asterisk PBX started

Except :

Asterisk (ocf::heartbeat:asterisk): Stopped

Failed Actions:

* Asterisk_start_0 on node2 'unknown error' (1): call=73,
status=Timed Out, exitreason='none',
last-rc-change='Fri Jun 17 14:25:57 2016', queued=0ms, exec=20001ms

Delete Resource and try again :

Success ?! :

Asterisk (ocf::heartbeat:asterisk): Started node2

Failover :

Failed Actions:

* Asterisk_start_0 on node1 'unknown error' (1): call=23,
status=Timed Out, exitreason='none',
last-rc-change='Fri Jun 17 14:31:06 2016', queued=0ms, exec=20002ms

Start and Stop Cluster = Same thing, failed on node 1 and 2

root@node1:/var/run $ /usr/lib/ocf/resource.d/heartbeat/asterisk start ;
echo $?
INFO: Asterisk PBX is not running
INFO: 0 active channels 0 active calls 0 calls processed
DEBUG: Asterisk PBX monitor succeeded
INFO: Asterisk PBX started
0
root@node1:/var/run $ service asterisk status
● asterisk.service - LSB: Asterisk PBX
Loaded: loaded (/etc/rc.d/init.d/asterisk)
Active: inactive (dead) since Fri 2016-06-17 14:03:45 EDT; 29min ago

 
 ad...@freesoftwareservers.com added a comment - 4 minutes ago -
edited

Solution !!

 

http://manpages.ubuntu.com/manpages/wily/man7/ocf_heartbeat_asterisk.7.html

wget
https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbe
at/asterisk -O /usr/lib/ocf/resource.d/heartbeat/asterisk
chmod +x /usr/lib/ocf/resource.d/heartbeat/asterisk
nano /usr/lib/ocf/resource.d/heartbeat/asterisk
# Initialization: 
 
. /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs 
 
#: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} 
#. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
chmod 777 /var/run/asterisk
chown asterisk:asterisk /var/run/asterisk
pcs resource create Asterisk ocf:heartbeat:asterisk params binary="asterisk"
canary_binary="astercany" config="/etc/asterisk/asterisk.conf"
user="asterisk" group="asterisk" additional_parameters="-g -vvv" op monitor
timeout="30"
 
pcs constraint location Asterisk prefers node01
 
pcs resource delete Asterisk

 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Alert notes

2016-06-17 Thread Jan Pokorný
On 15/06/16 18:45 +0200, Klaus Wenninger wrote:
> On 06/15/2016 06:11 PM, Ferenc Wágner wrote:
>> Did you think about filtering the environment variables passed to the
>> alert scripts?  NOTIFY_SOCKET probably shouldn't be present, and PATH
>> probably shouldn't contain sbin directories; I guess all these are
>> inherited from systemd in my case.
> 
> It is just what crmd comes along with ... but interesting point ...

... and having Shellshock vulnerability in mind, also a little bit
worring (yes, even nowadays).

(that being said, I've already presented my subversive opinion that
shell introduces more headaches than reasonable, as using it may be
most natural and with almost no barriers to entry, but it's actually quite
hard to make scripts bullet-proof; say chances the script will be derailed
just with a space-contained [not talking about quotes] parameter are
quite high: http://clusterlabs.org/pipermail/users/2015-May/000403.html)

-- 
Jan (Poki)


pgpjRoiHqKzCJ.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-06-17 Thread Vladislav Bogdanov

17.06.2016 15:05, Vladislav Bogdanov wrote:

03.05.2016 01:14, Ken Gaillot wrote:

On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:

Hi,

Just found an issue with node is silently unfenced.

That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).

Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource
failed to stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop
failure was successfully cleaned.
There were transition aborts due to attribute changes, after that
stop failure happened, but fencing
was not initiated for some reason.


Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer


Ken, could this one be considered to be fixed before 1.1.15 is released?


I created https://github.com/ClusterLabs/pacemaker/pull/1072 for this
That is RFC, tested only to compile.
I hope that should be correct, please tell me if I do something damn 
wrong, or if there could be a better way.


Best,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-06-17 Thread Vladislav Bogdanov

03.05.2016 01:14, Ken Gaillot wrote:

On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:

Hi,

Just found an issue with node is silently unfenced.

That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).

Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource failed to 
stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop failure was 
successfully cleaned.
There were transition aborts due to attribute changes, after that stop failure 
happened, but fencing
was not initiated for some reason.


Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer


Ken, could this one be considered to be fixed before 1.1.15 is released?
I was just hit by the same in the completely different setup.
Two-node cluster, one node fails to stop a resource, and is fenced. 
Right after that second node fails to activate clvm volume (different 
story, need to investigate) and then fails to stop it. Node is scheduled 
to be fenced, but it cannot be because first node didn't come up yet.
Any cleanup (automatic or manual) of a resource failed to stop clears 
node state, removing "unclean" state from a node. That is probably not 
what I could expect (resource cleanup is a node unfence)...

Honestly, this potentially leads to a data corruption...

Also (probably not related) there was one more resource stop failure (in 
that case - timeout) prior to failed stop mentioned above. And that stop 
timeout did not lead to fencing by itself.


I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so 
any additional information from them can be easily provided.


Best regards,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster administration from non-root users

2016-06-17 Thread Auer, Jens
Thanks a lot. Everything works as expected.
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Tomas Jelinek [tojel...@redhat.com]
Gesendet: Montag, 13. Juni 2016 14:32
An: users@clusterlabs.org
Betreff: Re: [ClusterLabs] Cluster administration from non-root users

Dne 13.6.2016 v 13:57 Auer, Jens napsal(a):
> Hi,
>
> I am trying to give admin rights to my clusters to non-root users. I
> have two users which need to be able to control the cluster. Both are
> members of the haclient group, and I have created acl roles granting
> write-access. I can query the cluster status, but I am unable to perform
> any commands:
> id
> uid=1000(mdaf) gid=1000(mdaf)
> groups=1000(mdaf),10(wheel),189(haclient),801(mdaf),802(mdafkey),803(mdafmaintain)
>
> pcs acl
> ACLs are enabled
>
> User: mdaf
>Roles: admin
> User: mdafmaintain
>Roles: admin
> Role: admin
>Permission: write xpath /cib (admin-write)
>
> pcs cluster status
> Cluster Status:
>   Last updated: Mon Jun 13 11:46:45 2016Last change: Mon Jun 13
> 11:46:38 2016 by root via cibadmin on MDA2PFP-S02
>   Stack: corosync
>   Current DC: MDA2PFP-S01 (version 1.1.13-10.el7-44eb2dd) - partition
> with quorum
>   2 nodes and 9 resources configured
>   Online: [ MDA2PFP-S01 MDA2PFP-S02 ]
>
> PCSD Status:
>MDA2PFP-S01: Online
>MDA2PFP-S02: Online
>
> pcs cluster stop
> Error: localhost: Permission denied - (HTTP error: 403)
>
> pcs cluster start
> Error: localhost: Permission denied - (HTTP error: 403)

Hi Jens,

You configured permissions to edit CIB. But it is also required to
assign permissions to use pcsd (only root is allowed to start and stop
services, so the request goes through pcsd).

This can be done using pcs web UI:
- open the web UI in your browser at https://:2224
- login as hacluster user
- add existing cluster
- go to permissions
- set permissions for your cluster
- don't forget to apply changes

Regards,
Tomas

>
> I tried to use sudo instead, but this also not working:
> sudo pcs status
> Permission denied
> Error: unable to locate command: /usr/sbin/crm_mon
>
> Any help would be greatly appreciated.
>
> Best wishes,
>Jens
>
> --
> *Jens Auer *| CGI | Software-Engineer
> CGI (Germany) GmbH & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> _jens.auer@cgi.com_ 
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie
> unter _de.cgi.com/pflichtangaben_ .
> CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging
> to CGI Group Inc. and its affiliates may be contained in this message.
> If you are not a recipient indicated or intended in this message (or
> responsible for delivery of this message to such person), or you think
> for any reason that this message may have been addressed to you in
> error, you may not use or copy or deliver this message to anyone else.
> In such case, you should destroy this message and are asked to notify
> the sender by reply e-mail.
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync with passive rrp, udpu - Unable to reset after "Marking ringid 1 interface 127.0.0.1 FAULTY"

2016-06-17 Thread Jan Friesse

Martin,



Hi Jan

Thanks for your super quick response !

We do not use a Network Manager - it's all static on these Ubuntu 14.04 nodes
(/etc/network/interfaces).


Good



I do not think we did an ifdown on the network interface manually. However, the
IP-Addresses are assigned to bond0 and bond1 - we use 4x physical network
interfaces with 2x bond'ed into a public (bond1) and 2x bond'ed into a private
network (bond0).

Could this have anything to do with it ?


I don't think so. Problem really happens only when corosync is 
configured to ip address which disappears so it has to rebind to 
127.0.0.1. You would then see "The network interface is down" in the 
logs. Try to find that msg, if it's really problem I was referring about.


Regards,
  Honza


Regards,
Martin Schlegel

___

 From /etc/network/interfaces, i.e.

auto bond0
iface bond0 inet static
#pre-up /sbin/ethtool -s bond0 speed 1000 duplex full autoneg on
post-up ifenslave bond0 eth0 eth2
pre-down ifenslave -d bond0 eth0 eth2
bond-slaves none
bond-mode 4
bond-lacp-rate fast
bond-miimon 100
bond-downdelay 0
bond-updelay 0
bond-xmit_hash_policy 1
address  [...]


Jan Friesse  hat am 16. Juni 2016 um 17:55 geschrieben:

Martin Schlegel napsal(a):


Hello everyone,

we run a 3 node Pacemaker (1.1.14) / Corosync (2.3.5) cluster for a couple
of
months successfully and we have started seeing a faulty ring with unexpected
  127.0.0.1 binding that we cannot reset via "corosync-cfgtool -r".


This is problem. Bind to 127.0.0.1 = ifdown happened = problem and with
RRP it means BIG problem.


We have had this once before and only restarting Corosync (and everything
else)
on the node showing the unexpected 127.0.0.1 binding made the problem go
away.
However, in production we obviously would like to avoid this if possible.


Just don't do ifdown. Never. If you are using NetworkManager (which does
ifdown by default if cable is disconnected), use something like
NetworkManager-config-server package (it's just change of configuration
so you can adopt it to whatever distribution you are using).

Regards,
  Honza


So from the following description - how can I troubleshoot this issue and/or
does anybody have a good idea what might be happening here ?

We run 2x passive rrp rings across different IP-subnets via udpu and we get
the
following output (all IPs obfuscated) - please notice the unexpected
interface
binding 127.0.0.1 for host pg2.

If we reset via "corosync-cfgtool -r" on each node heartbeat ring id 1
briefly
shows "no faults" but goes back to "FAULTY" seconds later.

Regards,
Martin Schlegel
_

root@pg1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id = A.B.C1.5
  status = ring 0 active with no faults
RING ID 1
  id = D.E.F1.170
  status = Marking ringid 1 interface D.E.F1.170 FAULTY

root@pg2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
  id = A.B.C2.88
  status = ring 0 active with no faults
RING ID 1
  id = 127.0.0.1
  status = Marking ringid 1 interface 127.0.0.1 FAULTY

root@pg3:~# corosync-cfgtool -s
Printing ring status.
Local node ID 3
RING ID 0
  id = A.B.C3.236
  status = ring 0 active with no faults
RING ID 1
  id = D.E.F3.112
  status = Marking ringid 1 interface D.E.F3.112 FAULTY

_

/etc/corosync/corosync.conf from pg1 0 other nodes use different subnets and
IPs, but are otherwise identical:
===
quorum {
  provider: corosync_votequorum
  expected_votes: 3
}

totem {
  version: 2

  crypto_cipher: none
  crypto_hash: none

  rrp_mode: passive
  interface {
  ringnumber: 0
  bindnetaddr: A.B.C1.0
  mcastport: 5405
  ttl: 1
  }
  interface {
  ringnumber: 1
  bindnetaddr: D.E.F1.64
  mcastport: 5405
  ttl: 1
  }
  transport: udpu
}

nodelist {
  node {
  ring0_addr: pg1
  ring1_addr: pg1p
  nodeid: 1
  }
  node {
  ring0_addr: pg2
  ring1_addr: pg2p
  nodeid: 2
  }
  node {
  ring0_addr: pg3
  ring1_addr: pg3p
  nodeid: 3
  }
}

logging {
  to_syslog: yes
}

===

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org







___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org