[ClusterLabs] pacemaker startup problem

2020-07-24 Thread Gabriele Bulfon
Hello,
 
after a long time I'm back to run heartbeat/pacemaker/corosync on our 
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release (probably 
a bit outdated, but I want to start from where I left).
Looks like pacemaker is having trouble starting up showin this logs:
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active directory to 
/sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15 (e174ec8)
Jul 24 18:21:32 [971] crmd: info: do_log: Input I_STARTUP received in state 
S_STARTING from crmd_init
Jul 24 18:21:32 [969] lrmd: info: crm_log_init: Changed active directory to 
/sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: crm_log_init: Changed active directory 
to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Verifying cluster 
type: 'heartbeat'
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Assuming an active 
'heartbeat' cluster
Jul 24 18:21:32 [968] stonith-ng: notice: crm_cluster_connect: Connecting to 
cluster infrastructure: heartbeat
Jul 24 18:21:32 [969] lrmd: error: mainloop_add_ipc_server: Could not start 
lrmd IPC server: Operation not supported (-48)
Jul 24 18:21:32 [969] lrmd: error: main: Failed to create IPC server: shutting 
down and inhibiting respawn
Jul 24 18:21:32 [969] lrmd: info: crm_xml_cleanup: Cleaning up memory from 
libxml2
Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Verifying cluster type: 
'heartbeat'
Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Assuming an active 
'heartbeat' cluster
Jul 24 18:21:32 [971] crmd: info: start_subsystem: Starting sub-system "pengine"
Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Created entry 
25bc5492-a49e-40d7-ae60-fd8f975a294a/80886f0 for node xstorage1/0 (1 total)
Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Node 0 has uuid 
d426a730-5229-6758-853a-99d4d491514a
Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn: Hostname: 
xstorage1
Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn: UUID: 
d426a730-5229-6758-853a-99d4d491514a
Jul 24 18:21:32 [970] attrd: notice: crm_cluster_connect: Connecting to cluster 
infrastructure: heartbeat
Jul 24 18:21:32 [970] attrd: error: mainloop_add_ipc_server: Could not start 
attrd IPC server: Operation not supported (-48)
Jul 24 18:21:32 [970] attrd: error: attrd_ipc_server_init: Failed to create 
attrd servers: exiting and inhibiting respawn.
Jul 24 18:21:32 [970] attrd: warning: attrd_ipc_server_init: Verify pacemaker 
and pacemaker_remote are not both enabled.
Jul 24 18:21:32 [972] pengine: info: crm_log_init: Changed active directory to 
/sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [972] pengine: error: mainloop_add_ipc_server: Could not start 
pengine IPC server: Operation not supported (-48)
Jul 24 18:21:32 [972] pengine: error: main: Failed to create IPC server: 
shutting down and inhibiting respawn
Jul 24 18:21:32 [972] pengine: info: crm_xml_cleanup: Cleaning up memory from 
libxml2
Jul 24 18:21:33 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:33 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 1 times... pause and retry
Jul 24 18:21:33 [971] crmd: error: crmd_child_exit: Child process pengine 
exited (pid=972, rc=100)
Jul 24 18:21:35 [971] crmd: info: crm_timer_popped: Wait Timer (I_NULL) just 
popped (2000ms)
Jul 24 18:21:36 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:36 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 2 times... pause and retry
Jul 24 18:21:38 [971] crmd: info: crm_timer_popped: Wait Timer (I_NULL) just 
popped (2000ms)
Jul 24 18:21:39 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:39 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 3 times... pause and retry
Jul 24 18:21:41 [971] crmd: info: crm_timer_popped: Wait Timer (I_NULL) just 
popped (2000ms)
Jul 24 18:21:42 [971] crmd: info: do_cib_control: Could not connect to the CIB 
service: Transport endpoint is not connected
Jul 24 18:21:42 [971] crmd: warning: do_cib_control: Couldn't complete CIB 
registration 4 times... pause and retry
Jul 24 18:21:42 [968] stonith-ng: error: setup_cib: Could not connect to the 
CIB service: Transport endpoint is not connected (-134)
Jul 24 18:21:42 [968] stonith-ng: error: mainloop_add_ipc_server: Could not 
start stonith-ng IPC server: Operation not supported (-48)
Jul 24 18:21:42 [968] stonith-ng: error: stonith_ipc_server_init: Failed to 
create stonith-ng servers: exiting and inhibiting respawn.
Jul 2

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Gabriele Bulfon
Thanks, I ran it manually so I got those errors, running from service script it 
correctly set PCMK_ipc_type to socket.
 
But now I see these now:
Jul 26 11:08:16 [4039] pacemakerd: info: crm_log_init: Changed active directory 
to /sonicle/var/cluster/lib/pacemaker/cores
Jul 26 11:08:16 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 1s
Jul 26 11:08:17 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 2s
Jul 26 11:08:19 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 3s
Jul 26 11:08:22 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 4s
Jul 26 11:08:26 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 5s
Jul 26 11:08:31 [4039] pacemakerd: warning: mcp_read_config: Could not connect 
to Cluster Configuration Database API, error 2
Jul 26 11:08:31 [4039] pacemakerd: notice: main: Could not obtain corosync 
config data, exiting
Jul 26 11:08:31 [4039] pacemakerd: info: crm_xml_cleanup: Cleaning up memory 
from libxml2
 
So I think I need to start corosync first (right?) but it dies with this:
 
Jul 26 11:07:06 [4027] xstorage1 corosync notice [MAIN ] Corosync Cluster 
Engine ('2.4.1'): started and ready to provide service.
Jul 26 11:07:06 [4027] xstorage1 corosync info [MAIN ] Corosync built-in 
features: bindnow
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: none hash: none
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] The network interface 
[10.100.100.1] is now up.
Jul 26 11:07:06 [4027] xstorage1 corosync notice [SERV ] Service engine loaded: 
corosync configuration map access [0]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync configuration service [1]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync cluster closed process group service v1.01 [2]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync profile loading service [4]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [QUORUM] Using quorum provider 
corosync_votequorum
Jul 26 11:07:06 [4027] xstorage1 corosync crit [QUORUM] Quorum provider: 
corosync_votequorum failed to initialize.
Jul 26 11:07:06 [4027] xstorage1 corosync error [SERV ] Service engine 
'corosync_quorum' failed to load for reason 'configuration error: nodelist or 
quorum.expected_votes must be configured!'
Jul 26 11:07:06 [4027] xstorage1 corosync error [MAIN ] Corosync Cluster Engine 
exiting with status 20 at 
/data/sources/sonicle/xstream-storage-gate/components/cluster/corosync/corosync-2.4.1/exec/service.c:356.
My corosync conf has nodelist configured! Here it is:
 
service {ver: 1name: pacemakeruse_mgmtd: nouse_logd: no}totem { 
   version: 2crypto_cipher: nonecrypto_hash: none
interface {ringnumber: 0bindnetaddr: 
10.100.100.0mcastaddr: 239.255.1.1mcastport: 
5405ttl: 1}}nodelist {   node { ring0_addr: 
xstorage1 nodeid: 1}   node { ring0_addr: xstorage2 
nodeid: 2}}quorum {provider: corosync_votequorum
two_node: 1}logging {fileline: offto_stderr: no
to_logfile: yeslogfile: /sonicle/var/log/cluster/corosync.log
to_syslog: nodebug: offtimestamp: onlogger_subsys { 
   subsys: QUORUMdebug: off}}
 
 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: Cluster Labs - All topics related to open-source clustering welcomed
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello,
after a long time I'm back to run heartbeat/pacemaker/corosync on our
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release
(probably a bit outdated, but I want to start from where I left).
Looks like pacemaker is having trouble starting up showin this logs:
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15
(e174ec8)
Jul 24 18:21:32 [971] crmd: info: 

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Gabriele Bulfon
Sorry, I was using wrong hostnames for that networks, using debug log I found 
it was not finding "this node" in conf file.
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
26 luglio 2020 11.23.53 CEST
Oggetto:
Re: [ClusterLabs] pacemaker startup problem
 
Thanks, I ran it manually so I got those errors, running from service script it 
correctly set PCMK_ipc_type to socket.
 
But now I see these now:
Jul 26 11:08:16 [4039] pacemakerd: info: crm_log_init: Changed active directory 
to /sonicle/var/cluster/lib/pacemaker/cores
Jul 26 11:08:16 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 1s
Jul 26 11:08:17 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 2s
Jul 26 11:08:19 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 3s
Jul 26 11:08:22 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 4s
Jul 26 11:08:26 [4039] pacemakerd: info: mcp_read_config: cmap connection setup 
failed: CS_ERR_LIBRARY. Retrying in 5s
Jul 26 11:08:31 [4039] pacemakerd: warning: mcp_read_config: Could not connect 
to Cluster Configuration Database API, error 2
Jul 26 11:08:31 [4039] pacemakerd: notice: main: Could not obtain corosync 
config data, exiting
Jul 26 11:08:31 [4039] pacemakerd: info: crm_xml_cleanup: Cleaning up memory 
from libxml2
 
So I think I need to start corosync first (right?) but it dies with this:
 
Jul 26 11:07:06 [4027] xstorage1 corosync notice [MAIN ] Corosync Cluster 
Engine ('2.4.1'): started and ready to provide service.
Jul 26 11:07:06 [4027] xstorage1 corosync info [MAIN ] Corosync built-in 
features: bindnow
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: none hash: none
Jul 26 11:07:06 [4027] xstorage1 corosync notice [TOTEM ] The network interface 
[10.100.100.1] is now up.
Jul 26 11:07:06 [4027] xstorage1 corosync notice [SERV ] Service engine loaded: 
corosync configuration map access [0]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync configuration service [1]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync cluster closed process group service v1.01 [2]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [YKD ] Service engine loaded: 
corosync profile loading service [4]
Jul 26 11:07:06 [4027] xstorage1 corosync notice [QUORUM] Using quorum provider 
corosync_votequorum
Jul 26 11:07:06 [4027] xstorage1 corosync crit [QUORUM] Quorum provider: 
corosync_votequorum failed to initialize.
Jul 26 11:07:06 [4027] xstorage1 corosync error [SERV ] Service engine 
'corosync_quorum' failed to load for reason 'configuration error: nodelist or 
quorum.expected_votes must be configured!'
Jul 26 11:07:06 [4027] xstorage1 corosync error [MAIN ] Corosync Cluster Engine 
exiting with status 20 at 
/data/sources/sonicle/xstream-storage-gate/components/cluster/corosync/corosync-2.4.1/exec/service.c:356.
My corosync conf has nodelist configured! Here it is:
 
service {ver: 1name: pacemakeruse_mgmtd: nouse_logd: no}totem { 
   version: 2crypto_cipher: nonecrypto_hash: none
interface {ringnumber: 0bindnetaddr: 
10.100.100.0mcastaddr: 239.255.1.1mcastport: 
5405ttl: 1}}nodelist {   node { ring0_addr: 
xstorage1 nodeid: 1}   node { ring0_addr: xstorage2 
nodeid: 2}}quorum {provider: corosync_votequorum
two_node: 1}logging {fileline: offto_stderr: no
to_logfile: yeslogfile: /sonicle/var/log/cluster/corosync.log
to_syslog: nodebug: offtimestamp: onlogger_subsys { 
   subsys: QUORUMdebug: off}}
 
 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: Cluster Labs - All topics related to open-source clustering welcomed
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello,
after a long time I'm back to run heartbeat/pacemaker/corosync on our
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release
(probably a bit ou

Re: [ClusterLabs] pacemaker startup problem

2020-07-26 Thread Gabriele Bulfon
Sorry, actually the problem is not gone yet.
Now corosync and pacemaker are running happily, but those IPC errors are coming 
out of heartbeat and crmd as soon as I start it.
The pacemakerd process has PCMK_ipc_type=socket, what's wrong with heartbeat or 
crmd?
 
Here's the env of the process:
 
sonicle@xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
4222: /usr/sbin/pacemakerd
envp[0]: PCMK_respawned=true
envp[1]: PCMK_watchdog=false
envp[2]: HA_LOGFACILITY=none
envp[3]: HA_logfacility=none
envp[4]: PCMK_logfacility=none
envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
envp[7]: HA_debug=0
envp[8]: PCMK_debug=0
envp[9]: HA_quorum_type=corosync
envp[10]: PCMK_quorum_type=corosync
envp[11]: HA_cluster_type=corosync
envp[12]: PCMK_cluster_type=corosync
envp[13]: HA_use_logd=off
envp[14]: PCMK_use_logd=off
envp[15]: HA_mcp=true
envp[16]: PCMK_mcp=true
envp[17]: HA_LOGD=no
envp[18]: LC_ALL=C
envp[19]: PCMK_service=pacemakerd
envp[20]: PCMK_ipc_type=socket
envp[21]: SMF_ZONENAME=global
envp[22]: PWD=/
envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
envp[24]: _=/usr/sbin/pacemakerd
envp[25]: TZ=Europe/Rome
envp[26]: LANG=en_US.UTF-8
envp[27]: SMF_METHOD=start
envp[28]: SHLVL=2
envp[29]: PATH=/usr/sbin:/usr/bin
envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[31]: A__z="*SHLVL
 
 
Here are crmd complaints:
 
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: Node 
xstorage1 state is now member
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
start crmd IPC server: Operation not supported (-48)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Failed to 
create IPC server: shutting down and inhibiting respawn
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: The 
local CRM is operational
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_ERROR received in state S_STARTING from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: State 
transition S_STARTING -S_RECOVERY
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: 
Fast-tracking shutdown in response to errors
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: Input 
I_PENDING received in state S_RECOVERY from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_TERMINATE received in state S_RECOVERY from do_recover
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: 
Disconnected from the LRM
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Child 
process pengine exited (pid=4316, rc=100)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
recover from internal error
Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]: WARN: 
Managed /usr/libexec/pacemaker/crmd process 4315 exited with return code 201.
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: Cluster Labs - All topics related to open-source clustering welcomed
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello,
after a long time I'm back to run heartbeat/pacemaker/corosync on our
XStreamOS/illumos distro.
I rebuilt the original components I did in 2016 on our latest release
(probably a bit outdated, but I want to start from where I left).
Looks like pacemaker is having trouble starting up showin this logs:
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15
(e174ec8)
Jul 24 18:21:32 [971] crmd: info: do_log: Input I_STARTUP received in
state S_STARTING from crmd_init
Jul 24 18:21:32 [969] lrmd: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: crm_log_init: Changed active
directory to /sonicle/var/cluster/lib/pacemaker/cores
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Verifying
cluster type: 'heartbeat'
Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Assuming an
active 'heartbeat' cluster
Jul 24 18:21:32 [968] stonith-ng: notice: crm_cluster_connect:
Connecting to cluster infrastructure: heartbeat
Jul 24 18:21:32 [969] lrmd: error: mainloop_add_ipc_server: Could not
start lrmd IPC server: Operation not supported (-48)
This is repeated for all the subdaemons ... the error is coming from
qb_ipcs

Re: [ClusterLabs] pacemaker startup problem

2020-07-27 Thread Gabriele Bulfon
Solved this, actually I don't need heartbeat component and service running.
I just use corosync and pacemaker, and this seems to work.
Now going on with crm configuration.
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
26 luglio 2020 12.25.20 CEST
Oggetto:
Re: [ClusterLabs] pacemaker startup problem
Hmm. If it's reading PCMK_ipc_type and matching the server type to 
QB_IPC_SOCKET, then the only other place I see it could be coming from is 
qb_ipc_auth_creds.
 
qb_ipcs_run -qb_ipcs_us_publish -qb_ipcs_us_connection_acceptor 
-qb_ipcs_uc_recv_and_auth -process_auth -qb_ipc_auth_creds -
 
static int32_t
qb_ipc_auth_creds(struct ipc_auth_data *data)
{
...
#ifdef HAVE_GETPEERUCRED
        /*
         * Solaris and some BSD systems
...
#elif defined(HAVE_GETPEEREID)
        /*
        * Usually MacOSX systems
...
#elif defined(SO_PASSCRED)
        /*
        * Usually Linux systems
...
#else /* no credentials */
        data-ugp.pid = 0;
        data-ugp.uid = 0;
        data-ugp.gid = 0;
        res = -ENOTSUP;
#endif /* no credentials */
        return res;
 
I'll leave it to Ken to say whether that's likely and what it implies if so.
On Sun, Jul 26, 2020 at 2:53 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
Sorry, actually the problem is not gone yet.
Now corosync and pacemaker are running happily, but those IPC errors are coming 
out of heartbeat and crmd as soon as I start it.
The pacemakerd process has PCMK_ipc_type=socket, what's wrong with heartbeat or 
crmd?
 
Here's the env of the process:
 
sonicle@xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
4222: /usr/sbin/pacemakerd
envp[0]: PCMK_respawned=true
envp[1]: PCMK_watchdog=false
envp[2]: HA_LOGFACILITY=none
envp[3]: HA_logfacility=none
envp[4]: PCMK_logfacility=none
envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
envp[7]: HA_debug=0
envp[8]: PCMK_debug=0
envp[9]: HA_quorum_type=corosync
envp[10]: PCMK_quorum_type=corosync
envp[11]: HA_cluster_type=corosync
envp[12]: PCMK_cluster_type=corosync
envp[13]: HA_use_logd=off
envp[14]: PCMK_use_logd=off
envp[15]: HA_mcp=true
envp[16]: PCMK_mcp=true
envp[17]: HA_LOGD=no
envp[18]: LC_ALL=C
envp[19]: PCMK_service=pacemakerd
envp[20]: PCMK_ipc_type=socket
envp[21]: SMF_ZONENAME=global
envp[22]: PWD=/
envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
envp[24]: _=/usr/sbin/pacemakerd
envp[25]: TZ=Europe/Rome
envp[26]: LANG=en_US.UTF-8
envp[27]: SMF_METHOD=start
envp[28]: SHLVL=2
envp[29]: PATH=/usr/sbin:/usr/bin
envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[31]: A__z="*SHLVL
 
 
Here are crmd complaints:
 
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: Node 
xstorage1 state is now member
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
start crmd IPC server: Operation not supported (-48)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Failed to 
create IPC server: shutting down and inhibiting respawn
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: The 
local CRM is operational
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_ERROR received in state S_STARTING from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: State 
transition S_STARTING -S_RECOVERY
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: 
Fast-tracking shutdown in response to errors
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning: Input 
I_PENDING received in state S_RECOVERY from do_started
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Input 
I_TERMINATE received in state S_RECOVERY from do_recover
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice: 
Disconnected from the LRM
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Child 
process pengine exited (pid=4316, rc=100)
Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error: Could not 
recover from internal error
Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]: WARN: 
Managed /usr/libexec/pacemaker/crmd process 4315 exited with return code 201.
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
kgail...@redhat.com
A: Cluster Labs - All topics related to open-source clustering welcomed
users@clusterlabs.org
Data: 25 luglio 2020 0.46.52 CEST
Oggetto: Re: [ClusterLabs] pacemaker startup problem
On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
Hello

[ClusterLabs] ip address configuration problem

2020-07-27 Thread Gabriele Bulfon
Hello,
 
after configuring crm for IP automatic configuration, I stumbled upon a problem 
with the IPaddr utiliy that I don't understand:
IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup problem: 
couldn't find command: /usr/gnu/bin/awk
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished: 
xstha2_san0_IP_start_0:10439:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such file or 
directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished: 
xstha2_san0_IP_start_0:10439:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such file 
or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished: 
xstha2_san0_IP_start_0:10439:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such file 
or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished: 
xstha2_san0_IP_start_0:10439:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such file 
or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished: 
xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem: couldn't 
find command: /usr/gnu/bin/awk ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished: 
xstha2_san0_IP_start_0:10439:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such file 
or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished: 
xstha2_san0_IP_start_0:10439:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[186]: local: not found [No such file 
or directory] ]
Jul 27 17:26:17 [10258] lrmd: info: log_finished: finished - rsc:xstha2_san0_IP 
action:start call_id:22 pid:10439 exit-code:5 exec-time:91ms queue-time:0ms
 
It says it cannot find /usr/gnu/bin/awk but this is absolutely not true!
 
sonicle@xstorage1:/sonicle/home# ls -l /usr/gnu/bin/awk
-r-xr-xr-x 1 root bin 881864 Jun 1 12:25 /usr/gnu/bin/awk
 
sonicle@xstorage1:/sonicle/home# file /usr/gnu/bin/awk
/usr/gnu/bin/awk: ELF 64-bit LSB executable AMD64 Version 1, dynamically 
linked, not stripped, no debugging information available
 
what may be happening??
Thanks!
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem

2020-07-28 Thread Gabriele Bulfon
Thanks, I patched all the scripts in build to have "#!/bin/bash" in head, and I 
receive no errors now.
Though, the IP is not configured :( I'm looking at it...
Is there any easy way to debug what's doing on the IP script?
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ulrich Windl
A: users@clusterlabs.org
Data: 28 luglio 2020 9.12.41 CEST
Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
You could try replacing "local" with "typeset", also.
Reid Wahl
schrieb am 28.07.2020 um 09:05 in Nachricht
:
By the way, it doesn't necessarily have to be bash. Upon looking further, a
lot of shells support the `local` keyword, even though it's not required by
the POSIX standard. Plain ksh, however, does not :(
On Monday, July 27, 2020, Reid Wahl
wrote:
Hi, Gabriele. The `local` keyword is a bash built-in and not available in
some other shells (e.g., ksh). It's used in `have_binary()`, so it's
causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all the
"local: not found" errors. I just reproduced it to make sure.
check_binary () {
if ! have_binary "$1"; then
if [ "$OCF_NOT_RUNNING" = 7 ]; then
# Chances are we have a fully setup OCF environment
ocf_exit_reason "Setup problem: couldn't find command: $1"
else
echo "Setup problem: couldn't find command: $1"
fi
exit $OCF_ERR_INSTALLED
fi
}
have_binary () {
if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
false
else
local bin=`echo $1 | sed -e 's/ -.*//'`
test -x "`which $bin 2/dev/null`"
fi
}
Is bash available on your system?
On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon
wrote:
Hello,
after configuring crm for IP automatic configuration, I stumbled upon a
problem with the IPaddr utiliy that I don't understand:
IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup problem:
couldn't find command: /usr/gnu/bin/awk
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
couldn't find command: /usr/gnu/bin/awk ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[186]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: info: log_finished: finished -
rsc:xstha2_san0_IP action:start call_id:22 pid:10439 exit-code:5
exec-time:91ms queue-time:0ms
It says it cannot find /usr/gnu/bin/awk but this is absolutely not true!
sonicle@xstorage1:/sonicle/home# ls -l /usr/gnu/bin/awk
-r-xr-xr-x 1 root bin 881864 Jun 1 12:25 /usr/gnu/bin/awk
sonicle@xstorage1:/sonicle/home# file /usr/gnu/bin/awk
/usr/gnu/bin/awk: ELF 64-bit LSB executable AMD64 Version 1, dynamically
linked, not stripped, no debugging information available
what may be happening??
Thanks!
Gabriele
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem

2020-07-28 Thread Gabriele Bulfon
Sorry, found the reason, I have to patch all the scripts, others I missed.
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
28 luglio 2020 9.35.49 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
 
Thanks, I patched all the scripts in build to have "#!/bin/bash" in head, and I 
receive no errors now.
Though, the IP is not configured :( I'm looking at it...
Is there any easy way to debug what's doing on the IP script?
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ulrich Windl
A: users@clusterlabs.org
Data: 28 luglio 2020 9.12.41 CEST
Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
You could try replacing "local" with "typeset", also.
Reid Wahl
schrieb am 28.07.2020 um 09:05 in Nachricht
:
By the way, it doesn't necessarily have to be bash. Upon looking further, a
lot of shells support the `local` keyword, even though it's not required by
the POSIX standard. Plain ksh, however, does not :(
On Monday, July 27, 2020, Reid Wahl
wrote:
Hi, Gabriele. The `local` keyword is a bash built-in and not available in
some other shells (e.g., ksh). It's used in `have_binary()`, so it's
causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all the
"local: not found" errors. I just reproduced it to make sure.
check_binary () {
if ! have_binary "$1"; then
if [ "$OCF_NOT_RUNNING" = 7 ]; then
# Chances are we have a fully setup OCF environment
ocf_exit_reason "Setup problem: couldn't find command: $1"
else
echo "Setup problem: couldn't find command: $1"
fi
exit $OCF_ERR_INSTALLED
fi
}
have_binary () {
if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
false
else
local bin=`echo $1 | sed -e 's/ -.*//'`
test -x "`which $bin 2/dev/null`"
fi
}
Is bash available on your system?
On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon
wrote:
Hello,
after configuring crm for IP automatic configuration, I stumbled upon a
problem with the IPaddr utiliy that I don't understand:
IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup problem:
couldn't find command: /usr/gnu/bin/awk
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
couldn't find command: /usr/gnu/bin/awk ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[186]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: info: log_finished: finished -
rsc:xstha2_san0_IP action:start call_id:22 pid:10439 exit-code:5
exec-time:91ms queue-time:0ms
It says it cannot find /usr/gnu/bin/awk but this is absolutely not true!
sonicle@xstorage1:/sonicle/home# ls -l /usr/gnu/bin/awk
-r-xr-xr-x 1 root bin 881864 Jun 1 12:25 /usr/gnu/bin/awk
sonicle@xstorage1:/sonicle/home# file /usr/gnu/bin/awk
/usr/gnu/bin/awk: ELF 64-bit LSB executable AMD64 Version 1, dynamically
linked, not stripped, no debugging information available
what may be happening??
Thanks!
Gabriele
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer

Re: [ClusterLabs] ip address configuration problem

2020-07-28 Thread Gabriele Bulfon
Working great! IP addresses configured, now I go for the ZFS pools ;)
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
28 luglio 2020 9.44.26 CEST
Oggetto:
Re: [ClusterLabs] ip address configuration problem
Great! And it would be --force-start --verbose --verbose.
On Tuesday, July 28, 2020, Gabriele Bulfon
gbul...@sonicle.com
wrote:
Sorry, found the reason, I have to patch all the scripts, others I missed.
 
Gabriele
 
 
Sonicle S.r.l. : 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon

Da: Gabriele Bulfon
gbul...@sonicle.com
A: Cluster Labs - All topics related to open-source clustering welcomed
users@clusterlabs.org
Data: 28 luglio 2020 9.35.49 CEST
Oggetto: Re: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
 
Thanks, I patched all the scripts in build to have "#!/bin/bash" in head, and I 
receive no errors now.
Though, the IP is not configured :( I'm looking at it...
Is there any easy way to debug what's doing on the IP script?
 
Gabriele
 
 
Sonicle S.r.l. : 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de
A:
users@clusterlabs.org
Data: 28 luglio 2020 9.12.41 CEST
Oggetto: [ClusterLabs] Antw: [EXT] Re: ip address configuration problem
You could try replacing "local" with "typeset", also.
Reid Wahl
nw...@redhat.com
schrieb am 28.07.2020 um 09:05 in Nachricht
capiuu9_wln77dmk_+wo_gvqzsyb5kxc94fztk1iqutwi7xa...@mail.gmail.com
:
By the way, it doesn't necessarily have to be bash. Upon looking further, a
lot of shells support the `local` keyword, even though it's not required by
the POSIX standard. Plain ksh, however, does not :(
On Monday, July 27, 2020, Reid Wahl
nw...@redhat.com
wrote:
Hi, Gabriele. The `local` keyword is a bash built-in and not available in
some other shells (e.g., ksh). It's used in `have_binary()`, so it's
causing `check_binary(/usr/gnu/bin/awk)` to fail. It's also causing all the
"local: not found" errors. I just reproduced it to make sure.
check_binary () {
if ! have_binary "$1"; then
if [ "$OCF_NOT_RUNNING" = 7 ]; then
# Chances are we have a fully setup OCF environment
ocf_exit_reason "Setup problem: couldn't find command: $1"
else
echo "Setup problem: couldn't find command: $1"
fi
exit $OCF_ERR_INSTALLED
fi
}
have_binary () {
if [ "$OCF_TESTER_FAIL_HAVE_BINARY" = "1" ]; then
false
else
local bin=`echo $1 | sed -e 's/ -.*//'`
test -x "`which $bin 2/dev/null`"
fi
}
Is bash available on your system?
On Mon, Jul 27, 2020 at 8:34 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
Hello,
after configuring crm for IP automatic configuration, I stumbled upon a
problem with the IPaddr utiliy that I don't understand:
IPaddr(xstha2_san0_IP)[10439]: 2020/07/27_17:26:17 ERROR: Setup problem:
couldn't find command: /usr/gnu/bin/awk
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [ ocf-exit-reason:Setup problem:
couldn't find command: /usr/gnu/bin/awk ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[185]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: notice: operation_finished:
xstha2_san0_IP_start_0:10439:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[186]: local: not found [No such
file or directory] ]
Jul 27 17:26:17 [10258] lrmd: info: log_finished: finished -
rsc:xstha2_san0_IP action:start call_id:22 pid:10439 exit-code:5
exec-time:91ms queue-time:0ms
It says it cannot find /usr/gnu/bin/awk but this is absolutely not true!
sonicle@xstorage1:/sonicle/home# ls -l /usr/gnu/

[ClusterLabs] Stonith failing

2020-07-28 Thread Gabriele Bulfon
Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by Corosync.
To check how stonith would work, I turned off Corosync service on second node.
First node try to attempt to stonith 2nd node and take care of its resources, 
but this fails.
Stonith action is configured to run a custom script to run ssh commands, both 
machines have reciprocal authorized keys to allow ssh without password.
The script does not contemplate the on/off commands, so it just returns 1 on 
those cases.
What I don't get is the "no route to host", who and what is it traying to do?
 
Jul 28 10:48:18 [9636] pengine: warning: stage6: Scheduling Node xstha2 for 
STONITH
Jul 28 10:48:18 [9636] pengine: info: native_stop_constraints: 
xstha1-stonith_stop_0 is implicit after xstha2 is fenced
Jul 28 10:48:18 [9636] pengine: info: native_stop_constraints: 
xstha2_san0_IP_stop_0 is implicit after xstha2 is fenced
Jul 28 10:48:18 [9636] pengine: notice: LogActions: Stop xstha1-stonith (xstha2)
Jul 28 10:48:18 [9636] pengine: info: LogActions: Leave xstha2-stonith (Started 
xstha1)
Jul 28 10:48:18 [9636] pengine: info: LogActions: Leave xstha1_san0_IP (Started 
xstha1)
Jul 28 10:48:18 [9636] pengine: notice: LogActions: Move xstha2_san0_IP 
(Started xstha2 -xstha1)
Jul 28 10:48:18 [9636] pengine: warning: process_pe_message: Calculated 
transition 15 (with warnings), saving inputs in 
/sonicle/var/cluster/lib/pacemaker/pengine/pe-warn-10.bz2
Jul 28 10:48:18 [9637] crmd: info: do_state_transition: State transition 
S_POLICY_ENGINE -S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
origin=handle_response
Jul 28 10:48:18 [9637] crmd: info: do_te_invoke: Processing graph 15 
(ref=pe_calc-dc-1595926098-89) derived from 
/sonicle/var/cluster/lib/pacemaker/pengine/pe-warn-10.bz2
Jul 28 10:48:18 [9637] crmd: notice: te_fence_node: Requesting fencing 
(poweroff) of node xstha2 | action=11 timeout=6
Jul 28 10:48:18 [9633] stonith-ng: notice: handle_request: Client 
crmd.9637.bb39d7c9 wants to fence (poweroff) 'xstha2' with device '(any)'
Jul 28 10:48:18 [9633] stonith-ng: notice: initiate_remote_stonith_op: 
Requesting peer fencing (poweroff) of xstha2 | 
id=6fe78117-0075-c655-927d-f9326e9a6630 state=0
Jul 28 10:48:19 [9633] stonith-ng: info: process_remote_stonith_query: Query 
result 1 of 1 from xstha1 for xstha2/poweroff (1 devices) 
6fe78117-0075-c655-927d-f9326e9a6630
Jul 28 10:48:19 [9633] stonith-ng: info: call_remote_stonith: Total timeout set 
to 60 for peer's fencing of xstha2 for 
crmd.9637|id=6fe78117-0075-c655-927d-f9326e9a6630
Jul 28 10:48:19 [9633] stonith-ng: info: call_remote_stonith: Requesting that 
'xstha1' perform op 'xstha2 poweroff' for crmd.9637 (72s, 0s)
Jul 28 10:48:20 [9633] stonith-ng: info: stonith_fence_get_devices_cb: Found 1 
matching devices for 'xstha2'
Jul 28 10:48:21 xstorage1 stonith: [12141]: CRIT: external_reset_req: 
'ssh-sonicle off' for host xstha2 failed with rc 1
Jul 28 10:48:21 [9633] stonith-ng: info: internal_stonith_action_execute: 
Attempt 2 to execute fence_legacy (poweroff). remaining timeout is 59
Jul 28 10:48:22 [9632] cib: info: cib_process_ping: Reporting our current 
digest to xstha1: 91ad9245488736582038cd758d58c08a for 0.8.75 (825f610 0)
Jul 28 10:48:23 xstorage1 stonith: [12152]: CRIT: external_reset_req: 
'ssh-sonicle off' for host xstha2 failed with rc 1
Jul 28 10:48:23 [9633] stonith-ng: info: update_remaining_timeout: Attempted to 
execute agent fence_legacy (poweroff) the maximum number of times (2) allowed
Jul 28 10:48:23 [9633] stonith-ng: error: log_operation: Operation 'poweroff' 
[12150] (call 14 from crmd.9637) for host 'xstha2' with device 'xstha2-stonith' 
returned: -61 (No data available)
Jul 28 10:48:23 [9633] stonith-ng: warning: log_operation: xstha2-stonith:12150 
[ Performing: stonith -t external/ssh-sonicle -T off xstha2 ]
Jul 28 10:48:23 [9633] stonith-ng: warning: log_operation: xstha2-stonith:12150 
[ failed: xstha2 5 ]
Jul 28 10:48:23 [9633] stonith-ng: notice: stonith_choose_peer: Couldn't find 
anyone to fence (poweroff) xstha2 with any device
Jul 28 10:48:23 [9633] stonith-ng: info: call_remote_stonith: None of the 1 
peers are capable of fencing (poweroff) xstha2 for crmd.9637 (1)
Jul 28 10:48:23 [9633] stonith-ng: error: remote_op_done: Operation poweroff of 
xstha2 by
for crmd.9637@xstha1.6fe78117: No route to host
Jul 28 10:48:23 [9637] crmd: notice: tengine_stonith_callback: Stonith 
operation 14/11:15:0:5814817b-c10a-c931-fd0e-e9ee3b3a8e59: No route to host 
(-148)
Jul 28 10:48:23 [9637] crmd: notice: tengine_stonith_callback: Stonith 
operation 14 for xstha2 failed (No route to host): aborting transition.
Jul 28 10:48:23 [9637] crmd: notice: abort_transition_graph: Transition 
aborted: Stonith failed | source=tengine_stonith_callback:749 complete=false
Jul 28 10:48:23 [9637] crmd: notice: tengine_stonith_notify: Peer xstha2 was 
not terminated (poweroff) by
for xstha1: No route to host (ref=6fe78117-0075-c655-927d-f9326e9a6630) by 
c

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-28 Thread Gabriele Bulfon
Thanks, I attach here the script.
It basically runs ssh on the other node with no password (must be preconfigured 
via authorization keys) with commands.
This was taken from a script by OpenIndiana (I think).
As it stated in the comments, we don't want to halt or boot via ssh, only 
reboot.
Maybe this is the problem, we should at least have it shutdown when asked for.
 
Actually if I stop corosync in node 2, I don't want it to shutdown the system 
but just let node 1 keep control of all resources.
Same if I just shutdown manually node 2, 
node 1 should keep control of all resources and release them back on reboot.
Instead, when I stopped corosync on node 2, log was showing the temptative to 
stonith node 2: why?
 
Thanks!
Gabriele
 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
28 luglio 2020 12.03.46 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
Gabriele,
 
"No route to host" is a somewhat generic error message when we can't find 
anyone to fence the node. It doesn't mean there's necessarily a network routing 
issue at fault; no need to focus on that error message.
 
I agree with Ulrich about needing to know what the script does. But based on 
your initial message, it sounds like your custom fence agent returns 1 in 
response to "on" and "off" actions. Am I understanding correctly? If so, why 
does it behave that way? Pacemaker is trying to run a poweroff action based on 
the logs, so it needs your script to support an off action.
On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de
wrote:
Gabriele Bulfon
gbul...@sonicle.com
schrieb am 28.07.2020 um 10:56 in
Nachricht
:
Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
Corosync.
To check how stonith would work, I turned off Corosync service on second
node.
First node try to attempt to stonith 2nd node and take care of its
resources, but this fails.
Stonith action is configured to run a custom script to run ssh commands,
I think you should explain what that script does exactly.
[...]
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home:
https://www.clusterlabs.org/
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/


ssh-sonicle
Description: Binary data
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-28 Thread Gabriele Bulfon
That one was taken from a specific implementation on Solaris 11.
The situation is a dual node server with shared storage controller: both nodes 
see the same disks concurrently.
Here we must be sure that the two nodes are not going to import/mount the same 
zpool at the same time, or we will encounter data corruption: node 1 will be 
perferred for pool 1, node 2 for pool 2, only in case one of the node goes down 
or is taken offline the resources should be first free by the leaving node and 
taken by the other node.
 
Would you suggest one of the available stonith in this case?
 
Thanks!
Gabriele
 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Strahil Nikolov
A: Cluster Labs - All topics related to open-source clustering welcomed
Gabriele Bulfon
Data: 29 luglio 2020 6.39.08 CEST
Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing
Do you have a reason not to use any stonith already available ?
Best Regards,
Strahil Nikolov
На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon
написа:
Thanks, I attach here the script.
It basically runs ssh on the other node with no password (must be
preconfigured via authorization keys) with commands.
This was taken from a script by OpenIndiana (I think).
As it stated in the comments, we don't want to halt or boot via ssh,
only reboot.
Maybe this is the problem, we should at least have it shutdown when
asked for.
 
Actually if I stop corosync in node 2, I don't want it to shutdown the
system but just let node 1 keep control of all resources.
Same if I just shutdown manually node 2, 
node 1 should keep control of all resources and release them back on
reboot.
Instead, when I stopped corosync on node 2, log was showing the
temptative to stonith node 2: why?
 
Thanks!
Gabriele
 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
28 luglio 2020 12.03.46 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
Gabriele,
 
"No route to host" is a somewhat generic error message when we can't
find anyone to fence the node. It doesn't mean there's necessarily a
network routing issue at fault; no need to focus on that error message.
 
I agree with Ulrich about needing to know what the script does. But
based on your initial message, it sounds like your custom fence agent
returns 1 in response to "on" and "off" actions. Am I understanding
correctly? If so, why does it behave that way? Pacemaker is trying to
run a poweroff action based on the logs, so it needs your script to
support an off action.
On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de
wrote:
Gabriele Bulfon
gbul...@sonicle.com
schrieb am 28.07.2020 um 10:56 in
Nachricht
:
Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
Corosync.
To check how stonith would work, I turned off Corosync service on
second
node.
First node try to attempt to stonith 2nd node and take care of its
resources, but this fails.
Stonith action is configured to run a custom script to run ssh
commands,
I think you should explain what that script does exactly.
[...]
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home:
https://www.clusterlabs.org/
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___Manage your
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs
home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

2020-07-29 Thread Gabriele Bulfon
Hi, it's a single controller, shared to both nodes, SM server.
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ulrich Windl
A: users@clusterlabs.org
Data: 29 luglio 2020 9.26.39 CEST
Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing
Gabriele Bulfon
schrieb am 29.07.2020 um 08:01 in
Nachricht
:
That one was taken from a specific implementation on Solaris 11.
The situation is a dual node server with shared storage controller: both
nodes see the same disks concurrently.
You mean you have a dual-controler setup (one controller on each node, both
connected to the same bus)? If so Use sbd!
Here we must be sure that the two nodes are not going to import/mount the
same zpool at the same time, or we will encounter data corruption: node 1
will be perferred for pool 1, node 2 for pool 2, only in case one of the
node
goes down or is taken offline the resources should be first free by the
leaving node and taken by the other node.
Would you suggest one of the available stonith in this case?
Thanks!
Gabriele
Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon

--
Da: Strahil Nikolov
A: Cluster Labs - All topics related to open-source clustering welcomed
Gabriele Bulfon
Data: 29 luglio 2020 6.39.08 CEST
Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing
Do you have a reason not to use any stonith already available ?
Best Regards,
Strahil Nikolov
На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon
написа:
Thanks, I attach here the script.
It basically runs ssh on the other node with no password (must be
preconfigured via authorization keys) with commands.
This was taken from a script by OpenIndiana (I think).
As it stated in the comments, we don't want to halt or boot via ssh,
only reboot.
Maybe this is the problem, we should at least have it shutdown when
asked for.
Actually if I stop corosync in node 2, I don't want it to shutdown the
system but just let node 1 keep control of all resources.
Same if I just shutdown manually node 2,
node 1 should keep control of all resources and release them back on
reboot.
Instead, when I stopped corosync on node 2, log was showing the
temptative to stonith node 2: why?
Thanks!
Gabriele
Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
28 luglio 2020 12.03.46 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
Gabriele,
"No route to host" is a somewhat generic error message when we can't
find anyone to fence the node. It doesn't mean there's necessarily a
network routing issue at fault; no need to focus on that error message.
I agree with Ulrich about needing to know what the script does. But
based on your initial message, it sounds like your custom fence agent
returns 1 in response to "on" and "off" actions. Am I understanding
correctly? If so, why does it behave that way? Pacemaker is trying to
run a poweroff action based on the logs, so it needs your script to
support an off action.
On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de
wrote:
Gabriele Bulfon
gbul...@sonicle.com
schrieb am 28.07.2020 um 10:56 in
Nachricht
:
Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
Corosync.
To check how stonith would work, I turned off Corosync service on
second
node.
First node try to attempt to stonith 2nd node and take care of its
resources, but this fails.
Stonith action is configured to run a custom script to run ssh
commands,
I think you should explain what that script does exactly.
[...]
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home:
https://www.clusterlabs.org/
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___Manage your
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs
home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Gabriele Bulfon
It is a ZFS based illumos system.
I don't think SBD is an option.
Is there a reliable ZFS based stonith?
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Andrei Borzenkov
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
29 luglio 2020 9.46.09 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
 
On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
That one was taken from a specific implementation on Solaris 11.
The situation is a dual node server with shared storage controller: both nodes 
see the same disks concurrently.
Here we must be sure that the two nodes are not going to import/mount the same 
zpool at the same time, or we will encounter data corruption:
 
ssh based "stonith" cannot guarantee it.
 
node 1 will be perferred for pool 1, node 2 for pool 2, only in case one of the 
node goes down or is taken offline the resources should be first free by the 
leaving node and taken by the other node.
 
Would you suggest one of the available stonith in this case?
 
 
IPMI, managed PDU, SBD ...
In practice, the only stonith method that works in case of complete node outage 
including any power supply is SBD.
___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Gabriele Bulfon
Thanks a lot for the extensive explanation!
Any idea about a ZFS stonith?
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
29 luglio 2020 11.39.35 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
"As it stated in the comments, we don't want to halt or boot via ssh, only 
reboot."
 
Generally speaking, a stonith reboot action consists of the following basic 
sequence of events:
Execute the fence agent with the "off" action.
Poll the power status of the fenced node until it is powered off.
Execute the fence agent with the "on" action.
Poll the power status of the fenced node until it is powered on.
So a custom fence agent that supports reboots, actually needs to support off 
and on actions.
 
 
As Andrei noted, ssh is **not** a reliable method by which to ensure a node 
gets rebooted or stops using cluster-managed resources. You can't depend on the 
ability to SSH to an unhealthy node that needs to be fenced.
 
The only way to guarantee that an unhealthy or unresponsive node stops all 
access to shared resources is to power off or reboot the node. (In the case of 
resources that rely on shared storage, I/O fencing instead of power fencing can 
also work, but that's not ideal.)
 
As others have said, SBD is a great option. Use it if you can. There are also 
power fencing methods (one example is fence_ipmilan, but the options available 
depend on your hardware or virt platform) that are reliable under most 
circumstances.
 
You said that when you stop corosync on node 2, Pacemaker tries to fence node 
2. There are a couple of possible reasons for that. One possibility is that you 
stopped or killed corosync without stopping Pacemaker first. (If you use pcs, 
then try `pcs cluster stop`.) Another possibility is that resources failed to 
stop during cluster shutdown on node 2, causing node 2 to be fenced.
On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov
arvidj...@gmail.com
wrote:
 
On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
That one was taken from a specific implementation on Solaris 11.
The situation is a dual node server with shared storage controller: both nodes 
see the same disks concurrently.
Here we must be sure that the two nodes are not going to import/mount the same 
zpool at the same time, or we will encounter data corruption:
 
ssh based "stonith" cannot guarantee it.
 
node 1 will be perferred for pool 1, node 2 for pool 2, only in case one of the 
node goes down or is taken offline the resources should be first free by the 
leaving node and taken by the other node.
 
Would you suggest one of the available stonith in this case?
 
 
IPMI, managed PDU, SBD ...
In practice, the only stonith method that works in case of complete node outage 
including any power supply is SBD.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home:
https://www.clusterlabs.org/
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Stonith failing

2020-07-30 Thread Gabriele Bulfon
It is this system:
https://www.supermicro.com/products/system/1u/1029/SYS-1029TP-DC0R.cfm
 
it has a sas3 backplane with hotswap sas disks that are visible to both nodes 
at the same time.
 
Gabriele 
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ulrich Windl
A: users@clusterlabs.org
Data: 29 luglio 2020 15.15.17 CEST
Oggetto: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Stonith failing
Gabriele Bulfon
schrieb am 29.07.2020 um 14:18 in
Nachricht
:
Hi, it's a single controller, shared to both nodes, SM server.
You mean external controller, like NAS or SAN? I thought you are talking about
an internal controller like SCSI...
I don't know what an "SM server" is.
Regards,
Ulrich
Thanks!
Gabriele
Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon

--
Da: Ulrich Windl
A: users@clusterlabs.org
Data: 29 luglio 2020 9.26.39 CEST
Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing
Gabriele Bulfon
schrieb am 29.07.2020 um 08:01 in
Nachricht
:
That one was taken from a specific implementation on Solaris 11.
The situation is a dual node server with shared storage controller: both
nodes see the same disks concurrently.
You mean you have a dual-controler setup (one controller on each node, both
connected to the same bus)? If so Use sbd!
Here we must be sure that the two nodes are not going to import/mount the
same zpool at the same time, or we will encounter data corruption: node 1
will be perferred for pool 1, node 2 for pool 2, only in case one of the
node
goes down or is taken offline the resources should be first free by the
leaving node and taken by the other node.
Would you suggest one of the available stonith in this case?
Thanks!
Gabriele
Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon

--
Da: Strahil Nikolov
A: Cluster Labs - All topics related to open-source clustering welcomed
Gabriele Bulfon
Data: 29 luglio 2020 6.39.08 CEST
Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing
Do you have a reason not to use any stonith already available ?
Best Regards,
Strahil Nikolov
На 28 юли 2020 г. 13:26:52 GMT+03:00, Gabriele Bulfon
написа:
Thanks, I attach here the script.
It basically runs ssh on the other node with no password (must be
preconfigured via authorization keys) with commands.
This was taken from a script by OpenIndiana (I think).
As it stated in the comments, we don't want to halt or boot via ssh,
only reboot.
Maybe this is the problem, we should at least have it shutdown when
asked for.
Actually if I stop corosync in node 2, I don't want it to shutdown the
system but just let node 1 keep control of all resources.
Same if I just shutdown manually node 2,
node 1 should keep control of all resources and release them back on
reboot.
Instead, when I stopped corosync on node 2, log was showing the
temptative to stonith node 2: why?
Thanks!
Gabriele
Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
28 luglio 2020 12.03.46 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
Gabriele,
"No route to host" is a somewhat generic error message when we can't
find anyone to fence the node. It doesn't mean there's necessarily a
network routing issue at fault; no need to focus on that error message.
I agree with Ulrich about needing to know what the script does. But
based on your initial message, it sounds like your custom fence agent
returns 1 in response to "on" and "off" actions. Am I understanding
correctly? If so, why does it behave that way? Pacemaker is trying to
run a poweroff action based on the logs, so it needs your script to
support an off action.
On Tue, Jul 28, 2020 at 2:47 AM Ulrich Windl
ulrich.wi...@rz.uni-regensburg.de
wrote:
Gabriele Bulfon
gbul...@sonicle.com
schrieb am 28.07.2020 um 10:56 in
Nachricht
:
Hi, now I have my two nodes (xstha1 and xstha2) with IPs configured by
Corosync.
To check how stonith would work, I turned off Corosync service on
second
node.
First node try to attempt to stonith 2nd node and take care of its
resources, but this fails.
Stonith action is configured to run a custom script to run ssh
commands,
I think you should explain what that script does exactly.
[...]
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterL

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-30 Thread Gabriele Bulfon
Reading sbd from SuSE I saw that it requires a special block to write 
informations, I don't think this is possibile here.
 
It's a dual node ZFS storage running our own XStreamOS/illumos distribution, 
and here we're trying to add HA capabilities.
We can move IPs, ZFS Pools and COMSTAR/iSCSI/FC, and now looking for a stable 
way to manage stonith.
 
The hardware system is this:
 
https://www.supermicro.com/products/system/1u/1029/SYS-1029TP-DC0R.cfm
 
and it features a shared SAS3 backplane, so both nodes can see all the discs 
concurrently.
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
30 luglio 2020 6.38.58 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
I don't know of a stonith method that acts upon a filesystem directly. You'd 
generally want to act upon the power state of the node or upon the underlying 
shared storage.
 
What kind of hardware or virtualization platform are these systems running on? 
If there is a hardware watchdog timer, then sbd is possible. The fence_sbd 
agent (poison-pill fencing via block device) requires shared block storage, but 
sbd itself only requires a hardware watchdog timer.
 
Additionally, there may be an existing fence agent that can connect to the 
controller you mentioned. What kind of controller is it?
On Wed, Jul 29, 2020 at 5:24 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
Thanks a lot for the extensive explanation!
Any idea about a ZFS stonith?
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Reid Wahl
nw...@redhat.com
A:
Cluster Labs - All topics related to open-source clustering welcomed
users@clusterlabs.org
Data:
29 luglio 2020 11.39.35 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
"As it stated in the comments, we don't want to halt or boot via ssh, only 
reboot."
 
Generally speaking, a stonith reboot action consists of the following basic 
sequence of events:
Execute the fence agent with the "off" action.
Poll the power status of the fenced node until it is powered off.
Execute the fence agent with the "on" action.
Poll the power status of the fenced node until it is powered on.
So a custom fence agent that supports reboots, actually needs to support off 
and on actions.
 
 
As Andrei noted, ssh is **not** a reliable method by which to ensure a node 
gets rebooted or stops using cluster-managed resources. You can't depend on the 
ability to SSH to an unhealthy node that needs to be fenced.
 
The only way to guarantee that an unhealthy or unresponsive node stops all 
access to shared resources is to power off or reboot the node. (In the case of 
resources that rely on shared storage, I/O fencing instead of power fencing can 
also work, but that's not ideal.)
 
As others have said, SBD is a great option. Use it if you can. There are also 
power fencing methods (one example is fence_ipmilan, but the options available 
depend on your hardware or virt platform) that are reliable under most 
circumstances.
 
You said that when you stop corosync on node 2, Pacemaker tries to fence node 
2. There are a couple of possible reasons for that. One possibility is that you 
stopped or killed corosync without stopping Pacemaker first. (If you use pcs, 
then try `pcs cluster stop`.) Another possibility is that resources failed to 
stop during cluster shutdown on node 2, causing node 2 to be fenced.
On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov
arvidj...@gmail.com
wrote:
 
On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
That one was taken from a specific implementation on Solaris 11.
The situation is a dual node server with shared storage controller: both nodes 
see the same disks concurrently.
Here we must be sure that the two nodes are not going to import/mount the same 
zpool at the same time, or we will encounter data corruption:
 
ssh based "stonith" cannot guarantee it.
 
node 1 will be perferred for pool 1, node 2 for pool 2, only in case one of the 
node goes down or is taken offline the resources should be first free by the 
leaving node and taken by the other node.
 
Would you suggest one of the available stonith in this case?
 
 
IPMI, managed PDU, SBD ...
In practice, the only stonith method that works in case of complete node outage 
including any power supply is SBD.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home:
https://www.clusterlabs.org/
--
Regards,
Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___Manage your subscripti

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-08-14 Thread Gabriele Bulfon
Thanks to all your suggestions, I now have the systems with stonith configured 
on ipmi.
 
Two questions:
- how can I simulate a stonith situation to check that everything is ok?
- considering that I have both nodes with stonith against the other node, once 
the two nodes can communicate, how can I be sure the two nodes will not try to 
stonith each other?
 
:)
Thanks!
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
29 luglio 2020 14.22.42 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
 
It is a ZFS based illumos system.
I don't think SBD is an option.
Is there a reliable ZFS based stonith?
 
Gabriele
 
 
Sonicle S.r.l. 
: 
http://www.sonicle.com
Music: 
http://www.gabrielebulfon.com
Quantum Mechanics : 
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Andrei Borzenkov
A:
Cluster Labs - All topics related to open-source clustering welcomed
Data:
29 luglio 2020 9.46.09 CEST
Oggetto:
Re: [ClusterLabs] Antw: [EXT] Stonith failing
 
On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon
gbul...@sonicle.com
wrote:
That one was taken from a specific implementation on Solaris 11.
The situation is a dual node server with shared storage controller: both nodes 
see the same disks concurrently.
Here we must be sure that the two nodes are not going to import/mount the same 
zpool at the same time, or we will encounter data corruption:
 
ssh based "stonith" cannot guarantee it.
 
node 1 will be perferred for pool 1, node 2 for pool 2, only in case one of the 
node goes down or is taken offline the resources should be first free by the 
leaving node and taken by the other node.
 
Would you suggest one of the available stonith in this case?
 
 
IPMI, managed PDU, SBD ...
In practice, the only stonith method that works in case of complete node outage 
including any power supply is SBD.
___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/
___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Recoveing from node failure

2020-12-11 Thread Gabriele Bulfon
Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
storage cluster.
I have NFS IPs and shared storage zpool moving from one node or the other, and 
stonith controllin ipmi powering off when something is not clear.
 
What happens now is that if I shutdown 2nd node, I see the OFFLINE status from 
node 1 and everything is up and running, and this is ok:
 
Online: [ xstha1 ]
OFFLINE: [ xstha2 ]
Full list of resources:
 xstha1_san0_IP      (ocf::heartbeat:IPaddr):        Started xstha1
 xstha2_san0_IP      (ocf::heartbeat:IPaddr):        Started xstha1
 xstha1-stonith      (stonith:external/ipmi):        Started xstha1
 xstha2-stonith      (stonith:external/ipmi):        Started xstha1
 zpool_data  (ocf::heartbeat:ZFS):   Started xstha1
But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
running, so I clearstate of node 2, but resources are not started:

Online: [ xstha1 ]
OFFLINE: [ xstha2 ]
Full list of resources:
 xstha1_san0_IP      (ocf::heartbeat:IPaddr):        Stopped
 xstha2_san0_IP      (ocf::heartbeat:IPaddr):        Stopped
 xstha1-stonith      (stonith:external/ipmi):        Stopped
 xstha2-stonith      (stonith:external/ipmi):        Stopped
 zpool_data  (ocf::heartbeat:ZFS):   Stopped
I tried restarting zpool_data or other resources:
# crm resource start zpool_data
but nothing happens!
How can I recover from this state? Node2 needs to stay down, but I want node1 
to work.
Thanks!
Gabriele 
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

2020-12-11 Thread Gabriele Bulfon
That's what I suspect:
 
sonicle@xstorage1:/sonicle/home$ pfexec crm_mon -1Arfj
Stack: corosync
Current DC: xstha1 (version 1.1.15-e174ec8) - partition WITHOUT quorum
Last updated: Fri Dec 11 11:49:50 2020          Last change: Fri Dec 11 
11:00:38 2020 by hacluster via cibadmin on xstha1
2 nodes and 5 resources configured
Online: [ xstha1 ]
OFFLINE: [ xstha2 ]
Full list of resources:
 xstha1_san0_IP (ocf::heartbeat:IPaddr):        Stopped
 xstha2_san0_IP (ocf::heartbeat:IPaddr):        Stopped
 xstha1-stonith (stonith:external/ipmi):        Stopped
 xstha2-stonith (stonith:external/ipmi):        Stopped
 zpool_data     (ocf::heartbeat:ZFS):   Stopped
Node Attributes:
* Node xstha1:
Migration Summary:
* Node xstha1:

 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 11 dicembre 2020 11.35.44 CET
Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure


Hi!

Did you take care for special "two node" settings (quorum I mean)?
When I use "crm_mon -1Arfj", I see something like
" * Current DC: h19 (version 
2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with 
quorum"

What do you see?

Regards,
Ulrich

>>> Gabriele Bulfon  schrieb am 11.12.2020 um 11:23 in
Nachricht <350849824.6300.1607682209284@www>:
> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> storage cluster.
> I have NFS IPs and shared storage zpool moving from one node or the other, 
> and stonith controllin ipmi powering off when something is not clear.
> 
> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
> from node 1 and everything is up and running, and this is ok:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha1-stonith (stonith:external/ipmi): Started xstha1
> xstha2-stonith (stonith:external/ipmi): Started xstha1
> zpool_data (ocf::heartbeat:ZFS): Started xstha1
> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
> running, so I clearstate of node 2, but resources are not started:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha1-stonith (stonith:external/ipmi): Stopped
> xstha2-stonith (stonith:external/ipmi): Stopped
> zpool_data (ocf::heartbeat:ZFS): Stopped
> I tried restarting zpool_data or other resources:
> # crm resource start zpool_data
> but nothing happens!
> How can I recover from this state? Node2 needs to stay down, but I want 
> node1 to work.
> Thanks!
> Gabriele 
> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Recoveing from node failure

2020-12-11 Thread Gabriele Bulfon
I tried setting wait_for_all: 0, but then when I start only 1st node, it will 
power off itself after few minues! :O :O :O
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Reid Wahl 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 11 dicembre 2020 11.40.16 CET
Oggetto: Re: [ClusterLabs] Recoveing from node failure


Hi, Gabriele. It sounds like you don't have quorum on node 1.
Resources won't start unless the node is part of a quorate cluster
partition.

You probably have "two_node: 1" configured by default in
corosync.conf. This setting automatically enables wait_for_all.

>From the votequorum(5) man page:

NOTES: enabling two_node: 1 automatically enables
wait_for_all. It is still possible to override wait_for_all by
explicitly setting it to 0. If more than 2 nodes join the cluster,
the two_node
option is automatically disabled.

wait_for_all: 1

Enables Wait For All (WFA) feature (default: 0).

The general behaviour of votequorum is to switch a cluster from
inquorate to quorate as soon as possible. For example, in an 8 node
cluster, where every node has 1 vote, expected_votes is set to 8
and quorum is (50% + 1) 5. As soon as 5 (or more) nodes are
visible to each other, the partition of 5 (or more) becomes quorate
and can start operating.

When WFA is enabled, the cluster will be quorate for the first
time only after all nodes have been visible at least once at the same
time.

This feature has the advantage of avoiding some startup race
conditions, with the cost that all nodes need to be up at the same
time at least once before the cluster can operate.

You can either unblock quorum (`pcs quorum unblock` with pcs -- not
sure how to do it with crmsh) or set `wait_for_all: 0` in
corosync.conf and restart the cluster services.

On Fri, Dec 11, 2020 at 2:23 AM Gabriele Bulfon  wrote:
>
> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> storage cluster.
> I have NFS IPs and shared storage zpool moving from one node or the other, 
> and stonith controllin ipmi powering off when something is not clear.
>
> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
> from node 1 and everything is up and running, and this is ok:
>
>
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
>
> Full list of resources:
>
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha1-stonith (stonith:external/ipmi): Started xstha1
> xstha2-stonith (stonith:external/ipmi): Started xstha1
> zpool_data (ocf::heartbeat:ZFS): Started xstha1
>
> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
> running, so I clearstate of node 2, but resources are not started:
>
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
>
> Full list of resources:
>
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha1-stonith (stonith:external/ipmi): Stopped
> xstha2-stonith (stonith:external/ipmi): Stopped
> zpool_data (ocf::heartbeat:ZFS): Stopped
>
> I tried restarting zpool_data or other resources:
>
> # crm resource start zpool_data
>
> but nothing happens!
> How can I recover from this state? Node2 needs to stay down, but I want node1 
> to work.
>
> Thanks!
> Gabriele
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

2020-12-11 Thread Gabriele Bulfon
I cannot "use wait_for_all: 0", cause this would move automatically a powered 
off node from UNCLEAN to OFFLINE and mount the ZFS pool (total risk!): I want 
to manually move from UNCLEAN to OFFLINE, when I know that 2nd node is actually 
off!
 
Actually with wait_for_all to default (1) that was the case, so node1 would 
wait for my intervention when booting and node2 is down.
So what think I need is some way to manually override the quorum in such a case 
(node 2 down for maintenance, node 1 reboot), so I would manually turn OFFLINE 
node2 from UNCLEAN, manually override quorum and have zpool mount and NFS ip up.
 
Any idea?
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 11 dicembre 2020 11.35.44 CET
Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure


Hi!

Did you take care for special "two node" settings (quorum I mean)?
When I use "crm_mon -1Arfj", I see something like
" * Current DC: h19 (version 
2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with 
quorum"

What do you see?

Regards,
Ulrich

>>> Gabriele Bulfon  schrieb am 11.12.2020 um 11:23 in
Nachricht <350849824.6300.1607682209284@www>:
> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> storage cluster.
> I have NFS IPs and shared storage zpool moving from one node or the other, 
> and stonith controllin ipmi powering off when something is not clear.
> 
> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
> from node 1 and everything is up and running, and this is ok:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha1-stonith (stonith:external/ipmi): Started xstha1
> xstha2-stonith (stonith:external/ipmi): Started xstha1
> zpool_data (ocf::heartbeat:ZFS): Started xstha1
> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
> running, so I clearstate of node 2, but resources are not started:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha1-stonith (stonith:external/ipmi): Stopped
> xstha2-stonith (stonith:external/ipmi): Stopped
> zpool_data (ocf::heartbeat:ZFS): Stopped
> I tried restarting zpool_data or other resources:
> # crm resource start zpool_data
> but nothing happens!
> How can I recover from this state? Node2 needs to stay down, but I want 
> node1 to work.
> Thanks!
> Gabriele 
> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

2020-12-11 Thread Gabriele Bulfon
I found I can do this temporarily:
 
crm config property cib-bootstrap-options: no-quorum-policy=ignore
 
then once node 2 is up again:
 
crm config property cib-bootstrap-options: no-quorum-policy=stop
 
so that I make sure nodes will not mount in another strange situation.
 
Is there any better way? (such as ignore until everything is back to normal 
then conisder top again)
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

 


Da: Gabriele Bulfon 
A: Cluster Labs - All topics related to open-source clustering welcomed 

Data: 11 dicembre 2020 15.51.28 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure



 
I cannot "use wait_for_all: 0", cause this would move automatically a powered 
off node from UNCLEAN to OFFLINE and mount the ZFS pool (total risk!): I want 
to manually move from UNCLEAN to OFFLINE, when I know that 2nd node is actually 
off!
 
Actually with wait_for_all to default (1) that was the case, so node1 would 
wait for my intervention when booting and node2 is down.
So what think I need is some way to manually override the quorum in such a case 
(node 2 down for maintenance, node 1 reboot), so I would manually turn OFFLINE 
node2 from UNCLEAN, manually override quorum and have zpool mount and NFS ip up.
 
Any idea?
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 11 dicembre 2020 11.35.44 CET
Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure


Hi!

Did you take care for special "two node" settings (quorum I mean)?
When I use "crm_mon -1Arfj", I see something like
" * Current DC: h19 (version 
2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with 
quorum"

What do you see?

Regards,
Ulrich

>>> Gabriele Bulfon  schrieb am 11.12.2020 um 11:23 in
Nachricht <350849824.6300.1607682209284@www>:
> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> storage cluster.
> I have NFS IPs and shared storage zpool moving from one node or the other, 
> and stonith controllin ipmi powering off when something is not clear.
> 
> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
> from node 1 and everything is up and running, and this is ok:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha1-stonith (stonith:external/ipmi): Started xstha1
> xstha2-stonith (stonith:external/ipmi): Started xstha1
> zpool_data (ocf::heartbeat:ZFS): Started xstha1
> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
> running, so I clearstate of node 2, but resources are not started:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha1-stonith (stonith:external/ipmi): Stopped
> xstha2-stonith (stonith:external/ipmi): Stopped
> zpool_data (ocf::heartbeat:ZFS): Stopped
> I tried restarting zpool_data or other resources:
> # crm resource start zpool_data
> but nothing happens!
> How can I recover from this state? Node2 needs to stay down, but I want 
> node1 to work.
> Thanks!
> Gabriele 
> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

2020-12-12 Thread Gabriele Bulfon
Thanks, I will experiment this.
 
Now, I have a last issue about stonith.
I tried to reproduce a stonith situation, by disabling the network interface 
used for HA on node 1.
Stonith is configured with ipmi poweroff.
What happens, is that once the interface is down, both nodes tries to stonith 
the other node, causing both to poweroff...
I would like the node running all resources (zpool and nfs ip) to be the first 
trying to stonith the other node.
Or is there anything else better?
 
Here is the current crm config show:
 
node 1: xstha1 \attributes standby=off maintenance=offnode 2: xstha2 \  
  attributes standby=off maintenance=offprimitive xstha1-stonith 
stonith:external/ipmi \params hostname=xstha1 ipaddr=192.168.221.18 
userid=ADMIN passwd="**" interface=lanplus \op monitor interval=25 
timeout=25 start-delay=25 \meta target-role=Startedprimitive 
xstha1_san0_IP IPaddr \params ip=10.10.10.1 cidr_netmask=255.255.255.0 
nic=san0primitive xstha2-stonith stonith:external/ipmi \params 
hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN passwd="**" 
interface=lanplus \op monitor interval=25 timeout=25 start-delay=25 \   
 meta target-role=Startedprimitive xstha2_san0_IP IPaddr \params 
ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0primitive zpool_data ZFS \ 
   params pool=test \op start timeout=90 interval=0 \op stop 
timeout=90 interval=0 \meta target-role=Startedlocation 
xstha1-stonith-pref xstha1-stonith -inf: xstha1location xstha1_san0_IP_pref 
xstha1_san0_IP 100: xstha1location xstha2-stonith-pref xstha2-stonith -inf: 
xstha2location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2order 
zpool_data_order inf: zpool_data ( xstha1_san0_IP )location zpool_data_pref 
zpool_data 100: xstha1colocation zpool_data_with_IPs inf: zpool_data 
xstha1_san0_IPproperty cib-bootstrap-options: \have-watchdog=false \
dc-version=1.1.15-e174ec8 \cluster-infrastructure=corosync \
stonith-action=poweroff \no-quorum-policy=stop
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: users@clusterlabs.org 
Data: 11 dicembre 2020 18.30.29 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure


11.12.2020 18:37, Gabriele Bulfon пишет:
> I found I can do this temporarily:
>  
> crm config property cib-bootstrap-options: no-quorum-policy=ignore
>  

All two node clusters I remember run with setting forever :)

> then once node 2 is up again:
>  
> crm config property cib-bootstrap-options: no-quorum-policy=stop
>  
> so that I make sure nodes will not mount in another strange situation.
>  
> Is there any better way? 

"better" us subjective, but ...

> (such as ignore until everything is back to normal then conisder top again)
>  

That is what stonith does. Because quorum is pretty much useless in two
node cluster, as I already said all clusters I have seem used
no-quorum-policy=ignore and stonith-enabled=true. It means when node
boots and other node is not available stonith is attempted; if stonith
succeeds pacemaker continues with starting resources; if stonith fails,
node is stuck.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-14 Thread Gabriele Bulfon
I understand, but I cannot implement the third node at the moment.
I will think about it later.
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 14 dicembre 2020 8.52.16 CET
Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


>>> Gabriele Bulfon  schrieb am 11.12.2020 um 15:51 in
Nachricht <1053095478.6540.1607698288628@www>:
> I cannot "use wait_for_all: 0", cause this would move automatically a powered 
> off node from UNCLEAN to OFFLINE and mount the ZFS pool (total risk!): I want 
> to manually move from UNCLEAN to OFFLINE, when I know that 2nd node is 
> actually off!

Personally I think when you'll have to confirm that a node is down you need no 
cluster, because all actions would wait until the node is no longer unclean. I 
wouldn't want to be alerted in the middle of the night at weekends just to 
confirm that there was some problem, when the cluster could handle that 
automatically while I sleep.

> 
> Actually with wait_for_all to default (1) that was the case, so node1 would 
> wait for my intervention when booting and node2 is down.
> So what think I need is some way to manually override the quorum in such a 
> case (node 2 down for maintenance, node 1 reboot), so I would manually turn 
> OFFLINE node2 from UNCLEAN, manually override quorum and have zpool mount and 
> NFS ip up.
> 
> Any idea?
> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 
> 
> 
> 
> 
> 
> --
> 
> Da: Ulrich Windl 
> A: users@clusterlabs.org 
> Data: 11 dicembre 2020 11.35.44 CET
> Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> 
> 
> Hi!
> 
> Did you take care for special "two node" settings (quorum I mean)?
> When I use "crm_mon -1Arfj", I see something like
> " * Current DC: h19 (version 
> 2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with 
> quorum"
> 
> What do you see?
> 
> Regards,
> Ulrich
> 
>>>> Gabriele Bulfon  schrieb am 11.12.2020 um 11:23 in
> Nachricht <350849824.6300.1607682209284@www>:
>> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> 
>> storage cluster.
>> I have NFS IPs and shared storage zpool moving from one node or the other, 
>> and stonith controllin ipmi powering off when something is not clear.
>> 
>> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
>> from node 1 and everything is up and running, and this is ok:
>> 
>> Online: [ xstha1 ]
>> OFFLINE: [ xstha2 ]
>> Full list of resources:
>> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
>> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
>> xstha1-stonith (stonith:external/ipmi): Started xstha1
>> xstha2-stonith (stonith:external/ipmi): Started xstha1
>> zpool_data (ocf::heartbeat:ZFS): Started xstha1
>> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
>> running, so I clearstate of node 2, but resources are not started:
>> 
>> Online: [ xstha1 ]
>> OFFLINE: [ xstha2 ]
>> Full list of resources:
>> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
>> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
>> xstha1-stonith (stonith:external/ipmi): Stopped
>> xstha2-stonith (stonith:external/ipmi): Stopped
>> zpool_data (ocf::heartbeat:ZFS): Stopped
>> I tried restarting zpool_data or other resources:
>> # crm resource start zpool_data
>> but nothing happens!
>> How can I recover from this state? Node2 needs to stay down, but I want 
>> node1 to work.
>> Thanks!
>> Gabriele 
>> 
>> 
>> Sonicle S.r.l. : http://www.sonicle.com 
>> Music: http://www.gabrielebulfon.com 
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>> 
> 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

2020-12-14 Thread Gabriele Bulfon
Thanks!

I tried first option, by adding pcmk_delay_base to the two stonith primitives.
First has 1 second, second has 5 seconds.
It didn't work :( they still killed each other :(
Anything wrong with the way I did it?
 
Here's the config:
 
node 1: xstha1 \
        attributes standby=off maintenance=off
node 2: xstha2 \
        attributes standby=off maintenance=off
primitive xstha1-stonith stonith:external/ipmi \
        params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN passwd="***" 
interface=lanplus pcmk_delay_base=1 \
        op monitor interval=25 timeout=25 start-delay=25 \
        meta target-role=Started
primitive xstha1_san0_IP IPaddr \
        params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0
primitive xstha2-stonith stonith:external/ipmi \
        params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN passwd="***" 
interface=lanplus pcmk_delay_base=5 \
        op monitor interval=25 timeout=25 start-delay=25 \
        meta target-role=Started
primitive xstha2_san0_IP IPaddr \
        params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0
primitive zpool_data ZFS \
        params pool=test \
        op start timeout=90 interval=0 \
        op stop timeout=90 interval=0 \
        meta target-role=Started
location xstha1-stonith-pref xstha1-stonith -inf: xstha1
location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1
location xstha2-stonith-pref xstha2-stonith -inf: xstha2
location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2
order zpool_data_order inf: zpool_data ( xstha1_san0_IP )
location zpool_data_pref zpool_data 100: xstha1
colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IP
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.15-e174ec8 \
        cluster-infrastructure=corosync \
        stonith-action=poweroff \
        no-quorum-policy=stop
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: users@clusterlabs.org 
Data: 13 dicembre 2020 7.50.57 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure


12.12.2020 20:30, Gabriele Bulfon пишет:
> Thanks, I will experiment this.
>  
> Now, I have a last issue about stonith.
> I tried to reproduce a stonith situation, by disabling the network interface 
> used for HA on node 1.
> Stonith is configured with ipmi poweroff.
> What happens, is that once the interface is down, both nodes tries to stonith 
> the other node, causing both to poweroff...

Yes, this is expected. The options are basically

1. Have separate stonith resource for each node and configure static
(pcmk_delay_base) or random dynamic (pcmk_delay_max) delays to avoid
both nodes starting stonith at the same time. This does not take
resources in account.

2. Use fencing topology and create pseudo-stonith agent that does not
attempt to do anything but just delays for some time before continuing
with actual fencing agent. Delay can be based on anything including
resources running on node.

3. If you are using pacemaker 2.0.3+, you could use new
priority-fencing-delay feature that implements resource-based priority
fencing:

+ controller/fencing/scheduler: add new feature 'priority-fencing-delay'
Optionally derive the priority of a node from the
resource-priorities
of the resources it is running.
In a fencing-race the node with the highest priority has a certain
advantage over the others as fencing requests for that node are
executed with an additional delay.
controlled via cluster option priority-fencing-delay (default = 0)


See also https://www.mail-archive.com/users@clusterlabs.org/msg10328.html

> I would like the node running all resources (zpool and nfs ip) to be the 
> first trying to stonith the other node.
> Or is there anything else better?
>  
> Here is the current crm config show:
>  

It is unreadable

> node 1: xstha1 \ attributes standby=off maintenance=offnode 2: xstha2 \ 
> attributes standby=off maintenance=offprimitive xstha1-stonith 
> stonith:external/ipmi \ params hostname=xstha1 ipaddr=192.168.221.18 
> userid=ADMIN passwd="**" interface=lanplus \ op monitor interval=25 
> timeout=25 start-delay=25 \ meta target-role=Startedprimitive xstha1_san0_IP 
> IPaddr \ params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0primitive 
> xstha2-stonith stonith:external/ipmi \ params hostname=xstha2 
> ipaddr=192.168.221.19 userid=ADMIN passwd="**" interface=lanplus \ op 
> monitor interval=25 timeout=25 start-delay=25 \ meta 
> target-role=Startedprimitive xstha2_san0_IP IPaddr \ params ip=10.10.10.2 
> cidr_netmask=255.255.255.0 nic=san0primitive zpool_data ZFS \ params 
> pool=test \ op start timeout=90 interval=0 \ op stop timeout=90 inte

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-14 Thread Gabriele Bulfon
I isolated the log when everything happens (when I disable the ha interface), 
attached here.
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 14 dicembre 2020 11.53.22 CET
Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


>>> Gabriele Bulfon  schrieb am 14.12.2020 um 11:48 in
Nachricht <1065144646.7212.1607942889206@www>:
> Thanks!
> 
> I tried first option, by adding pcmk_delay_base to the two stonith 
> primitives.
> First has 1 second, second has 5 seconds.
> It didn't work :( they still killed each other :(
> Anything wrong with the way I did it?

Hard to say without seeing the logs...

> 
> Here's the config:
> 
> node 1: xstha1 \
> attributes standby=off maintenance=off
> node 2: xstha2 \
> attributes standby=off maintenance=off
> primitive xstha1-stonith stonith:external/ipmi \
> params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN 
> passwd="***" interface=lanplus pcmk_delay_base=1 \
> op monitor interval=25 timeout=25 start-delay=25 \
> meta target-role=Started
> primitive xstha1_san0_IP IPaddr \
> params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0
> primitive xstha2-stonith stonith:external/ipmi \
> params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN 
> passwd="***" interface=lanplus pcmk_delay_base=5 \
> op monitor interval=25 timeout=25 start-delay=25 \
> meta target-role=Started
> primitive xstha2_san0_IP IPaddr \
> params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0
> primitive zpool_data ZFS \
> params pool=test \
> op start timeout=90 interval=0 \
> op stop timeout=90 interval=0 \
> meta target-role=Started
> location xstha1-stonith-pref xstha1-stonith -inf: xstha1
> location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1
> location xstha2-stonith-pref xstha2-stonith -inf: xstha2
> location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2
> order zpool_data_order inf: zpool_data ( xstha1_san0_IP )
> location zpool_data_pref zpool_data 100: xstha1
> colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IP
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.15-e174ec8 \
> cluster-infrastructure=corosync \
> stonith-action=poweroff \
> no-quorum-policy=stop
> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 
> 
> 
> 
> 
>
----
> --
> 
> Da: Andrei Borzenkov 
> A: users@clusterlabs.org 
> Data: 13 dicembre 2020 7.50.57 CET
> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> 
> 
> 12.12.2020 20:30, Gabriele Bulfon пишет:
>> Thanks, I will experiment this.
>> 
>> Now, I have a last issue about stonith.
>> I tried to reproduce a stonith situation, by disabling the network
interface 
> used for HA on node 1.
>> Stonith is configured with ipmi poweroff.
>> What happens, is that once the interface is down, both nodes tries to 
> stonith the other node, causing both to poweroff...
> 
> Yes, this is expected. The options are basically
> 
> 1. Have separate stonith resource for each node and configure static
> (pcmk_delay_base) or random dynamic (pcmk_delay_max) delays to avoid
> both nodes starting stonith at the same time. This does not take
> resources in account.
> 
> 2. Use fencing topology and create pseudo-stonith agent that does not
> attempt to do anything but just delays for some time before continuing
> with actual fencing agent. Delay can be based on anything including
> resources running on node.
> 
> 3. If you are using pacemaker 2.0.3+, you could use new
> priority-fencing-delay feature that implements resource-based priority
> fencing:
> 
> + controller/fencing/scheduler: add new feature 'priority-fencing-delay'
> Optionally derive the priority of a node from the
> resource-priorities
> of the resources it is running.
> In a fencing-race the node with the highest priority has a certain
> advantage over the others as fencing requests for that node are
> executed with an additional delay.
> controlled via cluster option priority-fencing-delay (default = 0)
> 
> 
> See also https://www.mail-archive.com/users@clusterlabs.org/msg10328.html 
> 
>> I would like the node running all resources (zpool and nfs ip) to be the 
> first trying to stonith the other node.
>> Or is there anyth

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-15 Thread Gabriele Bulfon
Here it is, thanks!

Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 14 dicembre 2020 15.56.32 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon  wrote:
>
> I isolated the log when everything happens (when I disable the ha interface), 
> attached here.
>

And where are matching logs from the second node?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Dec 14 12:35:26 [652] xstorage2 corosync notice  [TOTEM ] A processor failed, 
forming new configuration.
Dec 14 12:35:27 [652] xstorage2 corosync notice  [TOTEM ] A new membership 
(10.100.100.2:352) was formed. Members left: 1
Dec 14 12:35:27 [652] xstorage2 corosync notice  [TOTEM ] Failed to receive the 
leave message. failed: 1
Dec 14 12:35:27 [676]  attrd: info: pcmk_cpg_membership:Node 1 
left group attrd (peer=xstha1, counter=2.0)
Dec 14 12:35:27 [678]   crmd: info: pcmk_cpg_membership:Node 1 
left group crmd (peer=xstha1, counter=2.0)
Dec 14 12:35:27 [672] pacemakerd: info: pcmk_cpg_membership:Node 1 
left group pacemakerd (peer=xstha1, counter=2.0)
Dec 14 12:35:27 [678]   crmd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 14 12:35:27 [673]cib: info: pcmk_cpg_membership:Node 1 
left group cib (peer=xstha1, counter=2.0)
Dec 14 12:35:27 [676]  attrd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 14 12:35:27 [673]cib: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 14 12:35:27 [676]  attrd:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc
Dec 14 12:35:27 [674] stonith-ng: info: pcmk_cpg_membership:Node 1 
left group stonith-ng (peer=xstha1, counter=2.0)
Dec 14 12:35:27 [672] pacemakerd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 14 12:35:27 [678]   crmd: info: peer_update_callback:   Client 
xstha1/peer now has status [offline] (DC=true, changed=400)
Dec 14 12:35:27 [652] xstorage2 corosync notice  [QUORUM] Members[1]: 2
Dec 14 12:35:27 [673]cib:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc
Dec 14 12:35:27 [678]   crmd: info: peer_update_callback:   Peer 
xstha1 left us
Dec 14 12:35:27 [676]  attrd:   notice: attrd_peer_remove:  Removing all 
xstha1 attributes for peer loss
Dec 14 12:35:27 [673]cib: info: crm_reap_dead_member:   
Removing node with name xstha1 and id 1 from membership cache
Dec 14 12:35:27 [678]   crmd: info: erase_status_tag:   Deleting xpath: 
//node_state[@uname='xstha1']/transient_attributes
Dec 14 12:35:27 [652] xstorage2 corosync notice  [MAIN  ] Completed service 
synchronization, ready to provide service.
Dec 14 12:35:27 [676]  attrd:   notice: attrd_peer_change_cb:   Lost 
attribute writer xstha1
Dec 14 12:35:27 [672] pacemakerd: info: pcmk_cpg_membership:Node 2 
still member of group pacemakerd (peer=xstha2, counter=2.0)
Dec 14 12:35:27 [673]cib:   notice: reap_crm_member:Purged 1 peers 
with id=1 and/or uname=xstha1 from the membership cache
Dec 14 12:35:27 [674] stonith-ng: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 14 12:35:27 [673]cib: info: pcmk_cpg_membership:Node 2 
still member of group cib (peer=xstha2, counter=2.0)
Dec 14 12:35:27 [674] stonith-ng:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc
Dec 14 12:35:27 [678]   crmd:  warning: match_down_event:   No reason to 
expect node 1 to be down
Dec 14 12:35:27 [672] pacemakerd: info: pcmk_quorum_notification:   Quorum 
retained | membership=352 members=1
Dec 14 12:35:27 [676]  attrd: info: crm_reap_dead_member:   
Removing node with name xstha1 and id 1 from membership cache
Dec 14 12:35:27 [672] pacemakerd:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_reap_unseen_nodes
Dec 14 12:35:27 [676]  attrd:   notice: reap_crm_member:Purged 1 peers 
with id=1 and/or uname=xstha1 from the membership cache
Dec 14 12:

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-16 Thread Gabriele Bulfon
Thanks! I updated package to version 1.1.24, but now I receive an error on 
pacemaker.log :
 
Dec 16 09:08:23 [5090] pacemakerd:    error: mcp_read_config:   Could not 
verify authenticity of CMAP provider: Operation not supported (48)
 
Any idea what's wrong? This was not happening on version 1.1.17.
What I noticed is that 1.1.24 mcp/corosync.c has a new section at line 334:
 
#if HAVE_CMAP
    rc = cmap_fd_get(local_handle, &fd);
    if (rc != CS_OK) {
.
and this is the failing part at line 343:
 
    if (!(rv = crm_ipc_is_authentic_process(fd, (uid_t) 0,(gid_t) 0, &found_pid,
                                            &found_uid, &found_gid))) {
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 15 dicembre 2020 10.52.46 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


pcmk_delay_base was introduced in 1.1.17 and you apparently have
1.1.15 (unless it was backported by your distribution). Sorry.

pcmk_delay_max may work, I cannot find in changelog when it appeared.


On Tue, Dec 15, 2020 at 11:52 AM Gabriele Bulfon  wrote:
>
> Here it is, thanks!
>
> Gabriele
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
>
>
>
> --
>
> Da: Andrei Borzenkov 
> A: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Data: 14 dicembre 2020 15.56.32 CET
> Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure
>
> On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon  wrote:
> >
> > I isolated the log when everything happens (when I disable the ha 
> > interface), attached here.
> >
>
> And where are matching logs from the second node?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-16 Thread Gabriele Bulfon
Ok, I used some OpenIndiana patches and now it works, and also accepts the 
pcmk_delay_base param.
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

 


Da: Gabriele Bulfon 
A: Cluster Labs - All topics related to open-source clustering welcomed 

Data: 16 dicembre 2020 9.27.31 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure



 
Thanks! I updated package to version 1.1.24, but now I receive an error on 
pacemaker.log :
 
Dec 16 09:08:23 [5090] pacemakerd:    error: mcp_read_config:   Could not 
verify authenticity of CMAP provider: Operation not supported (48)
 
Any idea what's wrong? This was not happening on version 1.1.17.
What I noticed is that 1.1.24 mcp/corosync.c has a new section at line 334:
 
#if HAVE_CMAP
    rc = cmap_fd_get(local_handle, &fd);
    if (rc != CS_OK) {
.
and this is the failing part at line 343:
 
    if (!(rv = crm_ipc_is_authentic_process(fd, (uid_t) 0,(gid_t) 0, &found_pid,
                                            &found_uid, &found_gid))) {
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 15 dicembre 2020 10.52.46 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


pcmk_delay_base was introduced in 1.1.17 and you apparently have
1.1.15 (unless it was backported by your distribution). Sorry.

pcmk_delay_max may work, I cannot find in changelog when it appeared.


On Tue, Dec 15, 2020 at 11:52 AM Gabriele Bulfon  wrote:
>
> Here it is, thanks!
>
> Gabriele
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
>
>
>
> --
>
> Da: Andrei Borzenkov 
> A: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Data: 14 dicembre 2020 15.56.32 CET
> Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure
>
> On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon  wrote:
> >
> > I isolated the log when everything happens (when I disable the ha 
> > interface), attached here.
> >
>
> And where are matching logs from the second node?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Hi, I have now a two node cluster using stonith with different pcmk_delay_base, 
so that node 1 has priority to stonith node 2 in case of problems.
 
Though, there is still one problem: once node 2 delays its stonith action for 
10 seconds, and node 1 just 1, node 2 does not delay start of resources, so it 
happens that while it's not yet powered off by node 1 (and waiting its dalay to 
power off node 1) it actually starts resources, causing a moment of few seconds 
where both NFS IP and ZFS pool (!) is mounted by both!
How can I delay node 2 resource start until the delayed stonith action is done? 
Or how can I just delay the resource start so I can make it larger than its 
pcmk_delay_base?
 
Also, I was suggested to set "stonith-enabled=true", but I don't know where to 
set this flag (cib-bootstrap-options is not happy with it...).
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Thanks, here are the logs, there are infos about how it tried to start 
resources on the nodes.
Keep in mind the node1 was already running the resources, and I simulated a 
problem by turning down the ha interface.
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 16 dicembre 2020 15.45.36 CET
Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource


>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836@www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
> 
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for 
stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

> 
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Dec 16 15:08:54 [642] xstorage2 corosync notice  [TOTEM ] A processor failed, 
forming new configuration.
Dec 16 15:08:56 [642] xstorage2 corosync notice  [TOTEM ] A new membership 
(10.100.100.2:408) was formed. Members left: 1
Dec 16 15:08:56 [642] xstorage2 corosync notice  [TOTEM ] Failed to receive the 
leave message. failed: 1
Dec 16 15:08:56 [666]  attrd: info: pcmk_cpg_membership:Group 
attrd event 2: xstha1 (node 1 pid 710) left via cluster exit
Dec 16 15:08:56 [663]cib: info: pcmk_cpg_membership:Group 
cib event 2: xstha1 (node 1 pid 707) left via cluster exit
Dec 16 15:08:56 [662] pacemakerd: info: pcmk_cpg_membership:Group 
pacemakerd event 2: xstha1 (node 1 pid 687) left via cluster exit
Dec 16 15:08:56 [642] xstorage2 corosync notice  [QUORUM] Members[1]: 2
Dec 16 15:08:56 [662] pacemakerd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [666]  attrd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [662] pacemakerd: info: pcmk_cpg_membership:Group 
pacemakerd event 2: xstha2 (node 2 pid 662) is member
Dec 16 15:08:56 [642] xstorage2 corosync notice  [MAIN  ] Completed service 
synchronization, ready to provide service.
Dec 16 15:08:56 [668]   crmd: info: pcmk_cpg_membership:Group 
crmd event 2: xstha1 (node 1 pid 712) left via cluster exit
Dec 16 15:08:56 [664] stonith-ng: info: pcmk_cpg_membership:Group 
stonith-ng event 2: xstha1 (node 1 pid 708) left via cluster exit
Dec 16 15:08:56 [663]cib: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [668]   crmd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [666]  attrd:   notice: attrd_remove_voter: Lost attribute 
writer xstha1
Dec 16 15:08:56 [664] stonith-ng: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha1[1] - corosync-cpg is now offline
Dec 16 15:08:56 [662] pacemakerd: info: pcmk_quorum_notification:   Quorum 
retained | membership=408 members=1
Dec 16 15:08:56 [663]cib:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc
Dec 16 15:08:56 [664] stonith-ng:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc
Dec 16 15:08:56 [662] pacemakerd:   notice: crm_update_peer_state_iter: Node 
xstha1 state is now lost | nodeid=1 previous=member source=crm_reap_unseen_no

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Looking at the two logs, looks like corosync decided that xst1 was offline, 
while xst was still online.
I just issued an "ifconfig ha0 down" on xst1, so I expect both nodes cannot see 
other one, while I see these same lines both on xst1 and xst2 log:
 
ec 16 15:08:56 [667]    pengine:  warning: pe_fence_node:      Cluster node 
xstha1 will be fenced: peer is no longer part of the cluster
Dec 16 15:08:56 [667]    pengine:  warning: determine_online_status:    Node 
xstha1 is unclean
Dec 16 15:08:56 [667]    pengine:     info: determine_online_status_fencing:    
Node xstha2 is active
Dec 16 15:08:56 [667]    pengine:     info: determine_online_status:    Node 
xstha2 is online
 
why xst2 and not xst1?
I would expect no action at all in this case, until stonith is done...
While it goes on with :
 
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha1_san0_IP_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
zpool_data_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
 
trying to stop everythin on xst1 (but it's not runnable).
Then:
 
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       
xstha1_san0_IP     ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha2_san0_IP  
(Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       
zpool_data         ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha1-stonith  
(Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Stop       
xstha2-stonith     (           xstha1 )   due to node availability
 
as if xst2 has been elected to be the running node, not knowing xst1 will kill 
xst2 within few seconds.
 
What is wrong here?
 
Thanks!
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 

 


Da: Gabriele Bulfon 
A: Cluster Labs - All topics related to open-source clustering welcomed 

Data: 16 dicembre 2020 15.56.28 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] delaying start of a resource



 
Thanks, here are the logs, there are infos about how it tried to start 
resources on the nodes.
Keep in mind the node1 was already running the resources, and I simulated a 
problem by turning down the ha interface.
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 16 dicembre 2020 15.45.36 CET
Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource


>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836@www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
> 
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for 
stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

> 
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___Manage your 
subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs 
home: https://www.clusterlabs.org/

<>
<>___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Gabriele Bulfon
Looks like my installation doesn't like this attribute:
 
sonicle@xstorage1:/sonicle/var/log/cluster# crm configure property 
stonith-enabled=true
ERROR: Warnings found during check: config may not be valid
Do you still want to commit (y/n)? n
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 16 dicembre 2020 15.45.36 CET
Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource


>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836@www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
> 
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for 
stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

> 
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-17 Thread Gabriele Bulfon
Sorry, I just meant xstha1 and xstha2, shortened as xst1 and xst2, the two 
nodes :)
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: users@clusterlabs.org 
Data: 17 dicembre 2020 6.57.34 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] delaying start of a resource


16.12.2020 19:05, Gabriele Bulfon пишет:
> Looking at the two logs, looks like corosync decided that xst1 was offline, 
> while xst was still online.
> I just issued an "ifconfig ha0 down" on xst1, so I expect both nodes cannot 
> see other one, while I see these same lines both on xst1 and xst2 log:
>  
> ec 16 15:08:56 [667]    pengine:  warning: pe_fence_node:      Cluster node 
> xstha1 will be fenced: peer is no longer part of the cluster

You should pay more attention to what you write. While occasional typos
of course happen, you consistently use wrong names for nodes which do
not match actual logs. If you continue this way nobody will be able to
follow.

> 
> why xst2 and not xst1?

Logs mention neither xst1 not xst2.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-17 Thread Gabriele Bulfon
These one:
 
https://github.com/OpenIndiana/oi-userland/tree/oi/hipster/components/cluster/pacemaker/patches
 
and some specifics of the build are here:
 
https://github.com/OpenIndiana/oi-userland/blob/oi/hipster/components/cluster/pacemaker/Makefile
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ken Gaillot 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 16 dicembre 2020 19.57.31 CET
Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure


On Wed, 2020-12-16 at 15:16 +0100, Gabriele Bulfon wrote:
> Ok, I used some OpenIndiana patches and now it works, and also
> accepts the pcmk_delay_base param.

Good to know. Beyond ensuring Pacemaker builds, we don't target or test
BSDish distros much, so they're a bit more likely to see issues. We're
happy to take pull requests with fixes though. What patches did you
have to apply?

You may have run into something like this:

https://bugs.clusterlabs.org/show_bug.cgi?id=5397 

> 
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets

^^^ wow, love it

> 
> 
> 
> Da: Gabriele Bulfon 
> A: Cluster Labs - All topics related to open-source clustering
> welcomed 
> Data: 16 dicembre 2020 9.27.31 CET
> Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node
> failure
> 
> 
> > 
> > Thanks! I updated package to version 1.1.24, but now I receive an
> > error on pacemaker.log :
> > 
> > Dec 16 09:08:23 [5090] pacemakerd: error: mcp_read_config: 
> > Could not verify authenticity of CMAP provider: Operation not
> > supported (48)
> > 
> > Any idea what's wrong? This was not happening on version 1.1.17.
> > What I noticed is that 1.1.24 mcp/corosync.c has a new section at
> > line 334:
> > 
> > #if HAVE_CMAP
> > rc = cmap_fd_get(local_handle, &fd);
> > if (rc != CS_OK) {
> > .
> > and this is the failing part at line 343:
> > 
> > if (!(rv = crm_ipc_is_authentic_process(fd, (uid_t) 0,(gid_t)
> > 0, &found_pid,
> > &found_uid,
> > &found_gid))) {
> > 
> > Gabriele
> > 
> > 
> > Sonicle S.r.l. : http://www.sonicle.com
> > Music: http://www.gabrielebulfon.com
> > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
> > 
> > 
> > 
> > 
> > -
> > -
> > 
> > Da: Andrei Borzenkov 
> > A: Cluster Labs - All topics related to open-source clustering
> > welcomed  
> > Data: 15 dicembre 2020 10.52.46 CET
> > Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from
> > node failure
> > 
> > > pcmk_delay_base was introduced in 1.1.17 and you apparently have
> > > 1.1.15 (unless it was backported by your distribution). Sorry.
> > > 
> > > pcmk_delay_max may work, I cannot find in changelog when it
> > > appeared.
> > > 
> > > 
> > > On Tue, Dec 15, 2020 at 11:52 AM Gabriele Bulfon <
> > > gbul...@sonicle.com> wrote:
> > > >
> > > > Here it is, thanks!
> > > >
> > > > Gabriele
> > > >
> > > >
> > > > Sonicle S.r.l. : http://www.sonicle.com
> > > > Music: http://www.gabrielebulfon.com
> > > > eXoplanets : 
> > > https://gabrielebulfon.bandcamp.com/album/exoplanets
> > > >
> > > >
> > > >
> > > >
> > > > -
> > > -
> > > >
> > > > Da: Andrei Borzenkov 
> > > > A: Cluster Labs - All topics related to open-source clustering
> > > welcomed 
> > > > Data: 14 dicembre 2020 15.56.32 CET
> > > > Oggetto: Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from
> > > node failure
> > > >
> > > > On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon <
> > > gbul...@sonicle.com> wrote:
> > > > >
> > > > > I isolated the log when everything happens (when I disable
> > > the ha interface), attached here.
> > > > >
> > > >
> > > > And where are matching logs from the second node?
> > > > 

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-17 Thread Gabriele Bulfon
Yes, sorry took same bash by mistake...here are the correct logs.
 
Yes, xstha1 has delay 10s so that I'm giving him precedence, xstha2 has delay 
1s and will be stonished earlier.
During the short time before xstha2 got powered off, I saw it had time to turn 
on NFS IP (I saw duplicated IP on xstha1).
And becase configuration has "order zpool_data_order inf: zpool_data ( 
xstha1_san0_IP )", that means xstha2 had imported the zpool for a small time 
before being stonished, and this must never happen.
 
What suggests me that resources were started on xstha2 (and duplicated IP is an 
effect) are these logs portions of xstha2.
These tells me it could not turn off resources on xstha1 (correct, it couldn't 
contact xstha1):

Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha1_san0_IP_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
zpool_data_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action 
xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
 
These tells me xstha2 took control of resources, that were actually running on 
xstha1:

Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       
xstha1_san0_IP     ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha2_san0_IP  
(Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       
zpool_data         ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha1-stonith  
(Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Stop       
xstha2-stonith     (           xstha1 )   due to node availability
 
The last stonith request is the last beacuse xstha2 was killed by xsrtha1 
before the 10s delay, which is what I wanted.
 
Gabriele
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: users@clusterlabs.org 
Data: 17 dicembre 2020 6.38.33 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] delaying start of a resource


16.12.2020 17:56, Gabriele Bulfon пишет:
> Thanks, here are the logs, there are infos about how it tried to start 
> resources on the nodes.

Both logs are from the same node.

> Keep in mind the node1 was already running the resources, and I simulated a 
> problem by turning down the ha interface.
>  

There is no attempt to start resources in these logs. Logs end with
stonith request. As this node had delay 10s, it probably was
successfully eliminated by another node, but there are no logs from
another node.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Dec 16 15:08:07 [660] xstorage1 corosync notice  [TOTEM ] A processor failed, 
forming new configuration.
Dec 16 15:08:07 [660] xstorage1 corosync notice  [TOTEM ] The network interface 
is down.
Dec 16 15:08:08 [660] xstorage1 corosync notice  [TOTEM ] A new membership 
(127.0.0.1:408) was formed. Members left: 2
Dec 16 15:08:08 [660] xstorage1 corosync notice  [TOTEM ] Failed to receive the 
leave message. failed: 2
Dec 16 15:08:08 [710]  attrd: info: pcmk_cpg_membership:Group 
attrd event 2: xstha2 (node 2 pid 666) left via cluster exit
Dec 16 15:08:08 [707]cib: info: pcmk_cpg_membership:Group 
cib event 2: xstha2 (node 2 pid 663) left via cluster exit
Dec 16 15:08:08 [710]  attrd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha2[2] - corosync-cpg is now offline
Dec 16 15:08:08 [687] pacemakerd: info: pcmk_cpg_membership:Group 
pacemakerd event 2: xstha2 (node 2 pid 662) left via cluster exit
Dec 16 15:08:08 [708] stonith-ng: info: pcmk_cpg_membership:Group 
stonith-ng event 2: xstha2 (node 2 pid 664) left via cluster exit
Dec 16 15:08:08 [687] pacemakerd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha2[2] - corosync-cpg is now offline
Dec 16 15:08:08 [708] stonith-ng: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node xstha2[2] - corosync-cpg is now offline
Dec 16 15:08:08 [710]  attrd:   notice: crm_update_peer_state_iter: Node 
xstha2 state is now lost | nodeid=2 previous=member source=crm_update_peer_proc
Dec 16 15:08:08 [660] xstorage1 corosync notice  [QUORUM] Members[1]: 1
Dec 16 15:08:08 [710]  attrd:   notice: attrd_peer_remove:  Removing all 
xstha2 attributes for peer loss
Dec 16 15:08:08 [708] stonith-ng:   notice: crm_update_peer_state_iter: Node 
xstha2 state is n

Re: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Gabriele Bulfon
I see, but then I have to issues:
 
1. it is a dual node server, the HA interface is internal, I have no way to 
unplug it, that's why I tried turning it down
 
2. even in case I could test it by unplugging it, there is still the 
possibility that someone turns the interface down, causing a bad situation for 
the zpool...so I would like to understand why xstha2 decided to turn on IP and 
zpool when stonish of xstha1 was not yet done...
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 17 dicembre 2020 7.48.46 CET
Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource


>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:56 in
Nachricht <386755316.773.1608130588146@www>:
> Thanks, here are the logs, there are infos about how it tried to start 
> resources on the nodes.
> Keep in mind the node1 was already running the resources, and I simulated a 
> problem by turning down the ha interface.

Please note that "turning down" an interface is NOT a realistic test; realistic 
would be to unplug the cable.

> 
> Gabriele
> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 
> 
> 
> 
> 
> 
> --
> 
> Da: Ulrich Windl 
> A: users@clusterlabs.org 
> Data: 16 dicembre 2020 15.45.36 CET
> Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource
> 
> 
>>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
> Nachricht <1523391015.734.1608129155836@www>:
>> Hi, I have now a two node cluster using stonith with different 
>> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
>> problems.
>> 
>> Though, there is still one problem: once node 2 delays its stonith action 
>> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> 
>> so it happens that while it's not yet powered off by node 1 (and waiting its 
> 
>> dalay to power off node 1) it actually starts resources, causing a moment of 
> 
>> few seconds where both NFS IP and ZFS pool (!) is mounted by both!
> 
> AFAIK pacemaker will not start resources on a node that is scheduled for 
> stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
> for stonith to start them elsewhere.
> 
>> How can I delay node 2 resource start until the delayed stonith action is 
>> done? Or how can I just delay the resource start so I can make it larger 
> than 
>> its pcmk_delay_base?
> 
> We probably need to see logs and configs to understand.
> 
>> 
>> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
>> to set this flag (cib-bootstrap-options is not happy with it...).
> 
> I think it's on by default, so you must have set it to false.
> In crm shell it is "configure# property stonith-enabled=...".
> 
> Regards,
> Ulrich
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-17 Thread Gabriele Bulfon
Sorry, somtimes I want to make it simpler, and maybe I'm missing informations.
I think I found what happened, and actually xstha2 DID NOT mount the zpool, nor 
start the IP address.
 
Let's make a step back.
We have two ip resources, one is normally for xstha1, the other is normally for 
xstha2.
The zpool is normally for xstha1.
The two IPs are used to share NFS resources to a Proxmox cluster (that's why I 
call them NFS IPs).
The logic moves the zpool and then the IP to node xstha2, when xstha1 is not 
available, and vice versa.
 
I was confused by the "duplicated IP" message I've seen on xstha1 while xstha2 
was going to be stonished.
I was worried that xstha2 may have mounted the zpool when xstha1 had already 
mounted it.
Actually, reading again the "duplicated IP" message, it was xstha1 that (having 
the pool mounted and not seeing xstha2 anymore) got the xstha2 IP for NFS.
 
So I think there is no worry about the zpool!
I will now try to play with the "no-quorum-policy=ignore" to see if that 
actually works correctly with stonith enabled.
 
Thanks for your help!
Gabriele 
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Andrei Borzenkov 
A: Cluster Labs - All topics related to open-source clustering welcomed 
 
Data: 17 dicembre 2020 9.50.54 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] delaying start of a resource


On Thu, Dec 17, 2020 at 11:11 AM Gabriele Bulfon  wrote:
>
> Yes, sorry took same bash by mistake...here are the correct logs.
>
> Yes, xstha1 has delay 10s so that I'm giving him precedence, xstha2 has delay 
> 1s and will be stonished earlier.
> During the short time before xstha2 got powered off, I saw it had time to 
> turn on NFS IP (I saw duplicated IP on xstha1).

Again - please write so that others can understand you. How should we
know what "NFS IP" is supposed to be? You have two resources that
looks like they are IP related and neither of them has NFS in its
name: xstha1_san0_IP, xstha2_san0_IP. And even if they had NFS in
their names - which of two resources are you talking about?

According to logs from xstha1, it started to activate resources only
after stonith was confirmed

Dec 16 15:08:12 [708] stonith-ng: notice: log_operation:
Operation 'off' [1273] (call 4 from crmd.712) for host 'xstha2' with
device 'xstha2-stonith' returned: 0 (OK)
Dec 16 15:08:12 [708] stonith-ng: notice: remote_op_done:
Operation 'off' targeting xstha2 on xstha1 for
crmd.712@xstha1.e487e7cc: OK

It is possible that your IPMI/BMC/whatever implementation responds
with success before it actually completes this action. I have seen at
least some delays in the past. There is not really much that can be
done here except adding artificial delay to stonith resource agent.
You need to test IPMI functionality before using it in pacemaker.

In this case xstha1 may have configured xstha2_san0_IP resource before
xstha2 was down. This would explain duplicated IP.

> And becase configuration has "order zpool_data_order inf: zpool_data ( 
> xstha1_san0_IP )", that means xstha2 had imported the zpool for a small time 
> before being stonished, and this must never happen.

There is no indication in logs that pacemaker started or attempted to
start either of xstha1_san0_IP or zpool_sata on xstha2.

>
> What suggests me that resources were started on xstha2 (and duplicated IP is 
> an effect) are these logs portions of xstha2.

The resources xstha2_san0_IP *remained* started on xstha2. pacemaker
did not try to stop them at all, it had no reasons to do so.

> These tells me it could not turn off resources on xstha1 (correct, it 
> couldn't contact xstha1):
>
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> xstha1_san0_IP_stop_0 on xstha1 is unrunnable (offline)
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> zpool_data_stop_0 on xstha1 is unrunnable (offline)
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
> Dec 16 15:08:56 [667] pengine: warning: custom_action: Action 
> xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
>
> These tells me xstha2 took control of resources, that were actually running 
> on xstha1:
>
> Dec 16 15:08:56 [667] pengine: notice: LogAction: * Move xstha1_san0_IP ( 
> xstha1 -> xstha2 )
> Dec 16 15:08:56 [667] pengine: info: LogActions: Leave xstha2_san0_IP 
> (Started xstha2)
> Dec 16 15:08:56 [667] pengine: notice: LogAction: * Move zpool_data ( xstha1 
> -> xstha2 )
> Dec 16 15:08:56 [667] pengine: info: LogActions: Leave xstha1-stonith 
> (Started xstha2)

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Gabriele Bulfon
Would a change of network class on one node be ok?
 
 
Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 




--

Da: Ulrich Windl 
A: users@clusterlabs.org 
Data: 17 dicembre 2020 12.26.29 CET
Oggetto: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] delaying start of a 
resource


>>> Gabriele Bulfon  schrieb am 17.12.2020 um 09:14 in
Nachricht <2080536991.1106.1608192888030@www>:
> I see, but then I have to issues:
> 
> 1. it is a dual node server, the HA interface is internal, I have no way to 
> unplug it, that's why I tried turning it down

You could block traffic using iptables or a "blackhole" route for example.

> 
> 2. even in case I could test it by unplugging it, there is still the 
> possibility that someone turns the interface down, causing a bad situation 
> for the zpool...so I would like to understand why xstha2 decided to turn on 
> IP and zpool when stonish of xstha1 was not yet done...

What should a HA software do when an admin turns down the interface?
I'm afraid there is no HA software against adinistrator errors.
It's important to understand that HA software helps against errors from 
hardware or software, but not against configuration errors (which an ifdown is).

> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 
> 
> 
> 
> 
> 
> --
> 
> Da: Ulrich Windl 
> A: users@clusterlabs.org 
> Data: 17 dicembre 2020 7.48.46 CET
> Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource
> 
> 
>>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:56 in
> Nachricht <386755316.773.1608130588146@www>:
>> Thanks, here are the logs, there are infos about how it tried to start 
>> resources on the nodes.
>> Keep in mind the node1 was already running the resources, and I simulated a 
>> problem by turning down the ha interface.
> 
> Please note that "turning down" an interface is NOT a realistic test; 
> realistic would be to unplug the cable.
> 
>> 
>> Gabriele
>> 
>> 
>> Sonicle S.r.l. : http://www.sonicle.com 
>> Music: http://www.gabrielebulfon.com 
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>> 
>> 
>> 
>> 
>> 
>> ----
>> --
>> 
>> Da: Ulrich Windl 
>> A: users@clusterlabs.org 
>> Data: 16 dicembre 2020 15.45.36 CET
>> Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource
>> 
>> 
>>>>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 in
>> Nachricht <1523391015.734.1608129155836@www>:
>>> Hi, I have now a two node cluster using stonith with different 
>>> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
>>> problems.
>>> 
>>> Though, there is still one problem: once node 2 delays its stonith action 
>>> for 10 seconds, and node 1 just 1, node 2 does not delay start of 
>>> resources, 
> 
>> 
>>> so it happens that while it's not yet powered off by node 1 (and waiting 
>>> its 
> 
>> 
>>> dalay to power off node 1) it actually starts resources, causing a moment 
>>> of 
> 
>> 
>>> few seconds where both NFS IP and ZFS pool (!) is mounted by both!
>> 
>> AFAIK pacemaker will not start resources on a node that is scheduled for 
>> stonith. Even more: Pacemaker will tra to stop resources on a node scheduled 
> 
>> for stonith to start them elsewhere.
>> 
>>> How can I delay node 2 resource start until the delayed stonith action is 
>>> done? Or how can I just delay the resource start so I can make it larger 
>> than 
>>> its pcmk_delay_base?
>> 
>> We probably need to see logs and configs to understand.
>> 
>>> 
>>> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
>>> to set this flag (cib-bootstrap-options is not happy with it...).
>> 
>> I think it's on by default, so you must have set it to false.
>> In crm shell it is "configure# property stonith-enabled=...".
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] problems on XStreamOS / illumos

2016-07-27 Thread Gabriele Bulfon
Hi,
I have built all the HA packages for our XStreamOS/illumos distribution:
- libqb-1.0
- clusterglue 1.0.12 [0a7add1d9996]
- corosync-2.3.6 , just to have needed libcfg, libcorosync_common, libcpg
- heartbeat-3.0.6 [958e11be8686]
- pacemaker-1.1.14
- crm-shell 2.1.2
- resource-agents-3.9.6
After installing the packages, configuring the various initial steps, starting 
the ha_logd and heartbeat services succesfully,
the system started to reboot everytime after some seconds of heartbeat service 
startup.
I cannot seem to find the reason for various segfaults I find both in the 
system log and the pacemaker log.
I attached the two pieces of logs here, any idea?
Thanks,
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.warning] [840]: WARN: 
node xstream2: is dead
Jul 27 18:03:03 xstream1 stonith-ng[921]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:03 xstream1 cib[920]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:03 xstream1 stonith-ng[921]: [ID 702911 daemon.notice]   notice: 
Connecting to cluster infrastructure: heartbeat
Jul 27 18:03:03 xstream1 cib[920]: [ID 702911 daemon.notice]   notice: Using 
new config location: /sonicle/var/cluster/lib/pacemaker/cib
Jul 27 18:03:03 xstream1 attrd[923]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:03 xstream1 attrd[923]: [ID 702911 daemon.notice]   notice: 
Connecting to cluster infrastructure: heartbeat
Jul 27 18:03:03 xstream1 lrmd[922]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:03 xstream1 cib[920]: [ID 702911 daemon.warning]  warning: Could 
not verify cluster configuration file 
/sonicle/var/cluster/lib/pacemaker/cib/cib.xml: No such file or directory (2)
Jul 27 18:03:03 xstream1 cib[920]: [ID 702911 daemon.warning]  warning: Primary 
configuration corrupt or unusable, trying backups in 
/sonicle/var/cluster/lib/pacemaker/cib
Jul 27 18:03:03 xstream1 cib[920]: [ID 702911 daemon.warning]  warning: 
Continuing with an empty configuration.
Jul 27 18:03:03 xstream1 crmd[924]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:03 xstream1 crmd[924]: [ID 702911 daemon.notice]   notice: CRM Git 
Version: 1.1.14 (70404b0)
Jul 27 18:03:03 xstream1 pengine[925]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:03 xstream1 cib[920]: [ID 702911 daemon.notice]   notice: 
Connecting to cluster infrastructure: heartbeat
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.warning] [840]: WARN: 
adjust sndbuf size to 1048576 failed: Invalid argument
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.warning] [840]: WARN: 
Managed /usr/libexec/pacemaker/attrd process 923 killed by signal 11 [SIGSEGV - 
Segmentation violation].
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.error] [840]: ERROR: 
Respawning client "/usr/libexec/pacemaker/attrd":
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.warning] [840]: WARN: 
Managed /usr/libexec/pacemaker/lrmd process 922 killed by signal 11 [SIGSEGV - 
Segmentation violation].
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.error] [840]: ERROR: 
Managed /usr/libexec/pacemaker/lrmd process 922 dumped core
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.error] [840]: ERROR: 
Respawning client "/usr/libexec/pacemaker/lrmd":
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.warning] [840]: WARN: 
Managed /usr/libexec/pacemaker/cib process 920 killed by signal 11 [SIGSEGV - 
Segmentation violation].
Jul 27 18:03:03 xstream1 heartbeat: [ID 996084 daemon.emerg] [840]: EMERG: 
Rebooting system.  Reason: /usr/libexec/pacemaker/cib
Jul 27 18:03:04 xstream1 crmd[924]: [ID 702911 daemon.warning]  warning: 
Couldn't complete CIB registration 1 times... pause and retry
Jul 27 18:03:04 xstream1 crmd[924]: [ID 702911 daemon.notice]   notice: Child 
process pengine terminated with signal 11 (pid=925, core=0)
Jul 27 18:03:04 xstream1 attrd[926]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:04 xstream1 attrd[926]: [ID 702911 daemon.notice]   notice: 
Connecting to cluster infrastructure: heartbeat
Jul 27 18:03:04 xstream1 lrmd[927]: [ID 702911 daemon.notice]   notice: 
Additional logging available in /var/log/pacemaker.log
Jul 27 18:03:04 xstream1 attrd[926]: [ID 702911 daemon.error]error: Cannot 
sign on with heartbeat:
Jul 27 18:03:04 xstream1 attrd[926]: [ID 702911 daemon.error]error: HA 
Signon failed
Jul 27 18:03:04 xstream1 attrd[926]: [ID 702911 daemon.error]error

[ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Gabriele Bulfon
Hi,
I built corosync/pacemaker for our XStreamOS/illumos : corosync starts fine and 
log correctly, pacemakerd quits after some seconds with the attached log.
Any idea where is the issue?
Thanks,
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon


corosync.log
Description: binary/octet-stream
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Gabriele Bulfon
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the 
"--with-corosync".
How is Corosync looking for his own version?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 4.54.44 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 12:20 AM, Ken Gaillot wrote:
On 08/22/2016 12:17 PM, Gabriele Bulfon wrote:
Hi,
I built corosync/pacemaker for our XStreamOS/illumos : corosync starts
fine and log correctly, pacemakerd quits after some seconds with the
attached log.
Any idea where is the issue?
Pacemaker is not able to communicate with corosync for some reason.
Aug 22 19:13:02 [1324] xstorage1 corosync notice  [MAIN  ] Corosync
Cluster Engine ('UNKNOWN'): started and ready to provide service.
'UNKNOWN' should show the corosync version. I'm wondering if maybe you
have an older corosync without configuring the pacemaker plugin. It
would be much better to use corosync 2 instead, if you can.
If corosync is not able to determine its' own version the
pacemaker-build might not have been able as well. So it
might have made some weird decisions/assumptions ...
like e.g. not building the plugin at all ... assuming you
are not using corosync 2+ ...
Thanks,
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Gabriele Bulfon
Thanks! I found it myself before reading this :) using that url I got the 
correct tar.gz and PACKAGE_VERSION is fine ;)
Going on now hoping it's going to work :)
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the 
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Gabriele Bulfon
Ok, looks like Corosync now runs fine with its version, but then pacemakerd 
fails again with new errors on attrd and other daemons it tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the 
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
Sure I did: created the new corosync package and installed on the dev machine 
before building and creating the new pacemaker package on the dev machine.

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
About the hacluster/haclient user/group, I staft to think that cib can't 
connect because it's started by pacemakerd with user hacluster, even though 
pacemakerd is started as root.
Instead, just before pacemakerd is able to connect with the same call, but that 
is the root user.
So I tried to run pacemakerd as hacluster, and infact it can't start that way.
I tried then to add the uidgid spec in the corosync.conf, but seems not to work 
anyway.
So ...should I start also corosync as hacluster? Is it safe to run everything 
as root? How can I force pacemakerd to run every child as root?
...if this is the problem...

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
I found that pacemakerd leaves a core file where I launch it, nad here is the 
output from "mdb core":
sonicle@xstorage1:/sonicle/etc/cluster/corosync# mdb core
Loading modules: [ libc.so.1 ld.so.1 ]
$C
08047a48 libqb.so.0.18.0`qb_thread_lock+0x16(0, feef9875, 8047a9c, fe9eb842, 
fe9ff000, 806fc78)
08047a68 libqb.so.0.18.0`qb_atomic_int_add+0x22(806fd84, 1, 8047a9c, 773)
08047a88 libqb.so.0.18.0`qb_ipcs_ref+0x23(806fc78, fea30960, feef9865, 
fe9de139, fede608f, 806fb58)
08047ab8 libqb.so.0.18.0`qb_ipcs_create+0x68(8057fd9, 0, 0, 8069470, 805302e, 
20)
08047ae8 libcrmcommon.so.3.5.0`mainloop_add_ipc_server+0x77(8057fd9, 0, 
8069470, 8047b64, 0, feffb0a8)
08047b28 main+0x18e(8047b1c, fef726a8, 8047b58, 8052d2f, 1, 8047b64)
08047b58 _start+0x83(1, 8047c70, 0, 8047c8c, 8047ca0, 8047cb4)

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
kwenn...@redhat.com Cluster Labs - All topics related to open-source clustering 
welcomed
Data:
23 agosto 2016 14.30.20 CEST
Oggetto:
Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
About the hacluster/haclient user/group, I staft to think that cib can't 
connect because it's started by pacemakerd with user hacluster, even though 
pacemakerd is started as root.
Instead, just before pacemakerd is able to connect with the same call, but that 
is the root user.
So I tried to run pacemakerd as hacluster, and infact it can't start that way.
I tried then to add the uidgid spec in the corosync.conf, but seems not to work 
anyway.
So ...should I start also corosync as hacluster? Is it safe to run everything 
as root? How can I force pacemakerd to run every child as root?
...if this is the problem...

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___Users mailing list: 
Users@clusterlabs.orghttp://clusterlabs.org/mailman/listinfo/usersProject Home: 
http://www.clusterlabs.orgGetting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: 
http://bugs.clusterlabs.org
___
Users maili

[ClusterLabs] Solved: pacemakerd quits after few seconds with some errors

2016-08-23 Thread Gabriele Bulfon
Found the 2 reasons:
1) I had to use gcc 4.8 for libqb to use internal memory barries
this still did not solve the crash but changed the way it crashed the subdaemons
2) /usr/var/run is not writable to everyone, but pacemakerd subdaemons want to 
create socket files here with hacluster user, and fail!
I will see if I can create these files in advance with correct permission 
during installation, but : how can I change this directory? looks like its 
libqb, but how can I drive this folder from the daemons? This way I could 
create a /usr/var/run/cluster with permissions and let everything run there.
Thanks
Gabrele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
Cluster Labs - All topics related to open-source clustering welcomed
kwenn...@redhat.com
Data:
23 agosto 2016 15.05.32 CEST
Oggetto:
Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
I found that pacemakerd leaves a core file where I launch it, nad here is the 
output from "mdb core":
sonicle@xstorage1:/sonicle/etc/cluster/corosync# mdb core
Loading modules: [ libc.so.1 ld.so.1 ]
$C
08047a48 libqb.so.0.18.0`qb_thread_lock+0x16(0, feef9875, 8047a9c, fe9eb842, 
fe9ff000, 806fc78)
08047a68 libqb.so.0.18.0`qb_atomic_int_add+0x22(806fd84, 1, 8047a9c, 773)
08047a88 libqb.so.0.18.0`qb_ipcs_ref+0x23(806fc78, fea30960, feef9865, 
fe9de139, fede608f, 806fb58)
08047ab8 libqb.so.0.18.0`qb_ipcs_create+0x68(8057fd9, 0, 0, 8069470, 805302e, 
20)
08047ae8 libcrmcommon.so.3.5.0`mainloop_add_ipc_server+0x77(8057fd9, 0, 
8069470, 8047b64, 0, feffb0a8)
08047b28 main+0x18e(8047b1c, fef726a8, 8047b58, 8052d2f, 1, 8047b64)
08047b58 _start+0x83(1, 8047c70, 0, 8047c8c, 8047ca0, 8047cb4)

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
kwenn...@redhat.com Cluster Labs - All topics related to open-source clustering 
welcomed
Data:
23 agosto 2016 14.30.20 CEST
Oggetto:
Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
About the hacluster/haclient user/group, I staft to think that cib can't 
connect because it's started by pacemakerd with user hacluster, even though 
pacemakerd is started as root.
Instead, just before pacemakerd is able to connect with the same call, but that 
is the root user.
So I tried to run pacemakerd as hacluster, and infact it can't start that way.
I tried then to add the uidgid spec in the corosync.conf, but seems not to work 
anyway.
So ...should I start also corosync as hacluster? Is it safe to run everything 
as root? How can I force pacemakerd to run every child as root?
...if this is the problem...

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 9.07.03 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 08:50 AM, Gabriele Bulfon wrote:
Ok, looks like Corosync now runs fine with its version, but then
pacemakerd fails again with new errors on attrd and other daemons it
tries to fork.
The main reason seems around ha signon and cluster process group api.
Any idea?
Just to be sure: You recompiled pacemaker against your new corosync?
Klaus
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Jan Pokorný
A: users@clusterlabs.org
Data: 23 agosto 2016 7.59.37 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with
some errors
On 23/08/16 07:23 +0200, Gabriele Bulfon wrote:
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the
"--with-corosync".
How is Corosync looking for his own version?
The situation may be as easy as building corosync from GitHub-provided
automatic tarball, which is never a good idea if upstream has its own
way of proper release delivery:
http://build.clusterlabs.org/corosync/releases/
(specific URLs are also being part of the corosync announcements
on this list)
The issue with automatic tarballs already reported:
https://github.com/corosync/corosync/issues/116
--
Jan (Poki)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailma

[ClusterLabs] pacemaker validate-with

2016-08-24 Thread Gabriele Bulfon
Hi,
now I've got my pacemaker 1.1.14 and corosync 2.4.1 working together running an 
empty configuration of just 2 nodes.
So I run crm configure but I get this error:
ERROR: CIB not supported: validator 'pacemaker-2.4', release '3.0.10'
ERROR: You may try the upgrade command
ERROR: configure: Missing requirements
Looking around I found this :
http://unix.stackexchange.com/questions/269635/cib-not-supported-validator-pacemaker-2-0-release-3-0-9
but before going on this way, my questions are:
- why does cib try validator of pacemaker-2.4 when latest pacemaker available 
is 1.1.15??
- what is "release 3.0.10'? of what??
- may I daliver any preconfigure configuration file to have it work fine with 
pacemaker 1.1.14??
Thanks for any help.
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker validate-with

2016-08-24 Thread Gabriele Bulfon
Yes, the cib folder was empty at first install, and now has been written just 
after running pacemakerd, with a cib.xml containing that wrong validator. What 
does it mean? Should I deliver a cib,xml file on install preconfigured?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 24 agosto 2016 16.34.55 CEST
Oggetto: Re: [ClusterLabs] pacemaker validate-with
On 08/24/2016 04:17 PM, Gabriele Bulfon wrote:
Hi,
now I've got my pacemaker 1.1.14 and corosync 2.4.1 working together
running an empty configuration of just 2 nodes.
So I run crm configure but I get this error:
ERROR: CIB not supported: validator 'pacemaker-2.4', release '3.0.10'
ERROR: You may try the upgrade command
ERROR: configure: Missing requirements
Looking around I found this :
http://unix.stackexchange.com/questions/269635/cib-not-supported-validator-pacemaker-2-0-release-3-0-9
this link is really about incompatibility of the cib with the currently
running pacemaker.
in this case I could imagine that it is some kind of rights problem
again so that
the cib is totally empty / non existant - even missing the header.
but before going on this way, my questions are:
- why does cib try validator of pacemaker-2.4 when latest pacemaker
available is 1.1.15??
that is referring to the schema-version - have a look at the xml-subdir
of the sourcetree
- what is "release 3.0.10'? of what??
- may I daliver any preconfigure configuration file to have it work
fine with pacemaker 1.1.14??
yes, it should be upgraded when needed (triggered by e.g. pcs when you
are configuring
something the current version doesn't support)
Thanks for any help.
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker validate-with

2016-08-24 Thread Gabriele Bulfon
I built and installed from sources pacemaker 1.1.14, corosync 2.4.1 and 
crm-shell 2.1.2.
Should I use different versions?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 24 agosto 2016 16.53.15 CEST
Oggetto: Re: [ClusterLabs] pacemaker validate-with
Gabriele Bulfon
writes:
Hi,
now I've got my pacemaker 1.1.14 and corosync 2.4.1 working together running an 
empty configuration of just 2 nodes.
So I run crm configure but I get this error:
ERROR: CIB not supported: validator 'pacemaker-2.4', release '3.0.10'
ERROR: You may try the upgrade command
ERROR: configure: Missing requirements
Looking around I found this :
http://unix.stackexchange.com/questions/269635/cib-not-supported-validator-pacemaker-2-0-release-3-0-9
but before going on this way, my questions are:
- why does cib try validator of pacemaker-2.4 when latest pacemaker available 
is 1.1.15??
- what is "release 3.0.10'? of what??
- may I daliver any preconfigure configuration file to have it work fine with 
pacemaker 1.1.14??
'pacemaker-2.4' is the CIB schema version, while release '3.0.10' is the
crm_feature_set version. Most likely, you are getting this error because
the pacemaker version you have installed needs an updated CIB
configuration. These version numbers are not in sync with the pacemaker
release version number. Slightly confusing, but that is the way it is...
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker validate-with

2016-08-24 Thread Gabriele Bulfon
Done it meanwhile ;) and it works guys ;) thanks a lot!

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 24 agosto 2016 17.55.21 CEST
Oggetto: Re: [ClusterLabs] pacemaker validate-with
Gabriele Bulfon
writes:
I built and installed from sources pacemaker 1.1.14, corosync 2.4.1 and 
crm-shell 2.1.2.
Should I use different versions?
Hmm yeah, crmsh 2.1.2 is quite old, I'd recommend at least going up to
2.1.6 if you want to stay on the 2.1 branch, or 2.3.0 for the latest
release.
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] converting configuration

2016-08-24 Thread Gabriele Bulfon
Hi,
In my previous tests I used a prebuilt older pacemaker/heartbeat package with a 
configuration like:
primitive xstor2-stonith stonith:external/ssh-sonicle \op monitor 
interval="25" timeout="25" start-delay="25" \params 
hostlist="xstor2"primitive xstor3-stonith stonith:external/ssh-sonicle \
op monitor interval="25" timeout="25" start-delay="25" \params 
hostlist="xstor3"location xstor2-stonith-pref xstor2-stonith -inf: 
xstor2location xstor3-stonith-pref xstor3-stonith -inf: xstor3property 
stonith-action=poweroffcommit
Now that I upgraded everything from sources and moved over to corosync 2, these 
commands are not recognized, refused
with "primitive not supported by the RNG schema".
Is there any way I can easily convert my old commands into the new ones?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] converting configuration

2016-08-25 Thread Gabriele Bulfon
Yes I'm packaging for our distro from sources, pacemaker 1.1.14, corosync 2.4.1 
and crm-shell 2.2.1
Our distro is an illumos distro, XStreamOS.
How can I check where it's looking for available primitives?
Here's the output from crm -dR:
sonicle@xstorage1:~# crm -dR
.EXT /usr/libexec/pacemaker/crmd version
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
crm(live)# configure
.INP: configure
.EXT cibadmin -Ql
crm(live)configure# show
.INP: show
node 1: xstorage1
node 2: xstorage2
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync
crm(live)configure# primitive xstorage1-stonith stonith:external/ssh-sonicle \
.INP: primitive xstorage1-stonith stonith:external/ssh-sonicle \
op monitor interval="25" timeout="25" start-delay="25" \
.INP: op monitor interval="25" timeout="25" start-delay="25" \
params hostlist="xstorage1"
.INP: params hostlist="xstorage1"
ERROR: primitive not supported by the RNG schema
crm(live)configure#

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Cluster Labs - All topics related to open-source clustering welcomed
Data: 24 agosto 2016 20.40.08 CEST
Oggetto: Re: [ClusterLabs] converting configuration
Gabriele Bulfon
writes:
Hi,
In my previous tests I used a prebuilt older pacemaker/heartbeat package with a 
configuration like:
primitive xstor2-stonith stonith:external/ssh-sonicle \op monitor 
interval="25" timeout="25" start-delay="25" \params 
hostlist="xstor2"primitive xstor3-stonith stonith:external/ssh-sonicle \
op monitor interval="25" timeout="25" start-delay="25" \params 
hostlist="xstor3"location xstor2-stonith-pref xstor2-stonith -inf: 
xstor2location xstor3-stonith-pref xstor3-stonith -inf: xstor3property 
stonith-action=poweroffcommit
Now that I upgraded everything from sources and moved over to corosync 2, these 
commands are not recognized, refused
with "primitive not supported by the RNG schema".
Is there any way I can easily convert my old commands into the new ones?
Gabriele
Hmm, that is a misleading error message. It sounds like crmsh isn't
finding the Pacemaker schema correctly. Try running it with -dR
arguments and see if you get any strange errors.
Did you build from source yourself? What distribution are you running?
You may need some different arguments to configure for it to locate
everything correctly.
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] converting configuration

2016-08-25 Thread Gabriele Bulfon
Also found this :
sonicle@xstorage1:~# crm_verify -LV
xmlRelaxNGParseElement: element has no content
error: crm_abort:validate_with_relaxng: Triggered assert at xml.c:5285 : 
ctx-rng != NULL
error: validate_with_relaxng:Could not find/parse 
/usr/share/pacemaker/pacemaker-2.4.rng
error: unpack_resources: Resource start-up disabled since no STONITH 
resources have been defined
error: unpack_resources: Either configure some or disable STONITH with the 
stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to 
ensure data integrity
Errors found during check: config not valid
while pacemaker-2.4.rng is there, containing:
and fencing-2.4.rng is there containing elements for stonith...

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
Kristoffer Grönlund
Cluster Labs - All topics related to open-source clustering welcomed
Data:
25 agosto 2016 10.01.17 CEST
Oggetto:
Re: [ClusterLabs] converting configuration
Yes I'm packaging for our distro from sources, pacemaker 1.1.14, corosync 2.4.1 
and crm-shell 2.2.1
Our distro is an illumos distro, XStreamOS.
How can I check where it's looking for available primitives?
Here's the output from crm -dR:
sonicle@xstorage1:~# crm -dR
.EXT /usr/libexec/pacemaker/crmd version
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
crm(live)# configure
.INP: configure
.EXT cibadmin -Ql
crm(live)configure# show
.INP: show
node 1: xstorage1
node 2: xstorage2
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync
crm(live)configure# primitive xstorage1-stonith stonith:external/ssh-sonicle \
.INP: primitive xstorage1-stonith stonith:external/ssh-sonicle \
op monitor interval="25" timeout="25" start-delay="25" \
.INP: op monitor interval="25" timeout="25" start-delay="25" \
params hostlist="xstorage1"
.INP: params hostlist="xstorage1"
ERROR: primitive not supported by the RNG schema
crm(live)configure#

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Cluster Labs - All topics related to open-source clustering welcomed
Data: 24 agosto 2016 20.40.08 CEST
Oggetto: Re: [ClusterLabs] converting configuration
Gabriele Bulfon
writes:
Hi,
In my previous tests I used a prebuilt older pacemaker/heartbeat package with a 
configuration like:
primitive xstor2-stonith stonith:external/ssh-sonicle \op monitor 
interval="25" timeout="25" start-delay="25" \params 
hostlist="xstor2"primitive xstor3-stonith stonith:external/ssh-sonicle \
op monitor interval="25" timeout="25" start-delay="25" \params 
hostlist="xstor3"location xstor2-stonith-pref xstor2-stonith -inf: 
xstor2location xstor3-stonith-pref xstor3-stonith -inf: xstor3property 
stonith-action=poweroffcommit
Now that I upgraded everything from sources and moved over to corosync 2, these 
commands are not recognized, refused
with "primitive not supported by the RNG schema".
Is there any way I can easily convert my old commands into the new ones?
Gabriele
Hmm, that is a misleading error message. It sounds like crmsh isn't
finding the Pacemaker schema correctly. Try running it with -dR
arguments and see if you get any strange errors.
Did you build from source yourself? What distribution are you running?
You may need some different arguments to configure for it to locate
everything correctly.
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___Users mailing list: 
Users@clusterlabs.orghttp://clusterlabs.org/mailman/listinfo/usersProject Home: 
http://www.clusterlabs.orgGetting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: 
http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] converting configuration

2016-08-25 Thread Gabriele Bulfon
Ok, now I deploy a custom crm.conf with path as in my system.
I also had to patch config.py with my crm.conf position as build is not 
considering sysconfdir and localstatedir.
Here is the output, I check with an asterisk what is not present in my system 
(but not important, I think):
core.editor = /usr/bin/vim
core.pager = /usr/bin/less -ins
core.ptest = /usr/sbin/crm_simulate
path.cache = /sonicle/var/cluster/cache/crm
path.crm_config = /sonicle/var/cluster/lib/pacemaker/cib
path.crm_daemon_dir = /usr/libexec/pacemaker
path.crm_dtd_dir = /usr/share/pacemaker
*path.hawk_wizards = /srv/www/hawk/config/wizard
path.hb_delnode = /usr/share/heartbeat/hb_delnode
path.heartbeat_dir = /var/lib/heartbeat (contains only 'cores', but I'm no more 
using heartbeat)
*path.nagios_plugins = /usr/lib64/nagios/plugins
path.ocf_root = /usr/lib/ocf
path.pe_state_dir = /sonicle/var/cluster/lib/pacemaker/pengine
path.sharedir = /usr/share/crmsh
here is the list of files in my /usr/share/pacemaker:
acls-1.2.rng
acls-2.0.rng
cib-1.0.rng
cib-1.2.rng
constraints-1.0.rng
constraints-1.2.rng
constraints-2.1.rng
constraints-2.2.rng
constraints-2.3.rng
constraints-next.rng
crm_mon.rng
crm-transitional.dtd
crm.dtd
fencing-1.2.rng
fencing-2.4.rng
nodes-1.0.rng
nodes-1.2.rng
nodes-1.3.rng
nvset-1.3.rng
nvset.rng
options-1.0.rng
pacemaker-1.0.rng
pacemaker-1.2.rng
pacemaker-1.3.rng
pacemaker-2.0.rng
pacemaker-2.1.rng
pacemaker-2.2.rng
pacemaker-2.3.rng
pacemaker-2.4.rng
pacemaker-next.rng
pacemaker.rng
report.collector
report.common
resources-1.0.rng
resources-1.2.rng
resources-1.3.rng
rule.rng
score.rng
status-1.0.rng
tags-1.3.rng
tests
upgrade-1.3.xsl
upgrade06.xsl
versions.rng
Thanks for your help!

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 25 agosto 2016 10.17.01 CEST
Oggetto: Re: [ClusterLabs] converting configuration
Gabriele Bulfon
writes:
Yes I'm packaging for our distro from sources, pacemaker 1.1.14, corosync 2.4.1 
and crm-shell 2.2.1
Our distro is an illumos distro, XStreamOS.
How can I check where it's looking for available primitives?
Aah, I see.
Look at the output from
crm options list all
There should be an entry for path.crm_dtd_dir. This should point to the
directory where the pacemaker .rng schema files are installed. On my
system, this is /usr/share/pacemaker, where I find files like
pacemaker-2.5.rng.
If this doesn't point to the correct location, you can set this in
/etc/crm/crm.conf. If you are packaging, you probably want to install
this file with the correct path already set. Also check the other path
configurations, and let me know what your installation location is, so I
can make crmsh look there as well by default.
Thank you!
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] converting configuration

2016-08-25 Thread Gabriele Bulfon
only this:
-_SYSTEMWIDE = '/etc/crm/crm.conf'
+_SYSTEMWIDE = '/sonicle/etc/cluster/crm/crm.conf'
I didn't find any way to change this via configure options.
And still I receive the same error with crm configure primitive...
I attach here my latest corosync.log file, with a clean complete session of 
startup corosync/pacemaker (the second node is off).
As you can see there are many errors regarding xml:
- line 34, again about RNG parser and pacemaker-2.4.rng (the path is there, 
checked)
- line 108 and later, internal libxml error during validation
Maybe if we can find the reason for these errors, it will solve!
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 25 agosto 2016 11.10.26 CEST
Oggetto: Re: [ClusterLabs] converting configuration
Gabriele Bulfon
writes:
Ok, now I deploy a custom crm.conf with path as in my system.
I also had to patch config.py with my crm.conf position as build is not 
considering sysconfdir and localstatedir.
Here is the output, I check with an asterisk what is not present in my system 
(but not important, I think):
Patching config.py shouldn't be necessary, what was the change you had
to make there?
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com


corosync.log
Description: binary/octet-stream
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] converting configuration

2016-08-25 Thread Gabriele Bulfon
It's 2.7.6 , heavily used by the whole system.
Do you need a more recent one?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 25 agosto 2016 11.59.54 CEST
Oggetto: Re: [ClusterLabs] converting configuration
Gabriele Bulfon
writes:
only this:
-_SYSTEMWIDE = '/etc/crm/crm.conf'
+_SYSTEMWIDE = '/sonicle/etc/cluster/crm/crm.conf'
I didn't find any way to change this via configure options.
And still I receive the same error with crm configure primitive...
I attach here my latest corosync.log file, with a clean complete session of 
startup corosync/pacemaker (the second node is off).
As you can see there are many errors regarding xml:
- line 34, again about RNG parser and pacemaker-2.4.rng (the path is there, 
checked)
- line 108 and later, internal libxml error during validation
Yes, I am wondering if perhaps the problem is not locating the RNG, but
parsing it. What is your libxml version?
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] converting configuration

2016-08-25 Thread Gabriele Bulfon
updgraded libxml2 to 2.9.3, no way :(
though I found this post that shows the very same problems, some versions ago:
https://lists.opensuse.org/opensuse-bugs/2014-04/msg00648.html
how can I force more debugging in corosync.log?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 25 agosto 2016 13.51.43 CEST
Oggetto: Re: [ClusterLabs] converting configuration
Gabriele Bulfon
writes:
It's 2.7.6 , heavily used by the whole system.
Do you need a more recent one?
Well, I don't know for sure.. but that version was released in 2009, so
it's not exactly recent.
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Solved: converting configuration

2016-08-25 Thread Gabriele Bulfon
YESSS!!! That was it! :)))
Upgraded to 1.1.15, rebuilt and the rng files contain a lot more stuff.
Packaged, published, installed on the test machine: got all my instructions as 
is!!! :)))
...now last stepsmaking our custom agents/shells work on this new setup ;)
For example, what does the yellow "stopped" state mean?
Here is the last rows of crm output after my config instructions :
Full list of resources:
xstorage1-stonith   (stonith:external/ssh-sonicle): Stopped
xstorage2-stonith   (stonith:external/ssh-sonicle): Stopped

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: users@clusterlabs.org
Data: 25 agosto 2016 15.55.35 CEST
Oggetto: Re: [ClusterLabs] converting configuration
On 08/25/2016 03:07 AM, Gabriele Bulfon wrote:
Also found this :
sonicle@xstorage1:~# crm_verify -LV
xmlRelaxNGParseElement: element has no content
error: crm_abort: validate_with_relaxng: Triggered assert at xml.c:5285
: ctx-rng != NULL
error: validate_with_relaxng: Could not find/parse
/usr/share/pacemaker/pacemaker-2.4.rng
error: unpack_resources: Resource start-up disabled since no STONITH
resources have been defined
error: unpack_resources: Either configure some or disable STONITH with
the stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to
ensure data integrity
Errors found during check: config not valid
while pacemaker-2.4.rng is there, containing:
datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'
and fencing-2.4.rng is there containing elements for stonith...
That looks odd ... there should be a lot more there than just fencing
(options, nodes, constraints, etc.).
The pacemaker-*.rng files are automatically generated by the Makefile
from the other *.rng files. Something might be going wrong in that
process on your system. It uses sed and sort, so most likely the syntax
usage is different from what's on Linux.
You might try pacemaker-1.1.15 first; it had some non-Linux
compatibility improvements that might apply to your OS. It also has a
good number of bug fixes.

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--------
*Da:* Gabriele Bulfon
*A:* Kristoffer Grönlund
Cluster Labs - All topics
related to open-source clustering welcomed
*Data:* 25 agosto 2016 10.01.17 CEST
*Oggetto:* Re: [ClusterLabs] converting configuration
Yes I'm packaging for our distro from sources, pacemaker 1.1.14,
corosync 2.4.1 and crm-shell 2.2.1
Our distro is an illumos distro, XStreamOS.
How can I check where it's looking for available primitives?
Here's the output from crm -dR:
sonicle@xstorage1:~# crm -dR
.EXT /usr/libexec/pacemaker/crmd version
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
crm(live)# configure
.INP: configure
.EXT cibadmin -Ql
crm(live)configure# show
.INP: show
node 1: xstorage1
node 2: xstorage2
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync
crm(live)configure# primitive xstorage1-stonith
stonith:external/ssh-sonicle \
.INP: primitive xstorage1-stonith stonith:external/ssh-sonicle \
op monitor interval="25" timeout="25" start-delay="25" \
.INP: op monitor interval="25" timeout="25" start-delay="25" \
params hostlist="xstorage1"
.INP: params hostlist="xstorage1"
ERROR: primitive not supported by the RNG schema
crm(live)configure#

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to
open-source clustering welcomed
Cluster Labs
- All topics related to open-source clustering welcomed
Data: 24 agosto 2016 20.40.08 CEST
Oggetto: Re: [ClusterLabs] converting configuration
Gabriele Bulfon
writes:
Hi,
In my previous tests I used a prebuilt older
pacemaker/heartbeat package with a configuration like:
primitive xstor2-stonith stonith:external/ssh-sonicle \ op
monitor interval="25" timeout="25" start-delay="25" \ params
hostlist="xstor2"primitive xstor3-stonith
stonith:external/ssh-sonicle \ op monitor interval="25"
timeou

Re: [ClusterLabs] Solved: converting configuration

2016-08-25 Thread Gabriele Bulfon
Thanks, stonithd is running, and I can see it working in the corosync log.
Though it says that my external/ssh-sonicle does not work.
Where do I see a log of these executions?
Maybe there is a problem with the ssh-with-no-password and the certificates.

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Dan Swartzendruber
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 25 agosto 2016 16.27.18 CEST
Oggetto: Re: [ClusterLabs] Solved:  converting configuration
On 2016-08-25 10:24, Gabriele Bulfon wrote:
YESSS!!! That was it! :)))
Upgraded to 1.1.15, rebuilt and the rng files contain a lot more
stuff.
Packaged, published, installed on the test machine: got all my
instructions as is!!! :)))
...now last stepsmaking our custom agents/shells work on this new
setup ;)
For example, what does the yellow "stopped" state mean?
Here is the last rows of crm output after my config instructions :
Full list of resources:
xstorage1-stonith (stonith:external/ssh-sonicle): Stopped
xstorage2-stonith (stonith:external/ssh-sonicle): Stopped
I believe you will see this if the cluster is in maintenance mode or
stonith is disabled.  Possibly other reasons, but these I have seen...
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Solved: converting configuration

2016-08-25 Thread Gabriele Bulfon
...silly me, the other node was down, so it says stopped for both...once power 
on, updated the software and ran, it says Started ;)

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Dan Swartzendruber
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 25 agosto 2016 16.27.18 CEST
Oggetto: Re: [ClusterLabs] Solved:  converting configuration
On 2016-08-25 10:24, Gabriele Bulfon wrote:
YESSS!!! That was it! :)))
Upgraded to 1.1.15, rebuilt and the rng files contain a lot more
stuff.
Packaged, published, installed on the test machine: got all my
instructions as is!!! :)))
...now last stepsmaking our custom agents/shells work on this new
setup ;)
For example, what does the yellow "stopped" state mean?
Here is the last rows of crm output after my config instructions :
Full list of resources:
xstorage1-stonith (stonith:external/ssh-sonicle): Stopped
xstorage2-stonith (stonith:external/ssh-sonicle): Stopped
I believe you will see this if the cluster is in maintenance mode or
stonith is disabled.  Possibly other reasons, but these I have seen...
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] ocf::heartbeat:IPaddr

2016-08-25 Thread Gabriele Bulfon
Hi,
I'm advancing with this monster cluster on XStreamOS/illumos ;)
In the previous older tests I used heartbeat, and I had these lines to take 
care of the swapping public IP addresses:
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params ip="1.2.3.4" 
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params ip="1.2.3.5" 
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
They get configured, but then I get this in crm status:
xstorage1_wan1_IP   (ocf::heartbeat:IPaddr):Stopped
xstorage2_wan2_IP   (ocf::heartbeat:IPaddr):Stopped
Failed Actions:
* xstorage1_wan1_IP_start_0 on xstorage1 'not installed' (5): call=20, 
status=complete, exitreason='Setup problem: couldn't find command: 
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:32 2016', queued=1ms, exec=158ms
* xstorage2_wan2_IP_start_0 on xstorage1 'not installed' (5): call=22, 
status=complete, exitreason='Setup problem: couldn't find command: 
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:33 2016', queued=1ms, exec=29ms
* xstorage1_wan1_IP_start_0 on xstorage2 'not installed' (5): call=22, 
status=complete, exitreason='Setup problem: couldn't find command: 
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:30 2016', queued=1ms, exec=36ms
* xstorage2_wan2_IP_start_0 on xstorage2 'not installed' (5): call=20, 
status=complete, exitreason='Setup problem: couldn't find command: 
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:29 2016', queued=0ms, exec=150ms
The crm configure process already checked of the presence of the required 
IPaddr shell, and it was ok.
Now looks like it's looking for "/usr/bin/gawk", and that is actually there!
Is there any known incompatibility with the mixed heartbeat ocf ? Should I use 
corosync specific ocf files or something else?
Thanks again!
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf::heartbeat:IPaddr

2016-08-26 Thread Gabriele Bulfon
I looked around what you suggested, inside ocf-binaris and ocf-shellfuncs etc.
So I found also these logs in corosync.log :
Aug 25 17:50:33 [2250]   crmd:   notice: process_lrm_event: 
xstorage1-xstorage2_wan2_IP_start_0:22 [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such file or 
directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No 
such file or directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: 
not found [No such file or 
directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No 
such file or directory]\nocf-exit-reason:Setup problem: coul
Aug 25 17:50:33 [2246]   lrmd:   notice: operation_finished:
xstorage2_wan2_IP_start_0:3613:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such file or 
directory] ]
Looks like the shell is not happy with the "local" variable definition.
I tried running ocf-shellfuncs manually with sh and bash and they all run 
without errors.
How can I see what shell is running these scripts?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: users@clusterlabs.org
Data: 25 agosto 2016 18.07.42 CEST
Oggetto: Re: [ClusterLabs] ocf::heartbeat:IPaddr
On 08/25/2016 10:51 AM, Gabriele Bulfon wrote:
Hi,
I'm advancing with this monster cluster on XStreamOS/illumos ;)
In the previous older tests I used heartbeat, and I had these lines to
take care of the swapping public IP addresses:
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params ip="1.2.3.4"
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params ip="1.2.3.5"
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
They get configured, but then I get this in crm status:
xstorage1_wan1_IP (ocf::heartbeat:IPaddr): Stopped
xstorage2_wan2_IP (ocf::heartbeat:IPaddr): Stopped
Failed Actions:
* xstorage1_wan1_IP_start_0 on xstorage1 'not installed' (5): call=20,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:32 2016', queued=1ms, exec=158ms
* xstorage2_wan2_IP_start_0 on xstorage1 'not installed' (5): call=22,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:33 2016', queued=1ms, exec=29ms
* xstorage1_wan1_IP_start_0 on xstorage2 'not installed' (5): call=22,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:30 2016', queued=1ms, exec=36ms
* xstorage2_wan2_IP_start_0 on xstorage2 'not installed' (5): call=20,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:29 2016', queued=0ms, exec=150ms
The crm configure process already checked of the presence of the
required IPaddr shell, and it was ok.
Now looks like it's looking for "/usr/bin/gawk", and that is actually there!
Is there any known incompatibility with the mixed heartbeat ocf ? Should
I use corosync specific ocf files or something else?
"heartbeat" in this case is just an OCF provider name, and has nothing
to do with the heartbeat messaging layer, other than having its origin
in the same project. There actually has been a recent proposal to rename
the provider to "clusterlabs" to better reflect the current reality.
The "couldn't find command" message comes from the ocf-binaries shell
functions. If you look at have_binary() there, it uses sed and which,
and I'm guessing that fails on your OS somehow. You may need to patch it.
Thanks again!
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] ocf scripts shell and local variables

2016-08-26 Thread Gabriele Bulfon
I tried adding some debug in ocf-shellfuncs, showing env and ps -ef into the 
corosync.log
I suspect it's always using ksh, because in the env output I produced I find 
this: KSH_VERSION=.sh.version
This is normally not present in the environment, unless ksh is running the 
shell.
I also tried modifiying all ocf shells with "#!/usr/bin/bash" at the beginning, 
no way, same output.
Any idea how can I change the used shell to support "local" variables?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
kgail...@redhat.com Cluster Labs - All topics related to open-source clustering 
welcomed
Data:
26 agosto 2016 10.12.13 CEST
Oggetto:
Re: [ClusterLabs] ocf::heartbeat:IPaddr
I looked around what you suggested, inside ocf-binaris and ocf-shellfuncs etc.
So I found also these logs in corosync.log :
Aug 25 17:50:33 [2250]   crmd:   notice: process_lrm_event: 
xstorage1-xstorage2_wan2_IP_start_0:22 [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such file or 
directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local: not found [No 
such file or directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local: 
not found [No such file or 
directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local: not found [No 
such file or directory]\nocf-exit-reason:Setup problem: coul
Aug 25 17:50:33 [2246]   lrmd:   notice: operation_finished:
xstorage2_wan2_IP_start_0:3613:stderr [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No such file or 
directory] ]
Looks like the shell is not happy with the "local" variable definition.
I tried running ocf-shellfuncs manually with sh and bash and they all run 
without errors.
How can I see what shell is running these scripts?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: users@clusterlabs.org
Data: 25 agosto 2016 18.07.42 CEST
Oggetto: Re: [ClusterLabs] ocf::heartbeat:IPaddr
On 08/25/2016 10:51 AM, Gabriele Bulfon wrote:
Hi,
I'm advancing with this monster cluster on XStreamOS/illumos ;)
In the previous older tests I used heartbeat, and I had these lines to
take care of the swapping public IP addresses:
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params ip="1.2.3.4"
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params ip="1.2.3.5"
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
They get configured, but then I get this in crm status:
xstorage1_wan1_IP (ocf::heartbeat:IPaddr): Stopped
xstorage2_wan2_IP (ocf::heartbeat:IPaddr): Stopped
Failed Actions:
* xstorage1_wan1_IP_start_0 on xstorage1 'not installed' (5): call=20,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:32 2016', queued=1ms, exec=158ms
* xstorage2_wan2_IP_start_0 on xstorage1 'not installed' (5): call=22,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:33 2016', queued=1ms, exec=29ms
* xstorage1_wan1_IP_start_0 on xstorage2 'not installed' (5): call=22,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:30 2016', queued=1ms, exec=36ms
* xstorage2_wan2_IP_start_0 on xstorage2 'not installed' (5): call=20,
status=complete, exitreason='Setup problem: couldn't find command:
/usr/bin/gawk',
last-rc-change='Thu Aug 25 17:50:29 2016', queued=0ms, exec=150ms
The crm configure process already checked of the presence of the
required IPaddr shell, and it was ok.
Now looks like it's looking for "/usr/bin/gawk", and that is actually there!
Is there any known incompatibility with the mixed heartbeat ocf ? Should
I use corosync specific ocf files or something else?
"heartbeat" in this case is just an OCF provider name, and has nothing
to do with the heartbeat messaging layer, other than having its origin
in the same project. There actually has been a recent proposal to rename
the provider to "clusterlabs" to better reflect the current reality.
The "couldn't find command" message comes from the ocf-binaries shell
functions. If you loo

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Gabriele Bulfon
Thanks, though this does not work :)

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Dmitri Maziuk
A: users@clusterlabs.org
Data: 26 agosto 2016 17.02.12 CEST
Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
On 2016-08-26 08:56, Ken Gaillot wrote:
On 08/26/2016 08:11 AM, Gabriele Bulfon wrote:
I tried adding some debug in ocf-shellfuncs, showing env and ps -ef into
the corosync.log
I suspect it's always using ksh, because in the env output I produced I
find this: KSH_VERSION=.sh.version
This is normally not present in the environment, unless ksh is running
the shell.
The RAs typically start with #!/bin/sh, so whatever that points to on
your system is what will be used.
ISTR exec() family will ignore the shebang and run whatever shell's in
user's /etc/passwd. Or something. Try changing that one.
Dima
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Gabriele Bulfon
Hi Ken,
I have been talking with the illumos guys about the shell problem.
They all agreed that ksh (and specially the ksh93 used in illumos) is 
absolutely Bourne-compatible, and that the "local" variables used in the ocf 
shells is not a Bourne syntax, but probably a bash specific.
This means that pointing the scripts to "#!/bin/sh" is portable as long as the 
scripts are really Bourne-shell only syntax, as any Unix variant may link 
whatever Bourne-shell they like.
In this case, it should point to "#!/bin/bash" or whatever shell the script was 
written for.
Also, in this case, the starting point is not the ocf-* script, but the 
original RA (IPaddr, but almost all of them).
What about making the code base of RA and ocf-* portable?
It may be just by changing them to point to bash, or with some kind of 
configure modifier to be able to specify the shell to use.
Meanwhile, changing the scripts by hands into #!/bin/bash worked like a charm, 
and I will start patching.
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 26 agosto 2016 15.56.02 CEST
Oggetto: Re: ocf scripts shell and local variables
On 08/26/2016 08:11 AM, Gabriele Bulfon wrote:
I tried adding some debug in ocf-shellfuncs, showing env and ps -ef into
the corosync.log
I suspect it's always using ksh, because in the env output I produced I
find this: KSH_VERSION=.sh.version
This is normally not present in the environment, unless ksh is running
the shell.
The RAs typically start with #!/bin/sh, so whatever that points to on
your system is what will be used.
I also tried modifiying all ocf shells with "#!/usr/bin/bash" at the
beginning, no way, same output.
You'd have to change the RA that includes them.
Any idea how can I change the used shell to support "local" variables?
You can either edit the #!/bin/sh line at the top of each RA, or figure
out how to point /bin/sh to a Bourne-compatible shell. ksh isn't
Bourne-compatible, so I'd expect lots of #!/bin/sh scripts to fail with
it as the default shell.
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
----
*Da:* Gabriele Bulfon
*A:* kgail...@redhat.com Cluster Labs - All topics related to
open-source clustering welcomed
*Data:* 26 agosto 2016 10.12.13 CEST
*Oggetto:* Re: [ClusterLabs] ocf::heartbeat:IPaddr
I looked around what you suggested, inside ocf-binaris and
ocf-shellfuncs etc.
So I found also these logs in corosync.log :
Aug 25 17:50:33 [2250] crmd: notice: process_lrm_event:
xstorage1-xstorage2_wan2_IP_start_0:22 [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No
such file or
directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[354]: local:
not found [No such file or
directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[355]: local:
not found [No such file or
directory]\n/usr/lib/ocf/resource.d/heartbeat/IPaddr[356]: local:
not found [No such file or directory]\nocf-exit-reason:Setup
problem: coul
Aug 25 17:50:33 [2246] lrmd: notice: operation_finished:
xstorage2_wan2_IP_start_0:3613:stderr [
/usr/lib/ocf/resource.d/heartbeat/IPaddr[71]: local: not found [No
such file or directory] ]
Looks like the shell is not happy with the "local" variable definition.
I tried running ocf-shellfuncs manually with sh and bash and they
all run without errors.
How can I see what shell is running these scripts?

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: users@clusterlabs.org
Data: 25 agosto 2016 18.07.42 CEST
Oggetto: Re: [ClusterLabs] ocf::heartbeat:IPaddr
On 08/25/2016 10:51 AM, Gabriele Bulfon wrote:
Hi,
I'm advancing with this monster cluster on XStreamOS/illumos ;)
In the previous older tests I used heartbeat, and I had these
lines to
take care of the swapping public IP addresses:
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params
ip="1.2.3.4"
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params
ip="1.2.3.5"
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_p

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Gabriele Bulfon
I think the main issue is the usage of the "local" operator in ocf*
I'm not an expert on this operator (never used!), don't know how hard it is to 
replace it with a standard version.
Happy to contribute, it still the case
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Kristoffer Grönlund
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
kgail...@redhat.com Cluster Labs - All topics related to open-source clustering 
welcomed
Data: 29 agosto 2016 14.36.23 CEST
Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
Gabriele Bulfon
writes:
Hi Ken,
I have been talking with the illumos guys about the shell problem.
They all agreed that ksh (and specially the ksh93 used in illumos) is 
absolutely Bourne-compatible, and that the "local" variables used in the ocf 
shells is not a Bourne syntax, but probably a bash specific.
This means that pointing the scripts to "#!/bin/sh" is portable as long as the 
scripts are really Bourne-shell only syntax, as any Unix variant may link 
whatever Bourne-shell they like.
In this case, it should point to "#!/bin/bash" or whatever shell the script was 
written for.
Also, in this case, the starting point is not the ocf-* script, but the 
original RA (IPaddr, but almost all of them).
What about making the code base of RA and ocf-* portable?
It may be just by changing them to point to bash, or with some kind of 
configure modifier to be able to specify the shell to use.
Meanwhile, changing the scripts by hands into #!/bin/bash worked like a charm, 
and I will start patching.
Gabriele
Hi Gabriele,
Yes, your observation is correct: The resource scripts are not fully
POSIX compatible in this respect. We have been fixing these issues as
they come up, but since we all use bash-like shells it has never become
a pressing issue (IIRC Debian did have some of the same issues since
/bin/sh there is dash, which is also not fully bash-compatible).
It would be fantastic if you could file issues or submit patches at
https://github.com/ClusterLabs/resource-agents for the resource agents
where you still find these problems.
Cheers,
Kristoffer
--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Gabriele Bulfon
Sure, infact I can change all shebang to point to /bin/bash and it's ok.
The question is about current shebang /bin/sh which may go into trouble (as if 
one would point to a generic python but uses many specific features of a 
version of python).
Also, the question is about bash being a good option for RAs, being much more 
heavy.
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Dejan Muhamedagic
A: kgail...@redhat.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 29 agosto 2016 16.43.52 CEST
Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
Hi,
On Mon, Aug 29, 2016 at 08:47:43AM -0500, Ken Gaillot wrote:
On 08/29/2016 04:17 AM, Gabriele Bulfon wrote:
Hi Ken,
I have been talking with the illumos guys about the shell problem.
They all agreed that ksh (and specially the ksh93 used in illumos) is
absolutely Bourne-compatible, and that the "local" variables used in the
ocf shells is not a Bourne syntax, but probably a bash specific.
This means that pointing the scripts to "#!/bin/sh" is portable as long
as the scripts are really Bourne-shell only syntax, as any Unix variant
may link whatever Bourne-shell they like.
In this case, it should point to "#!/bin/bash" or whatever shell the
script was written for.
Also, in this case, the starting point is not the ocf-* script, but the
original RA (IPaddr, but almost all of them).
What about making the code base of RA and ocf-* portable?
It may be just by changing them to point to bash, or with some kind of
configure modifier to be able to specify the shell to use.
Meanwhile, changing the scripts by hands into #!/bin/bash worked like a
charm, and I will start patching.
Gabriele
Interesting, I thought local was posix, but it's not. It seems everyone
but solaris implemented it:
http://stackoverflow.com/questions/18597697/posix-compliant-way-to-scope-variables-to-a-function-in-a-shell-script
Please open an issue at:
https://github.com/ClusterLabs/resource-agents/issues
The simplest solution would be to require #!/bin/bash for all RAs that
use local,
This issue was raised many times, but note that /bin/bash is a
shell not famous for being lean: it's great for interactive use,
but not so great if you need to run a number of scripts. The
complexity in bash, which is superfluous for our use case,
doesn't go well with the basic principles of HA clusters.
but I'm not sure that's fair to the distros that support
local in a non-bash default shell. Another possibility would be to
modify all RAs to avoid local entirely, by using unique variable
prefixes per function.
I doubt that we could do a moderately complex shell scripts
without capability of limiting the variables' scope and retaining
sanity at the same time.
Or, it may be possible to guard every instance of
local with a check for ksh, which would use typeset instead. Raising the
issue will allow some discussion of the possibilities.
Just to mention that this is the first time someone reported
running a shell which doesn't support local. Perhaps there's an
option that they install a shell which does.
Thanks,
Dejan

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source
clustering welcomed
Data: 26 agosto 2016 15.56.02 CEST
Oggetto: Re: ocf scripts shell and local variables
On 08/26/2016 08:11 AM, Gabriele Bulfon wrote:
I tried adding some debug in ocf-shellfuncs, showing env and ps
-ef into
the corosync.log
I suspect it's always using ksh, because in the env output I
produced I
find this: KSH_VERSION=.sh.version
This is normally not present in the environment, unless ksh is running
the shell.
The RAs typically start with #!/bin/sh, so whatever that points to on
your system is what will be used.
I also tried modifiying all ocf shells with "#!/usr/bin/bash" at the
beginning, no way, same output.
You'd have to change the RA that includes them.
Any idea how can I change the used shell to support "local" variables?
You can either edit the #!/bin/sh line at the top of each RA, or figure
out how to point /bin/sh to a Bourne-compatible shell. ksh isn't
Bourne-compatible, so I'd expect lots of #!/bin/sh scripts to fail with
it as the default shell.
Gabriele

*Sonicle S.r

[ClusterLabs] ip clustering strange behaviour

2016-08-29 Thread Gabriele Bulfon
Hi,
now that I have IPaddr work, I have a strange behaviour on my test setup of 2 
nodes, here is my configuration:
===STONITH/FENCING===
primitive xstorage1-stonith stonith:external/ssh-sonicle op monitor 
interval="25" timeout="25" start-delay="25" params hostlist="xstorage1"
primitive xstorage2-stonith stonith:external/ssh-sonicle op monitor 
interval="25" timeout="25" start-delay="25" params hostlist="xstorage2"
location xstorage1-stonith-pref xstorage1-stonith -inf: xstorage1
location xstorage2-stonith-pref xstorage2-stonith -inf: xstorage2
property stonith-action=poweroff
===IP RESOURCES===
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params ip="1.2.3.4" 
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params ip="1.2.3.5" 
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
===
So I plumbed e1000g1 with unconfigured IP on both machines and started 
corosync/pacemaker, and after some time I got all nodes online and started, 
with IP configured as virtual interfaces (e1000g1:1 and e1000g1:2) one in host1 
and one in host2.
Then I halted host2, and I expected to have host1 started with both IPs 
configured on host1.
Instead, I got host1 started with the IP stopped and removed (only e1000g1 
unconfigured), host2 stopped saying IP started (!?).
Not exactly what I expected...
What's wrong?
Here is the crm status after I stopped host 2:
2 nodes and 4 resources configured
Node xstorage2: UNCLEAN (offline)
Online: [ xstorage1 ]
Full list of resources:
xstorage1-stonith   (stonith:external/ssh-sonicle): Started xstorage2 (UNCLEAN)
xstorage2-stonith   (stonith:external/ssh-sonicle): Stopped
xstorage1_wan1_IP   (ocf::heartbeat:IPaddr):Stopped
xstorage2_wan2_IP   (ocf::heartbeat:IPaddr):Started xstorage2 (UNCLEAN)
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ip clustering strange behaviour

2016-08-29 Thread Gabriele Bulfon
Ok, got it, I hadn't gracefully shut pacemaker on node2.
Now I restarted, everything was up, stopped pacemaker service on host2 and I 
got host1 with both IPs configured. ;)
But, though I understand that if I halt host2 with no grace shut of pacemaker, 
it will not move the IP2 to Host1, I don't expect host1 to loose its own IP! 
Why?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 29 agosto 2016 17.26.49 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/29/2016 05:18 PM, Gabriele Bulfon wrote:
Hi,
now that I have IPaddr work, I have a strange behaviour on my test
setup of 2 nodes, here is my configuration:
===STONITH/FENCING===
primitive xstorage1-stonith stonith:external/ssh-sonicle op monitor
interval="25" timeout="25" start-delay="25" params hostlist="xstorage1"
primitive xstorage2-stonith stonith:external/ssh-sonicle op monitor
interval="25" timeout="25" start-delay="25" params hostlist="xstorage2"
location xstorage1-stonith-pref xstorage1-stonith -inf: xstorage1
location xstorage2-stonith-pref xstorage2-stonith -inf: xstorage2
property stonith-action=poweroff
===IP RESOURCES===
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params ip="1.2.3.4"
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params ip="1.2.3.5"
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
===
So I plumbed e1000g1 with unconfigured IP on both machines and started
corosync/pacemaker, and after some time I got all nodes online and
started, with IP configured as virtual interfaces (e1000g1:1 and
e1000g1:2) one in host1 and one in host2.
Then I halted host2, and I expected to have host1 started with both
IPs configured on host1.
Instead, I got host1 started with the IP stopped and removed (only
e1000g1 unconfigured), host2 stopped saying IP started (!?).
Not exactly what I expected...
What's wrong?
How did you stop host2? Graceful shutdown of pacemaker? If not ...
Anyway ssh-fencing is just working if the machine is still running ...
So it will stay unclean and thus pacemaker is thinking that
the IP might still be running on it. So this is actually the expected
behavior.
You might add a watchdog via sbd if you don't have other fencing
hardware at hand ...
Here is the crm status after I stopped host 2:
2 nodes and 4 resources configured
Node xstorage2: UNCLEAN (offline)
Online: [ xstorage1 ]
Full list of resources:
xstorage1-stonith (stonith:external/ssh-sonicle): Started xstorage2
(UNCLEAN)
xstorage2-stonith (stonith:external/ssh-sonicle): Stopped
xstorage1_wan1_IP (ocf::heartbeat:IPaddr): Stopped
xstorage2_wan2_IP (ocf::heartbeat:IPaddr): Started xstorage2 (UNCLEAN)
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ip clustering strange behaviour

2016-08-29 Thread Gabriele Bulfon
Sorry for reiterating, but my main question was:
why does node 1 removes its own IP if I shut down node 2 abruptly?
I understand that it does not take the node 2 IP (because the ssh-fencing has 
no clue about what happened on the 2nd node), but I wouldn't expect it to shut 
down its own IP...this would kill any service on both nodes...what am I wrong?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
Da:
Gabriele Bulfon
A:
kwenn...@redhat.com Cluster Labs - All topics related to open-source clustering 
welcomed
Data:
29 agosto 2016 17.37.36 CEST
Oggetto:
Re: [ClusterLabs] ip clustering strange behaviour
Ok, got it, I hadn't gracefully shut pacemaker on node2.
Now I restarted, everything was up, stopped pacemaker service on host2 and I 
got host1 with both IPs configured. ;)
But, though I understand that if I halt host2 with no grace shut of pacemaker, 
it will not move the IP2 to Host1, I don't expect host1 to loose its own IP! 
Why?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 29 agosto 2016 17.26.49 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/29/2016 05:18 PM, Gabriele Bulfon wrote:
Hi,
now that I have IPaddr work, I have a strange behaviour on my test
setup of 2 nodes, here is my configuration:
===STONITH/FENCING===
primitive xstorage1-stonith stonith:external/ssh-sonicle op monitor
interval="25" timeout="25" start-delay="25" params hostlist="xstorage1"
primitive xstorage2-stonith stonith:external/ssh-sonicle op monitor
interval="25" timeout="25" start-delay="25" params hostlist="xstorage2"
location xstorage1-stonith-pref xstorage1-stonith -inf: xstorage1
location xstorage2-stonith-pref xstorage2-stonith -inf: xstorage2
property stonith-action=poweroff
===IP RESOURCES===
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params ip="1.2.3.4"
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params ip="1.2.3.5"
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
===
So I plumbed e1000g1 with unconfigured IP on both machines and started
corosync/pacemaker, and after some time I got all nodes online and
started, with IP configured as virtual interfaces (e1000g1:1 and
e1000g1:2) one in host1 and one in host2.
Then I halted host2, and I expected to have host1 started with both
IPs configured on host1.
Instead, I got host1 started with the IP stopped and removed (only
e1000g1 unconfigured), host2 stopped saying IP started (!?).
Not exactly what I expected...
What's wrong?
How did you stop host2? Graceful shutdown of pacemaker? If not ...
Anyway ssh-fencing is just working if the machine is still running ...
So it will stay unclean and thus pacemaker is thinking that
the IP might still be running on it. So this is actually the expected
behavior.
You might add a watchdog via sbd if you don't have other fencing
hardware at hand ...
Here is the crm status after I stopped host 2:
2 nodes and 4 resources configured
Node xstorage2: UNCLEAN (offline)
Online: [ xstorage1 ]
Full list of resources:
xstorage1-stonith (stonith:external/ssh-sonicle): Started xstorage2
(UNCLEAN)
xstorage2-stonith (stonith:external/ssh-sonicle): Stopped
xstorage1_wan1_IP (ocf::heartbeat:IPaddr): Stopped
xstorage2_wan2_IP (ocf::heartbeat:IPaddr): Started xstorage2 (UNCLEAN)
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___Users mailing list: 
Users@clusterlabs.orghttp://clusterlabs.org/mailm

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Gabriele Bulfon
Not RA, but ocf-* do, because of the "local" operator usage.

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Dejan Muhamedagic
A: Cluster Labs - All topics related to open-source clustering welcomed
Data: 30 agosto 2016 10.44.49 CEST
Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
Hi,
On Mon, Aug 29, 2016 at 10:13:18AM -0500, Dmitri Maziuk wrote:
On 2016-08-29 04:06, Gabriele Bulfon wrote:
Thanks, though this does not work :)
Uhm... right. Too many languages, sorry: perl's system() will call the login
shell, system system() uses /bin/sh, and exec()s will run whatever the
programmer tells them to. The point is none of them cares what shell's in
shebang line AFAIK.
The kernel reads the shebang line and it is what defines the
interpreter which is to be invoked to run the script.
But anyway, you're correct; a lot of linux "shell" scripts are bash-only and
pacemaker RAs are no exception.
None of /bin/sh RA requires bash.
Thanks,
Dejan
Dima
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Gabriele Bulfon
illumos (and Solaris 11) delivers ksh93, that is fully Bourn compatible, but 
not with the bash extension of "local" variables, that is not Bourn shell. It 
is supported in ksh93 with the "typedef" operator, instead of "local".
This is used inside the "ocf-*" scripts.
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Dejan Muhamedagic
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 30 agosto 2016 12.20.19 CEST
Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
Hi,
On Mon, Aug 29, 2016 at 05:08:35PM +0200, Gabriele Bulfon wrote:
Sure, infact I can change all shebang to point to /bin/bash and it's ok.
The question is about current shebang /bin/sh which may go into trouble (as if 
one would point to a generic python but uses many specific features of a 
version of python).
Also, the question is about bash being a good option for RAs, being much more 
heavy.
I'd really suggest installing a smaller shell such as /bin/dash
and using that as /bin/sh. Isn't there a Bourne shell in Solaris?
If you modify the RAs it could be trouble on subsequent updates.
Thanks,
Dejan
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Dejan Muhamedagic
A: kgail...@redhat.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 29 agosto 2016 16.43.52 CEST
Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
Hi,
On Mon, Aug 29, 2016 at 08:47:43AM -0500, Ken Gaillot wrote:
On 08/29/2016 04:17 AM, Gabriele Bulfon wrote:
Hi Ken,
I have been talking with the illumos guys about the shell problem.
They all agreed that ksh (and specially the ksh93 used in illumos) is
absolutely Bourne-compatible, and that the "local" variables used in the
ocf shells is not a Bourne syntax, but probably a bash specific.
This means that pointing the scripts to "#!/bin/sh" is portable as long
as the scripts are really Bourne-shell only syntax, as any Unix variant
may link whatever Bourne-shell they like.
In this case, it should point to "#!/bin/bash" or whatever shell the
script was written for.
Also, in this case, the starting point is not the ocf-* script, but the
original RA (IPaddr, but almost all of them).
What about making the code base of RA and ocf-* portable?
It may be just by changing them to point to bash, or with some kind of
configure modifier to be able to specify the shell to use.
Meanwhile, changing the scripts by hands into #!/bin/bash worked like a
charm, and I will start patching.
Gabriele
Interesting, I thought local was posix, but it's not. It seems everyone
but solaris implemented it:
http://stackoverflow.com/questions/18597697/posix-compliant-way-to-scope-variables-to-a-function-in-a-shell-script
Please open an issue at:
https://github.com/ClusterLabs/resource-agents/issues
The simplest solution would be to require #!/bin/bash for all RAs that
use local,
This issue was raised many times, but note that /bin/bash is a
shell not famous for being lean: it's great for interactive use,
but not so great if you need to run a number of scripts. The
complexity in bash, which is superfluous for our use case,
doesn't go well with the basic principles of HA clusters.
but I'm not sure that's fair to the distros that support
local in a non-bash default shell. Another possibility would be to
modify all RAs to avoid local entirely, by using unique variable
prefixes per function.
I doubt that we could do a moderately complex shell scripts
without capability of limiting the variables' scope and retaining
sanity at the same time.
Or, it may be possible to guard every instance of
local with a check for ksh, which would use typeset instead. Raising the
issue will allow some discussion of the possibilities.
Just to mention that this is the first time someone reported
running a shell which doesn't support local. Perhaps there's an
option that they install a shell which does.
Thanks,
Dejan

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source
clustering welcomed

Re: [ClusterLabs] ip clustering strange behaviour

2016-08-31 Thread Gabriele Bulfon
Thanks, got it.
So, is it better to use "two_node: 1" or, as suggested else where, or 
"no-quorum-policy=stop"?
About fencing, the machine I'm going to implement the 2-nodes cluster is a dual 
machine with shared disks backend.
Each node has two 10Gb ethernets dedicated to the public ip and the admin 
console.
Then there is a third 100Mb ethernet connecing the two machines internally.
I was going to use this last one as fencing via ssh, but looks like this way 
I'm not gonna have ip/pool/zone movements if one of the nodes freezes or halts 
without shutting down pacemaker clean.
What should I use instead?
Thanks for your help,
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: users@clusterlabs.org
Data: 31 agosto 2016 17.25.05 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/30/2016 01:52 AM, Gabriele Bulfon wrote:
Sorry for reiterating, but my main question was:
why does node 1 removes its own IP if I shut down node 2 abruptly?
I understand that it does not take the node 2 IP (because the
ssh-fencing has no clue about what happened on the 2nd node), but I
wouldn't expect it to shut down its own IP...this would kill any service
on both nodes...what am I wrong?
Assuming you're using corosync 2, be sure you have "two_node: 1" in
corosync.conf. That will tell corosync to pretend there is always
quorum, so pacemaker doesn't need any special quorum settings. See the
votequorum(5) man page for details. Of course, you need fencing in this
setup, to handle when communication between the nodes is broken but both
are still up.

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
----
*Da:* Gabriele Bulfon
*A:* kwenn...@redhat.com Cluster Labs - All topics related to
open-source clustering welcomed
*Data:* 29 agosto 2016 17.37.36 CEST
*Oggetto:* Re: [ClusterLabs] ip clustering strange behaviour
Ok, got it, I hadn't gracefully shut pacemaker on node2.
Now I restarted, everything was up, stopped pacemaker service on
host2 and I got host1 with both IPs configured. ;)
But, though I understand that if I halt host2 with no grace shut of
pacemaker, it will not move the IP2 to Host1, I don't expect host1
to loose its own IP! Why?
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 29 agosto 2016 17.26.49 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/29/2016 05:18 PM, Gabriele Bulfon wrote:
Hi,
now that I have IPaddr work, I have a strange behaviour on my test
setup of 2 nodes, here is my configuration:
===STONITH/FENCING===
primitive xstorage1-stonith stonith:external/ssh-sonicle op
monitor
interval="25" timeout="25" start-delay="25" params
hostlist="xstorage1"
primitive xstorage2-stonith stonith:external/ssh-sonicle op
monitor
interval="25" timeout="25" start-delay="25" params
hostlist="xstorage2"
location xstorage1-stonith-pref xstorage1-stonith -inf: xstorage1
location xstorage2-stonith-pref xstorage2-stonith -inf: xstorage2
property stonith-action=poweroff
===IP RESOURCES===
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params
ip="1.2.3.4"
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params
ip="1.2.3.5"
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
===
So I plumbed e1000g1 with unconfigured IP on both machines and
started
corosync/pacemaker, and after some time I got all nodes online and
started, with IP configured as virtual interfaces (e1000g1:1 and
e1000g1:2) one in host1 and one in host2.
Then I halted host2, and I expected to have host1 started with
both
IPs configured on host1.
Instead, I got host1 started with the IP stopped and removed (only
e1000g1 unconfigured), host2 stopped saying IP started (!?).
Not exactly what I expected...
What's wrong?
How did you stop host2? Graceful shutdown of pacemaker? If not ...
Anyway ssh-fenci

Re: [ClusterLabs] ip clustering strange behaviour

2016-09-05 Thread Gabriele Bulfon
The dual machine is equipped with a syncro controller LSI 3008 MPT SAS3.
Both nodes can see the same jbod disks (10 at the moment, up to 24).
Systems are XStreamOS / illumos, with ZFS.
Each system has one ZFS pool of 5 disks, with different pool names (data1, 
data2).
When in active / active, the two machines run different zones and services on 
their pools, on their networks.
I have custom resource agents (tested on pacemaker/heartbeat, now porting to 
pacemaker/corosync) for ZFS pools and zones migration.
When I was testing pacemaker/heartbeat, when ssh-fencing discovered the other 
node to be down (cleanly or abrupt halt), it was automatically using IPaddr and 
our ZFS agents to take control of everything, mounting the other pool and 
running any configured zone in it.
I would like to do the same with pacemaker/corosync.
The two nodes of the dual machine have an inernal lan connecting them, a 100Mb 
ethernet: maybe this is enough reliable to trust ssh-fencing? Or is there 
anything I can do to ensure at the controller level that the pool is not in use 
on the other node?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 1 settembre 2016 15.49.04 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
Thanks, got it.
So, is it better to use "two_node: 1" or, as suggested else where, or
"no-quorum-policy=stop"?
I'd prefer "two_node: 1" and letting pacemaker's options default. But
see the votequorum(5) man page for what two_node implies -- most
importantly, both nodes have to be available when the cluster starts
before it will start any resources. Node failure is handled fine once
the cluster has started, but at start time, both nodes must be up.
About fencing, the machine I'm going to implement the 2-nodes cluster is
a dual machine with shared disks backend.
Each node has two 10Gb ethernets dedicated to the public ip and the
admin console.
Then there is a third 100Mb ethernet connecing the two machines internally.
I was going to use this last one as fencing via ssh, but looks like this
way I'm not gonna have ip/pool/zone movements if one of the nodes
freezes or halts without shutting down pacemaker clean.
What should I use instead?
I'm guessing as a dual machine, they share a power supply, so that rules
out a power switch. If the box has IPMI that can individually power
cycle each host, you can use fence_ipmilan. If the disks are shared via
iSCSI, you could use fence_scsi. If the box has a hardware watchdog
device that can individually target the hosts, you could use sbd. If
none of those is an option, probably the best you could do is run the
cluster nodes as VMs on each host, and use fence_xvm.
Thanks for your help,
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: users@clusterlabs.org
Data: 31 agosto 2016 17.25.05 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/30/2016 01:52 AM, Gabriele Bulfon wrote:
Sorry for reiterating, but my main question was:
why does node 1 removes its own IP if I shut down node 2 abruptly?
I understand that it does not take the node 2 IP (because the
ssh-fencing has no clue about what happened on the 2nd node), but I
wouldn't expect it to shut down its own IP...this would kill any
service
on both nodes...what am I wrong?
Assuming you're using corosync 2, be sure you have "two_node: 1" in
corosync.conf. That will tell corosync to pretend there is always
quorum, so pacemaker doesn't need any special quorum settings. See the
votequorum(5) man page for details. Of course, you need fencing in this
setup, to handle when communication between the nodes is broken but both
are still up.

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--------
*Da:* Gabriele Bulfon
*A:* kwenn...@redhat.com Cluster Labs - All topics related to
open-source clustering welcomed
*Data:* 29 agosto 2016 17.37.36 CEST
*Oggetto:* Re: [ClusterLabs] ip clustering strange behaviour
Ok, got it, I hadn't gracefully shut pacemaker on node2.
Now I 

Re: [ClusterLabs] ip clustering strange behaviour

2016-09-05 Thread Gabriele Bulfon
I read docs, looks like sbd fencing is more about iscsi/fc exposed storage 
resources.
Here I have real shared disks (seen from solaris with the format utility as 
normal sas disks, but on both nodes).
They are all jbod disks, that ZFS organizes in raidz/mirror pools, so I have 5 
disks on one pool in one node, and the other 5 disks on another pool in one 
node.
How can sbd work in this situation? Has it already been used/tested on a 
Solaris env with ZFS ?
BTW, is there any other possibility other than sbd.
Last but not least, is there any way to let ssh-fencing be considered good?
At the moment, with ssh-fencing, if I shut down the second node, I get all 
second resources in UNCLEAN state, not taken by the first one.
If I reboot the second , I only get the node on again, but resources remain 
stopped.
I remember my tests with heartbeat react different (halt would move everything 
to node1 and get back everything on restart)
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 5 settembre 2016 12.21.25 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 09/05/2016 11:20 AM, Gabriele Bulfon wrote:
The dual machine is equipped with a syncro controller LSI 3008 MPT SAS3.
Both nodes can see the same jbod disks (10 at the moment, up to 24).
Systems are XStreamOS / illumos, with ZFS.
Each system has one ZFS pool of 5 disks, with different pool names
(data1, data2).
When in active / active, the two machines run different zones and
services on their pools, on their networks.
I have custom resource agents (tested on pacemaker/heartbeat, now
porting to pacemaker/corosync) for ZFS pools and zones migration.
When I was testing pacemaker/heartbeat, when ssh-fencing discovered
the other node to be down (cleanly or abrupt halt), it was
automatically using IPaddr and our ZFS agents to take control of
everything, mounting the other pool and running any configured zone in it.
I would like to do the same with pacemaker/corosync.
The two nodes of the dual machine have an inernal lan connecting them,
a 100Mb ethernet: maybe this is enough reliable to trust ssh-fencing?
Or is there anything I can do to ensure at the controller level that
the pool is not in use on the other node?
It is not just about the reliability of the networking-connection why
ssh-fencing might be
suboptimal. Something with the IP-Stack config (dynamic due to moving
resources)
might have gone wrong. And resources might be somehow hanging so that
the node
can't be brought down gracefully. Thus my suggestion to add a watchdog
(so available)
via sbd.
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: gbul...@sonicle.com Cluster Labs - All topics related to
open-source clustering welcomed
Data: 1 settembre 2016 15.49.04 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
Thanks, got it.
So, is it better to use "two_node: 1" or, as suggested else
where, or
"no-quorum-policy=stop"?
I'd prefer "two_node: 1" and letting pacemaker's options default. But
see the votequorum(5) man page for what two_node implies -- most
importantly, both nodes have to be available when the cluster starts
before it will start any resources. Node failure is handled fine once
the cluster has started, but at start time, both nodes must be up.
About fencing, the machine I'm going to implement the 2-nodes
cluster is
a dual machine with shared disks backend.
Each node has two 10Gb ethernets dedicated to the public ip and the
admin console.
Then there is a third 100Mb ethernet connecing the two machines
internally.
I was going to use this last one as fencing via ssh, but looks
like this
way I'm not gonna have ip/pool/zone movements if one of the nodes
freezes or halts without shutting down pacemaker clean.
What should I use instead?
I'm guessing as a dual machine, they share a power supply, so that
rules
out a power switch. If the box has IPMI that can individually power
cycle each host, you can use fence_ipmilan. If the disks are
shared via
iSCSI, you could use fence_scsi. If the box has a hardware watchdog
device that can individually target the hosts, you could use sbd. If
none of those is an option, probably the best you could do is run the
cluster nodes as VMs on 

[ClusterLabs] clustering fiber channel pools on multiple wwns

2016-09-06 Thread Gabriele Bulfon
Hi,
on illumos, I have a way to cluster one zfs pool on two nodes, by moving 
ip,pool and its shares at once on the other node.
This works for iscsi too: the ip of the target has been migrated together with 
the pool, so the iscsi resource is still there running on the same ip (just a 
different node).
Now I was thinking to do the same with fiber channel: two nodes, each with its 
own qlogic fc connected to a fc switch, with vmware clients with their fc cards 
connected on the same switch.
I can't see how I can do this with fc, because with iscsi I can migrate the 
hosting IP, but with fc I can't migrate the hosting wwn!
What I need, is to tell vmware that the target volume may be running on two 
different wwns, so a failing wwn should trigger retry on the other wwn: the 
pool and shared volumes will be moving from one wwn to the other.
Am I dreaming??
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org