Re: [ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Priyanka Balotra
Sure,
Here are the logs:


63138:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(post_cache_update)debug: Updated cache after membership event 44.
63139:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__set_flags_as)   debug: FSA action flags 0x2
(A_ELECTION_CHECK) for controller set by post_cache_update:81
63140:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0002 (an_action)
for controller cleared by do_fsa_action:108
63141:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962] (do_started)
info: Delaying start, Config not read (0040)
63142:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(register_fsa_input_adv)   debug: Stalling the FSA pending further input:
source=do_started cause=C_FSA_INTERNAL data=(nil) queue=0
63143:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__set_flags_as)   debug: FSA action flags 0x0002
(with_actions) for controller set by register_fsa_input_adv:88
63144:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962] (s_crmd_fsa)
debug: Exiting the FSA: queue=0, fsa_actions=0x20002, stalled=true
63145:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(config_query_callback)debug: Call 3 : Parsing CIB options
63146:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(config_query_callback)debug: Shutdown escalation occurs if DC has not
responded to request in 120ms
63147:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(config_query_callback)debug: Re-run scheduler after 90ms of
inactivity
63148:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pe_unpack_alerts) debug: Alert pf-ha-alert:
path=/usr/lib/ocf/resource.d/pacemaker/pf_ha_alert.sh timeout=3ms
tstamp-format='%H:%M:%S.%06N' 0 vars
63149:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0002 (an_action)
for controller cleared by do_fsa_action:108
63150:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962] (do_started)
debug: Init server comms
63151:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(qb_ipcs_us_publish)   info: server name: crmd
63152:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962] (do_started)
notice: Pacemaker controller successfully started and accepting
connections
63153:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x2 (an_action)
for controller cleared by do_fsa_action:108
63154:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(do_election_check)debug: Ignoring election check because we are
not in an election
63155:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__set_flags_as)   debug: FSA action flags 0x10100100
(new_actions) for controller set by s_crmd_fsa:198
63156:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962] (s_crmd_fsa)
debug: Processing I_PENDING: [ state=S_STARTING cause=C_FSA_INTERNAL
origin=do_started ]
63157:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x1000
(an_action) for controller cleared by do_fsa_action:108
63158:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962] (do_log)
info: Input I_PENDING received in state S_STARTING from do_started
63159:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(do_state_transition)  notice: State transition S_STARTING -> S_PENDING
| input=I_PENDING cause=C_FSA_INTERNAL origin=do_started
63160:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__set_flags_as)   debug: FSA action flags 0x0020
(A_INTEGRATE_TIMER_STOP) for controller set by do_state_transition:559
63161:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__set_flags_as)   debug: FSA action flags 0x0080
(A_FINALIZE_TIMER_STOP) for controller set by do_state_transition:565
63162:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0020 (an_action)
for controller cleared by do_fsa_action:108
63163:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0080 (an_action)
for controller cleared by do_fsa_action:108
63164:Jul 17 14:16:25.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0010 (an_action)
for controller cleared by do_fsa_action:108
63165:Jul 17 14:16:26.132 FILE-2 pacemaker-controld  [15962]
(do_cl_join_query) debug: Querying for a DC
63166:Jul 17 14:16:26.132 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0100 (an_action)
for controller cleared by do_fsa_action:108
63167:Jul 17 14:16:26.132 FILE-2 pacemaker-controld  [15962]
(controld_start_timer) debug: Started Election Trigger (inject
I_DC_TIMEOUT if pops after 2ms, source=18)
63168:Jul 17 

Re: [ClusterLabs] Multi site pgsql

2023-07-19 Thread ProfiVPS Support

2023-07-19 11:50 időpontban ProfiVPS Support ezt írta:


Hello everyone,

I'm stuck with a multi-site cluster configuration and I'd really 
appreciate some help/pointers.


The scenario: I'd like to build two (whole) clusters in two different 
DCs, one of them acting as Master and the other as standby/Slave. Both 
clusters have a proxy, some web nodes and only one db node. I'm not 
planning to add HA for pgsql inside the clusters, failing over to the 
other cluster in case something goes south seems to be acceptable.


So I ended up with pcs + booth.

- Currently both clusters start up, the main one starts a master psql 
service when the ticket is granted, the slave starts a replicating 
psql.


- When ticket is revoked, master DB is stopped.

- Slave DB is not getting promoted and I understand it doesnt even have 
a reason to do so.


Basically what I need is the slave to be promoted to master when the 
ticket is granted to it.


I'm not fully sure that Im not in the wrong, to achieve this I had to 
use two different pgsql resource configuration on the two clusters.


Master cluster:

pcs -f cib.xml resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/lib/postgresql/15/bin/pg_ctl" \
psql="/usr/bin/psql" \
pgdata="/var/lib/postgresql/15/main/" \
node_list="sa_psql" \
logfile="/var/log/postgresql/postgresql-15-main.log" \
socketdir="/var/run/postgresql/" \
op monitor interval="11s" \
op monitor interval="10s" role="Master" \
op start timeout="60s" \
op stop timeout="60s" \
--group pgsql_group

# forcing pgsql to only reside only on the psql node
pcs -f cib.xml constraint location pgsql prefers la_psql=INFINITY
pcs -f cib.xml constraint location pgsql prefers la_worker1=-INFINITY
pcs -f cib.xml constraint location pgsql prefers la_hagw=-INFINITY

# Fence on ticket loss if we were promoted
pcs -f cib.xml constraint ticket set pgsql role=Promoted setoptions 
loss-policy=fence ticket=sqlticket


Slave cluster :

pcs -f cib.xml resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/lib/postgresql/15/bin/pg_ctl" \
psql="/usr/bin/psql" \
pgdata="/var/lib/postgresql/15/main/" \
node_list="sa_psql" \
logfile="/var/log/postgresql/postgresql-15-main.log" \
socketdir="/var/run/postgresql/" \
restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \
master_ip="_master_ip_" \
repuser="repuser" \
rep_mode="slave" \
replication_slot_name="replica_1_slot" \
primary_conninfo_opt="password=* keepalives_idle=60 
keepalives_interval=5 keepalives_count=5" \

op monitor interval="31s" \
op monitor interval="30s" role="Promoted" \
op start timeout="60s" \
op stop timeout="60s" \
op promote timeout="120s" \
--group pgsql_group

# Force pgsql to only run on the sql node
pcs -f cib.xml constraint location pgsql prefers sa_psql=INFINITY
pcs -f cib.xml constraint location pgsql prefers sa_worker1=-INFINITY
pcs -f cib.xml constraint location pgsql prefers sa_hagw=-INFINITY

# Without this service wouldnt satart
pcs -f cib.xml constraint ticket set pgsql role=Promoted setoptions 
loss-policy=demote ticket=sqlticket


When configuration is pushed and ticket granted, they start up in m/s 
straming replication mode. However, debug-promote on slave psql returns 
with:
Operation force-promote for pgsql (ocf:heartbeat:pgsql) returned 6 (not 
configured: Not in a replication mode.)


Which is strange, bc:
is_replication() {
if [ "$OCF_RESKEY_rep_mode" != "none" -a "$OCF_RESKEY_rep_mode" != 
"slave" ]; then

return 0
fi
return 1
}

So I'm pretty much stuck.
I'm also not sure that booth is a definite must here, sometimes I feel 
I was better off putting all of them into one bit cluster with an 
external tie breaker. But now I got sooo much time invested, I'd love 
to see it through.


All help is greatly appreciated!

Thank you,
András

---
Olcsó Virtuális szerver:
http://www.ProfiVPS.hu

Támogatás: supp...@profivps.hu
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Anyone, pretty please? :)

---
Olcsó Virtuális szerver:
http://www.ProfiVPS.hu

Támogatás: supp...@profivps.hu___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Ken Gaillot
On Wed, 2023-07-19 at 23:49 +0530, Priyanka Balotra wrote:
> Hi All, 
> I am using SLES 15 SP4. One of the nodes of the cluster is brought
> down and boot up after sometime. Pacemaker service came up first but
> later it faced a fatal shutdown. Due to that crm service is down. 
> 
> The logs from /var/log/pacemaker.pacemaker.log are as follows:
> 
> Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> (pcmk_child_exit)warning: Shutting cluster down because
> pacemaker-controld[15962] had fatal failure

The interesting messages will be before this. The ones with "pacemaker-
controld" will be the most relevant, at least initially.

> Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> (pcmk_shutdown_worker)   notice: Shutting down Pacemaker
> Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> (pcmk_shutdown_worker)   debug: pacemaker-controld confirmed stopped
> Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)  
>   notice: Stopping pacemaker-schedulerd | sent signal 15 to process
> 15961
> Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> (crm_signal_dispatch)notice: Caught 'Terminated' signal | 15
> (invoking handler)
> Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> (qb_ipcs_us_withdraw)info: withdrawing server sockets
> Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> (qb_ipcs_unref)  debug: qb_ipcs_unref() - destroying
> Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> (crm_xml_cleanup)info: Cleaning up memory from libxml2
> Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit)
>   info: Exiting pacemaker-schedulerd | with status 0
> Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
> (qb_ipcs_event_sendv)debug: new_event_notification (/dev/shm/qb-
> 15957-15962-12-RDPw6O/qb): Broken pipe (32)
> Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
> (cib_notify_send_one)warning: Could not notify client crmd:
> Broken pipe | id=e29d175e-7e91-4b6a-bffb-fabfdd7a33bf
> Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
> (cib_process_request)info: Completed cib_delete operation for
> section //node_state[@uname='FILE-2']/*: OK (rc=0, origin=FILE-
> 6/crmd/74, version=0.24.75)
> Jul 17 14:18:20.093 FILE-2 pacemaker-fenced[15958]
> (xml_patch_version_check)debug: Can apply patch 0.24.75 to
> 0.24.74
> Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> (pcmk_child_exit)info: pacemaker-schedulerd[15961] exited
> with status 0 (OK)
> Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
> (cib_process_request)info: Completed cib_modify operation for
> section status: OK (rc=0, origin=FILE-6/crmd/75, version=0.24.75)
> Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> (pcmk_shutdown_worker)   debug: pacemaker-schedulerd confirmed
> stopped
> Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)  
>   notice: Stopping pacemaker-attrd | sent signal 15 to process 15960
> Jul 17 14:18:20.093 FILE-2 pacemaker-attrd [15960]
> (crm_signal_dispatch)notice: Caught 'Terminated' signal | 15
> (invoking handler)
> 
> Could you please help me understand the issue here.
> 
> Regards
> Priyanka
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Priyanka Balotra
Hi All,
I am using SLES 15 SP4. One of the nodes of the cluster is brought down and
boot up after sometime. Pacemaker service came up first but later it faced
a fatal shutdown. Due to that crm service is down.

The logs from /var/log/pacemaker.pacemaker.log are as follows:

Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 warning: Shutting cluster down because pacemaker-controld[15962] had
fatal failure
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   notice: Shutting down Pacemaker
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-controld confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-schedulerd | sent signal 15 to process 15961
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(qb_ipcs_us_withdraw)info: withdrawing server sockets
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (qb_ipcs_unref)
 debug: qb_ipcs_unref() - destroying
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_xml_cleanup)
 info: Cleaning up memory from libxml2
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit)
info: Exiting pacemaker-schedulerd | with status 0
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(qb_ipcs_event_sendv)debug: new_event_notification
(/dev/shm/qb-15957-15962-12-RDPw6O/qb): Broken pipe (32)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_notify_send_one)warning: Could not notify client crmd: Broken pipe
| id=e29d175e-7e91-4b6a-bffb-fabfdd7a33bf
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_delete operation for section
//node_state[@uname='FILE-2']/*: OK (rc=0, origin=FILE-6/crmd/74,
version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemaker-fenced[15958]
(xml_patch_version_check)debug: Can apply patch 0.24.75 to 0.24.74
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 info: pacemaker-schedulerd[15961] exited with status 0 (OK)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_modify operation for section
status: OK (rc=0, origin=FILE-6/crmd/75, version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-schedulerd confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-attrd | sent signal 15 to process 15960
Jul 17 14:18:20.093 FILE-2 pacemaker-attrd [15960]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)

Could you please help me understand the issue here.

Regards
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] General questions on fencing code

2023-07-19 Thread Oyvind Albrigtsen

On 19/07/23 11:24 +0300, Or Raz wrote:

Hi all,
I was looking at the fencing code and I have two questions:

  1. What is the use of the autodetect agent
  ?
  I didn't see any fence_autodetect agent

That's a project on-hold.

  2. What is the use of *.py.py* extension in the files under lib directory
  ?

This is just like the fence agents. The final .py gets removed during
build when Python binary (@PYTHON@), libdir (@FENCEAGENTSLIBDIR@), and other 
@..@ values get replaced by what configure has detected.


Oyvind


Best regards,
*OR*



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Multi site pgsql

2023-07-19 Thread ProfiVPS Support

Hello everyone,

  I'm stuck with a multi-site cluster configuration and I'd really 
appreciate some help/pointers.


  The scenario: I'd like to build two (whole) clusters in two different 
DCs, one of them acting as Master and the other as standby/Slave. Both 
clusters have a proxy, some web nodes and only one db node. I'm not 
planning to add HA for pgsql inside the clusters, failing over to the 
other cluster in case something goes south seems to be acceptable.


  So I ended up with pcs + booth.

- Currently both clusters start up, the main one starts a master 
psql service when the ticket is granted, the slave starts a replicating 
psql.


- When ticket is revoked, master DB is stopped.

- Slave DB is not getting promoted and I understand it doesnt even 
have a reason to do so.


  Basically what I need is the slave to be promoted to master when the 
ticket is granted to it.


 I'm not fully sure that Im not in the wrong, to achieve this I had to 
use two different pgsql resource configuration on the two clusters.


 Master cluster:

pcs -f cib.xml resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/lib/postgresql/15/bin/pg_ctl" \
psql="/usr/bin/psql" \
pgdata="/var/lib/postgresql/15/main/" \
node_list="sa_psql" \
logfile="/var/log/postgresql/postgresql-15-main.log" \
socketdir="/var/run/postgresql/" \
op monitor interval="11s" \
op monitor interval="10s" role="Master" \
op start timeout="60s" \
op stop timeout="60s" \
--group pgsql_group

# forcing pgsql to only reside only on the psql node
pcs -f cib.xml constraint location pgsql prefers la_psql=INFINITY
pcs -f cib.xml constraint location pgsql prefers la_worker1=-INFINITY
pcs -f cib.xml constraint location pgsql prefers la_hagw=-INFINITY

# Fence on ticket loss if we were promoted
pcs -f cib.xml constraint ticket set pgsql role=Promoted setoptions 
loss-policy=fence ticket=sqlticket


 Slave cluster :

pcs -f cib.xml resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/lib/postgresql/15/bin/pg_ctl" \
psql="/usr/bin/psql" \
pgdata="/var/lib/postgresql/15/main/" \
node_list="sa_psql" \
logfile="/var/log/postgresql/postgresql-15-main.log" \
socketdir="/var/run/postgresql/" \
restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \
master_ip="_master_ip_" \
repuser="repuser" \
rep_mode="slave" \
replication_slot_name="replica_1_slot" \
primary_conninfo_opt="password=* keepalives_idle=60 
keepalives_interval=5 keepalives_count=5" \

op monitor interval="31s" \
op monitor interval="30s" role="Promoted" \
op start timeout="60s" \
op stop timeout="60s" \
op promote timeout="120s" \
--group pgsql_group

# Force pgsql to only run on the sql node
pcs -f cib.xml constraint location pgsql prefers sa_psql=INFINITY
pcs -f cib.xml constraint location pgsql prefers sa_worker1=-INFINITY
pcs -f cib.xml constraint location pgsql prefers sa_hagw=-INFINITY

# Without this service wouldnt satart
pcs -f cib.xml constraint ticket set pgsql role=Promoted setoptions 
loss-policy=demote ticket=sqlticket


When configuration is pushed and ticket granted, they start up in m/s 
straming replication mode. However, debug-promote on slave psql returns 
with:
Operation force-promote for pgsql (ocf:heartbeat:pgsql) returned 6 (not 
configured: Not in a replication mode.)


Which is strange, bc:
is_replication() {
  if [ "$OCF_RESKEY_rep_mode" != "none" -a "$OCF_RESKEY_rep_mode" != 
"slave" ]; then

return 0
  fi
  return 1
}

 So I'm pretty much stuck.
 I'm also not sure that booth is a definite must here, sometimes I feel 
I was better off putting all of them into one bit cluster with an 
external tie breaker. But now I got sooo much time invested, I'd love to 
see it through.


All help is greatly appreciated!

Thank you,
András

---
Olcsó Virtuális szerver:
http://www.ProfiVPS.hu

Támogatás: supp...@profivps.hu___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] General questions on fencing code

2023-07-19 Thread Or Raz
Hi all,
I was looking at the fencing code and I have two questions:

   1. What is the use of the autodetect agent
   ?
   I didn't see any fence_autodetect agent
   2. What is the use of *.py.py* extension in the files under lib directory
   ?

Best regards,
*OR*
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/