Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread vitaly
OK,
Thank you very much for your help!
_Vitaly

> On 07/05/2022 8:47 PM Reid Wahl  wrote:
> 
>  
> On Tue, Jul 5, 2022 at 3:03 PM vitaly  wrote:
> >
> > Hello,
> > Yes, the snippet has everything there was for the full second of Jul 05 
> > 11:54:34. I did not cut anything between the last line of 11:54:33 and 
> > first line of 11:54:35.
> >
> > Here is grep from pacemaker config:
> >
> > d19-25-left.lab.archivas.com ~ # egrep -v '^($|#)' /etc/sysconfig/pacemaker
> > PCMK_logfile=/var/log/pacemaker.log
> > SBD_SYNC_RESOURCE_STARTUP="no"
> > PCMK_trace_functions=services_action_sync,svc_read_output
> > d19-25-left.lab.archivas.com ~ #
> >
> > I also grepped CURRENT pacemaker.log for services_action_sync and got just 
> > 4 recs for the time that does not seem to match failures:
> >
> > d19-25-left.lab.archivas.com ~ # grep services_action_sync 
> > /var/log/pacemaker.log
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
> > /usr/sbin/fence_ipmilan = 0
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:903)  trace:  >  stdout:  > version="1.0" ?>
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
> > /usr/sbin/fence_sbd = 0
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:903)  trace:  >  stdout:  > version="1.0" ?>
> >
> > This is grep of messages for failures:
> >
> > d19-25-left.lab.archivas.com ~ # grep " 5 21:[23].*Failed to .*pgsql-rhino" 
> > /var/log/messages
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > d19-25-left.lab.archivas.com ~ #
> >
> > Sorry, these logs are not the same time as thi

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread Reid Wahl
On Tue, Jul 5, 2022 at 3:03 PM vitaly  wrote:
>
> Hello,
> Yes, the snippet has everything there was for the full second of Jul 05 
> 11:54:34. I did not cut anything between the last line of 11:54:33 and first 
> line of 11:54:35.
>
> Here is grep from pacemaker config:
>
> d19-25-left.lab.archivas.com ~ # egrep -v '^($|#)' /etc/sysconfig/pacemaker
> PCMK_logfile=/var/log/pacemaker.log
> SBD_SYNC_RESOURCE_STARTUP="no"
> PCMK_trace_functions=services_action_sync,svc_read_output
> d19-25-left.lab.archivas.com ~ #
>
> I also grepped CURRENT pacemaker.log for services_action_sync and got just 4 
> recs for the time that does not seem to match failures:
>
> d19-25-left.lab.archivas.com ~ # grep services_action_sync 
> /var/log/pacemaker.log
> Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> (services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
> /usr/sbin/fence_ipmilan = 0
> Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> (services_action_sync@services.c:903)  trace:  >  stdout:  ?>
> Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> (services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
> /usr/sbin/fence_sbd = 0
> Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> (services_action_sync@services.c:903)  trace:  >  stdout:  ?>
>
> This is grep of messages for failures:
>
> d19-25-left.lab.archivas.com ~ # grep " 5 21:[23].*Failed to .*pgsql-rhino" 
> /var/log/messages
> Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
> metadata for postgres (ocf:heartbeat:pgsql-rhino)
> d19-25-left.lab.archivas.com ~ #
>
> Sorry, these logs are not the same time as this morning as I reinstalled 
> cluster couple of times today.
>
> Thanks,
> _Vitaly
>

Strange. If we reach "Failed to receive meta-data", that means
services_action_sync() returned true... and if services_action_sync()
returned true, then we should hit a crm_trace() line no matter what.
```
lrmd_api_

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread vitaly via Users
Hello,
Yes, the snippet has everything there was for the full second of Jul 05 
11:54:34. I did not cut anything between the last line of 11:54:33 and first 
line of 11:54:35.

Here is grep from pacemaker config:

d19-25-left.lab.archivas.com ~ # egrep -v '^($|#)' /etc/sysconfig/pacemaker
PCMK_logfile=/var/log/pacemaker.log
SBD_SYNC_RESOURCE_STARTUP="no"
PCMK_trace_functions=services_action_sync,svc_read_output
d19-25-left.lab.archivas.com ~ # 

I also grepped CURRENT pacemaker.log for services_action_sync and got just 4 
recs for the time that does not seem to match failures:

d19-25-left.lab.archivas.com ~ # grep services_action_sync 
/var/log/pacemaker.log
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
/usr/sbin/fence_ipmilan = 0
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:903)  trace:  >  stdout: 
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
/usr/sbin/fence_sbd = 0
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:903)  trace:  >  stdout: 

This is grep of messages for failures:

d19-25-left.lab.archivas.com ~ # grep " 5 21:[23].*Failed to .*pgsql-rhino" 
/var/log/messages 
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
d19-25-left.lab.archivas.com ~ # 

Sorry, these logs are not the same time as this morning as I reinstalled 
cluster couple of times today. 

Thanks,
_Vitaly


> On 07/05/2022 3:19 PM Reid Wahl  wrote:
> 
>  
> On Tue, Jul 5, 2022 at 5:17 AM vitaly  wrote:
> >
> > Hello,
> > Thanks for looking at this issue!
> > Snippets from /var/log/messages and /var/log/pacemaker.log are below.
> > _Vitaly
> >
> > Here is /var/log/pacemaker.log snippet around the failure:
> >
> > Jul 05 11:54:33 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> > (svc_read_output@services_linux.c:277)

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread Reid Wahl
On Tue, Jul 5, 2022 at 5:17 AM vitaly  wrote:
>
> Hello,
> Thanks for looking at this issue!
> Snippets from /var/log/messages and /var/log/pacemaker.log are below.
> _Vitaly
>
> Here is /var/log/pacemaker.log snippet around the failure:
>
> Jul 05 11:54:33 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:277)  trace: Reading M_IP_monitor_1 
> stdout into offset 177
> Jul 05 11:54:34  tomcat-rhino(tomcat-instance)[2295103]:INFO: [tomcat] 
> Leave tomcat start 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading 
> tomcat-instance_start_0 stderr into offset 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:277)  trace: Reading 
> tomcat-instance_start_0 stdout into offset 10505
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (log_finished@execd_commands.c:214) info: tomcat-instance start (call 
> 59, PID 2295103) exited with status 0 (execution time 110997ms, queue time 
> 0ms)
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (log_execute@execd_commands.c:232)  info: executing - rsc:N1F1 action:start 
> call_id:66
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (log_execute@execd_commands.c:232)  info: executing - rsc:fs_monitor 
> action:start call_id:67
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading fs_monitor_start_0 
> stdout into offset 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:287)  trace: Got 54 chars: 2298369 
> (process ID) old priority 0, new priority -10
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading N1F1_start_0 
> stdout into offset 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:287)  trace: Got 175 chars: 8: bond0: 
>  mtu 1500 qdisc noqueue state
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading 
> tomcat-instance_monitor_1 stderr into offset 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading 
> tomcat-instance_monitor_1 stdout into offset 0
> Jul 05 11:54:34  fs_monitor-rhino(fs_monitor)[2298359]:INFO: Started 
> fs_monitor.sh, pid=2298369
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading fs_monitor_start_0 
> stderr into offset 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:277)  trace: Reading fs_monitor_start_0 
> stdout into offset 54
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (log_finished@execd_commands.c:214) info: fs_monitor start (call 67, 
> PID 2298359) exited with status 0 (execution time 31ms, queue time 0ms)
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (log_execute@execd_commands.c:232)  info: executing - rsc:ClusterMonitor 
> action:start call_id:69
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading 
> fs_monitor_monitor_1 stderr into offset 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading 
> fs_monitor_monitor_1 stdout into offset 0
> Jul 05 11:54:34  IPaddr2-rhino(N1F1)[2298357]:INFO: Adding inet address 
> 172.18.51.93/23 with broadcast address 172.18.51.255 to device bond0 (with 
> label bond0:N1F1)
> Jul 05 11:54:34  IPaddr2-rhino(N1F1)[2298357]:INFO: Bringing device bond0 
> up
> Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log
> Jul 05 11:54:34  IPaddr2-rhino(N1F1)[2298357]:INFO: 
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
> /run/resource-agents/send_arp-172.18.51.93 bond0 172.18.51.93 auto not_used 
> not_used
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:280)  trace: Reading N1F1_start_0 
> stderr into offset 0
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (svc_read_output@services_linux.c:277)  trace: Reading N1F1_start_0 
> stdout into offset 175
> Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
> (log_finished@execd_commands.c:214) info: N1F1 start (call 66, PID 
> 2298357) exited with status 0 (execution time 68m

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread vitaly
Hello,
Thanks for looking at this issue!
Snippets from /var/log/messages and /var/log/pacemaker.log are below.
_Vitaly

Here is /var/log/pacemaker.log snippet around the failure:

Jul 05 11:54:33 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:277)  trace: Reading M_IP_monitor_1 
stdout into offset 177
Jul 05 11:54:34  tomcat-rhino(tomcat-instance)[2295103]:INFO: [tomcat] 
Leave tomcat start 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading 
tomcat-instance_start_0 stderr into offset 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:277)  trace: Reading 
tomcat-instance_start_0 stdout into offset 10505
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(log_finished@execd_commands.c:214) info: tomcat-instance start (call 
59, PID 2295103) exited with status 0 (execution time 110997ms, queue time 0ms)
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(log_execute@execd_commands.c:232)  info: executing - rsc:N1F1 action:start 
call_id:66
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(log_execute@execd_commands.c:232)  info: executing - rsc:fs_monitor 
action:start call_id:67
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading fs_monitor_start_0 
stdout into offset 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:287)  trace: Got 54 chars: 2298369 
(process ID) old priority 0, new priority -10
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading N1F1_start_0 stdout 
into offset 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:287)  trace: Got 175 chars: 8: bond0: 
 mtu 1500 qdisc noqueue state
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading 
tomcat-instance_monitor_1 stderr into offset 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading 
tomcat-instance_monitor_1 stdout into offset 0
Jul 05 11:54:34  fs_monitor-rhino(fs_monitor)[2298359]:INFO: Started 
fs_monitor.sh, pid=2298369
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading fs_monitor_start_0 
stderr into offset 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:277)  trace: Reading fs_monitor_start_0 
stdout into offset 54
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(log_finished@execd_commands.c:214) info: fs_monitor start (call 67, 
PID 2298359) exited with status 0 (execution time 31ms, queue time 0ms)
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(log_execute@execd_commands.c:232)  info: executing - rsc:ClusterMonitor 
action:start call_id:69
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading 
fs_monitor_monitor_1 stderr into offset 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading 
fs_monitor_monitor_1 stdout into offset 0
Jul 05 11:54:34  IPaddr2-rhino(N1F1)[2298357]:INFO: Adding inet address 
172.18.51.93/23 with broadcast address 172.18.51.255 to device bond0 (with 
label bond0:N1F1)
Jul 05 11:54:34  IPaddr2-rhino(N1F1)[2298357]:INFO: Bringing device bond0 up
Set r/w permissions for uid=189, gid=189 on /var/log/pacemaker.log
Jul 05 11:54:34  IPaddr2-rhino(N1F1)[2298357]:INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/run/resource-agents/send_arp-172.18.51.93 bond0 172.18.51.93 auto not_used 
not_used
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:280)  trace: Reading N1F1_start_0 stderr 
into offset 0
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(svc_read_output@services_linux.c:277)  trace: Reading N1F1_start_0 stdout 
into offset 175
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(log_finished@execd_commands.c:214) info: N1F1 start (call 66, PID 
2298357) exited with status 0 (execution time 68ms, queue time 0ms)
Jul 05 11:54:34  cluster_monitor-rhino(ClusterMonitor)[2298481]:INFO: 
Started cluster_monitor.sh, pid=2298549
Jul 05 11:54:34 d19-25-right.lab.archivas.com pacemaker-execd [2294543] 
(log_fi

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-04 Thread Reid Wahl
On Mon, Jul 4, 2022 at 7:19 AM vitaly  wrote:
>
> I get printout of metadata as follows:
> d19-25-left.lab.archivas.com ~ # OCF_ROOT=/usr/lib/ocf 
> /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data
> 
> 
> 
> 1.0
>
> 
> Resource script for PostgreSQL. It manages a PostgreSQL as an HA resource.
> 
> Manages a PostgreSQL database instance
>
> 
> 
> 
> Path to pg_ctl command.
> 
> pgctl
> 
> 
>
> 
> 
> Start options (-o start_opt in pg_ctl). "-i -p 5432" for example.
> 
> start_opt
> 
>
> 
> 
> 
> Additional pg_ctl options (-w, -W etc..).
> 
> ctl_opt
> 
> 
>
> 
> 
> Path to psql command.
> 
> psql
> 
> 
>
> 
> 
> Path to PostgreSQL data directory.
> 
> pgdata
> 
> 
>
> 
> 
> User that owns PostgreSQL.
> 
> pgdba
> 
> 
>
> 
> 
> Hostname/IP address where PostgreSQL is listening
> 
> pghost
> 
> 
>
> 
> 
> Port where PostgreSQL is listening
> 
> pgport
> 
> 
>
> 
> 
> PostgreSQL user that pgsql RA will user for monitor operations. If it's not 
> set
> pgdba user will be used.
> 
> monitor_user
> 
> 
>
> 
> 
> Password for monitor user.
> 
> monitor_password
> 
> 
>
> 
> 
> SQL script that will be used for monitor operations.
> 
> monitor_sql
> 
> 
>
> 
> 
> Path to the PostgreSQL configuration file for the instance.
> 
> Configuration file
> 
> 
>
> 
> 
> Database that will be used for monitoring.
> 
> pgdb
> 
> 
>
> 
> 
> Path to PostgreSQL server log output file.
> 
> logfile
> 
> 
>
> 
> 
> Unix socket directory for PostgeSQL
> 
> socketdir
> 
> 
>
> 
> 
> Number of shutdown retries (using -m fast) before resorting to -m immediate
> 
> stop escalation
> 
> 
>
> 
> 
> Replication mode(none(default)/async/sync).
> "async" and "sync" require PostgreSQL 9.1 or later.
> If you use async or sync, it requires node_list, master_ip, restore_command
> parameters, and needs setting postgresql.conf, pg_hba.conf up for
> replication.
> Please delete "include /../../rep_mode.conf" line in postgresql.conf
> when you switch from sync to async.
> 
> rep_mode
> 
> 
>
> 
> 
> All node names. Please separate each node name with a space.
> This is required for replication.
> 
> node list
> 
> 
>
> 
> 
> restore_command for recovery.conf.
> This is required for replication.
> 
> restore_command
> 
> 
>
> 
> 
> Master's floating IP address to be connected from hot standby.
> This parameter is used for "primary_conninfo" in recovery.conf.
> This is required for replication.
> 
> master ip
> 
> 
>
> 
> 
> User used to connect to the master server.
> This parameter is used for "primary_conninfo" in recovery.conf.
> This is required for replication.
> 
> repuser
> 
> 
>
> 
> 
> Location of WALS archived by the other node
> 
> remote_wals_dir
> 
> 
>
> 
> 
> Location of WALS on current node in Rhino before 2.2.0
> 
> xlogs_dir
> 
> 
>
> 
> 
> Location of WALS on current node in Rhino 2.2.0 and later
> 
> wals_dir
> 
> 
>
> 
> 
> User used to connect to the master server.
> This parameter is used for "primary_conninfo" in recovery.conf.
> This is required for replication.
> 
> reppassword
> 
> 
>
> 
> 
> primary_conninfo options of recovery.conf except host, port, user and 
> application_name.
> This is optional for replication.
> 
> primary_conninfo_opt
> 
> 
>
> 
> 
> Path to temporary directory.
> This is optional for replication.
> 
> tmpdir
> 
> 
>
> 
> 
> Number of checking xlog on monitor before promote.
> This is optional for replication.
> 
> xlog check count
> 
> 
>
> 
> 
> The timeout of crm_attribute forever update command.
> Default value is 5 seconds.
> This is optional for replication.
> 
> The timeout of crm_attribute forever update 
> command.
> 
> 
>
> 
> 
> Number of shutdown retries (using -m fast) before resorting to -m immediate
> in Slave state.
> This is optional for replication.
> 
> stop escalation_in_slave
> 
> 
>
> 
> 
> Number of seconds to wait for a postgreSQL process to be running but not 
> necessarilly usable
> 
> Seconds to wait for a process to be running
> 
> 
>
> 
> 
> Number of failed starts before the system forces a recovery from the master 
> database
> 
> Start failures before recovery
> 
> 
>
> 
> 
> Configuration file with overrides for pgsql-rhino.
> 
> Rhino configuration file
> 
> 
>
> 
>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

Hmm, seems reasonable. No permissions issues, and it looks like we
should only print the "Failed to receive" message if we don't receive
any stdout at all from the meta-data action.

Can you add the following to /etc/sysconfig/pacemaker and restart
pacemaker? Then monitor /var/log/pacemaker/pacemaker.log for relevant
trace-level messages around the same time as the "Failed to receive
meta-data" messages.

PCMK_trace_functions=services_action_sync,svc_read_output

This will get fairly verbose if you have more than a couple of
resources, so after you've grabbed any relevant logs, comment that
line out and restart pacemaker again.

>
> > On 07/04/2022 5:39 AM Reid Wahl  wrote:
> >
> >
> > On Mon, Jul 4, 2022 at 1:06 AM Reid Wahl  wrote:
> > >
> > > On Sat, J

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-04 Thread vitaly
I get printout of metadata as follows:
d19-25-left.lab.archivas.com ~ # OCF_ROOT=/usr/lib/ocf 
/usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data



1.0


Resource script for PostgreSQL. It manages a PostgreSQL as an HA resource.

Manages a PostgreSQL database instance




Path to pg_ctl command.

pgctl





Start options (-o start_opt in pg_ctl). "-i -p 5432" for example.

start_opt





Additional pg_ctl options (-w, -W etc..).

ctl_opt





Path to psql command.

psql





Path to PostgreSQL data directory.

pgdata





User that owns PostgreSQL.

pgdba





Hostname/IP address where PostgreSQL is listening

pghost





Port where PostgreSQL is listening

pgport





PostgreSQL user that pgsql RA will user for monitor operations. If it's not set
pgdba user will be used.

monitor_user





Password for monitor user.

monitor_password





SQL script that will be used for monitor operations.

monitor_sql





Path to the PostgreSQL configuration file for the instance.

Configuration file





Database that will be used for monitoring.

pgdb





Path to PostgreSQL server log output file.

logfile





Unix socket directory for PostgeSQL

socketdir





Number of shutdown retries (using -m fast) before resorting to -m immediate

stop escalation





Replication mode(none(default)/async/sync).
"async" and "sync" require PostgreSQL 9.1 or later.
If you use async or sync, it requires node_list, master_ip, restore_command
parameters, and needs setting postgresql.conf, pg_hba.conf up for
replication.
Please delete "include /../../rep_mode.conf" line in postgresql.conf
when you switch from sync to async.

rep_mode





All node names. Please separate each node name with a space.
This is required for replication.

node list





restore_command for recovery.conf.
This is required for replication.

restore_command





Master's floating IP address to be connected from hot standby.
This parameter is used for "primary_conninfo" in recovery.conf.
This is required for replication.

master ip





User used to connect to the master server.
This parameter is used for "primary_conninfo" in recovery.conf.
This is required for replication.

repuser





Location of WALS archived by the other node

remote_wals_dir





Location of WALS on current node in Rhino before 2.2.0

xlogs_dir





Location of WALS on current node in Rhino 2.2.0 and later

wals_dir





User used to connect to the master server.
This parameter is used for "primary_conninfo" in recovery.conf.
This is required for replication.

reppassword





primary_conninfo options of recovery.conf except host, port, user and 
application_name.
This is optional for replication.

primary_conninfo_opt





Path to temporary directory.
This is optional for replication.

tmpdir





Number of checking xlog on monitor before promote.
This is optional for replication.

xlog check count





The timeout of crm_attribute forever update command.
Default value is 5 seconds.
This is optional for replication.

The timeout of crm_attribute forever update 
command.





Number of shutdown retries (using -m fast) before resorting to -m immediate
in Slave state.
This is optional for replication.

stop escalation_in_slave





Number of seconds to wait for a postgreSQL process to be running but not 
necessarilly usable

Seconds to wait for a process to be running





Number of failed starts before the system forces a recovery from the master 
database

Start failures before recovery





Configuration file with overrides for pgsql-rhino.

Rhino configuration file




















> On 07/04/2022 5:39 AM Reid Wahl  wrote:
> 
>  
> On Mon, Jul 4, 2022 at 1:06 AM Reid Wahl  wrote:
> >
> > On Sat, Jul 2, 2022 at 1:12 PM vitaly  wrote:
> > >
> > > Sorry, I noticed that I am missing meta "notice=true" and after adding it 
> > > to postgres-ms configuration "notice" events started to come through.
> > > Item 1 still needs explanation. As pacemaker-controld keeps complaining.
> >
> > What happens when you run `OCF_ROOT=/usr/lib/ocf
> > /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data`?
> 
> This may also be relevant:
> https://lists.clusterlabs.org/pipermail/users/2022-June/030391.html
> 
> >
> > > Thanks!
> > > _Vitaly
> > >
> > > > On 07/02/2022 2:04 PM vitaly  wrote:
> > > >
> > > >
> > > > Hello Everybody.
> > > > I have a 2 node cluster with clone resource “postgres-ms”. We are 
> > > > running following versions of pacemaker/corosync:
> > > > d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
> > > > pacemaker-cluster-libs-2.0.5-9.el8.x86_64
> > > > pacemaker-libs-2.0.5-9.el8.x86_64
> > > > pacemaker-cli-2.0.5-9.el8.x86_64
> > > > corosynclib-3.1.0-5.el8.x86_64
> > > > pacemaker-schemas-2.0.5-9.el8.noarch
> > > > corosync-3.1.0-5.el8.x86_64
> > > > pacemaker-2.0.5-9.el8.x86_64
> > > >
> > > > There are couple of issues that could be related.
> > > > 1. There are following messages in the logs coming from 
> > > > pacemaker-controld:
> > > > Jul  2 1

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-04 Thread Reid Wahl
On Mon, Jul 4, 2022 at 1:06 AM Reid Wahl  wrote:
>
> On Sat, Jul 2, 2022 at 1:12 PM vitaly  wrote:
> >
> > Sorry, I noticed that I am missing meta "notice=true" and after adding it 
> > to postgres-ms configuration "notice" events started to come through.
> > Item 1 still needs explanation. As pacemaker-controld keeps complaining.
>
> What happens when you run `OCF_ROOT=/usr/lib/ocf
> /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data`?

This may also be relevant:
https://lists.clusterlabs.org/pipermail/users/2022-June/030391.html

>
> > Thanks!
> > _Vitaly
> >
> > > On 07/02/2022 2:04 PM vitaly  wrote:
> > >
> > >
> > > Hello Everybody.
> > > I have a 2 node cluster with clone resource “postgres-ms”. We are running 
> > > following versions of pacemaker/corosync:
> > > d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
> > > pacemaker-cluster-libs-2.0.5-9.el8.x86_64
> > > pacemaker-libs-2.0.5-9.el8.x86_64
> > > pacemaker-cli-2.0.5-9.el8.x86_64
> > > corosynclib-3.1.0-5.el8.x86_64
> > > pacemaker-schemas-2.0.5-9.el8.noarch
> > > corosync-3.1.0-5.el8.x86_64
> > > pacemaker-2.0.5-9.el8.x86_64
> > >
> > > There are couple of issues that could be related.
> > > 1. There are following messages in the logs coming from 
> > > pacemaker-controld:
> > > Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed 
> > > to receive meta-data for ocf:heartbeat:pgsql-rhino
> > > Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed 
> > > to get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > >
> > > 2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which 
> > > causes multiple issues with postgres synchronization during availability 
> > > events.
> > >
> > > 3. Item 2 raises another question. Who is setting these values:
> > > ${OCF_RESKEY_CRM_meta_notify_type}
> > > ${OCF_RESKEY_CRM_meta_notify_operation}
> > >
> > > Here is excerpt from cluster config:
> > >
> > > d19-25-left.lab.archivas.com ~ # pcs config
> > >
> > > Cluster Name:
> > > Corosync Nodes:
> > >  d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com
> > > Pacemaker Nodes:
> > >  d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com
> > >
> > > Resources:
> > >  Clone: postgres-ms
> > >   Meta Attrs: promotable=true target-role=started
> > >   Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino)
> > >Attributes: master_ip=172.16.1.6 
> > > node_list="d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com" 
> > > pgdata=/pg_data remote_wals_dir=/remote/walarchive rep_mode=sync 
> > > reppassword=XX repuser=XXX 
> > > restore_command="/opt/rhino/sil/bin/script_wrapper.sh wal_restore.py  %f 
> > > %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal 
> > > xlogs_dir=/pg_data/pg_xlog
> > >Meta Attrs: is-managed=true
> > >Operations: demote interval=0 on-fail=restart timeout=120s 
> > > (postgres-demote-interval-0)
> > >methods interval=0s timeout=5 
> > > (postgres-methods-interval-0s)
> > >monitor interval=10s on-fail=restart timeout=300s 
> > > (postgres-monitor-interval-10s)
> > >monitor interval=5s on-fail=restart role=Master 
> > > timeout=300s (postgres-monitor-interval-5s)
> > >notify interval=0 on-fail=restart timeout=90s 
> > > (postgres-notify-interval-0)
> > >promote interval=0 on-fail=restart timeout=120s 
> > > (postgres-promote-interval-0)
> > >start interval=0 on-fail=restart timeout=1800s 
> > > (postgres-start-interval-0)
> > >stop interval=0 on-fail=fence timeout=120s 
> > > (postgres-stop-interval-0)
> > > Thank you very much!
> > > _Vitaly
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl (He/Him), RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA



-- 
Regards,

Reid Wahl (He/Him), RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-04 Thread Reid Wahl
On Sat, Jul 2, 2022 at 1:12 PM vitaly  wrote:
>
> Sorry, I noticed that I am missing meta "notice=true" and after adding it to 
> postgres-ms configuration "notice" events started to come through.
> Item 1 still needs explanation. As pacemaker-controld keeps complaining.

What happens when you run `OCF_ROOT=/usr/lib/ocf
/usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data`?

> Thanks!
> _Vitaly
>
> > On 07/02/2022 2:04 PM vitaly  wrote:
> >
> >
> > Hello Everybody.
> > I have a 2 node cluster with clone resource “postgres-ms”. We are running 
> > following versions of pacemaker/corosync:
> > d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
> > pacemaker-cluster-libs-2.0.5-9.el8.x86_64
> > pacemaker-libs-2.0.5-9.el8.x86_64
> > pacemaker-cli-2.0.5-9.el8.x86_64
> > corosynclib-3.1.0-5.el8.x86_64
> > pacemaker-schemas-2.0.5-9.el8.noarch
> > corosync-3.1.0-5.el8.x86_64
> > pacemaker-2.0.5-9.el8.x86_64
> >
> > There are couple of issues that could be related.
> > 1. There are following messages in the logs coming from pacemaker-controld:
> > Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed 
> > to get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> >
> > 2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which 
> > causes multiple issues with postgres synchronization during availability 
> > events.
> >
> > 3. Item 2 raises another question. Who is setting these values:
> > ${OCF_RESKEY_CRM_meta_notify_type}
> > ${OCF_RESKEY_CRM_meta_notify_operation}
> >
> > Here is excerpt from cluster config:
> >
> > d19-25-left.lab.archivas.com ~ # pcs config
> >
> > Cluster Name:
> > Corosync Nodes:
> >  d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com
> > Pacemaker Nodes:
> >  d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com
> >
> > Resources:
> >  Clone: postgres-ms
> >   Meta Attrs: promotable=true target-role=started
> >   Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino)
> >Attributes: master_ip=172.16.1.6 node_list="d19-25-left.lab.archivas.com 
> > d19-25-right.lab.archivas.com" pgdata=/pg_data 
> > remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XX 
> > repuser=XXX restore_command="/opt/rhino/sil/bin/script_wrapper.sh 
> > wal_restore.py  %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal 
> > xlogs_dir=/pg_data/pg_xlog
> >Meta Attrs: is-managed=true
> >Operations: demote interval=0 on-fail=restart timeout=120s 
> > (postgres-demote-interval-0)
> >methods interval=0s timeout=5 (postgres-methods-interval-0s)
> >monitor interval=10s on-fail=restart timeout=300s 
> > (postgres-monitor-interval-10s)
> >monitor interval=5s on-fail=restart role=Master timeout=300s 
> > (postgres-monitor-interval-5s)
> >notify interval=0 on-fail=restart timeout=90s 
> > (postgres-notify-interval-0)
> >promote interval=0 on-fail=restart timeout=120s 
> > (postgres-promote-interval-0)
> >start interval=0 on-fail=restart timeout=1800s 
> > (postgres-start-interval-0)
> >stop interval=0 on-fail=fence timeout=120s 
> > (postgres-stop-interval-0)
> > Thank you very much!
> > _Vitaly
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl (He/Him), RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-02 Thread vitaly
Sorry, I noticed that I am missing meta "notice=true" and after adding it to 
postgres-ms configuration "notice" events started to come through.
Item 1 still needs explanation. As pacemaker-controld keeps complaining.
Thanks!
_Vitaly

> On 07/02/2022 2:04 PM vitaly  wrote:
> 
>  
> Hello Everybody.
> I have a 2 node cluster with clone resource “postgres-ms”. We are running 
> following versions of pacemaker/corosync:
> d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
> pacemaker-cluster-libs-2.0.5-9.el8.x86_64
> pacemaker-libs-2.0.5-9.el8.x86_64
> pacemaker-cli-2.0.5-9.el8.x86_64
> corosynclib-3.1.0-5.el8.x86_64
> pacemaker-schemas-2.0.5-9.el8.noarch
> corosync-3.1.0-5.el8.x86_64
> pacemaker-2.0.5-9.el8.x86_64
> 
> There are couple of issues that could be related. 
> 1. There are following messages in the logs coming from pacemaker-controld:
> Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed to 
> get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> 
> 2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which 
> causes multiple issues with postgres synchronization during availability 
> events. 
> 
> 3. Item 2 raises another question. Who is setting these values:
> ${OCF_RESKEY_CRM_meta_notify_type}
> ${OCF_RESKEY_CRM_meta_notify_operation}
> 
> Here is excerpt from cluster config:
> 
> d19-25-left.lab.archivas.com ~ # pcs config 
> 
> Cluster Name: 
> Corosync Nodes:
>  d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com
> Pacemaker Nodes:
>  d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com
> 
> Resources:
>  Clone: postgres-ms
>   Meta Attrs: promotable=true target-role=started
>   Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino)
>Attributes: master_ip=172.16.1.6 node_list="d19-25-left.lab.archivas.com 
> d19-25-right.lab.archivas.com" pgdata=/pg_data 
> remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XX 
> repuser=XXX restore_command="/opt/rhino/sil/bin/script_wrapper.sh 
> wal_restore.py  %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal 
> xlogs_dir=/pg_data/pg_xlog
>Meta Attrs: is-managed=true
>Operations: demote interval=0 on-fail=restart timeout=120s 
> (postgres-demote-interval-0)
>methods interval=0s timeout=5 (postgres-methods-interval-0s)
>monitor interval=10s on-fail=restart timeout=300s 
> (postgres-monitor-interval-10s)
>monitor interval=5s on-fail=restart role=Master timeout=300s 
> (postgres-monitor-interval-5s)
>notify interval=0 on-fail=restart timeout=90s 
> (postgres-notify-interval-0)
>promote interval=0 on-fail=restart timeout=120s 
> (postgres-promote-interval-0)
>start interval=0 on-fail=restart timeout=1800s 
> (postgres-start-interval-0)
>stop interval=0 on-fail=fence timeout=120s 
> (postgres-stop-interval-0)
> Thank you very much!
> _Vitaly
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-02 Thread vitaly
Hello Everybody.
I have a 2 node cluster with clone resource “postgres-ms”. We are running 
following versions of pacemaker/corosync:
d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
pacemaker-cluster-libs-2.0.5-9.el8.x86_64
pacemaker-libs-2.0.5-9.el8.x86_64
pacemaker-cli-2.0.5-9.el8.x86_64
corosynclib-3.1.0-5.el8.x86_64
pacemaker-schemas-2.0.5-9.el8.noarch
corosync-3.1.0-5.el8.x86_64
pacemaker-2.0.5-9.el8.x86_64

There are couple of issues that could be related. 
1. There are following messages in the logs coming from pacemaker-controld:
Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed to 
receive meta-data for ocf:heartbeat:pgsql-rhino
Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed to 
get metadata for postgres (ocf:heartbeat:pgsql-rhino)

2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which causes 
multiple issues with postgres synchronization during availability events. 

3. Item 2 raises another question. Who is setting these values:
${OCF_RESKEY_CRM_meta_notify_type}
${OCF_RESKEY_CRM_meta_notify_operation}

Here is excerpt from cluster config:

d19-25-left.lab.archivas.com ~ # pcs config 

Cluster Name: 
Corosync Nodes:
 d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com
Pacemaker Nodes:
 d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com

Resources:
 Clone: postgres-ms
  Meta Attrs: promotable=true target-role=started
  Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino)
   Attributes: master_ip=172.16.1.6 node_list="d19-25-left.lab.archivas.com 
d19-25-right.lab.archivas.com" pgdata=/pg_data 
remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XX 
repuser=XXX restore_command="/opt/rhino/sil/bin/script_wrapper.sh 
wal_restore.py  %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal 
xlogs_dir=/pg_data/pg_xlog
   Meta Attrs: is-managed=true
   Operations: demote interval=0 on-fail=restart timeout=120s 
(postgres-demote-interval-0)
   methods interval=0s timeout=5 (postgres-methods-interval-0s)
   monitor interval=10s on-fail=restart timeout=300s 
(postgres-monitor-interval-10s)
   monitor interval=5s on-fail=restart role=Master timeout=300s 
(postgres-monitor-interval-5s)
   notify interval=0 on-fail=restart timeout=90s 
(postgres-notify-interval-0)
   promote interval=0 on-fail=restart timeout=120s 
(postgres-promote-interval-0)
   start interval=0 on-fail=restart timeout=1800s 
(postgres-start-interval-0)
   stop interval=0 on-fail=fence timeout=120s 
(postgres-stop-interval-0)
Thank you very much!
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/