Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-07 Thread Klaus Wenninger

On 6/2/21 10:39 PM, Eric Robinson wrote:

-Original Message-
From: Users  On Behalf Of Andrei
Borzenkov
Sent: Tuesday, June 1, 2021 12:52 PM
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?

On 01.06.2021 19:21, Eric Robinson wrote:

-Original Message-
From: Users  On Behalf Of Klaus
Wenninger
Sent: Monday, May 31, 2021 12:54 AM
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?

On 5/29/21 12:21 AM, Strahil Nikolov wrote:

I agree -> fencing is mandatory.

Agreed that with proper fencing setup the cluster wouldn'thave run
into that state.
But still it might be interesting to find out what has happened.

Thank you for looking past the fencing issue to the real question.

Regardless of whether or not fencing was enabled, there should still be some
indication of what actions the cluster took and why, but it appears that
cluster services just terminated silently.

Not seeing anything in the log snippet either.

Me neither.


Assuming you are running something systemd-based. Centos 7.

Yes. CentOS Linux release 7.5.1804.


Did you check the journal for pacemaker to see what systemd is thinking?
With the standard unit-file systemd should observe pacemakerd and
restart it if it goes away ungracefully.

The only log entry showing Pacemaker startup that I found in any of the

messages files (current and several days of history) was the one when I
started the cluster manually (see below).

Guess if systemd finding out about a stopped service is logged
to any file is configuration dependent.
Was more thinking of 'systemctl status pacemaker' or
'journalctl -u pacemaker'.

If cluster processes stopped or crashed you obviously won't see any logs
from them until they are restarted. You need to look at other system logs -
may be they record something unusual around this time? Any crash dumps?

The messages log shows continued entries for various pacemaker components, as 
mentioned in a previous email. Could not find any crash dumps.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-02 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Tuesday, June 1, 2021 12:52 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>
> On 01.06.2021 19:21, Eric Robinson wrote:
> >
> >> -Original Message-
> >> From: Users  On Behalf Of Klaus
> >> Wenninger
> >> Sent: Monday, May 31, 2021 12:54 AM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
> >>
> >> On 5/29/21 12:21 AM, Strahil Nikolov wrote:
> >>> I agree -> fencing is mandatory.
> >> Agreed that with proper fencing setup the cluster wouldn'thave run
> >> into that state.
> >> But still it might be interesting to find out what has happened.
> >
> > Thank you for looking past the fencing issue to the real question.
> Regardless of whether or not fencing was enabled, there should still be some
> indication of what actions the cluster took and why, but it appears that
> cluster services just terminated silently.
> >
> >> Not seeing anything in the log snippet either.
> >
> > Me neither.
> >
> >> Assuming you are running something systemd-based. Centos 7.
> >
> > Yes. CentOS Linux release 7.5.1804.
> >
> >> Did you check the journal for pacemaker to see what systemd is thinking?
> >> With the standard unit-file systemd should observe pacemakerd and
> >> restart it if it goes away ungracefully.
> >
> > The only log entry showing Pacemaker startup that I found in any of the
> messages files (current and several days of history) was the one when I
> started the cluster manually (see below).
> >
>
> If cluster processes stopped or crashed you obviously won't see any logs
> from them until they are restarted. You need to look at other system logs -
> may be they record something unusual around this time? Any crash dumps?

The messages log shows continued entries for various pacemaker components, as 
mentioned in a previous email. Could not find any crash dumps.

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-01 Thread Strahil Nikolov
Did you configure pacemaker blackbox ?
If not, it could be valuable in such cases.
Also consider updating as soon as possible. Most probably nobody can count the 
bug fixes that were introduced between 7.5 and 7.9, nor anyone will be able to 
help as you are running a pretty outdated version (even by RH standards).
Best Regards,Strahil Nikolov___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-01 Thread Andrei Borzenkov
On 01.06.2021 19:21, Eric Robinson wrote:
> 
>> -Original Message-
>> From: Users  On Behalf Of Klaus
>> Wenninger
>> Sent: Monday, May 31, 2021 12:54 AM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>>
>> On 5/29/21 12:21 AM, Strahil Nikolov wrote:
>>> I agree -> fencing is mandatory.
>> Agreed that with proper fencing setup the cluster wouldn'thave run into that
>> state.
>> But still it might be interesting to find out what has happened.
> 
> Thank you for looking past the fencing issue to the real question. Regardless 
> of whether or not fencing was enabled, there should still be some indication 
> of what actions the cluster took and why, but it appears that cluster 
> services just terminated silently.
> 
>> Not seeing anything in the log snippet either.
> 
> Me neither.
> 
>> Assuming you are running something systemd-based. Centos 7.
> 
> Yes. CentOS Linux release 7.5.1804.
> 
>> Did you check the journal for pacemaker to see what systemd is thinking?
>> With the standard unit-file systemd should observe pacemakerd and restart
>> it if it goes away ungracefully.
> 
> The only log entry showing Pacemaker startup that I found in any of the 
> messages files (current and several days of history) was the one when I 
> started the cluster manually (see below).
> 

If cluster processes stopped or crashed you obviously won't see any logs
from them until they are restarted. You need to look at other system
logs - may be they record something unusual around this time? Any crash
dumps?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-06-01 Thread Eric Robinson

> -Original Message-
> From: Users  On Behalf Of Klaus
> Wenninger
> Sent: Monday, May 31, 2021 12:54 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>
> On 5/29/21 12:21 AM, Strahil Nikolov wrote:
> > I agree -> fencing is mandatory.
> Agreed that with proper fencing setup the cluster wouldn'thave run into that
> state.
> But still it might be interesting to find out what has happened.

Thank you for looking past the fencing issue to the real question. Regardless 
of whether or not fencing was enabled, there should still be some indication of 
what actions the cluster took and why, but it appears that cluster services 
just terminated silently.

>Not seeing anything in the log snippet either.

Me neither.

> Assuming you are running something systemd-based. Centos 7.

Yes. CentOS Linux release 7.5.1804.

> Did you check the journal for pacemaker to see what systemd is thinking?
> With the standard unit-file systemd should observe pacemakerd and restart
> it if it goes away ungracefully.

The only log entry showing Pacemaker startup that I found in any of the 
messages files (current and several days of history) was the one when I started 
the cluster manually (see below).


> You should be able to test this behavior sending a SIGKILL to pacemakerd.
> pacemakerd in turn watches out for signals from the sub-daemons it has
> spawned (I'm currently working on more in-depth observation here.).
> So just disappearing shouldn't happen that easily.

Agreed. From looking at the cluster log, it appears that it just stopped making 
log entries at 9:40. Then at 11:49, I found the cluster service stopped and 
started them...

May 27 09:25:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
May 27 09:25:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91482 (ref=pe_calc-dc-1622121931-124396) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2
May 27 09:25:31 [92171] 001store01a   crmd:   notice: run_graph:
Transition 91482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
May 27 09:25:31 [92171] 001store01a   crmd: info: do_log:   Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
May 27 09:25:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd
May 27 09:40:31 [92171] 001store01a   crmd: info: crm_timer_popped: 
PEngine Recheck Timer (I_PE_CALC) just popped (90ms)
May 27 09:40:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_TIMER_POPPED origin=crm_timer_popped
May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
May 27 09:40:31 [92170] 001store01apengine: info: process_pe_message:   
Input has not changed since last time, not saving to disk
May 27 09:40:31 [92170] 001store01apengine: info: 
determine_online_status:  Node 001store01a is online
May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-itls active on 001store01a
May 27 09:40:31 [92170] 001store01apengine:  warning: 
unpack_rsc_op_failure:Processing failed op monitor for p_vip_ftpclust01 
on 001store01a: unknown error (1)
May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-etls active on 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed
May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_vip_ftpclust01(ocf::heartbeat:IPaddr2):   Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_replicator(systemd:pure-replicator):  Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-etls(systemd:pure-ftpd-etls):   Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-itls(systemd:pure-ftpd-itls):   Started 001store01a
May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_vip_ftpclust01(Started 001store01a)
May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_replicator(Started 001store01a)
May 27 09:40:31 [92170] 001store01a

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-30 Thread Klaus Wenninger

On 5/29/21 12:21 AM, Strahil Nikolov wrote:

I agree -> fencing is mandatory.

Agreed that with proper fencing setup the cluster
wouldn'thave run into that state.
But still it might be interesting to find out what has
happened. Not seeing anything in the log snippet either.
Assuming you are running something systemd-based.
Did you check the journal for pacemaker to see what
systemd is thinking?
With the standard unit-file systemd should observe
pacemakerd and restart it if it goes away ungracefully.
You should be able to test this behavior sending a
SIGKILL to pacemakerd.
pacemakerd in turn watches out for signals from the
sub-daemons it has spawned (I'm currently working
on more in-depth observation here.).
So just disappearing shouldn't happen that easily.
Did you find any core-dumps?

Regards,
Klaus


You can enable the debug logs by editing corosync.conf or 
/etc/sysconfig/pacemaker.


In case simple reload doesn't work, you can set the cluster in global 
maintenance, stop and then start the stack.



Best Regards,
Strahil Nikolov

On Fri, May 28, 2021 at 22:13, Digimer
 wrote:
On 2021-05-28 3:08 p.m., Eric Robinson wrote:
>
>> -Original Message-
>> From: Digimer mailto:li...@alteeve.ca>>
>> Sent: Friday, May 28, 2021 12:43 PM
>> To: Cluster Labs - All topics related to open-source clustering
welcomed
>> mailto:users@clusterlabs.org>>; Eric
Robinson mailto:eric.robin...@psmnv.com>>; Strahil
>> Nikolov mailto:hunter86...@yahoo.com>>
>> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>>
>> Shared storage is not what triggers the need for fencing.
Coordinating actions
>> is what triggers the need. Specifically; If you can run
resource on both/all
>> nodes at the same time, you don't need HA. If you can't, you
need fencing.
>>
>> Digimer
>
> Thanks. That said, there is no fencing, so any thoughts on why
the node behaved the way it did?

Without fencing, when a communication or membership issues arises,
it's
hard to predict what will happen.

I don't see anything in the short log snippet to indicate what
happened.
What's on the other node during the event? When did the node disappear
and when was it rejoined, to help find relevant log entries?

Going forward, if you want predictable and reliable operation,
implement
fencing asap. Fencing is required.


-- 
Digimer

Papers and Projects: https://alteeve.com/w/ <https://alteeve.com/w/>
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal
talent
have lived and died in cotton fields and sweatshops." - Stephen
Jay Gould


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Strahil Nikolov
I agree -> fencing is mandatory.
You can enable the debug logs by editing corosync.conf or 
/etc/sysconfig/pacemaker.
In case simple reload doesn't work, you can set the cluster in global 
maintenance, stop and then start the stack.

Best Regards,Strahil Nikolov 
 
  On Fri, May 28, 2021 at 22:13, Digimer wrote:   On 
2021-05-28 3:08 p.m., Eric Robinson wrote:
> 
>> -Original Message-
>> From: Digimer 
>> Sent: Friday, May 28, 2021 12:43 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> ; Eric Robinson ; Strahil
>> Nikolov 
>> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>>
>> Shared storage is not what triggers the need for fencing. Coordinating 
>> actions
>> is what triggers the need. Specifically; If you can run resource on both/all
>> nodes at the same time, you don't need HA. If you can't, you need fencing.
>>
>> Digimer
> 
> Thanks. That said, there is no fencing, so any thoughts on why the node 
> behaved the way it did?

Without fencing, when a communication or membership issues arises, it's
hard to predict what will happen.

I don't see anything in the short log snippet to indicate what happened.
What's on the other node during the event? When did the node disappear
and when was it rejoined, to help find relevant log entries?

Going forward, if you want predictable and reliable operation, implement
fencing asap. Fencing is required.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Digimer
On 2021-05-28 3:08 p.m., Eric Robinson wrote:
> 
>> -Original Message-
>> From: Digimer 
>> Sent: Friday, May 28, 2021 12:43 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> ; Eric Robinson ; Strahil
>> Nikolov 
>> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>>
>> Shared storage is not what triggers the need for fencing. Coordinating 
>> actions
>> is what triggers the need. Specifically; If you can run resource on both/all
>> nodes at the same time, you don't need HA. If you can't, you need fencing.
>>
>> Digimer
> 
> Thanks. That said, there is no fencing, so any thoughts on why the node 
> behaved the way it did?

Without fencing, when a communication or membership issues arises, it's
hard to predict what will happen.

I don't see anything in the short log snippet to indicate what happened.
What's on the other node during the event? When did the node disappear
and when was it rejoined, to help find relevant log entries?

Going forward, if you want predictable and reliable operation, implement
fencing asap. Fencing is required.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Strahil Nikolov
what is your fencing agent ?
Best Regards,Strahil Nikolov
 
 
  On Thu, May 27, 2021 at 20:52, Eric Robinson wrote:  
  
We found one of our cluster nodes down this morning. The server was up but 
cluster services were not running. Upon examination of the logs, we found that 
the cluster just stopped around 9:40:31 and then I started it up manually (pcs 
cluster start) at 11:49:48. I can’t imagine that Pacemaker just randomly 
terminates. Any thoughts why it would behave this way?
 
  
 
  
 
May 27 09:25:31 [92170] 001store01a    pengine:   notice: process_pe_message:   
Calculated transition 91482, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2
 
May 27 09:25:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
 
May 27 09:25:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91482 (ref=pe_calc-dc-1622121931-124396) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2
 
May 27 09:25:31 [92171] 001store01a   crmd:   notice: run_graph:    
Transition 91482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
 
May 27 09:25:31 [92171] 001store01a   crmd: info: do_log:   Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
 
May 27 09:25:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd
 
May 27 09:40:31 [92171] 001store01a   crmd: info: crm_timer_popped: 
PEngine Recheck Timer (I_PE_CALC) just popped (90ms)
 
May 27 09:40:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_TIMER_POPPED origin=crm_timer_popped
 
May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
 
May 27 09:40:31 [92170] 001store01a    pengine: info: process_pe_message:   
Input has not changed since last time, not saving to disk
 
May 27 09:40:31 [92170] 001store01a    pengine: info: 
determine_online_status:  Node 001store01a is online
 
May 27 09:40:31 [92170] 001store01a    pengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-itls active on 001store01a
 
May 27 09:40:31 [92170] 001store01a    pengine:  warning: 
unpack_rsc_op_failure:    Processing failed op monitor for p_vip_ftpclust01 
on 001store01a: unknown error (1)
 
May 27 09:40:31 [92170] 001store01a    pengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-etls active on 001store01a
 
May 27 09:40:31 [92170] 001store01a    pengine: info: unpack_node_loop: 
Node 1 is already processed
 
May 27 09:40:31 [92170] 001store01a    pengine: info: unpack_node_loop: 
Node 1 is already processed
 
May 27 09:40:31 [92170] 001store01a    pengine: info: common_print: 
p_vip_ftpclust01    (ocf::heartbeat:IPaddr2):   Started 001store01a
 
May 27 09:40:31 [92170] 001store01a    pengine: info: common_print: 
p_replicator    (systemd:pure-replicator):      Started 001store01a
 
May 27 09:40:31 [92170] 001store01a    pengine: info: common_print: 
p_pure-ftpd-etls    (systemd:pure-ftpd-etls):   Started 001store01a
 
May 27 09:40:31 [92170] 001store01a    pengine: info: common_print: 
p_pure-ftpd-itls    (systemd:pure-ftpd-itls):   Started 001store01a
 
May 27 09:40:31 [92170] 001store01a    pengine: info: LogActions:   Leave   
p_vip_ftpclust01    (Started 001store01a)
 
May 27 09:40:31 [92170] 001store01a    pengine: info: LogActions:   Leave   
p_replicator    (Started 001store01a)
 
May 27 09:40:31 [92170] 001store01a    pengine: info: LogActions:   Leave   
p_pure-ftpd-etls    (Started 001store01a)
 
May 27 09:40:31 [92170] 001store01a    pengine: info: LogActions:   Leave   
p_pure-ftpd-itls    (Started 001store01a)
 
May 27 09:40:31 [92170] 001store01a    pengine:   notice: process_pe_message:   
Calculated transition 91483, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2
 
May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response
 
May 27 09:40:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91483 (ref=pe_calc-dc-1622122831-124397) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2
 
May 27 09:40:31 [92171] 001store01a   crmd:   notice: run_graph:    
Transition 91483 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
 
May 27 09:40:31 [92171] 001store01a   crmd: 

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Eric Robinson

> -Original Message-
> From: Digimer 
> Sent: Friday, May 28, 2021 12:43 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Eric Robinson ; Strahil
> Nikolov 
> Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?
>
> Shared storage is not what triggers the need for fencing. Coordinating actions
> is what triggers the need. Specifically; If you can run resource on both/all
> nodes at the same time, you don't need HA. If you can't, you need fencing.
>
> Digimer

Thanks. That said, there is no fencing, so any thoughts on why the node behaved 
the way it did?

>
> On 2021-05-28 1:19 p.m., Eric Robinson wrote:
> > There is no fencing agent on this cluster and no shared storage.
> >
> > -Eric
> >
> > *From:* Strahil Nikolov 
> > *Sent:* Friday, May 28, 2021 10:08 AM
> > *To:* Cluster Labs - All topics related to open-source clustering
> > welcomed ; Eric Robinson
> > 
> > *Subject:* Re: [ClusterLabs] Cluster Stopped, No Messages?
> >
> > what is your fencing agent ?
> >
> > Best Regards,
> >
> > Strahil Nikolov
> >
> > On Thu, May 27, 2021 at 20:52, Eric Robinson
> >
> > mailto:eric.robin...@psmnv.com>>
> wrote:
> >
> > We found one of our cluster nodes down this morning. The server was
> > up but cluster services were not running. Upon examination of the
> > logs, we found that the cluster just stopped around 9:40:31 and then
> > I started it up manually (pcs cluster start) at 11:49:48. I can’t
> > imagine that Pacemaker just randomly terminates. Any thoughts why it
> > would behave this way?
> >
> >
> >
> >
> >
> > May 27 09:25:31 [92170] 001store01apengine:   notice:
> > process_pe_message:   Calculated transition 91482, saving inputs in
> > /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> > May 27 09:25:31 [92171] 001store01a   crmd: info:
> > do_state_transition:  State transition S_POLICY_ENGINE ->
> > S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> > origin=handle_response
> >
> > May 27 09:25:31 [92171] 001store01a   crmd: info:
> > do_te_invoke: Processing graph 91482
> > (ref=pe_calc-dc-1622121931-124396) derived from
> > /var/lib/pacemaker/pengine/pe-input-756.bz2
> >
> > May 27 09:25:31 [92171] 001store01a   crmd:   notice:
> > run_graph:Transition 91482 (Complete=0, Pending=0, Fired=0,
> > Skipped=0, Incomplete=0,
> > Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
> >
> > May 27 09:25:31 [92171] 001store01a   crmd: info:
> > do_log:   Input I_TE_SUCCESS received in state
> > S_TRANSITION_ENGINE from notify_crmd
> >
> > May 27 09:25:31 [92171] 001store01a   crmd:   notice:
> > do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE
> > | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> >
> > May 27 09:40:31 [92171] 001store01a   crmd: info:
> > crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped
> > (90ms)
> >
> > May 27 09:40:31 [92171] 001store01a   crmd:   notice:
> > do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE |
> > input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
> >
> > May 27 09:40:31 [92171] 001store01a   crmd: info:
> > do_state_transition:  Progressed to state S_POLICY_ENGINE after
> > C_TIMER_POPPED
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > process_pe_message:   Input has not changed since last time, not
> > saving to disk
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > determine_online_status:  Node 001store01a is online
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > determine_op_status:  Operation monitor found resource
> > p_pure-ftpd-itls active on 001store01a
> >
> > May 27 09:40:31 [92170] 001store01apengine:  warning:
> > unpack_rsc_op_failure:Processing failed op monitor for
> > p_vip_ftpclust01 on 001store01a: unknown error (1)
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > determine_op_status:  Operation monitor found resource
> > p_pure-ftpd-etls active on 001store01a
> >
> > May 27 09:40:31 [92170] 001store01apengine: info:
> > u

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Digimer
Shared storage is not what triggers the need for fencing. Coordinating
actions is what triggers the need. Specifically; If you can run resource
on both/all nodes at the same time, you don't need HA. If you can't, you
need fencing.

digimer

On 2021-05-28 1:19 p.m., Eric Robinson wrote:
> There is no fencing agent on this cluster and no shared storage.
> 
> -Eric
> 
> *From:* Strahil Nikolov 
> *Sent:* Friday, May 28, 2021 10:08 AM
> *To:* Cluster Labs - All topics related to open-source clustering
> welcomed ; Eric Robinson 
> *Subject:* Re: [ClusterLabs] Cluster Stopped, No Messages?
> 
> what is your fencing agent ?
> 
> Best Regards,
> 
> Strahil Nikolov
> 
> On Thu, May 27, 2021 at 20:52, Eric Robinson
> 
> mailto:eric.robin...@psmnv.com>> wrote:
> 
> We found one of our cluster nodes down this morning. The server was
> up but cluster services were not running. Upon examination of the
> logs, we found that the cluster just stopped around 9:40:31 and then
> I started it up manually (pcs cluster start) at 11:49:48. I can’t
> imagine that Pacemaker just randomly terminates. Any thoughts why it
> would behave this way?
> 
>  
> 
>  
> 
> May 27 09:25:31 [92170] 001store01a    pengine:   notice:
> process_pe_message:   Calculated transition 91482, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-756.bz2
> 
> May 27 09:25:31 [92171] 001store01a   crmd: info:
> do_state_transition:  State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response
> 
> May 27 09:25:31 [92171] 001store01a   crmd: info:
> do_te_invoke: Processing graph 91482
> (ref=pe_calc-dc-1622121931-124396) derived from
> /var/lib/pacemaker/pengine/pe-input-756.bz2
> 
> May 27 09:25:31 [92171] 001store01a   crmd:   notice:
> run_graph:    Transition 91482 (Complete=0, Pending=0, Fired=0,
> Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete
> 
> May 27 09:25:31 [92171] 001store01a   crmd: info:
> do_log:   Input I_TE_SUCCESS received in state
> S_TRANSITION_ENGINE from notify_crmd
> 
> May 27 09:25:31 [92171] 001store01a   crmd:   notice:
> do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE
> | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> 
> May 27 09:40:31 [92171] 001store01a   crmd: info:
> crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped
> (90ms)
> 
> May 27 09:40:31 [92171] 001store01a   crmd:   notice:
> do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE |
> input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
> 
> May 27 09:40:31 [92171] 001store01a   crmd: info:
> do_state_transition:  Progressed to state S_POLICY_ENGINE after
> C_TIMER_POPPED
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> process_pe_message:   Input has not changed since last time, not
> saving to disk
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> determine_online_status:  Node 001store01a is online
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> determine_op_status:  Operation monitor found resource
> p_pure-ftpd-itls active on 001store01a
> 
> May 27 09:40:31 [92170] 001store01a    pengine:  warning:
> unpack_rsc_op_failure:    Processing failed op monitor for
> p_vip_ftpclust01 on 001store01a: unknown error (1)
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> determine_op_status:  Operation monitor found resource
> p_pure-ftpd-etls active on 001store01a
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> unpack_node_loop: Node 1 is already processed
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> unpack_node_loop: Node 1 is already processed
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> common_print: p_vip_ftpclust01   
> (ocf::heartbeat:IPaddr2):   Started 001store01a
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> common_print: p_replicator    (systemd:pure-replicator):  
>    Started 001store01a
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> common_print: p_pure-ftpd-etls   
> (systemd:pure-ftpd-etls):   Started 001store01a
> 
> May 27 09:40:31 [92170] 001store01a    pengine: info:
> common_print: p_pure-ftpd-itls   
> (syst

Re: [ClusterLabs] Cluster Stopped, No Messages?

2021-05-28 Thread Eric Robinson
There is no fencing agent on this cluster and no shared storage.

-Eric


From: Strahil Nikolov 
Sent: Friday, May 28, 2021 10:08 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Eric Robinson 
Subject: Re: [ClusterLabs] Cluster Stopped, No Messages?

what is your fencing agent ?

Best Regards,
Strahil Nikolov
On Thu, May 27, 2021 at 20:52, Eric Robinson
mailto:eric.robin...@psmnv.com>> wrote:

We found one of our cluster nodes down this morning. The server was up but 
cluster services were not running. Upon examination of the logs, we found that 
the cluster just stopped around 9:40:31 and then I started it up manually (pcs 
cluster start) at 11:49:48. I can’t imagine that Pacemaker just randomly 
terminates. Any thoughts why it would behave this way?





May 27 09:25:31 [92170] 001store01apengine:   notice: process_pe_message:   
Calculated transition 91482, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2

May 27 09:25:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response

May 27 09:25:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91482 (ref=pe_calc-dc-1622121931-124396) derived from 
/var/lib/pacemaker/pengine/pe-input-756.bz2

May 27 09:25:31 [92171] 001store01a   crmd:   notice: run_graph:
Transition 91482 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-756.bz2): Complete

May 27 09:25:31 [92171] 001store01a   crmd: info: do_log:   Input 
I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd

May 27 09:25:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd

May 27 09:40:31 [92171] 001store01a   crmd: info: crm_timer_popped: 
PEngine Recheck Timer (I_PE_CALC) just popped (90ms)

May 27 09:40:31 [92171] 001store01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_TIMER_POPPED origin=crm_timer_popped

May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED

May 27 09:40:31 [92170] 001store01apengine: info: process_pe_message:   
Input has not changed since last time, not saving to disk

May 27 09:40:31 [92170] 001store01apengine: info: 
determine_online_status:  Node 001store01a is online

May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-itls active on 001store01a

May 27 09:40:31 [92170] 001store01apengine:  warning: 
unpack_rsc_op_failure:Processing failed op monitor for p_vip_ftpclust01 
on 001store01a: unknown error (1)

May 27 09:40:31 [92170] 001store01apengine: info: determine_op_status:  
Operation monitor found resource p_pure-ftpd-etls active on 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed

May 27 09:40:31 [92170] 001store01apengine: info: unpack_node_loop: 
Node 1 is already processed

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_vip_ftpclust01(ocf::heartbeat:IPaddr2):   Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_replicator(systemd:pure-replicator):  Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-etls(systemd:pure-ftpd-etls):   Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: common_print: 
p_pure-ftpd-itls(systemd:pure-ftpd-itls):   Started 001store01a

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_vip_ftpclust01(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_replicator(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_pure-ftpd-etls(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine: info: LogActions:   Leave   
p_pure-ftpd-itls(Started 001store01a)

May 27 09:40:31 [92170] 001store01apengine:   notice: process_pe_message:   
Calculated transition 91483, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-756.bz2

May 27 09:40:31 [92171] 001store01a   crmd: info: do_state_transition:  
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response

May 27 09:40:31 [92171] 001store01a   crmd: info: do_te_invoke: 
Processing graph 91483 (ref=pe_calc-dc-1622122831-124397) derived from 
/var/lib/pacemaker/pe