Re: [ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 03:50 +, 范国腾 wrote:
> Thank you, Ken.
> 
> We have set the timeout to be 10 seconds, but it reports timeout only
> after 2 seconds. So it seems not work if I set higher timeouts.
> Our application which is managed by pacemaker will start more than
> 500 process to run when running performance test. Does it affect the
> result? Which log could help us to analyze?
> 
> > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-
> > interval-16s)

It's not timing out after 2 seconds. The message:

  sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor

indicates that the monitor's process ID is 5240, but the message:

  sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606)
timed out

indicates that the monitor that timed out had process ID 5606. That
means that there were two separate monitors in progress. I'm not sure
why; I wouldn't expect the second one to be started until after the
first one had timed out. But it's possible with the high load that the
log messages were simply written to the log out of order, since they
were written by different processes.

I would just raise the timeout higher than 10s during the test.

> 
> -邮件原件-
> 发件人: Ken Gaillot [mailto:kgail...@redhat.com] 
> 发送时间: 2018年1月11日 0:54
> 收件人: Cluster Labs - All topics related to open-source clustering
> welcomed 
> 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is
> high
> 
> On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote:
> > Hello,
> >  
> > This issue only appears when we run performance test and the CPU
> > is 
> > high. The cluster and log is as below. The Pacemaker will restart
> > the 
> > Slave Side pgsql-ha resource about every two minutes.
> >  
> > Take the following scenario for example:(when the pgsqlms RA is 
> > called, we print the log “execute the command start (command)”.
> > When 
> > the command is returned, we print the log “execute the command stop
> > (Command) (result)”)
> > 1. We could see that pacemaker call “pgsqlms monitor” about
> > every
> > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor
> > command 
> > again at 13:56:16, and then it reports timeout error error
> > 13:56:18. 
> > It is only 2 seconds but it reports “timeout=1ms”
> > 3. In other logs, sometimes after 15 minutes, there is no
> > “execute 
> > the command start monitor” printed and it reports timeout error 
> > directly.
> >  
> > Could you please tell how to debug or resolve such issue?
> >  
> > The log:
> >  
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command 
> > start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: 
> > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]:
> > INFO: 
> > _confirm_role stop
> > 0
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command 
> > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
> > execute the command start monitor Jan 10 13:55:52 sds2 
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52
> > sds2 
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop
> > 0
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the
> > command 
> > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
> > load detected:
> > 426.77
> > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the
> > command 
> > start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: 
> > pgsqld_monitor_16000 process (PID 5606) timed out
> 
> There's something more going on than in this log snippet. Notice the
> process that timed out (5606) is not one of the processes that logged
> above (5240 and 5477).
> 
> Generally, once load gets that high, it's very difficult to maintain
> responsiveness, and the expectation is that another node will fence
> it.
> But it can often be worked around with high timeouts, and/or you can
> use rules to set higher timeouts or maintenance mode during times
> when high load is expected.
> 
> > Jan 10 13:56:18 sds2 lrmd[26093]: warning:
> > pgsqld_monitor_16000:5606
> > - timed out after 1ms
> > Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor
> > operation 
> > for pgsqld on db2: Timed Out | call=102
> > key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 
> > crmd[26096]:  notice: db2-
> > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ]
> > Jan 
> > 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
> > warning: Processing failed op monitor for pgsqld:0 on db2: unknown 
> > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing 
> > failed op start for pgsqld:1 on db1: unknown error (1) Jan 10
> > 13:56:19 
> > sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 
> > 100 failures (max=100) Jan 10 13:56:19 sds2
> > 

[ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high

2018-01-10 Thread 范国腾
Thank you, Ken.

We have set the timeout to be 10 seconds, but it reports timeout only after 2 
seconds. So it seems not work if I set higher timeouts.
Our application which is managed by pacemaker will start more than 500 process 
to run when running performance test. Does it affect the result? Which log 
could help us to analyze?

> monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)

-邮件原件-
发件人: Ken Gaillot [mailto:kgail...@redhat.com] 
发送时间: 2018年1月11日 0:54
收件人: Cluster Labs - All topics related to open-source clustering welcomed 

主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is high

On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote:
> Hello,
>  
> This issue only appears when we run performance test and the CPU is 
> high. The cluster and log is as below. The Pacemaker will restart the 
> Slave Side pgsql-ha resource about every two minutes.
>  
> Take the following scenario for example:(when the pgsqlms RA is 
> called, we print the log “execute the command start (command)”. When 
> the command is returned, we print the log “execute the command stop
> (Command) (result)”)
> 1. We could see that pacemaker call “pgsqlms monitor” about every
> 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor command 
> again at 13:56:16, and then it reports timeout error error 13:56:18. 
> It is only 2 seconds but it reports “timeout=1ms”
> 3. In other logs, sometimes after 15 minutes, there is no “execute 
> the command start monitor” printed and it reports timeout error 
> directly.
>  
> Could you please tell how to debug or resolve such issue?
>  
> The log:
>  
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
> start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: 
> _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: 
> _confirm_role stop
> 0
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
> execute the command start monitor Jan 10 13:55:52 sds2 
> pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 sds2 
> pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop
> 0
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command 
> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
> load detected:
> 426.77
> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command 
> start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: 
> pgsqld_monitor_16000 process (PID 5606) timed out

There's something more going on than in this log snippet. Notice the process 
that timed out (5606) is not one of the processes that logged above (5240 and 
5477).

Generally, once load gets that high, it's very difficult to maintain 
responsiveness, and the expectation is that another node will fence it.
But it can often be worked around with high timeouts, and/or you can use rules 
to set higher timeouts or maintenance mode during times when high load is 
expected.

> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606
> - timed out after 1ms
> Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation 
> for pgsqld on db2: Timed Out | call=102
> key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 
> crmd[26096]:  notice: db2-
> pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan 
> 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
> warning: Processing failed op monitor for pgsqld:0 on db2: unknown 
> error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing 
> failed op start for pgsqld:1 on db1: unknown error (1) Jan 10 13:56:19 
> sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 
> 100 failures (max=100) Jan 10 13:56:19 sds2 pengine[26095]: 
> warning: Forcing pgsql-ha away from db1 after 100 failures 
> (max=100) Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover 
> pgsqld:0#011(Slave db2) Jan 10 13:56:19 sds2 pengine[26095]:  notice: 
> Calculated transition 37, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-1251.bz2
>  
>  
> The Cluster Configuration:
> 2 nodes and 13 resources configured
>  
> Online: [ db1 db2 ]
>  
> Full list of resources:
>  
> Clone Set: dlm-clone [dlm]
>  Started: [ db1 db2 ]
> Clone Set: clvmd-clone [clvmd]
>  Started: [ db1 db2 ]
> ipmi_node1 (stonith:fence_ipmilan):    Started db2
> ipmi_node2 (stonith:fence_ipmilan):    Started db1 Clone Set: 
> clusterfs-clone [clusterfs]
>  Started: [ db1 db2 ]
> Master/Slave Set: pgsql-ha [pgsqld]>
>   Masters: [ db1 ]
> Slaves: [ db2 ]
> Resource Group: mastergroup
>  db1-vip    (ocf::heartbeat:IPaddr2):   Started
>  rep-vip    (ocf::heartbeat:IPaddr2):   Started Resource 
> Group: