[ClusterLabs] Serialize and symmetrical=true does not work together

2020-03-16 Thread Dileep V Nair


Hi,

I have pacemaker clusters which were working fine with Serialize and
symmetrical=true together. After an upgrade, I see that Serialize does not
work with symmetrical=true. How do I make sure the resources in Serialize
stop in reverse order of start ? All the clusters are already in
production. Please help.

Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Safe way to stop pacemaker on both nodes of a two node cluster

2019-10-20 Thread Dileep V Nair

Hi,

I am confused about the best way to stop pacemaker on both nodes of a
two node cluster. The options I know of are
1. Put the cluster in Maintenance Mode, stop the applications manually and
then stop pacemaker on both nodes. For this I need the application to  be
stopped manually
2. Stop pacemaker on one node, wait for all resources to come up on second
node, then stop pacemaker on second node. This might cause a significant
delay because all resources has to come up on second node.

Is there any other way to stop pacemaker on both nodes gracefully ?
Thanks in advance.

Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Strange behaviour of group resource

2019-08-01 Thread Dileep V Nair

I see that this behavior is on Pacemaker 1.1.15 and does not see it on
1.1.15.

Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services




From:   Ken Gaillot 
To: Cluster Labs - All topics related to open-source clustering
welcomed 
Date:   07/30/2019 08:03 PM
Subject:[EXTERNAL] Re: [ClusterLabs] Strange behaviour of group
resource
Sent by:"Users" 



On Tue, 2019-07-30 at 16:26 +0530, Dileep V Nair wrote:
> Thanks Ken for the response. I see below errors. Not sure why it says
> target: 7 vs. rc: 0. Does that mean that pacemaker expect the
> resource to be stopped and since it is running, it is taking an
> action ?
>
> Jul 30 10:08:59 dntstdb2s0703 cib[90848]: warning: A-Sync reply to
> crmd failed: No message of desired type
>
> Jul 30 10:09:04 dntstdb2s0703 crmd[90853]: warning: Action 16 (fs-
> sapdata4_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0):
> Error
> Jul 30 10:09:04 dntstdb2s0703 crmd[90853]: notice: Transition 1445
> aborted by operation fs-sapdata4_monitor_0 'modify' on dntstdb2s0703:
> Event failed

These actually aren't errors, and they're expected after a clean-up. I
recently merged a change to make the message more accurate. As of the
next release, it will look like:

notice: Transition 1445 action 5 (fs-sapdata4_monitor_0 on dntstdb2s0703):
expected 'not running' but got 'ok'

Cleaning up a resource involves clearing its history. That makes the
cluster expect that it is stopped. The cluster then runs probes to find
out the actual status, and if the probe finds it running, the above
situation happens.

So, that's not causing the restarts. An actual failure that could cause
restarts would have a similar message, but the rc would be something
other than 0 or 7.

> Jul 30 10:09:04 dntstdb2s0703 crmd[90853]: warning: Action 16 (fs-
> sapdata4_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0):
> Error
> Jul 30 10:09:04 dntstdb2s0703 stonith-ng[90849]: notice: On loss of
> CCM Quorum: Ignore
> Jul 30 10:09:04 dntstdb2s0703 crmd[90853]: notice: Result of probe
> operation for fs-saptmp3 on dntstdb2s0703: 0 (ok)
> Jul 30 10:09:04 dntstdb2s0703 crmd[90853]: warning: Action 19 (fs-
> saptmp3_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0):
> Error
> Jul 30 10:09:04 dntstdb2s0703 crmd[90853]: warning: Action 19 (fs-
> saptmp3_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0):
> Error
>
> Thanks & Regards
>
> Dileep Nair
> Squad Lead - SAP Base
> Togaf Certified Enterprise Architect
> IBM Services for Managed Applications
> +91 98450 22258 Mobile
> dilen...@in.ibm.com
>
> IBM Services
>
>
> Ken Gaillot ---07/30/2019 12:47:52 AM---On Thu, 2019-07-25 at 20:51
> +0530, Dileep V Nair wrote: > Hi,
>
> From: Ken Gaillot 
> To: Cluster Labs - All topics related to open-source clustering
> welcomed 
> Date: 07/30/2019 12:47 AM
> Subject: [EXTERNAL] Re: [ClusterLabs] Strange behaviour of group
> resource
> Sent by: "Users" 
>
>
>
> On Thu, 2019-07-25 at 20:51 +0530, Dileep V Nair wrote:
> > Hi,
> >
> > I have around 10 filesystems in a group. When I do a crm resource
> > refresh, the filesystems are unmounted and remounted, starting from
> > the fourth resource in the group. Any idea what could be going on,
> is
> > it expected ?
>
> No, it sounds like some of the reprobes are failing. The logs may
> have
> more info. Each filesystem will have a probe like RSCNAME_monitor_0
> on
> each node.
>
> >
> > Thanks & Regards
> >
> > Dileep Nair
> > Squad Lead - SAP Base
> > Togaf Certified Enterprise Architect
> > IBM Services for Managed Applications
> > +91 98450 22258 Mobile
> > dilen...@in.ibm.com
> >
> > IBM Services
> ___
> Manage your subscription:
>
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs.org_mailman_listinfo_users=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=8No4o8ZPAhC6uixSlqjSkiBExcNnw1RqwA2zgIBi1zQ=8OAcZFYq8rUunpXXJ_e76xvVkqBbCsA0U-K9p27qEcE=

>
> ClusterLabs home:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.clusterlabs.org_=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=8No4o8ZPAhC6uixSlqjSkiBExcNnw1RqwA2zgIBi1zQ=W6qIo88IoD1emMnTqQehMlXVSbh0EVqhoywkMlfhtRU=

--
Ken Gaillot 

___
Manage your subscription:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs.org_mailman_listinfo_users=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=8No4o8

Re: [ClusterLabs] Strange behaviour of group resource

2019-07-30 Thread Dileep V Nair

Thanks Ken for the response. I see below errors. Not sure why it says
target: 7 vs. rc: 0. Does that mean that pacemaker expect the resource to
be stopped and since it is running, it is taking an action ?

Jul 30 10:08:59 dntstdb2s0703 cib[90848]:  warning: A-Sync reply to crmd
failed: No message of desired type

Jul 30 10:09:04 dntstdb2s0703 crmd[90853]:  warning: Action 16
(fs-sapdata4_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0):
Error
Jul 30 10:09:04 dntstdb2s0703 crmd[90853]:   notice: Transition 1445
aborted by operation fs-sapdata4_monitor_0 'modify' on dntstdb2s0703: Event
failed
Jul 30 10:09:04 dntstdb2s0703 crmd[90853]:  warning: Action 16
(fs-sapdata4_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0):
Error
Jul 30 10:09:04 dntstdb2s0703 stonith-ng[90849]:   notice: On loss of CCM
Quorum: Ignore
Jul 30 10:09:04 dntstdb2s0703 crmd[90853]:   notice: Result of probe
operation for fs-saptmp3 on dntstdb2s0703: 0 (ok)
Jul 30 10:09:04 dntstdb2s0703 crmd[90853]:  warning: Action 19
(fs-saptmp3_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0): Error
Jul 30 10:09:04 dntstdb2s0703 crmd[90853]:  warning: Action 19
(fs-saptmp3_monitor_0) on dntstdb2s0703 failed (target: 7 vs. rc: 0): Error

Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services




From:   Ken Gaillot 
To: Cluster Labs - All topics related to open-source clustering
welcomed 
Date:   07/30/2019 12:47 AM
Subject:[EXTERNAL] Re: [ClusterLabs] Strange behaviour of group
resource
Sent by:"Users" 



On Thu, 2019-07-25 at 20:51 +0530, Dileep V Nair wrote:
> Hi,
>
> I have around 10 filesystems in a group. When I do a crm resource
> refresh, the filesystems are unmounted and remounted, starting from
> the fourth resource in the group. Any idea what could be going on, is
> it expected ?

No, it sounds like some of the reprobes are failing. The logs may have
more info. Each filesystem will have a probe like RSCNAME_monitor_0 on
each node.

>
> Thanks & Regards
>
> Dileep Nair
> Squad Lead - SAP Base
> Togaf Certified Enterprise Architect
> IBM Services for Managed Applications
> +91 98450 22258 Mobile
> dilen...@in.ibm.com
>
> IBM Services
--
Ken Gaillot 

___
Manage your subscription:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs.org_mailman_listinfo_users=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=N21VARBhbZ1K8RgI-VSKGoCUjZZvcl0R2SN1w1rIrW0=C7kpNZ8b9AJHm39KxwOwOqMCu91lX3u-T5mTzMzsEXk=


ClusterLabs home:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.clusterlabs.org_=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=N21VARBhbZ1K8RgI-VSKGoCUjZZvcl0R2SN1w1rIrW0=TSwFUqzpViYaOC0lzUBCuD0nWE22MECPl_aQJI4ZBkM=




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Strange behaviour of group resource

2019-07-25 Thread Dileep V Nair

Hi,

I have around 10 filesystems in a group. When I do a crm resource
refresh, the filesystems are unmounted and remounted, starting from the
fourth resource in the group. Any idea what could be going on, is it
expected ?

Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Issue with DB2 HADR cluster

2019-04-03 Thread Dileep V Nair

Thanks a lot for all the information. I could see that the issue was indeed
with the PEER_WINDOW. I increased it and now it is working.

Thanks & Regards

Dileep Nair
Togaf Certified Enterprise Architect




From:   Valentin Vidic 
To: users@clusterlabs.org
Date:   04/03/2019 01:26 PM
Subject:Re: [ClusterLabs] Antw: Re: Issue with DB2 HADR cluster
Sent by:"Users" 



On Wed, Apr 03, 2019 at 10:36:52AM +0300, Andrei Borzenkov wrote:
> I assume this is path failover time? As I doubt storage latency can be
> that high?
>
> I wonder, does IBM have official guidelines for integrating SBD with
> their storage? Otherwise where this requirement comes from?

Yes, we had problems with SBD when the timeouts were lower, so it is
now configured based on this info:

# Set SCSI command timeout to 120s (default == 30 or 60) for IBM 2145
devices
https://www.ibm.com/support/knowledgecenter/ST3FR7_8.1.1/com.ibm.storwize.v7000.811.doc/svc_linux_settings.html


--
Valentin
___
Manage your subscription:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs.org_mailman_listinfo_users=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=wC-0hCcJ5EJgDwQYM6IGC7YFmh0llzlM21u4FRDEWjI=EElhExf_586KA8W5cCqJf2qPuEePMN0OKP6XR2OPiv8=


ClusterLabs home:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.clusterlabs.org_=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=wC-0hCcJ5EJgDwQYM6IGC7YFmh0llzlM21u4FRDEWjI=gpeRMXxOx687efR5nUTYvwRw_37ZvqqtiSikE7oMV2U=




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Change SBD Disk to VCenter Stonith

2019-02-22 Thread Dileep V Nair

Hello Klaus,

Thanks for the suggestion. I tried both ways, but even then pacemaker
service is not starting because there is a dependency on SBD service, which
does not start without the SBD disk. I am planning to try uninstalling SBD
service itself and see if that removes the dependency. Any suggestion on
how to remove the dependency without actually uninstalling service will be
helpful.

Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services




From:   Klaus Wenninger 
To: Cluster Labs - All topics related to open-source clustering
welcomed , Dileep V Nair

Date:   02/22/2019 12:25 PM
Subject:Re: [ClusterLabs] Change SBD Disk to VCenter Stonith



On 02/22/2019 06:24 AM, Dileep V Nair wrote:


  Hi,

  I have a running cluster with Stonith configured as SBD. Now I would
  like to remove the SBD disks and move to VCenter Stonith. After
  removing the SBD disk, I am not able to start pacemaker because of
  the dependant SBD service which was configured during
  ha-cluster-init. What is the best way to remove SBD disk from the VM.



The other way round would probably be more handy like first remove SBD
configuration and
afterwards remove the disk from the VM.
But I guess it would probably be quickest to remove the disk from the
SBD-config-file
(e.g. /etc/sysconfig/sbd) to get pacemaker-service up again. Or you just
disable SBD-service
(e.g. systemctl disable sbd). In both cases the SBD fencing resource would
probably moan
on monitoring but that shouldn't prevent you from adapting the stonith
configuration.

Klaus





  Thanks & Regards

  Dileep Nair
  Squad Lead - SAP Base
  Togaf Certified Enterprise Architect
  IBM Services for Managed Applications
  +91 98450 22258 Mobile
  dilen...@in.ibm.com

  IBM Services





  ___
  Users mailing list: Users@clusterlabs.org
  https://lists.clusterlabs.org/mailman/listinfo/users

  Project Home: http://www.clusterlabs.org
  Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org













___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Change SBD Disk to VCenter Stonith

2019-02-21 Thread Dileep V Nair


Hi,

I have a running cluster with Stonith configured as SBD. Now I would like
to remove the SBD disks and move to VCenter Stonith. After removing the SBD
disk, I am not able to start pacemaker because of the dependant SBD service
which was configured during ha-cluster-init. What is the best way to remove
SBD disk from the VM.


Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker log showing time mismatch after

2019-01-28 Thread Dileep V Nair

Thanks Ken for prompt response.

Yes.. It was at system boot. I am still to find out a reason as to what
caused the reboot. There was no Stonith OR any other error in pacemaker
log.

Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified Enterprise Architect
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services




From:   Ken Gaillot 
To: Cluster Labs - All topics related to open-source clustering
welcomed 
Date:   01/28/2019 09:18 PM
Subject:Re: [ClusterLabs] Pacemaker log showing time mismatch after
Sent by:"Users" 



On Mon, 2019-01-28 at 18:04 +0530, Dileep V Nair wrote:
> Hi,
>
> I am seeing that there is a log entry showing Recheck Timer popped
> and the time in pacemaker.log went back in time. After sometime, the
> time issue Around the same time the resources also failed over (Slave
> became master). Do anyone know why this behavior ?
>
> Jan 23 01:16:48 [9383] pn4ushleccp1 lrmd: notice: operation_finished:
> db_cp1_monitor_2:32476:stderr [ /usr/bin/.: Permission denied. ]
> Jan 23 01:16:48 [9383] pn4ushleccp1 lrmd: notice: operation_finished:
> db_cp1_monitor_2:32476:stderr [ /usr/bin/.: Permission denied. ]
> Jan 22 20:17:03 [9386] pn4ushleccp1 crmd: info: crm_timer_popped:
> PEngine Recheck Timer (I_PE_CALC) just popped (90ms)

Pacemaker can handle the clock jumping forward, but not backward. The
recheck timer here is unrelated to the clock jump, it's just the first
log message to appear since it jumped.

You definitely want to find out what's changing the clock.

If this is at system boot, likely the hardware clock is wrong and some
time manager (ntp, etc.) is adjusting it. Pacemaker's systemd unit file
has "After=time-sync.target" to try to ensure that it doesn't start
until after this has happened, but unfortunately you often have to take
extra steps to make time managers use that target (e.g. enable chronyd-
wait.service if you're using chronyd), and of course if you're not
using systemd it's not any help. But the basic idea is you want to
ensure pacemaker starts after the time has been adjusted at boot.

If this isn't at boot, then your host has something weird going on.
Check the system log around the time of the jump, etc.

> Jan 22 20:17:03 [9386] pn4ushleccp1 crmd: notice:
> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE |
> input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
> Jan 22 20:17:03 [9386] pn4ushleccp1 crmd: info: do_state_transition:
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> process_pe_message: Input has not changed since last time, not saving
> to disk
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: notice: unpack_config:
> Relying on watchdog integration for fencing
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_online_status_fencing: Node pn4us7leccp1 is active
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_online_status: Node pn4us7leccp1 is online
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_online_status_fencing: Node pn4ushleccp1 is active
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_online_status: Node pn4ushleccp1 is online
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_op_status: Operation monitor found resource db_cp1:0 active
> on pn4us7leccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_op_status: Operation monitor found resource TSM_DB2 active
> on pn4us7leccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_op_status: Operation monitor found resource TSM_DB2 active
> on pn4us7leccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_op_status: Operation monitor found resource ip_cp1 active
> on pn4ushleccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_op_status: Operation monitor found resource db_cp1:1 active
> in master mode on pn4ushleccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_op_status: Operation monitor found resource TSM_DB2log
> active on pn4ushleccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info:
> determine_op_status: Operation monitor found resource KUD_DB2 active
> on pn4ushleccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info: native_print:
> stonith-sbd (stonith:external/sbd): Started pn4ushleccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info: native_print:
> ip_cp1 (ocf::heartbeat:IPaddr2): Started pn4us7leccp1
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info: clone_print:
> Master/Slave Set: ms_db2_cp1 [db_cp1]
> Jan 22 20:17:03 [9385] pn4ushleccp1 pengine: info: short_print:
> Masters: [ pn4us7leccp1 ]
> Jan 22 20:17:03 [9385] p

[ClusterLabs] Pacemaker log showing time mismatch after

2019-01-28 Thread Dileep V Nair


Hi,

I am seeing that there is a log entry showing Recheck Timer popped
and the time in pacemaker.log went back in time. After sometime, the time
issue Around the same time the resources also failed over (Slave became
master). Do anyone know why this behavior ?

Jan 23 01:16:48 [9383] pn4ushleccp1   lrmd:   notice:
operation_finished:   db_cp1_monitor_2:32476:stderr [ /usr/bin/.:
Permission denied. ]
Jan 23 01:16:48 [9383] pn4ushleccp1   lrmd:   notice:
operation_finished:   db_cp1_monitor_2:32476:stderr [ /usr/bin/.:
Permission denied. ]
Jan 22 20:17:03 [9386] pn4ushleccp1   crmd: info: crm_timer_popped:
PEngine Recheck Timer (I_PE_CALC) just popped (90ms)
Jan 22 20:17:03 [9386] pn4ushleccp1   crmd:   notice:
do_state_transition:  State transition S_IDLE -> S_POLICY_ENGINE |
input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped
Jan 22 20:17:03 [9386] pn4ushleccp1   crmd: info:
do_state_transition:  Progressed to state S_POLICY_ENGINE after
C_TIMER_POPPED
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
process_pe_message:   Input has not changed since last time, not saving to
disk
Jan 22 20:17:03 [9385] pn4ushleccp1pengine:   notice: unpack_config:
Relying on watchdog integration for fencing
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_online_status_fencing:  Node pn4us7leccp1 is active
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_online_status:  Node pn4us7leccp1 is online
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_online_status_fencing:  Node pn4ushleccp1 is active
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_online_status:  Node pn4ushleccp1 is online
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_op_status:  Operation monitor found resource db_cp1:0 active on
pn4us7leccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_op_status:  Operation monitor found resource TSM_DB2 active on
pn4us7leccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_op_status:  Operation monitor found resource TSM_DB2 active on
pn4us7leccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_op_status:  Operation monitor found resource ip_cp1 active on
pn4ushleccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_op_status:  Operation monitor found resource db_cp1:1 active in
master mode on pn4ushleccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_op_status:  Operation monitor found resource TSM_DB2log active on
pn4ushleccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
determine_op_status:  Operation monitor found resource KUD_DB2 active on
pn4ushleccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: native_print:
stonith-sbd (stonith:external/sbd): Started pn4ushleccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: native_print:
ip_cp1  (ocf::heartbeat:IPaddr2):   Started pn4us7leccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: clone_print:
Master/Slave Set: ms_db2_cp1 [db_cp1]
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: short_print:
Masters: [ pn4us7leccp1 ]
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: short_print:
Slaves: [ pn4ushleccp1 ]
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: native_print:
TSM_DB2 (systemd:dsmcad_db2):   Started pn4us7leccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: native_print:
TSM_DB2log  (systemd:dsmcad_db2log):Started pn4us7leccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: native_print:
KUD_DB2 (systemd:kuddb2_db2):   Started pn4us7leccp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info:
rsc_merge_weights:ms_db2_cp1: Breaking dependency loop at ms_db2_cp1
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: master_color:
Promoting db_cp1:0 (Master pn4us7leccp1)
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: master_color:
ms_db2_cp1: Promoted 1 instances of a possible 1 to master
Jan 22 20:17:03 [9385] pn4ushleccp1pengine: info: LogActions:
Leave   ip_cp1  (Started pn4us7leccp1)


After the transition, the date was shifted back to normal

Jan 22 20:47:03 [9386] pn4ushleccp1   crmd: info: do_log:
Input I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
Jan 22 20:47:03 [9386] pn4ushleccp1   crmd:   notice:
do_state_transition:  State transition S_TRANSITION_ENGINE -> S_IDLE |
input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
Jan 23 01:47:22 [9383] pn4ushleccp1   lrmd:   notice:
operation_finished:   db_cp1_monitor_2:19518:stderr [ /usr/bin/.:
Permission denied. ]
Jan 23 01:47:22 [9383] pn4ushleccp1   lrmd:   notice:
operation_finished:   db_cp1_monitor_2:19518:stderr [ /usr/bin/.:
Permission denied. ]



Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
Togaf Certified 

[ClusterLabs] Anyone have a document on how to configure VMWare fencing on Suse Linux

2018-12-13 Thread Dileep V Nair


Hi,

I am using pacemaker for my clusters and shared sbd disk as the Stonith
mechanism. Now I have an issue because I am using VMWare SRM for DR and
that does not support shared disk. So I am thinking of configuring
external/vcenter as the stonith mechanism. Is there any document which I
can refer for configuring the same. Is there some specific settings /
configurations to be done on the vcenter to do this. Any help is highly
appreciated.

Thanks & Regards

Dileep Nair
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Floating IP active in both nodes

2018-10-26 Thread Dileep V Nair
Hello Gabriel,

I have a similar cluster configuration running fine. I am using the virtual
IP to NFS mount a filesystem from Node 1 to node 2. The differences I could
see from your configuration ..

  > primitive site_one_ip ocf:heartbeat:IPaddr \
  > params ip="192.168.2.200" cidr_netmask="255.255.252.0" nic="eth0" \
  > op monitor interval="40s" timeout="20s"


  I use ocf:heartbeat:IPaddr2
  I have given only the IP parameter, no netmask and nic
  I have a virtual hostname associated with the IP addr using /etc/hosts
  and use the virtual hostname to connect.



Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services



From:   Gabriel Buades 
To: users@clusterlabs.org
Date:   10/26/2018 06:34 PM
Subject:Re: [ClusterLabs] Floating IP active in both nodes
Sent by:"Users" 



Hello Andrei.


I did not add lvs_support at first. I added it later when noticed the
problem to test if something changes, but I got same result

Gabriel


El vie., 26 oct. 2018 a las 11:47, Andrei Borzenkov ()
escribió:
  26.10.2018 11:14, Gabriel Buades пишет:
  > Dear cluster labs team.
  >
  > I previously configured a two nodes cluster with replicated maria Db.
  > To use one database as the active, and the other one as failover, I
  > configured a cluster using heartbeat:
  >
  > root@logpmgid01v:~$ sudo crm configure show
  > node $id="59bbdb76-be67-4be0-aedb-9e27d65f371e" logpmgid01v
  > node $id="adbc5972-c491-4fc4-b87d-8170e1b2d4d0" logpmgid02v \
  > attributes standby="off"
  > primitive site_one_ip ocf:heartbeat:IPaddr \
  > params ip="192.168.2.200" cidr_netmask="255.255.252.0" nic="eth0" \
  > op monitor interval="40s" timeout="20s"
  > location site_one_ip_pref site_one_ip 100: logpmgid01v
  > property $id="cib-bootstrap-options" \
  > dc-version="1.1.10-42f2063" \
  > cluster-infrastructure="heartbeat" \
  > stonith-enabled="false"
  >
  > Now, I've done a similar setup using corosync:
  > root@908soffid02:~# crm configure show
  > node 1: 908soffid01
  > node 2: 908soffid02
  > primitive site_one_ip IPaddr \
  > params ip=10.6.12.118 cidr_netmask=255.255.0.0 nic=ens160
  lvs_support=true \

  What is the reason you added lvs_support? Previous configuration did not
  have it.

  > meta target-role=Started is-managed=true
  > location cli-prefer-site_one_ip site_one_ip role=Started inf:
  908soffid01
  > location site_one_ip_pref site_one_ip 100: 908soffid01
  > property cib-bootstrap-options: \
  > have-watchdog=false \
  > dc-version=1.1.14-70404b0 \
  > cluster-infrastructure=corosync \
  > cluster-name=debian \
  > stonith-enabled=false \
  > no-quorum-policy=ignore \
  > maintenance-mode=false
  >
  > Apparently, it works fine, and floating IP address is active in node1:
  > root@908soffid02:~# crm_mon -1
  > Last updated: Fri Oct 26 10:06:12 2018 Last change: Fri Oct 26 10:02:53
  > 2018 by root via cibadmin on 908soffid02
  > Stack: corosync
  > Current DC: 908soffid01 (version 1.1.14-70404b0) - partition with
  quorum
  > 2 nodes and 1 resource configured
  >
  > Online: [ 908soffid01 908soffid02 ]
  >
  >  site_one_ip (ocf::heartbeat:IPaddr): Started 908soffid01
  >
  > But when node2 tries to connect to the floating IP address, it gets
  > connected to itself, despite the IP address is bound to the first node:
  > root@908soffid02:~# ssh root@10.6.12.118 hostname
  > root@soffiddb's password:
  > 908soffid02
  >
  > I'd want the second node connects to the actual floating IP address,
  but I
  > cannot see how to set it up. Any help is welcome.
  >
  > I am using pacemaker 1.1.14-2ubuntu1.4 and corosync 2.3.5-3ubuntu2.1
  >
  > Kind regards.
  >
  >
  >
  > Gabriel Buades
  >
  >
  > ___
  > Users mailing list: Users@clusterlabs.org
  > https://lists.clusterlabs.org/mailman/listinfo/users
  >
  > Project Home: http://www.clusterlabs.org
  > Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  > Bugs: http://bugs.clusterlabs.org
  >

  ___
  Users mailing list: Users@clusterlabs.org
  https://lists.clusterlabs.org/mailman/listinfo/users

  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  ___
  Users mailing list: Users@clusterlabs.org
  https://lists.clusterlabs.org/mailman/listinfo/users


  Project Home:
  http://www.clusterlabs.org

  Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

  Bugs:
  http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Encrypted passwords for Resource Agent Scripts

2018-09-21 Thread Dileep V Nair


Hi,

I have written heartbeat resource agent scripts for Oracle and
Sybase. Both the scripts take user passwords as parameters. Is there a way
to do some encryption for the passwords so that the plain text passwords
are not visible from the primitive also.


Thanks & Regards

Dileep Nair
Squad Lead - SAP Base
IBM Services for Managed Applications
+91 98450 22258 Mobile
dilen...@in.ibm.com

IBM Services
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker not restarting Resource on same node

2018-06-28 Thread Dileep V Nair


Hi,

I have a cluster with DB2 running in HADR mode. I have used the db2
resource agent. My problem is whenever DB2 fails on primary it is migrating
to the secondary node. Ideally it should restart thrice (Migration
Threshold set to 3) but not happening. This is causing extra downtime for
customer. Is there any other settings / parameters which needs to be set.
Did anyone face similar issue ? I am on pacemaker version 1.1.15-21.1.

Dileep V Nair

dilen...@in.ibm.com

IBM Services
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Sybase HADR Resource Agent

2018-02-22 Thread Dileep V Nair

Thanks for the response. I tried that but I think that does not take care
of HADR setup.



   
 Regards,   
   

   
 Dileep V Nair  
   

   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Oyvind Albrigtsen <oalbr...@redhat.com>
To: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Date:   02/22/2018 04:07 PM
Subject:Re: [ClusterLabs] Sybase HADR Resource Agent
Sent by:"Users" <users-boun...@clusterlabs.org>



On 21/02/18 21:41 +0530, Dileep V Nair wrote:
>
>
>Hi,
>
>I am trying to configure Pacemaker to automate a Sybase HADR
setup.
>Is anyone aware of a Resource Agent which I can use for this.
There's a Sybase ASE agent available at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_pull_=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=s01uAps4uLnSIv5wPvqkx01hT-DLo0oKhsgS4rENrY8=

>
>
>
> Regards,
>
> Dileep V Nair
> Senior AIX Administrator
> Cloud Managed Services Delivery (MSD), India
> IBM Cloud
>
>
>
>
> E-mail: dilen...@in.ibm.com Outer Ring Road,
Embassy Manya
>   Bangalore,
KA 560045
>
India
>
>
>
>

>___
>Users mailing list: Users@clusterlabs.org
>
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs.org_mailman_listinfo_users=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=FU_w0ezHasAFDADCNzp2JNnJyMIa97O-tlImRxSn5BA=

>
>Project Home:
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=qccrDiQZli1T14ikdlW8lzu12WI5HzI0Hp9C-B4Vj74=

>Getting started:
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org_doc_Cluster-5Ffrom-5FScratch.pdf=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=j8-FpzXutLFPWK4Bid6flXthTp-K7xo2b_SJkrweyss=

>Bugs:
https://urldefense.proofpoint.com/v2/url?u=http-3A__bugs.clusterlabs.org=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=58i7pXq9ISFRfRVk1QpNamLkC9Age55QpdnxOxbA1FY=


___
Users mailing list: Users@clusterlabs.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs.org_mailman_listinfo_users=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=FU_w0ezHasAFDADCNzp2JNnJyMIa97O-tlImRxSn5BA=


Project Home:
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=qccrDiQZli1T14ikdlW8lzu12WI5HzI0Hp9C-B4Vj74=

Getting started:
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clusterlabs.org_doc_Cluster-5Ffrom-5FScratch.pdf=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=j8-FpzXutLFPWK4Bid6flXthTp-K7xo2b_SJkrweyss=

Bugs:
https://urldefense.proofpoint.com/v2/url?u=http-3A__bugs.clusterlabs.org=DwICAg=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=hjmR7GlCov4nPXe6oVwUNr_nvQTpUN0ZrmBeVoUH5bA=58i7pXq9ISFRfRVk1QpNamLkC9Age55QpdnxOxbA1FY=




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Sybase HADR Resource Agent

2018-02-21 Thread Dileep V Nair


Hi,

I am trying to configure Pacemaker to automate a Sybase HADR setup.
Is anyone aware of a Resource Agent which I can use for this.



   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 





___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-19 Thread Dileep V Nair

Hello Ondrej,

I am still having issues with my DB2 HADR on Pacemaker. When I do a
db2_kill on Primary for testing, initially it does a restart of DB2 on the
same node. But if I let it run for some days and then try the same test, it
goes into fencing and then reboots the Primary Node.

I am not sure how exactly it should behave in case my DB2 crashes on
Primary.

Also if I crash the Node 1 (the node itself, not only DB2), it
promotes Node 2  to Primary, but once the Pacemaker is started again on
Node 1, the DB on Node 1 is also promoted to Primary. Is that expected
behaviour ?

   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera <ofam...@redhat.com>
To: Dileep V Nair <dilen...@in.ibm.com>
Cc: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Date:   02/12/2018 11:46 AM
Subject:Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
>
>
> Regards,
>
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770=DwIDBA=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM=


The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-12 Thread Dileep V Nair

Thanks Ondrej for the response. I also figured out the same and reduced the
HADR_TIMEOUT and increased the promote timeout which helped in resolving
the issue.



   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera <ofam...@redhat.com>
To: Dileep V Nair <dilen...@in.ibm.com>
Cc: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Date:   02/12/2018 11:46 AM
Subject:Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 07:24 PM, Dileep V Nair wrote:
> Thanks Ondrej for the response. I have set the PEER_WINDOWto 1000 which
> I guess is a reasonable value. What I am noticing is it does not wait
> for the PEER_WINDOW. Before that itself the DB goes into a
> REMOTE_CATCHUP_PENDING state and Pacemaker give an Error saying a DB in
> STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.
>
>
> Regards,
>
> *Dileep V Nair*

Hi Dileep,

sorry for later response. The DB2 should not get into the
'REMOTE_CATCHUP' phase or the DB2 resource agent will indeed not
promote. From my experience it usually gets into that state when the DB2
on standby was restarted during or after PEER_WINDOW timeout.

When the primary DB2 fails then standby should end up in some state that
would match the one on line 770 of DB2 resource agent and the promote
operation is attempted.

  770  STANDBY/*PEER/DISCONNECTED|Standby/DisconnectedPeer)

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_resource-2Dagents_blob_master_heartbeat_db2-23L770=DwIDBA=jf_iaSHvJObTbx-siA1ZOg=syjI0TzCX7--Qy0vFS1xy17vob_50Cur84Jg-YprJuw=dhvUwjWghTBfDEHmzU3P5eaU9Ce3DkCRdRPNd71L1bU=3vPiNA4KGdZzc0xJOYv5hMCObjWdlxZDO_bLb86YaGM=


The DB2 on standby can get restarted when the 'promote' operation times
out, so you can try increasing the 'promote' timeout to something higher
if this was the case.

So if you see that DB2 was restarted after Primary failed, increase the
promote timeout. If DB2 was not restarted then question is why DB2 has
decided to change the status in this way.

Let me know if above helped.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-01 Thread Dileep V Nair

Thanks Ondrej for the response. I have set the PEER_WINDOW to 1000 which I
guess is a reasonable value. What I am noticing is it does not wait for the
PEER_WINDOW. Before that itself the DB goes into a REMOTE_CATCHUP_PENDING
state and Pacemaker give an Error saying a DB in
STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted.



   
 Regards,   
   

   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 









From:   Ondrej Famera <ofam...@redhat.com>
To: Dileep V Nair <dilen...@in.ibm.com>
Cc: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Date:   02/01/2018 02:48 PM
Subject:Re: [ClusterLabs] Issues with DB2 HADR Resource Agent



On 02/01/2018 05:57 PM, Dileep V Nair wrote:
> Now the second issue I am facing is that when I crash the node were DB
> is primary, the STANDBY DB is not getting promoted to PRIMARY. I could
> fix that by adding below lines in db2_promote()
>
> 773 *)
> 774 # must take over forced
> 775 force="by force"
> 776
> 777 ;;
>
> But I am not sure of the implications that this can cause.
>
> Can someone suggest whether what I am doing is correct OR will this lead
> to any Data loss.


Hi Dileep,

As for the 'by force' implications you may check the documentation on
what it brings. In short: the data can get corrupted.

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.cmd.doc/doc/r0011553.html#r0011553__byforce


The original 'by force peer window only' is limiting the takeover to
period when DB2 is within PEER_WINDOW which gives a bit more safety.
(the table in link above also explains how much safer it is)

Instead of changing the resource agent I would rather suggest checking
the PEER_WINDOW and HADR_TIMEOUT variables in DB2. They determine how
long it is possible to do takeover 'by force peer window only'.

--
Ondrej Faměra
@Red Hat



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Issues with DB2 HADR Resource Agent

2018-02-01 Thread Dileep V Nair


Hi,

I am facing multiple issues with the DB2 Resource Agent.


The first issue was that when I start Pacemaker, DB2 is also started and
immediately after that DB2 gets stopped. I fixed that by removing \ before
$ in Line numbers 686, 689 and 691.


684 CMD="if db2 connect to $db;
685 then
686 db2 select * from sysibm.sysversions ; rc=$?;
687 db2 terminate;
688 else
689 rc=$?;
690 fi;
691 exit $rc"


Now the second issue I am facing is that when I crash the node were DB is
primary, the STANDBY DB is not getting promoted to PRIMARY. I could fix
that by adding below lines in db2_promote()


773 *)
774 # must take over forced
775 force="by force"
776
777 ;;


But I am not sure of the implications that this can cause.


Can someone suggest whether what I am doing is correct OR will this lead to
any Data loss.



   
 Regards,   
   
        
   
 Dileep V Nair  
   
 Senior AIX Administrator   
   
 Cloud Managed Services Delivery (MSD), India   
   
 IBM Cloud  
   

   






 E-mail: dilen...@in.ibm.com Outer Ring Road, Embassy 
Manya 
   Bangalore, KA 
560045 
  
India 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org